Jan 21, 2022 - The Eighty Five Percent Rule for optimal learning

The Eighty Five Percent Rule for optimal learning

A practice schedule or learning plan can be very important in skills and knowledge acquisition, thus the saying "perfect practice makes perfect". I first heard about this work on Andrew Huberman's podcast on goal setting. The results here are significant, offering insights into both training ML algorithms, as well as biological learners. Furthermore, the analysis techniques are elegant, definitely brushed off some rust here!

Summary

  1. Examine the role of the difficulty of training on the rate of learning.
  2. Theoretical result derived for learning algorithms relying on some form of gradient descent, in the context of binary classification.
  • Note that gradient descent here is defined as where "parameters of the model are adjusted based on feedback in such a way as to reduce the average error rate over time". This is pretty general, and both reinforcement learning, stochastic gradient descent, and LMS filtering can be seen as special cases.
  1. Results show under Gaussian noise distributions, the optimal error rate for training is around 15.87%, or conversely, the optimal training accuracy is about 85%.
  • The noise distribution profile is assumed to be fixed, i.e. the noise distribution won't change from Gaussian to Poisson.
  • The 15.87% is derived from the value of normal CDF with value of -1.
  • Under different fixed noise distributions, this optimal value can change.
  1. Training according to fixed error rate yields exponentially faster improvement in precision (proportional to accuracy), compared to fixed difficulty. O((t)) vs. O((log(t))).
  2. Theoretical results validated in simulations for perceptrons, 2-layer NN on MNIST dataset, and Law and Gold model of perceptual learning (neurons in MT area making decision about the moving direction of dots with different coherence level, using reinforcement learning rules).

Some background

The problem of optimal error rate pops up in many domains:

  • Curriculum learning, or self-paced learning in ML.
  • Level design in video games
  • Teaching curriculum in education

In education and game design, empirically we know that people are more engaged when the tasks being presented aren't too easy, and just slightly more difficult. Some call this the optimal conditions for "flow state".

Problem formulation

The problem formulation is fairly straight-forward, as the case of binary classification and Gaussian noise.

  1. Agent make decision, represented by a decision variable h, computed as h=Φ(x,ϕ), where x is the stimulus, and ϕ are the parameters of whatever learning algorithm.
  2. The decision variable h is a noisy representation of the true label Δ: h=Δ+n, where nN(0,σ).
  3. If the decision boundary is set at h=0 (see Figure 1A), such that chose A when h<0, B when h>0 and randomly otherwise, then the decision noise leads to error rate of: \[ER = \int_{-\infty}^{0} p(h|\Delta,\sigma)dh=F(-\Delta/\sigma)=F(-\beta\Delta)\].
  • p() is Gaussian distribution
  • F() is Gaussian CDF
  • β is precision, and essentially measures how "peaky" the decision variable distribution is. This can be thought of as the (inverse) accuracy or "skill" of the agent.
  • So, the error decreases with both agent skill ( β) and ease of problem ( Δ).
  1. Learning is essentially an optimization problem, to change ϕ in order to minimize ER. This can be formulated as a general gradient descent problem: \[\frac{d\phi}{dt}=-\eta\nabla_\phi ER\].
  2. This gradient can be written as ϕER=dERdβϕβ. And we want to find the optimal difficulty Δ that maximizes dERdβ.
  3. Δ turns out to be 1β, which gives optimal error rate ER0.1587.
  • This is nice, the difficulty level should be proportional to the skill level.

model illustration

Simulations

Validation of this theory boils down to the following elements:

  • How to quantify problem/stimulus difficulty Δ?
  • How to select problem with the desired difficulty?
  • How to adjust target difficulty level to maintain the desired error rate?
  • How to measure error rate ER?
  • How to measure skill level?
  1. Application to perceptron:
  • A fully trained teacher perceptron network's weights e are used to calculate the difficulty. The difficulty of a sample is equal to its distance from the decision boundary (higher is less difficult).
  • Skill level of the network is cotθ, where θ is the angle between the learner perceptron's weight vector and the teacher perceptron's weight vector.
  • Error rate is approximately Gaussian and can be determined from the weight vectors of the teacher and learner perceptrons, and the sample vectors.
  1. Application to two-layer NN:
  • Trained teacher network. The absolute value of the teacher network's sigmoid output is the difficulty of a stimulus.
  • Error rate is the accuracy over the past 50 trials (not batch training).
  • Skill level is the final performance over the entire MNIST dataset.
  • Difficulty level is updated with value proportional to the current and target accuracy level.
  1. Appliation to Law and Gold model:
  • Simulated update rule of 7200 MT neurons according to the orignal reinforcement model.
  • Coherence adjusted to hit target accuracy level: implicitly measure difficulty level
  • Accuracy/error rate is from past running average
  • Final skill level is from applying ensemble on different coherence level stimulus (disabling learning) and fitting logistic regression.

Implications

  1. Optimal rate is dependent on task (binary vs. multi-class), and noise distribution.
  2. Batch learning makes things more tricky -- depending on agent learning rate.
  3. Multi-class may require finer grained evaluation of difficulty and problem preseentation (i.e. misclassifying 1v2, or 2v3).
  4. Not all models follow this framework, e.g. Bayesian learner with perfect memory does not care about example presentation order.
  5. Can be a good model for optimal use of attention for learning -- suppose exerting attention changes the precision β, then the benefits of exerting attention is maximized during optimal error rate.

Sep 30, 2018 - Linear Algebra Intuition

Continuing on my quest to have the best unifying intuition about linear algebra, I came across the excellent series Essence of Linear Algebra on YouTube by 3Blue1Brown.

This series really expanded on the intuition behind the fundamental concepts of Linear Algebra by illustrating them geometrically. I have vaguely arrived at the same sort of intuition through thinking about them before, but never this explicitly. My notes are here.

Chapter 1: Vectors, what even are they

Vectors can be visualized geometrically as arrows in 1-D, 2-D, 3-D, ..., n-D coordinate system, and can also be represented as a list of numbers (where the numbers represent the coordinate values).

The geometric interpretation here is important to build interpretation and can later be gneralized to more abstract vector spaces.

Chapter 2: Linear combinations, span, and basis vectors

Vectors (arrows, and list of numbers) can form linear combinations, which can involve multiplication by scalars and addition of vectors. Geometrically, multiplication by scalar is equivlaent to scaling the length of a vector by that factor. Vector addition means putting vectors tail to head and find the resulting vector.

Any vectors in 2D space, for example, can be described by linear combinations of a set of basis vectors.

In other words, 2D space are spanned by a set of basis vectors. Usually, the basis vectors are chose to be [1,0]T and [0,1]T in 2-D, representing the unit i^ and j^ vectors.

As a corollary, if a 3-by-3 matrix A has linearly-dependent columns, that means its columns does not span the entire 3D-space (spans a plane or a line instead).

Chapter 3: Linear transformations and matrices

Multiplication of a 2-by-1 vector v by an 2-by-2 matrix A to yield a new 2-by-1 vector u: Au=v can be thought of as a linear transformation. In fact, this multiplication can be thought of as transforming the original 2-D coordinate system into a new one, with the caveat the grid lines still have to remain parallel after the transformation (think about a grid getting sheared or rotated).

Specifically, assume the original vector v is written with basis vectors i^ and j^, the columns of A can be thought of where i^ and j^ land in the transformed coordinate system.

Therefore, the matrix-vector multiplication then has the geometric meaning of representing the original vector v as linear combination of the original basis vectors i^ and j^, finding where they map to in the transformed space (indicated by matrix A), and then yielding u as the same linear combination of the transformed basis vectors.

This is a very powerful idea -- the specific operation of matrix-vector multiplication makes clear sense under this context.

Chapter 4: Matrix multiplication as compositions and Chapter 5: Three-dimensional linear transformations

If a matrix represents a transformation of the coordinate system (e.g. shear, rotation), then multiplication of matrices represent sequences of transformations. For example AB can represent first shear the coordinate system (B) then rotate the resulting coordinate system (A).

This also makes it obvious why matrix multiplication is NOT commutative. Shearing a rotated coordinate system can have different end result as rotating a sheared coordinate system.

Inverse of a matrix A then represents performing the reverse coordinate system transformation, such that A1A=AA1 represents net-zero transformation of the original coordinate system, which is exactly represented by the identity matrix.

Chapter 6: The determinant

This is a really cool idea. In many statistics formulas, there will be a condition that reads like "assuming A is positive-definite, we multiply by A1". Positive-definite relates to positive determinants. This gave me the sense that determinant represents a matrix analog of real number's magnitude. But then what does a negative determinant mean?

In the 2D case, if a matrix represents a transformation of the coordinate system, the determinant then represents the area in the transformed coordinate system of a unit area in the original system. Imagine a unit square being stretched, what is the area of the resulting shape -- that is the value of the determinant.

Negative determinant would mean a change in the orientation of the resulting shape. Imagine a linear transformation as performing operations to a piece of paper (parallel lines on the paper are preserved through the operations). Matrix with determinant of -1 would correspond to operations that involve flipping the paper. i^ is originally 90-degrees clockwise from j^, after flipping the paper (x-y plane), i^ is now 90-degrees counter clockwise from j^.

In 3D, determinant then measures the scaling a volume, and negative determinant represent transformations that does not preserve the handed-ness of the basis vectors (i.e. right-hand rule).

If negative determinant represent flipping a 2D paper, then 0 determinant would mean a unit-area becoming zero. Similarly, in 3D, 0 determinant would mean a unit-volume becoming zero. What shape has 0 volume? A plane, a line, and a point. What shape has 0 area? A line and a point.

So if det(A)=0, then we know A transforms maps vectors into a lower-dimensional space! And that is why positive-definiteness is important, it means a transformation perserves the dimension of the data!

From this intuition, the computation of determinant also makes a bit more sense (at least in the 2D case).

Chapter 7: Inverse matrices, column space and null space

This chapter combines the geometric intuition from before and solving systems of linear equations to explain inverse matrices, column space, and null space.

Inverse matrices

Solving system of linear equations in the form of Ax=v (where A=3-by-3, x and v are 3-by-1 vectors) can be thought of as what vector x in 3-D space, after transformation represented by A, become the vector v?

In this nominal case where det(A)>0, there exists an inverse A1 that represents the inverse transformation represented by A, and thus x can be solved by multiplying A1 on both sides of the equation.

Intuitively this makes sense -- to get the vector pre-transformation, we simply reverse the transformation.

Now suppose det(A)=0, this means A maps a 3-D vector onto a lower-dimensional space, which can be a plane, a line, or even a point. In this case, no inverse A1 exists, because how can you transform these lower-dimensional constructs into 3D space? (Note that in most textbooks, the justification of A1 exists only when det(A) is not zero is made on the basis of Gauss-Jordan form, which is not intuitive at all).

Column Space and Null Space

So in the nominal case, each vector in 3D is mapped to a different one in 3D space. Therefore the set of all possible v's span the entire 3D space. Therefore the 3D space is the column space of the matrix A. The rank of A in this case is 3, because the column space is 3-dimensional.

The set of vectors x that after transformation by A yield the 0-vector are called the null space of the matrix A.

In cases where det(A)=0, the columns of A must be linearly dependent, this means its column space is not 3-dimensional, and therefore its rank is less than 3 (and therefore is rank-deficient and not full-rank for a 3-by-3 matrix).

If this rank-deficient A maps x onto a plane, then its column space has rank 2. Onto a line -- rank 1; Onto a point -- rank 0. For a 3-by-3 matrix, 3 minus the rank of the column space is the rank or dimension of the null-space.

Geometrically, this means if a transformation compresses a cube into a plane, then an entire line of points are mapped onto the origin of the resulting plane (imagine a vertical compression of a cube, the entire z-axis is mapped onto the origin and is therefore the null space of this transformation). Similarly, if a cube is compressed into a line, then an entire plane of points are mapped onto the origin of the resulting number line.

Left and Right Inverse

Technically, the inverse A1 normally talked about refers to the left inverse (A1A=I). And the full-rank matrices for which left inverse exist perform a one-to-one transformation of vectors (injective) -- each unique vector x is mapped onto at most one v vector.

For matrices with more columns than rows, the situation is a bit different. Using our intuition from before treating the columns of matrix as mapping of the original basis vectors into a new coordinate system, consider the matrix [1,0;0,1;0,1] mapping a 3D vector. This matrix maps the basis vectors [1,0,0]T, [0,1,0]T and [0,0,1]T onto an entire plane, respectively. Therefore each transformed vector can have multiple corresponding vectors in the original 3D space (therefore A is called surjective).

Consequently, if A is 2-by-3, there exists a 3-by-2 right-inverse matrix B such that AB=I, but here I is 2-by-2. Geometrically, this is saying if we transform a 2D vector onto 3-space (B), we can recover this 2D vector by another transformation (A). Thus right-inverse deal with transformations involving dimensional changes.

These explanations, in my opinion, are much more intuitive and easy to remember than the rules and diagram taught in all of the linear algebra courses I have ever taken, including 18.06.

Chapter 8: Nonsquare matrices as transformations between dimensions

This is straight forward applying the idea that the columns of a matrix represent mapping of the basis vectors.

Chapter 9: Dot product and duality

This one is pretty interesting. Dot product represents projection. At the same time, we can think of dot product in 2D, vu as a matrix-vector multiplication vTu. Here we can think of the "matrix" vT as a transformation of the basis 2D basis vectors i^ and j^ onto vectors on the number line. Therefore dot-product represents a 2D-to-1D transformation.

This is the idea of vector-transformation duality. Each n-dimensional vector's transpose represents an N-to-1 dimensional linear transformation.

Chapter 10: Cross products

This one is also not too insightful, perhaps because cross product's definition as a vector was strongly motivated by physics.

Cross-product of two 2D vectors a and b yield a third vector c perpendicular to both (together obeying right-hand rule) with magnitude equal to the area of the parallelogram formed by the a and b. Thinking back, cross-product formula involves calculating a type of determinant -- which makes sense in terms of the area interpretation of the determinant.

Then the triple product c(a×b) represents the volume of the parallelpiped formed by a, b, and c.

Chapter 11: Cross products in the light of linear transformations

This one basically explains the triple product by interpreting the dot-product operation (c()) by the cross-product (a×b) as finding a 3D-to-1D linear transformation.

Not too useful.

Chapter 12: Change of basis

This also follows from the idea that a matrix represents mapping of the individual vectors of a coordinate system.

The idea is that coordinate systems are entirely arbitrary. We are used to having [1,0]T and [0,1]T as our basis vectors, but they can be a different set entirely. Suppose Jennifer has different basis vectors [2,1]T and [1,1]T, how do we represent a vector [x0,y0]T in our coordinate system in Jennifer's coordinate system?

This is done by a change of basis -- by multiplying [x0,y0]T by a matrix whose columns are Jennifer's basis vectors. This is yet another application of matrix-vector product.

image1

Change of basis can also be applied to translate linear transformations. For example, suppose we want to rotate the vector u=[x1,y1]T in Jennifer's coordinate system by 90 degrees and want to find out the result in Jennifer's coordinate system. It may be tedious to figure out what the 90-degree transformation matrix is in Jennifer's coordinate system, but it is easily found in our system ([0,1;1,0]).

So we can achieve this easily by first transforming u into our coordinate system by multiplying it by the mapping between Jennfier and our coordinate system. Then apply the 90-degree transformation matrix, then apply the inverse transformation to get back the resulting vector in Jennifer's system.

imag2

This is a powerful idea and relates closely to eigenvectors, PCA, and SVD. SVD especially represents a matrix by a series of coordinate transformations (rotation, scaling and rotation).

Chatper 13: Eigenvector and eigenvalues

This is motivated by the problem Av=λv, where v are the eigenvectors, and λ are the eigenvalues of the matrix A (nominally, A is a square matrix, otherwise we use SVD, but that is beside the point).

The eigenvectors are vectors that, after transformation by A, still point at the same directions as before. A unit vectors before after the transformation may have different length while pointing to the same direction, the change in length is described by the eigenvalue. A negative eigenvalue means the eigenvectors have reversed where it points to after the transformation.

The imagery is suppose we have a square piece of paper/mesh, we pull at the top right and bottom left corners to shear it. A line connecting this two corners will still have the same orientation, though have longer length. Thus this line is an eigenvector, and the eigenvalue will be positive.

However, if a matrix represents rotation, then the determinant will be 0, and the eigenvalues will be imaginary. Imaginary eigenvalues means that a rotation is involved in the transformation.

Chapter 14: Abstract vector spaces

This lesson extrapolate vectors, normally thought of as arrows in 1- to 3-D space and corresponding list of numbers, to any construct that obey the axioms of vector space. Essentially the members of a vector space follow the rule of linear combination (or linearity).

An immediate generalization is thinking of functions as infinite-dimensional vectors (we can add scalar multiples of functions together and the result obey the linearity rule). We can further define linear transformations on these functions, similarly to how matrices are defined on geometric vector spaces.

An example of linear transformation on function is derivative.

image3

This insight is pretty cool and really ties a lot of math and physics concepts together (ex. modeling wavefunctions as vectors in quantum mechanics, eigen-functions as analogs of eigen-vectors).

Aug 22, 2018 - Latex biblography tricks

Writing my dissertation I chose to use Latex because it makes everything looks good, I can stay in Vim, type math easily, and don't have to deal with an actual Word-like processor's abysmal load speed when I'm writing 100+ pages. At the same time, some things that should be very easy took forever to figure out.

Problem

In addition to the bibliography of the actual dissertation chapters itself, I need to append a list of my own publications in the biography section (last "chapter" of the entire document), formatted like a proper bibliography section. Why biography? So future readers would known which John Smith you are, of course!

This is difficult for several reasons.

The top level dissertation.tex is something like

\documentclass[]{dissertation}
% Package setup, blabla
\begin{document}
\tableofcontents
\listoftables
\listoffigures
\include{{./Chapter1/ch1.tex}}                 % Separate tex-file and folder for other chapters
\include{{./Chapter2/ch2.tex}}
% \include for all the chapters
% Now in the end insert a single bibliography for all the chapters
\bibliographystyle{myStyle}                     % uses style defined in myStyle.bst
\addcontentsline{toc}{chapter}{Bibliography}    % Add it as a section in table of content
\bibliography{dissertation_bib_file}            % link to the bib file 
\include{{./Biography/biography}}
\end{document}

If I were using word, I might just copy and paste some formatted lines to the end of the page and call it a day. But since I am a serious academic and want to do this the right away...it poses a slight struggle. I need to make bibtex format my publications and insert it at the end during the compile process. My requirements:

  1. I don't want the publication list to start with a "Reference" section heading like typical \bibliography{bib_file} commands produce.

  2. I want the items in the publication list to be in reverse chronological order, from most recent to least.

  3. I want to allow my publication list to be separate from the main bibliography section -- an item can appear in both places, but they should have the same numbering.

Approach 1

Created a separate bibtex file mypub.bib, then inserted the following lines in biography.tex:

\nocite{*}                      % needed to show all the items in the bib-file. 
                                % Otherwise only those called with \cite{}
\bibliographystyle{myStyle}     % uses style defined in myStyle.bst
\bibliography{mypub}            % link to second bib file 

I don't remember the exact behavior, but it either

  1. Gives error, pdflatex tells me I can't call \bibliography in multiple places. Or
  2. Insert the entire dissertation's bibliography after the biography section.

Neither is desirable.

Approach 2

Searching this problem online, you run across people wanting to

  1. Have a bibliography section for each chapter
  2. Divide biblography into nonoverlapping sets, such as journals, conference proceedings, etc.

Don't want those. Finally found this SO post How to use multibib in order to insert a second bibliography as a part of an annex?, which other than suggesting using multibib, also mentions biblatex would make everything a lot easier. But that required me to change the other chapters a bit. Also the manual to multibib.

So new approach, in dissertation.tex:

% preamble stuff
\usepackage{multibib}
\newcites{pub}{headingTobeRemoved}  % create a new bib part called 'pub' with heading 'headingTobeRemoved`
\begin{document}
% all the chapter includes
% Now in the end insert a single bibliography for all the chapters
\bibliographystyle{myStyle}                     % uses style defined in myStyle.bst
\addcontentsline{toc}{chapter}{Bibliography}    % Add it as a section in table of content
\bibliography{dissertation_bib_file}            % link to the bib file 
\include{{./Biography/biography}}
\end{document}

In biography.tex:

Blablabla memememememe
\nocitepub{*}
\bibliographystylepub{myStyle}
\bibliographypub{mypub}

Really important in the second reference section to use \nocitepub, \bibliographystylepub, and \bibliographypub such that it matches the name of the \newcites group.

Also important to run the following,

pdflatex dissertation.tex
bibtex disseration.aux
bibtex pub.aux
pdflatex dissertation.tex
pdflatex dissertation.tex

where dissertation.aux contains the bib-entries to use in the main document's reference section, and pub.aux contains that for the biography's publication list.

This got me a list at the end of the biography chapter. But it starts with a big heading, it's not in reverse chronological order but alphabetical (because myStyle.bst organizes bibs that way), and the numbering is not diea. For example, one of the item in mypub.bib is also in dissertation_bib_file.bib, and their numbering is the same in both places such that in the publication list, I have item ordering like 1, 2, 120, 37, 5...

No good.

Removing Header

In biography.tex, included the following suggest by SO post

Blablabla memememememe
\renewcommand{\chapter}[2]{}
\nocitepub{*}
\bibliographystylepub{myStyle}
\bibliographypub{mypub}

Internally bibliography implements its header as a section or chapter header, so the \renewcommand suppresses it.

Get rid of weird ordering

This took the longest time and I could not find a satisfactory solution. Using

\usepackage[resetlabels]{multibib}

did not help. Nobody seems to even have this problem...so I gave up and just made the items use bullets instead. In biography.tex:

Blablabla memememeem

\renewcommand{\chapter}[2]{}
% ---- next three lines make bullets happen ---
\makeatletter
\renewcommand\@biblabel[1]{\textbullet} % Can put any desired symbol in the braces actually
\makeatother
% -----------------
\nocitepub{*}
\bibliographystylepub{myStyle}
\bibliographypub{mypub}

Reverse chronological order

This one is very tricky, and requires modifying the actual myStyle.bst file. This SO post gives a good idea of how these bst-files work...they really have a weird syntax. Basically, in every bst file, a bunch of functions are defined and used to format individual reference entries from the bibtex entry, then these entries are sorted and inserted into the final file. The function that likely needs to be changed in all bst-files is the PRESORT function.

plainyr-rev.bst provides a working example of ordering reference entries in reverse chronological order. I decided to use jasa.bst, from Journal of American Statistical Association. The following are the changes needed:

%%%%%%% Extra added functions %%%%%%%%%%%%%%%
% From plainyr_rev.bst
FUNCTION {sort.format.month}
{ 't :=
  t #1 #3 substring$ "Jan" =
  t #1 #3 substring$ "jan" =
  or
  { "12" }
    { t #1 #3 substring$ "Feb" =
      t #1 #3 substring$ "feb" =
      or
      { "11" }
      { t #1 #3 substring$ "Mar" =
        t #1 #3 substring$ "mar" =
        or
        { "10" }
        { t #1 #3 substring$ "Apr" =
          t #1 #3 substring$ "apr" =
          or
          { "09" }
          { t #1 #3 substring$ "May" =
            t #1 #3 substring$ "may" =
            or
            { "08" }
            { t #1 #3 substring$ "Jun" =
              t #1 #3 substring$ "jun" =
              or
              { "07" }
              { t #1 #3 substring$ "Jul" =
                t #1 #3 substring$ "jul" =
                or
                { "06" }
                { t #1 #3 substring$ "Aug" =
                  t #1 #3 substring$ "aug" =
                  or
                  { "05" }
                  { t #1 #3 substring$ "Sep" =
                    t #1 #3 substring$ "sep" =
                    or
                    { "04" }
                    { t #1 #3 substring$ "Oct" =
                      t #1 #3 substring$ "oct" =
                      or
                      { "03" }
                      { t #1 #3 substring$ "Nov" =
                        t #1 #3 substring$ "nov" =
                        or
                        { "02" }
                        { t #1 #3 substring$ "Dec" =
                          t #1 #3 substring$ "dec" =
                          or
                          { "01" }
                          { "13" } % No month specified
                        if$
                        }
                      if$
                      }
                    if$
                    }
                  if$
                  }
                if$
                }
              if$
              }
            if$
            }
          if$
          }
        if$
        }
      if$
      }
    if$
    }
  if$
 
}

% Original jasa's presort function
%FUNCTION {presort}
%{ calc.label
%  label sortify
%  "    "
%  *
%  type$ "book" =
%  type$ "inbook" =
%  or
%    'author.editor.sort
%    { type$ "proceedings" =
%        'editor.sort
%        'author.sort
%      if$
%    }
%  if$
%  #1 entry.max$ substring$
%  'sort.label :=
%  sort.label
%  *
%  "    "
%  *
%  title field.or.null
%  sort.format.title
%  *
%  #1 entry.max$ substring$
%  'sort.key$ :=
%}

% New one from plainyr_rev.bst
FUNCTION {presort}
{
  % sort by reverse year
  reverse.year
  "    "
  *
  month field.or.null
  sort.format.month
  *
  " "
  *
  author field.or.null
  sort.format.names
  *
  "    "
  *
  title field.or.null
  sort.format.title
  *

  % cite key for definitive disambiguation
  cite$
  *

  % limit to maximum sort key length
  #1 entry.max$ substring$

  'sort.key$ :=
}

ITERATE {presort}

% Bunch of other stuff
% ...
EXECUTE {initialize.extra.label.stuff}

%ITERATE {forward.pass} % <--- comment out this step to avoid the 
                        %       year numbers in the reverse-chronological
                        %       entry to have letters following the year numbers

REVERSE {reverse.pass}

%FUNCTION {bib.sort.order}
%{ sort.label
%  "    "
%  *
%  year field.or.null sortify
%  *
%  "    "
%  *
%  title field.or.null
%  sort.format.title
%  *
%  #1 entry.max$ substring$
%  'sort.key$ :=
%}

%ITERATE {bib.sort.order}

SORT                    % Now things will sort as desired

Extra: Make my name bold

Might as well make things nicer and make my name bold in all the entries in my publication list. This post has many good approaches. Considering I'm likely only to have to do this once (or very infrequently), I decided to simply alter the mypub.bib file.

Suppose your name is John A. Smith. Depending on the reference style and the specific bibtex entry, this might become something like Smith, J. A., Smith, John A., Smith, John, or Smith, J..

In bibtex, the author line may be something like:

% variation 1
author = {Smith, John and Doe, Jane}
% variation 2
author = {Smith, John A. and Doe, Jane}

Apparently the bst files simply goes through these lines, strsplit based on keyword "and", and take the last name and first name field and contatenate them together. And if only initial is needed, take the first letter of the first name. So, we can simply make the correct fields bold:

% variation 1
author = { {\bf Smith}, {\bf J}{\bf ohn} and Doe, Jane}
% variation 2
author = { {\bf Smith}, {\bf J}{\bf ohn} {\bf A.} and Doe, Jane}

Note that the first name's first letter and last letters are bolded separately for correct formatting.

Took way too long!

May 5, 2018 - Signal Processing Notes

There are some important signal processing points I keep having to re-derive, is rarely mentioned in the textbooks, and buried deep in the math.

What is the physical significance of negative frequencies?

FFT of real signals usually have both positive and negative frequencies in their spectrum. In feedback analysis, we simply ignore the negative frequencies and look at the positive part.

A good illustration of what is actually happening is below:

spiral

See also this stackexchange

The spiral can spin counter-clockwise or clock-wise, corresponding to positive or negative frequency. From the formula of Fourier Transform F(ω)=f(t)expjωtdt, we can see the frequency spectrum of a signal actually correspond to that of the spiral. Therefore both cos(ωt) and sin(ωt) are the sum of two complex exponentials. This is in fact true for all real, measurable signals.

In contrast, in Hilbert transform, the analytic signal xc(t)=xr(t)+jHT{xr(t)} has single-sided spectrum, which is not physically realizable. However, it allows us to derive the instantaneous amplitude and frequency of the actual time-domain signal xr(t).

Why is a linear phase response desirable for a system?

I always get tripped up by the language. I used to see a phase-response that is not flat and think the different frequencies have different phase lag, therefore they are going to combine at the output and the signal is going to be distorted.

The first part of that thought is correct, but the implication is different. 90-degrees phase-lag for a low-frequency signal is much longer in the time-domain as 90-degrees phase-lag for a high-frequency signal. Therefore, if all frequencies need to be delayed the same amount of time, the phase-lag need to be different.

In math, suppose an input signal x(t)=sin(ωt) is filtered to output given by y(t)=sin(ω(tτ)), which is a pure delay by τ. This can then be written as y(t)=sin(ωtωτ)=sin(ωt+ϕ(ω)), where ϕ(ω)=ωτ is the phase response of the system, and the phase becomes more negative for larger frequencies.

And this is why linear phase response is desirable for a system.

Feb 22, 2018 - Hints from biology

The success of deep neural network techniques have been so great leading to the hype general intelligence can be achieved soon. I had my doubts, mostly based on the amount of data needed compared to examples needed for people to learn, but also on possible biological mechanisms that could implement backpropagation that is central to artificial neural networks. More neuroscience discoverys are needed.

The great Hinton made a big splash with the capsule theory. Marblestone et al., 2016 had many speculations on how backprop could be approximated by various mechanisms.

Some cool recent results:

  1. Delahunt 2018: Biological mechanisms for learning, a computational model of olfactory learning in the Manduca sexta Moth, with applications to neural nets. TR interview

    The insect olfactory system [...] process olfactory stimuli through a cascade of networks where large dimension shifts occur from stage to stage and where sparsity and randomoness play a critical role in coding.

    Notably, a transition from encoding of stimulus from a low-dimensional parameter space to one in a high-dimensional parameter space occur, not too common in ANN architectures. Reminds me of the kernel transformations.

    Learning is partly enabled by a neuromodulatory reward mechnaism of octopamine stimulation of the antennal lobe, whose increased activity induces rewiring of the mushroom body through Hebbian plasticity. Enforced sparsity in the MB focuses Hebbian growth on neurons that are the most important for the representation of the learned odor.

    Potential augment to backprop at least. Also, octopamine opens new transmitting channels for wiring expanding the solution space.

  2. Akrami 2018: PPC represents sensory history and mediates its effects on behavior. Shows PPC's role in the representation and ues of prior stimulus information. Cool experiments showing Bayesian aspect of the brain. As much as I dislike Bayesian techniques..

  3. Nikbakht 2018 Supralinear and supramodal integration of visual and tactile signals in rats: psychophysics and neuronal mechanisms.

    . Rats combine vision and touch to distinguish two grating orientation categories.
    . Performance with vision and touch together reveals synergy between the two channels.
    . PPC neuronal responses are invariant to modality.
    . PPC neurons carry information about object orientation and the rat's categorization.

Jan 31, 2018 - More linear models

Linear model is a crazy big subject, dividing into:

  • LM (linear models) -- this includes regressions by ordinary least squares (OLS), as well as all analysis of variance methods (ANOVA, ANCOVA, MANOVA). Assumes normal distribution of the dependent variable.

  • GLM (generalized linear models) -- linear predictor and dependent variable related via link functions (poisson, logistic regressions fall into this category).

ANOVA and friends (and their GLM analogs) in GLM and LM are achieved through the introduction of indicator variables representing "factor levels".

  • GLMM (generalized linear mixed models) -- Extension of GLMs to allow response variables from different distributions. Mixed refers to both fixed and random effects.

A book that seems to have reasonable treatments on the connections is ANOVA and ANCOVA: A GLM approach.

MANOVA, repeated measures, and GLMM

Repeated measures is when the same set of subjects, divided into different treatment groups, are measured over time. It would be silly to treat all of the measurements as independent. Advantage is there is less variance bebetween measurements taken on the same subject. Consequently, a repeated measures design allow for same statistical power with less subjects (compared to another design where N different subjects are measured at the different time points and treating all these measurements as independent).

This can be achieved through:

  1. MANOVA approach - The measurements of all subjects at a given time is a length-n vector and becomes the dependent variable of our MANOVA.

  2. GLMM - controls for non-independence among the repeated observations for each individual by adding one or more random effects for individuals to the model.

Good explanation

Variance, Deviance, Sum of Squares (SS), Clustering, and PCA

In LM and ANOVA-based methods, we try to divide the total sum of squares (measuring the response deviation of all data points from the grand mean) into components that can be explained by the predictor variables. A good mental picture to have is:

image1.

Don't mind that it is in 2D, it actually illustrates the generalization of ANOVA into MANOVA. In ANOVA, response values are distributed along a line, in MANOVA, response values are distributed in n-dimensional space. As we see here, the within-group sum of squares (SSW) is the sum of squared distances from individual points to their group centroid. The among-group sum of squares (SSA) is the sum of squared distances from group centroids to the overall centroid. In ANOVA methods, a predictor's effect is significant if the F-test statistic, proportional to the ratio of SSA and SSW hits the critical value.

The analogous method is the analysis of deviance in GLM methods, where we are comparing likelihood ratios instead of SS ratios. Still a bitch though.

This framework of comparing dissimilarity or distance is very useful, especially in constructing randomization-based permutation tests. The excellent paper by Anderson 2001 describes how to use this framework to derive a permutation-based non-parametric MANOVA test. This was implemented in MATLAB in the Fathom toolbox.

A key insight is as shown in the following picture

image1

The sum of squared distances from individual points to their centroid is equal to the sum of squared distances divided by the number of points. Why is this important? Anderson explains:

In the case of an analysis based on Euclidean distances, the average for each variable across the observations within a group constitutes the measure of central location for the group in Euclidean space, called a centroid. For many distance measures, however, the calculation of a central location may be problematic. For example, in the case of the semimetric Bray–Curtis measure, a simple average across replicates does not correspond to the ‘central location’ in multivariate Bray–Curtis space. An appropriate measure of central location on the basis of Bray–Curtis distances cannot be calculated easily directly from the data. This is why additive partitioning (in terms of ‘average’ differences among groups) has not been previously achieved using Bray–Curtis (or other semimetric) distances. However, the relationship shown in Fig. 2 can be applied to achieve the partitioning directly from interpoint distances.

Brilliant! Now we can use any metric -- SS, Deviance, absolute value, to run all sorts of tests. Note now we see how linear models are similar to clustering methods, but supervised (i.e. we define the number of clusters by the number of predictors).

PCA

PCA is pretty magical. Bishop's Pattern Recognition and Machine Learning devotes an entire chapter (12) on it, covering PCA as:

  1. Projecting data to subspace to maximize the projected data's variance. This is the traditional and easiest approach to understand.

  2. A complementary formulation to the previous one is projecting data to subspace to mimimize the projection error. This is basically linear regression by OLS. There is even a stackexchange question on it.

  3. Probabilistic PCA. In this approach, PCA is described as the maximum likelihood solution of a probabilistic latent variable model. In this generative view, the data is described as samples drawn according to

    p(x|z)=N(x|Wz+μ,σ2I)

    This is a pretty mind blowing formulation which would link PCA, traditionally a dimensionality reduction technique, with factor analysis that describe latent variables. There is a big discussion on stackexchange on this generative view of PCA.

  4. Bayesian PCA. Bishop is a huge Bayesian lover, so of course he talks about making PCA Bayesian. A cool advantages is automatic selection of principal components by giving the vectors of W a prior which is then shrunk during the EM procedures. Another cool thing about PPCA/Bayesian PCA is the iterative procedure allows estimating the top principal components without computing the entire covariance matrix/eigenvalue decomposition.

  5. Factor Analysis. This is treated as a generalization of PPCA, where instead of σ2I, we have

    p(x|z)=N(x|Wz+μ,Ψ)

    where Ψ is still a diagonal matrix, but the individual variance values are not the same. This has the following consequences of using PCA vs. FA, quoting amoeba,

    As the dimensionality n increase, the diagonal [of sampling covariance matrix] becomes in a way less and less important because there are only n elements on the diagonal and O(n^2) elements off the diagonal. As a result, for the large n there is usually not much of a difference between PCA and FA at all [...] For small n they can indeed differ a lot.

Nonparametric tests and Randomization tests

Here is a nice chart from Kording lab's page.

One common misconception among neuroscientists (and me, until a short while ago) is to use conventional non-parametric tests such as WIlcoxon, Mann-Whitney, Kruskal-Wallis, etc as a cure-all for violation of assumptions in the parametric tests. However, these tests simply replace the actual response values by their ranks, then proceed with calculating sum of squares, and using a Chi2 test statistic instead of an F-test.

However, as Petraitis & Dunham 2001 points out,

The Chi-squared distribution of the test statistic is based on the assumption that ranked data for each treatment level are samples drawn from distributions that differe only in location (e.g. mean or median). The underlying distributions are assumed to have the same shape (i.e. all other moments of the distributions-- the variance, skewness, and so on -- are identical). These assumptions about nonparamteric tests are often not appreciated, and ecologists often assume that such tests are distribution-free).

Hence the motivation for many permutation based tests, which Anderson notes can detect both variance and location differences (but not correlation difference). Traditional parametric tests, on the other hand, are also sensitive to correlation and dispersion differences (hence why everyone loves them).

Jan 30, 2018 - Generalized Linear Model (GLM) - ANCOVA and Multiple comparison

I have the data shown below image1, where the dependent variable is the count data Bin counts (bincnts), and vary with categorical predictor duration, and covariate reward_cond (shown by the colors). Traditionally, this would call for 1-way Analysis of Covriance (ANCOVA), which is essentially ANOVA but with a covariate. Intuitively, ANCOVA fits different regression lines to different groups of data, and check to see if the categorical variable has a significant effect on the intercept of the fitted regression lines (insert some pictures)

ANCOVA has the model assumptions that:

  1. Homogeniety of slopes.
  2. Normal distribution of the response variable.
  3. Equal variance due to different groups.
  4. Independence

My data clearly violates (2), being count data and all.

I could do a permutation based ANCOVA, as outlined in Petraitis & Dunham 2001, and implemented in the Fathom toolbox, but it takes forever, running 1000 acotool for each permutation (slope homogeneity, main effects, covariate, and multiple comparison require separate permutations).

ANCOVA can also be implemented in the GLM framework, although at this point by ANCOVA I really mean linear models with categorical and continuous predictor variables and continuous dependent variable. In MATLAB, with Poisson regression we can do:

ds = table(durations, reward_cond, bincnts(:,n), 'VariableNames', {'durations', 'reward_cond','bincnts'});
lm = stepwiseglm(ds, 'constant', 'upper', 'interactions', 'distribution', 'poisson', 'DispersionFlag', true)

This gives the following:

1. Adding reward_cond, Deviance = 1180.1526, FStat = 46.078691, PValue = 1.1400431e-19
2. Adding durations, Deviance = 1167.4079, FStat = 8.7552452, PValue = 0.0031780653

lm = 


Generalized Linear regression model:
    log(bincnts) ~ 1 + durations + reward_cond
    Distribution = Poisson

Estimated Coefficients:
                     Estimate        SE        tStat       pValue  
                     _________    ________    _______    __________

    (Intercept)         2.9845    0.039066     76.397             0
    durations        -0.037733    0.012949     -2.914     0.0036674
    reward_cond_0     -0.20277    0.024162    -8.3921    2.1507e-16
    reward_cond_1     0.042287    0.061958    0.68252       0.49511


805 observations, 801 error degrees of freedom
Estimated Dispersion: 1.46
F-statistic vs. constant model: 33.9, p-value = 1.19e-20

Here we are fitting a glm: log(bincnts) ~ beta_0+z0(beta_2+beta_5X)+z1(beta_1+beta_6X)+eps

durations is significant, but interaction terms are not significant, so we actually see a homogeneity of slopes. reward_cond_0 is significant but not reward_cond_1, this means the regression line (duration and bincnts) for the first reward condition (represented by intercept, durations) are not significantly different from that for reward_cond_1, but it is for reward_cond_0.

The fitted response values are shown below image2

Note in the print out that the dispersion is greater than 1 (1.46), meaning the variance of the data is bigger than expected for a Poisson distribution (where var=mean).

As can be seen in the fitted plot, the blue line is less than the red and green line, and red~=green line. To get all these orderings we need to do post-hoc pairwise tests between the factors, and adjust for multiple comparison. If it was regular ANCOVA, in MATLAB this can be done by acotool and multcompare. Unfortunately multcompare doesn't work for GLM.

So we do it in R, here are my steps after looking through a bunch of stack exchange posts (last one is on poisson diagnostics).

First write the table in matlab to csv:
>> writetable(ds, '../processedData/lolita_s26_sig072b_bincntData.csv');

In R:

> mydata = read.csv("lolita_s26_sig072b_bincntData.csv")
> mydata$reward_cond = factor(mydata$reward_cond)
> model1 = glm(bincnts ~ durations + reward_cond + durations*reward_cond, family=poisson, data=mydata)
> summary(model1)

Call:
glm(formula = bincnts ~ durations + reward_cond + durations * 
    reward_cond, family = poisson, data = mydata)

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-4.1671  -0.8967  -0.1151   0.6881   5.0604  

Coefficients:
                       Estimate Std. Error z value Pr(>|z|)    
(Intercept)             2.98942    0.03994  74.841  < 2e-16 ***
durations              -0.03960    0.01396  -2.836 0.004568 ** 
reward_cond0           -0.24969    0.06993  -3.571 0.000356 ***
reward_cond1            0.19174    0.13787   1.391 0.164305    
durations:reward_cond0  0.01557    0.02305   0.675 0.499458    
durations:reward_cond1 -0.04990    0.04392  -1.136 0.255930    
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for poisson family taken to be 1)

    Null deviance: 1315.6  on 804  degrees of freedom
Residual deviance: 1165.3  on 799  degrees of freedom
AIC: 4825.8

Number of Fisher Scoring iterations: 4

None of the interaction terms are significant, so we fit a reduced model:

> model1 = glm(bincnts ~ durations + reward_cond, family=poisson, data=mydata)
> summary(model1)

Call:
glm(formula = bincnts ~ durations + reward_cond, family = poisson, 
    data = mydata)

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-4.1766  -0.8773  -0.1229   0.7185   5.0664  

Coefficients:
             Estimate Std. Error z value Pr(>|z|)    
(Intercept)   2.98452    0.03238  92.174  < 2e-16 ***
durations    -0.03773    0.01073  -3.516 0.000438 ***
reward_cond0 -0.20277    0.02003 -10.125  < 2e-16 ***
reward_cond1  0.04229    0.05135   0.823 0.410243    
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for poisson family taken to be 1)

    Null deviance: 1315.6  on 804  degrees of freedom
Residual deviance: 1167.4  on 801  degrees of freedom
AIC: 4824

Number of Fisher Scoring iterations: 4

This is pretty similar to MATLAB's results. Before continuing, we can check the dispersion, which is equal to the Residual deviance divided by degrees of freedom. According to this post

> library(AER)
> deviance(model1)/model1$df.residual
[1] 1.457438
> dispersiontest(model1)

    Overdispersion test

data:  model1
z = 5.3538, p-value = 4.306e-08
alternative hypothesis: true dispersion is greater than 1
sample estimates:
dispersion 
  1.448394 

Again, our dispersion agrees with MATLAB's estimate. Would this overdispersion invalidate our model fit? According to this CountData document, when this happens, we can adjust our test for overdispersion to use an F test with an empirical scale parameter instead of a chi square test. This is carried out in R by using the family quasipoisson in place of poisson errors. We can do this by fitting the model with a quasipoisson model first, then compare the model fit against a constant model (also quasipoisson) using anova in which we specify test="F":

> model1 = glm(bincnts ~ durations + reward_cond, family=quasipoisson, data=mydata)
> model2 = glm(bincnts ~ 1, family=quasipoisson, data=mydata)
> anova(model1, model2, test="F")
Analysis of Deviance Table

Model 1: bincnts ~ durations + reward_cond
Model 2: bincnts ~ 1
  Resid. Df Resid. Dev Df Deviance      F    Pr(>F)    
1       801     1167.4                                 
2       804     1315.6 -3  -148.18 33.931 < 2.2e-16 ***

Still significant! Nice. Another common way to deal with overdispersion is to use a negative-binomial distribution instead. In general, according to Penn State Stat504, overdispersion can be adjusted by using quasi-likelihood function in fitting.

Now multiple comparison -- how do I know if the fitted means are significantly different based on reward levels? Use the multcomp R-package's glht function:

> library(multcomp)
> summary(glht(model1, mcp(reward_cond="Tukey")))

     Simultaneous Tests for General Linear Hypotheses

Multiple Comparisons of Means: Tukey Contrasts


Fit: glm(formula = bincnts ~ durations + reward_cond, family = quasipoisson, 
    data = mydata)

Linear Hypotheses:
            Estimate Std. Error z value Pr(>|z|)    
0 - -1 == 0 -0.20277    0.02416  -8.392  < 1e-05 ***
1 - -1 == 0  0.04229    0.06196   0.683    0.762    
1 - 0 == 0   0.24505    0.06018   4.072 9.42e-05 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
(Adjusted p values reported -- single-step method)

Here we see the difference between reward_cond=0 and reward_cond=-1, and that between reward_cond=1 and reward_cond==0 are significant, as we suspected looking at the fitted plots. We are fortunate that there is no duration:reward_interactions.

But how would we do multiple comparison when interaction effects are present -- i.e. the regression lines have different slopes? We first fit the same data with interaction effects (which we know are not significant):

> model1 = glm(bincnts ~ durations + reward_cond + durations*reward_cond, family=quasipoisson, data=mydata)
> summary(model1)

Call:
glm(formula = bincnts ~ durations + reward_cond + durations * 
    reward_cond, family = quasipoisson, data = mydata)

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-4.1671  -0.8967  -0.1151   0.6881   5.0604  

Coefficients:
                       Estimate Std. Error t value Pr(>|t|)    
(Intercept)             2.98942    0.04822  61.999  < 2e-16 ***
durations              -0.03960    0.01686  -2.349  0.01905 *  
reward_cond0           -0.24969    0.08442  -2.958  0.00319 ** 
reward_cond1            0.19174    0.16643   1.152  0.24963    
durations:reward_cond0  0.01557    0.02783   0.559  0.57601    
durations:reward_cond1 -0.04990    0.05302  -0.941  0.34693    
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for quasipoisson family taken to be 1.457184)

    Null deviance: 1315.6  on 804  degrees of freedom
Residual deviance: 1165.3  on 799  degrees of freedom
AIC: NA

Number of Fisher Scoring iterations: 4

Previously, our second argument to glht was mcp(reward_cond="Tukey"). This is telling glht (general linear hypothesis testing) to compare levels of reward_cond and adjust the pvalues with Tukey's method. When interaction effects are present, we need to define a contrast matrix.

> contrast.matrix = rbind("durations:reward_cond_-1 vs. durations:reward_cond0" = c(0,0,0,0,1,0),
+                         "durations:reward_cond_-1 vs. durations:reward_cond1" = c(0,0,0,0,0,1),
+                         "durations:reward_cond_0  vs. durations:reward_cond1" = c(0,0,0,0,-1,1))
> contrast.matrix
                                                    [,1] [,2] [,3] [,4] [,5]
durations:reward_cond_-1 vs. durations:reward_cond0    0    0    0    0    1
durations:reward_cond_-1 vs. durations:reward_cond1    0    0    0    0    0
durations:reward_cond_0 vs. durations:reward_cond1     0    0    0    0   -1
                                                    [,6]
durations:reward_cond_-1 vs. durations:reward_cond0    0
durations:reward_cond_-1 vs. durations:reward_cond1    1
durations:reward_cond_0 vs. durations:reward_cond1     1

The columns of the contrast matrix corresponds to the rows of the glm coefficients. By default, the base model is selected. So the first row of the contrast matrix will ask for comparison between the base model (durations:reward_cond_-1) and durations:reward_cond0. We give this to glht:

> summary(glht(model1, contrast.matrix))

     Simultaneous Tests for General Linear Hypotheses

Fit: glm(formula = bincnts ~ durations + reward_cond + durations * 
    reward_cond, family = quasipoisson, data = mydata)

Linear Hypotheses:
                                                         Estimate Std. Error
durations:reward_cond_-1 vs. durations:reward_cond0 == 0  0.01557    0.02783
durations:reward_cond_-1 vs. durations:reward_cond1 == 0 -0.04990    0.05302
durations:reward_cond_0  vs. durations:reward_cond1 == 0 -0.06547    0.05493
                                                         z value Pr(>|z|)
durations:reward_cond_-1 vs. durations:reward_cond0 == 0   0.559    0.836
durations:reward_cond_-1 vs. durations:reward_cond1 == 0  -0.941    0.603
durations:reward_cond_0  vs. durations:reward_cond1 == 0  -1.192    0.445
(Adjusted p values reported -- single-step method)

And as we already knew, none of the interactions are significant.

Dec 4, 2017 - Even-Chen et al 2017: Augmenting intracortical brain-machine interface with neurally driven error detectors

Augmenting intracortical brain-machine interface with neurally driven error detectors

Detecting errors (outcome-error or execution-error) while performing tasks via BMI from the same cortical populations can improve BMI performance through error prevention and decoder adaptation. Justin Sanchez was one of the earliest proponents of this idea and had several theoretical work on it. Joseph Francis conducted studies on detecting reward-signal in the M1 and PMd in monkeys in 2015 and continues now. There have been multiple studies on detecting these error-signals in humans, via ECoG or EEG. The detection of error-signal or reward-signal, which can be closely related, in motor and premotor areas have been rising in popularity owing to its implication to improve BMI.

It looks like Nir Even-Chen has gotten this very important first flag (with experimental paradigm pretty similar to what I wanted to do as well, so at least my hypothesis is validated).

Experiment

Monkey first performed arm-reach to move a cursor to different target. This training data was used to fit either a ReFIT decoder (requiring a second decoder fitting) or just a regular FIT decoder. Both decoders are velocity Kalman Filters that utilized intention estimation when fitting the training data.

During the BMI task, whenever a cursor overlaps a target, it starts a 300ms hold period. The color of the cursor changes depending on whether the hovered target is correct. If this selection period is completed, that target is selected. Target selection is followed by a 600ms waiting period after which liquid reward and auditory cue signals the outcome of the selection.

This target reach task is fashioned as a "typing task", i.e. the goal is to select specific sequences of targets, or letters.

Neural signals used were threshold-crossings.

Results

Task outcome-related neural differences

Trial-averaged PSTH based on task outcome showed signficant differences electrode-wise, in the period [-300, 600ms] with respect to selection onset.

Online decoding of task outcome

This motivates decoding the task outcome from using activities from different lengths of time within this time-window, on a trial-by-trial basis. Decoder used was a linear SVM on five PC components. There can be multiple ways of performing the PCA dimensionality reduction:

  1. Use the first n BMI trials as training trials. The task difficulty can be varied to achieve a certain success rate. Get the trial-average for different task outcomes, perform PCA on it, and train the SVM using the top five PCs.

In subsequent trials, project the trials' activities in the same time window to the previously selected PCs, then run the SVM.

  1. Initialization of the decoder same as above. However, with every new trial, the error-decoder can be run again. More trials would then lead to a more accurate decoder. As the authors noted,

We found that decoding performance converged after a large quantity of approximately 2000 training trials, and that these decoders worked well across days.

  1. Initialize outcome-decoder using previous days' data -- this was the approach taken during the online experiments.

Online error-correction

Two methods of error-correction in the context of the experiment were implemented:

  1. Error auto-deletion: After detecting that an error has happened, the previously selected target or "letter" will be deleted and the previous target will be cued again.

  2. Error prevention: As the task-outcome can be decoded with decent accuracy before target selection is finalized, when an error outcome is detected, the required hold period is extended by 50ms, allowing the monkey more time to move the cursor. This is actually pretty clever.

They found that error prevention resulted in higher performance as measured by "bit-rate" for both monkeys.

Outcome error signal details

The first worries coming to mind is whether these outcome error signals are in fact encoding the kinematic differences with respect to different trial outcomes. These kinematic differences include cursor movements and arm movements (monkey's arms were free to move).

Other confounding variables include (1) reward expectation (2) auditory feedback difference (3) colored cue differences.

To control for kinematic differences, three analysis were done:

  1. Linear regression of the form yk=Axk+b were performed, where xk includes the cursor velocity or hand velocity, and yk represents neural activity vectors at time k. The residual ykres=ykAxkb were then used to classify task-outcome, and this did not affect the accuracy very much.

This analysis makes sense, however, why do they only regress out either cursor or hand velocity, but not both at the same time??

  1. Used either the hand or cursor velocity to decode trial-outcome. The results were significantly better than chance but also significantly worse than that using the putative error signal.

  2. Because of the BMI paradigm, there is knowledge of the causual relationship between the neural activities and the cursor velocity, as defined by the matrix M that linearly maps the two in the Kalman Filter equation.

From the fundamental theorem of linear algebra, we know that a matrix is only capable of mapping vectors into its row-space. This means the cursor velocity results only from the projection of the neural activity vectors into the Ms row-space. Shenoy and colleagues term this output-potent subspace of the neural population activities.

In contrast, the output-null subspace is orthogonal to the output-potent subspace, and is therefore the null space of M. Thus, if the error-signal is unrelated to the neural acvitivies responsible for the decoded kinematics, we would expect it lying in the output-null subspace. Quantitatively, this means the projection of the task-outcome signal into the output-null subspace would explain for the majority of its variance.

To extract the outcome-related signal, neural activities are first trial-averaged based on the task outcome (success or fail), then subtracted from each other. The row-space and null-space of M are found from SVD. The outcome-related matrix ( N×T, N=number of neurons, T=number of time bins) are then projected into these spaces. The variances of these projections are calculated by first subtracting the row-means from each row then taking the sum of squared of all elements in the matrix.

It turns out that the variances of the outcome-signal explained by the projection into the null-space is much more than that explained by the row-space. A good result.

To visualize the error-signal, the principal components of the "difference-mode" of the neural activities were plotted. The idea of applying "common-mode" and "difference-mode" to neural activities is simlar to ANOVA quantifying the between-group and within-group variances. Common-mode neural activities is equal to trial-averaging regardless of task-outcomes. Difference-mode is equal to the difference between the trial-averaged success trials and failed trials.

To control for reward-expectation, experiments were controlled where rewards were delivered for every trial regardless of success. It was found this did not make a significant difference to the decoding performance. Not sure how they look, but according to Ramakrishnan 2017, M1 and PMd neurons exhibit a decreased firing rate to lower-than-expected reward. This, combined with similar decoding performance is a good sign for this error-signal to be different from reward-expectation.

To control for auditory feedback, it was turned off, decoding performance did not decrease significantly.

To control for color cues, the color of the selected target stayed constant. The resulted in a significant, but minor (5%) performance decrease. May be due to the monkey's increase in uncertainty under the absence of color changes. Or maybe it is due to a change in execution-error -- the monkey expects the taraget to change, but it doesn't. More work needs to be done here.

It is very surprising that their monkeys performed so many trials under so many different conditions...in my experience they either freak out and perform terribly or just refuse, and we have to wait for them to get their shit together again.

Execution-Error

Execution-error differs from outcome-error in that the former is execution sensitive. In this context, execution-error implies an error-signal that varies with where the correct target is with respect to the selected target. In contrast, outcome-error is simply whether the selected target is correct.

I am not convinced that the authors' definition is correct here. Techincally outcome-error should be if the monkey selects the letter "A" but the letter "B" appears, and execution-error is when the monkey wants to move the cursor left, but the cursor went in another direction.

Regardless, it was found the direction of the correct target explained a small percentage (~15%) of the error-signal variance. Very promising!


This paper has basically validated and performed my planned (focus on past tense) first stage experiments using BMI-controlled cursor tasks. Another thing to note is that PMd shows significant outcome-modulation earlier than M1, fitting with the role of that area.

Next step would probably be online adaptation of the decoder informed by the error-signal. An important point yet to be addressed is that, degrading and nonstationary signals motivate online adaptation, the error-signal decoder would also require adaptation. This adaptation in the case of BMI-controlled typing is simple -- is the backspace being pressed? In other tasks, external knowledge still needs to be known...

Oct 14, 2017 - Chi-squared post-hoc test, and how trolly statistical testing is

Chi-squared test is any statistical hypothesis test wherein the sampling distribution of the test statistic is a chi-squared distribution when the null hypothesis is true. It is commonly used to determine whether there is a significant difference between the expected frequencies and the observed frequencies in one or more categories.

The calculations for this test is fairly simple, good examples, and is used to compare if the proportions of a category in a group is significantly different than expected.


On first look, the difference between ANOVA and Chi-squared is pretty straight-forward. ANOVA tries to identify whether the variances observed in a dependent variable can be explained by different flevels in categorical independent variables.

Do hours of sleep affect health? (One-way ANOVA)

Do gender and hours of sleep affect health? (Two-way ANOVA)

Using ANOVA, we would survey a bunch of people about their hours of sleep, gender, and healthy. Suppose we divide hours of sleep into 3 levels, and the health score varies from 1:10. In one-way ANOVA, for example, we would get the mean and standard deviation of the health scores for people with low, medium, or high hours of sleep. Depending on the overlap between those distributions, we can either reject or fail to reject the hypothesis that hours of sleep affect health.

Chi-squared can also be used to answer if there's a relationship between hours of sleep and health. We can build a contingency table where the columns are level of sleep, and the rows are level of health score. The chi-squared test would then tell us if the number of people in each cell is expected.

Of course, we can also simply divide the health score into either high or low, and make a logistic regression where the independent variable is the hours of sleep.

ANOVA is probably the most appropriate test here. Different people would give different answeres to the same question. The point is that simply saying "chi-squared test deals with count data" is right and wrong at the same time. But I digress


So, chi-squared test itself serves as an omnibus test like ANOVA, indicating whether the observed counts or cells in the contingency table are significantly different from expected. But it does not tell us which cell. In ANOVA, there is post-hoc testing, and MATLAB handles it very easily by passing the result from anova1 or anovan into multcompare, which would further handle multiple comparison.

In comparison, post-hoc chi-squared test is not as common -- MATLAB does not have one and it is not well-documented in other packages like SPSS.

There are several methods documented. My favorite, and probably most intuitive one is residual method by Beasley and Shumacker 1995. After the omnibus chi-squared test rejects null hypothesis, post-hoc steps include:

  1. Make the contingency table M as in any Chi-squared test.

  2. Get the expected value E(i,j) for each cell. If [i,j] indexes the table M, then E(i,j)=(iM(i,j))(jM(i,j))/n, where n=ijM(i,j).

  3. Obtain standardized residuals for each cell: e(i,j)=M(i,j)E(i,j)E(i,j). These values are equivalent to the square root of each cell's chi-squared values.

  4. The standardized residuals follow a standard normal distribution. So we can obtain two-tailed or one-tailed pvalues from them. Multiple comparison procedure can be applied as usual to the resulting pvalues.

MATLAB implementation here.

Oct 14, 2017 - More Jekyll and CSS

Github's gh-pages have updated to use rouge, a pure ruby-based syntax highlighter. I have had enough with the build-failure emails because I have been using the Python highlighter pygment. Time to upgrade.

Installing Ruby
My Ubuntu machine was outdated and according to github, I should have Ruby 2.x.x rather than staying at my 1.9.

Installed the ruby version manager -- rvm, and followed the instructions on the github page. Had a strange permission error [https://github.com/bundler/bundler/issues/5211] during bundle install, solved it finally by deleting the offending directory.

Basic syntax highlighting using the fenced-code blocks can be achieved following instructions here, however, enabling line numbers requires using the ```<language> tag, which is not consistent with the custom rouge syntax highlighting themes out of the box. Required a bunch of CSS stylings to get them to work. In the following steps /myjekyll represents the path to my jekyll site's root directory.

  1. All my css style sheets are in the directory /myjekyll/css. In my /myjekyll/_includes/header.html file, the syntax highlighting stylesheet is included as syntax.css. I did

    rougify style monokai > /myjekyll/css/syntax.css
    

This will give me good looking fenced code blocks with monokai theme, but the syntax highlighting and line number will be completely screwed up. Specifically, the text colors will be that of the default highlighting theme.

  1. In the new syntax.css stylesheet, we have the following:

    .highlight {
      color: #f8f8f2;
      background-color: #49483e;
    }
    

    We want to have the code blocks to be consistent with these colors. Inspecting the code block elements allow us to set the appropriate css properties. Notably, the highlighted code-block with line numbers is implemented as a table, one td for the line numbers, and one td for the code. I added the following into my /myjekyll/css/theme.css file (or whatever other stylesheet that's included in your header)

    /* Syntax highlighting for monokai -- see syntax.css */
    .highlight pre,
    .highlight .table 
    {
        border: box;
        border-radius: 0px;
        border-collapse: collapse;
        padding-top: 0;
        padding-bottom: 0;
    }
    
    .highlight table td { 
        padding: 0px; 
        background-color: #49483e; 
        color: #f8f8f2;
        margin: 0px;
    }
    
    /* Border to the right of the line numbers */
    .highlight td.gutter {
        border-right: 1px solid white;
    }
    
    .highlight table pre { 
        margin: 0;
        background-color: #49483e; 
        color: #f8f8f2;
    }
    
    .highlight pre {
        background-color: #49483e; 
        color: #f8f8f2;
        border: 0px;    /* no border between cells! */
    }
    
    /* The code box extend the entire code block */
    .highlight table td.code {
        width: 100%;
    } 
    
  2. The fenced code-block by default wraps a long line instead of creating a horizontal scrollbar, unlike using the highlight tag. According to the internet, this can be done by adding to the style sheet:

    /* fence-block style highlighting */
    pre code {
        white-space: pre;
    }
    
  3. One last modification for in-line code highlighting:

    /* highlighter-rouge */
    code.highlighter-rouge {
        background-color: #f8f8f8;
        border: 1px solid #ddd;
        border-radius: 4px;
        padding: 0 0.5em;
        color: #d14;
    }
    
  4. A final thing that is broken about the

    {% highlight %}

    tag is that using it inside a list will break the list. In this post, all items after the list with line numbers have started their number all over agian. According to the ticket here, there is no easy solution to this because it is related to the kramdown parser. Using different indentation (3 spaces or 4 spaces) in list items does not change. Some imperfect solutions are suggested here and here. None can fix the indentation problem. But, by placing

    {:start="3"}

    one line before my third list item allows the following items to have the correct numbering.