May 1, 2016 - Fogassi, Rizzolatti 2005: Parietal Lobe: From Action Organization to Intention Understanding

Tags: Parietal Lobe, Mirror Neurons, Intention, Context dependence, Fogassi, Rizzolatti

Parietal Lobe: From Action Organization to Intention Understanding

Inferior parietal lobule (IPL) neurons were studied when monkeys performed motor acts embedded in different actions, and when they observed similar acts done by an experimenter. Most IPL neurons coding a specific act (e.g. grasping) showed markedly different activiations depending on contexts (e.g. grasping for eating or for placing).

Many motor IPL neurons also discharged during the observation of acts done by others. The context-dependence also applied to when these actions were observed.

These neurons fired during the observation of an act, before the beginning of the subsequent acts specifying the action - thus these neurons not only code the observed motor acts but also allow the observer to understand the agent's intentions.

Posterial parietal cortex has traditionally been considered as the association cortex. Besides "putting together" different sensory modalities, also

Codes motor actions
Provides the representations of these motor actions with specific sensory information.

Experiment

Set 1

Investigates neuron selectivity during actioin performance:

Monkey starts from a fixed position, reache for and grasp a piece of food located in front of it and brigh the food to mouth.
Monkey reach for and grasp an object, and place it in a box on the table.
Monkey reach for and grasp an object, and place it into a container near its mouth.

Most observed neurons discharged differentially with respect to context.

Control for confounds:

Grasp an identical piece of food in both conditions - rewarded with the same piece of food in situation 2.
Situation 3 aims to determine if arm kinematics determined the different firing behavior - turns out to be pretty much identical to situation 2.
Force exerted grasping - mechanical device that could hold the objects with two different strengths used, no change in neuron selectivity.

Set 2

Study 41 mirror neurons that discharge both during grapsing observation and grasping execution, in an experiment where the monkey observed an experimenter performing those actions.

Results:

Some neurons are selective for grasping to eat, others for grasping to place - most have the same specificity as when the monkey performed those actions.
Most neurons have the same selectivity regardless of whether the grasped object is food or solid. Some discharged more strongly in the placing conditions when the grasped object was food.

The authors suggest that while counter-intuitive, this should makes sense since motor acts are not related to one another independent of the global aim of the action but appear to form prewired intentional chains in which each motor act is faciliatted by the previously executed one. This reminds me of Churchland and Shenoy's theory about action planning as a trajectory in the Null-Space.

Consequences

The authors suggest that the monkey's decision on what to do with the object is undoubtedly made before the onset of the grasping movement. The action intentionis set before the beinning of the movements and is already reflected in the first motor act. This motor reflectioin of action intention and the chained motor organization of IPL neurons have profound consequences on [...] understanding the intention of others.

While I can see where he reaches this conclusion, and it would nicely explain why passive observation training for BMI works (even though mostly M1 and S1 neurons are recorded), but this explanation would make more sense if most of the difference in discharge rate happens BEFORE the monkey physically grasp the object/food...

It has been generally accepted that mirror neurons allow the observing individual to understand the goal of the observed motor act. The authors' interpretation then deduces that because the monkey knows the outcome of the motor act it executes, it recognizes the goal of the motor act done by another individual when this act triggers the same set of neurons that are active during the execution of that act.

Applications to BMI

This paper suggests mirror neurons as a mechanism for evaluation of intention, it also suggests that depending on context and object, a simple grasp action may be coded differently. Thus a BMI arm trained on passive observation of grasping an object to place might not work well when the user tries to grasp a food to eat. This is similar to when the Schwarz 10-DOF arm encountered, when the human subject had trouble grasping object. The problem only went away when the training was done in a virtual space.

Apr 30, 2016 - Chavarriaga, Sobolewski & Millan 2014: Errare machinale est: the use of error-related potentials in brain-machine interfaces

Tags: BMI, error-related potentials, Reinforcement Learning, Machine Learning, Review

Errare machinale est: the use of error-related potentials in braine-machine interfaces

This review focuses on electrophysiological correlates of error recognition in human brain, in the context of human EEG. Millan calls this error-related potentials, ErrPs. It has been proposed to use these signals to improve BMI, in the form of training BMI decoder via reinforcement learning.

Error-related brain activity

Early reports in EEG date back to early 1990's (Falkenstein et al., 1991; Gehring et al., 1993). Showed a characteristic EEG event-related potential elicited after subjects commited errors in a speed repsonse choice task, characterized by a negative potential deflection, termed the error-related negativity, appearing over fronto-central scalp areas at about 50-100ms after a subject's erroneous response. The negative copmonent is followed by a centro-parietal positive deflection.

Modulations of this latter component have been linked to the subject's awareness of the error. ERN amplitude seems to be modulated by the importance of erros in the given task. These signals have been shown to be quite reliable over time and across diferent tasks.

Very encouraging for reinforcement-learning based schemes.

fMRI, EEG-based source localization, and intra-cranial recordings suggest that the fronto-central ERP modulations commonly involve the medial-frontal cortex, specifically the anterior cingulate cortex (ACC). This is consistent with ACC being involved in reward-anticipation and decision-making.

Error-related potentials for BMI

Ferrez and Millan 2008: 2-class motor-imagery bsaed BMI controlling a cursor moving in discrete steps -- showed ERPs elicited after each command could be decoded as corresponding to the error or correct condition with an accuracy of about 80%. Only 5 subjects, but encouraging.

The ErrP classifiers maintained the same performance when tested several months after calibration - relatively stable signal form. Seemed to be mainly related to a general error-monitoring process, not task-dependent.

Possible Confounds (in EEG studies):

Eye movement contamination: Need to make sure location or movement of target stimuli are balanced.
Observed potentials are more related to the rarrity of the erroneous events (?)
Attential level: subjects tend to have smaller ErrP amplitude when simply monitoring the device than when they are controling it -- requires efficient calibration methods.

Error-Driven Learning

Trials in which the decoder performed correctly, as classified by ErrP can be incorporated into a learning set that can be then used to perform online retraining of the motor imagery (MI) classifier. Rather discrete, but good architecture.

Binary ErrP decoding can be affected substantially by false positive rate (i.e. correct BMI actions misclassified as errors). Solution is to use methods relying on probabilistic error signals, taking into account of the reliability of the ErrP decoder.

Millan has used ErrP to improve his shared-control BMI systems.

Apr 28, 2016 - Dadarlat, O'Doherty & Sabes 2015: A learning-based approach to artificial sensory feedback leads to optimal integration

Tags: BMI, sensory feedback, cortical stimulation, Sabes, O'Doherty

A learning-based approach to artificial sensory feedback leads to optimal integration

Motivation

Maximally precise movements are achieved by combining estimates of limb or target positiion from multiple sensory modalities, weighting each by its relative reliability. Current BMI relies on visual feedback alone, and would be hard to achieve fluid and precise natural movements. Artificial proprioception may improve performance of neuroprosthetics.

Goal

Artificial proprioception should be able to:

Provide sufficient information to alow competent performance in the absence of other sensory inputs.
Permit multisensory integration with vision to reduce movement variability when both signnals are available.
Implemented via multichannel intracortical microstimulation. Different from most other works that take a biomimetic approach. It focuses not on reproducing natural patterns of activity but instead on taking advantage of the natural mechanisms of sensorimotor learning and plasiticity.

Hypothesize spatiotemporal correlations between a visual signal and novel artificial signal in a behavioral context would be sufficient for a monkey to learn to integrate the new modality.

Approach

Monkey looks at a screen, trying to move to an invisible target. The movement vector is the vector between the monkey's current hand position and the invisible target.

Visual feedback: The movement vector is translated to random moving-dot flow field. The coherence of the dot-field is the percentage of dots moving in the same direction. So 100% coherence dot-field would have the dots moving in the same direction as the movement vector.
ICMS: 96-channels array is implanted. 8 electrodes were stimulated on. During feedback, the movement vector is decomposed into 8 components in direction equally spaced around a circle, each corresponding to a stimulating electrode. Higher relative frequency on an electrode means greater component of the movement vector in that direction. Common frequency scaling factor corresponds to movement vector distance.

Experiments inluded:

Learning the task with only visual feedback, with different coherence to demonstrate performance differences due to changing sensory cue precision.
Learning the task with only ICMS feedback. Trained by providing ICMS during visual task to teach monkey to associate the spatial relationship. No changing coherence.
Performing the task with both.

Metrics:

Percent trials correct
Number of movement segments - small is better. Movement segment found by assuming sub-movements have bell-hsaped velocity profiles -- identify submovements by threshold crossings of the radio velocity plot of a trajectory.
Normalized movement time: by the distance to target.
Noramlized path length: Normalized the integrated path length by the distance from start to target.
Mean and Variance of the initial angle: Used to compare initial movement direction and movement vector.
Distance estimation: Assess the monkey's ability to estimate target distance from ICMS by regressing initial movement distance against target distance for ICMS-only trials, excluding trials for which the first movement segment was not the longest.

Evaluating Sensory Integration

Minimum variance integration, which is supposed to the optimal integration, predicts that in VIS+ICMS condition, as the visual coherence increases from 0% to 100%, the animals should transition from relying primarily on the ICMS cue to primarily on the visual cue -- under the model of minimum variance integration, each sensory cue should be wieghted inversely proportional to its variance.

Results

Monkey learns all three tasks.
VIS+ICMS outperforms VIS only for visual coherence less than 50%.
Optimal (minimum variance integration) integration of the two sensory modalities observed.

And only 8 electrodes!!

Apr 21, 2016 - Linear Algebra Notes

Tags: Notes, Linear Algebra

Table of Contents

I can apply linear algebra, but often forget/confuse the spatial representations.

A good resource is Gilbert Strang's essay. I remember him talking about this when I took 18.06, but he was old at that point, and I barely went to class. So I didn't retain much. This notes of his still retains his stream of consciousness style, but the two diagrams summarizes it well, and is useful to often go back and make sure I still understand it.

The first one shows the relationship between the four subspaces of a matrix $A$ . The second one adds in the relationship between the orthonormal basis of the subspaces.

Strang concludes:

Finally, the eigenvectors of $A$ lie in its column space and nullspace, not a natural pair. The dimensions of the spaces add to $n$ , but the spaces are not orthogonal and they could even coincide. The better picture is the orthogonal one that leads to the SVD.

The first assertion is illuminating. Eigenvectors $v$ are defined such that $A v = λ v$ . Suppose $A$ is $M \times N$ , then $v$ must be $M \times 1$ , therefore eigenvectors only work when $M = N$ . On the other hand, in SVD, we have $A = U Σ V^{T}$ and

$A v_{i} = σ_{i} u_{i}$ for $i = 1, . . ., r$ , where $r$ is the rank of column and row space.
- $u_{i}$ are the basis for the column space (left singular vectors), $v_{i}$ are the basis for the row space (right singular vectors).
- The dimensions for this equation matches because the column space and row space are more natural pair.
$A v_{i} = 0$ , $A^{T} u_{i} = 0$ , for $i > r$ .
- $v_{i}$ are the eigenvectors of the null space of $A^{T}$ ,
- $u_{i}$ are the eigenvectors of the null space of $A$ .
- They also match up similarly (Figure 2).
In $A x = b$ , $x$ is in the row space, $b$ is in the column space.
If $A x = 0$ , then $x$ is in the null space (kernel) of A, and is orthogonal complement of the row space of A, $C (A^{T})$ .
In $A^{T} x = b$ , $x$ is in the column space, $b$ is in the row space.
If $A^{T} x = 0$ , then $x$ is in the cokernel of A, and is orthogonal complement of the column space of A, $C (A)$ .

Relationship with PCA

PCA Algorithm
Inputs: The centered data matrix $X$ and $k > 1$ .

Compute the SVD of $X$ : $[U, Γ, V] = s v d (X)$ .
Let $V_{k} = [v_{1}, . . ., v_{k}]$ be the first $k$ columns of $V$ .
The PCA-feature matrix and the reconstructed data are:
$Z = X V_{k}$ $\hat{X} = X V_{k} V_{k}^{T}$

So in PCA, the rows of the data matrix are the observations, and the columns are in the original coordinate system. The principle components are then the eigenvectors of the row space. We can do PCA in MATLAB with pca or manually with svd:

% PCA test
a = rand(100,1);
b = 5*a+rand(100,1);

data = [a, b];  % rows are observations
plot(a,b, '.');


% PCA results
% coeffs=PC in original coordinate[PxP], 
% scores=observation in PC-space [NxP], 
% latent=variance explained [Px1]
[coeffs, score, latent] = pca(data);
hold on;
plot(score(:,1), score(:,2), 'r.');

% SVD results
centeredData = bsxfun(@minus, data, mean(data,1));
[U, S, V] = svd(centeredData);
svd_score = centeredData*V;
plot(svd_score(:,1), svd_score(:,2), 'g.');

The results shown below. Blue is original data, green/red are the PCA results, they overlap exactly. Note that MATLAB's pca by default centers the data.

Projection into subspace (e.g. Kaufman 2014, Cortical activity in the null space, Even-Chen 2017, Neurally-driven error detectors)

Suppose we have a trial-averaged neural activity matrix $A$ of size $N \times T$ , where $N$ is number of neurons, and $T$ is number of time bins.

In online BMI experiment, a linear decoder maps the neural activity vector $n_{t}$ of size $N \times 1$ to kinematics $v_{t}$ of size $2 \times 1$ via the equation $v_{t} = w n_{t}$ , where $w$ is of size $2 \times N$ .

Kaufman describes the output-potent neural activities as collection of $n_{t}$ such that $v_{t}$ is not null. Therefore the output-potent subspace of $n_{t}$ is the rowspace of $w$ . Similarly, the output-null subspace is the null space of $w$ .

By projecting the activity matrix $A$ into these subspaces, we can get an idea of how much variance in the neural activities can be accounted for the modulation to control actuator kinematics:

Use SVD on $w = U Σ V^{T}$ to get its row and null space, defined by the first $r$ columns of $V$ (call it $w_{p o t e n t}$ ), and the last $N - r$ columsn of $V$ (call it $w_{n u l l}$ ), respectively, where $r$ is the rank of $w$ .
Left multiply $A$ by these subspaces (defined by the those columns) to get $A_{p o t e n t}$ and $A_{n u l l}$ .
The variance of $A$ explained by projection into the subspaces is obtained by first subtracting each row's mean from each matrix, then taking the L2 norm of the entire matrix (Frobenius norm of the matrix).

Philip Sabes' Linear Algebra notes

Apr 12, 2016 - Optimum mean square linear filter notes

Tags: BMI, DSP, Signal processing, neural decoding, Gilja, Shenoy, Kalman filters

Wiener Filter

Wiener filter is used to produce an estimate of a target random process by LTI filtering of an observed noisy process, assuming known stationary siganl and noise spectra. This is done by minimizing the mean square error (MSE) between the estimated random process and the desired process.

Wiener filter is one of the first decoding algorithms used with success in BMI, used by Wessberg in 2000.

Derivations of Wiener filter are all over the place.

Many papers and classes refer to Haykin's Adaptive Filter Theory, a terrible book, in terms of setting the motivation, presenting intuition, and the horrible typesetting.
MIT's 6.011 notes on Wiener Filter has a good treatment of DT noncausal Wiener filter derivation, and includes some practical examples. Its treatment on Causual Wiener Filter, which is probably the most important regarding real-time processing, is mathematical and really obscures the intuition -- we reach this equation from completing the square and using Parseval's theorem, without mentioning the consquences in the time domain. My notes
Max Kamenetsky's EE264 notes gives the best derivation and numerical examples of the subject for practical application, even though only DT is considered. It might get taken down, so extra copy [here](/reading_list/assets/wienerFilter.pdf}. Introducing the causual form of Wiener filter using input whitening method developed by Shannon and Bode fits into the general framework of Wiener filter nicely.

Wiener Filter Derivation Summary

Both the input $x$ (observation of the target process) and output $y$ (estimate of the target process $d$ , after filtering) are random processes.
Derive expression for cross-correlation between input and output.
Derive expression for the MSE between the output and the target process.
Minimize expression and obtain the transfer function in terms of $S_{xx}$ and $S_{xd}$ , where $S$ are Z or Laplace transforms of the cross-correlations.

This looks strangely similar to a linear regression. Not surprisingly, it is known to statisticians as the multivariate linear regression.

Kalman Filter

In Wiener Filter, it is assumed that both the input and target processes are WSS. If not, one may assume local stationarity and conduct the filtering. However, nearly two decades after Wiener's work, Rudolf Kalman developed the Kalman filter, which is the optimu mean square linear filter for nonstationary processes (evolving under a certain state space model) AND stationary ones (converging in steady state to Wiener filter).

Two good sources on Kalman Filter:

Dan Simon's article describes the intuition and practical implementation of KF, specifically in an embedded control context. The KF should be coupled with a controller when the sensor reading is unreliable -- it filters out the sensor noise. My notes
Greg Welch and Gary Bishop's An Introduction to the Kalman Filter. Presents just the right amount of math in the derivation of the KF. My notes
Update (12/9/2021): New best Kalman Filter derivation, by Thacker and Lacey. Full memo, Different link.

The general framework of the Kalman filter is (using notation from Welch and Bishop):

\begin{aligned} x_{k} & = A x_{k - 1} + B u_{k - 1} + w_{k - 1} \\ z_{k} & = H x_{k} + v_{k} \\ p (w) & \approx N (0, Q) \\ p (v) & \approx N (0, R) \end{aligned}

The first equation is the state model. $x_{k}$ denotes the state of a system, this could be the velocity or position of a vehicle, or the desired velocity or position of a cursor in BMI experiment. $A$ is the transition matrix -- describes how the state will transition depending on past states. $u_{k}$ is the input to the system. $B$ describes how the inputs are coupled into the state. $w_{k}$ is the process noise.

The second equation is the observation model - it describes what kind of measurements we can observe from a system with some state. $z_{k}$ is the measurement, it could be the speedometer reading of the car (velocity would be the state in this case), or it could be neural firing rates in BMI (kinematics would be the state). $H$ is the observation matrix (tuning model in BMI applications). $v_{k}$ is the measurement or sensor noise.

One caveat is that both $w$ and $v$ are assumed to be Gaussian with zero mean.

The central insight with Kalman filter is that, depending on whether the process noise or sensor noise dominates, we weigh the state model and observation model differently in their contributions to compute the new underlying state, resulting in a posteriori state estimate \[\hat{x_k}=\hat{x_k}^- + K_k(z_k-H\hat{x}_k^-)\] Note the subscript $_{k}$ denotes at each discrete time step.

The first term ${\hat{x_{k}}}^{-}$ is the contribution from the state model - it is the a priori state estimate based on the state model only. The second term is the contribution from the observation model - given the a posteriori state estimate.

Note a priori denotes a state estimate at a given time state $k$ to use only knowledge of the process prior to step $k$ . a posteriori denotes a state estimate that also uses the measurement at step $k$ , $z_{k}$ .

K$$ is the Kalman gain, the derivation of which is the most mathematically involved step. The derivation involves the definition of:  1. $P_K^-$: the *a priori* estimate error covariance \\[P_k^-=E[e_k^-e_k^{-T}]=E[(x_k-\hat{x_k}^-)(x_k-\hat{x_k}^-)]\\].  2. $P_K$: the *a posteriori* estimate error covariance \\[P_k=E[e_ke_k^T]=E[(x_k-\hat{x_k})(x_k-\hat{x_k})^T]\\].   To derive $K$, we want to minimze the *a posteriori* state estimate $P_K$, which makes intuitive sense. A good estimate of the state is one that minimizes the variance of the estimation error. A bunch of math later, we obtain \\[K_k=P_k^-H^T(HP_k^-H^T+R)^{-1}\\].  Looking at the *a posteriori* state estimate equation, we can see that   1. If the sensor noise is extremely low, i.e. $R\rightarrow 0$, then we should trust the sensor reading, and $\hat{x_k}$ should just be $H^{-1}z_k$. This is indeed the case as $\lim_{R_k\to 0}K_k=H^{-1}$.  2. If the process noise is extremely low, i.e. $Q\rightarrow 0$, then we should trust the state model on how the state should change, thus $\hat{x_k}$ should just equal to the *a priori* state estimate $\hat{x_k}^-$. This is indeed the case as $\lim_{P_k^-\to0}K_k=0$.  In implementation, KF operates by alternating between the prediction phase and a correct phase. The prediction phase projects the state based on the state model. The correction phase updates the estimate from the predictioin phase by the measurement and observation model. Operation is recursive and efficient.  ![KF_architecture](/reading_list/assets/KF_diagram.jpg){: .center-image }  Thus the KF is a generalization of the Wiener filter when the input and reference processes are non-stationary. This is done by estimating the *state*, tracking the changing statistics in the non-stationary process. This tracking is enabled by the incorporating the measurements taken.  KF is superior to Wiener Filter also because   1. There are less parameters to be fitted. 2. The movement and observation models are decoupled, providing more flexibility in system design.  **Extended Kalman Filter (EKF)**  The default KF addresses the general problem of trying to estimate the state $x$ goverend by a *linear* stochastic model. However, when the state and/or observation model is nonlinear, the KF needs to linearize about the current mean and covariance.  Therefore, $A$, $B$, $H$, $w$ and $v$ need to be replaced by the Jacobians linearized at the current mean and covariance. The math is different, but the architecture is the same.  **Unscented Kalman Filter (UKF)**  UKF is used a better implementation of KF for nonlinear state and observation model. Why use UKF when we can use EKF?  According to [wikipedia](https://en.wikipedia.org/wiki/Kalman_filter#Unscented_Kalman_filter), whe the state and observation models   > are highly non-linear, the extended Kalman filter can give particularly poor performance. This is because the covariance is propagated through linearization of the underlying non-linear model. The unscented Kalman filter (UKF) uses a deterministic sampling technique known as the unscented transform to pick a minimal set of sample points (called sigma points) around the mean. These sigma points are then propagated through the non-linear functions, from which the mean and covariance of the estimate are then recovered. The result is a filter which more accurately captures the true mean and covariance.   In Zheng Li's 2009 paper, [Unscented Kalman Filter for Brain-Machine Interfaces](http://journals.plos.org/plosone/article?id=10.1371/journal.pone.0006243), he applies the UKF to BMI. The keypoints were:  1. UKF allows nonlinear neuron tuning model -- this is a nonlinear observation model. In the paper, the tuning model is quadratic and describes the tuning better than a linear one.  2. UKF also allow a nonlinear movement model -- nonlinear state model.  3. The movement model was made to be of *degree-n*, meaning that the state equation is no longer a function of the previous state, but of the previous *n* states. Incorporating this "movement history" acts as smoothing for the output kinematics.  The number of sigma points was chosen to be $2d+1$, where $d=4n$ is the dimension of the state space. If $n=10$, then $d=40$, big state space.  The name *unscented* is pretty dumb and provides no clue to how it modifies the KF.  **Adaptive UKF**  As a follow up to the 2009 paper, Li published [Adaptive UKF](http://www.ncbi.nlm.nih.gov/pmc/articles/PMC3335277/). This algorithm modifies on the n-degree UKF by including   > Bayesian regression self-training method for updating the parameters of an unscented Kalman filter decoder. This novel paradigm uses the decoder’s output to periodically update the decoder’s neuronal tuning model in a Bayesian linear regression.   So, the tuning model/observation matrix $H$ is updated online periodically. Great for modeling the plasticity associated non-stationarity.  **ReFIT-KF**

Apr 10, 2016 - Notes on Learning from Data

Tags: Machine Learning

Last september I went through Caltech's Learning from Data course offered through EdX. Went through all the lectures and homework except for the final exam.

Finally got around to finish it this week, after re-reading some forgotten materials. All the code written for the problem sets are here. Some notes:

VC Dimension and Generalization

One of the most emphasized point in the course, is how to estimate the generalization performance of an algorithm. VC-Dimension can be used as a preliminary measure of an algorithm's generalization performance. A more complex hypothesis is more likely to overfit, and given the same amount of training data, may lead to worse generalization. When possible, using simpler hypothesis set is better. Otherwise, use regularization.

Linear Regression

Simple, not very sexy, but surprisingly powerful in many applications. Combined with nonlinear transforms can solve many classes of problems. We have many techniques that can solve linear problems. So in general, want to try to reduce problems into this class.

Validation

Cross-validation, especially leave-one-out, is an unbiased estimator of the out of sample error $E_{o u t}$ .

Use validation for model selection, including parameters for regularization.

SVM

Very powerful technique. Motivation is simple - finding a hyperplane that separats two classes of data with the greatest margin from the separation plane to the closest points of each class. Solving the equations become a quadratic programming problem of minimizing the norm of the weights of the plane (which is inversely proportional to the margin width), subject to the all points being classified correctly (in hard-margin SVM).

This original formulation is called the primal problem.

The same problem can be formulated into a LaGarange Multiplier problem, with an additional multiplier parameter $α_{i}$ for each data point. This problem then requires the minimizing over the weights, after maximizing over the multiplier $α$ s. Under the KKT conditions (or vice versa), we can switch the order to maximizing over the multipliers after minimizing over the weights. This allows us to write expressions for the weights in terms of $α$ after doing the minimization (by taking the partial derivatives of the expression).

Finally, we reach another Quadratic-Programming problem, where the objective is over $α$ s, subjective to constraints from the weights' minimization.

The advantages of this Dual Problem is that the data points are present in the expression in terms of inner products -- allowing us to use the kernel trick, which is a computationally efficient way of doing nonlinear transformation to higher-order (even infinite) dimensions.

One of the coolest thing about SVM is that, according to strict VC-dimension analysis, after nonlinear transformation of the data points into infinite dimensions via the kernel trick, $E_{o u t}$ should be unbounded. However, because SVM searches for hyperplane with the fattest separation margin, the effective VC-dimension is significantly reduced, and thus $E_{o u t}$ is under control.

RBF

RBF network classifiers places a RBF on each training data point, and then classifies test points by summing those RBF values at the test point.

Since this is prohibitive when data set is big, we can instead pick K representative points, by finding the centroid of K clusters (via clustering algorithm). Therefore, the classification problem is transformed into a linear regression problem, where the basis are the RBF centered around the cluster centers.

SVM with RBF kernel still outperforms the regular-RBF network classifiers most of the time. Why do we use it? I can imagine doing unsupervised learning: have a cluster of data points with no labels, but simply use clustering + RBF to derive K labels and assign them to new samples.

Apr 5, 2016 - Wireless Headstage: Validation of signal quality IV - Results

Tags: BMI, Wireless, Blackfin, DSP, Spike train analysis, Distance measures

Results of the comparison in terms of the metrics given in the metrics post.

Right click to open plots in a new window to see better

Victor Purpura

In this plot, the x-axis is the acceptable time shift we can shift a spike in the recording to match one in the reference, and equal to $1 / q$ , where $q$ is the shift-cost in the original VP formulation.

When the acceptable time shift is small, VP is a measure of the number of non-coincident spikes. When the shift is large, it is more a measure of difference in spike counts.

As expected, the performance is much worse when SNR is low. As expected, Plexon performs well on the VP metric across all timescales. RHD has very close performance, except in the 10ms scale. This is most likely due to the onboard spike-sorting uses only 0.5ms window rather than the full 1ms window used in Plexon, resulting in missed spikes or shifted spike times. I expect the difference between plexon and RHD to decrease with shorter spike duration/features.

RHA has the worst performance, and this is especially apparent for low SNR recordings. The difference of the performance difference between RHA and RHD is most likely RHA's use of AGC. When SNR is low, AGC also amplifies the noise level - therefore a burst in noise may result in amplitude similar to a spike. This is also apparent when trying to align the RHA and reference waveforms prior to metric comparison - with simple peakfinding it is impossible to align the signals properly when SNR=1dB. More advanced frequency-domain techniques may be needed. Therefore, the cyan RHA curve should be counted as an outlier.

The mean and standard deviation plot confirms the above. At time scales greater than 10ms, plexon and RHD have very close performance over different SNRs.

van Rossum

The x-axis is the $τ$ parameter used in the exponential kernel. As $τ$ increases, the metric is supposed to measure non-coincident spikes to difference in spike counts and firing rate.

I would expect the curves to show the same trend as that in the VP distance. But instead it shows the opposite trend. Further, the results don't agree with the other tests...therefore I must conclude even the corrected Paiva formula is incorrect..?

Schreiber

The x-axis is the $σ$ parameter used in the Gaussian kernel. As $σ$ increases, the metric measures the non-coincident spikes to difference in spike counts and firing rates. The curves show the expected shape. And again, at time scale greater than 10ms, all three systems have simlar performance.

Binned Distance

The x-axis is the bin size used in calculating the binned distance. The order of the curves are almost the same as that in the VP-distance. Again, Plexon performs the best for small time scale, all three systems converge in performrance with increasing bin size.

The mean curve shows the convergence for binned distance is a lot steeper than that for VP-distance. This is really good news, as most of our decoding algorithms use binned firing rates. It shows for bin size $>= 25 m s$ , the wireless systems have performarnce very close to Plexon.

Finally, the order of performrance is again $p l e x o n > R H D > R H A$ .

Apr 4, 2016 - Wireless Headstage: Validation of signal quality III - Computation and Methods

Tags: BMI, Wireless, Blackfin, DSP, Spike train analysis, Distance measures

Some computation problems/notes I encountered while implementing the metrics. Code

Victor Purpura Distance

Victor-Purpura's MATLAB implementation, adapted from the Seller algorithm of DNA sequence matching, while having $O (N_{i} N_{j})$ performance runs pretty slow.

For a test comparison of two spike trains, one with 8418 spike times and the other with 9782 spike times, it takes around 75 seconds. A bit too long. On Victor's page, there are a few other Matlab implementations, but only deal with parallelizing calculations with different cost of shifting a spike. The mex function has weird usage. So I tried translating the algorithm into Julia and calling it from Matlab. The result is spkd_qpara.jl:

Pretty straight forward. Importing my spike time vectors into Julia and running the function, it takes about 1.07 seconds, a significant speed-up. However, most of my analysis, plotting, and data are in Matlab, so I would like to call Julia from Matlab. This functionality is provided by jlcall.

Configuration through running jlconfig as instructed in the README was smooth, except for the message:

> Warning: Unable to save path to file '/usr/local/MATLAB/R2015a/toolbox/local/pathdef.m'. You can save your path to a different location by calling SAVEPATH with an input argument that specifies the full path. For MATLAB to use that path in future sessions, save the path to 'pathdef.m' in your MATLAB startup folder. 
> > In savepath (line 169)
>   In jlconfig (line 100)

Warning most likely related to my machine.

Using any methods from Jl.m, however, was not. I received the error:
Invalid MEX-file '/home/allenyin/.julia/v0.4/jlcall/m/jlcall.mexa64': libjulia.so: cannot open shared object file: No such file or directory.

So it seems Matlab cannot find the libjulia.so. This is documented jlcall's bug list.

Simple fix is as suggested in the message thread: in Matlab do
setenv('LD_RUN_PATH', [getenv('LD_RUN_PATH'), ':', '/usr/lib/x86_64-linux-gnu/julia')
before running the jlconfig script. The path given is the location of the so file.

Now to call victorD from Matlab, do:

Jl.include(`spkd_qpara.jl');
Jl.call('victorD', spikeTimes1, spikeTimes2, 200);

The second call takes 1.41 seconds.

Van Rossum Distance

Paiva 2009 derived a computational effective estimator with order $O (N_{i} N_{j})$ for the VR-distance:

d_{v R} (S_{i}, S_{j}) = \frac{1}{2} [\sum_{m = 1}^{N_{i}} \sum_{m = 1}^{N_{i}} L_{τ} (t_{m}^{i} - t_{n}^{i}) + \sum_{m = 1}^{N_{j}} \sum_{m = 2}^{N_{j}} L_{τ} (t_{m}^{j} - t_{n}^{j})] + \sum_{m = 1}^{N_{i}} \sum_{n = 1}^{N_{j}} L_{τ} (t_{m}^{i} - t_{n}^{j})

where $L_{τ} (\cdot) = e x p (- a b s (\cdot) / τ)$ is the Laplacian kernel. However, this formula is wrong. By inspection we can see that this distance will never be 0 since it is a sum of exponentials. Further sample calculation for distance between less similar spike trains may even yield smaller distance! These problems can be corrected by changing the sign of the last summation term. Unfortunately, this typo occurs in (Paiva et al., 2007) as well as Paiva's PhD thesis. However, as I can't derive the formula myself (there must be some autocorrelation trick I'm missing), I don't trust it very much.

It is nevertheless, a much more efficient estimator than the naive way of taking the distance between convolved spike trains, and does not depend on time discretization.

Thomas Kreuz has multiple van Rossum Matlab code posted on his website, but they don't work when $τ < 20 m s$ .

Schreiber Distance

This dissimilarity measure also have a data-effective method of $O (N_{i} N_{j})$ , given in Paiva 2010.

d_{C S} (S_{i}, S_{j}) = 1 - \frac{\sum_{m = 1}^{N_{i}} \sum_{n = 1}^{N_{j}} e x p [- \frac{(t_{m}^{i} - t_{n}^{j})^{2}}{2 σ^{2}}]}{\sqrt{(\sum_{m, n = 1}^{N_{i}} e x p [- \frac{(t_{m}^{i} - t_{n}^{i})^{2}}{2 σ^{2}}]) (\sum_{m, n = 1}^{N_{j}} e x p [- \frac{(t_{m}^{j} - t_{n}^{j})^{2}}{2 σ^{2}}])}}

Generating Test Signals

Signals were generated using Leslie Smith and Nhamoinesu Mtetwa's Noisy Spike Generator MATLAB Software v1.1 (NSG) using makeReferenceSignals.m.

Sampling rate was set to 31250Hz since that is the sample rate of the wireless headstages. This frequency is greater than the maximum Plexon ADC frequency of 20,000Hz, but offline analysis may simply use interpolation.

Notes on using this toolbox:

The main script generatenoisysamples.m outputs a final signal that includes noise from uncorrelated spike noise, correlated spike noise, and Gaussian noise. The NoiseSNR only sets the Gaussian noise level. As this validation is on recording quality, not spike sorting algorithm, I modified that script to not include the uncorrelated and correlated spike noise in the final output.
While the manual of the toolbox claims that the spike duration can be changed by changing the T_Delta_Integrate and N_Delta_integrate parameters (spike duration is T_Delta_Integrate X N_Delta_Integrate). However, changing these parameters simply changes how many samples it takes from a pre-generated template file. These template files were generated by the author and the script used was not available. Regardless, the default spike duration of ~1.8ms, while longer than most motor cortex spikes we encounter, can still be sorted and detected by all three systems teseted.

Aligning Reference with Recording

After the reference signals are generated, they are converted to .wav files with Fs=31250Hz. All audio signals (including those with two distinct spikes firing) are available on soundcloud. The audio signals then goes through the Plexon headstage test board, to which the wireless and plexon headstage may connect.

As mentioned in Validation of signal quality I, since the reference signal and the recording cannot be started and stopped at exactly the same time, some alignment needed to be done before applying the spike train measurements.

The reference and recording are aligned in the start by aligning the peak of the first spike in the reference with that of the recording. I use the recorded analog waveform, rather than the recorded spike times for alignment because the first spike may not be detected, and it's easier to examine the alignment quality visually from the analog waveforms.

The first spike peak of the reference signal is easily found by doing peak-finding on a 5ms window after the first spike-time (which is known). The spike times generated by NSG marks the beginning of a spike, not the peak.

When SNR is high, peak-finding on the recording can also find the peak of the first recorded spike pretty easily. But with lower SNR, the noise level gets higher and the MinPeakHeight parameter of MATLAB's findpeaks() must be adjusted so that a noise peak is not being mistaken for a signal peak. Of course, it's not always possible to align two signals, contributing to bad spike train comparison. The best MinPeakHeight for found this way for each recorded signal is in compare_signals.m. The script checkAlignment.m helps with that.

For the end of the recording signal - sometimes when I don't stop the input audio signal in time, extra signal is recorded. Thus after alignment, I cut the recording to within 102 seconds of the alignment start.

Mar 31, 2016 - Wireless Headstage: Validation of signal quality II - Metrics

Tags: BMI, Wireless, Blackfin, DSP, Spike train analysis, Distance measures

While recent work such as A wireless transmission neural interface system for unconstrained non-human primates focuses on the how transmitted analog waveform compares to that recorded by wired commerical systems, our validation goal is somewhat different.

While our wireless system operates also streams full analog waveforms, the focus is on onboard spike sorting and reporting spikes. Therefore, the most important measure is how the reported spike trains compares with that of the ground truth and that sorted/measured by commercial wired systems. Thus spike-train analysis is conducted here.

Systems compared:

RHA-headstage with resistor-set RHA-Intan filtering, AGC, and four stages of IIR digital filtering. Final bandwidth is 500Hz-6.7kHz or 150Hz-10kHz.
RHD-headstage with onchip-resistor-set RHD-Intan filtering (1Hz-10kHz), manually set fixed gain (Q7.8 ranges -128:0.5:127.5), and two stages of IIR digital filtering. Final bandwidth is 250Hz-9kHz.

I observed it's much easier to do spike sorting keeping the Intan onboard bandpass filter to have wide bandwidth, so the displayed waveform is not too distorted. The IIR filters can apply sufficient filtering to remove low and high frequency noise to expose the key spike features. But having access to the signal prior to IIR can be very helpful for sorting, especially when SNR is low. Finally, changing from AGC to user-set fixed gain seems to yield much cleaner signals since AGC amplifies noise as well, making it very difficult to sort under low SNR. Downside is it's neccessary to check the templates often to make sure the previously set gain is enough.
Plexon (old Harveybox..), standard spike filtering options. Waveform uses 20kHz sampling rate.

All three systems use manual PCA spike sorting. While the actual sorting interface has a time window of 1ms, the headstages use only templates that are 0.5ms long to sort spikes onboard.

Measures used:

To compare spike train performance, I referred to A comparison of binless spike train measures; Paiva, Park, Principe 2010 for a review and comparison of these measures. Very useful despite an amateur error with one of the formulas.

Victor-Purpura Distance(VP)

The first binless distance measure proposed in the literature. Defines the distance between two spike trains in terms of the minimum cost of transforming one spike train into the other by means of just three basic operations: spike insertion (cost 1), spike deletion (cost 1), and shifting a spike by some interval by $Δ t$ (cost $q ∥ Δ t ∥$ ). The cost per time unit, $q$ sets the time scale of the analysis. For $q = 0$ the distance is equal to the difference in spike counts, while for large $q$ the distance is linearly proportional with the number of non-coincident spikes, as it becomes more favorable to delete and reinsert all non-coincident spikes rather than shifting them (cost 2 in this instance). Thus, by increasing the cost, the distance is transformed from a rate distance to a temporal distance. Note that the cost is inversely proportional to the acceptable time shift between two spikes.

Victor-Purpura spike train distance. Two spike trains X and Y and a path of basic operations (spike deletion, spike insertion, and spike shift) transforming one into the other. (Victor and Purpura, 1996).
Van Rossum Distance(VR)

Similar to the VP-distance, VR distance utilizes the full resolution of the spike times. The VR distance is the Euclidean distance between exponentially filtered spike trains.

A spike train $S_{i}$ is defined as $S_{i} (t) = \sum_{m = 1}^{N_{i}} δ (t - t_{m}^{i})$ , where ${t_{m}^{i} : m = 1, . . ., N_{i}}$ are the spike times.

After convolving this spike train with a causal decaying exponential function $h (t) = e x p (- t / τ) u (t)$ , we get the filtered spike train $f_{i} (t) = \sum_{m = 1}^{N_{i}} h (t - t_{m}^{i})$ . The Van Rossum distance between two spike trains $S_{i}$ and $S_{j}$ is the Euclidean distance:
$d_{v R} (S_{i}, S_{j}) = \frac{1}{τ} \int_{0}^{\infty} [f_{i} (t) - f_{j} (t)]^{2} d t$
For VR, the parameter $τ$ sets the time scale, which can be though of as inversely related to VP's $q$ parameter. The metric acts as a count of non-coincident spikes (temporal relationship) to a difference in spike count (spike rates) as the kernel size $τ$ is increased. This is opposite of what happens to VP distance as $q$ is increased.
Schreiber-correlation dissimilarity

The Schreiber dissimilarity based on the correlation measure defined in terms of Gaussian-filterd spike trains. In this approach, the filtered spike train $S_{i}$ becomes $g_{i} (t) = \sum_{m = 1}^{N i} G_{σ / \sqrt{2}} (t - t_{m}^{i})$ , where $G_{σ / \sqrt{2}} (t) = e x p (- t^{2} / σ^{2})$ is the Gaussian kernel. The role of $σ$ here is similar to $τ$ in Van Rossum distance, and is inversely related to $q$ in VP distance. Assuming a discrete-time implementation of the measure, then the filtered spike trains can be seen as vectors, for which the usual dot product can be used. Therefore a correlation like dissimilarity measure can be derived:
$d_{C S} = 1 - r (S_{i}, S_{j}) = 1 - \frac{\vec{g_{i}} \cdot \vec{g_{j}}}{∥ \vec{g_{i}} ∥ ∥ \vec{g_{j}} ∥}$
This metric is always betweeen 0 and 1, and can be thought of as a measure of the angle between the two filtered spike trains. Note that this quantity is not a strict distance metric because it does not fulfill the triangel inequality. Chicharro 2011 suggests that this measure's applicability to estimate reliability is limited because of the individual normalization of spike trains, which renders the measure sensitive only to similarities in the temporal modulation of the individaul rate profiles but neglects differences in the absolute spike count.

Van Rossum spike train distance and Schreiber et al. similarity measure. Figure from (van Rossum, 2001) and (Schreiber et al., 2003)
Binned distance ( $d_{B}$ )

While both Paiva2010 and Chicharro2011 recommends not using a binned method to measure spike train similarity, it is more directly related to BMI methods. The binned distance is simply the sum of the absolute value of the two binned spike trains, which would require us to compare spike trains of the same length.

Notes

According to simulation tests conducted by Paiva 2009, none of the above measures performs the best or consistently when measuring different aspects of spike-train similarity, including 1) discriminating different firing rate; 2) discriminating changing firing rate; 3) Spike train synchrony. In the end, Paiva concludes:

VP and van Rossume distances perform better discriminating constant firing rate.
Schreiber and binned-distance perform better discriminating synchrony.
All measures performed similar in discriminating instantaneous firing rates.

Ideally, validation should tell us exactly what percentage of detected spikes are true positives, and how many spikes we have missed. But that requires probabilistic calculation (e.g. is a spike false positive, or just a shifted spike) that is difficult, I settle for these metrics that are well established and are relevant for BMI -- the firing rates are usually binned and normalized prior to decoding, which is sensitive to time scales. There are other time-scale independent spiket rain distances such as event synchronization (Quiroga 2002), SPIKE synchronization (Kreuz 2015), ISI-distance (Kreuz 2007a), and SPIKE-distance (Kreuz 2013), but they are in my opinion harder to interpret.

Mar 27, 2016 - Wireless Headstage: Validation of signal quality I

Tags: BMI, Wireless, Blackfin, DSP, Spike train analysis, Distance measures

After finishing work on the wireless headstage, it occured to me that no comprehensive validation has been done to confirm spike-sorting and signal quality of the wireless headstages. So I'm conducting a semi-rigourous testing of the RHA-headstage, RHD-headstage, agains the Plexon recordings. This paper on a Blackrock-based wireless system testing has been useful.

First part, I want to examine the signal quality of the streamed raw waveform. The basic setup is:

Generate a neural signal of known length, save this as an audio signal with sampling frequency Fs=31250Hz. This is chosen because the headstage's sampling frequency is 31250Hz.
An extra button was added to gtkclient so that as sson as it starts recording the incoming data from the headstages, it also launches VLC to play that audio signal.
Keep recording until the end of the audio signal.

The recorded signal can then be plotted and analyzed for signal quality.

Note that the command to VLC to play the audio signal is launched from within gtkclient through the system() command. As a result, the headstage is not actually recording that audio signal until after a short VLC initialization delay. This is consistently around 0.5 seconds, as measured by VLC log.

To compare the recorded signal quality against the reference signal, the first test is by visual inspection. Since the recorded signal is a simulated neural signal, this becomes -- do the spike features (sharp peaks and valleys) line up? A more quantitative measure is the cross-correlation between the reference and recorded signals, but the resulting value is a bit harder to interpret.

For the visual test, as a result of the VLC delay, we must shift the recorded waveform to align with the beginning of the reference signal. See below for original signals and aligned signals. Blue is reference, red is recording. The signals are 100 seconds long, with dt=1/31250s.

Note the short red segment in the top plot in the beginning, that signals the absence of audio signal.

We can inspect the aligned waveform (by aligning the peak of the first spikes in each signal). In the beginning (shown below), we can see the spike features align fairly well. The squiggles in the red signal are due to noise and filtering artifacts, which are ok in this case but poses the SNR lower limit with which we can distinguish neural signals.

Around the middle of the signal (around 45.82s), we see the signals are no longer aligned. Rough inspections show the red recorded signal has developed a lead over the blue signal - the recording signal has shifted to the left of the reference signal. The three labeled red peaks correspond to the three labeled blue peaks. This is confirmed by the relative time between the labled peaks.

We can further make a table of where the corresponding labeled peaks occur within their respective signals, in terms of sample number:

|Recording samp#       | Ref samp#            | Diff in samp#    | Diff in ms 
-----------------------|----------------------|------------------|------------
1432167                | 1432972              | 805              | 25.76      
1432707                | 1433515              | 808              | 25.856     
1433025                | 1433830              | 805              | 25.76

the last column is calculated by dividing the Diff in samp# by 31250Hz to obtain the lead in seconds, then to ms.

Near the end of the signal (around 96.056s), the lead becomes even greater, as shown below.

And the corresponding table of labeled peak locations:

|Recording samp#       | Ref samp#            | Diff in samp#    | Diff in ms 
-----------------------|----------------------|------------------|------------
3001750                | 3003284              | 1534             | 49.088    
3002059                | 3003620              | 1561             | 49.952
3002839                | 3004374              | 1535             | 49.12      
3003072                | 3004612              | 1540             | 49.28

The developing lead in sample numbers between the two recorded signals are most likely due to loss of packets from the wireless headstage. Each packet contains new waveform samples, and if they are lost, our recorded signal will be effectively shortened from the reference signal, resulting in this lead. And since the number of lost packets can only increase, this lead will simply increase with time.

While this is a problem when we are doing offline analysis of the recorded signals, it should not be a huge deal in terms of real-time BMI applications where we care about the instantaneous firing rate -- as long as we dont lose too many packets. As Tim has stated in his thesis:

1 in 3176 packets were rejected on average; assuming one bit error per packet, this equates to a bit error rate (BER) of 1.23e-6, or approximately one bit error per second. Efforts to reduce this have not been successful, but the system works acceptably presently.

This matches what I have seen empirically, and comparisons against Plexon. As long as the BER is on the order of 1e-6, then packet loss should not affect performance too much.

The current software keeps track of the total number of lost packets. It might be of interest to find how this changes through time. The previous tables seems to suggest that the number of samples lost increase linearly with time, which supports my hypothesis that the shift is due to last packets. To get a better idea, however, I would want to perform that same analysis around different time points along the signal, probably within a sliding window. Matlab's findpeaks() function can extract the positions of the different peaks within the signals, but matching them would be more difficult. While it is relatively easy via visual inspection, it seems to me to be NP-hard...what's an efficient algorithm to do this?

Update - Perhaps a sliding window of seller's algorithm to match the 'edit distance' of the peaks would answer the question, but that's not too important for our goal of validation.

Newer Page: 4 of Older