Apr 21, 2016 - Linear Algebra Notes

I can apply linear algebra, but often forget/confuse the spatial representations.

A good resource is Gilbert Strang's essay. I remember him talking about this when I took 18.06, but he was old at that point, and I barely went to class. So I didn't retain much. This notes of his still retains his stream of consciousness style, but the two diagrams summarizes it well, and is useful to often go back and make sure I still understand it.

image1

image2

The first one shows the relationship between the four subspaces of a matrix A. The second one adds in the relationship between the orthonormal basis of the subspaces.

Strang concludes:

Finally, the eigenvectors of A lie in its column space and nullspace, not a natural pair. The dimensions of the spaces add to n, but the spaces are not orthogonal and they could even coincide. The better picture is the orthogonal one that leads to the SVD.

The first assertion is illuminating. Eigenvectors v are defined such that Av=λv. Suppose A is M×N, then v must be M×1, therefore eigenvectors only work when M=N. On the other hand, in SVD, we have A=UΣVT and

  • Avi=σiui for i=1,...,r, where r is the rank of column and row space.
    • ui are the basis for the column space (left singular vectors), vi are the basis for the row space (right singular vectors).
    • The dimensions for this equation matches because the column space and row space are more natural pair.
  • Avi=0, ATui=0, for i>r.
    • vi are the eigenvectors of the null space of AT,
    • ui are the eigenvectors of the null space of A.
    • They also match up similarly (Figure 2).
  • In Ax=b, x is in the row space, b is in the column space.
  • If Ax=0, then x is in the null space (kernel) of A, and is orthogonal complement of the row space of A, C(AT).
  • In ATx=b, x is in the column space, b is in the row space.
  • If ATx=0, then x is in the cokernel of A, and is orthogonal complement of the column space of A, C(A).

Relationship with PCA

PCA Algorithm
Inputs: The centered data matrix X and k>1.

  1. Compute the SVD of X: [U,Γ,V]=svd(X).

  2. Let Vk=[v1,...,vk] be the first k columns of V.

  3. The PCA-feature matrix and the reconstructed data are:

    Z=XVkX^=XVkVkT

So in PCA, the rows of the data matrix are the observations, and the columns are in the original coordinate system. The principle components are then the eigenvectors of the row space. We can do PCA in MATLAB with pca or manually with svd:

% PCA test
a = rand(100,1);
b = 5*a+rand(100,1);

data = [a, b];  % rows are observations
plot(a,b, '.');


% PCA results
% coeffs=PC in original coordinate[PxP], 
% scores=observation in PC-space [NxP], 
% latent=variance explained [Px1]
[coeffs, score, latent] = pca(data);
hold on;
plot(score(:,1), score(:,2), 'r.');

% SVD results
centeredData = bsxfun(@minus, data, mean(data,1));
[U, S, V] = svd(centeredData);
svd_score = centeredData*V;
plot(svd_score(:,1), svd_score(:,2), 'g.');

The results shown below. Blue is original data, green/red are the PCA results, they overlap exactly. Note that MATLAB's pca by default centers the data.
image1

Projection into subspace (e.g. Kaufman 2014, Cortical activity in the null space, Even-Chen 2017, Neurally-driven error detectors)

Suppose we have a trial-averaged neural activity matrix A of size N×T, where N is number of neurons, and T is number of time bins.

In online BMI experiment, a linear decoder maps the neural activity vector nt of size N×1 to kinematics vt of size 2×1 via the equation vt=wnt, where w is of size 2×N.

Kaufman describes the output-potent neural activities as collection of nt such that vt is not null. Therefore the output-potent subspace of nt is the rowspace of w. Similarly, the output-null subspace is the null space of w.

By projecting the activity matrix A into these subspaces, we can get an idea of how much variance in the neural activities can be accounted for the modulation to control actuator kinematics:

  1. Use SVD on w=UΣVT to get its row and null space, defined by the first r columns of V (call it wpotent), and the last Nr columsn of V (call it wnull), respectively, where r is the rank of w.
  2. Left multiply A by these subspaces (defined by the those columns) to get Apotent and Anull.
  3. The variance of A explained by projection into the subspaces is obtained by first subtracting each row's mean from each matrix, then taking the L2 norm of the entire matrix (Frobenius norm of the matrix).

Philip Sabes' Linear Algebra notes

Apr 12, 2016 - Optimum mean square linear filter notes

Wiener Filter

Wiener filter is used to produce an estimate of a target random process by LTI filtering of an observed noisy process, assuming known stationary siganl and noise spectra. This is done by minimizing the mean square error (MSE) between the estimated random process and the desired process.

Wiener filter is one of the first decoding algorithms used with success in BMI, used by Wessberg in 2000.

Derivations of Wiener filter are all over the place.

  1. Many papers and classes refer to Haykin's Adaptive Filter Theory, a terrible book, in terms of setting the motivation, presenting intuition, and the horrible typesetting.

  2. MIT's 6.011 notes on Wiener Filter has a good treatment of DT noncausal Wiener filter derivation, and includes some practical examples. Its treatment on Causual Wiener Filter, which is probably the most important regarding real-time processing, is mathematical and really obscures the intuition -- we reach this equation from completing the square and using Parseval's theorem, without mentioning the consquences in the time domain. My notes

  3. Max Kamenetsky's EE264 notes gives the best derivation and numerical examples of the subject for practical application, even though only DT is considered. It might get taken down, so extra copy [here](/reading_list/assets/wienerFilter.pdf}. Introducing the causual form of Wiener filter using input whitening method developed by Shannon and Bode fits into the general framework of Wiener filter nicely.

Wiener Filter Derivation Summary

  1. Both the input x (observation of the target process) and output y(estimate of the target process d, after filtering) are random processes.

  2. Derive expression for cross-correlation between input and output.

  3. Derive expression for the MSE between the output and the target process.

  4. Minimize expression and obtain the transfer function in terms of Sxx and Sxd, where S are Z or Laplace transforms of the cross-correlations.

This looks strangely similar to a linear regression. Not surprisingly, it is known to statisticians as the multivariate linear regression.

Kalman Filter

In Wiener Filter, it is assumed that both the input and target processes are WSS. If not, one may assume local stationarity and conduct the filtering. However, nearly two decades after Wiener's work, Rudolf Kalman developed the Kalman filter, which is the optimu mean square linear filter for nonstationary processes (evolving under a certain state space model) AND stationary ones (converging in steady state to Wiener filter).

Two good sources on Kalman Filter:

  1. Dan Simon's article describes the intuition and practical implementation of KF, specifically in an embedded control context. The KF should be coupled with a controller when the sensor reading is unreliable -- it filters out the sensor noise. My notes

    block-diagram

  2. Greg Welch and Gary Bishop's An Introduction to the Kalman Filter. Presents just the right amount of math in the derivation of the KF. My notes

  3. Update (12/9/2021): New best Kalman Filter derivation, by Thacker and Lacey. Full memo, Different link.

The general framework of the Kalman filter is (using notation from Welch and Bishop):

xk=Axk1+Buk1+wk1zk=Hxk+vkp(w)N(0,Q)p(v)N(0,R)

The first equation is the state model. xk denotes the state of a system, this could be the velocity or position of a vehicle, or the desired velocity or position of a cursor in BMI experiment. A is the transition matrix -- describes how the state will transition depending on past states. uk is the input to the system. B describes how the inputs are coupled into the state. wk is the process noise.

The second equation is the observation model - it describes what kind of measurements we can observe from a system with some state. zk is the measurement, it could be the speedometer reading of the car (velocity would be the state in this case), or it could be neural firing rates in BMI (kinematics would be the state). H is the observation matrix (tuning model in BMI applications). vk is the measurement or sensor noise.

One caveat is that both w and v are assumed to be Gaussian with zero mean.

The central insight with Kalman filter is that, depending on whether the process noise or sensor noise dominates, we weigh the state model and observation model differently in their contributions to compute the new underlying state, resulting in a posteriori state estimate \[\hat{x_k}=\hat{x_k}^- + K_k(z_k-H\hat{x}_k^-)\] Note the subscript k denotes at each discrete time step.

The first term xk^ is the contribution from the state model - it is the a priori state estimate based on the state model only. The second term is the contribution from the observation model - given the a posteriori state estimate.

Note a priori denotes a state estimate at a given time state k to use only knowledge of the process prior to step k. a posteriori denotes a state estimate that also uses the measurement at step k, zk.

K$$ is the Kalman gain, the derivation of which is the most mathematically involved step. The derivation involves the definition of: 1. $P_K^-$: the *a priori* estimate error covariance \\[P_k^-=E[e_k^-e_k^{-T}]=E[(x_k-\hat{x_k}^-)(x_k-\hat{x_k}^-)]\\]. 2. $P_K$: the *a posteriori* estimate error covariance \\[P_k=E[e_ke_k^T]=E[(x_k-\hat{x_k})(x_k-\hat{x_k})^T]\\]. To derive $K$, we want to minimze the *a posteriori* state estimate $P_K$, which makes intuitive sense. A good estimate of the state is one that minimizes the variance of the estimation error. A bunch of math later, we obtain \\[K_k=P_k^-H^T(HP_k^-H^T+R)^{-1}\\]. Looking at the *a posteriori* state estimate equation, we can see that 1. If the sensor noise is extremely low, i.e. $R\rightarrow 0$, then we should trust the sensor reading, and $\hat{x_k}$ should just be $H^{-1}z_k$. This is indeed the case as $\lim_{R_k\to 0}K_k=H^{-1}$. 2. If the process noise is extremely low, i.e. $Q\rightarrow 0$, then we should trust the state model on how the state should change, thus $\hat{x_k}$ should just equal to the *a priori* state estimate $\hat{x_k}^-$. This is indeed the case as $\lim_{P_k^-\to0}K_k=0$. In implementation, KF operates by alternating between the prediction phase and a correct phase. The prediction phase projects the state based on the state model. The correction phase updates the estimate from the predictioin phase by the measurement and observation model. Operation is recursive and efficient. ![KF_architecture](/reading_list/assets/KF_diagram.jpg){: .center-image } Thus the KF is a generalization of the Wiener filter when the input and reference processes are non-stationary. This is done by estimating the *state*, tracking the changing statistics in the non-stationary process. This tracking is enabled by the incorporating the measurements taken. KF is superior to Wiener Filter also because 1. There are less parameters to be fitted. 2. The movement and observation models are decoupled, providing more flexibility in system design. **Extended Kalman Filter (EKF)** The default KF addresses the general problem of trying to estimate the state $x$ goverend by a *linear* stochastic model. However, when the state and/or observation model is nonlinear, the KF needs to linearize about the current mean and covariance. Therefore, $A$, $B$, $H$, $w$ and $v$ need to be replaced by the Jacobians linearized at the current mean and covariance. The math is different, but the architecture is the same. **Unscented Kalman Filter (UKF)** UKF is used a better implementation of KF for nonlinear state and observation model. Why use UKF when we can use EKF? According to [wikipedia](https://en.wikipedia.org/wiki/Kalman_filter#Unscented_Kalman_filter), whe the state and observation models > are highly non-linear, the extended Kalman filter can give particularly poor performance. This is because the covariance is propagated through linearization of the underlying non-linear model. The unscented Kalman filter (UKF) uses a deterministic sampling technique known as the unscented transform to pick a minimal set of sample points (called sigma points) around the mean. These sigma points are then propagated through the non-linear functions, from which the mean and covariance of the estimate are then recovered. The result is a filter which more accurately captures the true mean and covariance. In Zheng Li's 2009 paper, [Unscented Kalman Filter for Brain-Machine Interfaces](http://journals.plos.org/plosone/article?id=10.1371/journal.pone.0006243), he applies the UKF to BMI. The keypoints were: 1. UKF allows nonlinear neuron tuning model -- this is a nonlinear observation model. In the paper, the tuning model is quadratic and describes the tuning better than a linear one. 2. UKF also allow a nonlinear movement model -- nonlinear state model. 3. The movement model was made to be of *degree-n*, meaning that the state equation is no longer a function of the previous state, but of the previous *n* states. Incorporating this "movement history" acts as smoothing for the output kinematics. The number of sigma points was chosen to be $2d+1$, where $d=4n$ is the dimension of the state space. If $n=10$, then $d=40$, big state space. The name *unscented* is pretty dumb and provides no clue to how it modifies the KF. **Adaptive UKF** As a follow up to the 2009 paper, Li published [Adaptive UKF](http://www.ncbi.nlm.nih.gov/pmc/articles/PMC3335277/). This algorithm modifies on the n-degree UKF by including > Bayesian regression self-training method for updating the parameters of an unscented Kalman filter decoder. This novel paradigm uses the decoder’s output to periodically update the decoder’s neuronal tuning model in a Bayesian linear regression. So, the tuning model/observation matrix $H$ is updated online periodically. Great for modeling the plasticity associated non-stationarity. **ReFIT-KF**

Apr 10, 2016 - Notes on Learning from Data

Last september I went through Caltech's Learning from Data course offered through EdX. Went through all the lectures and homework except for the final exam.

Finally got around to finish it this week, after re-reading some forgotten materials. All the code written for the problem sets are here. Some notes:

VC Dimension and Generalization

One of the most emphasized point in the course, is how to estimate the generalization performance of an algorithm. VC-Dimension can be used as a preliminary measure of an algorithm's generalization performance. A more complex hypothesis is more likely to overfit, and given the same amount of training data, may lead to worse generalization. When possible, using simpler hypothesis set is better. Otherwise, use regularization.

Linear Regression

Simple, not very sexy, but surprisingly powerful in many applications. Combined with nonlinear transforms can solve many classes of problems. We have many techniques that can solve linear problems. So in general, want to try to reduce problems into this class.

Validation

Cross-validation, especially leave-one-out, is an unbiased estimator of the out of sample error Eout.

Use validation for model selection, including parameters for regularization.

SVM

Very powerful technique. Motivation is simple - finding a hyperplane that separats two classes of data with the greatest margin from the separation plane to the closest points of each class. Solving the equations become a quadratic programming problem of minimizing the norm of the weights of the plane (which is inversely proportional to the margin width), subject to the all points being classified correctly (in hard-margin SVM).

This original formulation is called the primal problem.

The same problem can be formulated into a LaGarange Multiplier problem, with an additional multiplier parameter αi for each data point. This problem then requires the minimizing over the weights, after maximizing over the multiplier αs. Under the KKT conditions (or vice versa), we can switch the order to maximizing over the multipliers after minimizing over the weights. This allows us to write expressions for the weights in terms of α after doing the minimization (by taking the partial derivatives of the expression).

Finally, we reach another Quadratic-Programming problem, where the objective is over αs, subjective to constraints from the weights' minimization.

The advantages of this Dual Problem is that the data points are present in the expression in terms of inner products -- allowing us to use the kernel trick, which is a computationally efficient way of doing nonlinear transformation to higher-order (even infinite) dimensions.

One of the coolest thing about SVM is that, according to strict VC-dimension analysis, after nonlinear transformation of the data points into infinite dimensions via the kernel trick, Eout should be unbounded. However, because SVM searches for hyperplane with the fattest separation margin, the effective VC-dimension is significantly reduced, and thus Eout is under control.

RBF

RBF network classifiers places a RBF on each training data point, and then classifies test points by summing those RBF values at the test point.

Since this is prohibitive when data set is big, we can instead pick K representative points, by finding the centroid of K clusters (via clustering algorithm). Therefore, the classification problem is transformed into a linear regression problem, where the basis are the RBF centered around the cluster centers.

SVM with RBF kernel still outperforms the regular-RBF network classifiers most of the time. Why do we use it? I can imagine doing unsupervised learning: have a cluster of data points with no labels, but simply use clustering + RBF to derive K labels and assign them to new samples.

Apr 5, 2016 - Wireless Headstage: Validation of signal quality IV - Results

Results of the comparison in terms of the metrics given in the metrics post.

Right click to open plots in a new window to see better

Victor Purpura

image1

In this plot, the x-axis is the acceptable time shift we can shift a spike in the recording to match one in the reference, and equal to 1/q, where q is the shift-cost in the original VP formulation.

When the acceptable time shift is small, VP is a measure of the number of non-coincident spikes. When the shift is large, it is more a measure of difference in spike counts.

As expected, the performance is much worse when SNR is low. As expected, Plexon performs well on the VP metric across all timescales. RHD has very close performance, except in the 10ms scale. This is most likely due to the onboard spike-sorting uses only 0.5ms window rather than the full 1ms window used in Plexon, resulting in missed spikes or shifted spike times. I expect the difference between plexon and RHD to decrease with shorter spike duration/features.

RHA has the worst performance, and this is especially apparent for low SNR recordings. The difference of the performance difference between RHA and RHD is most likely RHA's use of AGC. When SNR is low, AGC also amplifies the noise level - therefore a burst in noise may result in amplitude similar to a spike. This is also apparent when trying to align the RHA and reference waveforms prior to metric comparison - with simple peakfinding it is impossible to align the signals properly when SNR=1dB. More advanced frequency-domain techniques may be needed. Therefore, the cyan RHA curve should be counted as an outlier.

image2

The mean and standard deviation plot confirms the above. At time scales greater than 10ms, plexon and RHD have very close performance over different SNRs.

van Rossum

image3

The x-axis is the τ parameter used in the exponential kernel. As τ increases, the metric is supposed to measure non-coincident spikes to difference in spike counts and firing rate.

I would expect the curves to show the same trend as that in the VP distance. But instead it shows the opposite trend. Further, the results don't agree with the other tests...therefore I must conclude even the corrected Paiva formula is incorrect..?

image4

Schreiber

image5

The x-axis is the σ parameter used in the Gaussian kernel. As σ increases, the metric measures the non-coincident spikes to difference in spike counts and firing rates. The curves show the expected shape. And again, at time scale greater than 10ms, all three systems have simlar performance.

image6

Binned Distance

image7

The x-axis is the bin size used in calculating the binned distance. The order of the curves are almost the same as that in the VP-distance. Again, Plexon performs the best for small time scale, all three systems converge in performrance with increasing bin size.

image8

The mean curve shows the convergence for binned distance is a lot steeper than that for VP-distance. This is really good news, as most of our decoding algorithms use binned firing rates. It shows for bin size >=25ms, the wireless systems have performarnce very close to Plexon.

Finally, the order of performrance is again plexon>RHD>RHA.

Apr 4, 2016 - Wireless Headstage: Validation of signal quality III - Computation and Methods

Some computation problems/notes I encountered while implementing the metrics. Code

Victor Purpura Distance

Victor-Purpura's MATLAB implementation, adapted from the Seller algorithm of DNA sequence matching, while having O(NiNj) performance runs pretty slow.

For a test comparison of two spike trains, one with 8418 spike times and the other with 9782 spike times, it takes around 75 seconds. A bit too long. On Victor's page, there are a few other Matlab implementations, but only deal with parallelizing calculations with different cost of shifting a spike. The mex function has weird usage. So I tried translating the algorithm into Julia and calling it from Matlab. The result is spkd_qpara.jl:

Pretty straight forward. Importing my spike time vectors into Julia and running the function, it takes about 1.07 seconds, a significant speed-up. However, most of my analysis, plotting, and data are in Matlab, so I would like to call Julia from Matlab. This functionality is provided by jlcall.

Configuration through running jlconfig as instructed in the README was smooth, except for the message:

> Warning: Unable to save path to file '/usr/local/MATLAB/R2015a/toolbox/local/pathdef.m'. You can save your path to a different location by calling SAVEPATH with an input argument that specifies the full path. For MATLAB to use that path in future sessions, save the path to 'pathdef.m' in your MATLAB startup folder. 
> > In savepath (line 169)
>   In jlconfig (line 100) 

Warning most likely related to my machine.

Using any methods from Jl.m, however, was not. I received the error:
Invalid MEX-file '/home/allenyin/.julia/v0.4/jlcall/m/jlcall.mexa64': libjulia.so: cannot open shared object file: No such file or directory.

So it seems Matlab cannot find the libjulia.so. This is documented jlcall's bug list.

Simple fix is as suggested in the message thread: in Matlab do
setenv('LD_RUN_PATH', [getenv('LD_RUN_PATH'), ':', '/usr/lib/x86_64-linux-gnu/julia')
before running the jlconfig script. The path given is the location of the so file.

Now to call victorD from Matlab, do:

Jl.include(`spkd_qpara.jl');
Jl.call('victorD', spikeTimes1, spikeTimes2, 200);

The second call takes 1.41 seconds.

Van Rossum Distance

Paiva 2009 derived a computational effective estimator with order O(NiNj) for the VR-distance:

dvR(Si,Sj)=12[m=1Nim=1NiLτ(tmitni)+m=1Njm=2NjLτ(tmjtnj)]+m=1Nin=1NjLτ(tmitnj)

where Lτ()=exp(abs()/τ) is the Laplacian kernel. However, this formula is wrong. By inspection we can see that this distance will never be 0 since it is a sum of exponentials. Further sample calculation for distance between less similar spike trains may even yield smaller distance! These problems can be corrected by changing the sign of the last summation term. Unfortunately, this typo occurs in (Paiva et al., 2007) as well as Paiva's PhD thesis. However, as I can't derive the formula myself (there must be some autocorrelation trick I'm missing), I don't trust it very much.

It is nevertheless, a much more efficient estimator than the naive way of taking the distance between convolved spike trains, and does not depend on time discretization.

Thomas Kreuz has multiple van Rossum Matlab code posted on his website, but they don't work when τ<20ms.

Schreiber Distance

This dissimilarity measure also have a data-effective method of O(NiNj), given in Paiva 2010.

dCS(Si,Sj)=1m=1Nin=1Njexp[(tmitnj)22σ2](m,n=1Niexp[(tmitni)22σ2])(m,n=1Njexp[(tmjtnj)22σ2])

Generating Test Signals

Signals were generated using Leslie Smith and Nhamoinesu Mtetwa's Noisy Spike Generator MATLAB Software v1.1 (NSG) using makeReferenceSignals.m.

Sampling rate was set to 31250Hz since that is the sample rate of the wireless headstages. This frequency is greater than the maximum Plexon ADC frequency of 20,000Hz, but offline analysis may simply use interpolation.

Notes on using this toolbox:

  1. The main script generatenoisysamples.m outputs a final signal that includes noise from uncorrelated spike noise, correlated spike noise, and Gaussian noise. The NoiseSNR only sets the Gaussian noise level. As this validation is on recording quality, not spike sorting algorithm, I modified that script to not include the uncorrelated and correlated spike noise in the final output.

  2. While the manual of the toolbox claims that the spike duration can be changed by changing the T_Delta_Integrate and N_Delta_integrate parameters (spike duration is T_Delta_Integrate X N_Delta_Integrate). However, changing these parameters simply changes how many samples it takes from a pre-generated template file. These template files were generated by the author and the script used was not available. Regardless, the default spike duration of ~1.8ms, while longer than most motor cortex spikes we encounter, can still be sorted and detected by all three systems teseted.

Aligning Reference with Recording

After the reference signals are generated, they are converted to .wav files with Fs=31250Hz. All audio signals (including those with two distinct spikes firing) are available on soundcloud. The audio signals then goes through the Plexon headstage test board, to which the wireless and plexon headstage may connect.

As mentioned in Validation of signal quality I, since the reference signal and the recording cannot be started and stopped at exactly the same time, some alignment needed to be done before applying the spike train measurements.

The reference and recording are aligned in the start by aligning the peak of the first spike in the reference with that of the recording. I use the recorded analog waveform, rather than the recorded spike times for alignment because the first spike may not be detected, and it's easier to examine the alignment quality visually from the analog waveforms.

The first spike peak of the reference signal is easily found by doing peak-finding on a 5ms window after the first spike-time (which is known). The spike times generated by NSG marks the beginning of a spike, not the peak.

When SNR is high, peak-finding on the recording can also find the peak of the first recorded spike pretty easily. But with lower SNR, the noise level gets higher and the MinPeakHeight parameter of MATLAB's findpeaks() must be adjusted so that a noise peak is not being mistaken for a signal peak. Of course, it's not always possible to align two signals, contributing to bad spike train comparison. The best MinPeakHeight for found this way for each recorded signal is in compare_signals.m. The script checkAlignment.m helps with that.

For the end of the recording signal - sometimes when I don't stop the input audio signal in time, extra signal is recorded. Thus after alignment, I cut the recording to within 102 seconds of the alignment start.

Mar 31, 2016 - Wireless Headstage: Validation of signal quality II - Metrics

While recent work such as A wireless transmission neural interface system for unconstrained non-human primates focuses on the how transmitted analog waveform compares to that recorded by wired commerical systems, our validation goal is somewhat different.

While our wireless system operates also streams full analog waveforms, the focus is on onboard spike sorting and reporting spikes. Therefore, the most important measure is how the reported spike trains compares with that of the ground truth and that sorted/measured by commercial wired systems. Thus spike-train analysis is conducted here.

Systems compared:

  1. RHA-headstage with resistor-set RHA-Intan filtering, AGC, and four stages of IIR digital filtering. Final bandwidth is 500Hz-6.7kHz or 150Hz-10kHz.

  2. RHD-headstage with onchip-resistor-set RHD-Intan filtering (1Hz-10kHz), manually set fixed gain (Q7.8 ranges -128:0.5:127.5), and two stages of IIR digital filtering. Final bandwidth is 250Hz-9kHz.

    I observed it's much easier to do spike sorting keeping the Intan onboard bandpass filter to have wide bandwidth, so the displayed waveform is not too distorted. The IIR filters can apply sufficient filtering to remove low and high frequency noise to expose the key spike features. But having access to the signal prior to IIR can be very helpful for sorting, especially when SNR is low. Finally, changing from AGC to user-set fixed gain seems to yield much cleaner signals since AGC amplifies noise as well, making it very difficult to sort under low SNR. Downside is it's neccessary to check the templates often to make sure the previously set gain is enough.

  3. Plexon (old Harveybox..), standard spike filtering options. Waveform uses 20kHz sampling rate.

All three systems use manual PCA spike sorting. While the actual sorting interface has a time window of 1ms, the headstages use only templates that are 0.5ms long to sort spikes onboard.

Measures used:

To compare spike train performance, I referred to A comparison of binless spike train measures; Paiva, Park, Principe 2010 for a review and comparison of these measures. Very useful despite an amateur error with one of the formulas.

  1. Victor-Purpura Distance(VP)

    The first binless distance measure proposed in the literature. Defines the distance between two spike trains in terms of the minimum cost of transforming one spike train into the other by means of just three basic operations: spike insertion (cost 1), spike deletion (cost 1), and shifting a spike by some interval by Δt(cost qΔt). The cost per time unit, q sets the time scale of the analysis. For q=0 the distance is equal to the difference in spike counts, while for large q the distance is linearly proportional with the number of non-coincident spikes, as it becomes more favorable to delete and reinsert all non-coincident spikes rather than shifting them (cost 2 in this instance). Thus, by increasing the cost, the distance is transformed from a rate distance to a temporal distance. Note that the cost is inversely proportional to the acceptable time shift between two spikes.

    image1
    Victor-Purpura spike train distance. Two spike trains X and Y and a path of basic operations (spike deletion, spike insertion, and spike shift) transforming one into the other. (Victor and Purpura, 1996).

  2. Van Rossum Distance(VR)

    Similar to the VP-distance, VR distance utilizes the full resolution of the spike times. The VR distance is the Euclidean distance between exponentially filtered spike trains.

    A spike train Si is defined as Si(t)=m=1Niδ(ttmi), where {tmi:m=1,...,Ni} are the spike times.

    After convolving this spike train with a causal decaying exponential function h(t)=exp(t/τ)u(t), we get the filtered spike train fi(t)=m=1Nih(ttmi). The Van Rossum distance between two spike trains Si and Sj is the Euclidean distance:

    dvR(Si,Sj)=1τ0[fi(t)fj(t)]2dt

    For VR, the parameter τ sets the time scale, which can be though of as inversely related to VP's q parameter. The metric acts as a count of non-coincident spikes (temporal relationship) to a difference in spike count (spike rates) as the kernel size τ is increased. This is opposite of what happens to VP distance as q is increased.

  3. Schreiber-correlation dissimilarity

    The Schreiber dissimilarity based on the correlation measure defined in terms of Gaussian-filterd spike trains. In this approach, the filtered spike train Si becomes gi(t)=m=1NiGσ/2(ttmi), where Gσ/2(t)=exp(t2/σ2) is the Gaussian kernel. The role of σ here is similar to τ in Van Rossum distance, and is inversely related to q in VP distance. Assuming a discrete-time implementation of the measure, then the filtered spike trains can be seen as vectors, for which the usual dot product can be used. Therefore a correlation like dissimilarity measure can be derived:

    dCS=1r(Si,Sj)=1gigjgigj

    This metric is always betweeen 0 and 1, and can be thought of as a measure of the angle between the two filtered spike trains. Note that this quantity is not a strict distance metric because it does not fulfill the triangel inequality. Chicharro 2011 suggests that this measure's applicability to estimate reliability is limited because of the individual normalization of spike trains, which renders the measure sensitive only to similarities in the temporal modulation of the individaul rate profiles but neglects differences in the absolute spike count.

    image2
    Van Rossum spike train distance and Schreiber et al. similarity measure. Figure from (van Rossum, 2001) and (Schreiber et al., 2003)

  4. Binned distance ( dB)

    While both Paiva2010 and Chicharro2011 recommends not using a binned method to measure spike train similarity, it is more directly related to BMI methods. The binned distance is simply the sum of the absolute value of the two binned spike trains, which would require us to compare spike trains of the same length.

Notes

According to simulation tests conducted by Paiva 2009, none of the above measures performs the best or consistently when measuring different aspects of spike-train similarity, including 1) discriminating different firing rate; 2) discriminating changing firing rate; 3) Spike train synchrony. In the end, Paiva concludes:

  1. VP and van Rossume distances perform better discriminating constant firing rate.
  2. Schreiber and binned-distance perform better discriminating synchrony.
  3. All measures performed similar in discriminating instantaneous firing rates.

Ideally, validation should tell us exactly what percentage of detected spikes are true positives, and how many spikes we have missed. But that requires probabilistic calculation (e.g. is a spike false positive, or just a shifted spike) that is difficult, I settle for these metrics that are well established and are relevant for BMI -- the firing rates are usually binned and normalized prior to decoding, which is sensitive to time scales. There are other time-scale independent spiket rain distances such as event synchronization (Quiroga 2002), SPIKE synchronization (Kreuz 2015), ISI-distance (Kreuz 2007a), and SPIKE-distance (Kreuz 2013), but they are in my opinion harder to interpret.

Mar 27, 2016 - Wireless Headstage: Validation of signal quality I

After finishing work on the wireless headstage, it occured to me that no comprehensive validation has been done to confirm spike-sorting and signal quality of the wireless headstages. So I'm conducting a semi-rigourous testing of the RHA-headstage, RHD-headstage, agains the Plexon recordings. This paper on a Blackrock-based wireless system testing has been useful.

First part, I want to examine the signal quality of the streamed raw waveform. The basic setup is:

  1. Generate a neural signal of known length, save this as an audio signal with sampling frequency Fs=31250Hz. This is chosen because the headstage's sampling frequency is 31250Hz.
  2. An extra button was added to gtkclient so that as sson as it starts recording the incoming data from the headstages, it also launches VLC to play that audio signal.
  3. Keep recording until the end of the audio signal.

The recorded signal can then be plotted and analyzed for signal quality.

Note that the command to VLC to play the audio signal is launched from within gtkclient through the system() command. As a result, the headstage is not actually recording that audio signal until after a short VLC initialization delay. This is consistently around 0.5 seconds, as measured by VLC log.

To compare the recorded signal quality against the reference signal, the first test is by visual inspection. Since the recorded signal is a simulated neural signal, this becomes -- do the spike features (sharp peaks and valleys) line up? A more quantitative measure is the cross-correlation between the reference and recorded signals, but the resulting value is a bit harder to interpret.

For the visual test, as a result of the VLC delay, we must shift the recorded waveform to align with the beginning of the reference signal. See below for original signals and aligned signals. Blue is reference, red is recording. The signals are 100 seconds long, with dt=1/31250s.

image1

Note the short red segment in the top plot in the beginning, that signals the absence of audio signal.

We can inspect the aligned waveform (by aligning the peak of the first spikes in each signal). In the beginning (shown below), we can see the spike features align fairly well. The squiggles in the red signal are due to noise and filtering artifacts, which are ok in this case but poses the SNR lower limit with which we can distinguish neural signals.

image2

Around the middle of the signal (around 45.82s), we see the signals are no longer aligned. Rough inspections show the red recorded signal has developed a lead over the blue signal - the recording signal has shifted to the left of the reference signal. The three labeled red peaks correspond to the three labeled blue peaks. This is confirmed by the relative time between the labled peaks.

image3

We can further make a table of where the corresponding labeled peaks occur within their respective signals, in terms of sample number:

|Recording samp#       | Ref samp#            | Diff in samp#    | Diff in ms 
-----------------------|----------------------|------------------|------------
1432167                | 1432972              | 805              | 25.76      
1432707                | 1433515              | 808              | 25.856     
1433025                | 1433830              | 805              | 25.76      

the last column is calculated by dividing the Diff in samp# by 31250Hz to obtain the lead in seconds, then to ms.

Near the end of the signal (around 96.056s), the lead becomes even greater, as shown below.

image4

And the corresponding table of labeled peak locations:

|Recording samp#       | Ref samp#            | Diff in samp#    | Diff in ms 
-----------------------|----------------------|------------------|------------
3001750                | 3003284              | 1534             | 49.088    
3002059                | 3003620              | 1561             | 49.952
3002839                | 3004374              | 1535             | 49.12      
3003072                | 3004612              | 1540             | 49.28

The developing lead in sample numbers between the two recorded signals are most likely due to loss of packets from the wireless headstage. Each packet contains new waveform samples, and if they are lost, our recorded signal will be effectively shortened from the reference signal, resulting in this lead. And since the number of lost packets can only increase, this lead will simply increase with time.

While this is a problem when we are doing offline analysis of the recorded signals, it should not be a huge deal in terms of real-time BMI applications where we care about the instantaneous firing rate -- as long as we dont lose too many packets. As Tim has stated in his thesis:

1 in 3176 packets were rejected on average; assuming one bit error per packet, this equates to a bit error rate (BER) of 1.23e-6, or approximately one bit error per second. Efforts to reduce this have not been successful, but the system works acceptably presently.

This matches what I have seen empirically, and comparisons against Plexon. As long as the BER is on the order of 1e-6, then packet loss should not affect performance too much.

The current software keeps track of the total number of lost packets. It might be of interest to find how this changes through time. The previous tables seems to suggest that the number of samples lost increase linearly with time, which supports my hypothesis that the shift is due to last packets. To get a better idea, however, I would want to perform that same analysis around different time points along the signal, probably within a sliding window. Matlab's findpeaks() function can extract the positions of the different peaks within the signals, but matching them would be more difficult. While it is relatively easy via visual inspection, it seems to me to be NP-hard...what's an efficient algorithm to do this?

Update - Perhaps a sliding window of seller's algorithm to match the 'edit distance' of the peaks would answer the question, but that's not too important for our goal of validation.

Feb 17, 2016 - The Brain Electric - Science journalism on the field of BMI

Just finished reading The Brain Electric: The dramatic high-tech race to merge minds and machines

This book traces the development of modern BMI, following the works of Eric Leuthardt (EEG, ECoG), John Donoghue et al. (Utah Array, Cyberkinetics/Braingate), Andrew Schwarz, and Miguel Nicolelis. Does a good job describing the important publications and experiments. More importantly, I find the descriptions about how DARPA funding changes affect BMI, and the politics/races between the investigators very interesting. The quotes are pretty believable, according to what I personally know about the people described.

Some interesting points:

Feb 17, 2016 - Compilation of unconventional electrodes

Conventional electrodes used in BMI include the Utah Array, Michigan Array, and the Duke microwire array, which have been used in most BMI work so far.

I compile here a list of unconventional electrodes that I come across.

  1. Multifunctional fibers for simultaneous optical, electrical and chemical interrogation of neural circuits in vivo - Polina Anikeeva, MIT

  2. Syringe-injectable electronics - Charles Lieber, Harvard

  3. Stent-electrode - Thomas Oxley, University of Melbourne.

    Measures only LFP up to 200Hz.

Jan 28, 2016 - Koralek et al 2012: Corticostriatal plasticity is necessary for learning intentional neuroprosthetic skills

Corticostriatal plasticity is necessary for learning intentional neuroprosthetic skills

Plasticity in the motoro cortices and the striatum has been shown to accompany the learning of physical skills. The motoro cortex and frontal cortices have also been implicated in the learning of abstract skills, and in learning to control neuroprosthetic devices irrespective of physical movement. Some studies suggest that striatum are involved in learning abstract skills as well.

Question Is striatum required for abstract skill earning, and does corticostriatal circuits undergo plasticity during the learning of such skills as they do during the learning of physical skills.

Method

Rats trained to modulate pitch to two different target tones with two ensembles of M1 neurons, increasing firing rates in one ensemble increase the pitch, while the other lowers the pitch. Different rewards associated with reaching target tones.

Manipulation of expected reward values were used to establish causal-relationships.

Findings

  1. Rats showed improved performance through training. Auditory feedback crucial. Not much overt movements during task. Causal-relationship properly demonstrated.

  2. Performance improvements accompanied by significant increase in firing rates in dorsal striatum (DS) in late learning (higher proficiency) compared with early learning.

    DS firing rates exhibited greatest modulation during target reaching compared with baseline control periods, as observed during natural motor learning. Modulation significantly gerater in late learning.

    Indicates DC neurons change activity during volitional control of M1 activity, and the change increases with learning.

    Coherence betweens spiking activity in DS and M1 increases, appeared to be related to learning to performing the task rather than higher reward expectation --> corticostriataal plastiticty important in novel task earning.

  3. Knocked out NMDARs in striatal medium spiny neurons, which are critical for corticostriatal long-term potentiation. Resulting mice weren't able to learn the neuroprosthetic task.