This book traces the development of modern BMI, following the works of Eric Leuthardt (EEG, ECoG), John Donoghue et al. (Utah Array, Cyberkinetics/Braingate), Andrew Schwarz, and Miguel Nicolelis. Does a good job describing the important publications and experiments. More importantly, I find the descriptions about how DARPA funding changes affect BMI, and the politics/races between the investigators very interesting. The quotes are pretty believable, according to what I personally know about the people described.
Some interesting points:
When DARPA decides on a project, there are "selected" organizations they prefer. Proposals that involve those organizations are favored. For example, APL and DEKA in the Revolutionizing Prosthetics program.
The APL technical digest has a series of articles describing the development of the Modular Prosthetic Limb, very cool!
Cyberkinetics really pioneered the way for BMI to be used in humans, but also demonstrated just how little the market for BMI is. Best bet is probably military applications, then trickle down to civilian uses methinks.
The intellectual property Cyberkinetics owns is ridiculous:
merged with Bionic Technologies, a neurotech firm cofounded by Normann, who hed the patent for the Utah array. [...] added a neural decoding patent out of Brown, while also licensing an astonishingly broad patent held by Emory University's Donald Humphrey. Granted in 2001, Humphrey's patent laid claim to any system that uses sensors implanted in the central nervous system to record and process neural signals that are transmitted to an external device. In other words, most any BCI.
Philsophical point, which was brought up by Nicolelis in his book Beyond Boundaries, is that consciousness and our brain even may be defined by ability to perform actions. Interesting only in the sense that some have extrapolated the experimental results to this vague point. But without "thought-provoking" points like this, the book won't be nearly as interesting to non-scientific readers.
Sensory perception may need to be investigated more to enable natural BMI -- brain constructs models of the outside world and compares to sensory inputs.
I may need to read more about the Mirror Neurons.
Rizzolatti and his protege's original research on Mirror Neurons:
Plasticity in the motoro cortices and the striatum has been shown to accompany the learning of physical skills. The motoro cortex and frontal cortices have also been implicated in the learning of abstract skills, and in learning to control neuroprosthetic devices irrespective of physical movement. Some studies suggest that striatum are involved in learning abstract skills as well.
Question Is striatum required for abstract skill earning, and does corticostriatal circuits undergo plasticity during the learning of such skills as they do during the learning of physical skills.
Method
Rats trained to modulate pitch to two different target tones with two ensembles of M1 neurons, increasing firing rates in one ensemble increase the pitch, while the other lowers the pitch. Different rewards associated with reaching target tones.
Manipulation of expected reward values were used to establish causal-relationships.
Findings
Rats showed improved performance through training. Auditory feedback crucial. Not much overt movements during task. Causal-relationship properly demonstrated.
Performance improvements accompanied by significant increase in firing rates in dorsal striatum (DS) in late learning (higher proficiency) compared with early learning.
DS firing rates exhibited greatest modulation during target reaching compared with baseline control periods, as observed during natural motor learning. Modulation significantly gerater in late learning.
Indicates DC neurons change activity during volitional control of M1 activity, and the change increases with learning.
Coherence betweens spiking activity in DS and M1 increases, appeared to be related to learning to performing the task rather than higher reward expectation --> corticostriataal plastiticty important in novel task earning.
Knocked out NMDARs in striatal medium spiny neurons, which are critical for corticostriatal long-term potentiation. Resulting mice weren't able to learn the neuroprosthetic task.
Short review on how BCI can serve as a good platform for cognitive processes. Not much new information, but spells out the advatnage of BMI as a neuroscience platform in a lucid way. Include some good references.
LMS is the second stage of processing in RHA-headstage's signal chain, after AGC. It aims to estimate the noise common to all electrodes and subtract that noise from each channel.
Adaptive Filtering Theory
Chapter 12 on Adaptive Interference Cancelling in Adaptive Signal Processing by Bernard Widrow is an excellent reference on adaptive interference cancelling. Chapter 6 is a great reference on the LMS algorithm.
In adapative noise cancelling, we assume the noise source injects into both the primary input and the reference input. THe signal of interest is localized in the primary input. An adaptive filter then tries to reconstruct the noise injected by combining the reference inputs. This approximation ( in the figure below) is subtracted from the primary input to give the filtered signal ). This filtered output is then fed back into the adaptive filter to tune the filter weights. The rationale is that the perfect adaptive filter will perfectly approximate the noise component , will then become , and the error signal is minimized.
An interesting question is, why would not be minimized toward 0? If we assume the signal is uncorrelated with the noise and , but and are correlated with each other, then the minimum can achieve, in the least mean squared sense, is equal to . Unfortunately, this results is less intuitive and more easily seen through math (Widrow and Bernard, Ch 12.1).
Another way of seeing LMS filtering is that, in LMS, given an input vector , we multiply it by weights vector such that we reduce the least-mean-squared error, which is the difference between and the desired vector . The error term is then fedback to adjust . This is illustrated below:
In the case of adaptive filtering, the desired response is the noise component present in the primary input ( ). In fact, for stationary stochastic inputs, the steady-state performance of slowly adapting adaptive filters closely approximates that of fixed Wiener filters.
A side tangent is, why dont we use LMS with Wiener filter in BMI decoding? In this case, the adaptive filter output and would have to be decoded covariates, it is difficult to see how that would increase decoding performance.
In contrast with Wiener filter's derivation where we estimate the gradient of the error in order to find values to minimize the error, in LMS, we simply aprroximate the gradient with . This turns out to be an unbiased estimator, and the weights can be updated as:
where is the update rate, similar to that in other gradient descent algorithms.
Therefore, the LMS algorithm is a rather simple algorithm to serve as the adaptive filter block in the adaptive noise cancellation framework.
Application in firmware
To apply these algorithms to our headstage application, we treat the current channel's signal as , where is the signal for channel , is the noise added to channel at current time . can be assumed to be Gaussian (even though in reality it probably isn't, e.g. chewing artifacts). It's reasonable to assume that is correlated with noise observed in other channels (since they share the same chip, board, etc): , , etc. The signals measured in the previous channels then can be used as the reference input sources to predict and model the current channel's noise.
This is implemented in the firmware by Tim Hanson as below:
/* LMS adaptive noise remover
want to predict the current channel based on samples from the previous 8.
at this point i1 will point to x(n-1) of the present channel.
(i2 is not used, we do not need to write taps -- IIR will take care of this.)
2432-bit samples are written each time through this loop,
so to go back 8 channels add m1 =-784
to move forward one channel add m2 =112
for modifying i0, m3 =8 (move +2 weights)
and m0 =-28 (move -7 weights)
i0, as usual, points to the filter taps!*/r3= abs r3 (v) ||r7=[FP - FP_WEIGHTDECAY]; //set weightdecay to zero to disable LMS.r4.l = (a0 =r0.l *r7.l), r4.h = (a1 =r0.h *r7.l) (iss2) || i1 += m1 ||[i2++]=r3;//saturate the sample, save the gain.
// apply LMS weights for noise cancelling
mnop ||r1=[i1++]||[i2++]=r4; //move to x1(n-1), save the saturated sample.
mnop ||r1=[i1++m2]||r2=[i0++]; //r1 = sample, r2 = weight.
a0 =r1.l *r2.l, a1 =r1.h *r2.h ||r1=[i1++m2]||r2=[i0++];
a0+=r1.l *r2.l, a1+=r1.h *r2.h ||r1=[i1++m2]||r2=[i0++];
a0+=r1.l *r2.l, a1+=r1.h *r2.h ||r1=[i1++m2]||r2=[i0++];
a0+=r1.l *r2.l, a1+=r1.h *r2.h ||r1=[i1++m2]||r2=[i0++];
a0+=r1.l *r2.l, a1+=r1.h *r2.h ||r1=[i1++m2]||r2=[i0++];
a0+=r1.l *r2.l, a1+=r1.h *r2.h ||r1=[i1++m2]||r2=[i0++];r6.l = (a0+=r1.l *r2.l), r6.h = (a1+=r1.h *r2.h) ||r1=[i1++m1]||r2=[i0++m0];// move i1 back to the integrator output 7 chans ago; move i0 back to LMS weight for that chanr0=r0-|-r6 (s) ||[i2++]=r0||r1=[i1--]; //compute error, save original, move i1 back to saturated samples.r6=r0>>>15 (v,s) ||r1=[i1++m2]||r2=[i0++];//r1 = saturated sample, r2 = w0, i0 @ 1
.align 8// UPDATE LMS weights
a0 =r2.l *r7.h, a1 =r2.h *r7.h || nop || nop; //load / decay weight.r5.l = (a0 +=r1.l *r6.l), r5.h = (a1 +=r1.h *r6.h) ||r1=[i1++m2]||r2=[i0--];//r2 = w1, i0 @ 0
a0 =r2.l *r7.h, a1 =r2.h *r7.h ||[i0++m3]=r5; //write 0, i0 @ 2r5.l = (a0 +=r1.l *r6.l), r5.h = (a1 +=r1.h *r6.h) ||r1=[i1++m2]||r2=[i0--];//r2 = w2, i0 @ 1
a0 =r2.l *r7.h, a1 =r2.h *r7.h ||[i0++m3]=r5; //write 1, i0 @ 3r5.l = (a0 +=r1.l *r6.l), r5.h = (a1 +=r1.h *r6.h) ||r1=[i1++m2]||r2=[i0--];//r2 = w3, i0 @ 2
a0 =r2.l *r7.h, a1 =r2.h *r7.h ||[i0++m3]=r5; //write 2, i0 @ 4r5.l = (a0 +=r1.l *r6.l), r5.h = (a1 +=r1.h *r6.h) ||r1=[i1++m2]||r2=[i0--];//r2 = w4, i0 @ 3
a0 =r2.l *r7.h, a1 =r2.h *r7.h ||[i0++m3]=r5; //write 3, i0 @ 5r5.l = (a0 +=r1.l *r6.l), r5.h = (a1 +=r1.h *r6.h) ||r1=[i1++m2]||r2=[i0--];//r2 = w5, i0 @ 4
a0 =r2.l *r7.h, a1 =r2.h *r7.h ||[i0++m3]=r5; //write 4, i0 @ 6r5.l = (a0 +=r1.l *r6.l), r5.h = (a1 +=r1.h *r6.h) ||r1=[i1++m2]||r2=[i0--];//r2 = w6, i0 @ 5
a0 =r2.l *r7.h, a1 =r2.h *r7.h ||[i0++m3]=r5; //write 5, i0 @ 7r5.l = (a0 +=r1.l *r6.l), r5.h = (a1 +=r1.h *r6.h) ||r1=[i1++m3]||r2=[i0--];// inc to x1(n-1) r2 = w1, i0 @ 6
mnop ||[i0++]=r5; //write 6.
This is described by Tim in the following diagram, where =current channel number, =current sample number.
Prior to start of line 12, i1 points to AGC gain of current channel , i2 points to the integrator mean of current channel, i0 points to AGC transfer function coefficients prior to LMS coefficients.
Line 12: r3 will contain the updated AGC gain, and r7 will contain the weight decay for LMS. The weight decay will be either 0 or 1 in Q15. 0 effectively disables LMS.
Line 13: r0 currently contains AGC output. The multiplication will multiply the AGC output by weight decay, saturate the sample and move it to r4. Updated AGC gain from the previous step is saved by writing to i2, which is then incremented to point at saturated AGC output. i1 is moved to point to the AGC gain value corresponding to channel .
Line 17: The saturated sample of channel is loaded into r1, i1 now points to AGC output for channel . The current channel's saturated AGC sample in r4 is saved, and i2 now points to current channel's AGC-output (from sample n-1).
Line 18: r1 is loaded with AGC output of channel . i1 points to AGC output of channel . r2 is loaded with the LMS weight corresponding to channel , or in diagram.
Line 19: a0 contains the product of in figure above. r1 is loaded with AGC output of channel while i1 end up pointing to AGC output of . r2 is loaded with while i0 end up pointing to .
Line 20-24: Continues the adaptive filter operation (the multiply/summation block in diagram). At end of line 24, r1 contains AGC output of , i1 is moved to point to AGC out of . r2 is loaded with , i0 is moved to point to low-pass filter coefficients.
Line 25: r6 now contains the result of the entire multiply-accumulate operation of the linear combiner ( in figure 2). r1 is loaded with AGC out of current channel, i1 is then moved to point at AGC out of . r2 is loaded with low-pass filter coefficients and moved back to point to . The r1 and r2 values will not be used and will be overwritten next instruction.
Line 28: r0 originally contains current channel's AGC output. Subtracting r6 from it, r0 will contain the LMS filtered output -- in figure 1. Meanwhile, the current AGC output is written, after which i2 will point to the delayed sample used for IIR filters. r1 is loaded with AGC out of , after which i1 points to saturated AGC output of .
Line 29: After shifting r0 vectorially, r6 is loaded with the sign of filtered LMS output, or . r1 is loaded with saturated AGC output of , i1 then points to saturated AGC output of . r2 is loaded with the corresponding LMS weight , i0 then points to .
Line 33-34: Update the weight . r5 will contain the updated weight equal to , where is the LMS weight, is either 0 or 1 indicating whether LMS is on, is the saturated AGC output corresponding to the weight, and corresponds to current channel's filtered LMS output.
End of line 34, r1 is loaded with saturated AGC output of , then i1 points to saturated AGC output of . r2 is loaded with , but i0 points to .
Line 35: Here the updated weight is written (hence i0 pointing to that location after line 34), after which i0 skips to point at .
Line 36: r5 contains the updated weight . r1 is loaded with saturated AGC output of , i1 points to saturated AGC output of . r2 is loaded with , i0 then points to in anticipation of its update.
Line 46: r5 now contains the last updated weight . r1 is loaded with current channel's saturated AGC-output, then i1 is moved to point current channel's delayed sample (in anticipation of IIR filter). i0 points to .
Line 47: The updated is written back to memory.
The implemented algorithm is a modified LMS approach, where the filter weights are updated by
where the registers are those used in line 33-46.
Here, the old are loaded into r2. r7 contains a weight-decay factor. If it's set to 0, LMS is effectively disabled as no weight update would happen and it becomes a fixed gain stage.
r1 contains the saturated version of each reference channel's input. r6 is the sign of the error term.
In this modified LMS approach, if both the error and sample are the same sign, the weight would increase (Hebb's rule). This implementation involves saturation and 1-bit weight updates, so it's not analytical. However, it is fast and according to Tim,
in practice the filter settles to final weight values within 4s. The filter offers 40dB of noise rejection.
If at any time the error term , that means the current channel can be completely predicted by the previous 7 channels, that means the current channel is measuring only noise. This could serve as a good debugging tool.
Finally, at the end of this code block, r0 contains the LMS filtered sample, also . The IIR stage following LMS then operates directly on this filtered sample. Therefore only the regular and saturated AGC outputs are saved, while the LMS filter output is not saved until the IIR stage as a delayed sample (after AGC-out slot).
The effects of LMS is especially clear when all the channels of an amplifier is applied the same signal:
Simulated spike signals are being applied through the Plexon headstage tester to channel 0-31. In image shown, the first two channels have fairly low signal. This is expected since all channels have the same signal, the previous 7 channels can exactly reconstruct the signal in the current channel, subtraction leaves little left.
Turning off LMS, we have the following:
The identical spike signals applied now show up loud and clear in the first two channels shown.
The radio-transmission protocol is described fairly well in Tim's thesis pg. 166-168.
Some important points to remmember and references:
Each 32-bytes radio packet contains, in order of low to high byte:
6 samples from 4 channels of raw data (on any part of the signal chain selected by gtkclient).
8 bytes of template matching.
Each byte indicates whether there is a spike detected for both templates of a group of 4 channels - if each bit represents one template on one channel, then all 8 bits of the byte will be taken. However, we make the assumption that the two neurons represented by the two templates on the same channel cannot both fire at the same time, this then allows an encoding scheme to compress the matching byte into only 7bits. The most-significant bit in the first four template-matching bytes are then replaced by the bits of an packet # in frame nibble. This nibble ranges from 0-15, representing the position of the current packet within the 16-packets radio frame.
The next four template-matching bytes has their most significant bit replaced by the bits of an echo nibble. This echo nibble is extracted from the last command packet gtkclient sent to the headstage, and allows a sort of SEQ-ACK exchange to happen between the headstage and gtkclient, although this only allows to know when a packet might've been lost -- there is no bandwidth or CPU cycles to resend a lost packet. However, the control packets can be resent easily. The Sync headstage in gtkclient is a good way of sending control packets to sync the headstage configurations to those in gtkclient.
There are a total of , meaning it takes 4 packets to transmit all template matches, accompanied by 24 samples of 4 channels.
ADC runs at 1Msps, each packet requires 6 samples, so each packet requires , and 4 packets requires to transmit.
Firmware sets aside a buffer enough for two radio frames. Each radio frame contains 16 packets. The radio starts sending when a frame is full, while the signal-chain collates and saves new packet in the other frame. Therefore, each frame takes about 3.072 ms to send, which is less than the 4ms free-run PLL time. Time to transmit, change to receive, then back is 3.60ms. So,
Received packet are 4 (memory address, value) pairs. Each value is 32 bits. Memory address should be 32bits, but since their upper nibble is always 0xf, they are replaced with the 4 bits echo value. In the radio loop handling received packets, the echo value is extracted and saved, while the memory address is restored. The values from the packets are then written to the corresponding memory address.
Debugging Notes
If the firmware used on headstage allows for radio transmission (i.e. radio_loop implements the described protocol), there are several ways to debug and check the quality of transmission.
If the headstage's firmware is compiled correctly, the bridge's firmware is compiled correctly, gtkclient is compiled correctly, and multicast is allowed, yet there are not signals showing up in gtkclient, then there is something wrong with the radio code within the headstage's firmware.
If the gtkclient is getting raw waveforms displayed, but the waveforms look suspicious: for example, we apply sinusoids signals but waveforms displayed aren't very smooth, that may be caused by either errors in DSP or radio transmission code. We can eliminate the former possibility if after gtkclient instructs the headstage to change coefficients to induce oscillations (see IIR oscillator post) we do indeed see sinusoids of the same frequency on all transmitted channels.
In gtkclient, if the pkts/sec field is changing, but the dropped and BER fields stays constantly 0, that means something is wrong with the radio packaging process. As great as 0 dropped packet sounds, it very unlikely.
I encountered these last two problems while finalizing the RHD-headstage firmware.
In the screenshots below, I applied square waves of known frequency to one of the amplifiers. The top three channels are floating, the fourth one is supposed to be the sinusoid applied. The firmware I used does not have any signal-chain steps so I was simply streaming the acquiared samples from Intan.
800Hz signal applied
1kHz signal applied
1.2kHz signal applied
1.5kHz signal applied
2kHz signal applied
2.8kHz signal applied
It is apparent in most of these screenshots that I'm not getting a square wave back. In others, I get resemblance of square wave, but the frequency is never right, and the waveform seems to have multiple frequencies present, both higher and lower than what I expect.
Transmitting the IIR-generated oscillations also showed distortions. The generated waveform is supposed to be 919Hz. The recorded signals, when dumped and plotted in Matabl is shown below:
The first idea that came to mind to investigate what's going on was performing FFT. Below are FFT of gtkclient recording when sinusoidal signals of different frequency was applied, as well as that of the IIR-generated oscillations.
800Hz signal FFT
1kHz signal FFT
3.2kHz signal FFT
5kHz signal FFT
None of the spectra look very clean, and the signal's actual frequency never has the most energy. However, it seemed that the highest peaks are consistently 1300Hz apart. Since the sampling frequency of the amplifiers is 31.25kHz, 1300Hz correspond to every 24 samples, or every 4 packets.
In the radio-transmission protocol, the spike-match bits are looped every 4 packets -- the spike match bits rae saved in a buffer, and for every packet a pointer reads through this buffer. The pointer loops should therefore loop around the buffer every 4 packets, since all spike-match bits are transfered in that time.
Very careful line-by-line inspection of the radio packet assembly/transmission code revealed that I made a mistake modifying the original transmission code when modifying to work with firmware versions that did not include SAA template-matching. Fixing those errors solved the problem in all non-final RHD firmware versions.
That would mean after adding SAA, I only needed to copy over the original radio packet assembly/transmission code from RHA-headstage's radio5.asm to get everything work. But it didn't, and the same 1300Hz problem appeared. This then led me to a typo in my initial memory-allocation code in radio_bidi_asm, executed even before any of the signal-chain or radio-loop threads.
More Resources
Tim has notes on his radio protocol hardware/software development on his website, good read in case of any future Nordic radio work.
L1 data memory - consists of 2 banks of 2-way associative cache SRAM, each one is 16kB.
databank A: 0xFF804000 to 0xFF808000
databank B: 0xFF904000 to 0xFF908000
L1 scratchpad RAM - accessible as data SRAM, cannot be configured as cache memory, 4kB. Address is (0xFFB00000 to 0xFFB010000)
For our firmware, we want to load:
The instructions into the 32kB SRAM instruction memory.
Signal chain calculations and intermediate values (W1), past samples for template comparison (T1), template match bits (MATCH) , look-up tables (ENC_LUT and STATE_LUT), and radio packet frames to L1 databank A.
Signal chain coefficients/parameters (A1), template and aperture coefficients (A1), and pointer values (FP_) to L1 databank B.
These memory layout and address are visualized in each firmware directory's memory_firmwareVersion.ods files, and memory_firmwareVersion.h header files.
To compile our code to fit this memory structure, we use the blackfin linux bare-metal tool chain, which is a gnu gcc based toolchain that allows us to compile assembly and c/c++ code for different blackfin architectures. The installation comes with header files such as defBF532.h that lists the memory addresses for various MMRs and system registers.
After running make, the file Linker.map shows the order of files linked in the final stage.dxe file, which includes:
crt0.o - from crt0.asm. Includes the global start() routine from where BF532 starts execution. Also includes most of the processor peripheral setup, such as PLL and clock frequency, and interrupt handler setup.
radio_AGC_IIR_SAA.o - from radio_AGC_IIR_SAA.asm, this is where awherewherewhere all the DSP and interesting code go. The different firmware versions only really differ in this file.
enc.o - generated by enc.asm, generated by enc_create.cpp. It's a subroutine called in the data memory setup code in radio_AGC_IIR_SAA.asm to install the encoding look-up tables.
divsi3.o - generated by common_bfin/divsi3.asm, a subroutine for signed division...not really actually used, but could be useful.
main.o - from main.c. The blackfin starts execution from crt0.asm, which eventuall jumps to the main routine within main.c. Within the main routine, more blackfin and Nordic radio chip configurations are done. The radio setup is done throught the SPI interface (such as the radio channel). At the end of main(), it jumps to the setup-up code radio_bidi_asm within radio_AGC_IIR_SAA.asm.
print.o - from common_bfin/print.h and common_bfin/print.c. Not actually needed for the headstage, rather a leftover from compilation setup for BF537. Not really relevant to headstage firmware.
util.o - from common_bfin/util.h and common_bfin/util.c. Functions defined not actually needed for the headstage.
spi.o - from common_bfin/spi.h and common_bfin/spi.c. Include functions in C to talk to the SPI port, which is needed for reading the flash memory and talking to the radio - functions called from the main() routine.
Finally, the file bftiny.x is the linker script to produce the final binary .dxe file. Written by unknown author, but it works!
The assembly version of the final compiled code and their memory addresses can be found in decompile.asm. The code order follows the link order. See this document for more details on the compile process.
The flash process will upload the final stage.dxe file into the onboard flash memory. The blackfin is wired to boot from flash - upon power up, blackfin will read the flash memory through SPI and load the data and instructions to the corresponding memory addresses appropriately.
The architecture of the firmware (in radio_AGC_IIR_SAA.asm for RHD-headstage or radio5.asm for RHA-headstage) is essentially two threads: DSP service routine reads in samples from the amplifiers and does the DSP, the radio control thread handles radio transmission and reception.
The radio control thread (radio_loop) fills the SPI read/write registers and changes flags to the radio to transmit packets. It also reads the incoming packets and writes the desired changes to the requested memory locations. Interleaved between these instructions, the DSP service routine is called.
The DSP service routine blocks until it receives until a new set of samples (4 samples, one from each amplifier) is read and fully processed and then returns control back to the radio control thread. To preserve state for each routine, circular buffers are used to store:
DSP signal-chain constants and coefficient values (A1).
Past stored samples for spike detection (T1).
Data to be transmitted by the radio (packet buffer).
After all amplifier channels have been processed once, template matches and raw samples are written to the packet buffers. The radio-control thread monitors the packet buffer and transmit when packets are available.
One key limit to this threading architecture is the ADC sampling rate. The processor is operating at 400MHz. The ADC onboard the Intan operates at 1MHz. That means the code-length for calling the DSP service routine, return and execute an instruction in the radio loop, then back, cannot exceed 400 clock cycles.
Whenever the code-length violates this limit, data corruption and glitches can happen.
Outside of the values stored in the circular buffers, all persisten variables that do not have their own registers, such as the number of packets enqueued, are stored in fixed offsets from the frame pointer to permit one-cycle access latency.
The memory map spreadsheet is very helpful in debugging and understanding code execution.
The signal chain ends after IIR, at which spike matching happens.
Spike Templates
Spike templates are generated through spike sorting on gtkclient, then transmitted to and saved by the headstage. Spike sorting through PCA is described very well in Tim's thesis:
Spike sorting on the client is very similar to that used by Plexon Inc's SortClient and OfflineSorter: raw traces are thresholded, waveforms are extracted around the threshold, and a PCA projection is computed of these waveform snippets. Unlike Plexon, threshold crossing can be moved in both time and voltage domains, which permits considerably greater flexibility when centering a waveform within the relatively narrow 16-sample template region. [...] PCA projected waveforms are selected by the user via a polygonal lasso, and the mean and L1 norm of this sample is computed. The former is sent to the transceiver as the template, and the latter sets a guess at the aperture. Aperture can later be updated manually by the user; as the aperture is changed, the GUI updates the color labels of the PCA projected points.
Image below shows spike-sorting in action:
Note that this PCA template process can apply to any signal shape, not just limited to a neural signal waveform. In the screenshot, channel 31 is just noise when projected into the PCA space vaguely forms two clusters. The overlaid waveforms (cyan and red) don't resemble any neural signal. But the fact that they overlaid simply tells us that the noise detected on channel 31 is not purely stochastic and probably contains interference from deterministic sources.
The 16-point templates in the time domain is limited only to the purple strip in the waveform window. For each channel, a maximum of two set of 16-point templates (16 8-bit numbers) can be stored on the headstage. For each template, the corresponding aperture is also stored as a 8-bit number.
Blackfin Implementation
The assembly implementation of template matching on blackfin (without LMS) is as below:
// At end of signal chain for both group of two samples
// Template comparison, plexon style. Sliding window, no threshold.
//r2=samples from amp1 and amp2; r3=samples from amp3 and amp4. Pack themr2=r2>>>8 (v); // vector shift 16-bit to 8 bits, preserve signr3=r3>>>8 (v);r4= bytepack(r2, r3); // amp4,3,2,1 (hi to low byte)r0=[FP - FP_8080]; // r0=0x80808080;r4=r4 ^ r0; // convert to unsigned offset binary. SAA works on unsigned numbers.r0=r4; // save a copy for later
.align 8; // template A: load order is t, t-15, t-14,...t-1
a0 = a1 =0||r2=[i0++]; //r2=template_A(t), r0 and r4 contains byte-packed samples just collected
saa( r1:0, r3:2 ) ||r0=[i3++]||r2=[i0++]; //r2=template_A(t-15), r0=bytepack(t-15)
saa( r1:0, r3:2 ) ||r0=[i3++]||r2=[i0++]; //r2=template_A(t-14), r0=bytepack(t-14)
saa( r1:0, r3:2 ) ||r0=[i3++]||r2=[i0++]; //r2=template_A(t-13), r0=bytepack(t-13)
saa( r1:0, r3:2 ) ||r0=[i3++]||r2=[i0++]; //r2=template_A(t-12), r0=bytepack(t-12)
saa( r1:0, r3:2 ) ||r0=[i3++]||r2=[i0++]; //r2=template_A(t-11), r0=bytepack(t-11)
saa( r1:0, r3:2 ) ||r0=[i3++]||r2=[i0++]; //r2=template_A(t-10), r0=bytepack(t-10)
saa( r1:0, r3:2 ) ||r0=[i3++]||r2=[i0++]; //r2=template_A(t-9), r0=bytepack(t-9)
saa( r1:0, r3:2 ) ||r0=[i3++]||r2=[i0++]; //r2=template_A(t-8), r0=bytepack(t-8)
saa( r1:0, r3:2 ) ||r0=[i3++]||r2=[i0++]; //r2=template_A(t-7), r0=bytepack(t-7)
saa( r1:0, r3:2 ) ||r0=[i3++]||r2=[i0++]; //r2=template_A(t-6), r0=bytepack(t-6)
saa( r1:0, r3:2 ) ||r0=[i3++]||r2=[i0++]; //r2=template_A(t-5), r0=bytepack(t-5)
saa( r1:0, r3:2 ) ||r0=[i3++]||r2=[i0++]; //r2=template_A(t-4), r0=bytepack(t-4)
saa( r1:0, r3:2 ) ||r0=[i3++]||r2=[i0++]; //r2=template_A(t-3), r0=bytepack(t-3)
saa( r1:0, r3:2 ) ||r0=[i3++]||r2=[i0++]; //r2=template_A(t-2), r0=bytepack(t-2)
saa( r1:0, r3:2 ) ||r0=[i3++]||r2=[i0++]; //r2=template_A(t-2), r0=bytepack(t-1)
saa( r1:0, r3:2 ) ||[i3++]=r4; // write bytepack(t), inc i3// saa results in a0.l, a0.h, a1.l, a1.h (amp4,3,2,1); compare to aperture// i0 @ Aperture[amp1A, amp2A]r0= a0, r1= a1 (fu) ||r2=[i0++]|| i3 -= m3; // r2=aperture[amp1A,amp2A], i3@saved bytepack(t-15)// subtract and saturate - if the answer is negative-->spike!r0=r0-|-r2 (s) ||r3=[i0++]; // r0=[amp1A match, amp2A match], r3=aperture[amp3A,amp4A]r1=r1-|-r3 (s); // r1=[amp3A match, amp4A match]// shift to bit 0, sign preserved
r0=r0>>>15 (v); // r0.h and r0.l will be 0xffff or 0x0000 (match or not)r1=r1>>>15 (v);r0=-r0 (v); // now 0x0001 or 0x0000 (match or not)r1=-r1 (v); // save both for packing laterr1<<=1;r6=r0+r1; // r6=[14 zeros][amp4A][amp2A][14 zeros][amp3A][amp1A]
In non-LMS versions of firmware, incrementing by m3 moves address up by 16 32-bit words. In LMS-verions of firmware, incrementing by m3 moves address up by 2 32-bit words.
As mentioned in the Blackfins-Intan post, Blackfin works efficiently with two 16-bit samples at once, therefore the signal-chain operates twice before reaching the template matching step, once for each two samples acquired in a SPORT cycle.
This efficiency is due to Blackfin's dual-MAC architecture, which can also operate on 4 unsigned 8-bit samples at the same time (treating each byte as a separate number).
Template matching implemented here is simply subtracting the 16 template points from a channel's current and 15 past samples, then comparing the sum of their absolute differences to the aperture value. If the sum is smaller, then we declare a spike has been detected.
Luckily, Blackfin has the SAA instruction, which accumulates four different sums of absolute difference between vectors of unsigned bytes. Therefore, at the end of the signal chain after all four samples acquired have been processed, they are packed into a 32-bit word by first converting each signed 16-bit sample to offset binary, then take the most significant byte (line 1-7). These bytepacked words are saved for the past 15 samples for each group of four channels in a circular buffer.
The template vectors from the headstage are also written into memory in the same bytepacked form. Thus template matching becomes 16 successive SAA instructions (line 11-27).
Line 31-38 does the comparison between the SAA result against the saved template, resulting in either 0 for no match, or 1 for a match.
Consequences of this method
This biggest upside of this spike matching method is its speed and simplicity of implementation. However, despite the simplicity, this is probably not the most optimal or correct method for spike detection because:
In most sorting algorithms (e.g. Plexon), the raw signal is first thresholded, then waveform snippets of usually 32 samples long are compared to a template to accept/reject, or to sort them into different units. The comparison metric, or aperture in our implementation, is usually the MSE or L2-norm. Our implementation uses 16-sample window and L1-norm, which presumably may have worse performance.
Because the spike-sorting is based on 16-bit samples truncated to 8-bit samples, the templates themselves have truncation noise. The onboard spike-matching also involves truncation of the samples. How this truncation noise can affect the sorting quality has not been rigorously studied. But I expect this to not contribute significantly to spike-detection performance hit (if at all).
Literature suggests that for isolating a fixed-pattern signal embedded in noise, the best solution might be matched filter, which Tim has done a preliminary study on.
While trying to confirm that the radio transmission for the RHD-headstage firmware was working correctly, I needed a way to decouple the signal-chain from the transmission in terms of testing. A nice side-effect of the cascading biquad IIR filter implementation is that, by writing coefficients for the biquads in a certain way, sinusoidal oscillations for different frequencies can be generated. If this was done on the last biquad, it would result in pure sinusoids being transmitted to gtkclient (when the final output is selected for transmission, of course), and it would be easy to see if the data is getting corrupted in the transmission process (if not, then any corruption would be due to the signal-chain code).
Below is the the DI-biquad block diagram again:
If we assume nonzero initial conditions such that and are nonzero, then pure oscillations can be induced by setting , and to be 0, in which case only feedback is present in the biquad.
The system function is then:
Let , and , then the resulting conjugate poles will be at
The resulting normalized frequency of oscillation is , in radians/sample. Oscillation frequency in Hz is , where is the sampling frequency (in our case 31.25kHz for all channels).
So all we need to find is to get a desired oscillation frequency. To find the appropriate coefficients for the last biquad of my signal to induce oscillations, I used the following Matlab script. Note that the coefficients naming is different from that in the diagram -- and in script are the same as and in diagram.
% IIR oscillator - myopen_multi/gktclient_multi/matlab/IIR_oscillator.m
Fs =31250;% sampling frequency% To get oscillator of frequency f, do:% set b0=0, b1=0, a1=-1, and a0=2-fp
fp =0.035;
omega =angle((2-fp+sqrt(fp^2-4*fp))/2);% in rad/sample
f = omega * Fs/(2*pi);% convert to Hz% But a0 is fixed-point number, convert fp:
fp_fixed =round(fp*2^14);% coefficients
b0 =0;
b1 =0;
a0 =(2^15-fp_fixed)/2^14;
a1 =-1;
y =[0.5,0.5];% y[n-1]=0.5, y[n-2]=0.5
x =rand(1,1002)*2;% 1000 random values, including x[n-1] and x[n-2]fori=3:numel(x),
val = b0*x(i)+ b1*x(i-1)+ b0*x(i-2)+ a0*y(i-1)+ a1*y(i-2);
y =[y, val];endplot_fft(y,31250);
Upon receiving that message, gtkclient assigns that bridge to one of the pre-determined radio channels.
The bridge then starts receiving wireless packets on the assigned radio channel.
gktclient spawns sock_thread to take care of each bridge. Previously, the array holding the detected bridge IP addresses was not locked, this means it's sometimes possible for two threads to be assigned to handle the same bridge, which at best leaves us with 1 operational radio channel, and at worst we will not get any transmission. This can be solved by mutex-locking the shared array. In general, proper threading can be used to solve gtkclient-bridge communication issues.
Dropped packets tracking
In bridge firmware:
When bridge starts, it keeps a flag g_nextFlag which says what the echo of the next packet should be.
This echo value is extracted by getRadioPacket() in test_spi.c.
If flag>g_nextFlag, we know a packet has been dropped. The total number of dropped packet, g_dropped increment by flag-g_nextFlag and is sent to gtkclient.
gtkclient also keeps track of this g_dropped statistic in g_dropped[tid], where tid corresponds to the different headstages we are communicating with. If the echo flag of the newly received packet is greater than that in g_dropped[tid], then the value in array is replaced by that echo flag. The difference between the new echo flag and the old value is then added to g_totalDropped.
Waveform display
Each received wireless packet has 24 bytes of waveform data nd 8 bytes of template matching data. The waveform data is in fact 6 consecutive samples from 4 different channels, each sample is one byte in size. However, it is important to know how we are supposed to interpret this byte value before plotting.
In my earliest iteration of radio_basic.asm firmware, I am simply saving the high byte of the 16-bit sample acquired from the RHD2132. The amplifiers were configured to transmit the samples in offset-binary, meaning the most negative voltage with respect to the reference electrode is represented by 0x0000, the most positive 0xffff, and the reference voltage is 0x7fff. Therefore, I would want to treat the byte extracted from the packet as an unsigned char. The waveform plot window has vertical range of [-1,1], therefore to plot, I had mapped the unsigned char range [0,255] to [-1,1].
In later firmware iterations, AGC and IIR filter outputs are transmitted. Those samples, however, have to be interpreted as signed char. This is because the samples collected were converted to signed Q1.15 before going through the signal-chain process, and therefore the outputs ended up in signed two's complement format as well.
Although the location of the decimal point varied depending on what the value transmitted is. For example, the AGC gain is Q7.8, therefore the decimal point is at the end of the most significant byte being transmitted. In comparison, the result of IIR filter is Q1.15, so the decimal point is after the most significant bit of the byte transmitted in that case. But since we are never mixing displaying waveforms in different format, this is not too important, just have to keep in mind when interpreting the waveforms plotted. Although practically speaking, the shape of the waveforms is the most important for spike-sorting, debugging, etc.
In those cases, to plot, I had to map the signed char range [-128,127] to [-1,1], which is a different function from the signed char mapping.
The gtkclient by default used the signed char mapping. Therefore when I first transmitted Intan data as unsigned data, the waveforms plotted were simply rail-to-rail oscillations due to value-wrappings. There are two ways to allow displaying both Intan data and the signal chain data appropriately:
Changed the mapping based on what part of the siganl chain gtkclient has selected from the headstage to transmit.
Configure the RHD2132 to transmit samples in two's complement format. This actually simplifies the AGC stage. Since the signed 16-bit samples can be treated as Q1.15 numbers right-away, we skip a step of conversion from unsigned to signed fixed-point. With this method, gtkclient can use the signed char mapping to display all parts of the signal chain.
Multicast Problems
This problem happened during a demo of the wheelchair experiment using RHA-headstages, with RHA-gtkclient running on Daedalus. A router was used rather than a switch at a time. Despite router configurations to not filter multicast traffic, the bridges were not able to be detected by gtkclient.
UDP sockets opened by gtkclient:
4340, and 4340+radio_channel are the bcastsock to detect bridges. This is what we are interested in for the bridge problem.
4343: TCP server socket to receive BMI3 connections
8845: Strobe socket
Steps:
Used ss -u -a to check what multicast sockets are open.
Monitor multicast traffic via wireshark:
sudo tshark -n -V -i eth0 igmp
Surprisingly, after executing this command, we started to receive bridge packets and gtkclient started working again.
I wasn't able to duplicate this scenario on other systems...