Jan 27, 2016 - Golub et al 2016: Brain-computer interfaces for dissecting cognitive processes underlying sensorimotor control

Tags: BMI, Review

Brain-computer interfaces for dissecting cognitive processes underlying sensorimotor control

Short review on how BCI can serve as a good platform for cognitive processes. Not much new information, but spells out the advatnage of BMI as a neuroscience platform in a lucid way. Include some good references.

Jan 26, 2016 - Wireless Headstage: Least-Mean Squared (LMS) adaptive noise cancellation.

Tags: BMI, Wireless, Blackfin, Microcontroller, DSP

Wireless Project github

LMS is the second stage of processing in RHA-headstage's signal chain, after AGC. It aims to estimate the noise common to all electrodes and subtract that noise from each channel.

Adaptive Filtering Theory

Chapter 12 on Adaptive Interference Cancelling in Adaptive Signal Processing by Bernard Widrow is an excellent reference on adaptive interference cancelling. Chapter 6 is a great reference on the LMS algorithm.

In adapative noise cancelling, we assume the noise source injects into both the primary input and the reference input. THe signal of interest is localized in the primary input. An adaptive filter then tries to reconstruct the noise injected by combining the reference inputs. This approximation ( $y$ in the figure below) is subtracted from the primary input to give the filtered signal ). This filtered output $ϵ$ is then fed back into the adaptive filter to tune the filter weights. The rationale is that the perfect adaptive filter will perfectly approximate the noise component $n_{0}$ , $ϵ$ will then become $s$ , and the error signal is minimized.

An interesting question is, why would $ϵ$ not be minimized toward 0? If we assume the signal $s$ is uncorrelated with the noise $n_{0}$ and $n_{1}$ , but $n_{0}$ and $n_{1}$ are correlated with each other, then the minimum $ϵ$ can achieve, in the least mean squared sense, is equal to $s$ . Unfortunately, this results is less intuitive and more easily seen through math (Widrow and Bernard, Ch 12.1).

Another way of seeing LMS filtering is that, in LMS, given an input vector $\vec{x}$ , we multiply it by weights vector $\vec{w}$ such that we reduce the least-mean-squared error, which is the difference between $w^{T} x$ and the desired vector $\vec{d}$ . The error term is then fedback to adjust $\vec{w}$ . This is illustrated below:

In the case of adaptive filtering, the desired response $d_{k}$ is the noise component present in the primary input ( $n_{0}$ ). In fact, for stationary stochastic inputs, the steady-state performance of slowly adapting adaptive filters closely approximates that of fixed Wiener filters.

A side tangent is, why dont we use LMS with Wiener filter in BMI decoding? In this case, the adaptive filter output $y$ and $ϵ$ would have to be decoded covariates, it is difficult to see how that would increase decoding performance.

In contrast with Wiener filter's derivation where we estimate the gradient of the error in order to find $\vec{w}$ values to minimize the error, in LMS, we simply aprroximate the gradient with $ϵ_{k}$ . This turns out to be an unbiased estimator, and the weights can be updated as:

\vec{W_{k + 1}} = \vec{W_{k}} + 2 μ ϵ_{k} \vec{X_{k}}

where $μ$ is the update rate, similar to that in other gradient descent algorithms.

Therefore, the LMS algorithm is a rather simple algorithm to serve as the adaptive filter block in the adaptive noise cancellation framework.

Application in firmware

To apply these algorithms to our headstage application, we treat the current channel's signal as $s_{n} + e_{n}$ , where $s_{n}$ is the signal for channel $n$ , $e_{n}$ is the noise added to channel $n$ at current time $t$ . $e_{n}$ can be assumed to be Gaussian (even though in reality it probably isn't, e.g. chewing artifacts). It's reasonable to assume that $e_{n}$ is correlated with noise observed in other channels (since they share the same chip, board, etc): $e_{n - 1}$ , $e_{n - 2}$ , etc. The signals measured in the previous channels then can be used as the reference input sources to predict and model the current channel's noise.

This is implemented in the firmware by Tim Hanson as below:


/* LMS adaptive noise remover
	want to predict the current channel based on samples from the previous 8.
	at this point i1 will point to x(n-1) of the present channel.
		(i2 is not used, we do not need to write taps -- IIR will take care of this.)
	24 32-bit samples are written each time through this loop,
	so to go back 8 channels add m1 = -784
	to move forward one channel add m2 = 112
	for modifying i0, m3 = 8 (move +2 weights)
		and m0 = -28 (move -7 weights)
	i0, as usual, points to the filter taps!
*/
	r3 = abs r3 (v) || r7 = [FP - FP_WEIGHTDECAY]; //set weightdecay to zero to disable LMS.
	r4.l = (a0 = r0.l * r7.l), r4.h = (a1 = r0.h * r7.l) (iss2) || i1 += m1 || [i2++] = r3;
				//saturate the sample, save the gain.

    // apply LMS weights for noise cancelling
	mnop || r1 = [i1++] || [i2++] = r4; //move to x1(n-1), save the saturated sample.
	mnop || r1 = [i1++m2] || r2 = [i0++]; //r1 = sample, r2 = weight.
	a0 = r1.l * r2.l, a1 = r1.h * r2.h || r1 = [i1++m2] || r2 = [i0++];
	a0+= r1.l * r2.l, a1+= r1.h * r2.h || r1 = [i1++m2] || r2 = [i0++];
	a0+= r1.l * r2.l, a1+= r1.h * r2.h || r1 = [i1++m2] || r2 = [i0++];
	a0+= r1.l * r2.l, a1+= r1.h * r2.h || r1 = [i1++m2] || r2 = [i0++];
	a0+= r1.l * r2.l, a1+= r1.h * r2.h || r1 = [i1++m2] || r2 = [i0++];
	a0+= r1.l * r2.l, a1+= r1.h * r2.h || r1 = [i1++m2] || r2 = [i0++];
	r6.l = (a0+= r1.l * r2.l), r6.h = (a1+= r1.h * r2.h) || r1 = [i1++m1] || r2 = [i0++m0];
    // move i1 back to the integrator output 7 chans ago; move i0 back to LMS weight for that chan

	r0 = r0 -|- r6 (s) || [i2++] = r0 || r1 = [i1--] ; //compute error, save original, move i1 back to saturated samples.
	r6 = r0 >>> 15 (v,s) || r1 = [i1++m2] || r2 = [i0++];//r1 = saturated sample, r2 = w0, i0 @ 1
.align 8

    // UPDATE LMS weights
	a0 = r2.l * r7.h, a1 = r2.h * r7.h || nop || nop; //load / decay weight.
r5.l = (a0 += r1.l * r6.l), r5.h = (a1 += r1.h * r6.h) || r1 = [i1++m2] || r2 = [i0--];//r2 = w1, i0 @ 0
	a0 = r2.l * r7.h, a1 = r2.h * r7.h || [i0++m3] = r5; //write 0, i0 @ 2
r5.l = (a0 += r1.l * r6.l), r5.h = (a1 += r1.h * r6.h) || r1 = [i1++m2] || r2 = [i0--];//r2 = w2, i0 @ 1
	a0 = r2.l * r7.h, a1 = r2.h * r7.h || [i0++m3] = r5; //write 1, i0 @ 3
r5.l = (a0 += r1.l * r6.l), r5.h = (a1 += r1.h * r6.h) || r1 = [i1++m2] || r2 = [i0--];//r2 = w3, i0 @ 2
	a0 = r2.l * r7.h, a1 = r2.h * r7.h || [i0++m3] = r5; //write 2, i0 @ 4
r5.l = (a0 += r1.l * r6.l), r5.h = (a1 += r1.h * r6.h) || r1 = [i1++m2] || r2 = [i0--];//r2 = w4, i0 @ 3
	a0 = r2.l * r7.h, a1 = r2.h * r7.h || [i0++m3] = r5; //write 3, i0 @ 5
r5.l = (a0 += r1.l * r6.l), r5.h = (a1 += r1.h * r6.h) || r1 = [i1++m2] || r2 = [i0--];//r2 = w5, i0 @ 4
	a0 = r2.l * r7.h, a1 = r2.h * r7.h || [i0++m3] = r5; //write 4, i0 @ 6
r5.l = (a0 += r1.l * r6.l), r5.h = (a1 += r1.h * r6.h) || r1 = [i1++m2] || r2 = [i0--];//r2 = w6, i0 @ 5
	a0 = r2.l * r7.h, a1 = r2.h * r7.h || [i0++m3] = r5; //write 5, i0 @ 7
r5.l = (a0 += r1.l * r6.l), r5.h = (a1 += r1.h * r6.h) || r1 = [i1++m3] || r2 = [i0--];// inc to x1(n-1) r2 = w1, i0 @ 6
	mnop || [i0++] = r5; //write 6.

This is described by Tim in the following diagram, where $c$ =current channel number, $n$ =current sample number.

Prior to start of line 12, i1 points to AGC gain of current channel $c$ , i2 points to the integrator mean of current channel, i0 points to AGC transfer function coefficients prior to LMS coefficients.

Line 12: r3 will contain the updated AGC gain, and r7 will contain the weight decay for LMS. The weight decay will be either 0 or 1 in Q15. 0 effectively disables LMS.

Line 13: r0 currently contains AGC output. The multiplication will multiply the AGC output by weight decay, saturate the sample and move it to r4. Updated AGC gain from the previous step is saved by writing to i2, which is then incremented to point at saturated AGC output. i1 is moved to point to the AGC gain value corresponding to channel $(c - 7) % 32$ .

Line 17: The saturated sample of channel $(n - 7) % 32$ is loaded into r1, i1 now points to AGC output for channel $(c - 7) % 32$ . The current channel's saturated AGC sample in r4 is saved, and i2 now points to current channel's AGC-output (from sample n-1).

Line 18: r1 is loaded with AGC output of channel $(c - 7) % 32$ . i1 points to AGC output of channel $(c - 6) % 32$ . r2 is loaded with the LMS weight corresponding to channel $(c - 7) % 32$ , or $L M S (n, 6)$ in diagram.

Line 19: a0 contains the product of $L M S (n, 6) * x (n, c - 7 % 32)$ in figure above. r1 is loaded with AGC output of channel $(c - 6) % 32$ while i1 end up pointing to AGC output of $(c - 5) % 32$ . r2 is loaded with $L M S (n, 5)$ while i0 end up pointing to $L M S (n, 4)$ .

Line 20-24: Continues the adaptive filter operation (the multiply/summation block in diagram). At end of line 24, r1 contains AGC output of $(c - 1) % 32$ , i1 is moved to point to AGC out of $c$ . r2 is loaded with $L M S (n, 0)$ , i0 is moved to point to low-pass filter coefficients.

Line 25: r6 now contains the result of the entire multiply-accumulate operation of the linear combiner ( $y_{k}$ in figure 2). r1 is loaded with AGC out of current channel, i1 is then moved to point at AGC out of $(c - 7) % 32$ . r2 is loaded with low-pass filter coefficients and moved back to point to $L M S (n, 6)$ . The r1 and r2 values will not be used and will be overwritten next instruction.

Line 28: r0 originally contains current channel's AGC output. Subtracting r6 from it, r0 will contain the LMS filtered output -- $ϵ = s + n_{0} - y$ in figure 1. Meanwhile, the current AGC output is written, after which i2 will point to the delayed sample used for IIR filters. r1 is loaded with AGC out of $(c - 7)$ , after which i1 points to saturated AGC output of $(c - 7) % 32$ .

Line 29: After shifting r0 vectorially, r6 is loaded with the sign of filtered LMS output, or $s i g n (ϵ)$ . r1 is loaded with saturated AGC output of $(c - 7) % 32$ , i1 then points to saturated AGC output of $(c - 6) % 32$ . r2 is loaded with the corresponding LMS weight $L M S (n, 6)$ , i0 then points to $L M S (n, 5)$ .

Line 33-34: Update the weight $L M S (n, 6)$ . r5 will contain the updated weight equal to $w * α + x * s i g n (ϵ)$ , where $w$ is the LMS weight, $α$ is either 0 or 1 indicating whether LMS is on, $x$ is the saturated AGC output corresponding to the weight, and $ϵ$ corresponds to current channel's filtered LMS output.

End of line 34, r1 is loaded with saturated AGC output of $(c - 6) % 32$ , then i1 points to saturated AGC output of $(c - 5) % 32$ . r2 is loaded with $L M S (n, 5)$ , but i0 points to $L M S (n, 6)$ .

Line 35: Here the updated $L M S (n, 6)$ weight is written (hence i0 pointing to that location after line 34), after which i0 skips to point at $L M S (n, 4)$ .

Line 36: r5 contains the updated weight $L M S (n, 5)$ . r1 is loaded with saturated AGC output of $(c - 5) % 32$ , i1 points to saturated AGC output of $(c - 4) % 32$ . r2 is loaded with $L M S (n - 4)$ , i0 then points to $L M S (n, 5)$ in anticipation of its update.

Line 46: r5 now contains the last updated weight $L M S (n, 0)$ . r1 is loaded with current channel's saturated AGC-output, then i1 is moved to point current channel's delayed sample (in anticipation of IIR filter). i0 points to $L M S (n, 0)$ .

Line 47: The updated $L M S (n, 0)$ is written back to memory.

The implemented algorithm is a modified LMS approach, where the filter weights are updated by

\begin{aligned} \vec{W_{k + 1}} & = \vec{W_{k}} + \vec{X_{k}} * s i g n (ϵ_{k}) \\ = r 2 * r 7 + r 1 * r 6 \end{aligned}

where the registers are those used in line 33-46.

Here, the old $w_{k}$ are loaded into r2. r7 contains a weight-decay factor. If it's set to 0, LMS is effectively disabled as no weight update would happen and it becomes a fixed gain stage.

r1 contains the saturated version of each reference channel's input. r6 is the sign of the error term.

In this modified LMS approach, if both the error and sample are the same sign, the weight would increase (Hebb's rule). This implementation involves saturation and 1-bit weight updates, so it's not analytical. However, it is fast and according to Tim,

in practice the filter settles to final weight values within 4s. The filter offers 40dB of noise rejection.

If at any time the error term $ϵ_{k} = 0$ , that means the current channel can be completely predicted by the previous 7 channels, that means the current channel is measuring only noise. This could serve as a good debugging tool.

Finally, at the end of this code block, r0 contains the LMS filtered sample, also $ϵ$ . The IIR stage following LMS then operates directly on this filtered sample. Therefore only the regular and saturated AGC outputs are saved, while the LMS filter output is not saved until the IIR stage as a delayed sample (after AGC-out slot).

Effects of removing LMS

LMS is now implemented with RHD headstages in radio_AGC_LMS_IIR_SAA.asm.

The effects of LMS is especially clear when all the channels of an amplifier is applied the same signal:

Simulated spike signals are being applied through the Plexon headstage tester to channel 0-31. In image shown, the first two channels have fairly low signal. This is expected since all channels have the same signal, the previous 7 channels can exactly reconstruct the signal in the current channel, subtraction leaves little left.

Turning off LMS, we have the following:

The identical spike signals applied now show up loud and clear in the first two channels shown.

Jan 26, 2016 - Wireless Headstage: Radio transmission protocol between headstage and gtkclient.

Tags: BMI, Wireless, Blackfin, Microcontroller

Wireless Project github

The radio-transmission protocol is described fairly well in Tim's thesis pg. 166-168.

Some important points to remmember and references:

Each 32-bytes radio packet contains, in order of low to high byte:

6 samples from 4 channels of raw data (on any part of the signal chain selected by gtkclient).
8 bytes of template matching.

Each byte indicates whether there is a spike detected for both templates of a group of 4 channels - if each bit represents one template on one channel, then all 8 bits of the byte will be taken. However, we make the assumption that the two neurons represented by the two templates on the same channel cannot both fire at the same time, this then allows an encoding scheme to compress the matching byte into only 7bits. The most-significant bit in the first four template-matching bytes are then replaced by the bits of an packet # in frame nibble. This nibble ranges from 0-15, representing the position of the current packet within the 16-packets radio frame.

The next four template-matching bytes has their most significant bit replaced by the bits of an echo nibble. This echo nibble is extracted from the last command packet gtkclient sent to the headstage, and allows a sort of SEQ-ACK exchange to happen between the headstage and gtkclient, although this only allows to know when a packet might've been lost -- there is no bandwidth or CPU cycles to resend a lost packet. However, the control packets can be resent easily. The Sync headstage in gtkclient is a good way of sending control packets to sync the headstage configurations to those in gtkclient.

There are a total of $32 \frac{c h a n n e l s}{a m p l i f i e r} \times 4 \frac{a m p s}{h e a d s t a g e} \times 2 \frac{m a t c h e s}{c h a n n e l} = 256 \frac{m a t c h e s}{h e a d s t a g e}$ , meaning it takes 4 packets to transmit all template matches, accompanied by 24 samples of 4 channels.

ADC runs at 1Msps, each packet requires 6 samples, so each packet requires $\frac{6 s a m p l e s \times 32 c h a n n e l s / s a m p l e}{1 M s p s} = 192 μ s$ , and 4 packets requires $768 μ s$ to transmit.

Firmware sets aside a buffer enough for two radio frames. Each radio frame contains 16 packets. The radio starts sending when a frame is full, while the signal-chain collates and saves new packet in the other frame. Therefore, each frame takes about 3.072 ms to send, which is less than the 4ms free-run PLL time. Time to transmit, change to receive, then back is 3.60ms. So,

\begin{aligned} o u t g o i n g B W & = 32 \frac{b y t e s}{p a c k e t} \times 8 \frac{b i t s}{b y t e} \times 16 \frac{p a c k e t s}{t r a n s m i s s i o n} \times 1 \frac{t r a n s}{3.072 m s} = 1.333 M b p s \\ i n c o m i n g B W & = 32 \frac{b y t e s}{p a c k e t} \times 8 \frac{b i t s}{b y t e} \times 1 \frac{p a c k e t}{r e c e p t i o n} \times 1 \frac{r e c e p t i o n}{3.072 m s} = 83.33 k b p s \end{aligned}

Received packet are 4 (memory address, value) pairs. Each value is 32 bits. Memory address should be 32bits, but since their upper nibble is always 0xf, they are replaced with the 4 bits echo value. In the radio loop handling received packets, the echo value is extracted and saved, while the memory address is restored. The values from the packets are then written to the corresponding memory address.

Debugging Notes

If the firmware used on headstage allows for radio transmission (i.e. radio_loop implements the described protocol), there are several ways to debug and check the quality of transmission.

If the headstage's firmware is compiled correctly, the bridge's firmware is compiled correctly, gtkclient is compiled correctly, and multicast is allowed, yet there are not signals showing up in gtkclient, then there is something wrong with the radio code within the headstage's firmware.
If the gtkclient is getting raw waveforms displayed, but the waveforms look suspicious: for example, we apply sinusoids signals but waveforms displayed aren't very smooth, that may be caused by either errors in DSP or radio transmission code. We can eliminate the former possibility if after gtkclient instructs the headstage to change coefficients to induce oscillations (see IIR oscillator post) we do indeed see sinusoids of the same frequency on all transmitted channels.
In gtkclient, if the pkts/sec field is changing, but the dropped and BER fields stays constantly 0, that means something is wrong with the radio packaging process. As great as 0 dropped packet sounds, it very unlikely.

I encountered these last two problems while finalizing the RHD-headstage firmware.

In the screenshots below, I applied square waves of known frequency to one of the amplifiers. The top three channels are floating, the fourth one is supposed to be the sinusoid applied. The firmware I used does not have any signal-chain steps so I was simply streaming the acquiared samples from Intan.

800Hz signal applied

1kHz signal applied

1.2kHz signal applied

1.5kHz signal applied

2kHz signal applied

2.8kHz signal applied

It is apparent in most of these screenshots that I'm not getting a square wave back. In others, I get resemblance of square wave, but the frequency is never right, and the waveform seems to have multiple frequencies present, both higher and lower than what I expect.

Transmitting the IIR-generated oscillations also showed distortions. The generated waveform is supposed to be 919Hz. The recorded signals, when dumped and plotted in Matabl is shown below:

The first idea that came to mind to investigate what's going on was performing FFT. Below are FFT of gtkclient recording when sinusoidal signals of different frequency was applied, as well as that of the IIR-generated oscillations.

800Hz signal FFT

1kHz signal FFT

3.2kHz signal FFT

5kHz signal FFT

None of the spectra look very clean, and the signal's actual frequency never has the most energy. However, it seemed that the highest peaks are consistently 1300Hz apart. Since the sampling frequency of the amplifiers is 31.25kHz, 1300Hz correspond to every 24 samples, or every 4 packets.

In the radio-transmission protocol, the spike-match bits are looped every 4 packets -- the spike match bits rae saved in a buffer, and for every packet a pointer reads through this buffer. The pointer loops should therefore loop around the buffer every 4 packets, since all spike-match bits are transfered in that time.

Very careful line-by-line inspection of the radio packet assembly/transmission code revealed that I made a mistake modifying the original transmission code when modifying to work with firmware versions that did not include SAA template-matching. Fixing those errors solved the problem in all non-final RHD firmware versions.

That would mean after adding SAA, I only needed to copy over the original radio packet assembly/transmission code from RHA-headstage's radio5.asm to get everything work. But it didn't, and the same 1300Hz problem appeared. This then led me to a typo in my initial memory-allocation code in radio_bidi_asm, executed even before any of the signal-chain or radio-loop threads.

More Resources

Tim has notes on his radio protocol hardware/software development on his website, good read in case of any future Nordic radio work.

Jan 26, 2016 - Wireless Headstage: Blackfin BF532 headstage firmware setup

Tags: BMI, Wireless, Blackfin, Micrcontroller

Wireless Project github

The Blackfin BF532 has the following internal memory architecture (see Blackfin Manual):

L1 instruction memory - SRAM (4-way set-associative cache)
- 16kB instruction SRAM/cache: 0xFFA10000 - 0xFFA14000
- 32kB SRAM instruction memory: 0xFFA08000 - 0xFFA10000
L1 data memory - consists of 2 banks of 2-way associative cache SRAM, each one is 16kB.
- databank A: 0xFF804000 to 0xFF808000
- databank B: 0xFF904000 to 0xFF908000
L1 scratchpad RAM - accessible as data SRAM, cannot be configured as cache memory, 4kB. Address is (0xFFB00000 to 0xFFB010000)

For our firmware, we want to load:

The instructions into the 32kB SRAM instruction memory.
Signal chain calculations and intermediate values (W1), past samples for template comparison (T1), template match bits (MATCH) , look-up tables (ENC_LUT and STATE_LUT), and radio packet frames to L1 databank A.
Signal chain coefficients/parameters (A1), template and aperture coefficients (A1), and pointer values (FP_) to L1 databank B.

These memory layout and address are visualized in each firmware directory's memory_firmwareVersion.ods files, and memory_firmwareVersion.h header files.

To compile our code to fit this memory structure, we use the blackfin linux bare-metal tool chain, which is a gnu gcc based toolchain that allows us to compile assembly and c/c++ code for different blackfin architectures. The installation comes with header files such as defBF532.h that lists the memory addresses for various MMRs and system registers.

After running make, the file Linker.map shows the order of files linked in the final stage.dxe file, which includes:

crt0.o - from crt0.asm. Includes the global start() routine from where BF532 starts execution. Also includes most of the processor peripheral setup, such as PLL and clock frequency, and interrupt handler setup.

See VisualDSP++ 5.0 C/C++ Compiler and Library Manual for Blackfin for details on blackfin execution process.
radio_AGC_IIR_SAA.o - from radio_AGC_IIR_SAA.asm, this is where awherewherewhere all the DSP and interesting code go. The different firmware versions only really differ in this file.
enc.o - generated by enc.asm, generated by enc_create.cpp. It's a subroutine called in the data memory setup code in radio_AGC_IIR_SAA.asm to install the encoding look-up tables.
divsi3.o - generated by common_bfin/divsi3.asm, a subroutine for signed division...not really actually used, but could be useful.
main.o - from main.c. The blackfin starts execution from crt0.asm, which eventuall jumps to the main routine within main.c. Within the main routine, more blackfin and Nordic radio chip configurations are done. The radio setup is done throught the SPI interface (such as the radio channel). At the end of main(), it jumps to the setup-up code radio_bidi_asm within radio_AGC_IIR_SAA.asm.

See the section on Mixing C/C++ and Assembly Usage in VisualDSP++ 5.0 C/C++ Compiler and Library Manual for Blackfin
print.o - from common_bfin/print.h and common_bfin/print.c. Not actually needed for the headstage, rather a leftover from compilation setup for BF537. Not really relevant to headstage firmware.
util.o - from common_bfin/util.h and common_bfin/util.c. Functions defined not actually needed for the headstage.
spi.o - from common_bfin/spi.h and common_bfin/spi.c. Include functions in C to talk to the SPI port, which is needed for reading the flash memory and talking to the radio - functions called from the main() routine.

Finally, the file bftiny.x is the linker script to produce the final binary .dxe file. Written by unknown author, but it works!

The assembly version of the final compiled code and their memory addresses can be found in decompile.asm. The code order follows the link order. See this document for more details on the compile process.

The flash process will upload the final stage.dxe file into the onboard flash memory. The blackfin is wired to boot from flash - upon power up, blackfin will read the flash memory through SPI and load the data and instructions to the corresponding memory addresses appropriately.

The architecture of the firmware (in radio_AGC_IIR_SAA.asm for RHD-headstage or radio5.asm for RHA-headstage) is essentially two threads: DSP service routine reads in samples from the amplifiers and does the DSP, the radio control thread handles radio transmission and reception.

The radio control thread (radio_loop) fills the SPI read/write registers and changes flags to the radio to transmit packets. It also reads the incoming packets and writes the desired changes to the requested memory locations. Interleaved between these instructions, the DSP service routine is called.

The DSP service routine blocks until it receives until a new set of samples (4 samples, one from each amplifier) is read and fully processed and then returns control back to the radio control thread. To preserve state for each routine, circular buffers are used to store:

DSP signal-chain calculated intermediate values (W1).
DSP signal-chain constants and coefficient values (A1).
Past stored samples for spike detection (T1).
Data to be transmitted by the radio (packet buffer).

After all amplifier channels have been processed once, template matches and raw samples are written to the packet buffers. The radio-control thread monitors the packet buffer and transmit when packets are available.

One key limit to this threading architecture is the ADC sampling rate. The processor is operating at 400MHz. The ADC onboard the Intan operates at 1MHz. That means the code-length for calling the DSP service routine, return and execute an instruction in the radio loop, then back, cannot exceed 400 clock cycles.

Whenever the code-length violates this limit, data corruption and glitches can happen.

Outside of the values stored in the circular buffers, all persisten variables that do not have their own registers, such as the number of packets enqueued, are stored in fixed offsets from the frame pointer to permit one-cycle access latency.

The memory map spreadsheet is very helpful in debugging and understanding code execution.

Jan 26, 2016 - Wireless Headstage: Spike sorting via PCA and spike detection with SAA

Tags: BMI, Wireless, Blackfin, Microcontroller, DSP

Wireless Project github

The signal chain ends after IIR, at which spike matching happens.

Spike Templates

Spike templates are generated through spike sorting on gtkclient, then transmitted to and saved by the headstage. Spike sorting through PCA is described very well in Tim's thesis:

Spike sorting on the client is very similar to that used by Plexon Inc's SortClient and OfflineSorter: raw traces are thresholded, waveforms are extracted around the threshold, and a PCA projection is computed of these waveform snippets. Unlike Plexon, threshold crossing can be moved in both time and voltage domains, which permits considerably greater flexibility when centering a waveform within the relatively narrow 16-sample template region. [...] PCA projected waveforms are selected by the user via a polygonal lasso, and the mean and L1 norm of this sample is computed. The former is sent to the transceiver as the template, and the latter sets a guess at the aperture. Aperture can later be updated manually by the user; as the aperture is changed, the GUI updates the color labels of the PCA projected points.

Image below shows spike-sorting in action:

Note that this PCA template process can apply to any signal shape, not just limited to a neural signal waveform. In the screenshot, channel 31 is just noise when projected into the PCA space vaguely forms two clusters. The overlaid waveforms (cyan and red) don't resemble any neural signal. But the fact that they overlaid simply tells us that the noise detected on channel 31 is not purely stochastic and probably contains interference from deterministic sources.

The 16-point templates in the time domain is limited only to the purple strip in the waveform window. For each channel, a maximum of two set of 16-point templates (16 8-bit numbers) can be stored on the headstage. For each template, the corresponding aperture is also stored as a 8-bit number.

Blackfin Implementation

The assembly implementation of template matching on blackfin (without LMS) is as below:

// At end of signal chain for both group of two samples
// Template comparison, plexon style. Sliding window, no threshold.
   // r2=samples from amp1 and amp2; r3=samples from amp3 and amp4. Pack them
   r2 = r2 >>> 8 (v);   // vector shift 16-bit to 8 bits, preserve sign
   r3 = r3 >>> 8 (v);
   r4 = bytepack(r2, r3); // amp4,3,2,1 (hi to low byte)
   r0 = [FP - FP_8080];   // r0=0x80808080;
   r4 = r4 ^ r0;          // convert to unsigned offset binary. SAA works on unsigned numbers.
   r0 = r4;               // save a copy for later

.align 8;   // template A: load order is t, t-15, t-14,...t-1
    a0 = a1 = 0 || r2 = [i0++]; //r2=template_A(t), r0 and r4 contains byte-packed samples just collected
    saa( r1:0, r3:2 ) || r0 = [i3++] || r2 = [i0++]; //r2=template_A(t-15), r0=bytepack(t-15)
    saa( r1:0, r3:2 ) || r0 = [i3++] || r2 = [i0++]; //r2=template_A(t-14), r0=bytepack(t-14)
    saa( r1:0, r3:2 ) || r0 = [i3++] || r2 = [i0++]; //r2=template_A(t-13), r0=bytepack(t-13)
    saa( r1:0, r3:2 ) || r0 = [i3++] || r2 = [i0++]; //r2=template_A(t-12), r0=bytepack(t-12)
    saa( r1:0, r3:2 ) || r0 = [i3++] || r2 = [i0++]; //r2=template_A(t-11), r0=bytepack(t-11)
    saa( r1:0, r3:2 ) || r0 = [i3++] || r2 = [i0++]; //r2=template_A(t-10), r0=bytepack(t-10)
    saa( r1:0, r3:2 ) || r0 = [i3++] || r2 = [i0++]; //r2=template_A(t-9),  r0=bytepack(t-9)
    saa( r1:0, r3:2 ) || r0 = [i3++] || r2 = [i0++]; //r2=template_A(t-8),  r0=bytepack(t-8)
    saa( r1:0, r3:2 ) || r0 = [i3++] || r2 = [i0++]; //r2=template_A(t-7),  r0=bytepack(t-7)
    saa( r1:0, r3:2 ) || r0 = [i3++] || r2 = [i0++]; //r2=template_A(t-6),  r0=bytepack(t-6)
    saa( r1:0, r3:2 ) || r0 = [i3++] || r2 = [i0++]; //r2=template_A(t-5),  r0=bytepack(t-5)
    saa( r1:0, r3:2 ) || r0 = [i3++] || r2 = [i0++]; //r2=template_A(t-4),  r0=bytepack(t-4)
    saa( r1:0, r3:2 ) || r0 = [i3++] || r2 = [i0++]; //r2=template_A(t-3),  r0=bytepack(t-3)
    saa( r1:0, r3:2 ) || r0 = [i3++] || r2 = [i0++]; //r2=template_A(t-2),  r0=bytepack(t-2)
    saa( r1:0, r3:2 ) || r0 = [i3++] || r2 = [i0++]; //r2=template_A(t-2),  r0=bytepack(t-1)
    saa( r1:0, r3:2 ) || [i3++] = r4; // write bytepack(t), inc i3

    // saa results in a0.l, a0.h, a1.l, a1.h (amp4,3,2,1); compare to aperture
    // i0 @ Aperture[amp1A, amp2A]
    r0 = a0, r1 = a1 (fu) || r2 = [i0++] || i3 -= m3; // r2=aperture[amp1A,amp2A], i3@saved bytepack(t-15)
    // subtract and saturate - if the answer is negative-->spike!
    r0 = r0 -|- r2 (s) || r3 = [i0++]; // r0=[amp1A match, amp2A match], r3=aperture[amp3A,amp4A]
    r1 = r1 -|- r3 (s);                // r1=[amp3A match, amp4A match]
    // shift to bit 0, sign preserved
    r0 = r0 >>> 15 (v); // r0.h and r0.l will be 0xffff or 0x0000 (match or not)
    r1 = r1 >>> 15 (v);
    r0 = -r0 (v);       // now 0x0001 or 0x0000 (match or not)
    r1 = -r1 (v);       // save both for packing later
    r1 << = 1;
    r6 = r0 + r1;       // r6=[14 zeros][amp4A][amp2A][14 zeros][amp3A][amp1A]

In non-LMS versions of firmware, incrementing by m3 moves address up by 16 32-bit words. In LMS-verions of firmware, incrementing by m3 moves address up by 2 32-bit words.

As mentioned in the Blackfins-Intan post, Blackfin works efficiently with two 16-bit samples at once, therefore the signal-chain operates twice before reaching the template matching step, once for each two samples acquired in a SPORT cycle.

This efficiency is due to Blackfin's dual-MAC architecture, which can also operate on 4 unsigned 8-bit samples at the same time (treating each byte as a separate number).

Template matching implemented here is simply subtracting the 16 template points from a channel's current and 15 past samples, then comparing the sum of their absolute differences to the aperture value. If the sum is smaller, then we declare a spike has been detected.

Luckily, Blackfin has the SAA instruction, which accumulates four different sums of absolute difference between vectors of unsigned bytes. Therefore, at the end of the signal chain after all four samples acquired have been processed, they are packed into a 32-bit word by first converting each signed 16-bit sample to offset binary, then take the most significant byte (line 1-7). These bytepacked words are saved for the past 15 samples for each group of four channels in a circular buffer.

The template vectors from the headstage are also written into memory in the same bytepacked form. Thus template matching becomes 16 successive SAA instructions (line 11-27).

Line 31-38 does the comparison between the SAA result against the saved template, resulting in either 0 for no match, or 1 for a match.

Consequences of this method

This biggest upside of this spike matching method is its speed and simplicity of implementation. However, despite the simplicity, this is probably not the most optimal or correct method for spike detection because:

In most sorting algorithms (e.g. Plexon), the raw signal is first thresholded, then waveform snippets of usually 32 samples long are compared to a template to accept/reject, or to sort them into different units. The comparison metric, or aperture in our implementation, is usually the MSE or L2-norm. Our implementation uses 16-sample window and L1-norm, which presumably may have worse performance.
Because the spike-sorting is based on 16-bit samples truncated to 8-bit samples, the templates themselves have truncation noise. The onboard spike-matching also involves truncation of the samples. How this truncation noise can affect the sorting quality has not been rigorously studied. But I expect this to not contribute significantly to spike-detection performance hit (if at all).

Literature suggests that for isolating a fixed-pattern signal embedded in noise, the best solution might be matched filter, which Tim has done a preliminary study on.

Jan 26, 2016 - Wireless Headstage: Generate oscillations by setting IIR coefficients

Tags: BMI, Wireless, Blackfin, Microcontroller, DSP

Wireless Project github

While trying to confirm that the radio transmission for the RHD-headstage firmware was working correctly, I needed a way to decouple the signal-chain from the transmission in terms of testing. A nice side-effect of the cascading biquad IIR filter implementation is that, by writing coefficients for the biquads in a certain way, sinusoidal oscillations for different frequencies can be generated. If this was done on the last biquad, it would result in pure sinusoids being transmitted to gtkclient (when the final output is selected for transmission, of course), and it would be easy to see if the data is getting corrupted in the transmission process (if not, then any corruption would be due to the signal-chain code).

Below is the the DI-biquad block diagram again:

If we assume nonzero initial conditions such that $y [n - 1]$ and $y [n - 2]$ are nonzero, then pure oscillations can be induced by setting $b_{0}$ , $b_{1}$ and $b_{2}$ to be 0, in which case only feedback is present in the biquad.

The system function is then:

\begin{aligned} Y & = a_{1} z^{- 1} Y + a_{2} z^{- 2} Y \end{aligned}

Let $a_{1} = 2 - f_{p}$ , and $a_{2} = - 1$ , then the resulting conjugate poles will be at

\begin{array}{r} p_{1, 2} = \frac{(2 - f_{p}) \pm \sqrt{f_{p}^{2} - 4 f_{p}}}{2} \end{array}

The resulting normalized frequency of oscillation is $ω = ∠ p_{1}$ , in radians/sample. Oscillation frequency in Hz is $f = \frac{ω F_{s}}{2 π}$ , where $F_{s}$ is the sampling frequency (in our case 31.25kHz for all channels).

So all we need to find is $f p$ to get a desired oscillation frequency. To find the appropriate coefficients for the last biquad of my signal to induce oscillations, I used the following Matlab script. Note that the coefficients naming is different from that in the diagram -- $a_{0}$ and $a_{1}$ in script are the same as $a_{1}$ and $a_{2}$ in diagram.

% IIR oscillator - myopen_multi/gktclient_multi/matlab/IIR_oscillator.m
Fs = 31250;     % sampling frequency

% To get oscillator of frequency f, do:
% set b0=0, b1=0, a1=-1, and a0=2-fp
fp = 0.035;
omega = angle((2-fp+sqrt(fp^2-4*fp))/2); % in rad/sample
f = omega * Fs/(2*pi);  % convert to Hz

% But a0 is fixed-point number, convert fp:
fp_fixed = round(fp*2^14);

% coefficients
b0 = 0;
b1 = 0;
a0 = (2^15-fp_fixed)/2^14;
a1 = -1;

y = [0.5, 0.5];       % y[n-1]=0.5, y[n-2]=0.5
x = rand(1,1002)*2;   % 1000 random values, including x[n-1] and x[n-2]

for i=3:numel(x),
    val = b0*x(i) + b1*x(i-1) + b0*x(i-2) + a0*y(i-1) + a1*y(i-2);
    y = [y, val];
end

plot_fft(y, 31250);

In line 10, $f_{p}$ is converted to a number that can be represented by Q14 (see biquad IIR filter implementation for why).

Line 19-27 simulates running a biquad with the newly found coefficients, and plots the FFT of the output waveform.

Jan 26, 2016 - Wireless Headstage: gtkclient notes

Tags: BMI, Wireless, Blackfin, Micrcontroller

Wireless Project github

How gtkclient works:

Bridge gets assigned DHCP IP address.
gtkclient multicast socket receives bridge messages neurobrg.
Upon receiving that message, gtkclient assigns that bridge to one of the pre-determined radio channels.
The bridge then starts receiving wireless packets on the assigned radio channel.

gktclient spawns sock_thread to take care of each bridge. Previously, the array holding the detected bridge IP addresses was not locked, this means it's sometimes possible for two threads to be assigned to handle the same bridge, which at best leaves us with 1 operational radio channel, and at worst we will not get any transmission. This can be solved by mutex-locking the shared array. In general, proper threading can be used to solve gtkclient-bridge communication issues.

Dropped packets tracking

In bridge firmware:

When bridge starts, it keeps a flag g_nextFlag which says what the echo of the next packet should be.
This echo value is extracted by getRadioPacket() in test_spi.c.
If flag>g_nextFlag, we know a packet has been dropped. The total number of dropped packet, g_dropped increment by flag-g_nextFlag and is sent to gtkclient.
gtkclient also keeps track of this g_dropped statistic in g_dropped[tid], where tid corresponds to the different headstages we are communicating with. If the echo flag of the newly received packet is greater than that in g_dropped[tid], then the value in array is replaced by that echo flag. The difference between the new echo flag and the old value is then added to g_totalDropped.

Waveform display

Each received wireless packet has 24 bytes of waveform data nd 8 bytes of template matching data. The waveform data is in fact 6 consecutive samples from 4 different channels, each sample is one byte in size. However, it is important to know how we are supposed to interpret this byte value before plotting.

In my earliest iteration of radio_basic.asm firmware, I am simply saving the high byte of the 16-bit sample acquired from the RHD2132. The amplifiers were configured to transmit the samples in offset-binary, meaning the most negative voltage with respect to the reference electrode is represented by 0x0000, the most positive 0xffff, and the reference voltage is 0x7fff. Therefore, I would want to treat the byte extracted from the packet as an unsigned char. The waveform plot window has vertical range of [-1,1], therefore to plot, I had mapped the unsigned char range [0,255] to [-1,1].

In later firmware iterations, AGC and IIR filter outputs are transmitted. Those samples, however, have to be interpreted as signed char. This is because the samples collected were converted to signed Q1.15 before going through the signal-chain process, and therefore the outputs ended up in signed two's complement format as well.

Although the location of the decimal point varied depending on what the value transmitted is. For example, the AGC gain is Q7.8, therefore the decimal point is at the end of the most significant byte being transmitted. In comparison, the result of IIR filter is Q1.15, so the decimal point is after the most significant bit of the byte transmitted in that case. But since we are never mixing displaying waveforms in different format, this is not too important, just have to keep in mind when interpreting the waveforms plotted. Although practically speaking, the shape of the waveforms is the most important for spike-sorting, debugging, etc.

In those cases, to plot, I had to map the signed char range [-128,127] to [-1,1], which is a different function from the signed char mapping.

The gtkclient by default used the signed char mapping. Therefore when I first transmitted Intan data as unsigned data, the waveforms plotted were simply rail-to-rail oscillations due to value-wrappings. There are two ways to allow displaying both Intan data and the signal chain data appropriately:

Changed the mapping based on what part of the siganl chain gtkclient has selected from the headstage to transmit.
Configure the RHD2132 to transmit samples in two's complement format. This actually simplifies the AGC stage. Since the signed 16-bit samples can be treated as Q1.15 numbers right-away, we skip a step of conversion from unsigned to signed fixed-point. With this method, gtkclient can use the signed char mapping to display all parts of the signal chain.

Multicast Problems

This problem happened during a demo of the wheelchair experiment using RHA-headstages, with RHA-gtkclient running on Daedalus. A router was used rather than a switch at a time. Despite router configurations to not filter multicast traffic, the bridges were not able to be detected by gtkclient.

UDP sockets opened by gtkclient:

4340, and 4340+radio_channel are the bcastsock to detect bridges. This is what we are interested in for the bridge problem.
4343: TCP server socket to receive BMI3 connections
8845: Strobe socket

Steps:

Used ss -u -a to check what multicast sockets are open.
Monitor multicast traffic via wireshark:

sudo tshark -n -V -i eth0 igmp

Surprisingly, after executing this command, we started to receive bridge packets and gtkclient started working again.

I wasn't able to duplicate this scenario on other systems...

Stackoverflow thread

Jan 6, 2016 - Wireless Headstage: Implementing direct-form I biquad IIR filters

Tags: BMI, Wireless, Blackfin, Intan, Microcontroller, DSP

Continuing to add on to the RHD headstage's signal chain, I have to cut down on the number of biquads used to implement the butterworth bandpass filters on the Blackfin. The original code uses four back-to-back biquads to implement 8-pole butterworth filter.

The reason for using butterworth filter is because of the maximal flat passband. The reason for using back-to-back biquads rather than a big-chain (i.e. naively implementing the transfer function) is because each biquad uses only one summer, maximizing efficiency on embedded systems.

DF2-biquad

DF1-biquad

Most commonly, Direct Form II biquads are used in implementations because only 1 delay line is needed. However, as this great paper and [other common DSP resources] (https://www.amazon.com/Understanding-Digital-Signal-Processing-3rd/dp/0137027419) suggest, Direct Form II biquads have limitations on the coefficient range. Therefore, Direct Form I biquads are more desirable in 16-bit fixed-point applications for its better resistance to coefficient quantization and stability.

A lot of the original implementation rationale of the filters have been discussed in Tim's post from a long time ago, it was decided that the filters would implement Butterworth type due to the excellent in-band performance. This post will mostly talk about how to generate the coefficients for the filters given the following assembly code structure per this paper again:

r5 = [i0++] || r1 = [i1++];  // r5=b0(LPF), r1=x0(n-1); i0@b1(LPF), i1@x0(n-2)
a0 = r0.l * r5.l, a1 = r0.h * r5.h || r6 = [i0++] || r2 = [i1++];    // r6=b1(LPF), r2=x0(n-2)
a0 += r1.l * r6.l, a1 += r1.h * r6.h || r7 = [i0++] || r3 = [i1++];  // r7=a0(LPF), r3=y1(n-1)
a0 += r2.l * r5.l, a1 += r2.h * r5.h || r5 = [i0++] || r4 = [i1++];  // r5=a1(LPF), r4=y1(n-2)
a0 += r3.l * r7.l, a1 += r3.h * r7.h || [i2++] = r0;                 // update x0(n-1), i2@x0(n-2)
r0.l = (a0 += r4.l * r5.l), r0.h = (a1 += r4.h * r5.h) (s2rnd) || [i2++] = r1; // r0=y1(n), save x0(n-2)

This code is the most efficient implementation of DF1-biquad on Blackfin. To implement new filters, we need to

Get new filter transfer function.
Factor the transfer function into biquad forms.
Make biquad coefficients into correct fixed-point format.

Note that since we are only implementing filters that are cascades of biquads, that means any filters that we implement must contain an even number of poles.

There are two approaches for step 1 and 2. Approach 1 is easier for making filters low number of poles. Approach 2 is easier for high number of poles.

Approach 1

Since I only have enough code-path to implement two biquads (instead of four as Tim did), I need one LPF biquad and one HPF biquad. The filter can be designed by:

fs = 31250; % sampling frequency

% 2nd order butterworth LPF, cutoff at 9kHz
fc_LPF = 9000;
wn_LPF = (fc_LPF/fs)*2;  % normalized cutoff frequency (units=pi rad/sample)
[b_LPF, a_LPF] = butter(2, wn_LPF, 'low');  % descending powers of z, const firstA

% b_LPF = [0.3665, 0.7329, 0.3665]
% a_LPF = [1, 0.2804, 0.1855]

% 2nd order butterworth HPF, cutoff at 500Hz
fc_HPF = 500;
wn_HPF = (fc_HPF/fs)*2;
[b_HPF, a_HPF] = butter(2, wn_HPF, 'high');

% b_HPF = [0.9314, -1.8628, 0.9314]
% a_HPF = [1, -1.8580, 0.8675]

% plot freq-response of bandpass filter
fvtool(conv(b_LPF,b_HPF), conv(a_LPF, a_HPF));

The transfer function of a biquad is $H (z) = \frac{Y}{X} = \frac{b_{0} + b_{1} z^{- 1} + b_{0} z^{- 2}}{1 - a_{0} z^{- 1} - a_{1} z^{- 2}}$ . However, the coefficients generated by the butter() function follows the form $H (z) = \frac{Y}{X} = \frac{b (1) + b (2) z^{- 1} + b (3) z^{- 2}}{a (1) + a (2) z^{- 1} + a (3) z^{- 2}}$ , so we need to reverse the sign of the last two values in a_LPF and b_LPF.

To make the generated coefficients into the actual values used on Blackfin, we need to be mindful of how many bits they should be. Since the result of the biquad (line 6 from the assembly snippet) uses the (s2rnd) flag, the result of the summer must be Q1.14 (so s2rnd result is Q1.15). This means all coefficients must be within Q1.14 (and hopefully can be represented as such):

b_LPF_DSP = round(b_LPF*2^14);    % result is [b0,b1,b0]
a_LPF_DSP = round(a_LPF*2^14)*-1; % result is [dont-care, a0, a1]
% b0=6004, b1=12008, a0=-4594, a1=-3039

b_HPF_DSP = round(b_HPF*2^14);
a_HPF_DSP = round(a_HPF*2^14)*-1;
% b0=15260, b1=-30519, a0=30442, a1=-14213

These resulting coefficients are what I ended up using in the firmware. The two code snippet above is in gtkclient_multi/genIIRfilters.m.

By multiplying the b-coefficients of the low-pass filter, we can increase the DC-gain without changing the shape of the gain curve (equivalent of increasing the in-band gain due to flatness of Butterworth). However, the resulting coefficients must be clamped within [-32768, 32767] so they can be properly stored within 16-bits.

This is desirable if the presence of high frequency noise makes the incoming signal envelope high and limits the gain, then LMS subtract this common noise. With the extra IIR-stage gain, we can amplify the signal in the passband more. ~~It is not applicable to my version of headstage, because the raw samples are alreayd 16-bits.~~ This is now implemented in gtkclient for all firmware versions with an IIR stage.

The total gain of each biquad is limited to the coefficient range in Q15. For LPF, the DC gain can be calculated as the sum of the filter's feedforward coefficients divided by 1 minus the sum of the filter's feedback coefficients (in decimal). Keeping the feedback coefficients, multiplying both feedforward coefficients by the same amount increases the DC gain of the LPF biquad by the same. In gtkclient code, we assumed static gain is 1, i.e. 1 minus the sum of the feedback coefficients is equal to the sum of feedforward coefficients. This is not entirely true, therefore, the IIR gain value represents how much extra gain the filter stage would provide over the base gain.

As the feedforward coefficient values have different magnitude, the same gain may saturate just one of the coefficients. In this case the gain scaling may change the filter's shape. Therefore, in practice, the IIR gain entered should not be greater than [-2.5, 2.5]. To find the max IIR gain possible without saturating, find the smallest feedforward coefficient for the LPF biquad (m_lowpass_coeffs[0,1,4,5] for RHA or m_low_pass_coeffs[0,1] for RHD, in headstage.h, and divide 2 by that number).

Approach 2

The above approach is fine when we need one LPF biquad and one HPF biquad. But say we need an 8th-order buttworht bandpass filter, using the above approach we would need two HPF biquads and two LPF biquads. We can't simply use two of the same HPF biquads and two of the same LPF biquads. The resulting bandpass filter will have less bandwidth between the -3dB points (two poles at same frequency means sharper roll-off from that point on).

The solution then, requires us to factor the generated transfer function (gtkclient_multi/genIIRfilters2.m):

% design butterworth filter, and derive coefficients for
% the cascade of two direct-form I biquads

fs = 31250;
% 8th order butterworth bandpass filter
% B and A in descending order of z
[B,A] = butter(4, [1000,9000]/(fs/2), 'bandpass');

% Factor the numerator and denominator.
% First get the roots of num and den: for roots
rB = roots(B);
rA = roots(A);

% get the polynomials for each biquad, each from two (conjugate) roots
% abs to take care of round-off errors - product of conjugate roots cannot be imaginary.
pB1 = abs(poly(rB(1:2)));    % num
pA1 = abs(poly(rA(1:2)));    % den

pB2 = abs(poly(rB(3:4)));
pA2 = abs(poly(rA(3:4)));

pB3 = abs(poly(rB(5:6)));
pA3 = abs(poly(rA(5:6)));

pB4 = abs(poly(rB(7:8)));
pA4 = abs(poly(rA(7:8)));

final_num = conv(conv(pB1,pB2),conv(pB3,pB4));
final_den = conv(conv(pA1,pA2),conv(pA3,pA4));
fvtool(final_num, final_den);   % plot resulting biquad cascade behavior

% convert to fixed point coefficients (Q1.14)
% b1(1) and a1(1) are the highest-order terms.
b1 = round(pB1*2^14);
a1 = round(pA1*2^14)*-1;

b2 = round(pB2*2^14);
a2 = round(pA2*2^14)*-1;

b3 = round(pB3*2^14);
a3 = round(pA3*2^14)*-1;

b4 = round(pB4*2^14);
a4 = round(pA4*2^14)*-1;

Ironically, the filtering behavior looks better for the 4th-order filter generated from Approach 1, shown below left is the 4th-order biquad cascades. Right is the 8th-order biquad (4 of them) cascades from Approach 2.

butterworth_comparison

In the 8th-order filter, the HPF filter disappeared..not sure why?

Edit: 2/4/2016

While comparing the signal quality between RHD- and RHA-headstage, it seems that RHD headstage is more noisy, even in the middle of the passband (FFT analysis of acquired signals). Therefore the extra noise may be due to the more compact board layout?..

As a basic attempt to reduce noise, I reduced the pass band from [500Hz, 9kHz] to [500Hz, 7kHz]. This is also accompanied by my Intan passband setup changes, gtkclient.cpp UI changes, and headstage.cpp resetBiquad changes.

Original bandwidth settings

Intan cutoff freq: [250Hz, 10kHz]
Intan DSPen setting: High pass above 300Hz
Firmware AGC highpass: > 225Hz
IIR setting: [500Hz, 9kHz]
- LPF: b0=6004, b1=12008, a0=-4594, a1=-3039
- HPF: b0=15260, b1=-30519, a0=30442, a1=-14213

Reduced bandwidth settings

Intan cutoff freq: [250Hz, 7.5kHz]
Intan DSPen setting: High pass above 300Hz
Firmware AGC highpass: > 225Hz
IIR setting: [500Hz, 7kHz]
- LPF: b0=4041, b1=8081, a0=3139, a1=-2917
- HPF: b0=15260, b1=-30519, a0=30442, a1=-14213

Edit: 4/6/2016

During [validation](WirelessValidationMetrics) of recording and sorting quality between RHA-headstage, RHD-headstage, and plexon, a few changes to the final deployement firmware was made. Specifically, AGC was replaced by a fixed gain stage. The filtering settings were changed as well.

The new settings is:

Intan cutoff freq: [1Hz, 10kHz]
Intan DSPen setting: High pass above 1Hz
Firmware radio_AGC_IIR_SAA.asm AGC highpass: > 225Hz
Firmware radio_gain_IIR_SAA.asm (final) AGC highpass: none
IIR setting: [250, 9kHz]
- LPF: b0=6004, b1=12008, a0=-4594, a1=-3039
- HPF: b0=15812, b1=-31624, a0=31604, a1=-15260

We widened the Intan's hardware filter bandwidth so when sorting, the incoming waveform is less distorted and we can better see the spike features. The two stage IIR filters (2 poles on each side) are sufficient to attenuate the undesired out-of-band signals.

Edit: 7/26/2017

After implementing LMS, the input to IIR stage is much less noisy. Keeping the filter bandwidth setting the same -- [250, 9kHz], IIR gain makes the spikes much easier to see. RHD headstages have comparable performance to RHA headstages, upon qualitative evaluation (heh).

Jan 6, 2016 - Solving jekyll highlight line numbers

Tags: jekyll, webdev

This helps a lot. Finally got the syntax higlighting and line numbers to work.

Update: 2/1/2016

Apparently jekyll (written in ruby)'s default highlighter has changed to rouge, because pygments was written in Python and they think it's too slow to use a ruby wrapper for it.

This means github pages is also using rouge as the default highlighter, which unfortunately does not support assembly. So now all my assembly code snippets are just gray..and the code table is broken, now there is an extra gray box around the actual code box. Too much webdev, maybe fix later.

Dec 25, 2015 - Wireless Headstage: Blackfin BF532 communcation with Intan RHD2132 via SPI simulation via SPORT

Tags: BMI, Wireless, Blackfin, Intan, Microcontroller

Wireless Project github

Background/Old implementation

We want to be able to acquire signals from as many channels as possible. The original headstage Tim designed has four Intan RHA2132 amplifiers, each of which can amplify signals from 32 channels. Based on signals sent to the chip, the internal multiplexer can be instructed to output the analog signals from any of the 32 channels. These measured voltage signals are then passed to 12-bit ADC running at 1MHz, which means each channel is sampled at about 31.25kHz, standard for neural signal acquisition.

The ADC then sends out the conversion results in a serial interface that is compatible with several standards, such as SPI, QSPI, MICROWIRE, as well as Blackfin's SPORT interface.

This is great news because with the SPORT interface, there are 4 sets of buses available, meaning the blackfin can talk to four ADCs, and therefore sample from 4 RHA2132 channels at time. Meaning we can sample 128 channels at 31.25kHz each.

Further, since RHA2132 acts as simply analog amplifier, communication between the blackfin and RHA2132+ADC is simple:

Receive frame sync (RFS) provides the $\bar{C S}$ signal to the ADC.
Receive serial clock (RSCLK) provides the SCLK signal to the ADC.
Receive data lines are connected to the ADC's data output.
The input of the ADC connects to MUXout of RHA2132.
The $\bar{r e s e t}$ and step signal of all RHA2132 connects to two GPIO pins on the blackfin.

Therefore, at the beginning of each call to DSP-routine, the data values are collected from the SPORT receive buffer. The step line is toggled to step all of the RHA2132's mux to the next channel. The pin out of this setp is (ADC or amp to blackfin)

Amplifier 1:
- ADC1 SCLK - RSCLK0
- ADC1 $\bar{C S}$ - RFS0
- ADC1 DATA - DR0PRI
- Amp1 $\bar{r e s e t}$ - PF7
- Amp1 step - PF8
Amplifier 2:
- ADC2 SCLK - RSCLK0
- ADC2 $\bar{C S}$ - RFS0
- ADC2 DATA - DR0SEC
- Amp2 $\bar{r e s e t}$ - PF7
- Amp2 step - PF8
Amplifier 3:
- ADC3 SCLK - RSCLK1
- ADC3 $\bar{C S}$ - RFS1
- ADC3 DATA - DR1PRI
- Amp3 $\bar{r e s e t}$ - PF7
- Amp3 step - PF8
Amplifier 4:
- ADC4 SCLK - RSCLK1
- ADC4 $\bar{C S}$ - RFS1
- ADC4 DATA - DR1SEC
- Amp4 $\bar{r e s e t}$ - PF7
- Amp4 step - PF8

Therefore the SPORT ports only receive. SPORT0 generates internal frame sync which becomes the external framesync for SPORT1. The SPORT configurations are (in headstage_firmware/main.c and headstage_firmware/radio5.asm).

SPORT0_RCLKDIV = 1
SPORT0_RFSDIV = 19
SPORT0_RCR1 = 0x0603    // early frame sync, active high RFS, require RFS for every word
SPORT0_RCR2 = 0x010F

SPORT1_RCLKDIV = 1
SPORT1_RFSDIV = 19
SPORT1_RCR1 = 0x0401    // require RFS for every word
SPORT1_RCR2 = 0x010F

The RFSCLK is set to be $\frac{S C L K}{2 x (1 + S P O R T x_R C L K D I V)} = \frac{80 M H z}{4} = 20 M H z$ $. Setting SPORTx_RFSDIV=19 means the ADC's chip select is pulsed high every 20 SCLK cycles for 1 clock cycle. Therefore the ADC is operated at 1MHz. The ADC outputs 12-bits word. There are 20 clock cycles between each chip-select high pulse, thus there are enough clock cycles for the SPORT to read the data output.

This is straight forward and works well.

New implementation with RHD2132

Intan RHD2132 is essentially RHA2132 with a 16-bit ADC built in. The ADC has maximum sampling frequency of 1MHz, which means when the 32 channels are multiplexed, we can achieve 31.25kHz sampling frequency on each channel.

However, RHD2132 needs to interface with other chips through strictly SPI. This introduces two complications:

Instead of simply toggling a pin to reset the amplifier and step the amplifier channel the internal ADC is sampling, each conversion needs to be initiated by a conversion command. This means if we were to keep using the SPORT ports for communication, we have to transmit at the same time as receiving.
The SPI timing requirement is shown in the diagram below:

The _CS ( $t_{C S 1}$ ) high duration needs to be greater than 154ns. The datasheet mentions that the _CS line must be pulsed high between every 16-bit data transfer. Finally, $t_{C S 1}$ and $t_{C S 2}$ must be greater than 20.8ns.

The first modification is easily taken care of -- we just need to enable the use of SPORTx_TX ports to send Intan commands at the same time as receiving conversion data.

Fulfilling all of the SPI timing requirements is impossible with the SPORT interface because:

RHD2132 transmits and receives only when _CS is pulled low. That means both RFSync and TFSync which dictates when a 16-bits word is received by or transmited from the blackfin, must be connected to _CS, and have the same timing. This was easily done by having TFSync to be internally generated, and set RFSync as externally generated and receives TFSync. The RSCLKx and TSCLKx, which dictates when a bit is sampled or driven out onto the bus is edge-sync'd to the RFSync and TFSync signal, respectively. At the same time, RSCLKx and TSCLKx have to be set to have the same timing, and connects to RHD2132's SCLK.

The new Blackin SPORT to RHD2132 amplifiers connections are (amplifiers numbered from left to right):

Amplifier 1: SPORT1_SEC
- SCLK+ to SCLK1
- $\bar{c s +}$ to FSync1
- MOSI+ to DT1SEC
- MISO+ to DR1SEC
Amplifier 2: SPORT1_PRI
- SCLK+ to SCLK1
- $\bar{c s +}$ to FSync1
- MOSI+ to DT1PRI
- MISO+ to DR1PRI
Amplifier 3: SPORT0_SEC
- SCLK+ to SCLK0
- $\bar{c s +}$ to FSync0
- MOSI+ to DT0SEC
- MISO+ to DR0SEC
Amplifier 4: SPORT0_PRI
- SCLK+ to SCLK0
- $\bar{c s +}$ to FSync0
- MOSI+ to DT0PRI
- MISO+ to DR0PRI

where SCLK0 is RSCLK0 and TSCLK0 tied together, SCLK1 is RSCLK1 and TSCLK1 tied together, FSync0 is RFS0 and TFS0 tied together, and FSync1 is RFS1 and TFS1 tied together.

This means the falling edge of _CS always corresponds to either the falling/rising edge of SCLK, and similarly, the rising edge of _CS corresponds to either the falling/rising edge of SCLK. However, $t_{C S 1}$ defines the timing between falling edge of _CS and the next rising edge of SCLK, while $t_{C S 2}$ defines the timing between falling edge of _CS and the next rising edge of SCLK. Given the operation of SPORT, one of these requirements will be violated.

In fact, according to ADI engineers, it is not possible to fulfill all the timing requirements when using SPORT to emulate SPI. So I just hoped it could work anyways and tried to test this.

Testing

My final SPORT settings are, collected from headstage2_firmware/main.c and headstage2_firmware/intan_setup_test.asm (but also other firmware versions. intan_setup_test is simply the first time I tested them):

// SPORT0 transmit
// Using TCLK for late-frame sync
SPORT0_TCLKDIV = 1;
SPORT0_TFSDIV = 3;
SPORT0_TCR2 = TXSE | 0x10;  // allow secondary, 17 bits word
SPORT0_TCR1 = TCKFE | LATFS | LTFS | TFSR | ITFS | ITCLK | TSPEN | DITFS;

// SPORT1 transmit
SPORT1_TCLKDIV = 1;
SPORT1_TFSDIV = 3;
SPORT1_TCR2 = TXSE | 0x10;  // allow secondary, 17 bits word
SPORT1_TCR1 = TCKFE | LATFS | LTFS | TFSR | ITFS | ITCLK | TSPEN | DITFS;

// SPORT0 receive
// Serial clock and frame syncs derived from TSCLK and TFS, so RCLKDIV and RFSDIV ignored
SPORT0_RCR2 = RXSE | 0x10;  // allow secondary, 17 bits word
SPORT0_RCR1 = RCKFE | LARFS | LRFS | RFSR | RSPEN;

// SPORT1 receive
// Serial clock and frame syncs derived from TSCLK and TFS, so RCLKDIV and RFSDIV ignored
SPORT1_RCR2 = RXSE | 0x010; // allow secondary, 17 bits word
SPORT1_RCR1 = RCKFE | LARFS | LRFS | RFSR | RSPEN;

In this setup, the receive clock and frame sync are the same as the transmit clock and frame sync. The TSCLK=RSCLK=SCLK of RHD2132, is again set to be 20MHz.

LTFS sets the frame sync signal to be active low.

SPORTx_TFSDIV=3 would mean the frame sync or _CS is pulled low for 1 SCLK (50ns) every 4 cycles (200ns).

However, LATFS sets late frame sync, which means when a frame sync triggers transmission or reception of a word, it will stay low during that time, and another further frame-sync during this time is ignored by the SPORT. After that time, if it's not time for another frame sync yet, frame sync will return to high, otherwise, it will be low again. Late frame sync also means the timing requiring between the falling edge of _CS and the next rising edge of SCLK is fulfilled.

Together, this means _CS will be low for 17 SCLK cycles, or $17 \times 50 n s = 850 n s$ while high for 3 SCLK cycles, or $3 \times 50 n s = 150 n s$ $. This is because, when frame sync first goes low, it will stay low for 17 SCLK cycles, at which time it will go high because 17 is not a multiple of 4. The next greatest multiple of 4 is 20, which is when frame sync will be pulled low again.

The reason for _CS to be low for 17 SCLK cycles is two fold:

Because a dataword is 16 bits, it has to be at least greater than 16 cycles. No matter what settings I have, the $t_{c s 2}$ requirment will be violated, which based on how multiplexer work, will affect the value of the last bit being read. Since the data is read MSBit first, having 17 clock cycles would mean only the 17th bit, which I will discard later will be corrupted by the $t_{c s 2}$ violation. Therefore no harm done!
With _CS low for 17 cycles and high for 3 cycles, I get exactly 20 cycles period, making it 1MHz. This is one of the only configurations that allows me to achieve that.

With this setup, the $t_{c s 2}$ timing requirement is not fulfilled at all, while $t_{c s o f f}$ is violated by 4ns.

Initial testing required checking the acquired sample values. Fortunately, all RHD2132 comes with registers that contains the ASCII values of 'I', 'N', 'T', 'A', 'N'. Therefore, if the SPI emulation actually works, I can just issue READ commands of those register locations to Intan, and check whether the received results are correct.

One thing to remember is that when a command to convert a channel is sent, the conversion result is not received until two SPORT cycles later. This also means, if the pipeline is full, when you stop sending commands, it takes two cycles for the pipeline to be completely empty of new data. But since with the SPORT setup, we can't read from Intan without also sending, that means the last (old) command will have to be send three times during the time it takes to clear the pipeline.

After establishing that, I then needed to make sure I can sample and obtain correct waveforms. This is done by applying a sinusoid of known frequency to the amplifier. In the firmware, I would simply read the the channels four at a time (ch1 on all four amps, then ch2 on all four amps, so on), while sending commands to tell the amplifiers to convert the next channels at the same time as reading. I would save the sample read from the same amplifier every time.

Running this test program (first version is headstage2_firmware/intan_setup_test.asm) and setting breakpoint at the end of execution through JTAG, I can dump the saved data and plot them (using headstage2_firmware/plot_memdump.py).

Initial test result for all 32 channels of amplifier 1, when the first channel is applied a 6400Hz signal looks like below:

I have signal, but it's not what I expected at all...channels without signals are oscillating, there were too many cycles, etc.

This is strange because when I repeatedly convert just one channel using the same setup (sending CONVERT(0), CONVERT(0)...) rather than converting each channel in turn (sending CONVERT(0), CONVERT(1), CONVERT(2),...), I get great result:

I got two periods on the first channel of amplifier 1. The bottom 3 are the first channels of the other amplifiers, which are quiet as expected.

Turns out the difference is due to reduce sampling frequency. In the second case, where the sampling frequency is 1MHz, each period of 6400Hz wave would require 156 samples.

However, while calculating how many samples I needed to collect in the first case, I mistakenly used the 156 samples/value. When I sample 32 channels round-robin style, the effective sampling rate applied to each channel is 31.25kHz, that means only 5 samples/period is needed. This is why my plots had way more periods than expected.

Secondly, the channels that weren't applied signals were left floating. This means it's susceptible to all the noise within the amplifier and PCB planes, resulting in same oscillation frequency, but different amplitude and DC value.

After calculating the correct number of samples to collect to see 2 periods of 6400Hz on all 32 channels of an amplifier, and correctly grounding the channels without applied signals, and turning on Intan's DSP highpass filter/offset removal, the results then looked just as expected:

In the plot above, channel 0, 15, 31 and 16 are applied the 6400Hz signal while the others are grounded.

Consequences of the new configuration

RHD2132 also support outputting ADC results in either unsigned offset-binary where reference electrode voltage has the value of 0x8000, or using signed two's complement, where the reference electrode has the value of 0x0000 and values less than that has the MSB set to 1 two's complement style.

This would result in modifications of the first stage of the firmware's signal chain, which is to convert the incoming samples to fixed-point Q1.15 format for further processing. See the post on modifying the AGC stage.

Finally, the option DITFS make the SPORT frame-sync, read and write happen every $1 μ s$ regardless if new commands have been pushed into the SPORT transmit FIFO. This means, if our code-path (signal-chain plus the radio transmission code) inbetween SPORT FIFO reads/writes is too long, the previous command in the FIFO will be sent again, and two cycles later, the corresponding result will be read. This will then result in erroneous data.

However, the manifestation of this data corruption is somewhat counter-intuitive. After establishing correct Intan and SPORT setup to successfully acquire signals, I added in minimally modified signal-chain code from RHA-headstage's working firmware headstage_firmware/radio5.asm, and saved all pre-processed samples from one of the amplifiers in memory (headstage2_firmware/firmware1.asm). I then stopped the execution when the designated buffer is memory is full, dumped the results for plotting in JTAG. The plots, which I expected to be smooth sinusoids of the applied signal, were riddled with random spikes.

This cannot be due to the signal-chain code since the samples saved came directly from the SPORT. A second test where I simply added nops inbetween SPORT reads/writes confirms that the data corruption was due to the code-path being too long. However, the number of nops that allowed for glitchless data acquisition is less than the signal-chain length used in radio5.asm. This is very strange, since the timing requirement for both headstage firmware setup is similar -- accounting for the radio transmission code, the DSP code has to finish in about 800ns. But somehow the timing budget for the RHD-headstage is significantly less.

In the nop test done in firmware2.asm, it was found I can fit 150 nops. I have yet to figure out why this is the case. Regardless, that means a reduction of the original RHA-headstage firmware is needed to adapt it to RHD-headstage. Since the radio-transmission code must stay the same, this reduction is exclusively on the signal-chain code, which includes:

high-pass + AGC = 5 cycles/2 channels * 2 = 10 cycles.

In each signal-chain iteration 4 samples are acquired, 1 per amplifier, so all 4 must be processed. Blackfin works efficiently with two 16-bit samples at once, so samples from two channels are processed first, then the remaining two samples are processed in the rest of the signal-chain code.
LMS = 27 cycles/2 channels * 2 = 54 cycles.
IIR = 27 cycles/2 channels * 2 = 54 cycles.
SAA = 58 cycles.

AGC is automatic gain control, LMS is least-mean-square adaptive noise cancellation, IIR is infinite impulse response filter, and SAA is used for spike template matching.

Among these, SAA is required for spike matching. AGC is required because the templates for spike-sorting were made under certain signal power level, AGC is needed to fix the channel's power level. Therefore, all of LMS and half of IIR are ripped out to reduce code-length.

Update: 7/26/2017

The reduction of available processing cycles kept bothering me, so I tried to see what it is caused by and how I can squeeze more cycles.

Initial RHD-headstage firmware flows like so:

Initialize all memory bank values _radio_bidi_asm.
Setup Intan communication by writing to its config registers and calibrate the ADC comm_setup.
Go to main loop where signal acquistion and processing interleaves with radio transmission.

In step 3, after getting the latest samples from the SPORT buffers, we increment the channel count, construct a proper request to Intan and send it through SPORT again. The SPORT would presumable output this to Intan headstages simulataneously while reading for new samples from Intan. This is done through:

r7 = CHANNEL_SHIFTED; 
[p0 + (SPORT1_TX - SPORT0_RX)] = r7;   // SPORT1 primary TX
[p0 + (SPORT1_TX - SPORT0_RX)] = r7;   // SPORT1 sec TX
[p0 + (SPORT0_TX - SPORT0_RX)] = r7;   // SPORT0 primary TX
[p0 + (SPORT0_TX - SPORT0_RX)] = r7;   // SPORT0 sec TX

Turns out if in the Intan-setup phase, prior to the start of signal acquisition, I leave the command for auto-incrementing selected amplifier channel in the SPORT transmission (SPORT1_TX, SPORT0_TX) and delete the above snippet from the signal acquisition code, the same behavior is achieved and I can an extra 57 cycles to work with during step 3. This then allows me to fit the entire LMS step into the firmware.

Why does this happen? Probably because accessing the SPORT ports on the blackfin, takes more than 1 CCLK cycle (1/400MHz=2.5ns), which is the case for all the instructions. The ADSP-BF532 hardware reference section 3-5 mentions that Peripheral Access Bus (PAB) and Memory Mapped Register (MMR) access, which includes SPORTx_TX and SPORTx_RX takes 2 SCLK cycles.

SCLK cycle is set to 80MHz, which is 5 CCLK cycles. The code snippet above takes 4*2=8 SCLK cycles, or 40 CCLK cycles. This is less than the extra 57 cycles I got after deleting them, but I think confirms what I see.

Newer Page: 5 of Older