Jan 6, 2016 - Wireless Headstage: Implementing direct-form I biquad IIR filters

Continuing to add on to the RHD headstage's signal chain, I have to cut down on the number of biquads used to implement the butterworth bandpass filters on the Blackfin. The original code uses four back-to-back biquads to implement 8-pole butterworth filter.

The reason for using butterworth filter is because of the maximal flat passband. The reason for using back-to-back biquads rather than a big-chain (i.e. naively implementing the transfer function) is because each biquad uses only one summer, maximizing efficiency on embedded systems.

DF2-biquad

DF2-biquad

DF1-biquad

DF1-biquad

Most commonly, Direct Form II biquads are used in implementations because only 1 delay line is needed. However, as this great paper and [other common DSP resources] (https://www.amazon.com/Understanding-Digital-Signal-Processing-3rd/dp/0137027419) suggest, Direct Form II biquads have limitations on the coefficient range. Therefore, Direct Form I biquads are more desirable in 16-bit fixed-point applications for its better resistance to coefficient quantization and stability.

A lot of the original implementation rationale of the filters have been discussed in Tim's post from a long time ago, it was decided that the filters would implement Butterworth type due to the excellent in-band performance. This post will mostly talk about how to generate the coefficients for the filters given the following assembly code structure per this paper again:

r5 = [i0++] || r1 = [i1++];  // r5=b0(LPF), r1=x0(n-1); i0@b1(LPF), i1@x0(n-2)
a0 = r0.l * r5.l, a1 = r0.h * r5.h || r6 = [i0++] || r2 = [i1++];    // r6=b1(LPF), r2=x0(n-2)
a0 += r1.l * r6.l, a1 += r1.h * r6.h || r7 = [i0++] || r3 = [i1++];  // r7=a0(LPF), r3=y1(n-1)
a0 += r2.l * r5.l, a1 += r2.h * r5.h || r5 = [i0++] || r4 = [i1++];  // r5=a1(LPF), r4=y1(n-2)
a0 += r3.l * r7.l, a1 += r3.h * r7.h || [i2++] = r0;                 // update x0(n-1), i2@x0(n-2)
r0.l = (a0 += r4.l * r5.l), r0.h = (a1 += r4.h * r5.h) (s2rnd) || [i2++] = r1; // r0=y1(n), save x0(n-2)

This code is the most efficient implementation of DF1-biquad on Blackfin. To implement new filters, we need to

  1. Get new filter transfer function.
  2. Factor the transfer function into biquad forms.
  3. Make biquad coefficients into correct fixed-point format.

Note that since we are only implementing filters that are cascades of biquads, that means any filters that we implement must contain an even number of poles.

There are two approaches for step 1 and 2. Approach 1 is easier for making filters low number of poles. Approach 2 is easier for high number of poles.

Approach 1

Since I only have enough code-path to implement two biquads (instead of four as Tim did), I need one LPF biquad and one HPF biquad. The filter can be designed by:

fs = 31250; % sampling frequency

% 2nd order butterworth LPF, cutoff at 9kHz
fc_LPF = 9000;
wn_LPF = (fc_LPF/fs)*2;  % normalized cutoff frequency (units=pi rad/sample)
[b_LPF, a_LPF] = butter(2, wn_LPF, 'low');  % descending powers of z, const firstA

% b_LPF = [0.3665, 0.7329, 0.3665]
% a_LPF = [1, 0.2804, 0.1855]

% 2nd order butterworth HPF, cutoff at 500Hz
fc_HPF = 500;
wn_HPF = (fc_HPF/fs)*2;
[b_HPF, a_HPF] = butter(2, wn_HPF, 'high');

% b_HPF = [0.9314, -1.8628, 0.9314]
% a_HPF = [1, -1.8580, 0.8675]

% plot freq-response of bandpass filter
fvtool(conv(b_LPF,b_HPF), conv(a_LPF, a_HPF));

The transfer function of a biquad is H(z)=YX=b0+b1z1+b0z21a0z1a1z2. However, the coefficients generated by the butter() function follows the form H(z)=YX=b(1)+b(2)z1+b(3)z2a(1)+a(2)z1+a(3)z2, so we need to reverse the sign of the last two values in a_LPF and b_LPF.

To make the generated coefficients into the actual values used on Blackfin, we need to be mindful of how many bits they should be. Since the result of the biquad (line 6 from the assembly snippet) uses the (s2rnd) flag, the result of the summer must be Q1.14 (so s2rnd result is Q1.15). This means all coefficients must be within Q1.14 (and hopefully can be represented as such):

b_LPF_DSP = round(b_LPF*2^14);    % result is [b0,b1,b0]
a_LPF_DSP = round(a_LPF*2^14)*-1; % result is [dont-care, a0, a1]
% b0=6004, b1=12008, a0=-4594, a1=-3039

b_HPF_DSP = round(b_HPF*2^14);
a_HPF_DSP = round(a_HPF*2^14)*-1;
% b0=15260, b1=-30519, a0=30442, a1=-14213

These resulting coefficients are what I ended up using in the firmware. The two code snippet above is in gtkclient_multi/genIIRfilters.m.

By multiplying the b-coefficients of the low-pass filter, we can increase the DC-gain without changing the shape of the gain curve (equivalent of increasing the in-band gain due to flatness of Butterworth). However, the resulting coefficients must be clamped within [-32768, 32767] so they can be properly stored within 16-bits.

This is desirable if the presence of high frequency noise makes the incoming signal envelope high and limits the gain, then LMS subtract this common noise. With the extra IIR-stage gain, we can amplify the signal in the passband more. It is not applicable to my version of headstage, because the raw samples are alreayd 16-bits. This is now implemented in gtkclient for all firmware versions with an IIR stage.

The total gain of each biquad is limited to the coefficient range in Q15. For LPF, the DC gain can be calculated as the sum of the filter's feedforward coefficients divided by 1 minus the sum of the filter's feedback coefficients (in decimal). Keeping the feedback coefficients, multiplying both feedforward coefficients by the same amount increases the DC gain of the LPF biquad by the same. In gtkclient code, we assumed static gain is 1, i.e. 1 minus the sum of the feedback coefficients is equal to the sum of feedforward coefficients. This is not entirely true, therefore, the IIR gain value represents how much extra gain the filter stage would provide over the base gain.

As the feedforward coefficient values have different magnitude, the same gain may saturate just one of the coefficients. In this case the gain scaling may change the filter's shape. Therefore, in practice, the IIR gain entered should not be greater than [-2.5, 2.5]. To find the max IIR gain possible without saturating, find the smallest feedforward coefficient for the LPF biquad (m_lowpass_coeffs[0,1,4,5] for RHA or m_low_pass_coeffs[0,1] for RHD, in headstage.h, and divide 2 by that number).

Approach 2

The above approach is fine when we need one LPF biquad and one HPF biquad. But say we need an 8th-order buttworht bandpass filter, using the above approach we would need two HPF biquads and two LPF biquads. We can't simply use two of the same HPF biquads and two of the same LPF biquads. The resulting bandpass filter will have less bandwidth between the -3dB points (two poles at same frequency means sharper roll-off from that point on).

The solution then, requires us to factor the generated transfer function (gtkclient_multi/genIIRfilters2.m):

% design butterworth filter, and derive coefficients for
% the cascade of two direct-form I biquads

fs = 31250;
% 8th order butterworth bandpass filter
% B and A in descending order of z
[B,A] = butter(4, [1000,9000]/(fs/2), 'bandpass');

% Factor the numerator and denominator.
% First get the roots of num and den: for roots
rB = roots(B);
rA = roots(A);

% get the polynomials for each biquad, each from two (conjugate) roots
% abs to take care of round-off errors - product of conjugate roots cannot be imaginary.
pB1 = abs(poly(rB(1:2)));    % num
pA1 = abs(poly(rA(1:2)));    % den

pB2 = abs(poly(rB(3:4)));
pA2 = abs(poly(rA(3:4)));

pB3 = abs(poly(rB(5:6)));
pA3 = abs(poly(rA(5:6)));

pB4 = abs(poly(rB(7:8)));
pA4 = abs(poly(rA(7:8)));

final_num = conv(conv(pB1,pB2),conv(pB3,pB4));
final_den = conv(conv(pA1,pA2),conv(pA3,pA4));
fvtool(final_num, final_den);   % plot resulting biquad cascade behavior

% convert to fixed point coefficients (Q1.14)
% b1(1) and a1(1) are the highest-order terms.
b1 = round(pB1*2^14);
a1 = round(pA1*2^14)*-1;

b2 = round(pB2*2^14);
a2 = round(pA2*2^14)*-1;

b3 = round(pB3*2^14);
a3 = round(pA3*2^14)*-1;

b4 = round(pB4*2^14);
a4 = round(pA4*2^14)*-1;

Ironically, the filtering behavior looks better for the 4th-order filter generated from Approach 1, shown below left is the 4th-order biquad cascades. Right is the 8th-order biquad (4 of them) cascades from Approach 2.

butterworth_comparison

In the 8th-order filter, the HPF filter disappeared..not sure why?


Edit: 2/4/2016

While comparing the signal quality between RHD- and RHA-headstage, it seems that RHD headstage is more noisy, even in the middle of the passband (FFT analysis of acquired signals). Therefore the extra noise may be due to the more compact board layout?..

As a basic attempt to reduce noise, I reduced the pass band from [500Hz, 9kHz] to [500Hz, 7kHz]. This is also accompanied by my Intan passband setup changes, gtkclient.cpp UI changes, and headstage.cpp resetBiquad changes.

Original bandwidth settings

  • Intan cutoff freq: [250Hz, 10kHz]
  • Intan DSPen setting: High pass above 300Hz
  • Firmware AGC highpass: > 225Hz
  • IIR setting: [500Hz, 9kHz]
    • LPF: b0=6004, b1=12008, a0=-4594, a1=-3039
    • HPF: b0=15260, b1=-30519, a0=30442, a1=-14213

Reduced bandwidth settings

  • Intan cutoff freq: [250Hz, 7.5kHz]
  • Intan DSPen setting: High pass above 300Hz
  • Firmware AGC highpass: > 225Hz
  • IIR setting: [500Hz, 7kHz]
    • LPF: b0=4041, b1=8081, a0=3139, a1=-2917
    • HPF: b0=15260, b1=-30519, a0=30442, a1=-14213

Edit: 4/6/2016

During [validation](WirelessValidationMetrics) of recording and sorting quality between RHA-headstage, RHD-headstage, and plexon, a few changes to the final deployement firmware was made. Specifically, AGC was replaced by a fixed gain stage. The filtering settings were changed as well.

The new settings is:

  • Intan cutoff freq: [1Hz, 10kHz]
  • Intan DSPen setting: High pass above 1Hz
  • Firmware radio_AGC_IIR_SAA.asm AGC highpass: > 225Hz
  • Firmware radio_gain_IIR_SAA.asm (final) AGC highpass: none
  • IIR setting: [250, 9kHz]
    • LPF: b0=6004, b1=12008, a0=-4594, a1=-3039
    • HPF: b0=15812, b1=-31624, a0=31604, a1=-15260

We widened the Intan's hardware filter bandwidth so when sorting, the incoming waveform is less distorted and we can better see the spike features. The two stage IIR filters (2 poles on each side) are sufficient to attenuate the undesired out-of-band signals.


Edit: 7/26/2017

After implementing LMS, the input to IIR stage is much less noisy. Keeping the filter bandwidth setting the same -- [250, 9kHz], IIR gain makes the spikes much easier to see. RHD headstages have comparable performance to RHA headstages, upon qualitative evaluation (heh).

Jan 6, 2016 - Solving jekyll highlight line numbers

This helps a lot. Finally got the syntax higlighting and line numbers to work.

Update: 2/1/2016

Apparently jekyll (written in ruby)'s default highlighter has changed to rouge, because pygments was written in Python and they think it's too slow to use a ruby wrapper for it.

This means github pages is also using rouge as the default highlighter, which unfortunately does not support assembly. So now all my assembly code snippets are just gray..and the code table is broken, now there is an extra gray box around the actual code box. Too much webdev, maybe fix later.

Dec 25, 2015 - Wireless Headstage: Blackfin BF532 communcation with Intan RHD2132 via SPI simulation via SPORT

Wireless Project github

Background/Old implementation

We want to be able to acquire signals from as many channels as possible. The original headstage Tim designed has four Intan RHA2132 amplifiers, each of which can amplify signals from 32 channels. Based on signals sent to the chip, the internal multiplexer can be instructed to output the analog signals from any of the 32 channels. These measured voltage signals are then passed to 12-bit ADC running at 1MHz, which means each channel is sampled at about 31.25kHz, standard for neural signal acquisition.

The ADC then sends out the conversion results in a serial interface that is compatible with several standards, such as SPI, QSPI, MICROWIRE, as well as Blackfin's SPORT interface.

This is great news because with the SPORT interface, there are 4 sets of buses available, meaning the blackfin can talk to four ADCs, and therefore sample from 4 RHA2132 channels at time. Meaning we can sample 128 channels at 31.25kHz each.

Further, since RHA2132 acts as simply analog amplifier, communication between the blackfin and RHA2132+ADC is simple:

  • Receive frame sync (RFS) provides the CS¯ signal to the ADC.
  • Receive serial clock (RSCLK) provides the SCLK signal to the ADC.
  • Receive data lines are connected to the ADC's data output.
  • The input of the ADC connects to MUXout of RHA2132.
  • The reset¯ and step signal of all RHA2132 connects to two GPIO pins on the blackfin.

Therefore, at the beginning of each call to DSP-routine, the data values are collected from the SPORT receive buffer. The step line is toggled to step all of the RHA2132's mux to the next channel. The pin out of this setp is (ADC or amp to blackfin)

  • Amplifier 1:
    • ADC1 SCLK - RSCLK0
    • ADC1 CS¯ - RFS0
    • ADC1 DATA - DR0PRI
    • Amp1 reset¯ - PF7
    • Amp1 step - PF8
  • Amplifier 2:
    • ADC2 SCLK - RSCLK0
    • ADC2 CS¯ - RFS0
    • ADC2 DATA - DR0SEC
    • Amp2 reset¯ - PF7
    • Amp2 step - PF8
  • Amplifier 3:
    • ADC3 SCLK - RSCLK1
    • ADC3 CS¯ - RFS1
    • ADC3 DATA - DR1PRI
    • Amp3 reset¯ - PF7
    • Amp3 step - PF8
  • Amplifier 4:
    • ADC4 SCLK - RSCLK1
    • ADC4 CS¯ - RFS1
    • ADC4 DATA - DR1SEC
    • Amp4 reset¯ - PF7
    • Amp4 step - PF8

Therefore the SPORT ports only receive. SPORT0 generates internal frame sync which becomes the external framesync for SPORT1. The SPORT configurations are (in headstage_firmware/main.c and headstage_firmware/radio5.asm).

SPORT0_RCLKDIV = 1
SPORT0_RFSDIV = 19
SPORT0_RCR1 = 0x0603    // early frame sync, active high RFS, require RFS for every word
SPORT0_RCR2 = 0x010F

SPORT1_RCLKDIV = 1
SPORT1_RFSDIV = 19
SPORT1_RCR1 = 0x0401    // require RFS for every word
SPORT1_RCR2 = 0x010F 

The RFSCLK is set to be SCLK2x(1+SPORTx_RCLKDIV)=80MHz4=20MHz$. Setting SPORTx_RFSDIV=19 means the ADC's chip select is pulsed high every 20 SCLK cycles for 1 clock cycle. Therefore the ADC is operated at 1MHz. The ADC outputs 12-bits word. There are 20 clock cycles between each chip-select high pulse, thus there are enough clock cycles for the SPORT to read the data output.

This is straight forward and works well.

New implementation with RHD2132

Intan RHD2132 is essentially RHA2132 with a 16-bit ADC built in. The ADC has maximum sampling frequency of 1MHz, which means when the 32 channels are multiplexed, we can achieve 31.25kHz sampling frequency on each channel.

However, RHD2132 needs to interface with other chips through strictly SPI. This introduces two complications:

  1. Instead of simply toggling a pin to reset the amplifier and step the amplifier channel the internal ADC is sampling, each conversion needs to be initiated by a conversion command. This means if we were to keep using the SPORT ports for communication, we have to transmit at the same time as receiving.

  2. The SPI timing requirement is shown in the diagram below:

    image1

    The _CS (tCS1) high duration needs to be greater than 154ns. The datasheet mentions that the _CS line must be pulsed high between every 16-bit data transfer. Finally, tCS1 and tCS2 must be greater than 20.8ns.

The first modification is easily taken care of -- we just need to enable the use of SPORTx_TX ports to send Intan commands at the same time as receiving conversion data.

Fulfilling all of the SPI timing requirements is impossible with the SPORT interface because:

RHD2132 transmits and receives only when _CS is pulled low. That means both RFSync and TFSync which dictates when a 16-bits word is received by or transmited from the blackfin, must be connected to _CS, and have the same timing. This was easily done by having TFSync to be internally generated, and set RFSync as externally generated and receives TFSync. The RSCLKx and TSCLKx, which dictates when a bit is sampled or driven out onto the bus is edge-sync'd to the RFSync and TFSync signal, respectively. At the same time, RSCLKx and TSCLKx have to be set to have the same timing, and connects to RHD2132's SCLK.

The new Blackin SPORT to RHD2132 amplifiers connections are (amplifiers numbered from left to right):

  • Amplifier 1: SPORT1_SEC
    • SCLK+ to SCLK1
    • cs+¯ to FSync1
    • MOSI+ to DT1SEC
    • MISO+ to DR1SEC
  • Amplifier 2: SPORT1_PRI
    • SCLK+ to SCLK1
    • cs+¯ to FSync1
    • MOSI+ to DT1PRI
    • MISO+ to DR1PRI
  • Amplifier 3: SPORT0_SEC
    • SCLK+ to SCLK0
    • cs+¯ to FSync0
    • MOSI+ to DT0SEC
    • MISO+ to DR0SEC
  • Amplifier 4: SPORT0_PRI
    • SCLK+ to SCLK0
    • cs+¯ to FSync0
    • MOSI+ to DT0PRI
    • MISO+ to DR0PRI

where SCLK0 is RSCLK0 and TSCLK0 tied together, SCLK1 is RSCLK1 and TSCLK1 tied together, FSync0 is RFS0 and TFS0 tied together, and FSync1 is RFS1 and TFS1 tied together.

This means the falling edge of _CS always corresponds to either the falling/rising edge of SCLK, and similarly, the rising edge of _CS corresponds to either the falling/rising edge of SCLK. However, tCS1 defines the timing between falling edge of _CS and the next rising edge of SCLK, while tCS2 defines the timing between falling edge of _CS and the next rising edge of SCLK. Given the operation of SPORT, one of these requirements will be violated.

In fact, according to ADI engineers, it is not possible to fulfill all the timing requirements when using SPORT to emulate SPI. So I just hoped it could work anyways and tried to test this.

Testing

My final SPORT settings are, collected from headstage2_firmware/main.c and headstage2_firmware/intan_setup_test.asm (but also other firmware versions. intan_setup_test is simply the first time I tested them):

// SPORT0 transmit
// Using TCLK for late-frame sync
SPORT0_TCLKDIV = 1;
SPORT0_TFSDIV = 3;
SPORT0_TCR2 = TXSE | 0x10;  // allow secondary, 17 bits word
SPORT0_TCR1 = TCKFE | LATFS | LTFS | TFSR | ITFS | ITCLK | TSPEN | DITFS;

// SPORT1 transmit
SPORT1_TCLKDIV = 1;
SPORT1_TFSDIV = 3;
SPORT1_TCR2 = TXSE | 0x10;  // allow secondary, 17 bits word
SPORT1_TCR1 = TCKFE | LATFS | LTFS | TFSR | ITFS | ITCLK | TSPEN | DITFS;

// SPORT0 receive
// Serial clock and frame syncs derived from TSCLK and TFS, so RCLKDIV and RFSDIV ignored
SPORT0_RCR2 = RXSE | 0x10;  // allow secondary, 17 bits word
SPORT0_RCR1 = RCKFE | LARFS | LRFS | RFSR | RSPEN;

// SPORT1 receive
// Serial clock and frame syncs derived from TSCLK and TFS, so RCLKDIV and RFSDIV ignored
SPORT1_RCR2 = RXSE | 0x010; // allow secondary, 17 bits word
SPORT1_RCR1 = RCKFE | LARFS | LRFS | RFSR | RSPEN;

In this setup, the receive clock and frame sync are the same as the transmit clock and frame sync. The TSCLK=RSCLK=SCLK of RHD2132, is again set to be 20MHz.

LTFS sets the frame sync signal to be active low.

SPORTx_TFSDIV=3 would mean the frame sync or _CS is pulled low for 1 SCLK (50ns) every 4 cycles (200ns).

However, LATFS sets late frame sync, which means when a frame sync triggers transmission or reception of a word, it will stay low during that time, and another further frame-sync during this time is ignored by the SPORT. After that time, if it's not time for another frame sync yet, frame sync will return to high, otherwise, it will be low again. Late frame sync also means the timing requiring between the falling edge of _CS and the next rising edge of SCLK is fulfilled.

Together, this means _CS will be low for 17 SCLK cycles, or 17×50ns=850ns while high for 3 SCLK cycles, or 3×50ns=150ns$. This is because, when frame sync first goes low, it will stay low for 17 SCLK cycles, at which time it will go high because 17 is not a multiple of 4. The next greatest multiple of 4 is 20, which is when frame sync will be pulled low again.

The reason for _CS to be low for 17 SCLK cycles is two fold:

  • Because a dataword is 16 bits, it has to be at least greater than 16 cycles. No matter what settings I have, the tcs2 requirment will be violated, which based on how multiplexer work, will affect the value of the last bit being read. Since the data is read MSBit first, having 17 clock cycles would mean only the 17th bit, which I will discard later will be corrupted by the tcs2 violation. Therefore no harm done!

  • With _CS low for 17 cycles and high for 3 cycles, I get exactly 20 cycles period, making it 1MHz. This is one of the only configurations that allows me to achieve that.

With this setup, the tcs2 timing requirement is not fulfilled at all, while tcsoff is violated by 4ns.

Initial testing required checking the acquired sample values. Fortunately, all RHD2132 comes with registers that contains the ASCII values of 'I', 'N', 'T', 'A', 'N'. Therefore, if the SPI emulation actually works, I can just issue READ commands of those register locations to Intan, and check whether the received results are correct.

One thing to remember is that when a command to convert a channel is sent, the conversion result is not received until two SPORT cycles later. This also means, if the pipeline is full, when you stop sending commands, it takes two cycles for the pipeline to be completely empty of new data. But since with the SPORT setup, we can't read from Intan without also sending, that means the last (old) command will have to be send three times during the time it takes to clear the pipeline.

After establishing that, I then needed to make sure I can sample and obtain correct waveforms. This is done by applying a sinusoid of known frequency to the amplifier. In the firmware, I would simply read the the channels four at a time (ch1 on all four amps, then ch2 on all four amps, so on), while sending commands to tell the amplifiers to convert the next channels at the same time as reading. I would save the sample read from the same amplifier every time.

Running this test program (first version is headstage2_firmware/intan_setup_test.asm) and setting breakpoint at the end of execution through JTAG, I can dump the saved data and plot them (using headstage2_firmware/plot_memdump.py).

Initial test result for all 32 channels of amplifier 1, when the first channel is applied a 6400Hz signal looks like below:

image2

I have signal, but it's not what I expected at all...channels without signals are oscillating, there were too many cycles, etc.

This is strange because when I repeatedly convert just one channel using the same setup (sending CONVERT(0), CONVERT(0)...) rather than converting each channel in turn (sending CONVERT(0), CONVERT(1), CONVERT(2),...), I get great result:

image3

I got two periods on the first channel of amplifier 1. The bottom 3 are the first channels of the other amplifiers, which are quiet as expected.

Turns out the difference is due to reduce sampling frequency. In the second case, where the sampling frequency is 1MHz, each period of 6400Hz wave would require 156 samples.

However, while calculating how many samples I needed to collect in the first case, I mistakenly used the 156 samples/value. When I sample 32 channels round-robin style, the effective sampling rate applied to each channel is 31.25kHz, that means only 5 samples/period is needed. This is why my plots had way more periods than expected.

Secondly, the channels that weren't applied signals were left floating. This means it's susceptible to all the noise within the amplifier and PCB planes, resulting in same oscillation frequency, but different amplitude and DC value.

After calculating the correct number of samples to collect to see 2 periods of 6400Hz on all 32 channels of an amplifier, and correctly grounding the channels without applied signals, and turning on Intan's DSP highpass filter/offset removal, the results then looked just as expected:

image4

In the plot above, channel 0, 15, 31 and 16 are applied the 6400Hz signal while the others are grounded.

Consequences of the new configuration

RHD2132 also support outputting ADC results in either unsigned offset-binary where reference electrode voltage has the value of 0x8000, or using signed two's complement, where the reference electrode has the value of 0x0000 and values less than that has the MSB set to 1 two's complement style.

This would result in modifications of the first stage of the firmware's signal chain, which is to convert the incoming samples to fixed-point Q1.15 format for further processing. See the post on modifying the AGC stage.

Finally, the option DITFS make the SPORT frame-sync, read and write happen every 1μs regardless if new commands have been pushed into the SPORT transmit FIFO. This means, if our code-path (signal-chain plus the radio transmission code) inbetween SPORT FIFO reads/writes is too long, the previous command in the FIFO will be sent again, and two cycles later, the corresponding result will be read. This will then result in erroneous data.

However, the manifestation of this data corruption is somewhat counter-intuitive. After establishing correct Intan and SPORT setup to successfully acquire signals, I added in minimally modified signal-chain code from RHA-headstage's working firmware headstage_firmware/radio5.asm, and saved all pre-processed samples from one of the amplifiers in memory (headstage2_firmware/firmware1.asm). I then stopped the execution when the designated buffer is memory is full, dumped the results for plotting in JTAG. The plots, which I expected to be smooth sinusoids of the applied signal, were riddled with random spikes.

This cannot be due to the signal-chain code since the samples saved came directly from the SPORT. A second test where I simply added nops inbetween SPORT reads/writes confirms that the data corruption was due to the code-path being too long. However, the number of nops that allowed for glitchless data acquisition is less than the signal-chain length used in radio5.asm. This is very strange, since the timing requirement for both headstage firmware setup is similar -- accounting for the radio transmission code, the DSP code has to finish in about 800ns. But somehow the timing budget for the RHD-headstage is significantly less.

In the nop test done in firmware2.asm, it was found I can fit 150 nops. I have yet to figure out why this is the case. Regardless, that means a reduction of the original RHA-headstage firmware is needed to adapt it to RHD-headstage. Since the radio-transmission code must stay the same, this reduction is exclusively on the signal-chain code, which includes:

  1. high-pass + AGC = 5 cycles/2 channels * 2 = 10 cycles.

    In each signal-chain iteration 4 samples are acquired, 1 per amplifier, so all 4 must be processed. Blackfin works efficiently with two 16-bit samples at once, so samples from two channels are processed first, then the remaining two samples are processed in the rest of the signal-chain code.

  2. LMS = 27 cycles/2 channels * 2 = 54 cycles.

  3. IIR = 27 cycles/2 channels * 2 = 54 cycles.

  4. SAA = 58 cycles.

AGC is automatic gain control, LMS is least-mean-square adaptive noise cancellation, IIR is infinite impulse response filter, and SAA is used for spike template matching.

Among these, SAA is required for spike matching. AGC is required because the templates for spike-sorting were made under certain signal power level, AGC is needed to fix the channel's power level. Therefore, all of LMS and half of IIR are ripped out to reduce code-length.


Update: 7/26/2017

The reduction of available processing cycles kept bothering me, so I tried to see what it is caused by and how I can squeeze more cycles.

Initial RHD-headstage firmware flows like so:

  1. Initialize all memory bank values _radio_bidi_asm.

  2. Setup Intan communication by writing to its config registers and calibrate the ADC comm_setup.

  3. Go to main loop where signal acquistion and processing interleaves with radio transmission.

In step 3, after getting the latest samples from the SPORT buffers, we increment the channel count, construct a proper request to Intan and send it through SPORT again. The SPORT would presumable output this to Intan headstages simulataneously while reading for new samples from Intan. This is done through:

r7 = CHANNEL_SHIFTED; 
[p0 + (SPORT1_TX - SPORT0_RX)] = r7;   // SPORT1 primary TX
[p0 + (SPORT1_TX - SPORT0_RX)] = r7;   // SPORT1 sec TX
[p0 + (SPORT0_TX - SPORT0_RX)] = r7;   // SPORT0 primary TX
[p0 + (SPORT0_TX - SPORT0_RX)] = r7;   // SPORT0 sec TX

Turns out if in the Intan-setup phase, prior to the start of signal acquisition, I leave the command for auto-incrementing selected amplifier channel in the SPORT transmission (SPORT1_TX, SPORT0_TX) and delete the above snippet from the signal acquisition code, the same behavior is achieved and I can an extra 57 cycles to work with during step 3. This then allows me to fit the entire LMS step into the firmware.

Why does this happen? Probably because accessing the SPORT ports on the blackfin, takes more than 1 CCLK cycle (1/400MHz=2.5ns), which is the case for all the instructions. The ADSP-BF532 hardware reference section 3-5 mentions that Peripheral Access Bus (PAB) and Memory Mapped Register (MMR) access, which includes SPORTx_TX and SPORTx_RX takes 2 SCLK cycles.

SCLK cycle is set to 80MHz, which is 5 CCLK cycles. The code snippet above takes 4*2=8 SCLK cycles, or 40 CCLK cycles. This is less than the extra 57 cycles I got after deleting them, but I think confirms what I see.

Dec 24, 2015 - Wireless Headstage: AGC with Blackfin BF532 and Intan RHD2132

Wireless Project github

I have been working on improving our lab's old wireless headstage using Analog Device's Blackfin BF532, Intan's analog 32-channel amplifiers RHA2136, and Nordic's NRF24L01 2.4MHz transceivers.
The improvement is replacing the RHA2136 with Intan's digital version, RHD2132. There are 4 RHA2136 per headstage in the original headstage, each of which requires a voltage regulator and ADC, both of which can be eliminated when using RHD2132. The resulting (ideally working) board would be half the size.

The PCBs have been made. I originally imagined the firmware changes to be minimal -- essentially changing how BF532 would talk to the amplifiers through SPI, while keeping most of the signal chain code constant. The code architecture should not have to change much.

However, it has proved to be a lot more difficult than the expected plug-and-play scenario. One of the first issues was to first get Blackfins to talk to the Intans correctly via SPORT. Then I ran into the problem where the original signal chain code in between sample conversions were too long and resulted in data corruption. Therefore, I need to pick parts of the signal chain to cut and modify.

This post documents my work to understand and adapt the integrator-highpass-AGC, the first stage of the signal chain.

Background

The general idea of this wireless system is that, we would acquire samples from four amplifiers at the same time, and then do processing on two of theses samples at a time. Signals at the different stages of the signal chain can be streamed to gtkclient - the PC GUI for display and for spike-sorting. The samples are displayed in a plexon-style interface, and templates are made using PCA. These templates are then sent back to the headstage. In the final firmware, at the end of the signal chain, template matching are done using those results, and a match (or a spike) is found if 16 samples on a given channel is within a certain "aperture" of the templates. (The template is a 16-point time-series. Aperture is the L1-norm of the difference between the template and the most recent 16-samples of the current channel)

Because of this, it would be nice to keep the input signal to template matching at a stable level -- a signal with the correct spiking shape, but different signal amplitude may not be catagorized as a match. Thus enters Automatic Gain Control (AGC).

Algorithm

The basic idea is that, the user sets a reference level for the signal power. If the signal power is above this reference level, then decrease the front-end gain. If it is below, then increase it. There are all kinds of fancy implementation, including using different attack/release rate (different gain increasing/decreasing rate). In terms of embeded system, I find this thread, and "TI DSP Applications With the TMS320 Family, Theory, Algorithms and Implementations Volume 2" (pages 389-403) to be helpful in understanding how it works.

Tim describes the firmware in his thesis, page 162:

Four samples from each of the 4 ADCs are read at once; as the blackfin works efficiently with two 16-bit samples at once, two samples (from two of the 4 headstages) are then processed at a time. These two come in at 12 bits unsigned, and must be converted to 1.15 signed fixed-point, which is done by pre-gain (usually 4) followed by an integraotr highpass stage. The transfer function of the high pass stage is H(z)=4(1-z^-1)/(1-z^-1(1-mu)); the normal value of mu is 800/16384 [...] This is followed by an AGC stage which applies a variable gain 0-128 in 7.8 fixed-point format; the absolute value of the scaled sample is compared to the AGC target (stored as the square root to permit 32-bit integers) and the difference is saturated at 1 bit plus sign. This permits the gain to ramp from 0 to 127.999 in 2^15 clocks.

The accompanying diagram is shown below:

image1

Below is the firmware related to this part - it is obvious now I understand how it works, and Tim's thesis explanations now makes perfect sense (it did before, but they seemed more like clues to an elaborate puzzle).

 //read in the samples -- SPORT0
 r0 = w[p0] (z); // SPORT0-primary: Ch0-31
 r1 = w[p0] (z); // SPORT0-sec:     Ch32-63
 r2.l = 0xfff;
 r0 = r0 & r2;
 r1 = r1 & r2;
 r1 <<= 16;  //secondary channel in the upper word.
 r2 = r0 + r1;
    
 //apply integrator highpass + gain (2, add twice)).
 r5 = [i0++]; //r5 = 32000,-16384. (lo, high)
.align 8
    a0 = r2.l * r5.l, a1 = r2.h * r5.l || r1 = [i1++]; // r1 = integrated mean
    a0 += r2.l * r5.l, a1 += r2.h * r5.l || r6 = [i0++]; //r6 = 16384 (0.5), 800 (mu)
    r0.l = (a0 += r1.l * r5.h), r0.h = (a1 += r1.h * r5.h) (s2rnd);
    a0 = r1.l * r6.l , a1 = r1.h * r6.l; //integrator
    r2.l = (a0 += r0.l * r6.h), r2.h = (a1 += r0.h * r6.h) (s2rnd)

    /*AGC*/	|| r5 = [i1++] || r7 = [i0++]; //r5 = gain, r7 AGC targets (sqrt)
    a0 = r0.l * r5.l, a1 = r0.h * r5.h || [i2++] = r2; //save mean, above
    a0 = a0 << 8 ; // 14 bits in SRC (note *4 above), amp to 16 bits, which leaves 2 more bits for amplification (*4)
    a1 = a1 << 8 ; //gain goes from 0-128 hence (don't forget about sign)
    r0.l = a0, r0.h = a1 ;  //default mode should work (treat both as signed fractions)
    a0 = abs a0, a1 = abs a1;
    r4.l = (a0 -= r7.l * r7.l), r4.h = (a1 -= r7.h * r7.h) (is) //subtract target, saturate, store difference
        || r6 = [i0++]; //r6.l = 16384 (1), r6.h = 1 (mu)
    a0 = r5.l * r6.l, a1 = r5.h * r6.l || nop; //load the gain again & scale.
    r3.l = (a0 -= r4.l * r6.h), r3.h = (a1 -= r4.h * r6.h) (s2rnd) || nop; //r6.h = 1 (mu); within a certain range gain will not change.
.align 8
    r3 = abs r3 (v) || r7 = [FP - FP_WEIGHTDECAY]; //set weightdecay to zero to disable LMS.
    r4.l = (a0 = r0.l * r7.l), r4.h = (a1 = r0.h * r7.l) (is) || i1 += m1 || [i2++] = r3;
                //saturate the sample (r4), save the gain (r3).

Line 2-8 stores the 12-bit samples in the LO and HI 16-bits of r2.

Line 13 and 14 implement the pre-gain. r5.l=32000=0x7d00 is 0.9765625 in Q15. Multiplying a 12-bit number by 0x7d00 in blackfin's default mode is equivalent of converting the 12-bit UNSIGNED sample into signed Q15 format. This IS VERY IMPORTANT as I shall see later. Line 14 does it again and then add to the original result. This means by line 15, a0 and a1 contains the original samples multiplied by 2 (although r5.l should really be 0x7fff, which is closer to 1 than 0x7d00 in Q15).

Line 15: r1.l=integrated mean, r5.l=-16384=(Q15) -0.5. The multiply-accumulate (MAC) mode is s2rnd. So, in algebraic form, at the end of the instruction, a0=2x-0.5m[n-1]. r0.l=2*a0=4x-m[n-1]. Here x is the incoming sample, and m[n-1] is the previous integrated mean. Note that since x is 12-bits, then a0 and r0 are at most 14-bits, therefore still positive in Q15 format. Finally, r0 here is actually the output of the integrator-highpass stage (call it y), as also indicated by the diagram.

Line 16: As the comment hints, loads 0.5*m[n-1] into the accumulator. Both numbers are in Q15, the result in the accumulator is Q31.

Line17: From line 15, we have r0=4x-m[n-1]. r6.h=mu=800=(Q15)0.0244140625, which is a constant used to get the desired high-pass response. At the end of this instruction, a0=0.5m[n-1]+mu(4x-m[n-1]).

Since the 40-bit accumulator value is extracted into r2 with (s2rnd) mode, we have r2.l and r2.h to be

2(0.5m[n-1]+mu(4x-m[n-1]))=m[n-1]+2mu(4x-m[n-1]).

As the comment in line 20 hints, this is the new mean.

To characterize the behavior of this integrator-filter stage, we have the following two equations: m=mz1(12μ)+8μxy=4xmz1

where m is integrated mean, 1/z is delay, x is the original 12-bit samples, y is the output. Note again that since x is 12-bit unsigned, conversion into Q15 does not change anything, therefore the system equations are easy to derive: YX=41z11+z1(2μ1)

The gain is 4, and the 3dB point is determined by \(z=1-2\mu\). In the code \(\mu\approx 0.0244\), so \(z=0.9512\). Note also, in Tim's thesis, \(\mu\) is actually \(0.0488\), this is because (s2rnd) was used in the actual implementation.

The -3dB frequency can be calculated by plugging in the transfer function into MATLAB, since there's no easy way to get the Bode plot of a z-transform from the poles and zeros like the analog systems..unless if substituting$z=exp{j\omega}$ into the transfer function and calculating the magnitude response is easy..

>> num=[1,-1];
den=[1, -0.9512];
[h,w] = freqz(num,den,'whole',2001);
plot(w/pi,20*log10(abs(h)))
xlabel('Normalized Frequency (\times\pi rad/sample)')
ylabel('Magnitude (dB)')

This will plot the two-sided magnitude response. We only care the part where normalized frequency is from 0 to 1. The normalized frequency ωc at the -3dB point can be found be estimated from the plot. Convert to Hz via fc=ωcπFs/(2π). For this AGC setting fc225Hz.

Line 20-23 is the first step of AGC, and is pretty tricky. r5 contains the AGC gain, which as described in the thesis is Q7.8 format (really Q8.8, but we never have negative gain here).

Line 20 multiplies r0 with the AGC gain. But the default mode of multiplication treats both numbers as Q15 fractions, the value inside the accumulator is Q31 and will not be correct. In Q31, the decimal point is after bit 31. But when a Q15 number multiplies a Q7.8 number, there are 15+8=23 fractional bits, and the decimal point is in fact after bit 23.

Therefore, before we can extract the resulting 16-bit number in line 23 (MAC to half-register), we have to shift the results in accumulator by 8. Since the default extraction option takes bit 31-16 after rounding, we get the correct 16-bit results. One questions might be, the result of Q15 and Q7.8 multiplication also have 9 integer bits. In the shift-and-extraction operation, we are essentially throwing away 8 of these integer bits. Turns out the gain is never very big for this to matter (in other words, Q7.8 gives a big range of available gain, but this range is probably not needed).

Finally, r0 in line 23 is the output of the AGC stage.

Important to note here r0 in line 20 is the difference between the new samples and the previous integrated mean, therefore when we take the absolute value of the accumulators in line 21 and line 23, we are essentially getting a measurement of the sigal envelope's amplitude. And the goal of AGC is to adjust the gain so the signal envelope stays within a target range. This is a very important point. Since Intan implements a highpass DSP filter onboard, I initially thought I could skip the integrator-highpass stage, and plug in the AGC code directly to process the incoming samples. But I simply got rubbish, because the instanteous signal value is not a good measurement of the signal power. The integrator is required to obtain the mean and for this specific AGC implementation to work.

Line 25 does a clever bit of MAC trick. a0 contains the difference of the signal level and the target level. Note that the target level is stored in memory as the square root of the value. Using the (is) mode treats both all numbers involved as signed integer, and the extraction will either be 0x8000 or 0x7fff, for a negative or positive difference. This is significant since 0x8000 and 0x7fff are respectively the smallest negative number and greatest positive number that can be represented in Q15, and allows us to update our AGC gain without doing any if-statement style conditions.

In line 26, r6.h=1 is not the value of mu, but simply indicates whether we have enabled AGC.

Line 27 and line 28 involve yet another fixed-point trick. These two lines updates the AGC gain. The AGC gain in r5 is loaded into the accumulator by multiplying 0x4000 (16384). Afterwards, the MAC with s2rnd option subtract either 0x8000 or 0x7fff from it. This whole procedure is the equivalent of either adding 28 or subtracting 27 from the Q7.8 value of the original AGC gain.

The key is noticing that multiplying a Q7.8 number by 0x4000 while treating both as Q15 numbers, then extracting the resulting Q31 number in s2rnd mode, results in a 16-bit number that has the same Q7.8 value as before. This can be demonstrated by the following lines using functions defined in AGC_sim.py:

gain = 52                               # original gain
Q8_gain = int(decimal_to_Q8_8(52),16)
acc_val = int(decimal_to_Q1_31(hex_to_Q15(Q8_gain) * hex_to_Q15(16384)),16)
final_gain = mult_acc_to_halfReg(acc_val, 0, 0, 'sub', 's2rnd')
final_gain = hex_to_Q8_8(final_gain)    # should be 52 again.

In line 30, we simply take the absolute value of r3, which contains the new AGC gain, just to be safe.

Implementation for Intan

Understanding that the integrator stage is required for the AGC to work was crucial. Initially giving the unsigned 16-bit Intan samples to the AGC stage (starting around line 20), I simply got rail-to-rail oscillations that didn't make any sense.

Adding the integrator stage, however, still gave me similar results, but less oscillations. The key to solving this is to realize the original code was written to work with 12-bits unsigned samples. I wish this was emphasized more in the thesis, or perhaps I was just too n00b to realize the significance here.

Since the incoming samples from Intan are 16-bits unsigned in offset binary format -- reference value is 0x8000, 0 volts would be 0x0000, and VDD is 0xffff, the conversion step in line 13 by multiplying 0x7fff will often result in a negative result. For example, treating both 0x8000 and 0x7FFF as Q15 numbers, the product is close to -1. Thus the result mean would be -2, rather than 0 (since 0x8000 means signal value at reference voltage). Further, this sign flip would result in new values with different amplitude from the original values.

There are two solutions to this problem:

  1. Right-shift the incoming 16-bit unsigned binary offset samples to 12 bits. Do this right after line 8, and the rest of the code stays unchanged. This approach the least changes to the old firmware (of course, line 1-8 need to be altered to work with Intan, but that's not algorithmic).

  2. RHD2132 is capable of outputting the ADC samples in 16-bit signed, two's complement format. This would solve the above mentioned problem of inadvertently flipped signs, since the samples will already in Q15 format. But this would also require changing the integrator-highpass stage, to obtain the same behavior. Specifically, in the system function, there is a pre-gain of 4 in front. This is a result of the line 14, and the two s2rnd operations in line 15 and line 17.

    Now that our samples are full 16-bits, we do not need the extra gain. This requires changing the coefficient values in:

    • r5 in line 11: from r5.l=0x7fff, r5.h=0xc000 to r5.l=0x7fff, r5.h=0x8000.
    • r6 in line 14: from r6.l=0x4000, r6.h=800 to r6.l=0x7fff, r6.h=1600.

    In addition to deleting the accumulator instructions in line 14, and changing the MAC option from (s2rnd) to default in line 15 and line 17.

I ended up using the second approach, because I don't like throwing away bits. But both works.


**Edit: 4/6/2016

After validation of recording and sorting quality between RHA-headstage, RHD-headstage, and plexon, it looks like AGC amplifies noise too much under low SNR conditions.

As a result, the final deployment firmware uses a fixed pre-gain stage before the IIR stage. This gain is Q7.8 value and can be set in the appropriately compiled version of gtkclient in 0.5 increment from -128 to 127.5.

Jul 28, 2015 - Gowda, Carmena 2014: Designing Dynamical Properties of Brain–Machine Interfaces to Optimize Task-Specific Performance

Designing Dynamical Properties of Brain–Machine Interfaces to Optimize Task-Specific Performance

Pretty heavy math, will have to go over this again. Premise is that small changes in closed-loop BMI dynamical properties can have significant and predicatable effects on performance. PVKF was used as a specific algorithm to study.

Two undesired effects:

  • presence of unintended attractor points which pulled the cursor to arbitrary workspace positions.
  • calibration-dependent control memory property which parameterized the tradeoff between speed and hold performance. likely driven by fluctuation in the closed-loop control memory of the decoder.

Both properties were not only present in experimental data but also their mathematical causes could be traced back to the mechanics of the closed-loop decoder. For KF, and generally for recursive Bayesian decoders, the closed-loop properties induced by interaction between the sate-space model and the neural firing model can produce unintentional dynamical artifacts.

There are numerous benefits of keeping both the decoder and the neural ensemble fixed across sessions. However, CLDA is needed to adapt to nonstationarity of the ensemble, and initial calibration. But CLDA can also change system properties that one might prefer to keep fixed if one wishes to keep the system as consistent as possible before and after calibration.

References to Read

  • F. R. Willett , A. J. Suminski , A. H. Fagg and N. G. Hatsopoulos "Improving brain-machine interfaceperformance by decoding intended future movements", J.Neural Eng., vol. 10, no. 2, pp.026011 2013
  • B. Jarosiewicz , S. M. Chase , G. W. Fraser , M. Velliste , R. E. Kass and A. B. Schwartz "Functional network reorganizationduring learning in a brain-computer interface paradigm", Proc. Nat. Acad. Sci., vol. 105, no. 49, pp.19486 -19491 2008
  • E. J. Hwang , P. M. Bailey and R. A. Andersen "Volitional control of neuralactivity relies on the natural motor repertoire", Curr.Biol., vol. 23, no. 5, pp.353 -361 2013
  • W. Wu , Y. Gao , E. Bienenstock , J. P. Donoghue and M. J. Black "Bayesian population decoding of motor corticalactivity using a Kalman filter", Neural Comput., vol. 18, no. 1, pp.80 -118 2006
  • Z. Li , J. E. O'Doherty , M. A. Lebedev and M. A. L. Nicolelis "Adaptive decoding for brain-machineinterfaces through Bayesian parameter updates", NeuralComput., vol. 23, no. 12, pp.3162 -3204 2011
  • M. Golub , S. M. Chase and B. M. Yu "Learning an internal dynamics model fromcontrol demonstration", Proc. Int. Conf. Mach.Learn., vol. 28, pp.606 -614 2013

Jul 28, 2015 - Orsborn, Carmena 2014: Closed-loop decoder adaptation shapes neural plasticity for skillful neuroprosthetic control

Closed-loop decoder adaptation shapes neural plasticity for skillful neuroprosthetic control

Focuses on the acquisition of "neuroprosthetic skill", performance and neural representations that are robust over time and resistant to interfernce. As shown before, learning in brain-control facilitate the formation of BMI-specific control networks, changes in cortical and corticostriatal plasticity show speficificty for BMI control neurons. Evidence seems to suggest that neuroplasticity creates specialized BMI control network that allows skillful control.

Cortical map formation has been shown to be sensitive to the details of the BMI system, such as the neurons input into the decoder, and decoder parameters. Training new decoders regularly, even with the same neural ensemble, elminated cortical map formation and the associated performance improvement. This would then suggest a fixed decoder would allow neuroplastiicty to yield continued task performance from day to day. However, the decoder's performance is sensitive to changes in the neural signals. Thus closed-loop adapation can facilitate and maintain learning in the presence of changing neural inputs to the BMI. Therefore the control architecture of closed-loop decoder adaptation (CLDA) as a two-learner system can provide more robust BMI performance.

In the first experiment, a delayed center-out hold task is used. CLDA was used initially to train the decoder (via smooth-batch or ReFIT). CLDA was stopped until the monkey can navigate the cursor across workspace - the goal is to limit decoder adapation to maximize improvement driven by neural adapation, perhaps this allows for more robust neuroprosthetic skill retention?

Performance (measured by % correct, success rate - trials/min, and mean error - deviation from ideal trajectory) continued to improve after the decoders were held fixed. Intermittent CLDA was used to compensate for performance drops and changes in the neural ensemble.

One key finding was that

[...] gradual refinement of cursor control, with continued improvements in movement errors and success rates reached a plateau. These improvements were absent when CLDA was used each day to maximize performance starting from varying initial decoders that used differeint neural ensembles. Whereas CLDA could achieve high task performance, movement kinematics showed no improvements. Daily performance also showed variability commonly observed with daily re-training. This finding confirms that observed learning was not purely reflective of increased practice in BMI and highlights the improtance of some degree of neural and decoder stability for learning.

While the specifics of how CLDA is used exactly in their training scheme are needed for me to accept their conclusion (reseeding CLDA every day would introduce abrupt changes to the decoder, as the authors mention), but the idea of intermittent decoder adapation and neural plastiticty can be useful. Also supports using reinforcement learning to train the decoder...

Neural data analysis:

  1. Tuning map correlate strongly in later sessions, less tuning changes in later sessions indicate saturation.
  2. Degree of neural adaptation depends on amount of performance improvements - correlation of tuning maps with day 1 decrease with sessions and performance rate. Really just a corrollary of analysis 1.
  3. Neural adapation is shaped by decoder properties - tuning properties changed primarily when necessary to improve performance and otherwise stable. Specifically, units were more likely to change their preferred directions if the initial decoder assigned them an "incorrect" PD. If they did the same analysis on the Ganguly, Carmena 2011 paper on direct and indirect neuron populations, would they still reach the same results with the direct neuron population? What about the indirect neurons?
  4. They compared the onset time of directionally tuned activity and time of peak firing for each unit during different sessions. Averaging across all units and series, they found that after learning:
    • units were both directionally tuned earlier and reached peak firing earlier in the trial (Wilcoxon sign-rank test).
    • Majority of units developed tuning prior to the go-cue - indicate planning or preparation to move.
    • Cursor speed profiles also shifted earlier with learning, and clear increases in speed occured prior to the go-cue. (WTF the cursor isn't supposed to move before go-cue!).

Second Experiment
To test the robustness of neuroprosthetic skill and the emerging neural map's resistance to interference from exposure to other contexts and perturbing neural inputs, specifically, from native motor networks, the BMI-SC task involved the monkey pressing down on a force sensor with arm contralateral to the implants, whlie performing the regular task. The decoder used in BMI-SC and regular BMI task is the same.

Results here include:

  1. Isometric force task significantly disrupted the task performance.
  2. BMI-only performance and learning not disrupted by performing the BMI-SC task in the same day.
  3. Sessions in which BMI-only and BMI-SC tasks were perofrmed in A-B-A block showed minimal within-session interference between contexts, confirms the reversible modification hypothesis from Ganguly, Carmena 2011.
  4. BMI-SC performance also improved across the series, approaching that of BMI-only performance on the last day in one example. These improvements may be due in part to reduced interference of arm-movement-related activity with BMI control, the disruption of simultaneous arm movements may not be fully blocked by skill formation. Learning Transfer between contexts.

Performance improved even when CLDA was used to fully adapt the decoder, suggesting that neural plasticity may provide benefits beyond decoder adapation alone. Gradual adapation might be kye.

References to Read

  • Collinger, J.L., Wodlinger, B., Downey, J.E., Wang, W., Tyler-Kabara, E.C., Weber, D.J., McMorland, A.J., Velliste, M., Boninger, M.L., and Schwartz, A.B. (2013). High-performance neuroprosthetic control by an individual with tetraplegia. Lancet 381, 557–564
  • Dangi, S., Orsborn, A.L., Moorman, H.G., and Carmena, J.M. (2013). Design and analysis of closed-loop decoder adaptation algorithms for brain-machine interfaces. Neural Comput. 25, 1693–1731
  • Gilja, V., Nuyujukian, P., Chestek, C.A., Cunningham, J.P., Yu, B.M., Fan, J.M., Churchland, M.M., Kaufman, M.T., Kao, J.C., Ryu, S.I., and Shenoy, K.V. (2012). A high-performance neural prosthesis enabled by control algorithm design. Nat. Neurosci. 15, 1752–1757
  • Orsborn, A.L., Dangi, S., Moorman, H.G., and Carmena, J.M. (2012). Closedloop decoder adaptation on intermediate time-scales facilitates rapid BMI performance improvements independent of decoder initialization conditions. IEEE Trans. Neural Syst. Rehabil. Eng. 20, 468–477.
  • Suminski, A.J., Tkach, D.C., and Hatsopoulos, N.G. (2009). Exploiting multiple sensory modalities in brain-machine interfaces. Neural Netw. 22, 1224–1234
  • Suminski, A.J., Tkach, D.C., Fagg, A.H., and Hatsopoulos, N.G. (2010). Incorporating feedback from multiple sensory modalities enhances brain-machine interface control. J. Neurosci. 30, 16777–16787.
  • Wander, J.D., Blakely, T., Miller, K.J., Weaver, K.E., Johnson, L.A., Olson, J.D., Fetz, E.E., Rao, R.P.N., and Ojemann, J.G. (2013). Distributed cortical adaptation during learning of a brain-computer interface task. Proc. Natl. Acad. Sci. USA 110, 10818–10823.

Jul 28, 2015 - Raspopovic, Micera 2014: Restoring Natural Sensory Feedback in Real-Time Bidirectional Hand Prostheses

Restoring Natural Sensory Feedback in Real-Time Bidirectional Hand Prostheses

Sensory feedback was achieved by stimulating peripheral nerve fascicles, which, in turn, allowed real-time closed-loop control of a prosthetic hand by a human subject with a transradial amputation. To restore the lost sensory feedback, four TIMEs were implanted in the median and ulnar nerve fascicles, and two stimulation sites that were able to elicit distinct and repeatable sensations on the innervation territories of the two nerves (3–5) were then selected at the end of systematic testing of all the contacts and then connected to the artificial hand sensors. Sensations were elicited in a range from slight contact to just below the reported pain threshold, to dynamically control the intensity of stimulation delivered, according to the prosthetic hand sensor readouts.

The participant controlled the prosthesis through voluntary contractions of the remaining muscles on the stump, being able to perform different (ulnar, pinch, and palmar) grasps, and hand opening by online processing of sEMG signals. The grasps were performed in terms of position control such that he was able to finely modulate the fastening and opening of the prosthetic hand.

The best part is with the nerve stimulation, the subject can distinguish between at least three different shapes, sizes, and stiffness. The stimulation was modulated at 50Hz, biphasic, and only amplitude varied.

Jul 28, 2015 - Shenoy, Carmena 2014: Combining Decoder Design and Neural Adaptation in Brain-Machine Interfaces

Perspective article: Combining Decoder Design and Neural Adaptation in Brain-Machine Interfaces

Thesis: optimal codesign of decoders and concomitant engagement of neural adaptation is important.

Decoding vs. Learning.

Decoder design begins by (1) training a decoder to predict natural or intended movement from previously recorded open-loop neural data and (2) assumes that the statistics of the neural activity remain the same when controlling the BMI during the closed-oop phase.

Neural adapation approach begins by (1) starting with a decoder that is not trained to predict natural or intended movement, and instead has minimal structure within its decoder paratmers, and (2) takes the view that there is a de novo, closed-loop system that has to be learned by the brain in order to achieve proficient BMI control.

These two approaches do not have to be exclusive.

Three experimental model:

  1. hand-controlled training, hand-free BMI control. No context switch between change, so movement continue from training to BMI. Lead to high performance with biomemtic decoder design.
  2. hand-controlled training, hand-independent BMI control - involves context change between control change. Hand free to move in brain-control. The same movements that were used to train the open-loop decoder cannot be made during BMI. This process predicts more meaningful neural adaptation changes.
  3. hand-constrained training, hand-constrained BMI control without context change via passive observation. Less mismatch between closed-loop BMI operation and open-loop decoder training, can lead to high BMI perforance but predict less neural adapatation.

CLDA e.g. Carmena's PVKF with Smooth-batch or ReFIT may combine decoder adapation with neural adapation. Requires "intention estimation" which I think is hard to generalize, but it's also the easiest closed-loop adaptation to implement.

Neuro-adapation allows animals to effectively learn at least a slightly different decoder each day before being able to perform the task proficiently. But this is difficult for more complex task with higher DOF. The key is to pair stable neural populations with fixed, static decoder. Differential modulation of neuronal populations based on their causal link to neuroprosthetic control, resulting netowrks co-existing with native cortical networks, can switch without interference based on task requirements, blablabla nothing new here, just re-iterating the Ganguly-Carmena 2011 paper.

The real challenge is to control two different BMI using two decoder from two separate neural ensembles!

**How do these adapations happen? Basal ganglia? **

Mouse studies (Yin et al., 2009)showed increasing coherence between M1 and dorsostlateral striatum neurons during learning to brain-control using M1 neurons.

Changes in cohereance were temporally precise and significantly more pronounced in direct M1 neurons than in those not controlling the BMI - consistent with formation of a BMI-specific network. Further, knckout mice lacking NMDA receptors in the striatum were not able to learn the same task, supporting the notion that corticostriatal plasticity is necessary for neuroprosthetic learning. Together these results suggest that corticobasal ganglia circuits are invovles in BMI learning, even when they do not require physical movement. [...] neural adapatation not only elicits changes in motor cortical networks, but also recruits elements fo the natural motor system outside of the cortex such as the basal ganglia.

This is pretty cool...brain has so much generalization capability, motor learning is like the greatest API ever for BMI! On the other hand, if BMI learning is so closely coupled with the motor learning process (signaled by the activation of the corticobasal ganglia circuits), would that imply subjects who have more proficient motor learning would also be more proficient at learning different modes of BMI control (at least when using M1 ensembles)?

Can try a bimanual task where one hand is controlled by BMI the other by native arm.

How does the decoder know if the user intends to use the BMI controlled device? Motor cortex is always active and not strictly lateralized and is part of a coordinated bimanual circuit. This correlation may not be readily separated by the decoder, especially if neural activity from both contra- and ipsilateral motor cortex couupy the same neural dimensions.

Thus the onus would then fall on neural adaptation to create a new motor control circuit, which effectively decorrelates control of the intact arm from the prosthetsis, so that the subject can have independent control of both her real and prosthetic limb and hand.

References to Read

  • Chase, S.M., Kass, R.E., and Schwartz, A.B. (2012). Behavioral and neural correlates of visuomotor adaptation observed through a brain-computer interface in primary motor cortex. J. Neurophysiol. 108, 624–644.
  • Chestek, C.A., Gilja, V., Nuyujukian, P., Foster, J.D., Fan, J.M., Kaufman, M.T., Churchland, M.M., Rivera-Alvidrez, Z., Cunningham, J.P., Ryu, S.I., and Shenoy, K.V. (2011). Long-term stability of neural prosthetic control signals from silicon cortical arrays in rhesus macaque motor cortex. J. Neural Eng. 8, 045005.
  • Dangi, S., Gowda, S., Moorman, H.G., Orsborn, A.L., So, K., Shanechi, M., and Carmena, J.M. (2014). Continuous closed-loop decoder adaptation with a recursive maximum likelihood algorithm allows for rapid performance acquisition in brain-machine interfaces. Neural Comput. 26, 1811–1839.
  • Durand, D.M., Ghovanloo, M., and Krames, E. (2014). Time to address the problems at the neural interface. J. Neural Eng. 11, 020201.
  • Fan, J.M., Nuyujukian, P., Kao, J.C., Chestek, C.A., Ryu, S.I., and Shenoy, K.V. (2014). Intention estimation in brain-machine interfaces. J. Neural Eng. 11, 016004.
  • Fetz, E.E. (2007). Volitional control of neural activity: implications for braincomputer interfaces. J. Physiol. 579, 571–579
  • Gage, G., Ludwig, K., Otto, K., Ionides, E., and Kipke, D. (2005). Naive coadaptive
    cortical control. J. Neural Eng. 2, 52–63
  • Jarosiewicz, B., Masse, N.Y., Bacher, D., Cash, S.S., Eskandar, E., Friehs, G., Donoghue, J.P., and Hochberg, L.R. (2013). Advantages of closed-loop calibration in intracortical brain-computer interfaces for people with tetraplegia. J. Neural Eng. 10, 046012.
  • Kao, J.C., Stavisky, S., Sussillo, D., Nuyujukian, P., and Shenoy, K.V. (2014). Information systems opportunities in brain-machine interface decoders. Proceedings of the IEEE 102, 666–682.
  • Koralek, A.C., Jin, X., Long, J.D., II, Costa, R.M., and Carmena, J.M. (2012). Corticostriatal plasticity is necessary for learning intentional neuroprosthetic skills. Nature 483, 331–335.
  • Koralek, A.C., Costa, R.M., and Carmena, J.M. (2013). Temporally precise cellspecific coherence develops in corticostriatal networks during learning. Neuron 79, 865–872.
  • Lebedev, M.A., Carmena, J.M., O’Doherty, J.E., Zacksenhouse, M., Henriquez, C.S., Principe, J.C., and Nicolelis, M.A. (2005). Cortical ensemble adaptation to represent velocity of an artificial actuator controlled by a brainmachine interface. J. Neurosci. 25, 4681–4693.
  • Merel, J., Fox, R., Jebara, T., and Paninski, L. (2013). A multi-agent control framework for co-adaptation in brain-computer interfaces. Proceedings of the 26th Annual Conference on Neural Information Processing Systems, 2841–2849.
  • Milla´ n, Jdel.R., and Carmena, J.M. (2010). Invasive or noninvasive: understanding brain-machine interface technology. IEEE Eng. Med. Biol. Mag. 29, 16–22
  • Yin, H.H., Mulcare, S.P., Hila´rio, M.R., Clouse, E., Holloway, T., Davis, M.I., Hansson, A.C., Lovinger, D.M., and Costa, R.M. (2009). Dynamic reorganization of striatal circuits during the acquisition and consolidation of a skill. Nat. Neurosci. 12, 333–341.
  • Sussillo D, Nuyujukian P, Fan JM, Kao JC, Stavisky SD, Ryu SI, Shenoy KV (2012) A recurrent neural network for closed-loop intracortical brain-machine interface decoders. Journal of Neural Engineering. 9(2):026027

Jul 28, 2015 - Shenoy, Kaufman, Sahani, Churchland 2011: A dynamical systems view of motor preparation: Implications for neural prosthetic system design

A dynamical systems view of motor preparation: Implications for neural prosthetic system design

This book chapter/review paper seeks to tie up a bunch of Shenoy lab work (mostly Churchland) on premotor cortex together with a dynamical system hypothesis - the optimal subspace hypothesis. Central question is what are the neural processes that precede movement and lead to the initiation of movement? And what links movement preparation and movement generation.

The experiment used to derive all the results here is a instructed-delay task, where a monkey first holds its hand in the center, a target then appears, then a go-cue appears, at which the monkey moves to the target.

They first noticed that delay time (between target appearnce and go-cue) is inversely proportional to the reaciton time (time between go-cue and movement-onset). This is hypothesized as a result of more time for movement preparation. So they recorded from the premotor cortex (PMd) and M1, and found activity changes there during the delay period, and seek to find how activties there (indvidual neuron changes vary) contribute to the reduction in movement delay.

(Seems like they are begging the question in terms of which cortical areas to focus on. They mentioned a number of areas show changes in activity during the delay period, yet focus only on PMd and M1. Probably because they only have implants there. Contributions of specific cortical areas to movement preparation is then unclear)

Optimal Suspace Hypothesis

If approaching this question from a dynamical systems perspective, then they will have to answer:

  • How the activity of a neural population evolves and achieves the needed preparatory state?
  • How this preparatory state impacts the subsequent arm movement?
  • What are the underlying dynamics of the neural circuit?

Their solution to the delay period-RT observation is to hypothesis that time is needed for the neural population to achieve some kind of preparatory state which is neccessary for the subsequent movement. Seems like a reasonable hypothesis to me. Specifically, the motor preparation may be the act of optimizing preparatory activity so that the generated movement has the desired properties (what these properties are, they have yet to answer at all).

In a dynamical systems framework, they considered, for a given reach, there is presumably some small sub-region of space containing those values of P that are adequate to produce a successful reach. A smooth relationship exist between firing rate and movement (reasonable), therefore, the small subregion of space is conceived of as being contiguous.

This framework would then predict several things that one can test in the lab (although the order of experiments and hypothesis formulation is probably the opposite in practice here...).

A direct corollary of this framework would be that:

If prep activity is equivalent to initializing a traverse to some optimal subspace for the subsequent movement, these activities for different movements must be different. Movements are defined by their kinematic parameters, thus prep activity should be characterized, or tuned, to these different movement parameters as well.

This is presumable true, have yet to read the Messier and Kalasak, 2000 paper.

Other predictions:

  • Prediction 1: Reach-speed modulation: Preparatory activity should covary with other meanginful aspects of movement, including peak reach speed. Experiment conducted where monkeys were instructed to reach targets with either faster or lower speed, and the neural data during the preparatory period analyzed.
  • Prediction 2: Reach-speed (trial-by-trial) correlation: Preparatory activity should correlate, on a trial-by-trial basis, with the peak reach speed. This is mainly to demonstrate the smooth mapping between the neural space and movement parameters, such a trial on the slower side within the instructed-fast condition, should be closer to the those found in the instructed-slow condition than the others. Closeness is measured via distance on the (speed, mean-firing-rate) axis.
  • Prediction 3: Across-trial firing-rate variance (Fano factor) reduces through time: Preparatory activityshould become, through time, quite accurate and therefore quite similar across trials. In other words, at the end of the preparatory period, the neural activity should become constrained inside the optimal subspace for the subsequent movement, this should manifest in the reduction in variance of the neural activity. The metric they used is the Fano factor (seems rather convoluted), need to read the actual method.
  • Prediction 4-I and 4-II: Lower Fano factor correlation with lower RTs.
  • Prediction 5: Perturbing neural activity increases RT: Delivering subthreshold electrical microstimulation (how do they define subthreshold here?) at different times prior to the go-cue in PMd, resulted in longer RT if the stimulation was closer to the go-cue. Stimulating M1 impacted RT less. Effect of stimulation was specific to arm movements and produced little increase in saccadic eye movement RT. The results aligns with theory pretty well, but are the shifts in RT significant? Need to reference the actual paper.
  • Prediction 6: It should be possible to construct single-trial state-space enural trajectories and use them to directly confirm that across-trial firing-rate variability decrease through time. They recorded a bunch of neurons simultaneously and plotted the neural activity trajectories using Gaussian Process Factor Analysis (GPFA) with dimensionality reduction to visualize. The plots look fine, as in the trajectories run around and converge into different subspaces at different stages of the task. But this can also be due to dimension selection. Need to check the GPFA technique.
  • Prediction 7: Farther and faster along loop reduces RT. This prediction tries to answer how preparatory state at the time of go cue influence subsequent movement. They hypotehsize that this prep state serves as the initial state of a subsequent dynamical system - thus some region of the brain appear to moniotr and wait for this to be true before "pulling the trigger" to initiate movement. When th trigger is pulled, the further along the loop, the lower the RT. They tested this with offline analysis on a trial-by-trial basis, how far along the loop the prep state was, in relation with that trial's RT. Of course, this would neccessitate the use of the GPFA plotting of the neural trajectories in finding the actual loop.

To really test these predictions however, we really need to monitor the PMd during brain-control tasks - does the Fano factor analysis still hold? Does the RT still vary with the delay period in brain-control? What else do these prep activities tune for? Does the location of the optimal subspace change following learning (Byron Yu's null hypothesis papers)?

Implications to BMI if this hypothesis turns out to be a good framework: How long does it take for the neural trajectory to change from baseline to the optimal subspace? Can we make decoders detect/classify these spaces? Shenoy suggests 200ms as measured with single-trial neural trajectories as the transition period needed for transition between baseline to the optimal subspace, thus decoding during this period should be avoided (will have to read the paper to evaluate this).

References to Read

  • Churchland, M. M., Afshar, A., & Shenoy, K. V. (2006a). A central source of movement variability. Neuron, 52, 1085–1096
  • Churchland, M. M., Cunningham, J. P., Kaufman, M. T.,Ryu, S. I., & Shenoy, K. V. (2010a). Cortical preparatory activity: Representation of movement or first cog in a dynamical machine? Neuron, 68, 387–400
  • Churchland, M. M., Kaufman, M. T., Cunningham, J. P., Ryu, S. I., & Shenoy, K. V. (2010b). Some basic features of the neural response in motor and premotor cortex. In Program No. 382.2. Neuroscience meeting planner. San Diego, CA: Society for Neuroscience Online
  • Churchland, M. M., Santhanam, G., & Shenoy, K. V. (2006b). Preparatory activity in premotor and motor cortex reflects the speed of the upcoming reach. Journal of Neurophysiology, 96, 3130–3146
  • Churchland, M. M., & Shenoy, K. V. (2007a). Delay of movement caused by disruption of cortical preparatory activity. Journal of Neurophysiology, 97, 348–359.
  • Churchland, M. M., Yu, B. M., Cunningham, J. P., Sugrue, L. P., Cohen, M. R., Corrado, G. S., et al. (2010c) Stimulus onset quenches neural variability: A widespread cortical phenomenon. Nature Neuroscience, 13, 369–378
  • Churchland, M. M., Yu, B. M., Ryu, S. I., Santhanam, G., & Shenoy, K. V. (2006c). Neural variability in premotor cortex provides a signature of motor preparation. The Journal of Neuroscience, 26, 3697–3712
  • Messier, J., & Kalaska, J. F. (2000). Covariation of primate dorsal premotor cell activity with direction and amplitude during a memorized-delay reaching task. Journal of Neurophysiology, 84, 152–165.
  • Santhanam, G., Ryu, S. I., Yu, B. M., Afshar, A., & Shenoy, K. V. (2006). A high-performance brain-computer interface. Nature, 442, 195–198.
  • Santhanam, G., Sahani, M., Ryu, S. I., & Shenoy, K. V. (2004). In: An extensible infrastructure for fully automated spike sorting during online experiments (pp. 4380–4384). Proceedings of the 26th annual international conference of IEEE EMBS, San Francisco, CA.
  • Yu, B. M., Afshar, A., Santhanam, G., Ryu, S. I., Shenoy, K. V., & Sahani, M. (2006). Extracting dynamical structure embedded in neural activity. In Y. Weiss, B. Schölkopf & J. Platt (Eds.), Advances in neural information processing systems, (Vol. 18, pp. 1545–1552). Cambridge, MA: MIT Press
  • Yu, B. M., Cunningham, J. P., Santhanam, G., Ryu, S. I., Shenoy, K. V., & Sahani, M. (2009). Gaussian-process factor analysis for low-dimensional single-trial analysis of neural population activity. Journal of Neurophysiology, 102, 614–635.

Jul 27, 2015 - Francis 2015 - Toward an Autonomous Brain Machine Interface: Integrating Sensorimotor Reward Modulation and Reinforcement Learning

Toward an Autonomous Brain Machine Interface: Integrating Sensorimotor Reward Modulation and Reinforcement Learning

This paper seeks to demonstrate that single units/multiunits and local field potentials in M1 are modulated by reward expectaiton during reaching movements and that this modulation is present even while subjects passively viewed cursor motions that are predictive of either reward or nonreward. They also tried to classify whether a trial is rewarding vs. nonrewarding based on the neural data, on a moment-to-moment basis.

Experiments:
Monkey arm wears an exoskeletal robot through all tasks. Use a single target during all trials

  1. Manual Task. The monkey was required to hold at the center target for 325ms before teh peripheral target appeared, then hold an additional 300ms before the center target disappearted, the go cue appearing for moving to the peripheral target, and hold for 325ms before receiving a liquid reward or no reward.

    The paper does not manually control for monkey arm trajectory, speed, etc. But in offline analysis they selected kinematically indistinguishable trajectories between the two reward contingencies to isolate the effect of reward.

  2. OT1. Monkey arm is fixed by the robot and cannot move. The monkey would fixate at the center target and observe the center target change color; red represents a rewarding trial, blue represents a nonrewarding trial. The cursor would then move toward the peripheral target at a constant speed with movement toward a red target resulitng in a reward, once the cursor arrived inside the target. For blue targets, reward was withheld. The monkey had to view the target plane to start a trial and maintain visual gaze until the color cue was given.

  3. OT2. The monkey observed the cursor moving away or twoard a target with movement toward a the target resulting in a reward, once the cursor arrived inside the target. No reward otherwise.

Results:

Some neurons fired higher throughout reward trials, while others behaved the opposite way. Both contralateral and ipsilateral M1 contains units that simultaneously correlat with reward and kinematics during reaching and observation...but how to identify which is reward-modulated in a less controlled BMI experiment?

Classifiers were trained using PCA components to classify reward vs. nonreward trials. The average classifer performance (over trial length) is around 70%, pretty good.

Simulated RL-BMI using TD-learning, NN for action-reward mappings. Works well, not too surprising, but not too much value either with a two-state classification task.

References to Read

  • Dura-Bernal S, Chadderdon GL, Neymotin XZ, Przekwas A, Francis JT, Lytton WW (2014) IEEE Signal Processing in Medicine and Biology Symposium (SPMB'13) Virtual musculoskeletal arm and robotic arm driven by a biomimetic model of sensorimotor cortex with reinforcement learning
  • Hosp JA, Pekanovic A, Rioult-Pedotti MS, Luft AR (2011) Dopaminergic projections from midbrain to primary motor cortex mediate motor skill learning. J Neurosci 31:2481–2487.
  • Legenstein R, Chase SM, Schwartz AB, Maass W (2010) A reward-modulated hebbian learning rule can explain experimentally observed network reorganization in a brain control task. J Neurosci 30:8400–8410.
  • Todorov E, Jordan MI (2002) Optimal feedback control as a theory of motor coordination. Nat Neurosci 5:1226–1235.