Towards autonomous neuroprosthetic control using Hebbian reinforcement learing

Theory paper toward which the Sanchez guys have been working toward. The RLBMI decoder architecture is now presented as an actor-critic method of RL, where the neuroprosthetic controller (actor) learned the neural state-to-action mapping based on the user’s evaluative feedback. The role of teh critic is to translate the user’s feedback into an explicity training signal taht can be used by the factor for adaptation. The aim is to learn and automatically produce a stable neural to motor mapping and respond to pertubations by readjusting its parameters in order to maintian the performance, using a binary evaluative feedback. The metrics for the controller’s performances used were:

  • speed of convergence.
  • generalization
  • accuracy and recovery from perturbation

Reinforcement Learning, control architecture, learning algorithm - Instead of using Q-learning with a NN to act as the action-reward mapping, a Hebbian reinforcement learning framework is used. I’m confused as to why this group decided to switch the RL-framework. I am not very familiar with all the ML techniques yet, but according to scholarpedia, we need to distinguish between the machine learning/TD-learning […] and the neuronal perspective. The machine learning perspective deals with states, values, and actions, etc., whereas the neuronal perspectives tries to obtain neuronal signals related to reward-expectation or prediction-error. Regardless, the basic strategy is the same: initialize a random decoder, based on some feedback on whether the decoded action is correct or not, update the NN weights. With more trials, both the decoder and the user adapt toward each other.

Tasks

  • The sequential mode is where the controller has to perfrom a sequence of actions over mulitple steps in order to accomplish the goal of a trial. This means that there can be multiple sequences of actions that could eventually result in the end goal, and thus mulitple solutions that could be learned by the HRL. Reward/feedback is received by the controller after every action. The metric for reward in movement given in this paper is whether a specific time-step the action moves the manipulandum closer towards the targe.

    But I can imagine this approach suffering from the problem common to some greedy algorithms - sometimes the greedy approach might not give the optimal solution. Also possible that this intention estimation is too constraining - the animal might not actually want to move in the most direct trajectory possible, but this may not be important in terms of BMI for patients, efficiency is probably the most important factor there.

  • The episodic mode where the controller can select only one action in each trial, and thus that action must achieve the goal for a successful trial. The episoic learning mode is suitable for classification tasks. This is what Sanchez 2011 used for their center-out task. Since achieving the trial goals requires only a signel action here, the HRL can be encouraged to learn more quickly in the episodic task compared to the sequential task. Experience replay can be used to seed learning in this paradigm.

    Since the sequential mode is a sequence of actions, using episodic mode to intitialize the decoders for sequential mode can probably speed up the learning process.

Network was first tested in simulation experiments, developed for a reaching task in 2D grid space, useful for validating the actual computational model. Three components:

  1. Synthetic neural data generator, with each neuron tuned to one action (moving left, right, up, and down) or none. Goal of the user is to reach targets in 2D space. Input feature is firing rates in 100ms bins.
  2. Neuroprosthetic controller (actor).
  3. Behavioral paradigm - gives binary feedback to the controller.

Experiments were done in common-marmosets, utilizing a Go/No-Go motor task to move a robot arm to spatial targets to either the left or right side of the monkey. Catch Trial - traning technique to ensure the monkeys understood the necessity of the robot movements where the robot moved in the opposite direction commanded by the monkey and thus the monkey received no reward.

Results

Simulation results showed the the weights converging to achieve high success rates for both tasks, as expected. Robust against neuron loss and tuning changes.

The monkey neural data were used offline to map to four different directions. Complete BS, the analysis makes no sense since the monkey was simply trained to perform Go/No-go arm movement. This is bad.

References to Read

  • Francis J T and Song W 2011 Neuroplasticity of the sensorimotor cortex during learning Neural Plast. 2011 310737
  • Heliot R et al 2010 Learning in closed-loop brain-machine interfaces: modeling and experimental validation IEEE Trans. Syst. Man Cybern. B 40 1387–97
  • Prasad A et al 2012 Comprehensive characterization and failure modes of tungsten microwire arrays in chronic neural implants J. Neural Eng. 9 056015
  • Polikov V S, Tresco P A and Reichert W M 2005 Response of brain tissue to chronically implanted neural electrodes J. Neurosci. Methods 148 1–18
  • Mahmoudi B and Sanchez J C 2011 A symbiotic brain-machine interface through value-based decision making PLoS One 6 e14760
  • Schultz W, Tremblay L and Hollerman J R 1998 Reward prediction in primate basal ganglia and frontal cortex Neuropharmacology 37 421–9
  • Izawa E I, Aoki N and Matsushima T 2005 Neural correlates of the proximity and quantity of anticipated food rewards in the ventral striatum of domestic chicks Eur. J. Neurosci. 22 1502–12
  • Wawrzyński P 2009 Real-time reinforcement learning by sequential actor–critics and experience replay Neural Netw. 22 1484–97
  • Schultz W 2000 Multiple reward signals in the brain Nature Rev. Neurosci. 1 199–207
  • Prins N et al 2013 Feature extraction and unsupervised classification of neural population reward signals for reinforcement based BMI Annu. Int. Conf. IEEE Eng. Med. Biol. Soc. (EMBC) (Osaka, Japan, 2013) at press
  • Ribas-Fernandes J J F et al 2011 A neural signature of hierarchical reinforcement learning Neuron 71 370–9