Control of a Center-Out Reaching Task Using a Reinforcement Learning Brain-Machine Interface

I am a bit puzzled why this paper is published in 2011, after the DiGiovanna 2009, when the experiment less complicated, but with the same premise. I speculate it might be this paper demonstrates RLBMI experiments in rhesus monkeys, instead of rats.

From the abstract: Neural recordings obtained from the primary motor cortex were used to adapt a decoder using only sequences of neuronal activation and reinforced interaction withe the environment. From a naive state, the system was able to achieve 100% of the targets (in a center-out reaching task) without any a priori knowledge of the correct neural-to-motor mapping. Results show that the coupling of motor and reward information in an adaptive BMI decoder has the potential to create more realistic and functional models necessary for future BMI control.

Experiment: A single monkey trained to perform a center-out reaching task (to two targets in one trial, sequentially) with arm attached to exoskeleton arm (presumable so that during brain-control, the exoskeleton may preven the arm from moving. Unfortunately nothing about arm movements during brain-control were described). Monkey implanted in S1, M1, and PMd representing the right shoulder and elbow regions with Utah arrays (450um inter-electrode spacing). From 96 channels, sort between 190-240 units.

Decoder: Reinforcement learning with a multilayer perceptron neural network (MLP). Adaptation is focused on maximizing rewards through successful completion of the trials by the agent. The agent/BMI controller modeled its cursor control probelm as a Markov Decision Process (MDP), characterized by neural modulation as state s (neural data corresponding to all the units) and discrete movements performed by the RL agent as actions a - (in the experiment simply as one of the 8 directions in which the targets can be). Each action in a state will change the state of the environment with a certain probability - the transition probability. The agent expects a reward r when taking an action given a state. Q-learning is used to approximate this reward function, and the MLP is used to map state-action pairs to their expected reward values.

\[\begin{align} P^{a}\_{ss^{\prime}} & = Pr \lbrace s\_{t+1}=ss^{\prime}|s\_t=s,a_t=a\rbrace \\ R^{a}\_{ss^{\prime}} & = E\lbrace r\_{t+1}|s\_t=s,a\_t=a,s\_{t+1}=s^{\prime}\rbrace\\ Q(s\_t,a\_t) & \gets Q(s\_t,a\_t)+\alpha(r\_{t+1}+\gamma\max\_{a}(Q(s\_{t+1},a)-Q(s\_t,a\_t))) \end{align}\]

Results: Performance evolves and improves over time from the ranodmized initialization state. Accuracy around 97%. But this task is easily done with a simple logistic regression classifier. They make the distinction that the decoder here does not require an a priori training signal, as the feedback is whether the action was correct. While this demonstration is not so impressive, under more complex experiments, RL can potentially be much more useful than supervised learning method.

But again, no neural data is shown.

References to Read (not too urgent)

  • K. V. Shenoy, D. Meeker, S. Cao, S. A. Kureshi, B. Pesaran, C. A. Buneo, A. P. Batista, P. P. Mitra, J. W. Burdick, and R. A. Andersen, “Neural prosthetic control signals from plan activity”, NeuroReport, vol. 14, pp. 591-597, 2003.
  • Y. Gao, M. J. Black, E. Bienenstock, W. Wu, and J. P. Donoghue, “A quantitative comparison of linear and non-linear models of motor cortical activity for the encoding and decoding of arm motions”, in The 1st International IEEE EMBS Conference on Neural Engineering, Capri, Italy, 2003.