Sep 17, 2025 - State of Vibe Coding: Blog migration from Jekyll to Eleventy

The previous version of this blog was built with Jekyll. I'm bad at webdev and took a while to figure it out. Therefore I've been reluctant to do any refactor, UI or otherwise.

Vibe coding has been taking off recently, and reading all the optimistic user stories with cursor one-shotting projects, I decided to try it out by migrating this site from Jekyll to a more modern static site framework. My main goals:

  • Automatic tag generation: Previously for each tag I had to make an independent html pag for it.
  • Better latex support: This might not've been Jekyll specific, but the behavior has been very inconsistent.
  • Simplified configuration: Jekyll had multiple config files.
  • Flexible date handling: Previously my markdown file names had to follow certain date convention, makes writing new posts higher friction.
  • Permalink system: I kept getting confused with the tag I needed to use. This is probably not Jekyll-specific, but I wanted an easier method to cross-link posts.
  • Add dark mode: The ten years ago me didn't know about it..

Good vibes

Architecture

I started with Gemini and chatted with it about my requirements and it gave me suggestions of Hugo, Jekyll, Eleventy, and Astro. After learning about each framework on language choice (js vs. python/go), (perceived) ease of use, build speed, flexibility w.r.t templating (e.g. jekyll is very opinionated on how to structure the code), and stability, I decided on using Eleventy.

I started with Cursor and told the agent the current architecture of my jekyll site, my requirements of the new eleventy site, and asked it to:

  • Give me a migration plan
  • Keep track of the migration plan and status in a new document.

The reason for this was two-folds:

  • I found this was a good way to navigate a non-trivial project. As can be seen in the document, while most of the markdown posts can stay the same, the migration involved a lot of javascript and templating changes. Web-dev link re-direction, templating syntax and css structure have always confused me, and it would've been very unmaintainable without an organized log of the entire process. Scrolling through the Cursor agent windows is very slow especially as the context got longer.
  • Past experience showed me that LLMs can often get stuck in local minimum and ends up going in circles trying to solve a problem. It's only with human supervision and hinting (e.g. "stop using approach 1, 2,.. try along this way") is there hope for it get out of the rut and make progress. But giving useful hints requires the supervisor (me) to actually have an idea of what's going on. This is easy if I'm familiar with the technology, but additional cognitive scaffolding for me is needed otherwise.

The initial generated migration plan had a big-tech RFC feel to it (I wonder why..) and I had to manually trim down some verbose components.

Cooking

The proposed plan looked fine, I then clicked through all the agent generated actions (generate new files, update existing files, terminal commands to install node components, etc). After specific steps, I would tell the agent to update what it has done in the migration document in the appropriate sections.

Bad Vibes

Letting the agent cook got me to local testing. Everything mostly worked, but there were very subtle problems:

  • Code formatting errors
  • Math formatting errors
  • Broken links
  • Broken tags

Some of these problems had a straight-forward but not simple fix which Cursor identified, e.g. math formatting needs to be consistent in posts, but required some manual regex.

Bad Gemini2.5-Pro

Then I ran out of free Cursor credits. As this is a one-time project, I continued with copying and pasting relevant files and describing the problem to Gemini2.5 directly -- this is an approach that I've used often in the early days of using LLMs. The difference in efficiency is immediate:

  • The context availability to Gemini is limited by my understanding of the project. Even if I have prompted Gemini to ask for any additional information that it might need, it's a much slower dev process.
    • AFAIK the planning and reflection behavior in Cursor are model dependent, but Gemini isn't an available backend.
  • I also started manually update sections in the migration doc, and didn't care too much to make it accurate. It's served its purpose and now the vibes are taking over. Subsequent commit notes served as my educational notes.
  • I spent maybe an hour on the code formatting problem with Gemini, iterating through multiple failed solutions. It was clear that it was stuck, but I didn't know enough about Eleventy and the Nunjucks templating system to give useful hints.

Good Claude-4-Sonnet

After two weeks I went back to it, with fresh Cursor credits. The agent solved the code formatting problem in essentially one shot. The backing model was Claude-4-sonnet, and I asked how did it know what to do. It mentioned:

Yes, I've encountered this exact problem before - it's actually a very common issue when setting up Prism.js with static site generators like Eleventy, Jekyll, or Hugo. Here's how I diagnosed it:..
I've seen this exact pattern in:
Jekyll sites using Rouge + Prism
Hugo sites with syntax highlighting
Eleventy sites (like yours)
Even some WordPress themes
The solution is always the same: include the Prism JavaScript files so the line numbers plugin can run and generate the actual number elements.
It's one of those "aha!" moments when you realize that some features need both server-side processing AND client-side JavaScript to work properly!

And there lies the importance of having good post-training data for your models!

Claude was able to one-/few-shot: implementing dark mode, theme and layout changes. When it works, very satisfying, and asking it to explain itself also accelerated my own learning process. This is peak vibe-coding.

Importance of informed prompts

During site deployment, the site was broken -- bad styling, broken links, etc. This didn't happen during local testing. I found being very specific at describing the problems, e.g. "clicking on this link took me to this url, which gives 404" makes them much more likely to be few-shotted than saying "The links are broken!!".

This is obvious, but I suspect the lack of this practice contribute partially to the 19% slow down in developer productivity with AI.

Conclusion

State of Vibe coding

I've started using LLMs in increasing capacity through the last two years and have personally become at least 2x more productive in terms of lines of code and diff generated in the company setting. The usage of AI tools there were mostly autocomplete, and direct chat sessions.

Cursor-style UI with tighter code context integration is super fun to work with and extremely satisfying when it works.

In my experiences now, AI-coding tools are extremely efficient when:

  • The user is already a domain expert and have good context over the existing code base.
    • Better supervision and hints can be provided to the agents
    • Can break down specific tasks to delegate to the agents
  • The user is a n00b and needs help ramping up on architectural decisions and learning a new framework
    • The Eleventy documentation sucks and I don't really want to allocate brain synapses to learning web frameworks. LLMs can explain targeted questions to me.

Relying on training data

It was clear that Claude-4-Sonnet was better than Gemini2.5-Pro and GPT5 at solving coding problems in this instance -- it one-shotted more often and got stuck at stupid loops. But I get the sense that was likely due to having better SFT data (i.e. the problems I encountered was more in distribution with the model's training data).

If I knew as much about web-dev as either of these models, how would I have approached the problems?

  • Search through the space of all potential failure points
  • Evaluate which one is likely the culprit
  • Test and check

The thinking models are clearly doing that to a degree. But getting stuck indicates to me that the models aren't paying attention to previously failed approaches -- one might even frame it as a continual learning problem, and limited hypothesis generation to OOD scenarios.

Value-add of AI products

Cursor is clearly useful and improves developer efficiency by increasing the developer-LLM bandwidth (faster context ingestion). I have not used Claude-CLI tool yet, but from what I've read it does not solve the problems of getting stuck, yet.

Jul 14, 2024 - Semantic encoding in single cell level

A very cool paper from the Williams lab at Harvard-MGH came out this month: Semantic encoding during language comprehension at single-cell resolution.

It records from 10 awake neurosurgery patients from the superior posterior middle frontal gyrus within the dorsal prefrontal cortex of the language-dominant hemisphere, while they listened to different short sentences. Comprehension was confirmed by asking follow-up questions to the sentences. 133 well-isolated units from the 10 patients were collectively analyzed.

The results are very satisfying. Also see nature commentary on this paper.

Semantic tuning

They found something akin to "semantic tuning" on the single neuron level to the words in the sentence.

  • This is done by correlating neuron firings to the semantic content of each word in time, where the semantic content of a word is a multi-dimensional embedding vector (derived from models like word2vec).
  • A neuron is tuned to a "semantic domain" if its firing rate is significantly higher for that domain vs. others.
  • They observed most of the neurons exhibited semantic selectivity to only one semantic domain. Though construction of 1-vs-all determination of semantic tuning this conclusion is a bit weak.
  • As a control, many semantic-selective neurons also distinguished real vs. non-words.

image1

Generalizable semantic selectivity

  • Semantic decoders generalize to words not used in the training set (31+/-7%)
  • Semantic decoders work when a different word-embedding model is used (25+/-5%)
  • Decoding performance holds regardless of position in a sentence (23% vs 29%)
  • Works for multi-unit activities (25%)

Considering they use a support vector classifier with only 43 neurons, this is really good.

Additional control found different story "narrative" (different thematic and style) does not affect semantic decoding (28% accuracy using decoders trained from a different narrative).

The decoding experiments used the response from the collective semantically-tuned neurons from all 10 participants (they can do this since the tasks are the same across participant). They checked the semantic decoding generalizability hold for individual participant.

image2

Context-dependence

  • Presenting words without context yield much lower semantic-selectivity from the units compared to when they were presented in a sentence.
  • Homophone pairs (words that sound the same but mean different things) showed bigger differences in semantic-selective units compared to non-homophone pairs (words that sound different but semantically similar).
  • Context helped with semantic decoding
    • They assigned a "surprisal"-metric to each word using a LSTM: high surprisal means based on the context, the prob that a word is surprising;
    • They looked at the decoding performance as a function of surprisal
    • Decoding performance for low-surprisal words significantly higher than for high-surprisal words

Neural representation of the semantic space

Even though a neuron might be selective primarily to a single semantic domain, the actual semantic representation could be distrbuted (perhaps in a sparse manner). Statistical significance from permutation tests.

They regressed the responses of all 133 units onto the embedding vectors (300-dimensional) of all words in the study.

  • This results in a set of model weights for each neuron (i.e. how much each neuron encodes a particular semantic dimension)
  • The concatenated set of model weights is then a neural represention of the semantic space (neurons-by-embeddings, 133x300 in this case).
  • Top 5 PC accounts for 81% of activities of semantically-selective neurons.
  • Different in neuronal activities correlated with word-vector distance (measured with cosine similarity). r=0.17
  • Word pairs with less hierarchical semantic distance (cophenetic distance) elicited more similar neuronal activities, r=0.36.

These last two points are interesting. It FEELS right, since hierarchical semantic organization probably allows a moer efficient coding scheme for a large and expanding semantic space.

image3

Impact

This work is spiritually similar to the Huth/Gallant approaches for looking at fMRI during story-listening to examine language processing. But the detailed single-neuron results make it reminiscent of the classic Georgeopoulos motor control papers that largely formed the basis of BMI (1, 2).

While the decoding accuracy (0.2-0.3) here is looks much lower than the initial motor cortex decoding of arm trajectories in the early papers, it is VERY GOOD considering the much higher dimensionality of the semantic space. While the results might not be too surprising -- we know semantic processing has to happy SOMEWHERE in the brain, it is surpising how elegant the results here are.

The natural next-step IMO is to obviously recorded from more neurons with more sentences, etc. I would then love to see:

  1. Fine-tune LLM with the recordings: since the neural activities are correlated with semantic content, it could be projected into a language model's embedding space.
  2. Try to reconstruct sentences' semantic meaning, and the LLM can be additionally be used to sample from the embedding space for sentence "visualization".

And this will be a huge step toward what most people perceive as "thought"-decoding vs. speech-decoding (which deals more with the mechanics of speech roduction such as tones and frequencies vs. languag aspects such as semantics).

What else are needed?

The discussion section of the paper is a good read, and this section stands out regarding different aspects of semantic processing:

Modality-dependence

As the present findings focus on auditory language processing, however, it is also interesting to speculate whether these semantic representations may be modality independent, generalizing to reading comprehension, or even generalize to non-linguistic stimuli, such as pictures or videos or nonspeech sounds.

Production vs. Comprehension

It remains to be discovered whether similar semantic representations would be observed across languages, including in bilingual speakers, and whether accessing word meanings in language comprehension and production would elicit similar responses (for example, whether the representations would be similar when participants understand the word ‘sun’ versus produce the word ‘sun’).

Perhaps the most relevant aspect to semantic-readout. It's unclear whether semantic processing in production of language (as close to thoughts as we can currently define) is similar to that during comprehension. Although a publication from the same group examines speech production (phoneme, syllables, etc) in the same brain region (the second paper says posterior middle frontal gyrus of the langauge-dominant prefrontal cortex, illustration looks similar), examined the organization of the cortical column and saw their activities transitioned from articulation planning to production.

It would be great to know if the semantic selectivity holds during speech production as well -- the combined findings suggest there's a high likelihood.

Cortical Distribution

It is also unknown whether similar semantic selectivity is present across other parts of the brain such as the temporal cortex, how finer-grained distinctions are represented, and how representations of specific words are composed into phrase- and sentence-level meanings.

Language and speech neuroscience has evolved quickly in the past two decades, with the traditional thesis that Broca's area is responsible for language production being challenged with more evidence implicating the role of precentral gyrus/premotor cortex.

Meanwhile the hypothesis that Werneke's area (posterior temporal lobe) for language understanding has withstood more test of time. How this is connected to the semantic processing observed in this paper in prefront gyrus should (e.g. is it downstream or upstream in language production) certainly be addressed.

My (hopeful) hypothesis is that the prefrontal gyrus area here participates in both semantic understanding and production. I don't believe this as far-fetched given how motor/premotor cortex' roles in both action observation and production in the decades of BMI studies.

Apr 5, 2024 - A Cross-Modal approach to silent speech with LLM-Enhanced recognition

Paper link

This paper advances the SOTA on silent-speech decoding from EMG recorded on the face. "Silent" here means "vocalized" or "mimed" speech. The dataset comes from Gaddy 2022.

image1

Image above shows the overall flow of the work:

  1. Model is trained to align EMG (from vocalized and silent) and audio into a shared latent space from which text-decoding can be trained. This training utilizes some new technique they call "cross-modal contrastive loss" (crossCon) and "supervised temporal contrastive loss" (supTCon). More on this later.
  2. They take the 10 best models trained with different loss and data-set settings, and make into an ensemble.
  3. For inference, they get the decoded beam-search output from these different models, and pass them into a fine-tuned LLM, to infer the best text transcription. They call this LLM-based decoding "LLM Integrated Scoring Adjustment" (LISA).

Datasets

The Gaddy 2022 dataset contains:

  1. EMG, Audio, and Text recorded simulataneously during vocalized speech
  2. EMG and Text for silent speech
  3. Librispeech: Synchronized Audio + Text

Techniques

A key challenge to decode silent speech from EMG is the lack of labeled data. So a variety of techniques are used to overcome this, drawing inspiration from self-supervised learning techniques that have advanced automatic-speech recognition (ASR) recently.

Cross-modality Contrastive Loss (crossCon): Aims to make cross-modality embeddings at the same time point more similar than all other pairs. This is really the same as CLIP-style loss.

image2

Supervised temporal contrastive Loss (supTCon): This loss aims to leverage un-synchronized temporal data by maximizing similarity between data at time points with the same label than other pairs.

image3

Dynamic time warping (DTW): To apply crossCon and supTCon to silent speech and audio data, it's important to have labels for the silent speech EMG. DTW leverages the fact that vocalized EMG and audio are synchronized, by:

  1. Use DTW to align vocalized and silent EMG
  2. Pair the aligned silent EMG with the vocalized audio embeddings.

Using audio-text data: To further increase the amount of training data, Librispeech is used. Since the final output is text, this results in more training data for the audio encoder, as well as the joint-embedding-to-text path.

All these tricks together maximize the amount of training data available for the models. I think there are some implicity assumptions here:

  1. EMG and Audio have more similarity with EMG and Text, since both Audio and EMG have temporal relationship.

The use of a joint-embedding space between EMG and Audio is crucial, as it allows for different ways to utilize available data.

LISA: An LLM (GPT3.5 or GPT4) are fine-tuned on the EMG/Audio-to-Text outputs for the ensemble models, and the ground truth text transcriptions. This is done from the validation dataset. Using LLM to output the final text transcription (given engineered prompt and beam-search paths), instead of the typical beam-search method, yielded significant improvements. And this technique can replace other language-model based speech-decoding (e.g. on invasive speech-decoder output) as well!

Details:

  1. CrossCon + DTW performed the best. It's interesting to note that DTW with longer time-steps (10ms per timepoint) perform better.
  2. SupTCon loss didn't actually help.
  3. Mini-batch balancing: Each minibatch has at least one Gaddy-silent sample. Vocalized Gaddy samples are class-balanced with Gaddy-silent sample. The rest of the mini-batch is sub-sampled from Librispeech. This is important to ensure the different encoders are jointly optimized.
  4. GeLU is used instead of ReLU for improved numerical stability.
  5. The final loss function equals to weighted sum EMG-CTC_loss, Audio-CTC_loss, CrossCon and supTConLoss

Final Results on Word-Error Rate (WER)

For final MONA LISA performance (joint-model + LLM output):

  1. SOTA on Gaddy silent speech: 28.8% to 12.2%
  2. SOTA on vocal EMG speech: 23.3% to 3.7%
  3. SOTA on Brain-to-Text: 9.8% to 8.9%

Additional userful reference

Cites Acceptability of Speech and Silent Speech Input Methods in Private and Public:

The performance threshold for SSIs to become a viable alternative to existing automatic speech recognition (ASR) systems is approximately 15% WER

Dec 17, 2023 - Neurips 2023 neuro-ml round up

Neurips 2023 has been incredibly awesome to scan through. The paper list is long and behind paywall, but usually searching for the paper titles will bring something up in arxiv or some tweet-thread related to it.

Patrick Mineault (OG Building 8 ) has collected a list of NeuroAI papers from Neurips which has been very useful to scan through.

Quirky papers

Time Series as Images: Vision Transformer for Irregularly Sampled Time Series

Instead of trying to figure out how to align differently sampled time series for time series classification task, plot them and send the image form to vision transformer, add a linear prediction head on top and be done with it.

vitst

And this actually works:

We conduct a comprehensive investigation and validation of the proposed approach, ViTST, which has demonstrated its superior performance over state-of-the-art (SoTA) methods specifically designed for irregularly sampled time series. Specifically, ViTST exceeded prior SoTA by 2.2% and 0.7% in absolute AUROC points, and 1.3% and 2.9% in absolute AUPRC points for healthcare datasets P19 [29] and P12 [12], respectively. For the human activity dataset, PAM [28], we observed improvements of 7.3% in accuracy, 6.3% in precision, 6.2% in recall, and 6.7% in F1 score (absolute points) over existing SoTA methods.

Even though most of the "plot" image is simply empty space, attention map shows the transformer is attending the actual lines, and regions with more changes.

vitst_attention

Why does this work? I'd think it's because the ViTST acts as an excellent feature extractor, since the DL vision models contains in them representations of primitive features typically present in the line signals (e.g. edges, curves, etc). Yet using a pretrained ResNet showed much worse performance vs. the pretrained SWIN-transformer (but still higher than the trained-from-scratch SWIN-transformer). That suggests transformer's cross-attention between different time series (or different regions of the plot) might make a difference.

Should we use it? Probably not -- lots of compute and memory is being wasted here producing mostly empty pixels. But it's a sign of the coming trend of leverage pre-trained model or die trying.

Training with heterogenous multi-modal data

Data collection sucks and everyone knows it, especially neuroscientists. How we wish we can just bust out some kind of Imagenet, CIFAR, or COCO like the vision people? Nope, datasets are always too heterogenous in sensors, protocols, or modalities. Transformers are now making it easier to combine them now though (see for example previous).

BIOT: Biosignal Transformer for Cross-data Learning in the Wild

biot

Main contributions:

  • Biosignal transformer (BIOT): a generic biosignal learning model BIOT by tokenizing biosignals of various formats into unified “sentences.” •
  • Knowledge transfer across different data: BIOT can enable joint (pre-)training and knowledge transfer across different biosignal datasets in the wild, which could inspire the research of large foundation models for biosignals.
  • Strong empirical performance. We evaluate our BIOT on several unsupervised and supervised EEG, ECG, and human sensory datasets. Results show that BIOT outperforms baseline models and can utilize the models pre-trained on similar data of other formats to benefit the current task.

Fancy word aside, the main takeways:

  1. Segment time series into 1s chunks (called tokens). Then parametrize with 3 embeddings: [channels, samples] --> [(dim_emb1 + dim_emb2 + dim_emb3),].
  2. Pass them through a linear transformer (use reduced-rank form of self-attention).
  3. Profit with transformer embedding outputs..

This is very similar approach to Poyo1, which uses relative position embedding and does not need to explicitly chunk 1s windows.

biot_tokenization

Interestingly, BIOT paper claims to be "the first multi-channel time series learning model that can handle biosignals of various formats". And both BIOT and POYO1 are in the Neurips 2023.

Leveraging LLM for decoding

Continuing onto the trend of using LLM at the end of all the ML stacks.. such as decoding mental image by conditioning diffusion models on fMRI, now we can decode "language" from EEG much better.

DeWave: Discrete Encoding of EEG Waves for EEG to Text Translation

Context:

Press releases such as this one would have you believe they have "developed a portable, non-invasive system that can decode silent thoughts and turn them into text".

But what exactly are the "silent thoughts"?

study participants silently read passages of text while wearing a cap that recorded electrical brain activity through their scalp using an electroencephalogram (EEG)

This is different from what we typically think of "thoughts", it's more similar to decoding movies from neural activities (similar to Alexander Huth and Joe Culver works). Now we continue:

dewave

Main contributions:

  • This paper introduces discrete codex encoding to EEG waves and proposes a new framework, DeWave, for open vocabulary EEG-to-Text translation.
  • By utilizing discrete codex, DeWave is the first work to realize the raw EEG wave-to-text translation, where a self-supervised wave encoding model and contrastive learning-based EEG-to-text alignment are introduced to improve the coding ability.
  • Experimental results suggest the DeWave reaches SOTA performance on EEG translation, where it achieves 41.35 BLUE-1 and 33.71 Rouge-1, which outperforms the previous baselines by 3.06% and 6.34% respectively

The paper does decoding with/without training data markers indicating where the subject's looking at. The case without markers is much more interesting and sidesteps the labeling problem.

The overall approach:

  1. Use a conformer to vectorize the EEG signals into embeddings,
  2. the embeddings are mapped to a set of discrete "symbols" (or code) via a learned "codex",
  3. The codex representations are fed into pre-trained BART (BERT+GPT) and get the output hidden states. A fully connected layer is applied on the hidden states to generate English tokens from pre-trained BART vocabulary V.

It's hard to decipher some of the details of this paper, but recording notes here for future me.

Training paradigm:

  • In the first stage, they do not involve the language model in weight updates. The target of the first stage is to train a proper encoder projection to theta_codex and a discrete codex representation C for the language model.
  • In the second stage, the gradient of all weights, including language model theta_BART is opened to fine-tune the whole system.

dewave_pretrain

The codex approach is very interesting -- instead of feeding EEG embeddings directly to the pre-trained BART, it gets converted into this intermediate representation. The rationale given was this:

It is widely accepted that EEG features have a strong data distribution variance across different human subjects. Meanwhile, the datasets can only have samples from a few human subjects due to the expense of data collection. This severely weakened the generalized ability of EEG-based deep learning models. By introducing discrete encoding, we could alleviate the input variance to a large degree as the encoding is based on checking the nearest neighbor in the codex book.
The codex contains fewer time-wise properties which could alleviate the order mismatch between event markers (eye fixations) and language outputs.

Need temporal alignment between segments of signals and word (or "label"?).

Not sure if this is back-rationalization, but the training supports it:

To train the encoder as well as the codex, two self-supervised approach was used:

  1. Encoder-decoder Reconstruction: raw waves -> embeddings -> codex -> embeddings -> raw waves. Notably subsequent ablation studies showed that a larger codex size is not necessarily better, which makes sense here as we expect a lower-D latent space for this approach to work.
  2. Language (word2vec embeddings) and codex alignment via contrastive learning (of course!)

And this approach works, even though EEG is traditionally very shitty. Sure they used new graphene-based dry electrodes which supposedly approach wet-gel electrode performance, it's still surprising. Though I'm not versed in EEG to understand how significant the outperformance margin vs. SOTA is.

I don't buy the jusification for the codex representation though, as the abolation study shows the effect codex on word level EEG feature as minimal.

dewave_codex

Predicting brain responses with large pre-trained models

Alex Huth is making it rain in Neurips this year with a series of fMRI response encoding papers, involving language and vision/video. One that stood out the most to me was Scaling laws for language encoding models in fMRI, also see tweet-thread.

Takeaways:

  • Predicting fMRI brain response to story-listening with LLM: Brain prediction performance scales logarithmically with model size from 125M to 30B parameter models, with ~15% increased encoding performance as measured by correlation with a held-out test set across 3 subjects. Similar logarithmic behavior was observed when scaling the size of the fMRI training set.
  • Similar trend exists for acoustic encoding models that use HuBERT, WavLM, and Whisper.

A noise ceiling analysis of these large, high-performance encoding models showed that performance is nearing the theoretical maximum for brain areas such as the precuneus and higher auditory cortex. These results suggest that increasing scale in both models and data will yield incredibly effective models of language processing in the brain, enabling better scientific understanding as well as applications such as decoding.

On the surprising effectiveness of neural decoding with large pre-trained models..

Deep-learning models were only "weakly" inspired by the brain and even though the Transformer arch isn't exactly neuro-inspired, decoding and neural responses using large pre-trained models have been surprisingly effective and "straight-forward".

My take is that these large pre-trained models (language, vision, audio, etc) encode structure of that particular modality. Brains proces these modalities differently from these models, but presumably there's some kind of latent space that both can be mapped onto. So this stype of encoder-decoder approach can be thought of as latent space alignment (which can include temporal dimension as well!)

This is a step up from other dynamic-programming style alignment techniques such as viterbi decoding and CTC-loss, and much more elegant conceptually.

Feb 26, 2023 - Matrix Cookbook

Still not great at matrix math -- why are they so much harder than trig identities...

Great reference on matrix identities and derivations, saved a copy here.

Feb 26, 2023 - PCA vs. FA: Theoretical and practical differences

I remember spending an afternoon understanding the theoretical and pratical differences between PCA and FA several years ago, when factor analysis (FA) started to appear more frequently in neuroscience and BMI literature. It was confusing because it seemingly measured the same thing as the popular tool principal component analysis (PCA), but in a much more computationally complex way. When I tried FA on the neural data I was working on at the moment, I didn't see much difference -- reconstruction using the top-n PC components and the assumed n common factors accounted for similar amount of variance in the data.

Reading the recent paper relating neuronal pairwise correlations and dimensionality reduction made me double back on this again -- the motivation question was can we derive similar results using PCA? The answer was no, and looking into this deepened my understanding of these two tools.

Problem Formulation

Both PCA and FA seek to provide a low-rank approximation of a given covariance (or correlation) matrix. "Low-rank" means that only a limited number of principal components or latent factors is used. If we have a n×n covariance matrix of the data C, then we have the following model formulations:

PCA:CWWTPPCA:CWWT+σ2IFA:CWWT+Ψ

Here W is a matrix with k columns ( k<n), representing the low number of principal components or latent factors, I is identify matrix, and Ψ is a diagnal matrix. Each method can be formulated as finding WW (in FA's case, als Ψ) to minimize the norm of the difference between left-hand and right-hand sides.

Note that PPCA can be thought of as an intermediate between PCA and FA, where the noise term σ2I makes it a generative model like FA, it practically acts like PCA (in that W spans the same subspace in both).

Difference in model assumptions

The principal components of PCA are derived from a linear combination of the feature vectors, akin to rotation (and scaling) of the feature-space. The directions of the PC are those where variance is maximized. The interpretation of the PCs, however, may correspond to some subjectively meaningful constructs, but this is not guaranteed in the model assumption.

In contrast, FA is a generative model with the built-in assumption that a number of latent factors led to the observed features. The factors are usually derived from EM algorithms assuming the distribution of the data is generated according to multi-variate Gaussians.

Consequences

FA reconstructs and explains all the pairwise covariances with a few factors, while PCA cannot do it successfully. This is because PCA extracts eigenvectors of the data distribution, while FA seeks to find latent factors that maximizes the covariances explained. Note that it doesn't make much sense to treat the factors from FA as a "direction" in the feature space as in PCA because it's not a transformation of the feature space.

A very useful illustration is given in ttnphns' stackexchange post, showing this difference:

FA_vs_PCA

The error of reconstruction using either PC1 or F1 will have different shape in the feature space. The error resulting from FA reconstruction are uncorrelated in the feature space, while that from PCA reconstruction is.

Applications

When to use PCA vs. FA?

Ideally, if the goal is find latent explainatory variables, then FA's generative model assumption is better suited. If the goal is to perform dimensionality reduction, such as when trying to conduct regression model with highly correlated features, PCA would be preferred.

So why would so many papers use PCA instead of FA, and interpret principal components to represent some latent factors?

The interpretation here is theoretically not sound. But practically this is ok, since as the number of features n increases, the results of PCA approaches FA. See amoeba's great simulations on stackexchange, as well as ttnphns's simulations.

FA=PCA

Why is this the case, if the model formulation and computations are so different? Referencing amoeba --

From model formulations, PCA finds W to best approximate the sample covariance matrix such that CWWT, while FA finds W to best approximate the off-diagonal entries of C, i.e. offdiag(C)WWT (remember that FA tries to capture the pairwise correlation between features). The diagonal elements of C in FA is taken care of by Ψ.

This means:

  1. If the diagonals of C are small (i.e. all features have low noise), meaning that the off-diagonal elements dominate C, then FA approaches PCA.
  2. If n is very big, then the size of C is also very big, and therefore the contribution of the diagonal elements is small compared to that of the off-diagonal elements, then once again PCA results approach FA. A good way to think about this is in terms of the residual reconstruction error picture -- when n, the residual error from PC-reconstruction becomes more isotropic, approaching that of FA.

When is n large enough for PCA to approximate FA?

When the ratio n/k, where k is the expected number of latent factors, is big. Usually a ratio of 10 is a good threshold from simulation results. This also explains my past observations in neural data -- where the number of features is ~100 (number of neurons), and latent factors is ~10 (cursor/actuator position/velocity).

The other, more practical reason is simply that PCA is way more easier to compute!

If PCA approaches FA under large n, why use FA at all?

Beyond better model assumptions for the question investigation, FA formulations enables easier interpretation of shared covariances and its relationship with pairwise correlations. This was the key insight in Umakantha2021, for example.

Conclusions

  1. PCA and FA differ in model assumptions, notably, FA assumes the data is generated by some underyling latent factors.
  2. PCA seems to approximate the data distribution while minimizing the diagonal reconstruction errors of the covariance matrix, while FA seeks to minimize the offdiagonal reconstruction errors of the covariance matrix.
  3. PCA approaches FA results when the number of features is big compared to the assumed number of latent factors.

Feb 13, 2023 - Neuronal correlations and dimensionality reduction

Bridging neuronal correlations and dimensionality reduction

Pairwise correlations between individual neurons, and dimensionality reduction based methods to characterize population statistics are widely used to measure how neural populations covary. This paper establishes mathematical relationships between the two approaches and demonstrate that summarizing population-wide covariability using any single activity statistic is insufficient.

The graphical abstract and highlights on the publication are actually very informative after reading through some of the paper:

abstract

As is typical Byron Yu/Aaron Batista fashion, this paper presents a clever application of dimensonality reduction (specifically factor analysis).

Neuroscience literature often presents pairwise statistics to characterize neural populations (i.e. average spike-count correlations before and after learning BMI). They first propose that that this measure rscmean needs to be complemented by the pairwise metric standard-deviation rscSD, then connect how the changes in this pair of pairwise metrics relate to population-level metrics obtained through dimensionality reduction.

motivation

The next three figures illustrate the population-level metrics, and their relationship with pairwise metrics. The central idea is that the population activity can have different degree of covariation, which can be decomposed into shared variation along a number of latent fluctuations.

  1. Loading similarity: How correlated population activities are.
  2. Percent-shared variance: How much each neuron's fluctuatons is captured by the latent co-fluctuation.
  3. Dimensionality: The number of "co-fluctations" needed to capture the variance in the population activities (similar to number of PCs in PCA).

population_metric_intuition

population_and_pairwise_metrics

summary

If all these sounds like factor analysis (FA), that's because it's a different way of interpreting F.A. The crux of the paper below:

FA

loading_similarity

shared_variance

dimensionality

Nov 1, 2022 - Signal Processing Notes 2

The field of DSP is pretty broad, but I found knowing/remembering the following basic concepts are both useful in practice, and in interviews (giving or receiving). Understanding the concepts below can also make debugging unexpected results much easier, and I've had to come back to these concepts many times.

FIR vs. IIR filters

  1. FIR filters are easy to implement, always causal.
  2. IIR filters are more concise to implement, but it requires feedback. Can implement higher order IIR filters with biquad structures (SOS-implementations) for stability (not needed for FIR).
  3. FIR filters are always linear phase.
  4. IIR filters can try to be linear phase within the passband.

Phase Delay

  1. Linear phase is when the phase-shift introduced to different frequencies are linearly increasing.
  2. This means higher-frequency signals are delayed more in phase.
  3. The result is that all frequencies are delayed the same in time, despite differently in phase.
  4. Intuitively, linear phase means the filtered signal will have similar shape as before.

Spectrum Leakage

  1. This phenomenon stems from the assumption of DFT, that the input signal is ONE PERIOD of a PERIODIC SIGNAL.
  2. If the part of the signal that is windowed is not an integer multiple of the periodic signal, then the resulting frequency spectrum may not have frequency bins corresponding exactly to the frequency of interest -- the resulting DFT is not sharp, and
  3. The frequency of interest is "spread" out into the surrounding frequency bins. producing a "leakage".
  4. This can be solved with "zero-padding", adding zeros to the end of the time-domain signal. This approximates the effect of taking the CTFT of a windowed periodic signal.
  5. The result is that the DFT of this zero-padded signal looks like an interpolation of the previously "spread-out" DFT of the non- zero-padded signal. For signal with a single tone, this would recover the correct peak frequency.
  6. Zero-padding does not improve frequency resolution (interpolation doesn't increase resolution). To improve resolution we need longer duration recording of the signal.

FFT^2 vs. PSD of a signal

  1. PSD usually applies to a stochastic process (usually stationary). For non-stationary processes such as speech, STFT should be used.
  2. Wiener-Khinchin Theorem: For stationary stochastic process, PSD is defined as the Fourier Transform of the Autocorrelation Sequence of the signal. From this we get the amount of power per frequency bin.
  3. PSD can be estimated by taking the magnitude squared of the FT of the signal -- this is called the Periodogram.
  • This is not a consistent estimator as it does not tend to a limit with increasing sample size, as the individual values are exponentially distributed.
  1. An alternative is to get a truncated version of the auto covariance signal, then take the Fourier Transform.
  • This leads to spectral window of some width, has lower sampling variability and with minor assumptions IS a consistent estimator.
  1. The key weakness of the periodogram is that it takes only one "realization" of a stochastic process and therefore has high variability..
  • The PSD can be thought of as a random variable -- you need to average over many outcomes to get a decent estimate
  • In fact, another definiton of PSD is "an average of the magnitude squared of the Fourier Transform".
  • Very CHEAP computationally, but has high variance.
  1. Consequently, averaging multiple peridograms can approach the PSD.
  2. The unit of PSD is /Hz. Integrating PSD over a delta of frequency bin gives the energy (Watts).
  3. For a mean-zero signal, integral of the entire PSD is equal to the variance.
  4. Key Duality: a quadratic quantity in the frequency domain (energy spectral density in determinstic case, power spectral density in the stochastic case) corresponds to a correlation (which is essentially a convolution) in the time domain.
  5. Multi-taper approach Is a fancy way of averaging periodograms. It averages a pre-determined number of periodograms obtained by different window (taper) on the same signal. The windows selected have two key properties:
  • The windows are orthogonal, this means that the periodgrams are uncorrelated, so averaging multiple peridograms give an estimate with lower variance than using just a single taper.
  • The windows have the best possible concentration in the frequency domain for a fixed signal length to minimize spectral leakage.
  • Probably the best estimator for stationary time-series that's not super long..
  1. Informative resource with a good discussion on why so many PSD calculations: 1

Savitzky-Golay filter

I actually haven't encountered this in grad school, probably because I dealt mostly with spikes, and didn't need the smoothing abilities that SGF is especially suited for. Everyone in my job seems to love it for some handwavy reason so I wanted to demystify it. A good overall discussion of the pros/cons beyond the basic formulation is on stackoverflow. Some highlights:

  1. Let's get it straight, SGF is not magical, adaptive or anything. It's an FIR filter designed with local polynomial approximation.
  2. If the noise spectrum significantly overlaps with the signal spectrum, a more careful approach is required, and brute-force attenuation will not work well because either you leave too much noise (by choosing the cut-off frequency too high) or you distort the desired signal too much. In this case Savitzky-Golay (S-G) filters may be a good choice.
  3. SGF tutorial:
  • Can be thought of a FIR low-pass filter, with flat passband, and moderate attenutation in the stop band.
  • Good for preserving peak shapes and height, bad for rejecting noise.
  • Use when you want to smooth, but the noise and signal share similar frequency.
  1. SGF with smaller window length means less attenuation in higher frequencies, this means it distorts the peaks less, but noise that's slightly out of the band are not filtered as sharply.
  2. Longer window attenuates more, but keeps the amplitude constant for a basic sinusoid.
  3. Examples/Tests:
  • Clean signal of 1Hz, corrupt with 10Hz, and 40Hz (yes the noise are out of band). Len-31 sgf would perform better than len-101 because it distorts the peaks less, and the resulting waveform is almost identical to the base 1Hz waveform.
  • Clean signal of 1Hz, corrupt with 1.2Hz and 40Hz. Len-31 sgf results in a smoother version of the original signal. Len-101 sgf results in a 1Hz signal with smaller amplitude. In this case, Len-101 correctly filtered out the high frequency components better.
  1. In practice, I actually found SGF to be hard to use/tune. Much better to do actual adaptive filtering (e.g. H-infinity, LMS) if a noise measurement is available.

Utilizing ergodicity to increas SNR

  • In diffusion correlation spectroscopy, for example, SNR is proportionally to photon rate (more signal per time), and integration time (for more accurate auto-correlation calculation).
  • However, more than one measurement can be taken at a time, increasing the number of measurements spatially can then achieve similar SNR increase as a greater photon rates and/or integration time.
  • This is based on the assumption of ergodicity -- i.e. a random process averaged over time/space has the same mean and variance.

Time Series ARIMA models and Signal Processing relations

  1. PSU stats510 is a great reference for the basics.
  2. Moving Average (MA) are like FIR filters
  3. Autogressive (AR) models are like IIR filters.
  4. Every autoregressive-integrated-moving-average (ARIMA) model can be converted to an infinte order MA model -- similar to how IIR filters can be approximated by FIR filters!
  5. When decomposing (trend + seasonal + random) additive models, trend is extracted by sliding window centered moving averages (similar to low-pass filtering), with a window length equal to the seasonal span.
  6. Smoothing: LOWESS/LOESS are equivalent to Savitzky-Golay -- i.e. fitting regressions or polynomials locally to each point, may include weighting function applied to different points.
  7. We shouldn't blindly apply exponential smoothing because the underlying proces smight not be well modeled by an ARIMA(0,1,1). The reason is that Exponential Moving Average (EMA) is equivalent to a first-order moving average (MA1) model.
  8. Diagnostics:
  • AR models should show decaying autocorrelation function (ACF) and cutoff in partial autocorrelation function (PACF).
  • MA models should show cutoff ACF, and decaying PACF.
  • ACF and PACF both showing spike-cutoff patterns suggests ARIMA model.
  • Seasonal trends show periodic cutoff ACF or PACF
  1. How to fit ARIMA model to linear model residuals:
  • Cochrane-Orcutt (example) does pre-whitening, then apply OLS to get the beta coefficients and SE. R-squared after this procedure is ususually less than that of LS
  • ARIMAX: Fitting regression with ARIMA error structure, can be done with Cochrane-Orcutt, but better with maximum-likelihood estimation to joint estimate both the regression model and error ARIMA model.
    • Can treat this as a state-space model -- the ARIMA error describes the state transition, the exogeneous regressors describe the observation model.
    • Some connection to Kalman filtering after some manipulations.

Yule-Walker...because I can never remember this name..

  1. Yule-Walker equations is used to fit AR model of certain orders.
  2. Order structure is determined by maximum-likelihood or AIC of the fit to the residuals.
  3. Wiener-Hopf equation is a generalization of the Yule-Walker equations.

DSP-implementation gotchas:

  1. SOS/biquad implementation are better than transfer function implementations in practice due to robustness in overflow and quantization error. For example, see scipy issue.
  2. When implementing cascade of filters in DSP, it's important to think about and set the initial conditions for the 2nd stages and on. These filter stages should have initial condition set to the step response steady-state of the previous filter, corresponding to that state's initial state, and so on.

Jun 6, 2022 - Physiology of endurance training

My newest hobby is high-altitude mountaineering. There's so much different advices out there about how to improve endurance and speed, to go faster for longer. Information from runners and climbers and cyclists can be very different.

I got nerdy and dug into the physiology and evaluated the main training methods:

  • Long-duration low-intensity (LD)
  • Interval/tempo-runs/maximal-steady-state (Interval)
  • High intensity interval training (HIIT)

To find which one or combination is the most optimal training solution for climbing big mountains, along with other training techniques.

Here is my summary. Interestingly, I found the recommended polarized training combination of LD and HIIT to be the same as the one I experienced during college crew, and also similar to Nimsdai's training (he doesn't do much HIIT though).

If you have time, go read Training for the new alpinim book, it's legit and it passes my personal physiology check.

Now I can get ahead with my training without constantly wondering if I'm doing the most optimal things (why don't you get a coach, d00d?!)

Jan 28, 2022 - Challenge Point Framework for motor learning

Challenge Point: A framework for conceptualizing the effects of various practice conditions in motor learning

Challenge point framework is a generalization of the 85% optimal learning rule result. Perhaps it's better to call the 85% optimal learning rule as a prediction within the challenge point framework (which came before).

Essentials

The Challenge Point Framework (CPF) provides a conceptual framework for factors influencing motor skill learning (unclear if it applies to cognitive learning, but some empirical observations in education and language learning seems to suggest so).

The main interacting factors in this framework are:

  • Skill level of the learner regarding a specific task
  • Task difficulty
  • Practice environment

The CPF suggests that as one's skill increase in a task, learning is optimized when task difficulty is increased as well. This relationship is explained by higher ability to utilize additional information for learning when task skill increases, and additional information improves learning. In contrast, extra information presented during to someone with a low skill level, and thus low ability to utilize information, impedes the learning process by overwhelming the cognitive resources available during learning.

A key idea is that increasing task difficulty is associated with increasing information available to the learner for the following reasons:

  1. Model someone being highly skilled in a motor task with the expected success of his movement plan in the task being very high. In this case, a negative result would yield more information about the learner's internal model. In contrast, a positive result does not provide much useful information.
  2. When task difficulty is low, he would expect success, therefore learning is minimal.
  3. When task difficulty is high, likelihood for negative result is higher, therefore the potential information available to the learner is higher. More potential information implies more learning is possible.

A related way to see it is that "practice leads to redundancy, less uncertainty, and, hence, to reduced information". The more that practice leads to better expectations, the less information there will be to process.

Therefore in this framework, factors that contribute to motor learning can be easily evaluated to predict its effects on skill performance and learning. Factors influencing learning include:

  1. Task difficulty: note this can be divided into the inherent or nominal difficulty, and functional difficulty. A nominally difficult task can be made to have less functional difficult task by introducing helpful feedback, for example.
  2. Practice schedule (changes the functional difficulty): blocked, random, randomized block. Random practice schedule increases the functional difficulty wrt blocked practice schedule due to contextual interference.
  3. Feedback (also known as knowlege of result - KR) and feedback schedule:
  • More frequent feedback lowers functional difficulty
  • Random feedback schedule increases functional difficulty, compared to blocked schedule.

This framework also predict optimal challenge point, as a function of task difficulty and skill level, during which learning (or utilizable potential information availability) is maximized.

Details

Definitions

Nominal task difficulty: The difficulty of a particular task within the constraints of an experimental protocol. The nomianl difficulty of a task is considered to reflect a constant amount of task difficulty, regardless of who is performing the task and under what conditions it is being performed. This makes the most sense in comparison with other skills, for example, kicking a ball 50 meters has more nominal difficulty than kicking it 1m, and less than kicking it 100 meters.

Functional task difficulty: How challenging the task is relative to the skill level of the individual performing the task and to the conditions under which it is being performed. Ex: kicking a ball 50 meters has the same nominal difficulty to amateur and pro, but different functional difficulty (i.e. success rate).

Practically, nominal task difficulty is probably not important to think about.

Assumptions

Learning is a problem-solving process in which the goal of an action represents the problem to be solved and the evolution of a movement configuration represents the performer's attempt to solve the problem.

Source of information available during and after each attempt is remembered and form the basis for learning, resulting in improvement skill -- this is practice.

Two sources of information are criticle for learning: the action plan (known to a priori to the learner), and feedback (obtained during or after).

Optimal challenge point

In CPF, learning is directly related to the information available and interpretable in a performance instance, which, in turn is tied to the functional difficulty of the task. The central thesis is then:

Information represents a challenge to the performer and that when information is present, there is potential to learn from it

Subsequent corollaries are:

  1. Learning cannot occur in the absence of information
  2. Learning will be impeded in the presence of too much information (too much challenge, cognitive overload).
  3. Learning achievement depends on an optimal amount of information, which differs as a function of the skill level of the individaul.

Therefore, the factors contributing to functional task difficulty interact to dictate the optimal amount of interpretable information, and thus the potential for learning.

Corollary 2 derives from the observation that if information is to result in learning, it must be interpretable. The total amount of information one can interpret is goverend by one's information-processing capabilities, which changes with practice.

As skill improves, the expectation for performance becomes more challenging. So to generate a challenge for learning, one must obtain increased information, which can arise only from an increase in the functional task difficulty. Luckily, both information-processing capability and skill level increases with practice.

optimal_challenge_points

CPF_interactions

Predictions of CPF

  1. Practice variables that influence action planning information via contextual interference (most often random practice schedule is proxy for contextual interference):
  • For tasks with differing levels of nominal difficulty, the advantage of random practice (vs. blocked practice) for learning will be largest for tasks of lowest nominal difficulty and smallest for tasks of highest nominal difficulty.
  • For individuals with differing skill levels, low levels of CI will be better for beginning skill levels and higher levels of CI will be better for more highly skilled individuals (via increasing functional difficulty).
  • Modeled information (i.e. examples and prior demos) decrease functional difficulty.
  1. Practice variables that influence feedback information (knowledge of result, among other things):
  • For tasks of high nominal difficulty, more frequent or immediate presentation of KR, or both, will yield the largest learning effect. For tasks of low nominal difficulty, less frequent or immediate presentation of KR, or both, will yield the largest learning effect
  • For tasks about which multiple sources of augmented information can be provided, the schedule of presenting the information will influence learning. For tasks of low nominal difficulty, a random schedule of augmented feedback presentation will facilitate learning as compared with a blocked presentation. For tasks high in nominal difficulty, a blocked presentation will produce better learning than a random schedule.

How to apply?

  1. Athletic skills is perhaps the obvious example: boxers practice individual punches first (blocked practice schedule, frequent feedback, easy), before mixing it in combinations and sparring.(random practice schedule, summary/infrequent feedback, difficult).
  2. Tutorials: Introduce concept one at a time (less information and less difficulty)