Meta Bets AI Can Read Minds. Scientists Watch
Meta AI unveiled TRIBE v2 on March 26, a multimodal model that predicts human brain responses to video, audio, and text. The company calls it a "foundation model for brain activity." Whether that label holds up is the real question.
TRIBE v2 — the name stands for Trimodal Brain Encoder — combines Meta's own LLaMA 3.2 language model, V-JEPA2 video model, and Wav2Vec-BERT audio model into a single transformer architecture that maps onto the cortical surface. It outputs predictions across roughly 20,000 vertices on the fsaverage5 brain mesh. The training set is a step change from its predecessor: more than 700 individuals and over 1,115 hours of fMRI recordings, compared to the original TRIBE, which was trained on fMRI data from just four subjects.
That four-to-700 scale jump is the part worth taking seriously. The original TRIBE model, a one-billion-parameter system, won the Algonauts 2025 brain encoding competition outright — 263 teams entered, and TRIBE beat them all by a meaningful margin, according to the competition results paper. The competition tested predicting fMRI responses across 1,000 whole-brain parcels while participants watched Friends episodes and feature films, including a silent black-and-white Charlie Chaplin film held out as an out-of-distribution test. TRIBE's key tricks were modality dropout during training and a parcel-specific ensembling scheme that weighted each sub-model by how well it performed on individual brain regions.
TRIBE v2 generalizes to new individuals without retraining — it achieves a two-to-three-times improvement over previous methods on movies and audiobooks, according to Meta's announcement. The company claims a 70-fold increase in resolution over comparable systems. Those numbers are in the announcement. The paper is not yet on arXiv.
That matters. The 70-fold figure describes spatial resolution on an averaged cortical surface — specifically, BOLD signal mapped onto the fsaverage5 mesh. BOLD is a hemodynamic proxy for neural activity, not direct neural firing. It measures blood oxygenation, which lags actual neuronal firing by roughly one to two seconds. Calling it a map of what the brain is "doing" is technically defensible but linguistically convenient.
There's another caveat built into the model itself: TRIBE v2 predicts responses for the average subject, not for any individual. The averaged-brain output is standard in neuroscience encoding models — it smooths out individual variation to reveal population-level patterns. But it means the model is predicting what a statistical composite brain does, not what your brain does. Whether that distinction matters depends on the application, and Meta is clearly hoping it doesn't become a dealbreaker.
The "foundation model" framing is the most aggressive claim in the announcement. The term implies something analogous to GPT-4 or CLIP — a single system that can be adapted to many tasks without task-specific training. TRIBE v2 can generalize to unseen individuals without retraining, which is genuinely novel for brain encoding. But whether it has the brittleness or the blank-slate generality that "foundation model" implies for language or vision is an open question that the pre-print, once published, should begin to answer.
The paper, titled "A foundation model of vision, audition, and language for in-silico neuroscience," lists Stéphane d'Ascoli, Jérémy Rapin, Yohann Benchetrit, Teon Brookes, Katelyn Begany, Joséphine Raugel, Hubert Banville, and Jean-Rémi King as authors. King, a Meta FAIR researcher, has published extensively on brain encoding models; this line of work traces back to his earlier language-to-cortex mapping research.
TRIBE v2 is out under a CC BY-NC-4.0 license, with model weights, code, and a demo available on GitHub. Non-commercial use is a constraint that will limit academic adoption but leaves the door open for any researcher willing to work within that scope.
The honest summary: Meta's FAIR team built a brain encoding model at a scale that makes prior work look like a pilot study. The generalization result — predicting new individuals without retraining — is the most interesting technical claim, and it deserves scrutiny. "Foundation model for brain activity" is branding. The paper will tell us how much substance is underneath it.