In this tutorial, we explore how we can decode linguistic features directly from brain signals using a modern neuroAI pipeline. We work with MEG data and build an end-to-end system that transforms raw neural activity into meaningful predictions, in this case, estimating word length from brain responses. We set up the environment, load and process neural events, design a custom feature extractor, and construct a structured data pipeline using NeuralSet. From there, we train a convolutional neural network to learn patterns in the temporal and spatial structure of MEG signals. Throughout the process, we focus on building a clean, modular workflow that mirrors real-world neuroAI research practices.
import subprocess, sys, importlib, pkgutil def pip_install(*pkgs): print(f"pip install {' '.join(pkgs)} ...") r = subprocess.run([sys.executable, "-m", "pip", "install", "-q", *pkgs], capture_output=True, text=True) if r.returncode != 0: print("pip STDOUT:", r.stdout[-2000:]) print("pip STDERR:", r.stderr[-2000:]) raise RuntimeError("pip install failed; see output above.") print(" ok") pip_install("numpy>=2.0,<2.3") pip_install("neuralset") pip_install("neuralfetch") import numpy as np from numpy._core.umath import _center print(f"numpy {np.__version__} OK") import warnings, typing as tp warnings.filterwarnings("ignore") import pandas as pd import torch import torch.nn as nn import torch.nn.functional as F from torch.utils.data import DataLoader import matplotlib.pyplot as plt import neuralset as ns from neuralset import extractors as ext_mod
We install and validate all required dependencies, ensuring critical packages such as NumPy and NeuralSet are properly configured. We perform a quick NumPy check to avoid runtime issues later in the pipeline. We then import all core libraries needed for data processing, modeling, and visualization.
def deep_import(pkg_name: str): try: pkg = importlib.import_module(pkg_name) except Exception as e: print(f"⚠️ could not import {pkg_name}: {e}") return if not hasattr(pkg, "__path__"): return for m in pkgutil.walk_packages(pkg.__path__, prefix=pkg_name + "."): try: importlib.import_module(m.name) except Exception: pass deep_import("neuralfetch") deep_import("neuralset") torch.manual_seed(0); np.random.seed(0) catalog = ns.Study.catalog() print(f"n{len(catalog)} studies registered.") preferred = ["Fake2025Meg", "Test2025Meg", "Test2023Meg"] study_name = next((n for n in preferred if n in catalog), None) if study_name is None: meg_studies = [n for n, c in catalog.items() if "Meg" in c.neuro_types()] study_name = meg_studies[0] if meg_studies else None if study_name is None: raise RuntimeError( "No MEG study available. Catalog: " f"{sorted(catalog.keys())[:20]}… " "Install neuralfetch correctly (pip install neuralfetch) and re-run." ) print(f"→ Using study: {study_name}")
We dynamically import all submodules from NeuralFetch and NeuralSet to ensure that all available studies are properly registered. We seed the random number generator for reproducibility and inspect the study catalog to identify available MEG datasets. We then select an appropriate study to use as the foundation for our pipeline.
class CharCount(ext_mod.BaseStatic): event_types: tp.Literal["Word"] = "Word" def get_static(self, event) -> torch.Tensor: return torch.tensor([float(len(event.text))], dtype=torch.float32) print("nBuilding chain...") chain = ns.Chain(steps=[ {"name": study_name, "path": str(ns.CACHE_FOLDER)}, {"name": "QueryEvents", "query": "type in ['Word', 'Meg']"}, ]) events = chain.run() print(f" → {len(events)} events; types={sorted(events.type.unique().tolist())}") print(f" → Words: {(events.type=='Word').sum()} | " f"timelines: {events.timeline.nunique()}") print("nSample words:") print(events[events.type=='Word'][["start","duration","text","timeline"]] .head(5).to_string(index=False)) print("nBuilding segmenter...") segmenter = ns.dataloader.Segmenter( extractors={ "meg": {"name": "MegExtractor", "frequency": 100.0}, "char_count": CharCount(aggregation="trigger"), }, trigger_query="type == 'Word'", start=-0.2, duration=0.8, drop_incomplete=True, ) dataset = segmenter.apply(events) print(f" → SegmentDataset: {len(dataset)} segments") s0 = dataset[0] print(f"nSingle item:n meg : {tuple(s0.data['meg'].shape)}") print(f" char_count : {s0.data['char_count'].item()} " f"(word: {s0.segments[0].trigger.text!r})")
We define a custom extractor that computes the character count of each word event, enabling us to create a supervised learning target. We build a processing chain to load and filter relevant events from the selected study. We then segment the MEG signals around word events and construct a dataset ready for modeling.
rng = np.random.RandomState(42) perm = rng.permutation(len(dataset)) n_tr, n_va = int(0.70*len(dataset)), int(0.15*len(dataset)) train_ds = dataset.select(perm[:n_tr]) val_ds = dataset.select(perm[n_tr:n_tr+n_va]) test_ds = dataset.select(perm[n_tr+n_va:]) print(f"nSplit | train={len(train_ds)} val={len(val_ds)} test={len(test_ds)}") mk = lambda d, sh: DataLoader(d, batch_size=32, shuffle=sh, collate_fn=d.collate_fn, drop_last=False) train_loader, val_loader, test_loader = mk(train_ds, True), mk(val_ds, False), mk(test_ds, False) probe = next(iter(train_loader)) n_ch, n_t = probe.data["meg"].shape[-2:] print(f" → batch[meg] shape: {tuple(probe.data['meg'].shape)}") print(f" → batch[char] shape: {tuple(probe.data['char_count'].shape)}") class MEGDecoder(nn.Module): def __init__(self, n_channels: int, mid: int = 64): super().__init__() self.spatial = nn.Conv1d(n_channels, mid, 1) self.bn0 = nn.BatchNorm1d(mid) self.temporal1 = nn.Conv1d(mid, mid, 7, padding=3) self.bn1 = nn.BatchNorm1d(mid) self.temporal2 = nn.Conv1d(mid, mid//2, 7, padding=3) self.bn2 = nn.BatchNorm1d(mid//2) self.pool = nn.AdaptiveAvgPool1d(1) self.head = nn.Linear(mid//2, 1) self.drop = nn.Dropout(0.3) def forward(self, x): x = F.gelu(self.bn0(self.spatial(x))) x = F.gelu(self.bn1(self.temporal1(x))) x = self.drop(x) x = F.gelu(self.bn2(self.temporal2(x))) return self.head(self.pool(x).squeeze(-1)).squeeze(-1) device = torch.device("cuda" if torch.cuda.is_available() else "cpu") model = MEGDecoder(n_channels=n_ch).to(device) print(f"nDevice: {device} | params: {sum(p.numel() for p in model.parameters()):,}") train_targets = torch.cat([b.data["char_count"].squeeze(-1) for b in train_loader]) y_mean, y_std = train_targets.mean().item(), train_targets.std().item() + 1e-6 print(f"Target μ={y_mean:.2f} σ={y_std:.2f}") def prep(batch): x = batch.data["meg"].to(device).float() y = batch.data["char_count"].squeeze(-1).to(device).float() x = (x - x.mean(-1, keepdim=True)) / (x.std(-1, keepdim=True) + 1e-6) y = (y - y_mean) / y_std return x, y
We split the dataset into training, validation, and test sets to ensure proper model evaluation. We create data loaders and inspect batch shapes to confirm correct data formatting. We then define a convolutional neural network and prepare normalized inputs and targets for stable training.
EPOCHS = 15 opt = torch.optim.AdamW(model.parameters(), lr=1e-3, weight_decay=1e-4) sched = torch.optim.lr_scheduler.CosineAnnealingLR(opt, T_max=EPOCHS) loss_fn = nn.MSELoss() hist = {"tr": [], "va": [], "r": []} def pearson(a, b): a, b = a - a.mean(), b - b.mean() return (a*b).sum() / (a.norm()*b.norm() + 1e-8) print("n" + "="*64) print(f"{'Epoch':>5} | {'train':>9} | {'val':>9} | {'val_r':>7}") print("="*64) for ep in range(EPOCHS): model.train(); tr = [] for batch in train_loader: x, y = prep(batch) loss = loss_fn(model(x), y) opt.zero_grad(); loss.backward() torch.nn.utils.clip_grad_norm_(model.parameters(), 1.0) opt.step(); tr.append(loss.item()) sched.step() model.eval(); va, P, T = [], [], [] with torch.no_grad(): for batch in val_loader: x, y = prep(batch); p = model(x) va.append(loss_fn(p, y).item()); P.append(p.cpu()); T.append(y.cpu()) P, T = torch.cat(P), torch.cat(T) r = pearson(P, T).item() hist["tr"].append(np.mean(tr)); hist["va"].append(np.mean(va)); hist["r"].append(r) print(f"{ep+1:>5d} | {np.mean(tr):>9.4f} | {np.mean(va):>9.4f} | {r:>+7.3f}") model.eval(); P, T = [], [] with torch.no_grad(): for batch in test_loader: x, y = prep(batch) P.append(model(x).cpu()); T.append(y.cpu()) P, T = torch.cat(P), torch.cat(T) test_r = pearson(P, T).item() test_mse = ((P - T) ** 2).mean().item() print(f"nTEST | Pearson r = {test_r:+.3f} MSE = {test_mse:.3f}") print(f"(Synthetic-MEG signals are random by design — small/zero r is expected.)") fig, ax = plt.subplots(1, 3, figsize=(15, 4)) ax[0].plot(hist["tr"], label="train"); ax[0].plot(hist["va"], label="val") ax[0].set(xlabel="Epoch", ylabel="MSE", title="Loss curves"); ax[0].legend(); ax[0].grid(alpha=.3) ax[1].plot(hist["r"], color="C2"); ax[1].axhline(0, color="k", ls="--", alpha=.4) ax[1].set(xlabel="Epoch", ylabel="Pearson r", title="Validation correlation"); ax[1].grid(alpha=.3) m = float(max(T.abs().max(), P.abs().max())) ax[2].scatter(T.numpy(), P.numpy(), s=10, alpha=.35) ax[2].plot([-m, m], [-m, m], "k--", alpha=.4) ax[2].set(xlabel="True (z-scored char count)", ylabel="Predicted", title=f"Test predictions (r = {test_r:+.3f})"); ax[2].grid(alpha=.3) plt.tight_layout(); plt.show() print("n✅ Tutorial complete!") print(f" • Study used : {study_name}") print(f" • Pipeline : Chain → Segmenter → SegmentDataset → DataLoader") print(f" • Custom extractor : CharCount (subclass of BaseStatic)") print(f" • Built-in extractor: MegExtractor @ 100 Hz") print(f" • Model : 1×1 spatial conv + 2 temporal convs + linear head")
We train the neural network using a structured training loop with loss tracking and learning rate scheduling. We evaluate the model on the validation and test sets using metrics such as MSE and Pearson’s correlation. Also, we visualize training performance and predictions to understand how well the model learns from the data.
In conclusion, we demonstrated how we can bridge neural data and language understanding using deep learning. We implemented a full pipeline, from raw event extraction to model training and evaluation, while maintaining flexibility through reusable components like chains, segmenters, and extractors. Although we worked with synthetic MEG signals, the framework we built is directly applicable to real-world datasets and more complex decoding tasks. This exercise highlights how we can combine neuroscience, machine learning, and structured pipelines to advance interpretable brain decoding systems, laying a strong foundation for more advanced neuroAI applications.
Check out the Full Codes and Notebook here. Also, feel free to follow us on Twitter and don’t forget to join our 130k+ ML SubReddit and Subscribe to our Newsletter. Wait! are you on telegram? now you can join us on telegram as well.
Need to partner with us for promoting your GitHub Repo OR Hugging Face Page OR Product Release OR Webinar etc.? Connect with us
Sana Hassan
Sana Hassan, a consulting intern at Marktechpost and dual-degree student at IIT Madras, is passionate about applying technology and AI to address real-world challenges. With a keen interest in solving practical problems, he brings a fresh perspective to the intersection of AI and real-life solutions.


