Immune Receptor Rearrangement Model-based enCOder DECoder (IRRM-CODEC).
irrm-codec contains two neural models for working with TCR CDR3 amino-acid sequences and TCRemP embeddings:
- forward model: predicts a TCRemP embedding from CDR3 sequence input
- inverse model: reconstructs a CDR3 sequence from a TCRemP embedding
The repository is organized as a small training package with Python entrypoints, Slurm launchers, and analysis notebooks for both single-run and multi-chain workflows.
irrm_codec/: package with data loading, tokenization, datasets, models, losses, utilities and training entrypointsscripts/: operational launchers;train_forward.shandtrain_inverse.share Slurm array jobs for the seven background-100k chains, whilecalc_pgen_1mm.shandtrain_pgen.share local bash wrappers around the Python modulesslurm/: additional single-job and array-jobsbatchexamples plusslurm/README.mdwith cluster usage notesnotebooks/: runnable notebooks for training walkthroughs, artifact inspection, pgen analysis, and cross-chain summariesnotebooks/projects/immunestatus/vdjrearm/airr_format/: checked-in AIRR examples used by some notebooks as local fallback inputsartifacts/: default output directory for checkpoints and run metadata
Create the conda environment:
conda create -n irrm-codec python=3.11 -y
conda activate irrm-codec
pip install -r requirements.txtrequirements.txt pins torch==2.4.1 and adds the PyTorch cu121 wheel index to avoid pulling newer CUDA 13 builds that may require a newer NVIDIA driver.
For pgen workflows, requirements.txt installs mirpy-lib directly from the
antigenomics/mirpy GitHub repository because the required API is newer than the
currently usable PyPI release in this workflow.
Update the environment after dependency changes:
pip install -r requirements.txt --upgradeRegister the environment as a Jupyter kernel:
python -m ipykernel install --user --name irrm-codec --display-name "Python (irrm-codec)"Project dependencies, including notebook packages, are installed from requirements.txt.
Training expects two separate input files, following the same general idea as the tcrempnet workflow.
Accepted formats:
.tsv.airr.csv.parquet
Required columns:
junction_aav_callj_calllocus
Optional:
clone_id
Required:
- parquet file
Supported embedding layouts:
- one column with vector values, for example
tcremp_emb - many numeric embedding columns plus
clone_id
If AIRR contains clone_id, the AIRR table and embeddings table are merged by clone_id. If AIRR does not contain clone_id but the two tables have the same number of rows, embeddings are matched to AIRR rows by row order.
Run the training modules directly for local or notebook-driven experiments:
python -m irrm_codec.train_forward \
--airr-path data/sample_airr.tsv \
--embeddings-path data/sample_embeddings.parquet \
--locus alpha \
--output-dir artifacts/forward
python -m irrm_codec.train_inverse \
--airr-path data/sample_airr.tsv \
--embeddings-path data/sample_embeddings.parquet \
--locus alpha \
--output-dir artifacts/inverseFor cluster runs, use the Slurm launchers:
scripts/train_forward.sh: 7-chain array job for forward training on the background-100k layoutscripts/train_inverse.sh: 7-chain array job for inverse training on the background-100k layoutslurm/train_forward.sbatch: single forward run with overridable paths and hyperparametersslurm/train_pgen_background_100k_array.sbatch: array launcher for one pgen regressor per chain after pgen preprocessing
Examples:
sbatch scripts/train_forward.sh
sbatch scripts/train_inverse.shUseful optional flags:
--clone-id-col clone_id--embedding-column tcremp_emb--wandb-project irrm-codec--wandb-entity ...--wandb-run-name ...--wandb-dir ...--wandb-mode online|offline|disabled--max-len 40--batch-size ...--epochs ...--seed 42--log-interval 10--no-progress
Use the dedicated module to compute 1-mismatch pgen values through mirpy's
mir.basic.pgen.OlgaModel.compute_pgen_junction_aa_1mm.
Example:
python -m irrm_codec.calc_pgen_1mm \
--airr-path data/sample_airr.tsv \
--output-path artifacts/pgen/sample_airr_pgen.tsv \
--chain TRB \
--species human \
--locus beta \
--threads 8 \
--chunk-size 1000Notes:
--threadscontrols how many independent worker processes run in parallel.- Each worker gets its own contiguous part of the filtered AIRR table, reads the AIRR file inside the child process, and never receives a shared in-memory sequence list from the parent.
--chunk-sizecontrols how many sequences a worker processes before flushing an intermediate result chunk to disk.- Completed chunks are reused on rerun, so a killed job resumes from the last successful on-disk save.
- The output table keeps the original AIRR columns and appends
pgen_1mmandlog10_pgen_1mm. - The code is compatible with the current
mirpy-libpackage and falls back to a local checkout at../mirpywhen needed.
The matching local bash wrapper is:
AIRR_PATH=data/sample_airr.tsv \
OUTPUT_PATH=artifacts/pgen/sample_airr_pgen.tsv \
CHAIN=TRB \
LOCUS=beta \
bash scripts/calc_pgen_1mm.shUse the dedicated training module to fit a scalar regressor from CDR3 sequence to log10_pgen_1mm.
Example:
python -m irrm_codec.train_pgen \
--airr-path artifacts/pgen/sample_airr_pgen.tsv \
--output-dir artifacts/pgen_model/trb \
--target-col log10_pgen_1mm \
--locus beta \
--batch-size 256 \
--epochs 40Notes:
- The model reuses the sequence encoder structure from the forward model and predicts one scalar per clonotype.
- The training target is taken directly from the AIRR table, so pgen preprocessing must be run first.
- Metrics include Huber loss, RMSE, and MAE.
- W&B logging can be configured with
--wandb-project,--wandb-entity,--wandb-run-name,--wandb-dir, and--wandb-mode.
The matching local bash wrapper is:
AIRR_PATH=artifacts/pgen/sample_airr_pgen.tsv \
OUTPUT_DIR=artifacts/pgen_model/trb \
LOCUS=beta \
bash scripts/train_pgen.shThe forward and inverse training modules:
- load AIRR and embeddings from separate files
- align them by
clone_idor by row order whenclone_idis absent and row counts match - filter by
locus - validate sequence and embedding inputs
- split data into train, validation and test subsets
- fit embedding normalization on the train split only
- save normalization parameters for later inference
- save both the best and the latest model checkpoints
- write per-epoch history and final test metrics
- log train, validation, and test metrics to Weights & Biases
The pgen regressor follows the same split, checkpoint, and metrics pattern, but it reads a single AIRR table with a target column instead of a separate embeddings parquet and it does not save embedding normalization arrays.
For remote runs, configure W&B with environment variables before launching training:
export WANDB_API_KEY=...
export WANDB_ENTITY=...
export WANDB_PROJECT=irrm-codecIf the server cannot reach the internet during training, you can log locally first and sync later:
export WANDB_MODE=offline
wandb sync path/to/wandb/offline-runForward and inverse runs write to their output directories, for example artifacts/forward or artifacts/inverse.
Saved files:
best.pt: checkpoint with the best validation losslast.pt: checkpoint from the final epochmean.npy: train-split embedding meanstd.npy: train-split embedding standard deviationhistory.json: epoch-by-epoch training and validation metricstest_metrics.json: final metrics on the test splitdata_stats.json: dataset summary, merge statistics and artifact paths
Pgen runs follow the same layout except that they do not write mean.npy and std.npy.
If you ran training locally from the CLI or notebooks, look in the output directory you passed with --output-dir. The defaults and common examples in this repository are:
- forward demo run:
artifacts/forwardorartifacts/forward_demo_trb - inverse demo run:
artifacts/inverseorartifacts/inverse_demo_trb - pgen demo run:
artifacts/pgen_model/trb
If you ran the cluster launchers that are currently checked into the repo, the model roots are:
- forward background-100k array runs:
/projects/immunestatus/vdjrearm/irrmcodec/forward_background_100k/<chain> - inverse background-100k array runs:
/projects/immunestatus/vdjrearm/irrmcodec/inverse_background_100k/<chain> - pgen background-100k array runs:
/projects/immunestatus/vdjrearm/irrmcodec/pgen_background_100k/<chain>
For example:
/projects/immunestatus/vdjrearm/irrmcodec/forward_background_100k/trb/projects/immunestatus/vdjrearm/irrmcodec/inverse_background_100k/trb/projects/immunestatus/vdjrearm/irrmcodec/pgen_background_100k/trb
Supported chain directory names in the multi-chain runs are:
ighigkigltratrbtrdtrg
Inside each model directory, the main files to look for are best.pt and last.pt. The surrounding metadata is stored alongside them: history.json, test_metrics.json, data_stats.json, and for forward/inverse also mean.npy and std.npy.
The committed main branch already contains executed summary notebooks for:
- forward multi-chain evaluation:
notebooks/forward_background_100k_multichain_analysis.ipynb - inverse multi-chain evaluation:
notebooks/inverse_background_100k_multichain_analysis.ipynb - pgen regression quality and inference speed:
notebooks/pgen_run_and_analysis.ipynb
Cross-chain summary over igh, igk, igl, tra, trb, trd, trg:
- mean recomputed test loss:
0.1741 - mean recomputed test cosine:
0.8871 - best cosine:
trg = 0.9323, closely followed bytrd = 0.9320andigl = 0.9202 - lowest recomputed test loss:
trd = 0.0805 - hardest chains in this sweep:
igh(cosine = 0.7906) andtrb(eval_test_loss = 0.2868)
| chain | epochs_ran | best_val_loss | eval_test_loss | eval_test_cosine |
|---|---|---|---|---|
| igh | 27 | 0.2669 | 0.2668 | 0.7906 |
| igk | 30 | 0.1968 | 0.1987 | 0.8712 |
| igl | 23 | 0.1411 | 0.1381 | 0.9202 |
| tra | 39 | 0.1473 | 0.1484 | 0.9035 |
| trb | 40 | 0.1871 | 0.2868 | 0.8599 |
| trd | 36 | 0.0775 | 0.0805 | 0.9320 |
| trg | 35 | 0.0981 | 0.0995 | 0.9323 |
Training dynamics and test distributions across chains:
Cross-chain summary over the same seven chains:
- mean test loss:
0.0936 - mean token accuracy:
0.9723 - mean length accuracy:
0.9920 - mean exact match:
0.4962 - best exact-match chains:
trg = 0.7658,trb = 0.7371,tra = 0.7112 - lowest exact-match chain:
igh = 0.1614
| chain | loss | token_accuracy | length_accuracy | exact_match |
|---|---|---|---|---|
| igh | 0.2589 | 0.9199 | 0.9798 | 0.1614 |
| igk | 0.0958 | 0.9739 | 0.9963 | 0.3178 |
| igl | 0.0896 | 0.9740 | 0.9937 | 0.2766 |
| tra | 0.0369 | 0.9907 | 0.9977 | 0.7112 |
| trb | 0.0407 | 0.9884 | 0.9937 | 0.7371 |
| trd | 0.0985 | 0.9697 | 0.9895 | 0.5036 |
| trg | 0.0349 | 0.9895 | 0.9931 | 0.7658 |
Training dynamics, convergence of length/exact-match, and mismatch counts across chains:
Letter-level substitution errors from the executed TRB inverse-analysis notebook:
Saved notebook metrics on the 10k test split:
RMSE = 0.5914MAE = 0.4731bias = 0.4399Pearson r = 0.9874Spearman rho = 0.9861R^2 = 0.9437
| RMSE | MAE | bias | Pearson r | Spearman rho | R^2 | n_test |
|---|---|---|---|---|---|---|
| 0.5914 | 0.4731 | 0.4399 | 0.9874 | 0.9861 | 0.9437 | 10000 |
The committed notebook benchmark was executed on CPU and reports throughput and per-sequence latency on the 10k TRB test split:
- batch
1:324.0 seq/s,3.086 ms/sequence - batch
16:4934.1 seq/s,0.203 ms/sequence - batch
64:9705.4 seq/s,0.103 ms/sequence - batch
256:11001.8 seq/s,0.0909 ms/sequence - batch
1024:9230.5 seq/s,0.1083 ms/sequence
The best throughput in this saved benchmark is at batch_size = 256.
| batch_size | seq_per_second | ms_per_sequence |
|---|---|---|
| 1 | 324.0 | 3.0860 |
| 16 | 4934.1 | 0.2027 |
| 64 | 9705.4 | 0.1030 |
| 256 | 11001.8 | 0.0909 |
| 1024 | 9230.5 | 0.1083 |
Main notebooks live in notebooks/ and are split by workflow:
notebooks/example_run_and_analysis.ipynb: forward-model walkthrough from input inspection to training, metrics, plots, and checkpoint restorenotebooks/example_inverse_run_and_analysis.ipynb: inverse-model analogue of the forward walkthroughnotebooks/pgen_run_and_analysis.ipynb: analysis notebook for trained sequence-to-log10_pgen_1mmmodels, including residual plots and inference timingnotebooks/forward_background_100k_multichain_analysis.ipynb: aggregate evaluation for all saved forward background-100k runsnotebooks/inverse_background_100k_multichain_analysis.ipynb: aggregate evaluation for all saved inverse background-100k runsnotebooks/generate_datasets.ipynb: helper notebook for preparing prototype AIRR datasets for IGH, IGK, IGL, TRD, and TRGnotebooks/trb_1kk.ipynb: compact TRB forward-run demo
Historical or scratch material:
notebooks/example_run_and_analysis-Copy1.ipynb: older copy of the forward walkthrough kept for reference
If you want one place to start, use:
- forward training:
notebooks/example_run_and_analysis.ipynb - inverse training:
notebooks/example_inverse_run_and_analysis.ipynb - pgen model analysis and timing:
notebooks/pgen_run_and_analysis.ipynb
There is currently no dedicated pytest suite in the repository, so validation is done through smoke checks plus notebook-backed end-to-end checks.
Recommended checks after pip install -r requirements.txt:
python -m compileall irrm_codec
python -m irrm_codec.train_forward --help
python -m irrm_codec.train_inverse --help
python -m irrm_codec.calc_pgen_1mm --help
python -m irrm_codec.train_pgen --helpRecommended workflow checks:
- Run one short forward training job with reduced epochs and confirm that
best.pt,last.pt,history.json,test_metrics.json, anddata_stats.jsonappear in the output directory. - Run one short inverse training job and check that
test_metrics.jsoncontainstoken_accuracy,length_accuracy, andexact_match. - Run
calc_pgen_1mmon a small AIRR subset and confirm that the output table containspgen_1mmandlog10_pgen_1mm. - Train the pgen regressor on that output and check that
test_metrics.jsoncontainsrmseandmae. - Open the matching notebook and verify that it can read the produced artifacts without manual file edits beyond the input-path cell.
If you are validating the cluster wrappers instead of local modules:
- submit
sbatch scripts/train_forward.shorsbatch scripts/train_inverse.shfor the 7-chain array jobs - submit the examples from
slurm/README.mdfor single-run forward or pgen workflows - inspect logs under
slurm/logs/or the cluster log directory configured in the script headers
- The inverse model predicts the full fixed-length token sequence of length 40 in parallel.
- The current pipeline expects one chain at a time and uses
locusfiltering to select it. - By default, CDR3 sequences are converted to fixed length 40 before tokenization by inserting
-gaps after residue 4 and before the last 3 residues. - Training outputs are intentionally ignored by git via
.gitignore.




