Given noisy protein coordinates in 2-dimensions, return the diffusion coefficient, anomalous exponent, and protein state. This can change at arbitrary points throughout the time series.
This project is for the Anomalous Diffusion Challenge 2024. It is a custom 3-stacked bidirectional LSTM model (with skip connections and dropout) developed by myself to infer diffusion coefficients, anomalous exponents, and states for noisy protein trajectories (fractional Brownian motion) undergoing an arbitrary number of changepoints. This is common when dealing with single molecule tracks from super-resolution fluorescence microscopy and informs us of changes in behaviour (e.g., confinement, clustering, directed motion, etc.). Features were selected from an extensive literature review and forward feature selection. Model finished in the Top 5.
This project is particularly interested in determining:
-
$K$ (diffusion coefficient) -
$\alpha$ (anomalous exponent) - Protein state:
- 0 - trapped
- 1 - confined
- 2 - freely diffusing
- 3 - directed motion
For more information on the challenge please refer to:
G. Muñoz-Gil, H. Bachimanchi ... C. Manzo
In-principle accepted at Nature Communications (Registered Report Phase 1)
arXiv:2311.18100
https://doi.org/10.48550/arXiv.2311.18100Feel free to download the code and run the models as notebooks.
Training time varies depending on dataset size and GPU.
For this work, we simulate ~5 million tracks, taking about 10 hours on a single NVIDIA RTX A5000, with ~44 iterations/sec using a batch size of 32. Convergence typically occurs after ~20 epochs.
Generates simulated protein tracks using AnDi:
- Uses multiprocessing for faster simulation.
- Tracks are saved individually to avoid memory issues.
- Files are then concatenated using
concat.py. - For faster training, tracks are converted into pickled dataset classes (
pickle_data.py). - Includes a commented
TimeSeriesDatasetclass for on-the-fly training from individual files if data does not fit into memory (6x slower). - You may provide your own dataset and skip this step.
- Alternatively, use provided model weights to skip training.
All training and inference scripts:
- Includes notebooks for:
train_alphatrain_ktrain_stateinference
- Written as notebooks to ease experimentation and understanding.
inference.ipynbincludes changepoint detection using ruptures with a penalty (number of changepoints is unknown).- A notebook for hyperparameter tuning (learning rate, L2 lambda, batch size, dropout, model layers) using optuna is also included.
- Currently optimises learning rate and L2 lambda.
- Easily extendable to other parameters.
Models used for training:
- Handles both regression (
K,alpha) and classification (state). - Uses the same base model 3 times with different output layers. Total trainable parameters: ~512k × 3 ≈ 1.5M
- A combined model returning
[K, alpha, state]can be developed but this is more difficult as it combines regression + classification and requires careful tuning of a weighted loss function. - Multiple architectures (e.g., LSTM+CNN, Transformer+Attention) were tested but final models performed best. Other models did not show performance improvements. Layer stacking continued until no further improvement.
- Model inputs are (x, y) coordinates per timestep and model output are timeseries of same length for the given variable.
- If only interested in one variable (e.g.,
K), train only the relevant model. Notebooks are self-contained for modular use, at the cost of some code duplication.
Utility functions used during training and inference:
- Feature extraction
- Post-processing
- Dataset class
- Plotting tools
utils/plotting.py:
- Generates multiple
.svgfigures for high-quality downstream editing (e.g., in Inkscape).
- Destination folder for saving and editing generated figures.
- Miscellaneous exploratory scripts and code.
Note: Some changes will be made to the code for tidying and clarity.
pandas
pyarrow
numpy
scikit-learn
andi-datasets
IPython
pylops (optional)
ruptures
scipy
torch
tensorboard
torchinfo
tqdm
matplotlib
torchmetrics
optuna (optional)