Deep Learning Research Engineer — building frontier AI architectures from scratch in raw PyTorch.
LLMs · Latent Diffusion · Multimodal · Video Understanding · Agentic ML
12 from-scratch projects · 78% memory optimization · 878-test agentic platform · 860M-param UNet trained from random init
Deep Learning Research Engineer · LLM Engineer · GenAI / Diffusion Engineer · Agentic ML Engineer
Remote-friendly · Available worldwide
Shipping the Autonomous ML Research Engineer platform (15 phases, 23 agents) and exploring mixture-of-depths routing for sub-1B parameter LLMs.
Architectures Transformers · GQA · MLA · RoPE · SwiGLU · RMSNorm · MoE · Gated Delta Net · MTP · Diffusion UNet · VAE · GAN · CycleGAN · ST-GCN · HRNet · SigLIP
Optimization & numerics
BF16 · FP16 · FP8 · Flash Attention 2 · SDPA · torch.compile · channels_last · Gradient checkpointing · μP scaling · WSD LR · NorMuon · Chunked cross-entropy · Disk-backed token caching · Fused optimizers
Hardware validated A100 80GB · RTX 5090 (Blackwell) · RTX 6000 Ada · RTX 3090 · P100 · 2× T4
Tooling HuggingFace · diffusers · tiktoken · W&B · Comet · safetensors · ONNX · TensorRT · FastAPI · pydantic v2 · ChromaDB · Ollama Cloud
- 78% peak memory reduction (92 GB → 20 GB) for LLM pretraining via gradient checkpointing, chunked cross-entropy, and disk-backed token caching — enabling 2× batch-size headroom on a single A100 80GB.
- Training loss 0.0947 at epoch 16 on Stable Diffusion 1.x (860M UNet) trained from random init across a 7-phase curriculum on 2× RTX 5090.
- ~30 FPS inference on RTX 3090 for skeleton-based action recognition, served via ONNX + TensorRT + FastAPI.
- 878 passing tests · 15 cooperating phases · 23 agents · 61 tools · 186 models in the Autonomous ML Research Engineer platform — full paper-to-conclusions loop with self-repair and provider-agnostic LLM routing.
- 415.6M active / 868.6M stored params in FusionLLM — a novel hybrid of MLA + Gated Delta Net + MoE + MTP in a 24-layer decoder.
- 643-line technical deep-dive on MLA (Multi-Head Latent Attention) covering KV-cache math, low-rank compression, the absorption-trick derivation, and decoupled RoPE mechanics.
| Domain | Project | Highlight | Hardware | Repo |
|---|---|---|---|---|
| LLM | DeepSeek-v3-Lite (422M) | MLA + AuxLossFreeGate MoE + MTP, end-to-end with absorption-trick inference | A100 80GB | → |
| LLM | LLaMA-3-Lite (515M) | GQA · RoPE θ=500K · SwiGLU · RMSNorm · FA2 · chunked CE · 78% memory cut | A100 80GB | → |
| LLM | FusionLLM (415.6M / 868.6M) | Novel MLA + Gated Delta Net + MoE + MTP hybrid · NorMuon + CautiousAdamW · WSD | A100 80GB | → |
| LLM | GPT-From-Scratch | 200-line educational GPT-2 with fused QKV; HF weight loading | MPS / CUDA | → |
| LLM | TranslationLM (EN→IT) | Encoder–decoder Transformer · loss 6.17 → 2.28 · BLEU/CER/WER | P100 | → |
| Vision | Stable Diffusion 1.x (860M UNet) | Custom UNet trained from random init · 7 phases · 1.3M+ images · best loss 0.0947 | 2× RTX 5090 | → |
| Vision | ActionRecognition (120 cls) | HRNet pose + Two-Stream CTR-GCN · ~30 FPS · ONNX + TensorRT | RTX 3090 | → |
| Vision | FaceAgingCycleGAN (256²) | Per-layer AdaIN conditioning · 3-scale PatchGAN · LSGAN + R1 GP | RTX 6000 Ada | → |
| Vision | FaceGenerationVAE (β-VAE) | 50 epochs · recon MSE 0.0152 · linear KL annealing · bilinear-upsample decoder | P100 | → |
| Vision | DCGAN-Face-Generation | 50 epochs · 202K CelebA · D loss → ln 2 ≈ 0.693 equilibrium | 2× T4 | → |
| Multimodal | VisionLangModel (PaliGemma-style) | SigLIP ViT + Gemma decoder + linear projector · zero pretrained weights | P100 | → |
| Agentic | Autonomous ML Research Engineer | 15-phase multi-agent platform · paper → plan → patch → train → evaluate → report | Local + Ollama Cloud | → |
- Multi-Head Latent Attention — A Technical Deep-Dive — 643-line reference covering KV-cache math, low-rank compression algebra, the absorption-trick derivation, decoupled RoPE mechanics, and SDPA vs manual attention trade-offs in DeepSeek-V2/V3.
- From-scratch PyTorch — no Trainer, no Lightning, no accelerate; every layer written by hand
- Single-GPU feasibility — BF16, gradient checkpointing, FA2,
channels_last, fused optimizers - Faithful reproductions — DeepSeek-V3, LLaMA-3, PaliGemma, DCGAN implemented to the paper
- Novel hybrids — FusionLLM (MLA + GDN + MoE + MTP), FaceAgingCycleGAN (AdaIN-conditioned CycleGAN)
- Production hygiene — atomic checkpoints (
.tmp.pt→os.rename), full RNG-state reproducibility, W&B / Comet tracking, CI lint + tests - Data pipelines — resumable download → filter → tokenize → shard → streaming loader, with dedup and document packing
- Post-training & inference — speculative decoding (MTP-as-draft), Min-SNR loss weighting, EMA, classifier-free guidance
- Hardware breadth — MPS / CPU → Kaggle T4 / P100 → A100 80GB → 2× RTX 5090 → RTX 6000 Ada
B.Tech, 2024 · Heritage Institute of Technology, Kolkata. Self-taught in deep learning through two years of from-scratch implementation — engineering discipline from infrastructure and constraint work translates directly to memory budgets, distributed training, and reproducible ML systems.
Last updated 2026-06-27 · Open to remote and on-site roles

