I build ML systems end-to-end - from training runs on rented GPUs to production inference APIs that actually stay up. My focus is the gap between "it works in the notebook" and "it runs at scale without falling over."
Current obsessions: LLM alignment (why does reward accuracy diverge from factuality?), model compression without killing accuracy, and distributed ML pipelines that don't require a dedicated ops team to maintain.
| Result | Project | |
|---|---|---|
| DPO Reward Accuracy | 82% (peak 88%) | distill-align-llm |
| Factuality - LLM-judge | 75.7% on 500-prompt benchmark | distill-align-llm |
| Model Compression | 107.7M -> 65.2M params, 93.2% F1 retained | clinical-nlp-optimization |
| Inference Speedup | 39ms -> 10.8ms, 1.9× faster | clinical-nlp-optimization |
| Weak Label Generation | 19,506 entities from 7,064 PubMed abstracts | clinical-nlp-optimization |
| SLA Compliance | 97% of requests under 50ms (100-req load test) | clinical-nlp-optimization |
| Training Cost | ~$27 total, SFT + DPO on Llama-3.1-8B | distill-align-llm |
distill-align-llm - SFT -> DPO alignment on Llama-3.1-8B, Live dashboard
Full alignment pipeline using QLoRA (r=16, α=32, 4-bit NF4). Trained on RunPod for ~$27 total.
The main finding: reward accuracy and factuality are not the same thing. 82% reward accuracy on DPO, but only 17.6% factuality with strict keyword matching on 51 prompts - and 75.7% with a proper 500-prompt LLM-judge benchmark. Same model. The evaluation methodology matters more than people admit.
Token probability analysis showed the model knows the answers - it just suppresses them. Median correct token rank after SFT/DPO: position 2. It's a generation suppression problem, not a forgetting problem.
Stack: PyTorch, HuggingFace TRL, PEFT, bitsandbytes, Streamlit, pytest (44 passing)
clinical-nlp-optimization - Knowledge distillation + distributed NLP pipeline for clinical NER
Compressed Bio_ClinicalBERT (107.7M params) down to DistilClinicalBERT (65.2M) while retaining 93.2% of F1. Deployed as a FastAPI inference server with Prometheus + OpenTelemetry observability.
The pipeline covers six components end-to-end: distillation, distributed weak labeling on PySpark/EMR, ONNX pruning + INT8 quantization, LangChain agentic evaluation, statistical A/B testing (Mann-Whitney + Wilcoxon), and a production observability stack. 97% SLA compliance on a 100-request load test.
Stack: PyTorch, HuggingFace, ONNX Runtime, PySpark, AWS EMR/S3/Lambda, FastAPI, LangChain, Prometheus, Grafana, Terraform
TheInheritableAgent - Cryptographic AI inheritance, Auth0 for AI Agents Hackathon
When someone passes away, their family inherits their belongings - but never their way of thinking. This lets a parent's AI-extracted decision patterns be inherited by their child through cryptographically scoped Auth0 tokens, while keeping every piece of personal data permanently inaccessible.
The boundary is enforced at the identity layer, not application code. 2-of-3 trustee multi-sig, step-up auth for sensitive topics, multi-generational token delegation where scopes can only shrink - never expand.
Stack: Python, Flask, Auth0 Token Vault, JWT, Claude API
PHI/PII Parser - FHIR-compliant redaction on AWS Lambda
Reads HL7 FHIR Bundle JSON files from S3, detects and redacts PII/PHI fields (name, DOB, SSN, address), and writes a cleaned CSV back to S3. Two deployment modes: FastAPI for local on-demand scanning, Lambda for serverless auto-triggering on every S3 upload.
Stack: Python, FastAPI, AWS S3/Lambda, Pydantic v2, Docker/LocalStack
Agentic AI Parenting - Multi-agent system on Google ADK + FatSecret
A modular parenting agent built on Google's Agent Development Kit. Root agent delegates to specialized sub-agents: parenting analyst, nutrition meal planner (FatSecret API), and a basic pediatric medical advisor. Stateful sessions remember user context across turns.
Stack: Python, Google ADK, Gemini, LiteLLM, FatSecret Nutrition API



