I build production-grade ML and AI systems — not models in notebooks. Every project here is a complete platform: training, evaluation, governance, monitoring, serving, and fairness — with evidence artifacts that prove each component works.
My focus areas: agentic LLM pipelines with deterministic safety guarantees, recommendation and ranking systems, credit and risk decisioning, developer intelligence tooling, experiment analysis platforms, and ML data quality auditing.
Production-simulated recommendation platform with IPS debiasing, MMR diversity reranking, delayed attribution, exposure governance, and offline A/B simulation.
- IPS debiasing: click × (1/propensity(rank)), clip max_weight=5.0 · Naive NDCG@10=0.134 → IPS-weighted NDCG@10=0.522
- MMR diversity: λ=0.70, max_per_seller=2, category_cap=50% · Seller Gini 0.74 → 0.582
- A/B simulation: 4,132 sessions, 650 items, 80 sellers, 33 JSON artifacts, 15 failure scenarios
- A/B decision: HOLD_SIMULATED (NDCG −7.9% relative, exceeds 5% threshold)
RAG + agentic platform for version-safe developer change decisions. Detects conflicting documentation before LLM synthesis (LLM-Last architecture).
- Hybrid retrieval: BM25 + vector + RRF with version pre-filtering · Recall@5 = 0.97
- Conflict detection: 4 conflict types, 3 verdict levels (BLOCKED/RISKY/SAFE) · Wrong-version answer rate = 0.0
- Goal Mode: 6 agentic components, retry cap = 2, recovery actions · Macro F1 = 0.966
- Semver utilities: parse_semver, semver_lt, semver_distance (≥2 → CRITICAL, =1 → HIGH)
Full ML lifecycle credit scoring platform on Home Credit dataset. Champion/challenger governance, ECOA fairness auditing, PSI drift monitoring, 5-gate promotion framework.
- Champion XGBoost: PR AUC = 0.2611, ROC AUC = 0.7663, ECE = 0.0046, Brier = 0.0673
- Fairness: Disparate Impact = 1.059 (ECOA safe harbor), SHAP-generated adverse action codes
- Drift: Day 14 PSI = 0.2296 (DRIFT ALERT > 0.20) → Policy v1.0 → v1.1 deployed
- Challenger: LightGBM 5/5 gates passed, HOLD (delta PR AUC = −0.0001, below material threshold)
Production experiment analysis platform with CUPED variance reduction, guardrail-first decisioning, A/A calibration, streaming early warning, and SRM detection.
- CUPED: Pre-experiment covariate adjustment · Variance reduction up to 40%
- Guardrail-first: Experiment blocked if any guardrail metric degrades beyond threshold
- A/A validation: 1,000 run calibration · False positive rate verified at α = 0.05
- Streaming: Early warning system with sequential testing (mSPRT)
DataFrame-native library for decomposing metric movements into mix shift vs. rate shift. Answers "a metric moved — where did it come from?"
from metriclens import MetricLens
lens = MetricLens(df_before, df_after, segment_col="country", metric_col="conversion")
lens.decompose() # → mix_effect, rate_effect, interaction_effect per segmentPlatform-agnostic A/B experiment auditor. Catches SRM, underpowered tests, peeking violations, and multiple comparison issues before you ship a wrong decision.
trialcheck audit experiment_results.json --alpha 0.05 --min-power 0.80Pre-training tabular ML auditor for feature leakage. Detects target leakage, temporal leakage, train/test overlap, and near-duplicate features before you train.
from featureleakagelens import LeakageAuditor
report = LeakageAuditor(df_train, df_test, target="label").audit()Evaluation dataset quality auditor for LLM and RAG applications. Validates golden sets for answer completeness, question ambiguity, context coverage, and contamination.
from goldensetauditor import GoldenSetAuditor
report = GoldenSetAuditor(golden_jsonl="eval_set.jsonl").audit()Pre-indexing QA auditor for RAG document ingestion. Runs 11 deterministic checks on exported chunks — missing pages, OCR noise, duplicates, encoding corruption, poor split boundaries.
pip install docingestqa
docingestqa audit chunks.jsonl --output report.htmlAudits AI inference routing decisions — profiles calls across model configurations, finds the cost/quality Pareto frontier, and flags dominated configs with routing recommendations.
from inferencelens import InferenceLens
report = InferenceLens(task_type="summarization").profile(prompts)
# → Pareto frontier, routing rules, PASS/WARN/FAIL verdictThree production pipelines applying the same principles under domain pressure:
| System | Domain | Primary Failure Mode |
|---|---|---|
lendflow |
Financial underwriting | When to stop or escalate |
agentreliabilitylab |
Cyber threat triage | When to stop or escalate |
nexussupply |
Supplier risk intelligence | Conflicting signal fusion |
Each uses LangGraph with deterministic-first architecture: scores are computed before LLM synthesis, escalation paths are explicit, and graceful degradation is guaranteed.
Every project here addresses one of three failure modes in AI systems:
① How does it know it's working correctly?
The system needs a verification signal independent of its own confidence.
TrialCheck · MetricLens · FeatureLeakageLens · DocIngestQA · GoldenSetAuditor · RiskFrame · PulseRank · InferenceLens
② When should it stop or escalate?
The system needs explicit rules for when automated decisions require human judgment.
MetaSignal · LendFlow · AgentReliabilityLab
③ How does it handle conflicting information?
The system needs a deterministic anchor when multiple signals disagree.
DevPulse · NexusSupply
The original nine are domain-agnostic auditors. The three applied systems test the same thesis under production pressure.
Raw Application Data ──────────────────────────────────► RiskFrame
(risk scoring)
User Queries / Developer Questions ────────────────────► DevPulse
(version-safe answers)
Marketplace Events / Clicks ────────────────────────────► PulseRank
(ranking + debiasing)
A/B Experiment Results ────────────────────────────────► MetaSignal / TrialCheck
(experiment validity)
ML Training Data ──────────────────────────────────────► FeatureLeakageLens
(pre-training audit)
RAG Chunks / Eval Sets ─────────────────────────────────► DocIngestQA / GoldenSetAuditor
(data quality gates)
Metric Movement Explanation ───────────────────────────► MetricLens
(decomposition)
Every platform feeds downstream governance: PulseRank's A/B decisions are validated by MetaSignal's experiment framework. RiskFrame's evaluation data is audited by FeatureLeakageLens pre-training and GoldenSetAuditor post-training. DevPulse's RAG corpus passes through DocIngestQA before indexing.
All repositories here follow a consistent production standard:
- Tests: pytest with golden scenarios, integrity tests, and property-based checks
- Artifacts: JSON evidence artifacts for all model and evaluation results
- CI/CD: GitHub Actions with lint, type-check, and test gates
- Versioning: Semantic versioning (major.minor.patch) with CHANGELOG
- Documentation: Defense documents, PRD documents, and README-level architecture diagrams