Deterministic evidence-surface scanner for bio/medical AI repositories.
No LLM. No API key. No model runtime. No secrets sent anywhere.
Bio and medical AI repositories vary enormously in evidence quality — from rigorous academic tools to marketing-grade demos that carry clinical language with no data provenance, no reproducibility path, and no clinical-use disclaimer. Manual review is slow and inconsistent.
STEM BIO-AI scans the observable repository surface — README, docs, code structure, CI configuration, dependency manifests, changelogs — and maps detected signals to a structured evidence tier (T0–T4). The scan runs in seconds on a local clone, produces machine-readable JSON and PDF reports, and makes every scoring decision traceable to a specific file, line, and pattern.
A T4 score means strong observable evidence signals. It does not mean the repository is safe for clinical deployment — that requires independent expert validation.
git clone https://github.com/flamehaven01/STEM-BIO-AI.git
cd STEM-BIO-AI
pip install stem-ai# editable local install with PDF output support
pip install -e .[pdf]
# fastest path: scan a local repository
stem /path/to/bio-ai-repo
# 7-page full evidence packet with proof trace
stem scan /path/to/bio-ai-repo --level 3 --format all --explain# workflow-oriented CLI
stem scan /path/to/bio-ai-repo --level 2
stem scan /path/to/bio-ai-repo --policy strict_clinical_adjacency
stem gate /path/to/bio-ai-repo --min-tier T2
stem policy list
stem policy explain strict_clinical_adjacency
stem policy derive --clinical-strictness 4 --code-integrity-priority 3 --reproducibility-priority 2 --structured-limitations-requirement 3
stem policy simulate /path/to/bio-ai-repo --clinical-strictness 4 --code-integrity-priority 3 --reproducibility-priority 2 --structured-limitations-requirement 3
stem policy simulate /path/to/bio-ai-repo --profile-file policy/drafts/scoring_profile.reproducibility_first.v1.json
stem advisory validate /path/to/bio-ai-repo
stem advisory packet /path/to/bio-ai-repo --output advisory_out
stem advisory check-response /path/to/bio-ai-repo --response provider_advisory.json# backward-compatible shortcuts still work
stem /path/to/bio-ai-repo --level 3 --format all --explain
stem audit /path/to/bio-ai-repo --tier-gate T3 --quietClone the target repository first; the CLI operates on local paths only.
Calibration profiles are implemented in mirror_only mode in 1.7.8. --policy changes what profile is surfaced in artifacts, while policy derive and policy simulate provide governed preview lanes without mutating the authoritative deterministic score path. policy simulate --profile-file <path> allows local schema-valid profile experiments without registering a new named policy. In the current rule scope, strict_clinical_adjacency is the only release-grade named recommendation; stronger reproducibility postures still fall back to preview_only simulation deltas rather than a named profile.
Researchers and domain specialists are expected to influence calibration through derive, simulate, and documented preview/profile proposals. The intent interview uses a governed 1–5 posture scale, while official score-affecting policy changes still require profile promotion rather than direct ad hoc tuning.
Full CLI reference: docs/CLI_REFERENCE.md
Proof surfaces
- Demo: Hugging Face Space
- API contract:
docs/API_CONTRACT.md - Secret handling:
docs/ADVISORY_SECRET_HANDLING.md - Advisory runtime boundary:
docs/ADVISORY_RUNTIME.md - Example audits:
docs/EXAMPLE_AUDITS.md - Scoring rationale:
docs/SCORING_RATIONALE.md - Calibration profile architecture:
docs/CALIBRATION_PROFILE_DESIGN.md - AIRI data governance:
docs/AIRI_DATA_GOVERNANCE.md - Third-party data attribution:
docs/THIRD_PARTY_DATA.md - Deterministic diagnostics:
docs/DETERMINISTIC_DIAGNOSTICS.md - Regulatory traceability mapping:
docs/REGULATORY_MAPPING.md - Regulatory basis registry:
docs/regulatory_basis_registry.v1.json - CLI reference:
docs/CLI_REFERENCE.md
- T0 Rejected (0–39): insufficient evidence — do not rely on without independent expert validation
- T1 Quarantine (40–54): exploratory review only — expert validation required before any use
- T2 Caution (55–69): research reference and supervised non-clinical technical review only
- T3 Supervised (70–84): supervised institutional review candidate
- T4 Candidate (85–100): strong evidence posture — clinical deployment still requires independent validation
Clinical-adjacent repositories without an explicit disclaimer are hard-capped at T2 (score ≤ 69). Repositories with unbounded CA-DIRECT claims are hard-capped at T0 (score ≤ 39).
Tier boundary derivation and calibration gap disclosures: docs/SCORING_RATIONALE.md.
Final = (Stage 1 × 0.40) + (Stage 2R × 0.20) + (Stage 3 × 0.40) − C1 Penalty
| Stage | Weight | What Is Measured |
|---|---|---|
| Stage 1 README Evidence | 40% | Bio-domain vocabulary; H1–H6 hype-claim penalties; R1–R5 responsibility signals (limitations, regulatory framing, clinical disclaimer, demographic-bias, reproducibility) |
| Stage 2R Repo-Local Consistency | 20% | Vocabulary overlap across README, docs, package metadata, CI, and tests; limitation repetition; contradiction, staleness, and unsupported-workflow deductions |
| Stage 3 Code/Bio Responsibility | 40% | CI presence; domain test coverage; changelog hygiene (T3); data provenance and IRB/dataset citation (B1); bias/limitation measurement evidence (B2); conflict-of-interest disclosure (B3) |
| Stage 4 Replication Evidence | Separate lane | Containers; reproducibility targets; dependency locks/pins; dataset and model artifact references; seed, CLI, and citation signals; license/use-scope restrictions |
| C1–C6 Code Integrity | Penalty / advisory | Hardcoded credentials (C1, −10 pts); dependency pinning and external-service fragility (C2); deprecated patient-adjacent paths (C3); fail-open exception handlers (C4); compliance and clinical-boundary integrity (C5); mock-auth or no-auth local/self-host boundary warnings (C6) |
Stage 4 is reported as replication_score / replication_tier and does not affect score.final_score. Full scoring rationale and calibration gap disclosures are in docs/SCORING_RATIONALE.md.
flowchart LR
A[Target repository] --> B[LOCAL_ANALYSIS scanner]
B --> C[Stage 1\nREADME evidence]
B --> D[Stage 2R\nRepo-local consistency]
B --> E[Stage 3\nCode/bio responsibility]
B --> F[Stage 4\nReplication lane]
B --> K[C1–C6\nCode integrity]
B --> CC[CC1–CC3\nAST contract detectors]
C --> G[Weighted evidence score]
D --> G
E --> G
K --> G
CC --> R[code_contract + AIRI coverage]
F --> H[replication_score / tier]
G --> I[Canonical JSON result]
H --> I
R --> I
I --> L[Evidence ledger]
I --> M[Explain trace]
I --> N[Markdown report]
I --> O[PDF packets 1p / 5p / 7p]
I --> P[Interactive HTML dashboard]
Core modules: stem_ai/scanner.py, stem_ai/render.py, stem_ai/cli.py, stem_ai/detectors.py, stem_ai/detector_surface.py, stem_ai/detector_ast.py, stem_ai/detector_bio.py, stem_ai/detector_contract.py, stem_ai/detector_stage4.py, stem_ai/evidence.py, stem_ai/airi_risk_mapping.py, stem_ai/app.py
Each run writes to --out DIR (default: stem_output/).
The plain stem <repo> and stem scan <repo> path now defaults to --level 3, which emits the full 7-page evidence packet unless you select a lower level explicitly.
audits/ is retained only for historical benchmark and reference artifacts; routine CLI output should land in stem_output/<repo_slug>/.
| Level | Pages | Audience | Artifacts |
|---|---|---|---|
--level 1 |
1 | Executive / triage (legacy) | Score, tier, stage cards, code integrity summary |
--level 2 |
5 | Standard audit review | Level 1 + Stage 1/2R/3/4 breakdown, AIRI summary, closeout page |
--level 3 |
7 | Full evidence packet | Level 2 + Stage 4 replication page, code integrity deep dive, remediation roadmap, metadata page |
<repo>_experiment_results.json # machine-readable score + full evidence object
<repo>_report.html # interactive 5-section HTML dashboard (v1.7.0+)
<repo>_report.md # human-readable audit report
<repo>_brief_1p.pdf # Level 1 executive dashboard
<repo>_detailed_5p.pdf # Level 2 standard review packet
<repo>_detailed_7p.pdf # Level 3 full review packet
<repo>_explain.txt # --explain: file/line/snippet proof trace
--format html generates a self-contained interactive dashboard (v1.7.0+). Single .html file — no network, no external dependencies.
Example interactive HTML audit
- Open in browser: https://htmlpreview.github.io/?https://raw.githubusercontent.com/flamehaven01/STEM-BIO-AI/main/docs/assets/report-preview/yorkeccak_bio_report.html
- Raw HTML artifact:
docs/assets/report-preview/yorkeccak_bio_report.html
5 sections: Executive Summary · Decision Path · Code Integrity · AIRI Coverage · Evidence Detail
Interactive features: sticky scroll-spy nav · repo hyperlink in the hero header · ? tooltip icons on every metric · click-to-expand integrity cards · covered/gaps + domain filtering for AIRI risks · FAIL/WARN/PASS/INFO filter on the evidence ledger.
Current 1.7.8 HTML semantics:
Decision Pathexplains score construction and policy posture withConfigured, Not RewrittenCode Integritysurfaces the split betweenC4fail-open exceptions,C5compliance/boundary integrity, andC6mock-auth/no-auth trust boundariesAIRI Coveragedistinguishes the full local AIRI registry, the curated runtime bundle, and the detector mapping registry- covered AIRI rows carry bounded
why mappedreasoning derived from detector-trigger evidence plus the local detector-mapping registry
This is a review aid, not a claim that AIRI independently verified the repository.
Sample PDF: Download the 7-page full packet preview
Every scored item maps to a concrete, inspectable detection method. No inference, no LLM judgment.
Full detection table
| Component | Detection Method |
|---|---|
| Stage 1 baseline | Non-zero README present (+60 base) |
| Stage 1 domain signal | Bio-domain keyword regex in README and package metadata |
| Stage 1 hype penalties (H1–H6) | Regex: clinical certainty, regulatory approval, autonomous replacement, breakthrough marketing, universal generalization, perfect accuracy claims |
| Stage 1 responsibility signals (R1–R5) | Regex: limitations section, regulatory framework, clinical disclaimer (CA-severity-weighted), demographic-bias disclosure, reproducibility provisions |
| Stage 2R consistency | Vocabulary set intersection across README/docs/package/tests; limitation repetition; clinical-boundary contradiction, version-staleness, and workflow-support deductions |
| Stage 3 T1 CI | .github/workflows/ contains at least one file |
| Stage 3 T2 domain tests | tests/ directory text contains bio-domain vocabulary (regex) |
| Stage 3 T3 changelog | CHANGELOG file presence + bug-fix/patch/security entry detection (3-tier: 0/+5/+15) |
| Stage 3 B1 data provenance | Dependency manifest presence + IRB/dataset-citation language detection (3-tier: 0/+10/+15) |
| Stage 3 B2 bias measurement | Bias/limitations vocabulary + quantitative measurement evidence (subgroup analysis, AUROC, demographic parity) (3-tier: 0/+8/+15) |
| Stage 3 B3 COI/funding | Funding, grant, sponsor, conflict-of-interest language in README/docs/FUNDING.md |
| Stage 4 containers | Dockerfile or compose file present |
| Stage 4 reproducibility target | Makefile with reproduce/eval/benchmark/test targets |
| Stage 4 dependency lock | Environment/lock/requirements file; exact pins or hash evidence |
| Stage 4 artifact references | Dataset/model/checkpoint URLs or checksum files |
| Stage 4 citation/interface | CITATION.cff; argparse CLI entry points (AST) |
| Stage 4 license restriction | Non-commercial, research-only, academic-only, no-clinical-use restrictions in LICENSE/README |
| CA severity | Clinical/diagnostic phrase regex in README, docs, and package metadata |
| C1 credentials | AWS AKIA*, OpenAI sk-*, GitHub ghp_*, api_key=... patterns; obvious placeholders excluded from penalty |
| C2 dependency pinning | == or hash pin vs. loose >=, ~=, <, > ranges |
| C3 deprecated paths | Patient-metadata patterns in deprecated/, legacy/, archive/ directories |
| C4 fail-open | except Exception: pass or except: pass in Python source (AST) |
| C5 compliance boundary integrity | Unsupported legal/compliance claims or missing clinical-boundary integrity in reviewed sources |
| CC1 clinical zero default | AST scan of function defaults: keyword-only and positional params named confidence_threshold, score_threshold, min_confidence, etc. defaulted to 0.0 |
| CC2 API contract | README-declared names cross-checked against __all__ exports; phantom APIs flagged |
| CC3 shallow validator | validate_* / check_* functions using only len() (no regex structure check) flagged as insufficient for clinical/PII validation |
Stage 2R and Stage 3 rubric artifacts now surface additive detector_id and decision_basis fields so reviewers can see which bounded detector or contradiction rule produced a deduction or credit.
The advisory system exports a sanitized, provider-neutral handoff packet and validates provider responses — without making any provider API call.
stem advisory validate /path/to/repo # offline contract check
stem advisory packet /path/to/repo # export sanitized input packet
stem advisory check-response /path/to/repo --response FILENon-negotiable rules (enforced by the validator):
- Provider output cannot override
score.final_scoreorscore.formal_tier - Every advisory item must cite exact
finding_idstrings fromallowed_finding_ids - Raw repository source text is not included in provider packets
- Responses containing clinical safety, efficacy, regulatory, or medical-advice claims are rejected
allowed_finding_idsis capped at 40 entries per packet
Packet hardening added in v1.5.7:
provider_requestnow carries a secret-free request schema plus deterministic argument-validation statuscontract_schemasexports the advisory input/output contract shapes for downstream validatorspacket_contractconfirms allowlist parity, snippet omission, and non-negative omission counts before handoff
Secret boundary hardening added in v1.5.9:
- provider-specific environment variables are recognized before the generic advisory key fallback
- provider handoff metadata exports endpoint-policy validation and the expected env-var name, never the key value
- embedded-credential URLs are rejected; cloud providers require
https; plainhttpis limited to localhost .envfiles are ignored by default;.env.exampledocuments supported variable names only--advisory callis now the explicit provider-call boundary, with centralized redaction, logging-policy export, child-env allowlist reporting, and artifact pre-write sanitization
Full contract: docs/API_CONTRACT.md
Secret policy: docs/ADVISORY_SECRET_HANDLING.md
Runtime boundary: docs/ADVISORY_RUNTIME.md
STEM BIO-AI uses local derived data from the MIT AI Risk Repository (AIRI) as a broader risk-vocabulary layer around deterministic repository findings.
Upstream references:
- MIT AI Risk Repository: https://airisk.mit.edu/
- AI Incident Tracker: https://airisk.mit.edu/ai-incident-tracker
How AIRI is used here:
- AIRI does not replace the local scoring and audit system
- AIRI does not prove harm, causality, clinical safety, or regulatory status
- AIRI helps place local findings into a wider risk vocabulary for review
In the current 1.7.8 line, AIRI is used through three local governed layers:
- full normalized local registry
- curated runtime bundle used by deterministic scans
- detector-to-risk mapping registry plus known-gap tracking
This allows STEM BIO-AI to keep scan behavior local and deterministic while still surfacing broader AI risk language, provenance, and bundle-scope boundaries in runtime artifacts.
License / provenance note:
- Upstream AIRI source license:
MIT - Local attribution and usage details:
docs/AIRI_DATA_GOVERNANCE.md,docs/THIRD_PARTY_DATA.md
STEM BIO-AI can help teams become more audit-ready, but it does not by itself create certification, attestation, or legal compliance.
What can be prepared internally:
- runtime and security evidence review
- control-matrix and evidence-room preparation
- validation-package assembly for electronic records / signature workflows
- gap assessment for logging, access control, change control, retention, and traceability
- independent third-party audit readiness and penetration-test readiness
What still requires external review or attestation:
- SOC 2 report issuance
- ISO 13485 certification
- strong
21 CFR Part 11 compliantclaims independent audit passedclaims
In other words: internal teams can do substantial readiness work, but external claims still require external auditors, certification bodies, or independent assessors.
Related boundary guidance: docs/REGULATORY_MAPPING.md
The repository keeps a versioned MICA memory layer under memory/ for agent-session initialization,
drift control, and release provenance. Historical snapshots are retained as archive; the active layer
is selected by memory/mica.yaml.
The active package now follows the non-breaking MICA v0.2.4 runtime contract:
memory/mica.yamlis the composition contractpython tools/mica_pct.py .validates package integritypython tools/mica_runtime.py . --format textemits a portable session summary- DI binding remains progressive rather than speculative critical invariants are not mass-rewritten just to satisfy schema formality
Operational reference: docs/MICA_MEMORY.md
Live demo: huggingface.co/spaces/Flamehaven/stem-bio-ai
The Space runs the same deterministic local scanner on public GitHub repositories. No provider API call is made.
Run locally:
pip install -e .[demo]
python app.pySTEM-BIO-AI/
stem_ai/ # Core Python package
docs/ # API contract, advisory runtime/secret policy, scoring rationale, MICA policy, report previews
memory/ # Versioned MICA archive/playbook/lessons; active layer selected by mica.yaml
audits/ # Historical benchmark/reference artifacts only
stem_output/ # Default live CLI output root (generated, ignored)
scripts/ # Benchmark and validation scripts
tests/ # Regression test suite
app.py # HuggingFace Spaces / Gradio entry point
pyproject.toml # Package metadata and extras
SKILL.md # Universal agent skill definition
CHANGELOG.md # Version history
# Claude Code
git clone --depth 1 https://github.com/flamehaven01/STEM-BIO-AI.git ~/.claude/skills/stem-bio-ai
# Generic agent frameworks
git clone --depth 1 https://github.com/flamehaven01/STEM-BIO-AI.git ~/.agents/skills/stem-bio-aiSee CONTRIBUTING.md. High-value areas: rubric discrimination examples, clinical-adjacency trigger refinements, additional bio-domain benchmark repositories, report rendering improvements.
Preferred citation metadata lives in CITATION.cff.
Current concept DOI-backed archive for the 1.7.8 line:
@software{stem-bio-ai,
author = {Yun, Kwansub},
title = {STEM BIO-AI: Deterministic Evidence-Surface Scanner for Bio/Medical AI Repositories},
version = {1.7.8},
year = {2026},
doi = {10.5281/zenodo.20154479},
url = {https://doi.org/10.5281/zenodo.20154479}
}Apache 2.0. See LICENSE.
Maintained by flamehaven01









