STEM BIO-AI

Deterministic evidence-surface scanner for bio/medical AI repositories.
No LLM. No API key. No model runtime. No secrets sent anywhere.

Why STEM BIO-AI

Bio and medical AI repositories vary enormously in evidence quality — from rigorous academic tools to marketing-grade demos that carry clinical language with no data provenance, no reproducibility path, and no clinical-use disclaimer. Manual review is slow and inconsistent.

STEM BIO-AI scans the observable repository surface — README, docs, code structure, CI configuration, dependency manifests, changelogs — and maps detected signals to a structured evidence tier (T0–T4). The scan runs in seconds on a local clone, produces machine-readable JSON and PDF reports, and makes every scoring decision traceable to a specific file, line, and pattern.

A T4 score means strong observable evidence signals. It does not mean the repository is safe for clinical deployment — that requires independent expert validation.

Quick Start

git clone https://github.com/flamehaven01/STEM-BIO-AI.git
cd STEM-BIO-AI
pip install stem-ai

# editable local install with PDF output support
pip install -e .[pdf]

# fastest path: scan a local repository
stem /path/to/bio-ai-repo

# 7-page full evidence packet with proof trace
stem scan /path/to/bio-ai-repo --level 3 --format all --explain

# workflow-oriented CLI
stem scan /path/to/bio-ai-repo --level 2
stem scan /path/to/bio-ai-repo --policy strict_clinical_adjacency
stem gate /path/to/bio-ai-repo --min-tier T2
stem policy list
stem policy explain strict_clinical_adjacency
stem policy derive --clinical-strictness 4 --code-integrity-priority 3 --reproducibility-priority 2 --structured-limitations-requirement 3
stem policy simulate /path/to/bio-ai-repo --clinical-strictness 4 --code-integrity-priority 3 --reproducibility-priority 2 --structured-limitations-requirement 3
stem policy simulate /path/to/bio-ai-repo --profile-file policy/drafts/scoring_profile.reproducibility_first.v1.json
stem advisory validate /path/to/bio-ai-repo
stem advisory packet /path/to/bio-ai-repo --output advisory_out
stem advisory check-response /path/to/bio-ai-repo --response provider_advisory.json

# backward-compatible shortcuts still work
stem /path/to/bio-ai-repo --level 3 --format all --explain
stem audit /path/to/bio-ai-repo --tier-gate T3 --quiet

Clone the target repository first; the CLI operates on local paths only.

Calibration profiles are implemented in mirror_only mode in 1.7.8. --policy changes what profile is surfaced in artifacts, while policy derive and policy simulate provide governed preview lanes without mutating the authoritative deterministic score path. policy simulate --profile-file <path> allows local schema-valid profile experiments without registering a new named policy. In the current rule scope, strict_clinical_adjacency is the only release-grade named recommendation; stronger reproducibility postures still fall back to preview_only simulation deltas rather than a named profile.

Researchers and domain specialists are expected to influence calibration through derive, simulate, and documented preview/profile proposals. The intent interview uses a governed 1–5 posture scale, while official score-affecting policy changes still require profile promotion rather than direct ad hoc tuning.

Full CLI reference: docs/CLI_REFERENCE.md

Proof surfaces

Demo: Hugging Face Space
API contract: docs/API_CONTRACT.md
Secret handling: docs/ADVISORY_SECRET_HANDLING.md
Advisory runtime boundary: docs/ADVISORY_RUNTIME.md
Example audits: docs/EXAMPLE_AUDITS.md
Scoring rationale: docs/SCORING_RATIONALE.md
Calibration profile architecture: docs/CALIBRATION_PROFILE_DESIGN.md
AIRI data governance: docs/AIRI_DATA_GOVERNANCE.md
Third-party data attribution: docs/THIRD_PARTY_DATA.md
Deterministic diagnostics: docs/DETERMINISTIC_DIAGNOSTICS.md
Regulatory traceability mapping: docs/REGULATORY_MAPPING.md
Regulatory basis registry: docs/regulatory_basis_registry.v1.json
CLI reference: docs/CLI_REFERENCE.md

Triage Tiers

T0 Rejected (0–39): insufficient evidence — do not rely on without independent expert validation
T1 Quarantine (40–54): exploratory review only — expert validation required before any use
T2 Caution (55–69): research reference and supervised non-clinical technical review only
T3 Supervised (70–84): supervised institutional review candidate
T4 Candidate (85–100): strong evidence posture — clinical deployment still requires independent validation

Clinical-adjacent repositories without an explicit disclaimer are hard-capped at T2 (score ≤ 69). Repositories with unbounded CA-DIRECT claims are hard-capped at T0 (score ≤ 39).

Tier boundary derivation and calibration gap disclosures: docs/SCORING_RATIONALE.md.

Scoring Model

Final = (Stage 1 × 0.40) + (Stage 2R × 0.20) + (Stage 3 × 0.40) − C1 Penalty

Stage	Weight	What Is Measured
Stage 1 README Evidence	40%	Bio-domain vocabulary; H1–H6 hype-claim penalties; R1–R5 responsibility signals (limitations, regulatory framing, clinical disclaimer, demographic-bias, reproducibility)
Stage 2R Repo-Local Consistency	20%	Vocabulary overlap across README, docs, package metadata, CI, and tests; limitation repetition; contradiction, staleness, and unsupported-workflow deductions
Stage 3 Code/Bio Responsibility	40%	CI presence; domain test coverage; changelog hygiene (T3); data provenance and IRB/dataset citation (B1); bias/limitation measurement evidence (B2); conflict-of-interest disclosure (B3)
Stage 4 Replication Evidence	Separate lane	Containers; reproducibility targets; dependency locks/pins; dataset and model artifact references; seed, CLI, and citation signals; license/use-scope restrictions
C1–C6 Code Integrity	Penalty / advisory	Hardcoded credentials (C1, −10 pts); dependency pinning and external-service fragility (C2); deprecated patient-adjacent paths (C3); fail-open exception handlers (C4); compliance and clinical-boundary integrity (C5); mock-auth or no-auth local/self-host boundary warnings (C6)

Stage 4 is reported as replication_score / replication_tier and does not affect score.final_score. Full scoring rationale and calibration gap disclosures are in docs/SCORING_RATIONALE.md.

Architecture

flowchart LR
    A[Target repository] --> B[LOCAL_ANALYSIS scanner]
    B --> C[Stage 1\nREADME evidence]
    B --> D[Stage 2R\nRepo-local consistency]
    B --> E[Stage 3\nCode/bio responsibility]
    B --> F[Stage 4\nReplication lane]
    B --> K[C1–C6\nCode integrity]
    B --> CC[CC1–CC3\nAST contract detectors]
    C --> G[Weighted evidence score]
    D --> G
    E --> G
    K --> G
    CC --> R[code_contract + AIRI coverage]
    F --> H[replication_score / tier]
    G --> I[Canonical JSON result]
    H --> I
    R --> I
    I --> L[Evidence ledger]
    I --> M[Explain trace]
    I --> N[Markdown report]
    I --> O[PDF packets 1p / 5p / 7p]
    I --> P[Interactive HTML dashboard]

Core modules: stem_ai/scanner.py, stem_ai/render.py, stem_ai/cli.py, stem_ai/detectors.py, stem_ai/detector_surface.py, stem_ai/detector_ast.py, stem_ai/detector_bio.py, stem_ai/detector_contract.py, stem_ai/detector_stage4.py, stem_ai/evidence.py, stem_ai/airi_risk_mapping.py, stem_ai/app.py

Output Artifacts

Each run writes to --out DIR (default: stem_output/). The plain stem <repo> and stem scan <repo> path now defaults to --level 3, which emits the full 7-page evidence packet unless you select a lower level explicitly. audits/ is retained only for historical benchmark and reference artifacts; routine CLI output should land in stem_output/<repo_slug>/.

Level	Pages	Audience	Artifacts
`--level 1`	1	Executive / triage (legacy)	Score, tier, stage cards, code integrity summary
`--level 2`	5	Standard audit review	Level 1 + Stage 1/2R/3/4 breakdown, AIRI summary, closeout page
`--level 3`	7	Full evidence packet	Level 2 + Stage 4 replication page, code integrity deep dive, remediation roadmap, metadata page

<repo>_experiment_results.json   # machine-readable score + full evidence object
<repo>_report.html               # interactive 5-section HTML dashboard (v1.7.0+)
<repo>_report.md                 # human-readable audit report
<repo>_brief_1p.pdf              # Level 1 executive dashboard
<repo>_detailed_5p.pdf           # Level 2 standard review packet
<repo>_detailed_7p.pdf           # Level 3 full review packet
<repo>_explain.txt               # --explain: file/line/snippet proof trace

HTML Report Dashboard

--format html generates a self-contained interactive dashboard (v1.7.0+). Single .html file — no network, no external dependencies.

Example interactive HTML audit

Open in browser: https://htmlpreview.github.io/?https://raw.githubusercontent.com/flamehaven01/STEM-BIO-AI/main/docs/assets/report-preview/yorkeccak_bio_report.html
Raw HTML artifact: docs/assets/report-preview/yorkeccak_bio_report.html

5 sections: Executive Summary · Decision Path · Code Integrity · AIRI Coverage · Evidence Detail

Interactive features: sticky scroll-spy nav · repo hyperlink in the hero header · ? tooltip icons on every metric · click-to-expand integrity cards · covered/gaps + domain filtering for AIRI risks · FAIL/WARN/PASS/INFO filter on the evidence ledger.

Current 1.7.8 HTML semantics:

Decision Path explains score construction and policy posture with Configured, Not Rewritten
Code Integrity surfaces the split between C4 fail-open exceptions, C5 compliance/boundary integrity, and C6 mock-auth/no-auth trust boundaries
AIRI Coverage distinguishes the full local AIRI registry, the curated runtime bundle, and the detector mapping registry
covered AIRI rows carry bounded why mapped reasoning derived from detector-trigger evidence plus the local detector-mapping registry

This is a review aid, not a claim that AIRI independently verified the repository.

Report Preview

Sample PDF: Download the 7-page full packet preview

View all 7 full-packet preview pages

Page 1	Page 2

Page 3	Page 4

Page 5	Page 6

Page 7

Detection Methods

Every scored item maps to a concrete, inspectable detection method. No inference, no LLM judgment.

Full detection table

Component	Detection Method
Stage 1 baseline	Non-zero README present (+60 base)
Stage 1 domain signal	Bio-domain keyword regex in README and package metadata
Stage 1 hype penalties (H1–H6)	Regex: clinical certainty, regulatory approval, autonomous replacement, breakthrough marketing, universal generalization, perfect accuracy claims
Stage 1 responsibility signals (R1–R5)	Regex: limitations section, regulatory framework, clinical disclaimer (CA-severity-weighted), demographic-bias disclosure, reproducibility provisions
Stage 2R consistency	Vocabulary set intersection across README/docs/package/tests; limitation repetition; clinical-boundary contradiction, version-staleness, and workflow-support deductions
Stage 3 T1 CI	`.github/workflows/` contains at least one file
Stage 3 T2 domain tests	`tests/` directory text contains bio-domain vocabulary (regex)
Stage 3 T3 changelog	CHANGELOG file presence + bug-fix/patch/security entry detection (3-tier: 0/+5/+15)
Stage 3 B1 data provenance	Dependency manifest presence + IRB/dataset-citation language detection (3-tier: 0/+10/+15)
Stage 3 B2 bias measurement	Bias/limitations vocabulary + quantitative measurement evidence (subgroup analysis, AUROC, demographic parity) (3-tier: 0/+8/+15)
Stage 3 B3 COI/funding	Funding, grant, sponsor, conflict-of-interest language in README/docs/FUNDING.md
Stage 4 containers	Dockerfile or compose file present
Stage 4 reproducibility target	Makefile with reproduce/eval/benchmark/test targets
Stage 4 dependency lock	Environment/lock/requirements file; exact pins or hash evidence
Stage 4 artifact references	Dataset/model/checkpoint URLs or checksum files
Stage 4 citation/interface	CITATION.cff; argparse CLI entry points (AST)
Stage 4 license restriction	Non-commercial, research-only, academic-only, no-clinical-use restrictions in LICENSE/README
CA severity	Clinical/diagnostic phrase regex in README, docs, and package metadata
C1 credentials	AWS `AKIA`, OpenAI `sk-`, GitHub `ghp_*`, `api_key=...` patterns; obvious placeholders excluded from penalty
C2 dependency pinning	`==` or hash pin vs. loose `>=`, `~=`, `<`, `>` ranges
C3 deprecated paths	Patient-metadata patterns in `deprecated/`, `legacy/`, `archive/` directories
C4 fail-open	`except Exception: pass` or `except: pass` in Python source (AST)
C5 compliance boundary integrity	Unsupported legal/compliance claims or missing clinical-boundary integrity in reviewed sources
CC1 clinical zero default	AST scan of function defaults: keyword-only and positional params named `confidence_threshold`, `score_threshold`, `min_confidence`, etc. defaulted to `0.0`
CC2 API contract	README-declared names cross-checked against `__all__` exports; phantom APIs flagged
CC3 shallow validator	`validate_` / `check_` functions using only `len()` (no regex structure check) flagged as insufficient for clinical/PII validation

Stage 2R and Stage 3 rubric artifacts now surface additive detector_id and decision_basis fields so reviewers can see which bounded detector or contradiction rule produced a deduction or credit.

AI Advisory Contract

The advisory system exports a sanitized, provider-neutral handoff packet and validates provider responses — without making any provider API call.

stem advisory validate /path/to/repo                # offline contract check
stem advisory packet /path/to/repo                  # export sanitized input packet
stem advisory check-response /path/to/repo --response FILE

Non-negotiable rules (enforced by the validator):

Provider output cannot override score.final_score or score.formal_tier
Every advisory item must cite exact finding_id strings from allowed_finding_ids
Raw repository source text is not included in provider packets
Responses containing clinical safety, efficacy, regulatory, or medical-advice claims are rejected
allowed_finding_ids is capped at 40 entries per packet

Packet hardening added in v1.5.7:

provider_request now carries a secret-free request schema plus deterministic argument-validation status
contract_schemas exports the advisory input/output contract shapes for downstream validators
packet_contract confirms allowlist parity, snippet omission, and non-negative omission counts before handoff

Secret boundary hardening added in v1.5.9:

provider-specific environment variables are recognized before the generic advisory key fallback
provider handoff metadata exports endpoint-policy validation and the expected env-var name, never the key value
embedded-credential URLs are rejected; cloud providers require https; plain http is limited to localhost
.env files are ignored by default; .env.example documents supported variable names only
--advisory call is now the explicit provider-call boundary, with centralized redaction, logging-policy export, child-env allowlist reporting, and artifact pre-write sanitization

Full contract: docs/API_CONTRACT.md Secret policy: docs/ADVISORY_SECRET_HANDLING.md Runtime boundary: docs/ADVISORY_RUNTIME.md

The AI Risk Repository (AIRI)

STEM BIO-AI uses local derived data from the MIT AI Risk Repository (AIRI) as a broader risk-vocabulary layer around deterministic repository findings.

Upstream references:

MIT AI Risk Repository: https://airisk.mit.edu/
AI Incident Tracker: https://airisk.mit.edu/ai-incident-tracker

How AIRI is used here:

AIRI does not replace the local scoring and audit system
AIRI does not prove harm, causality, clinical safety, or regulatory status
AIRI helps place local findings into a wider risk vocabulary for review

In the current 1.7.8 line, AIRI is used through three local governed layers:

full normalized local registry
curated runtime bundle used by deterministic scans
detector-to-risk mapping registry plus known-gap tracking

This allows STEM BIO-AI to keep scan behavior local and deterministic while still surfacing broader AI risk language, provenance, and bundle-scope boundaries in runtime artifacts.

License / provenance note:

Upstream AIRI source license: MIT
Local attribution and usage details: docs/AIRI_DATA_GOVERNANCE.md, docs/THIRD_PARTY_DATA.md

Runtime / Security / Compliance Boundary

STEM BIO-AI can help teams become more audit-ready, but it does not by itself create certification, attestation, or legal compliance.

What can be prepared internally:

runtime and security evidence review
control-matrix and evidence-room preparation
validation-package assembly for electronic records / signature workflows
gap assessment for logging, access control, change control, retention, and traceability
independent third-party audit readiness and penetration-test readiness

What still requires external review or attestation:

SOC 2 report issuance
ISO 13485 certification
strong 21 CFR Part 11 compliant claims
independent audit passed claims

In other words: internal teams can do substantial readiness work, but external claims still require external auditors, certification bodies, or independent assessors.

Related boundary guidance: docs/REGULATORY_MAPPING.md

MICA Memory Layer

The repository keeps a versioned MICA memory layer under memory/ for agent-session initialization, drift control, and release provenance. Historical snapshots are retained as archive; the active layer is selected by memory/mica.yaml.

The active package now follows the non-breaking MICA v0.2.4 runtime contract:

memory/mica.yaml is the composition contract
python tools/mica_pct.py . validates package integrity
python tools/mica_runtime.py . --format text emits a portable session summary
DI binding remains progressive rather than speculative critical invariants are not mass-rewritten just to satisfy schema formality

Operational reference: docs/MICA_MEMORY.md

Web Demo

Live demo: huggingface.co/spaces/Flamehaven/stem-bio-ai

The Space runs the same deterministic local scanner on public GitHub repositories. No provider API call is made.

Run locally:

pip install -e .[demo]
python app.py

Repository Structure

STEM-BIO-AI/
  stem_ai/              # Core Python package
  docs/                 # API contract, advisory runtime/secret policy, scoring rationale, MICA policy, report previews
  memory/               # Versioned MICA archive/playbook/lessons; active layer selected by mica.yaml
  audits/               # Historical benchmark/reference artifacts only
  stem_output/          # Default live CLI output root (generated, ignored)
  scripts/              # Benchmark and validation scripts
  tests/                # Regression test suite
  app.py                # HuggingFace Spaces / Gradio entry point
  pyproject.toml        # Package metadata and extras
  SKILL.md              # Universal agent skill definition
  CHANGELOG.md          # Version history

Agent Skill Install

# Claude Code
git clone --depth 1 https://github.com/flamehaven01/STEM-BIO-AI.git ~/.claude/skills/stem-bio-ai

# Generic agent frameworks
git clone --depth 1 https://github.com/flamehaven01/STEM-BIO-AI.git ~/.agents/skills/stem-bio-ai

Contributing

See CONTRIBUTING.md. High-value areas: rubric discrimination examples, clinical-adjacency trigger refinements, additional bio-domain benchmark repositories, report rendering improvements.

Citation

Preferred citation metadata lives in CITATION.cff.

Current concept DOI-backed archive for the 1.7.8 line:

https://doi.org/10.5281/zenodo.20154479

@software{stem-bio-ai,
  author  = {Yun, Kwansub},
  title   = {STEM BIO-AI: Deterministic Evidence-Surface Scanner for Bio/Medical AI Repositories},
  version = {1.7.8},
  year    = {2026},
  doi     = {10.5281/zenodo.20154479},
  url     = {https://doi.org/10.5281/zenodo.20154479}
}

License

Apache 2.0. See LICENSE.

Maintained by flamehaven01

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

STEM BIO-AI

Why STEM BIO-AI

Quick Start

Triage Tiers

Scoring Model

Architecture

Output Artifacts

HTML Report Dashboard

Report Preview

Detection Methods

AI Advisory Contract

The AI Risk Repository (AIRI)

Runtime / Security / Compliance Boundary

MICA Memory Layer

Web Demo

Repository Structure

Agent Skill Install

Contributing

Citation

License

About

Uh oh!

Releases 36

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 159 Commits
.github/workflows		.github/workflows
audits		audits
discrimination		discrimination
docs		docs
memory		memory
policy		policy
references		references
scripts		scripts
stem_ai		stem_ai
templates		templates
tests		tests
tools		tools
.env.example		.env.example
.gitattributes		.gitattributes
.gitignore		.gitignore
.slopconfig.yaml		.slopconfig.yaml
.zenodo.json		.zenodo.json
CHANGELOG.md		CHANGELOG.md
CITATION.cff		CITATION.cff
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
MANIFEST.in		MANIFEST.in
README.md		README.md
SKILL.md		SKILL.md
app.py		app.py
pyproject.toml		pyproject.toml
pytest.ini		pytest.ini
requirements.txt		requirements.txt

Folders and files

Latest commit

History

Repository files navigation

STEM BIO-AI

Why STEM BIO-AI

Quick Start

Triage Tiers

Scoring Model

Architecture

Output Artifacts

HTML Report Dashboard

Report Preview

Detection Methods

AI Advisory Contract

The AI Risk Repository (AIRI)

Runtime / Security / Compliance Boundary

MICA Memory Layer

Web Demo

Repository Structure

Agent Skill Install

Contributing

Citation

License

About

Topics

Resources

License

Contributing

Uh oh!

Stars

Watchers

Forks

Releases 36

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages