Skip to content

sonicboom15/medchart

Repository files navigation

medchart

ICD-10 coding from medical chart text using Daimon + Neo4j + Qdrant.

The LLM searches a vector store of 46k ICD-10 codes, verifies each candidate against a Neo4j taxonomy graph, and returns only confirmed codes with ground-truth confidence scores.

Architecture

Chart image / PDF / plain text
        ↓
Surya OCR  (layout detection → region OCR → structured text)
        ↓
Daimon sidecar  (LLM + auto-generated tools)
  ├─ icd10_search    → Qdrant  (semantic vector lookup)
  └─ icd10_graph_*   → Neo4j   (ICD-10 taxonomy traversal)
        ↓
CodeAssignment[]  { code, description, confidence, verified }

Confidence is computed from the graph — not from the model:

  • 1.0 — exact code confirmed in Neo4j
  • 0.5 — parent category exists, specific subcode does not
  • 0.0 — not found at all (hallucinated)

Quick start

1. Set your API key

cp .env.example .env
# edit .env and add ANTHROPIC_API_KEY

Fully local alternative: change the coder component in daimon-config/config.yaml from type: anthropic to type: llamacpp and pull qwen2.5:7b via Ollama. No API key required, but coding quality is lower.

2. Start infrastructure

docker compose up -d

Starts Neo4j, Qdrant, and (if using local mode) Ollama. Remove the deploy: GPU block from docker-compose.yml if you have no NVIDIA GPU.

3. Start Daimon

daimon serve --config daimon-config/config.yaml

Download the Daimon binary from releases.

4. Install Python dependencies

pip install -e ".[dev]"

5. Load ICD-10 codes

Download the CMS 2025 ICD-10-CM tabular XML from:

https://www.cms.gov/files/zip/2025-code-tables-tabular-and-index.zip

Extract, then run:

python -m icd10.loader data/icd10cm_tabular_2025.xml

This takes ~10 minutes (46k codes — vector embed + Neo4j graph insert).

6. Code MTSamples transcriptions

Download mtsamples.csv from Kaggle, then:

python -m pipeline.ingest --csv data/mtsamples.csv --out data/coded.jsonl --limit 50

7. Code a single chart file

from pipeline.coder import code_file

codes = code_file("path/to/chart.png")
for c in codes:
    print(c.code, f"{c.confidence:.0%}", c.description)

Output format

Each record in coded.jsonl:

{
  "sample_name": "Allergic Rhinitis",
  "specialty": "Allergy / Immunology",
  "codes": [
    {"code": "J30.1", "description": "Allergic rhinitis due to pollen", "confidence": 1.0, "verified": true},
    {"code": "J45.20", "description": "Mild intermittent asthma, uncomplicated", "confidence": 1.0, "verified": true}
  ]
}

Project structure

daimon-config/config.yaml   Daimon sidecar config (LLM + vector + graph)
docker-compose.yml          Neo4j + Qdrant + Ollama
ocr/layout.py               Surya hierarchical OCR pipeline
icd10/loader.py             CMS XML parser → Qdrant + Neo4j
pipeline/coder.py           Daimon client → CodeAssignment[]
pipeline/ingest.py          MTSamples batch processor with Rich console
data/                       Datasets (gitignored)

License

Apache 2.0

About

ICD-10 coding from medical chart text using Daimon + Claude + Neo4j + Qdrant

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages