Skip to content

Commit c50af4a

Browse files
Nick  VaccarelloNick  Vaccarello
authored andcommitted
docs(math): add Appendix A for v0.2 pipeline (JSONL mapping, CE, temperature scaling, EIG, evidence-aware stop, triage)
1 parent bdf7ec3 commit c50af4a

1 file changed

Lines changed: 42 additions & 0 deletions

File tree

foundational_brain/BEHIND_THE_SCENES.md

Lines changed: 42 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -180,3 +180,45 @@ for epoch in 1..E:
180180
- Try tiny synthetic cases where the answer is obvious.
181181

182182
This document lives in `foundational_brain/BEHIND_THE_SCENES.md` and explains the math that downstream models build on.
183+
184+
---
185+
186+
## Appendix A (v0.2 pipeline specifics)
187+
188+
### A1. JSONL → vector mapping
189+
- Each record has a free‑text symptom map (Name → Severity 0–10) and a `label_name`.
190+
- We map names to fixed symptom IDs; build x = [presence; severity], where presence_i = 1 if severity_i > 0 else 0 and severity_i ∈ [0,1].
191+
- Label is mapped to a class index and then one‑hot y.
192+
193+
### A2. Class balance and explicit negatives
194+
- Balanced per‑class counts (or class weights) suppress prior skew.
195+
- Explicit negatives encode “absence of key symptoms” (e.g., dysuria=0, frequency=0 in respiratory cases), teaching strong negative evidence.
196+
197+
### A3. Training objective (softmax + cross‑entropy)
198+
- Same equations as main text; we optimize NLL with SGD.
199+
- Validation split for early stopping; select the best epoch by lowest val loss.
200+
201+
### A4. Probability calibration (temperature scaling)
202+
- Pick T* on the validation set by minimizing NLL(softmax(z/T)).
203+
- At inference, ŷ = softmax(z/T*). This improves reliability (confidence ≈ accuracy).
204+
205+
### A5. Expected Information Gain (EIG) in adaptive questioning
206+
- Current posterior P(d) (after clinical rules).
207+
- For a candidate symptom s, approximate P(yes|d) from disease symptom frequencies.
208+
- P(yes) = Σ_d P(d) P(yes|d); P(no) = 1 − P(yes).
209+
- Posteriors: P(d|yes) ∝ P(d) P(yes|d), P(d|no) ∝ P(d) (1−P(yes|d)).
210+
- Entropy: H(P) = −Σ_d P(d) log P(d).
211+
- EIG(s) = H(P) − [P(yes) H(P(d|yes)) + P(no) H(P(d|no))]. We ask the s with highest EIG.
212+
213+
### A6. Evidence‑aware stop and triage
214+
- We stop only if (a) top‑1 probability ≥ threshold and (b) minimal supporting evidence exists (e.g., at least one GU key for UTI).
215+
- First question is selected from a small triage set (respiratory vs GU vs GI discriminators) to reduce early ambiguity.
216+
217+
### A7. Quick metrics
218+
- Confusion matrix and ECE bins provide a fast snapshot of class separation and calibration.
219+
- ECE = Σ_bins (n_bin/N) |mean_conf − mean_acc|.
220+
221+
### A8. Where to look
222+
- Data gen: `medical_diagnosis_model/data/generate_v02.py`
223+
- Train from JSONL: `medical_diagnosis_model/versions/v2/medical_neural_network_v2.py`
224+
- Pipeline + metrics: `medical_diagnosis_model/tools/train_pipeline.py`

0 commit comments

Comments
 (0)