@@ -180,3 +180,45 @@ for epoch in 1..E:
180180- Try tiny synthetic cases where the answer is obvious.
181181
182182This document lives in ` foundational_brain/BEHIND_THE_SCENES.md ` and explains the math that downstream models build on.
183+
184+ ---
185+
186+ ## Appendix A (v0.2 pipeline specifics)
187+
188+ ### A1. JSONL → vector mapping
189+ - Each record has a free‑text symptom map (Name → Severity 0–10) and a ` label_name ` .
190+ - We map names to fixed symptom IDs; build x = [ presence; severity] , where presence_i = 1 if severity_i > 0 else 0 and severity_i ∈ [ 0,1] .
191+ - Label is mapped to a class index and then one‑hot y.
192+
193+ ### A2. Class balance and explicit negatives
194+ - Balanced per‑class counts (or class weights) suppress prior skew.
195+ - Explicit negatives encode “absence of key symptoms” (e.g., dysuria=0, frequency=0 in respiratory cases), teaching strong negative evidence.
196+
197+ ### A3. Training objective (softmax + cross‑entropy)
198+ - Same equations as main text; we optimize NLL with SGD.
199+ - Validation split for early stopping; select the best epoch by lowest val loss.
200+
201+ ### A4. Probability calibration (temperature scaling)
202+ - Pick T* on the validation set by minimizing NLL(softmax(z/T)).
203+ - At inference, ŷ = softmax(z/T* ). This improves reliability (confidence ≈ accuracy).
204+
205+ ### A5. Expected Information Gain (EIG) in adaptive questioning
206+ - Current posterior P(d) (after clinical rules).
207+ - For a candidate symptom s, approximate P(yes|d) from disease symptom frequencies.
208+ - P(yes) = Σ_d P(d) P(yes|d); P(no) = 1 − P(yes).
209+ - Posteriors: P(d|yes) ∝ P(d) P(yes|d), P(d|no) ∝ P(d) (1−P(yes|d)).
210+ - Entropy: H(P) = −Σ_d P(d) log P(d).
211+ - EIG(s) = H(P) − [ P(yes) H(P(d|yes)) + P(no) H(P(d|no))] . We ask the s with highest EIG.
212+
213+ ### A6. Evidence‑aware stop and triage
214+ - We stop only if (a) top‑1 probability ≥ threshold and (b) minimal supporting evidence exists (e.g., at least one GU key for UTI).
215+ - First question is selected from a small triage set (respiratory vs GU vs GI discriminators) to reduce early ambiguity.
216+
217+ ### A7. Quick metrics
218+ - Confusion matrix and ECE bins provide a fast snapshot of class separation and calibration.
219+ - ECE = Σ_bins (n_bin/N) |mean_conf − mean_acc|.
220+
221+ ### A8. Where to look
222+ - Data gen: ` medical_diagnosis_model/data/generate_v02.py `
223+ - Train from JSONL: ` medical_diagnosis_model/versions/v2/medical_neural_network_v2.py `
224+ - Pipeline + metrics: ` medical_diagnosis_model/tools/train_pipeline.py `
0 commit comments