Model checkpoint on Hugging Face: ruby0322/pd-exit-site-classification. You can download the model here.
e41result: leveraged autoresearch to efficiently reach 96.8% infection-screening accuracy after training for 30+ epochs, outperforming NTUH's work.- Top-performing recipe: built the top-performing model with MobileNetV3 transfer learning, differential-LR fine-tuning, and positive-class reweighting.
- Dataset: ImageFolder-style dataset rooted at
./dataset - Classes: 5 classes, with
class_4treated as infection-positive - Primary metric:
bin_acc - Secondary metric:
mc_acc - Training budget: fixed 300 seconds wall-clock per experiment
- Canonical image size:
384 - Loop summary artifacts:
analysis_summary.jsonandanalysis_summary.md
At the time of writing, the current screening frontier is the e34... configuration, which reaches bin_acc=0.951271 with mc_acc=0.635593.
The repo is intentionally small. These are the important files:
prepare.py: shared constants, dataset validation, and the fixed evaluation harness. Do not modify it during experiments.train.py: the model/training file the agent iterates on.program.md: the research-loop instructions the agent follows.results.tsv: append-only experiment log, kept out of git.summarize_results.py: derives the current frontier and idea hints fromresults.tsv.analysis.ipynb: human-facing notebook forbin_acc-first analysis withmc_accas side context.
The loop is:
- edit
train.py - run a 5-minute experiment
- parse the footer metrics from stdout
- append a row to
results.tsv - regenerate
analysis_summary.jsonandanalysis_summary.md - keep or discard the change based on the frontier rules in
program.md
Each result row has this schema:
commit mc_acc bin_acc memory_gb status descriptionThe keep/discard policy is:
- keep if
bin_accis strictly higher than the current best - keep if
bin_accties the current best andmc_accimproves - keep if both metrics tie and the code becomes simpler
- discard otherwise
This means the loop is explicitly screening-first, not multiclass-first.
Requirements: Python 3.10, a virtual environment, and preferably a CUDA-capable GPU for the full research loop. Dependencies live in requirements.txt.
python3 -m venv .venv
source .venv/bin/activate # Windows: .venv\Scripts\activate
pip install -U pip
pip install -r requirements.txt
# Validate dataset structure and sample image readability
python prepare.py
# Run one training job
python train.py
# Derive loop summary artifacts from results.tsv
python summarize_results.pyIf prepare.py succeeds and train.py prints the metric footer, the setup is ready.
train.py prints a machine-readable footer that the loop uses to log results:
---
mc_acc: 0.563600
bin_acc: 0.712300
train_seconds: 300.2
train_stopped_budget: true
peak_vram_mb: 1234.5
arch: baseline
optimizer: sgd
The summary script turns results.tsv into:
analysis_summary.json: machine-readable frontier state for the agentanalysis_summary.md: short human-readable summary
The notebook in analysis.ipynb visualizes the same history with bin_acc as the main plot and mc_acc as supporting context.
Point your coding agent at program.md and let it drive the experiment loop. A minimal prompt is:
Read program.md, set up the run, and start the next experiment loop.
The agent is expected to:
- use
program.mdas the source of truth - modify only
train.py - leave
prepare.pyuntouched - update
results.tsvafter every run - regenerate the summary files with
python summarize_results.py - use
analysis_summary.jsonbefore choosing the next experiment
prepare.py dataset validation + fixed evaluation harness
train.py image-classification model and training loop
program.md agent instructions for the research loop
results.tsv experiment log (ignored by git)
summarize_results.py frontier summarizer for the loop
analysis.ipynb notebook analysis of experiment history
requirements.txt Python dependencies
- Single-file experimentation: the agent only edits
train.py, which keeps diffs reviewable. - Fixed-time comparison: every run gets the same 300-second budget, so results are comparable on the same machine.
- Screening-first optimization:
bin_accdefines the frontier;mc_accis important but secondary. - Derived loop memory:
analysis_summary.jsongives the agent a compact view of the frontier, near misses, and recently bad directions.
The code will select CUDA when available and fall back to CPU otherwise, but the intended research-loop setting is a single GPU. CPU runs are useful for smoke tests, not for efficient overnight search.
MIT

