Repository structure and where to look for each part of the project.
BenchmarkAnalysis/ Scripts and outputs for benchmark tables/plots
BenchmarkAndLayerwiseRuns/ Raw run artifacts for benchmarks and layerwise analyses
LayerwiseAnalysis/ Scripts and outputs for layerwise figures/appendix
LinearProbe/ Linear probe experiments, results, and plotting
PruneStudyAnalysis/ Analysis code, plots, and summaries for prune studies
PruneStudyRuns/ Raw run artifacts for prune studies
StreetMathDataset/ Dataset assets, data, and dataset-related docs
README.md This file
generate_benchmark_tables.py: Build LaTeX tables for benchmark results.plot_benchmark_overall.py: Generate overall benchmark plots.Benchmark_overall.tex,Benchmark_by_subtopic.tex: LaTeX outputs.plots/: Generated figures.energy_consumption_table: Energy/compute table artifact.
gsm8k-runs/,streetmath-runs/: Raw run outputs used by benchmark and layerwise analyses.
generate_layerwise_figures.py: Produce layerwise analysis figures.appendix_d_figures.tex: LaTeX output for appendix figures.figures/,analysis/,averages/: Generated assets and intermediates.
setup_linear_probe_experiments.py: Configure/launch linear probe experiments.analyze_probe_results.py: Aggregate and analyze probe runs.plot_acc_per_layer.py,plot_accuracy_by_number.py: Plotting utilities.generate_latex_tables.py: LaTeX table generation.results/,tables/,figures/,analysis_output/: Generated outputs.0_*directories: Per-model run folders.mapping.txt: Label/value mapping used by probes.requirements.txt: Local deps for probe analysis.
analyze_prune_study.py: Analyze pruning experiments.plot_model_correlations.py: Correlation plots across models.plots/,summaries/: Generated outputs.
- Per-model folders for pruning experiment outputs (e.g.,
Falcon-H1-7B-Instruct*,qwen2.5-3b-instruct*).
streetmath_dataset/,streetmath_benchmark/: Core dataset assets.data/: Data files used by dataset/benchmark.CUSTOMIZE_MODEL.md: Notes on adapting models to the dataset.requirements.txt: Local deps for dataset scripts.
The benchmark runner lives in StreetMathDataset/streetmath_benchmark. It can load the dataset from Hugging Face (requires network) or from a local JSONL copy via --local-jsonl.
Run with transformers (local model):
Without hint
python -m streetmath_benchmark.cli \
--provider transformers \
--model qwen/qwen2.5-32b-instruct \
--max-tokens 256 \
--local-jsonl data/street_math_test.jsonl \
--output ../runs/qwen2_5_32b_instruct_hint/qwen2_5_32b_instruct_hint.jsonl \
--log-file ../runs/qwen2_5_32b_instruct_hint/qwen2_5_32b_instruct_hint.log \
--log-raw-response \
--hint
With hint, record layerwise metrics
python -m streetmath_benchmark.cli \
--provider transformers \
--model qwen/qwen2.5-32b-instruct \
--max-tokens 256 \
--local-jsonl data/street_math_test.jsonl \
--output ../runs/qwen2_5_32b_instruct_hint/qwen2_5_32b_instruct_hint.jsonl \
--log-file ../runs/qwen2_5_32b_instruct_hint/qwen2_5_32b_instruct_hint.log \
--log-raw-response \
--layerwise \
--layerwise-output ../runs/qwen2_5_32b_instruct_hint/qwen2_5_32b_instruct_layerwise_hint.jsonl \
--layerwise-metrics spectral_entropies,effective_ranks,gradient_norms,trace_covariances \
--layerwise-max-tokens 256 \
--hint
Notes
- The runner writes one JSON object per line and appends a final
summaryrecord with counts and the average score. - You can customize instructions with
--prompt-template-fileor disable the system prompt via--no-system. - For diffusion models, pass
--arch diffusionand--diffusion-stepsto estimate FLOPs.
The dataset is release at StreetMath/StreetMath
The generator creates questions emphasizing approximation. Outputs are JSONL files like data/street_math_train.jsonl and data/street_math_test.jsonl.
Recommended invocation from the repo root:
export PYTHONPATH=StreetMathDataset:$PYTHONPATH
python -m streetmath_dataset.scripts.generate_street_math_dataset --seed 42 --train 1000 --test 1000 --outdir StreetMathDataset/dataPlease visit https://github.com/ctseng777/MathNeuro for the adapted version of MathNeuro. See original repository for usage.