Minnow-Math-1.5B

Scientific & Mathematical Language Model Trained from Raw arXiv LaTeX

Minnow-Math-1.5B is a 1.5B parameter decoder-only transformer trained from scratch on raw arXiv LaTeX sources spanning mathematics, computer science, and theoretical physics.

Paper (arXiv): https://arxiv.org/pdf/2602.17288
Model (HuggingFace): https://huggingface.co/KiteFishAI/Minnow-Math-1.5B

This repository documents the dataset construction pipeline, tokenizer design, and training process used to build a domain-specialized scientific language model under constrained compute (2× A100 80GB GPUs).

Motivation

Most open models are trained on heterogeneous web corpora.
KiteFish-A1 explores a different direction:

What happens when a language model is trained purely on structured scientific LaTeX archives?

The goal is to study domain specialization, engineering trade-offs, and training dynamics under realistic compute budgets.

Model Specifications

Component	Value
Parameters	~1.5B
Architecture	LLaMA-style dense transformer
Layers	24
Hidden Size	2048
FFN Size	5504
Attention Heads	16
Vocabulary	102,400
Context Length	4096 (trained at 768 tokens)
Precision	bfloat16
Embeddings	Untied

Training Summary

Pretraining Tokens: 52.18B
Post-training Tokens: 5B
Processed Corpus Size: ~200GB
Experimental Runs: 24
Hardware: 2× NVIDIA A100 (80GB)

Optimization stack:

AdamW
ZeRO Stage 2
Gradient checkpointing
bf16 mixed precision

Validation Perplexity: ~4.2 on held-out scientific corpus

Training operated in a data-rich regime (~38 tokens per parameter), prioritizing domain robustness over benchmark optimization.

Dataset Pipeline

Constructed directly from raw arXiv LaTeX archives:

Metadata filtering (subject selection, withdrawn removal)
.tar.gz archive extraction
Multi-file LaTeX resolution (\input, \include)
Cleaning and normalization
Deduplication
Domain-aware tokenizer training (102k vocabulary)

Key engineering challenges included LaTeX extraction inconsistencies, formula-heavy language detection issues, symbol fragmentation during tokenization, and large-scale I/O bottlenecks.

Performance Characteristics

Minnow-Math-1.5B is a base model.

It demonstrates:

Strong familiarity with scientific writing style
Stable LaTeX structural modeling
Symbolic fluency

It does not include:

Instruction tuning
RLHF or preference alignment
Benchmark optimization

Downstream benchmark accuracy is currently modest without supervised fine-tuning or LoRA adaptation.

This release is intended primarily for research and experimentation.

Reproducing Training

Requirements:

Python 3.10+
PyTorch
Transformers
DeepSpeed
2× A100 GPUs recommended

Install dependencies:

pip install -r requirements.txt

Launch training:

deepspeed train.py

Citation

If you use this work, please cite:

@article{kitefish2026,
  title={KiteFish-A1: Training a Scientific Language Model from Raw LaTeX Archives},
  author={Your Name},
  year={2026},
  eprint={2602.17288},
  archivePrefix={arXiv}
}

Name		Name	Last commit message	Last commit date
Latest commit History 15 Commits
assets		assets
data_prep		data_prep
evaluation		evaluation
external_data		external_data
tokenizer		tokenizer
train		train
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Minnow-Math-1.5B

Motivation

Model Specifications

Training Summary

Dataset Pipeline

Performance Characteristics

Reproducing Training

Citation

License

About

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Minnow-Math-1.5B

Motivation

Model Specifications

Training Summary

Dataset Pipeline

Performance Characteristics

Reproducing Training

Citation

License

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Contributors

Uh oh!

Languages