Scientific & Mathematical Language Model Trained from Raw arXiv LaTeX
Minnow-Math-1.5B is a 1.5B parameter decoder-only transformer trained from scratch on raw arXiv LaTeX sources spanning mathematics, computer science, and theoretical physics.
Paper (arXiv): https://arxiv.org/pdf/2602.17288
Model (HuggingFace): https://huggingface.co/KiteFishAI/Minnow-Math-1.5B
This repository documents the dataset construction pipeline, tokenizer design, and training process used to build a domain-specialized scientific language model under constrained compute (2× A100 80GB GPUs).
Most open models are trained on heterogeneous web corpora.
KiteFish-A1 explores a different direction:
What happens when a language model is trained purely on structured scientific LaTeX archives?
The goal is to study domain specialization, engineering trade-offs, and training dynamics under realistic compute budgets.
| Component | Value |
|---|---|
| Parameters | ~1.5B |
| Architecture | LLaMA-style dense transformer |
| Layers | 24 |
| Hidden Size | 2048 |
| FFN Size | 5504 |
| Attention Heads | 16 |
| Vocabulary | 102,400 |
| Context Length | 4096 (trained at 768 tokens) |
| Precision | bfloat16 |
| Embeddings | Untied |
Pretraining Tokens: 52.18B
Post-training Tokens: 5B
Processed Corpus Size: ~200GB
Experimental Runs: 24
Hardware: 2× NVIDIA A100 (80GB)
Optimization stack:
- AdamW
- ZeRO Stage 2
- Gradient checkpointing
- bf16 mixed precision
Validation Perplexity: ~4.2 on held-out scientific corpus
Training operated in a data-rich regime (~38 tokens per parameter), prioritizing domain robustness over benchmark optimization.
Constructed directly from raw arXiv LaTeX archives:
- Metadata filtering (subject selection, withdrawn removal)
.tar.gzarchive extraction- Multi-file LaTeX resolution (
\input,\include) - Cleaning and normalization
- Deduplication
- Domain-aware tokenizer training (102k vocabulary)
Key engineering challenges included LaTeX extraction inconsistencies, formula-heavy language detection issues, symbol fragmentation during tokenization, and large-scale I/O bottlenecks.
Minnow-Math-1.5B is a base model.
It demonstrates:
- Strong familiarity with scientific writing style
- Stable LaTeX structural modeling
- Symbolic fluency
It does not include:
- Instruction tuning
- RLHF or preference alignment
- Benchmark optimization
Downstream benchmark accuracy is currently modest without supervised fine-tuning or LoRA adaptation.
This release is intended primarily for research and experimentation.
Requirements:
- Python 3.10+
- PyTorch
- Transformers
- DeepSpeed
- 2× A100 GPUs recommended
Install dependencies:
pip install -r requirements.txtLaunch training:
deepspeed train.pyIf you use this work, please cite:
@article{kitefish2026,
title={KiteFish-A1: Training a Scientific Language Model from Raw LaTeX Archives},
author={Your Name},
year={2026},
eprint={2602.17288},
archivePrefix={arXiv}
}MIT License © 2026 KiteFish