The current plan is that, for each model variant, the followings will be run in order:
- 3b run with 300b tokens (cc only, using FineWeb-Edu)
- 3b run with 1T tokens (mixed data, using FineWeb-Edu + Dolma v1.7)
- 7b run with 2T-4T tokens (TBD)
models will be evaluated and reported back in this repo.
We will keep a running log of the experiments and rough schedules (depends on GPU availability).