AstroPT: a Large Observation (foundation) Model for astronomy 🔭

AstroPT: a Large Observation (foundation) Model for astronomy 🔭

Welcome to our simple repository for training astronomical large observation models. This repository began its life as Andrej Karpathy's nanoGPT, and has been altered so that it is usable for astronomical observation data. Within train.py you will find a ~300-line boilerplate training loop and within model.py you will find a ~300-line GPT model definition with an MLP tokeniser and a regressive loss.

Check out the UniverseTBD Discord for updates: discord.gg/MNEVegvfJq

Read the docs here: astropt.readthedocs.io

There is some deep lore about our logo

How does AstroPT work?

AstroPT is an autoregressive transformer under the hood.

Similarly to language models that predict the next word in a sentence, AstroPT processes sequences of astronomical data chunks to predict what comes next.

The intuition here is that this next-token-prediction task requires the model to internalise some understanding of the physical processes underlying the training data.

This is just like how a text GPT needs to have some knowledge of geography to guess a country's capital given a description of that country, or some knowledge of coding to write compilable Fortran.

Below we can see this principle applied to a galaxy image, where we split the image into chunks and pass them into an AstroPT model:

Of course we can apply this next-token-prediction task across many modalities due to its flexibility.

Check out our work on Euclid data for an example, where we chain galaxy image tokens and spectral energy distribution data and pass them into a single, unified AstroPT model.

Masked autoencoder (MAE) objective

As well as the default autoregressive objective, AstroPT can be pretrained with a BERT-style masked autoencoder objective (He et al. 2021, Devlin et al. 2019). A fraction of the image patches is replaced by a learnable mask token, the full patch sequence is processed by the bidirectional encoder, and the masked patches are reconstructed. Switch objectives with the objective config field ("ar" or "mae"); MAE additionally requires bidirectional attention (attn_type="full") and uses AstroPT's existing learned (BERT-style) positional embeddings. The same scripts/train.py runs both objectives — see config/astropt_mae.py for an example. MAE currently supports a single image modality.

I just want to run it! 🗣️

Okay I hear you! First you need to install the model:

Install

You can install via pip from PyPI:

pip install astropt

Or if you install locally via a git clone, you can uv install via:

git clone https://github.com/Smith42/astroPT.git
cd astroPT
uv sync

Load a pre-trained model

To load and run a pre-trained AstroPT model from HuggingFace you can use the load_astropt function:

from astropt.model_utils import load_astropt

model = load_astropt(
    repo_id="smith42/astropt_v2.0",
    path="astropt/095M",
    weights_filename="ckpt.pt",
)
model = model.to("cuda")

where repo_id is the HuggingFace repository ID, and path is the path within the repository that contains the AstroPT model checkpoint.

Pre-trained models

Below are some pre-trained models you can load with the code snippet above. Please make sure that you are using the correct version of AstroPT to load these!

Survey	Modalities	AstroPT version	Model weights	Dataset	Paper
DESI Legacy Survey	JPG galaxy imagery	v1.0.0	AstroPT	Galaxies Dataset	arXiv:2405.14930
Euclid	FITS VIS, NISP galaxy imagery and SED data	v1.0.2	AstroPT-Euclid	Euclid Training Dataset	arXiv:2503.15312
DESI Legacy Survey	JPG galaxy imagery	v2.0.5	AstroPT v2.0	Galaxies Dataset v2.0	arXiv:2405.14930

Scripts for pre-training and processing data

Check out scripts for a collection of all the scripts we have used to get the results in these papers, and scripts/train.py for an example boilerplate script for pre-training your own AstroPT. config contains example user configurations for pre-training.

AstroPT trains for roughly one epoch, so train.py can save intermediate checkpoints by step count: set num_checkpoints=N to save N snapshots across [0, max_iters] (always including the first/random-init and last/final step), with checkpoint_schedule one of "log" (default; geometric, dense early — good for probing how representations emerge over training), "even" (uniform), or "manual" (use the explicit checkpoint_steps list, e.g. --checkpoint_steps=[0,512,4096,30000]). This is independent of the best-val ckpt.pt; each checkpoint also stores optimizer state, so budget disk accordingly.

scripts/linear_probe.py has an example script for inferring embeddings from a pre-trained model and running a finetuning routine on them 🌝.

And finally scripts/finetune.py has an example LoRA finetune routine.

Multi-GPU streaming

When streaming the dataset from HuggingFace under DDP, train.py shards the stream across ranks with split_dataset_by_node so each GPU sees disjoint data (otherwise every rank replays the same stream and you get no data-throughput scaling), and applies a buffered shuffle (size shuffle_buffer_size) to the training stream.

Contributors

_{Ryan Roberts} 💻 🤔 🖋	_{Mike Smith} 💻 🤔 🖋 🔣	_{mhuertascompany} 🤔 🖋	_{Malgorzata Siudek} 🤔 🖋 💻 🔣	_gimarso 🤔 💻	_{Víctor Alonso} 🐛	_{Ashod Khederlarian} 💻
_{SogolSanjaripour} 💻 🤔	_ksd3 🤔 💻
Add your contributions

Name		Name	Last commit message	Last commit date
Latest commit History 228 Commits
.github		.github
assets		assets
config		config
data		data
docs		docs
scripts		scripts
src/astropt		src/astropt
.all-contributorsrc		.all-contributorsrc
.gitattributes		.gitattributes
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
AGENTS.md		AGENTS.md
CODE_OF_CONDUCT.md		CODE_OF_CONDUCT.md
CONTRIBUTING.md		CONTRIBUTING.md
COPYING.md		COPYING.md
README.md		README.md
pyproject.toml		pyproject.toml
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

AstroPT: a Large Observation (foundation) Model for astronomy 🔭

How does AstroPT work?

Masked autoencoder (MAE) objective

I just want to run it! 🗣️

Install

Load a pre-trained model

Pre-trained models

Scripts for pre-training and processing data

Multi-GPU streaming

Contributors

About

Uh oh!

Releases 16

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

AstroPT: a Large Observation (foundation) Model for astronomy 🔭

How does AstroPT work?

Masked autoencoder (MAE) objective

I just want to run it! 🗣️

Install

Load a pre-trained model

Pre-trained models

Scripts for pre-training and processing data

Multi-GPU streaming

Contributors

About

Topics

Resources

License

Code of conduct

Contributing

Uh oh!

Stars

Watchers

Forks

Releases 16

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages