torch_measure

PyTorch-native measurement science toolkit for AI evaluation.

Benchmark scores increasingly gate deployment decisions but rarely predict how a model will behave in production. torch_measure brings the measurement-science apparatus — item response theory, adaptive testing, psychometric metrics, and factor models — to the PyTorch ecosystem, so AI evaluations can be designed and interpreted with the rigor the stakes now demand.

Installation

With pip:

pip install torch_measure

With uv (faster; drop-in replacement for pip):

uv pip install torch_measure        # into the active environment
uv add torch_measure                # into a uv-managed project

With optional dependencies (same syntax for both — just prefix uv if desired):

pip install torch_measure[all]          # Everything
pip install torch_measure[bayesian]     # Pyro-based Bayesian IRT
pip install torch_measure[data]         # HuggingFace data loaders
pip install torch_measure[viz]          # Visualization

Quick Start

import torch
from torch_measure.models import Rasch
from torch_measure.data import ResponseMatrix

# Create a binary response matrix (models x items)
responses = torch.bernoulli(torch.rand(50, 200))
rm = ResponseMatrix(responses)

# Fit a Rasch (1PL) model
model = Rasch(n_subjects=rm.n_rows, n_items=rm.n_cols)
model.fit(rm.data, method="mle")

# Get estimated abilities and difficulties
abilities = model.ability          # (50,) subject ability parameters
difficulties = model.difficulty    # (200,) item difficulty parameters

# Predict response probabilities
probs = model.predict()            # (50, 200) predicted P(correct)

Loading a Benchmark

torch_measure.datasets.load() returns a LongFormData object backed by the measurement-db HuggingFace bucket. Pivot into the legacy wide-form ResponseMatrix only when you want to fit classical IRT on averaged-across-trial responses.

from torch_measure.datasets import list_datasets, info, load

list_datasets()                       # names sourced from benchmarks.parquet
info("mtbench")                       # DatasetInfo with modality/domain/license/...

data = load("mtbench")                # long-form: responses, items, subjects, traces, info
data.responses.head()
rm = data.to_response_matrix()        # opt-in pivot to ResponseMatrix

Adaptive Testing

from torch_measure.cat import AdaptiveTester

# Efficiently estimate a new model's ability using fewer items
tester = AdaptiveTester(model, strategy="fisher")
estimated_ability = tester.run(responses=new_model_responses, budget=50)

Psychometric Metrics

from torch_measure.metrics import tetrachoric_correlation, infit_statistics, expected_calibration_error

# Compute tetrachoric correlation matrix
corr = tetrachoric_correlation(rm.data)

# Evaluate model fit
infit = infit_statistics(predicted_probs, rm.data)

# Calibration quality
ece = expected_calibration_error(predicted_probs, rm.data)

Features

Module	Description
`torch_measure.models`	IRT (Rasch, 2PL, 3PL, Amortized, Many-Facet), Beta IRT (BetaRasch, Beta2PL), factor models, rotation
`torch_measure.cat`	Computerized Adaptive Testing with Fisher information selection
`torch_measure.fitting`	MLE, EM, JML, and Bayesian SVI parameter estimation
`torch_measure.metrics`	Tetrachoric correlation, Mokken scalability, infit/outfit, ECE, DIF
`torch_measure.data`	Response matrices, masking strategies, HuggingFace/HELM loaders
`torch_measure.viz`	Response heatmaps, ICCs, information plots, academic styling

Why torch_measure?

AI benchmark scores increasingly decide which models get deployed, but rarely predict how those models will behave in production. Measurement science — item response theory, adaptive testing, reliability, validity — has answered "how much should I trust this score?" for decades in education, psychology, and clinical assessment. torch_measure brings that apparatus to the PyTorch ecosystem, so evaluation can be done with the same rigor as training.

GPU-accelerated: All models are PyTorch nn.Modules — train on GPU, use autograd.
Amortized inference: Predict item parameters from embeddings without per-item calibration.
Built for LLM-era data: Scales to large benchmark matrices, handles missing responses, composes with modern ML pipelines.
Composable: Mix IRT, factor models, and adaptive testing freely.
Research-ready: Powers 6+ published papers from AIMS Foundations.

Citation

If you use torch_measure in your research, please cite:

@software{torch_measure,
  title={torch\_measure: PyTorch-native Measurement Science Toolkit},
  author={AIMS Foundations},
  url={https://github.com/aims-foundations/torch_measure},
  year={2026}
}

Contributing

We welcome contributions! Please see our contributing guidelines for details, or drop by our Discord to chat.

License

MIT License. See LICENSE for details.

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
.github/workflows		.github/workflows
docs		docs
src/torch_measure		src/torch_measure
tests		tests
tutorials		tutorials
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
.readthedocs.yaml		.readthedocs.yaml
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
README.md		README.md
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

torch_measure

Installation

Quick Start

Loading a Benchmark

Adaptive Testing

Psychometric Metrics

Features

Why torch_measure?

Citation

Contributing

License

About

Uh oh!

Releases

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

torch_measure

Installation

Quick Start

Loading a Benchmark

Adaptive Testing

Psychometric Metrics

Features

Why torch_measure?

Citation

Contributing

License

About

Resources

License

Contributing

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages