Skip to content

aims-foundations/torch_measure

Repository files navigation

torch_measure

License: MIT Python 3.10+ PyTorch Discord

PyTorch-native measurement science toolkit for AI evaluation.

Benchmark scores increasingly gate deployment decisions but rarely predict how a model will behave in production. torch_measure brings the measurement-science apparatus — item response theory, adaptive testing, psychometric metrics, and factor models — to the PyTorch ecosystem, so AI evaluations can be designed and interpreted with the rigor the stakes now demand.

Installation

With pip:

pip install torch_measure

With uv (faster; drop-in replacement for pip):

uv pip install torch_measure        # into the active environment
uv add torch_measure                # into a uv-managed project

With optional dependencies (same syntax for both — just prefix uv if desired):

pip install torch_measure[all]          # Everything
pip install torch_measure[bayesian]     # Pyro-based Bayesian IRT
pip install torch_measure[data]         # HuggingFace data loaders
pip install torch_measure[viz]          # Visualization

Quick Start

import torch
from torch_measure.models import Rasch
from torch_measure.data import ResponseMatrix

# Create a binary response matrix (models x items)
responses = torch.bernoulli(torch.rand(50, 200))
rm = ResponseMatrix(responses)

# Fit a Rasch (1PL) model
model = Rasch(n_subjects=rm.n_rows, n_items=rm.n_cols)
model.fit(rm.data, method="mle")

# Get estimated abilities and difficulties
abilities = model.ability          # (50,) subject ability parameters
difficulties = model.difficulty    # (200,) item difficulty parameters

# Predict response probabilities
probs = model.predict()            # (50, 200) predicted P(correct)

Loading a Benchmark

torch_measure.datasets.load() returns a LongFormData object backed by the measurement-db HuggingFace bucket. Pivot into the legacy wide-form ResponseMatrix only when you want to fit classical IRT on averaged-across-trial responses.

from torch_measure.datasets import list_datasets, info, load

list_datasets()                       # names sourced from benchmarks.parquet
info("mtbench")                       # DatasetInfo with modality/domain/license/...

data = load("mtbench")                # long-form: responses, items, subjects, traces, info
data.responses.head()
rm = data.to_response_matrix()        # opt-in pivot to ResponseMatrix

Adaptive Testing

from torch_measure.cat import AdaptiveTester

# Efficiently estimate a new model's ability using fewer items
tester = AdaptiveTester(model, strategy="fisher")
estimated_ability = tester.run(responses=new_model_responses, budget=50)

Psychometric Metrics

from torch_measure.metrics import tetrachoric_correlation, infit_statistics, expected_calibration_error

# Compute tetrachoric correlation matrix
corr = tetrachoric_correlation(rm.data)

# Evaluate model fit
infit = infit_statistics(predicted_probs, rm.data)

# Calibration quality
ece = expected_calibration_error(predicted_probs, rm.data)

Features

Module Description
torch_measure.models IRT (Rasch, 2PL, 3PL, Amortized, Many-Facet), Beta IRT (BetaRasch, Beta2PL), factor models, rotation
torch_measure.cat Computerized Adaptive Testing with Fisher information selection
torch_measure.fitting MLE, EM, JML, and Bayesian SVI parameter estimation
torch_measure.metrics Tetrachoric correlation, Mokken scalability, infit/outfit, ECE, DIF
torch_measure.data Response matrices, masking strategies, HuggingFace/HELM loaders
torch_measure.viz Response heatmaps, ICCs, information plots, academic styling

Why torch_measure?

AI benchmark scores increasingly decide which models get deployed, but rarely predict how those models will behave in production. Measurement science — item response theory, adaptive testing, reliability, validity — has answered "how much should I trust this score?" for decades in education, psychology, and clinical assessment. torch_measure brings that apparatus to the PyTorch ecosystem, so evaluation can be done with the same rigor as training.

  • GPU-accelerated: All models are PyTorch nn.Modules — train on GPU, use autograd.
  • Amortized inference: Predict item parameters from embeddings without per-item calibration.
  • Built for LLM-era data: Scales to large benchmark matrices, handles missing responses, composes with modern ML pipelines.
  • Composable: Mix IRT, factor models, and adaptive testing freely.
  • Research-ready: Powers 6+ published papers from AIMS Foundations.

Citation

If you use torch_measure in your research, please cite:

@software{torch_measure,
  title={torch\_measure: PyTorch-native Measurement Science Toolkit},
  author={AIMS Foundations},
  url={https://github.com/aims-foundations/torch_measure},
  year={2026}
}

Contributing

We welcome contributions! Please see our contributing guidelines for details, or drop by our Discord to chat.

License

MIT License. See LICENSE for details.

About

A package for AI measurement science

Resources

License

Contributing

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors