LLM training data querying through infini-gram mini

Note: I am not the author of the original infini-gram mini engine or paper. This repository is a personal build on top of the original infini-gram mini project (paper, project home). All credit for the core engine and indexing pipeline goes to Hao Xu, Jiacheng Liu, Yejin Choi, Noah A. Smith, and Hannaneh Hajishirzi.

What this repo adds on top of the original:

An Apptainer container definition for reproducible HPC deployment (scripts/infini_gram_mini.def)
SLURM job scripts for building the container and running indexing jobs (scripts/)
A pyproject.toml for installing the package locally with pip
A CLI query tool (api/query.py) with count / search / find subcommands and formatted output
A REST API server (api/api_server.py) for serving an index over HTTP

Repository layout

infini-gram-mini/
├── infini-gram-mini/              # Upstream source code
│   ├── engine/
│   │   └── src/
│   │       ├── cpp_engine.cpp     # C++ query backend (pybind11)
│   │       ├── cpp_engine.h
│   │       ├── engine.py          # Python wrapper
│   │       └── models.py
│   └── indexing/
│       ├── cpp/
│       │   └── indexing.cpp       # Compiled to cpp_indexing binary
│       ├── rust_indexing          # Compiled Rust binary (built from third_party/suffix_array)
│       └── indexing.py            # Core prepare / build_sa_bwt logic
├── scripts/
│   ├── build_apptainer.sh         # SLURM script: builds the Apptainer image
│   ├── infini_gram_mini.def       # Apptainer container definition
│   ├── index_parquet.sh           # SLURM: single-job indexing
│   └── index_parquet_array.sh     # SLURM: array-job indexing
├── third_party/
│   ├── nlohmann/                  # JSON header
│   ├── parallel_sdsl/             # SDSL + divsufsort (used by indexing)
│   ├── sdsl/                      # SDSL + divsufsort (used by query engine)
│   └── suffix_array/              # Rust source for rust_indexing binary
├── api/
│   ├── query.py                   # CLI query tool (count / search / find)
│   ├── api_server.py              # Flask REST API server
│   └── api_config.json            # Index config (edit this to point at your indexes)
├── pyproject.toml                 # Python package definition

Installation

Two paths are supported: plain Python (compile everything natively on the host) or Apptainer (recommended for HPC — bundles all toolchains in one image).

Option A — Python (bare metal)

Requirements: Python ≥ 3.10, GCC ≥ 11, Rust toolchain (cargo).

1. Install Python dependencies

pip install pybind11 pyarrow numpy zstandard

Or install the package directly (this also records the dependencies):

pip install -e .

2. Build the Rust suffix-array binary

cd third_party/suffix_array
cargo build --release
cp target/release/rust_indexing ../../infini-gram-mini/indexing/rust_indexing

3. Compile the C++ indexing binary

cd infini-gram-mini/indexing
g++ -std=c++17 -O3 \
    cpp/indexing.cpp -o cpp/cpp_indexing \
    -I../../third_party/parallel_sdsl/include \
    -L../../third_party/parallel_sdsl/lib \
    -lsdsl -ldivsufsort -ldivsufsort64

4. Compile the query engine (Python extension)

cd infini-gram-mini/engine
c++ -std=c++17 -O3 -shared -fPIC \
    $(python3 -m pybind11 --includes) \
    src/cpp_engine.cpp \
    -o src/cpp_engine$(python3-config --extension-suffix) \
    -I../../third_party/sdsl/include \
    -L../../third_party/sdsl/lib \
    -lsdsl -ldivsufsort -ldivsufsort64 -pthread

Option B — Apptainer (recommended for HPC)

The container bundles Python 3.12, GCC, Rust, and all compiled binaries in one reproducible image (x86_64 / amd64 only). Build it once; run it on any compatible node.

1. Build the image (submit as a SLURM job)

From the repo root:

sbatch scripts/build_apptainer.sh

The finished image is written to <path>/infini_gram_mini.sif.

2. Run indexing

Via Python CLI:

python -m indexing.indexing \
    --data_dir /path/to/parquet_files \
    --save_dir /path/to/index_output

Via apptainer:

apptainer run /path/to/infini_gram_mini.sif \
    --data_dir /path/to/parquet_files \
    --save_dir /path/to/index_output

Or submit via SLURM:

sbatch scripts/index_parquet.sh        # single job
sbatch scripts/index_parquet_array.sh  # array job

3. Query inside the container

Bind-mount the repo so the container can read your config and index:

apptainer exec \
    --bind /path/to/infini-gram-mini:/repo \
    --bind /path/to/index:/index \
    /path/to/infini_gram_mini.sif \
    python3 /repo/api/query.py \
        --config /repo/api/api_config.json \
        count "natural language processing"

Querying an existing index

1. Edit the index config

api/api_config.json is a JSON array where each entry describes one named index.

index_dirs accepts two forms:

A root directory string — all immediate subdirectories are loaded as shards (sorted). Convenient when the indexing job produces save_dir/00/, save_dir/01/, etc.
A list of directory strings — explicit list of shard paths, loaded in the given order.

[
    {
        "name": "my_index",
        "index_dirs": "/path/to/index",
        "load_to_ram": false,
        "get_metadata": true
    }
]

Or with an explicit shard list:

[
    {
        "name": "my_index",
        "index_dirs": ["/path/to/index/00", "/path/to/index/01"],
        "load_to_ram": false,
        "get_metadata": true
    }
]

2. Query via CLI

api/query.py loads the engine directly — no server needed:

cd api

# Count occurrences
python query.py --config api_config.json count "natural language processing"
# Count:   83,470
# Latency: 12.3 ms

# Search and retrieve matching document contexts (query highlighted in output)
python query.py --config api_config.json search "transformer model"

# Target a specific named index
python query.py --config api_config.json --index my_index count "BERT"

# Customise result count and context window
python query.py --config api_config.json search "GPT" --max_results 5 --max_ctx_len 300

# Raw suffix-array intervals per shard
python query.py --config api_config.json find "natural language processing"

3. (Optional) REST API server

Start the server:

cd api
python api_server.py --config api_config.json --port 5000

Then query it from any client:

# via query.py
python query.py --api_url http://localhost:5000 count "natural language processing"

# via curl
curl -s http://localhost:5000/query \
     -H "Content-Type: application/json" \
     -d '{"index": "my_index", "query_type": "count", "query": "natural language processing"}'

4. (Optional) Raw Python API

index_dirs accepts a root directory (auto-discovers sorted subdirs) or an explicit list of shard paths:

from engine.src import InfiniGramMiniEngine

# Option A: single root directory — loads all subdirectories as shards
engine = InfiniGramMiniEngine(
    index_dirs="/path/to/index",
    load_to_ram=False,
    get_metadata=True,
)

# Option B: explicit list of shard directories
engine = InfiniGramMiniEngine(
    index_dirs=["/path/to/index/00", "/path/to/index/01"],
    load_to_ram=False,
    get_metadata=True,
)

engine.count("natural language processing")
# {'count': 83470}

engine.find("natural language processing")
# {'cnt': 83470, 'segment_by_shard': [[442381579355, 442381620985], ...]}

engine.get_doc_by_rank(s=0, rank=442381579355, needle_len=27, max_ctx_len=200)
# {'doc_ix': ..., 'text': '...', 'metadata': {...}}

Indexing a new dataset

Supported input formats: .parquet (with text column), .jsonl, .json.gz, .zst.

cd infini-gram-mini/indexing
python jobs/index_parquet.py \
    --data_dir /path/to/parquet_files \
    --save_dir /path/to/index_output \

The script automatically selects between the few-files and many-files pipeline depending on file count, and produces one index shard per entry in save_dir/00/, save_dir/01/, etc.

Tuning `--num_shards` for SLURM array jobs

--num_shards N sets a target size per shard (total_dataset_size / N), not the exact shard count. Files larger than the target are split into line-range slices; the actual number of shards produced depends on how those slices pack together and can be less than N.

Use --dry_run to check the actual count before submitting — it uses only file sizes (no decompression, completes in under a second):

python -m indexing.indexing \
    --data_dir /path/to/data \
    --num_shards 20 \
    --dry_run
# → prints: 11

Then set --array=0-10 instead of --array=0-19 to avoid wasting jobs on shards that have nothing to do.

Memory sizing: for compressed formats (.zst, .json.gz), corpus size per shard is typically 4–6× the on-disk slice size. Step 3 (FM-index / wavetree) needs roughly 2–3× corpus size in RAM. Target ≤ 50 GB corpus per shard to stay within 300 GB nodes.

If you hit "too many open files":

ulimit -n 1048576
# or pass --ulimit 1048576 to the indexing script

SLURM example

#!/bin/bash
#SBATCH --job-name=index
#SBATCH --partition=your_partition
#SBATCH --nodes=1
#SBATCH --time=20:00:00
#SBATCH --cpus-per-task=64

cd /path/to/infini-gram-mini/infini-gram-mini/indexing
python jobs/index_parquet.py \
    --data_dir /path/to/data \
    --save_dir /path/to/index \
    --num_shards 10 \
    --mem 500 \
    --cpus $SLURM_CPUS_PER_TASK

Citation

If you use infini-gram mini, please cite the original paper:

@misc{xu2025infinigramminiexactngram,
  title={Infini-gram mini: Exact n-gram Search at the Internet Scale with FM-Index},
  author={Hao Xu and Jiacheng Liu and Yejin Choi and Noah A. Smith and Hannaneh Hajishirzi},
  year={2025},
  eprint={2506.12229},
  archivePrefix={arXiv},
  primaryClass={cs.CL},
  url={https://arxiv.org/abs/2506.12229},
}

Name		Name	Last commit message	Last commit date
Latest commit History 17 Commits
api		api
infini-gram-mini		infini-gram-mini
scripts		scripts
third_party		third_party
.gitignore		.gitignore
Dockerfile		Dockerfile
README.md		README.md
docker-compose.yml		docker-compose.yml
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

LLM training data querying through infini-gram mini

Repository layout

Installation

Option A — Python (bare metal)

1. Install Python dependencies

2. Build the Rust suffix-array binary

3. Compile the C++ indexing binary

4. Compile the query engine (Python extension)

Option B — Apptainer (recommended for HPC)

1. Build the image (submit as a SLURM job)

2. Run indexing

3. Query inside the container

Querying an existing index

1. Edit the index config

2. Query via CLI

3. (Optional) REST API server

4. (Optional) Raw Python API

Indexing a new dataset

Tuning `--num_shards` for SLURM array jobs

SLURM example

Citation

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

LLM training data querying through infini-gram mini

Repository layout

Installation

Option A — Python (bare metal)

1. Install Python dependencies

2. Build the Rust suffix-array binary

3. Compile the C++ indexing binary

4. Compile the query engine (Python extension)

Option B — Apptainer (recommended for HPC)

1. Build the image (submit as a SLURM job)

2. Run indexing

3. Query inside the container

Querying an existing index

1. Edit the index config

2. Query via CLI

3. (Optional) REST API server

4. (Optional) Raw Python API

Indexing a new dataset

Tuning --num_shards for SLURM array jobs

SLURM example

Citation

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Tuning `--num_shards` for SLURM array jobs

Packages