Note: I am not the author of the original infini-gram mini engine or paper. This repository is a personal build on top of the original infini-gram mini project (paper, project home). All credit for the core engine and indexing pipeline goes to Hao Xu, Jiacheng Liu, Yejin Choi, Noah A. Smith, and Hannaneh Hajishirzi.
What this repo adds on top of the original:
- An Apptainer container definition for reproducible HPC deployment (
scripts/infini_gram_mini.def) - SLURM job scripts for building the container and running indexing jobs (
scripts/) - A
pyproject.tomlfor installing the package locally with pip - A CLI query tool (
api/query.py) with count / search / find subcommands and formatted output - A REST API server (
api/api_server.py) for serving an index over HTTP
infini-gram-mini/
├── infini-gram-mini/ # Upstream source code
│ ├── engine/
│ │ └── src/
│ │ ├── cpp_engine.cpp # C++ query backend (pybind11)
│ │ ├── cpp_engine.h
│ │ ├── engine.py # Python wrapper
│ │ └── models.py
│ └── indexing/
│ ├── cpp/
│ │ └── indexing.cpp # Compiled to cpp_indexing binary
│ ├── rust_indexing # Compiled Rust binary (built from third_party/suffix_array)
│ └── indexing.py # Core prepare / build_sa_bwt logic
├── scripts/
│ ├── build_apptainer.sh # SLURM script: builds the Apptainer image
│ ├── infini_gram_mini.def # Apptainer container definition
│ ├── index_parquet.sh # SLURM: single-job indexing
│ └── index_parquet_array.sh # SLURM: array-job indexing
├── third_party/
│ ├── nlohmann/ # JSON header
│ ├── parallel_sdsl/ # SDSL + divsufsort (used by indexing)
│ ├── sdsl/ # SDSL + divsufsort (used by query engine)
│ └── suffix_array/ # Rust source for rust_indexing binary
├── api/
│ ├── query.py # CLI query tool (count / search / find)
│ ├── api_server.py # Flask REST API server
│ └── api_config.json # Index config (edit this to point at your indexes)
├── pyproject.toml # Python package definition
Two paths are supported: plain Python (compile everything natively on the host) or Apptainer (recommended for HPC — bundles all toolchains in one image).
Requirements: Python ≥ 3.10, GCC ≥ 11, Rust toolchain (cargo).
pip install pybind11 pyarrow numpy zstandardOr install the package directly (this also records the dependencies):
pip install -e .cd third_party/suffix_array
cargo build --release
cp target/release/rust_indexing ../../infini-gram-mini/indexing/rust_indexingcd infini-gram-mini/indexing
g++ -std=c++17 -O3 \
cpp/indexing.cpp -o cpp/cpp_indexing \
-I../../third_party/parallel_sdsl/include \
-L../../third_party/parallel_sdsl/lib \
-lsdsl -ldivsufsort -ldivsufsort64cd infini-gram-mini/engine
c++ -std=c++17 -O3 -shared -fPIC \
$(python3 -m pybind11 --includes) \
src/cpp_engine.cpp \
-o src/cpp_engine$(python3-config --extension-suffix) \
-I../../third_party/sdsl/include \
-L../../third_party/sdsl/lib \
-lsdsl -ldivsufsort -ldivsufsort64 -pthreadThe container bundles Python 3.12, GCC, Rust, and all compiled binaries in one reproducible image (x86_64 / amd64 only). Build it once; run it on any compatible node.
From the repo root:
sbatch scripts/build_apptainer.shThe finished image is written to <path>/infini_gram_mini.sif.
Via Python CLI:
python -m indexing.indexing \
--data_dir /path/to/parquet_files \
--save_dir /path/to/index_outputVia apptainer:
apptainer run /path/to/infini_gram_mini.sif \
--data_dir /path/to/parquet_files \
--save_dir /path/to/index_outputOr submit via SLURM:
sbatch scripts/index_parquet.sh # single job
sbatch scripts/index_parquet_array.sh # array jobBind-mount the repo so the container can read your config and index:
apptainer exec \
--bind /path/to/infini-gram-mini:/repo \
--bind /path/to/index:/index \
/path/to/infini_gram_mini.sif \
python3 /repo/api/query.py \
--config /repo/api/api_config.json \
count "natural language processing"api/api_config.json is a JSON array where each entry describes one named index.
index_dirs accepts two forms:
- A root directory string — all immediate subdirectories are loaded as shards (sorted). Convenient when the indexing job produces
save_dir/00/,save_dir/01/, etc. - A list of directory strings — explicit list of shard paths, loaded in the given order.
[
{
"name": "my_index",
"index_dirs": "/path/to/index",
"load_to_ram": false,
"get_metadata": true
}
]Or with an explicit shard list:
[
{
"name": "my_index",
"index_dirs": ["/path/to/index/00", "/path/to/index/01"],
"load_to_ram": false,
"get_metadata": true
}
]api/query.py loads the engine directly — no server needed:
cd api
# Count occurrences
python query.py --config api_config.json count "natural language processing"
# Count: 83,470
# Latency: 12.3 ms
# Search and retrieve matching document contexts (query highlighted in output)
python query.py --config api_config.json search "transformer model"
# Target a specific named index
python query.py --config api_config.json --index my_index count "BERT"
# Customise result count and context window
python query.py --config api_config.json search "GPT" --max_results 5 --max_ctx_len 300
# Raw suffix-array intervals per shard
python query.py --config api_config.json find "natural language processing"Start the server:
cd api
python api_server.py --config api_config.json --port 5000Then query it from any client:
# via query.py
python query.py --api_url http://localhost:5000 count "natural language processing"
# via curl
curl -s http://localhost:5000/query \
-H "Content-Type: application/json" \
-d '{"index": "my_index", "query_type": "count", "query": "natural language processing"}'index_dirs accepts a root directory (auto-discovers sorted subdirs) or an explicit list of shard paths:
from engine.src import InfiniGramMiniEngine
# Option A: single root directory — loads all subdirectories as shards
engine = InfiniGramMiniEngine(
index_dirs="/path/to/index",
load_to_ram=False,
get_metadata=True,
)
# Option B: explicit list of shard directories
engine = InfiniGramMiniEngine(
index_dirs=["/path/to/index/00", "/path/to/index/01"],
load_to_ram=False,
get_metadata=True,
)
engine.count("natural language processing")
# {'count': 83470}
engine.find("natural language processing")
# {'cnt': 83470, 'segment_by_shard': [[442381579355, 442381620985], ...]}
engine.get_doc_by_rank(s=0, rank=442381579355, needle_len=27, max_ctx_len=200)
# {'doc_ix': ..., 'text': '...', 'metadata': {...}}Supported input formats: .parquet (with text column), .jsonl, .json.gz, .zst.
cd infini-gram-mini/indexing
python jobs/index_parquet.py \
--data_dir /path/to/parquet_files \
--save_dir /path/to/index_output \The script automatically selects between the few-files and many-files pipeline depending on file count, and produces one index shard per entry in save_dir/00/, save_dir/01/, etc.
--num_shards N sets a target size per shard (total_dataset_size / N), not the exact shard count. Files larger than the target are split into line-range slices; the actual number of shards produced depends on how those slices pack together and can be less than N.
Use --dry_run to check the actual count before submitting — it uses only file sizes (no decompression, completes in under a second):
python -m indexing.indexing \
--data_dir /path/to/data \
--num_shards 20 \
--dry_run
# → prints: 11Then set --array=0-10 instead of --array=0-19 to avoid wasting jobs on shards that have nothing to do.
Memory sizing: for compressed formats (.zst, .json.gz), corpus size per shard is typically 4–6× the on-disk slice size. Step 3 (FM-index / wavetree) needs roughly 2–3× corpus size in RAM. Target ≤ 50 GB corpus per shard to stay within 300 GB nodes.
If you hit "too many open files":
ulimit -n 1048576
# or pass --ulimit 1048576 to the indexing script#!/bin/bash
#SBATCH --job-name=index
#SBATCH --partition=your_partition
#SBATCH --nodes=1
#SBATCH --time=20:00:00
#SBATCH --cpus-per-task=64
cd /path/to/infini-gram-mini/infini-gram-mini/indexing
python jobs/index_parquet.py \
--data_dir /path/to/data \
--save_dir /path/to/index \
--num_shards 10 \
--mem 500 \
--cpus $SLURM_CPUS_PER_TASKIf you use infini-gram mini, please cite the original paper:
@misc{xu2025infinigramminiexactngram,
title={Infini-gram mini: Exact n-gram Search at the Internet Scale with FM-Index},
author={Hao Xu and Jiacheng Liu and Yejin Choi and Noah A. Smith and Hannaneh Hajishirzi},
year={2025},
eprint={2506.12229},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2506.12229},
}