Summary
openai-harmony currently causes avoidable failures in air-gapped and restricted-network environments because loading a Harmony encoding depends on external .tiktoken vocab files that are not shipped as part of the normal package or GPT-OSS model artifacts.
This is painful for:
- air-gapped deployments,
- containers with no egress,
- reproducible builds,
- CI,
- clusters with strict startup/network policies.
Even when the model weights and tokenizer.json are already present locally, Harmony can still fail because the required .tiktoken vocab file is missing.
Problem
At the moment, the high-level workflow for loading Harmony encodings is not self-contained enough for offline use.
Practical problems:
- A user may already have the GPT-OSS model repo locally, but Harmony still needs separate
.tiktoken files.
- Users often discover this only at runtime.
- Runtime download is fragile in containers and distributed systems.
- Air-gapped users need manual workarounds and environment-specific packaging logic.
- This also creates ambiguity about which tokenizer asset is the canonical one:
- package asset,
- model-repo asset,
- remote download target,
- cache artifact.
Proposed solution
Please make vocab resolution fully offline-friendly and deterministic.
0th priority (highest, if added)
Allow explicit vocab-file / vocab-directory overrides through:
- CLI arguments, if Harmony has or will have CLI entrypoints for encoding-related workflows,
- function parameters in the public API.
Examples:
load_harmony_encoding(name, vocab_path=..., vocab_dir=...)
load_harmony_encoding(name, model_dir=...)
Reason:
explicit parameters are the most predictable and easiest to debug. They should override everything else.
1st priority
Support vocabulary files provided through environment variables.
This already exists in some form via environment variables, and should remain supported and documented as the primary non-code override for operators.
2nd priority
Load vocabulary files from the GPT-OSS model repo.
The GPT-OSS model repos should include the required .tiktoken files as official artifacts, alongside the existing tokenizer files.
This would let users mirror a model repo into an air-gapped environment and have all tokenizer-related assets available without a second manual download path.
3rd priority
Bundle vocabulary files inside the openai-harmony package itself.
The package should include the necessary .tiktoken files so that the default experience works offline with zero extra steps.
This is especially important because these files are small compared with model weights and are fundamental runtime assets, not optional extras.
Requested lookup order
Please make vocab lookup deterministic with this precedence:
- Explicit CLI argument / function parameter
- Environment-variable-provided path
- Model repo files
- Package-bundled files
- Remote download/cache fallback (optional, preferably deprecated as the default expectation)
This order gives:
- maximum operator control,
- strong air-gap support,
- good UX,
- backward compatibility.
Why both package and model repo should include the files
These two placements solve different real deployment patterns:
Package-bundled files
Best for:
- local development,
- pip installs,
- simple containers,
- tests,
- examples,
- out-of-the-box UX.
Model-repo-bundled files
Best for:
- mirrored Hugging Face repos,
- air-gapped clusters,
- artifact promotion pipelines,
- environments where model assets are managed separately from Python packages.
Shipping the files in both places would eliminate a large class of deployment failures.
Missing pieces that would make this complete
I think the full solution should also include:
1. Hash / integrity verification
Please continue verifying known hashes for vocab files regardless of source:
- explicit file path,
- env-provided file,
- model repo file,
- package file,
- downloaded file.
That keeps the current security and correctness guarantees.
2. Model-directory discovery
If a model directory is known, Harmony should be able to look for:
o200k_base.tiktoken
cl100k_base.tiktoken
in predictable locations, for example:
<model_dir>/
<model_dir>/encodings/
<model_dir>/tokenizer/
- or another documented convention.
3. Clear error messages
When resolution fails, the error should say exactly what was tried, in order.
For example:
- checked explicit parameter
- checked env path
- checked model repo
- checked package resource
- attempted remote download
This would make debugging much easier.
4. Air-gap documentation
Please add a short section to the docs with:
- what files are required,
- where Harmony searches for them,
- how to package them into a container,
- how to use model-repo-local resolution,
- and how to disable runtime network dependency entirely.
5. Stable public API
If there is already a lower-level file-loading helper internally, please expose a stable public API that makes offline usage first-class instead of requiring users to rely on internal details.
Suggested implementation direction
A possible implementation approach:
-
Add public API parameters such as:
vocab_path
vocab_dir
model_dir
-
Add packaged vocab resources to openai-harmony.
-
Add the required .tiktoken files to GPT-OSS model repos.
-
Implement the lookup precedence:
- explicit parameter
- env var
- model repo
- package resource
- remote/cache fallback
-
Preserve hash checking for all sources.
Expected outcome
With this change:
- a user who installs
openai-harmony gets a working default offline experience,
- a user who mirrors a GPT-OSS model repo gets the needed vocab files automatically,
- operators still retain explicit override control,
- air-gapped deployments become straightforward and reproducible,
- runtime download becomes a fallback rather than a requirement.
Motivation
Tokenizer vocab files are core runtime assets. Requiring them to be fetched at runtime makes deployments less reproducible and less robust than they need to be.
For air-gapped users, this is not just inconvenient; it is a hard failure mode.
Request
Please consider:
- bundling the required
.tiktoken vocab files into the openai-harmony package,
- adding the same vocab files to GPT-OSS model repos,
- and introducing an explicit public API / CLI override path with the precedence described above.
Summary
openai-harmonycurrently causes avoidable failures in air-gapped and restricted-network environments because loading a Harmony encoding depends on external.tiktokenvocab files that are not shipped as part of the normal package or GPT-OSS model artifacts.This is painful for:
Even when the model weights and
tokenizer.jsonare already present locally, Harmony can still fail because the required.tiktokenvocab file is missing.Problem
At the moment, the high-level workflow for loading Harmony encodings is not self-contained enough for offline use.
Practical problems:
.tiktokenfiles.Proposed solution
Please make vocab resolution fully offline-friendly and deterministic.
0th priority (highest, if added)
Allow explicit vocab-file / vocab-directory overrides through:
Examples:
load_harmony_encoding(name, vocab_path=..., vocab_dir=...)load_harmony_encoding(name, model_dir=...)Reason:
explicit parameters are the most predictable and easiest to debug. They should override everything else.
1st priority
Support vocabulary files provided through environment variables.
This already exists in some form via environment variables, and should remain supported and documented as the primary non-code override for operators.
2nd priority
Load vocabulary files from the GPT-OSS model repo.
The GPT-OSS model repos should include the required
.tiktokenfiles as official artifacts, alongside the existing tokenizer files.This would let users mirror a model repo into an air-gapped environment and have all tokenizer-related assets available without a second manual download path.
3rd priority
Bundle vocabulary files inside the
openai-harmonypackage itself.The package should include the necessary
.tiktokenfiles so that the default experience works offline with zero extra steps.This is especially important because these files are small compared with model weights and are fundamental runtime assets, not optional extras.
Requested lookup order
Please make vocab lookup deterministic with this precedence:
This order gives:
Why both package and model repo should include the files
These two placements solve different real deployment patterns:
Package-bundled files
Best for:
Model-repo-bundled files
Best for:
Shipping the files in both places would eliminate a large class of deployment failures.
Missing pieces that would make this complete
I think the full solution should also include:
1. Hash / integrity verification
Please continue verifying known hashes for vocab files regardless of source:
That keeps the current security and correctness guarantees.
2. Model-directory discovery
If a model directory is known, Harmony should be able to look for:
o200k_base.tiktokencl100k_base.tiktokenin predictable locations, for example:
<model_dir>/<model_dir>/encodings/<model_dir>/tokenizer/3. Clear error messages
When resolution fails, the error should say exactly what was tried, in order.
For example:
This would make debugging much easier.
4. Air-gap documentation
Please add a short section to the docs with:
5. Stable public API
If there is already a lower-level file-loading helper internally, please expose a stable public API that makes offline usage first-class instead of requiring users to rely on internal details.
Suggested implementation direction
A possible implementation approach:
Add public API parameters such as:
vocab_pathvocab_dirmodel_dirAdd packaged vocab resources to
openai-harmony.Add the required
.tiktokenfiles to GPT-OSS model repos.Implement the lookup precedence:
Preserve hash checking for all sources.
Expected outcome
With this change:
openai-harmonygets a working default offline experience,Motivation
Tokenizer vocab files are core runtime assets. Requiring them to be fetched at runtime makes deployments less reproducible and less robust than they need to be.
For air-gapped users, this is not just inconvenient; it is a hard failure mode.
Request
Please consider:
.tiktokenvocab files into theopenai-harmonypackage,