Skip to content

bug: missing tiktoken vocabulary files complicate air-gapped deployment, please consider bundling them into the package and GPT-OSS model repos #101

@PeganovAnton

Description

@PeganovAnton

Summary

openai-harmony currently causes avoidable failures in air-gapped and restricted-network environments because loading a Harmony encoding depends on external .tiktoken vocab files that are not shipped as part of the normal package or GPT-OSS model artifacts.

This is painful for:

  • air-gapped deployments,
  • containers with no egress,
  • reproducible builds,
  • CI,
  • clusters with strict startup/network policies.

Even when the model weights and tokenizer.json are already present locally, Harmony can still fail because the required .tiktoken vocab file is missing.

Problem

At the moment, the high-level workflow for loading Harmony encodings is not self-contained enough for offline use.

Practical problems:

  1. A user may already have the GPT-OSS model repo locally, but Harmony still needs separate .tiktoken files.
  2. Users often discover this only at runtime.
  3. Runtime download is fragile in containers and distributed systems.
  4. Air-gapped users need manual workarounds and environment-specific packaging logic.
  5. This also creates ambiguity about which tokenizer asset is the canonical one:
    • package asset,
    • model-repo asset,
    • remote download target,
    • cache artifact.

Proposed solution

Please make vocab resolution fully offline-friendly and deterministic.

0th priority (highest, if added)

Allow explicit vocab-file / vocab-directory overrides through:

  • CLI arguments, if Harmony has or will have CLI entrypoints for encoding-related workflows,
  • function parameters in the public API.

Examples:

  • load_harmony_encoding(name, vocab_path=..., vocab_dir=...)
  • load_harmony_encoding(name, model_dir=...)

Reason:
explicit parameters are the most predictable and easiest to debug. They should override everything else.

1st priority

Support vocabulary files provided through environment variables.

This already exists in some form via environment variables, and should remain supported and documented as the primary non-code override for operators.

2nd priority

Load vocabulary files from the GPT-OSS model repo.

The GPT-OSS model repos should include the required .tiktoken files as official artifacts, alongside the existing tokenizer files.

This would let users mirror a model repo into an air-gapped environment and have all tokenizer-related assets available without a second manual download path.

3rd priority

Bundle vocabulary files inside the openai-harmony package itself.

The package should include the necessary .tiktoken files so that the default experience works offline with zero extra steps.

This is especially important because these files are small compared with model weights and are fundamental runtime assets, not optional extras.

Requested lookup order

Please make vocab lookup deterministic with this precedence:

  1. Explicit CLI argument / function parameter
  2. Environment-variable-provided path
  3. Model repo files
  4. Package-bundled files
  5. Remote download/cache fallback (optional, preferably deprecated as the default expectation)

This order gives:

  • maximum operator control,
  • strong air-gap support,
  • good UX,
  • backward compatibility.

Why both package and model repo should include the files

These two placements solve different real deployment patterns:

Package-bundled files

Best for:

  • local development,
  • pip installs,
  • simple containers,
  • tests,
  • examples,
  • out-of-the-box UX.

Model-repo-bundled files

Best for:

  • mirrored Hugging Face repos,
  • air-gapped clusters,
  • artifact promotion pipelines,
  • environments where model assets are managed separately from Python packages.

Shipping the files in both places would eliminate a large class of deployment failures.

Missing pieces that would make this complete

I think the full solution should also include:

1. Hash / integrity verification

Please continue verifying known hashes for vocab files regardless of source:

  • explicit file path,
  • env-provided file,
  • model repo file,
  • package file,
  • downloaded file.

That keeps the current security and correctness guarantees.

2. Model-directory discovery

If a model directory is known, Harmony should be able to look for:

  • o200k_base.tiktoken
  • cl100k_base.tiktoken

in predictable locations, for example:

  • <model_dir>/
  • <model_dir>/encodings/
  • <model_dir>/tokenizer/
  • or another documented convention.

3. Clear error messages

When resolution fails, the error should say exactly what was tried, in order.

For example:

  • checked explicit parameter
  • checked env path
  • checked model repo
  • checked package resource
  • attempted remote download

This would make debugging much easier.

4. Air-gap documentation

Please add a short section to the docs with:

  • what files are required,
  • where Harmony searches for them,
  • how to package them into a container,
  • how to use model-repo-local resolution,
  • and how to disable runtime network dependency entirely.

5. Stable public API

If there is already a lower-level file-loading helper internally, please expose a stable public API that makes offline usage first-class instead of requiring users to rely on internal details.

Suggested implementation direction

A possible implementation approach:

  1. Add public API parameters such as:

    • vocab_path
    • vocab_dir
    • model_dir
  2. Add packaged vocab resources to openai-harmony.

  3. Add the required .tiktoken files to GPT-OSS model repos.

  4. Implement the lookup precedence:

    • explicit parameter
    • env var
    • model repo
    • package resource
    • remote/cache fallback
  5. Preserve hash checking for all sources.

Expected outcome

With this change:

  • a user who installs openai-harmony gets a working default offline experience,
  • a user who mirrors a GPT-OSS model repo gets the needed vocab files automatically,
  • operators still retain explicit override control,
  • air-gapped deployments become straightforward and reproducible,
  • runtime download becomes a fallback rather than a requirement.

Motivation

Tokenizer vocab files are core runtime assets. Requiring them to be fetched at runtime makes deployments less reproducible and less robust than they need to be.

For air-gapped users, this is not just inconvenient; it is a hard failure mode.

Request

Please consider:

  1. bundling the required .tiktoken vocab files into the openai-harmony package,
  2. adding the same vocab files to GPT-OSS model repos,
  3. and introducing an explicit public API / CLI override path with the precedence described above.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions