bug: missing tiktoken vocabulary files complicate air-gapped deployment, please consider bundling them into the package and GPT-OSS model repos

## Summary

`openai-harmony` currently causes avoidable failures in air-gapped and restricted-network environments because loading a Harmony encoding depends on external `.tiktoken` vocab files that are not shipped as part of the normal package or GPT-OSS model artifacts.

This is painful for:
- air-gapped deployments,
- containers with no egress,
- reproducible builds,
- CI,
- clusters with strict startup/network policies.

Even when the model weights and `tokenizer.json` are already present locally, Harmony can still fail because the required `.tiktoken` vocab file is missing.

## Problem

At the moment, the high-level workflow for loading Harmony encodings is not self-contained enough for offline use.

Practical problems:
1. A user may already have the GPT-OSS model repo locally, but Harmony still needs separate `.tiktoken` files.
2. Users often discover this only at runtime.
3. Runtime download is fragile in containers and distributed systems.
4. Air-gapped users need manual workarounds and environment-specific packaging logic.
5. This also creates ambiguity about which tokenizer asset is the canonical one:
   - package asset,
   - model-repo asset,
   - remote download target,
   - cache artifact.

## Proposed solution

Please make vocab resolution fully offline-friendly and deterministic.

### 0th priority (highest, if added)
Allow explicit vocab-file / vocab-directory overrides through:
- CLI arguments, if Harmony has or will have CLI entrypoints for encoding-related workflows,
- function parameters in the public API.

Examples:
- `load_harmony_encoding(name, vocab_path=..., vocab_dir=...)`
- `load_harmony_encoding(name, model_dir=...)`

Reason:
explicit parameters are the most predictable and easiest to debug. They should override everything else.

### 1st priority
Support vocabulary files provided through environment variables.

This already exists in some form via environment variables, and should remain supported and documented as the primary non-code override for operators.

### 2nd priority
Load vocabulary files from the GPT-OSS model repo.

The GPT-OSS model repos should include the required `.tiktoken` files as official artifacts, alongside the existing tokenizer files.

This would let users mirror a model repo into an air-gapped environment and have all tokenizer-related assets available without a second manual download path.

### 3rd priority
Bundle vocabulary files inside the `openai-harmony` package itself.

The package should include the necessary `.tiktoken` files so that the default experience works offline with zero extra steps.

This is especially important because these files are small compared with model weights and are fundamental runtime assets, not optional extras.

## Requested lookup order

Please make vocab lookup deterministic with this precedence:

1. Explicit CLI argument / function parameter
2. Environment-variable-provided path
3. Model repo files
4. Package-bundled files
5. Remote download/cache fallback (optional, preferably deprecated as the default expectation)

This order gives:
- maximum operator control,
- strong air-gap support,
- good UX,
- backward compatibility.

## Why both package and model repo should include the files

These two placements solve different real deployment patterns:

### Package-bundled files
Best for:
- local development,
- pip installs,
- simple containers,
- tests,
- examples,
- out-of-the-box UX.

### Model-repo-bundled files
Best for:
- mirrored Hugging Face repos,
- air-gapped clusters,
- artifact promotion pipelines,
- environments where model assets are managed separately from Python packages.

Shipping the files in both places would eliminate a large class of deployment failures.

## Missing pieces that would make this complete

I think the full solution should also include:

### 1. Hash / integrity verification
Please continue verifying known hashes for vocab files regardless of source:
- explicit file path,
- env-provided file,
- model repo file,
- package file,
- downloaded file.

That keeps the current security and correctness guarantees.

### 2. Model-directory discovery
If a model directory is known, Harmony should be able to look for:
- `o200k_base.tiktoken`
- `cl100k_base.tiktoken`

in predictable locations, for example:
- `<model_dir>/`
- `<model_dir>/encodings/`
- `<model_dir>/tokenizer/`
- or another documented convention.

### 3. Clear error messages
When resolution fails, the error should say exactly what was tried, in order.

For example:
- checked explicit parameter
- checked env path
- checked model repo
- checked package resource
- attempted remote download

This would make debugging much easier.

### 4. Air-gap documentation
Please add a short section to the docs with:
- what files are required,
- where Harmony searches for them,
- how to package them into a container,
- how to use model-repo-local resolution,
- and how to disable runtime network dependency entirely.

### 5. Stable public API
If there is already a lower-level file-loading helper internally, please expose a stable public API that makes offline usage first-class instead of requiring users to rely on internal details.

## Suggested implementation direction

A possible implementation approach:

1. Add public API parameters such as:
   - `vocab_path`
   - `vocab_dir`
   - `model_dir`

2. Add packaged vocab resources to `openai-harmony`.

3. Add the required `.tiktoken` files to GPT-OSS model repos.

4. Implement the lookup precedence:
   - explicit parameter
   - env var
   - model repo
   - package resource
   - remote/cache fallback

5. Preserve hash checking for all sources.

## Expected outcome

With this change:
- a user who installs `openai-harmony` gets a working default offline experience,
- a user who mirrors a GPT-OSS model repo gets the needed vocab files automatically,
- operators still retain explicit override control,
- air-gapped deployments become straightforward and reproducible,
- runtime download becomes a fallback rather than a requirement.

## Motivation

Tokenizer vocab files are core runtime assets. Requiring them to be fetched at runtime makes deployments less reproducible and less robust than they need to be.

For air-gapped users, this is not just inconvenient; it is a hard failure mode.

## Request

Please consider:
1. bundling the required `.tiktoken` vocab files into the `openai-harmony` package,
2. adding the same vocab files to GPT-OSS model repos,
3. and introducing an explicit public API / CLI override path with the precedence described above.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

bug: missing tiktoken vocabulary files complicate air-gapped deployment, please consider bundling them into the package and GPT-OSS model repos #101

Summary

Problem

Proposed solution

0th priority (highest, if added)

1st priority

2nd priority

3rd priority

Requested lookup order

Why both package and model repo should include the files

Package-bundled files

Model-repo-bundled files

Missing pieces that would make this complete

1. Hash / integrity verification

2. Model-directory discovery

3. Clear error messages

4. Air-gap documentation

5. Stable public API

Suggested implementation direction

Expected outcome

Motivation

Request

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

bug: missing tiktoken vocabulary files complicate air-gapped deployment, please consider bundling them into the package and GPT-OSS model repos #101

Description

Summary

Problem

Proposed solution

0th priority (highest, if added)

1st priority

2nd priority

3rd priority

Requested lookup order

Why both package and model repo should include the files

Package-bundled files

Model-repo-bundled files

Missing pieces that would make this complete

1. Hash / integrity verification

2. Model-directory discovery

3. Clear error messages

4. Air-gap documentation

5. Stable public API

Suggested implementation direction

Expected outcome

Motivation

Request

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions