Version: 1.1.0
Status: Stable
ZON includes advanced compression and optimization features that dramatically reduce token count and improve LLM accuracy. These features are automatically applied by the encoder when beneficial.
Introduced: v1.1.0
Purpose: Compress sequential numeric columns
Instead of storing absolute values, delta encoding stores the difference from the previous value:
# Without delta:
ids:@(1000):id
1,2,3,4,5,...,1000
# With delta (`:delta` marker):
ids:@(1000):id:delta
1,+1,+1,+1,+1,...,+1
Token Savings: Up to 70% for sequential IDs or timestamps.
Delta encoding is automatically applied when ALL conditions are met:
- Column contains only numbers
- Column has ≥5 values
- Values are sequential (small deltas)
from zon import encode
# Sequential IDs
data = {
'records': [
{'id': i + 1, 'name': f'Record {i}'}
for i in range(1000)
]
}
zon_str = encode(data)
print(zon_str)
# records:@(1000):id:delta,name
# 1,Record 0
# +1,Record 1
# +1,Record 2
# ...Timestamps:
logs = [
{'timestamp': 1609459200, 'message': 'Started'},
{'timestamp': 1609459260, 'message': 'Processing'}, # +60
{'timestamp': 1609459320, 'message': 'Done'} # +60
]
# Encoded as:
# logs:@(3):message,timestamp:delta
# Started,1609459200
# Processing,+60
# Done,+60Delta encoding is automatically reversed during decoding:
from zon import decode
zon_str = """
records:@(3):id:delta,name
1,Alice
+1,Bob
+1,Carol
"""
data = decode(zon_str)
print(data['records'])
# [
# {'id': 1, 'name': 'Alice'},
# {'id': 2, 'name': 'Bob'},
# {'id': 3, 'name': 'Carol'}
# ]Introduced: v1.0.3
Purpose: Deduplicate repeated string values
When a column has many repeated values, ZON creates a dictionary and stores indices:
# Without dictionary:
shipments:@(150):status,...
pending,...
delivered,...
pending,...
in-transit,...
pending,...
...
# With dictionary:
status[3]:delivered,in-transit,pending
shipments:@(150):status,...
2,... # "pending"
0,... # "delivered"
2,... # "pending"
1,... # "in-transit"
2,... # "pending"
...
Dictionary compression is automatically applied when:
- Column has ≥10 values
- Column has ≤10 unique values
- Compression ratio > 1.2x
from zon import encode
shipments = [
{'id': i, 'status': ['pending', 'delivered', 'in-transit'][i % 3]}
for i in range(100)
]
zon_str = encode({'shipments': shipments})
print(zon_str)
# status[3]:delivered,in-transit,pending
# shipments:@(100):id,status
# 0,2 # id:0, status:"pending"
# 1,0 # id:1, status:"delivered"
# 2,1 # id:2, status:"in-transit"
# ...Dictionary compression works with flattened nested fields:
data = {
'users': [
{'name': 'Alice', 'address': {'city': 'NYC'}},
{'name': 'Bob', 'address': {'city': 'LAX'}},
{'name': 'Carol', 'address': {'city': 'NYC'}}
]
}
# Automatically creates dictionary for "address.city"Real-world examples:
| Dataset | Without Dict | With Dict | Savings |
|---|---|---|---|
| E-commerce orders | 45k tokens | 28k tokens | 38% |
| Log files | 120k tokens | 65k tokens | 46% |
| User roles | 8k tokens | 3k tokens | 63% |
Introduced: v1.1.0
Purpose: Handle "stringified" values from LLMs
LLMs sometimes return numbers or booleans as strings:
{
"age": "25", // Should be number
"active": "true" // Should be boolean
}Enable type coercion in the encoder:
from zon import ZonEncoder
encoder = ZonEncoder(
anchor_interval=None, # default
enable_dictionary=True, # default
enable_type_coercion=True # ✅ Enable type coercion
)
data = {
'users': [
{'age': "25", 'active': "true"}, # Strings
{'age': "30", 'active': "false"}
]
}
zon_str = encoder.encode(data)
print(zon_str)
# users:@(2):active,age
# T,25 # Coerced to boolean and number
# F,30- Analyzes entire column
- Detects if all values are "coercible" (e.g.,
"123"→123) - Coerces entire column to the target type
| From | To | Example |
|---|---|---|
"123" |
123 |
Number strings |
"true" |
T |
Boolean strings |
"false" |
F |
Boolean strings |
"null" |
null |
Null strings |
The decoder also supports type coercion for LLM-generated ZON:
from zon import decode
options = {'enable_type_coercion': True}
data = decode(llm_output, **options)Introduced: v1.1.0
Purpose: Efficiently encode nested objects with missing fields
Nested fields are flattened with dot notation:
from zon import encode
data = {
'users': [
{'id': 1, 'profile': {'bio': 'Developer'}},
{'id': 2, 'profile': None},
{'id': 3, 'profile': {'bio': 'Designer'}}
]
}
zon_str = encode(data)
# users:@(3):id,profile.bio
# 1,Developer
# 2,null
# 3,DesignerSupports up to 5 levels of nesting:
data = {
'items': [{
'a': {'b': {'c': {'d': {'e': 'Deep!'}}}}
}]
}
# Flattened to:
# items:@(1):a.b.c.d.e
# Deep!Missing values are preserved:
data = {
'products': [
{'id': 1, 'meta': {'color': 'red', 'size': 'L'}},
{'id': 2}, # No meta
{'id': 3, 'meta': {'color': 'blue'}} # No size
]
}
# Core: id, meta.color
# Sparse (inline): meta.size
# products:@(3):id,meta.color
# 1,red,meta.size:L
# 2,null
# 3,blue- Delta encoding: Best for time-series and sequential IDs
- Dictionary compression: Best for categorical data (status, roles, countries)
- Type coercion: Enable when dealing with LLM outputs
- Sparse encoding: Automatic, no configuration needed
- API Reference - Full API documentation
- SPEC.md - Format specification
- LLM Best Practices - Using with LLMs