diff --git a/README.md b/README.md index d52ec29..c71607f 100644 --- a/README.md +++ b/README.md @@ -7,6 +7,8 @@ This is a Ruby port of the [TOON library](https://github.com/johannschopplich/toon) originally written in TypeScript. +TOON excels at **uniform complex objects** โ€“ multiple fields per row, same structure across items. It borrows YAML's indentation-based structure for nested objects and CSV's tabular format for uniform data rows, then optimizes both for token efficiency in LLM contexts. + ## Why TOON? AI is becoming cheaper and more accessible, but larger context windows allow for larger data inputs as well. **LLM tokens still cost money** โ€“ and standard JSON is verbose and token-expensive: @@ -28,6 +30,78 @@ users[2]{id,name,role}: 2,Bob,user ``` +## Format Comparison + +Format familiarity matters as much as token count. + +- **CSV:** best for uniform tables. +- **JSON:** best for non-uniform data. +- **TOON:** best for uniform complex (but not deeply nested) objects. + +TOON switches to list format for non-uniform arrays. In those cases, JSON can be cheaper at scale. + +## Key Features + +- ๐Ÿ’ธ **Token-efficient:** typically 30โ€“60% fewer tokens than JSON +- ๐Ÿคฟ **LLM-friendly guardrails:** explicit lengths and field lists help models validate output +- ๐Ÿฑ **Minimal syntax:** removes redundant punctuation (braces, brackets, most quotes) +- ๐Ÿ“ **Indentation-based structure:** replaces braces with whitespace for better readability +- ๐Ÿงบ **Tabular arrays:** declare keys once, then stream rows without repetition + +## Benchmarks + +### Token Efficiency + +``` +โญ GitHub Repositories โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘ 8,745 tokens + vs JSON: 15,145 ๐Ÿ’ฐ 42.3% saved + vs XML: 17,095 ๐Ÿ’ฐ 48.8% saved + +๐Ÿ“ˆ Daily Analytics โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘ 4,507 tokens + vs JSON: 10,977 ๐Ÿ’ฐ 58.9% saved + vs XML: 13,128 ๐Ÿ’ฐ 65.7% saved + +๐Ÿ›’ E-Commerce Order โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘ 166 tokens + vs JSON: 257 ๐Ÿ’ฐ 35.4% saved + vs XML: 271 ๐Ÿ’ฐ 38.7% saved + +โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€ +Total โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘ 13,418 tokens + vs JSON: 26,379 ๐Ÿ’ฐ 49.1% saved + vs XML: 30,494 ๐Ÿ’ฐ 56.0% saved +``` + +> **Note:** Measured with `gpt-tokenizer` using `o200k_base` encoding (used by GPT-5 and other modern models). Savings will vary across models and tokenizers. + +### Retrieval Accuracy + +Tested across **3 LLMs** with data retrieval tasks: + +``` +gpt-5-nano + toon โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆ 99.4% (158/159) + yaml โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–‘ 95.0% (151/159) + csv โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–‘โ–‘ 92.5% (147/159) + json โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–‘โ–‘ 92.5% (147/159) + xml โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–‘โ–‘ 91.2% (145/159) + +claude-haiku-4-5 + toon โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–‘โ–‘โ–‘โ–‘โ–‘ 75.5% (120/159) + xml โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–‘โ–‘โ–‘โ–‘โ–‘ 75.5% (120/159) + csv โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–‘โ–‘โ–‘โ–‘โ–‘ 75.5% (120/159) + json โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–‘โ–‘โ–‘โ–‘โ–‘ 75.5% (120/159) + yaml โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–‘โ–‘โ–‘โ–‘โ–‘ 74.2% (118/159) + +gemini-2.5-flash + xml โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–‘โ–‘ 91.8% (146/159) + csv โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–‘โ–‘โ–‘ 86.2% (137/159) + toon โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–‘โ–‘โ–‘ 84.9% (135/159) + json โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–‘โ–‘โ–‘โ–‘ 81.8% (130/159) + yaml โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–‘โ–‘โ–‘โ–‘ 78.6% (125/159) +``` + +**Advantage:** TOON achieves **86.6% accuracy** (vs JSON's 83.2%) while using **46.3% fewer tokens**. + ## Installation Add this line to your application's Gemfile: @@ -77,54 +151,22 @@ user: preferences[0]: ``` -## Key Features - -- ๐Ÿ’ธ **Token-efficient:** typically 30โ€“60% fewer tokens than JSON -- ๐Ÿคฟ **LLM-friendly guardrails:** explicit lengths and field lists help models validate output -- ๐Ÿฑ **Minimal syntax:** removes redundant punctuation (braces, brackets, most quotes) -- ๐Ÿ“ **Indentation-based structure:** replaces braces with whitespace for better readability -- ๐Ÿงบ **Tabular arrays:** declare keys once, then stream rows without repetition - -## API - -### `Toon.encode(value, **options)` +## Canonical Formatting Rules -Converts any value to TOON format. +TOON formatting is deterministic and minimal: -**Parameters:** - -- `value` โ€“ Any value to encode (Hash, Array, primitives, or nested structures) -- `indent` โ€“ Number of spaces per indentation level (default: `2`) -- `delimiter` โ€“ Delimiter for array values and tabular rows: `','`, `"\t"`, or `'|'` (default: `','`) -- `length_marker` โ€“ Optional marker to prefix array lengths: `'#'` or `false` (default: `false`) - -**Returns:** - -A TOON-formatted string with no trailing newline or spaces. - -**Examples:** - -```ruby -# Basic usage -Toon.encode({ 'id' => 1, 'name' => 'Ada' }) -# => "id: 1\nname: Ada" - -# Tabular arrays -items = [ - { 'sku' => 'A1', 'qty' => 2, 'price' => 9.99 }, - { 'sku' => 'B2', 'qty' => 1, 'price' => 14.5 } -] -Toon.encode({ 'items' => items }) -# => "items[2]{sku,qty,price}:\n A1,2,9.99\n B2,1,14.5" - -# Custom delimiter (tab) -Toon.encode({ 'items' => items }, delimiter: "\t") -# => "items[2\t]{sku\tqty\tprice}:\n A1\t2\t9.99\n B2\t1\t14.5" - -# Length marker -Toon.encode({ 'tags' => ['a', 'b', 'c'] }, length_marker: '#') -# => "tags[#3]: a,b,c" -``` +- **Indentation**: 2 spaces per nesting level. +- **Lines**: + - `key: value` for primitives (single space after colon). + - `key:` for nested/empty objects (no trailing space on that line). +- **Arrays**: + - Delimiter encoding: Comma delimiters are implicit in array headers (e.g., `tags[3]:`, `items[2]{id,name}:`). Tab and pipe delimiters are explicitly shown in array headers (e.g., `tags[3|]:`, `items[2 ]{id name}:`). + - Primitive arrays inline: `key[N]: v1,v2` (comma) or `key[N]: v1v2` (tab/pipe). + - Tabular arrays: `key[N]{f1,f2}: โ€ฆ` (comma) or `key[N]{f1f2}: โ€ฆ` (tab/pipe). + - List items: two spaces, hyphen, space (`" - โ€ฆ"`). +- **Whitespace invariants**: + - No trailing spaces at end of any line. + - No trailing newline at end of output. ## Format Overview @@ -165,6 +207,8 @@ user: ### Arrays +> **Tip:** TOON includes the array length in brackets (e.g., `items[3]`). When using comma delimiters (default), the delimiter is implicit. When using tab or pipe delimiters, the delimiter is explicitly shown in the header (e.g., `tags[2|]` or `[2 ]`). This encoding helps LLMs identify the delimiter and track the number of elements, reducing errors when generating or validating structured output. + #### Primitive Arrays (Inline) ```ruby @@ -194,6 +238,30 @@ items[2]{sku,qty,price}: B2,1,14.5 ``` +**Tabular formatting applies recursively:** nested arrays of objects (whether as object properties or inside list items) also use tabular format if they meet the same requirements. + +```ruby +Toon.encode({ + 'items' => [ + { + 'users' => [ + { 'id' => 1, 'name' => 'Ada' }, + { 'id' => 2, 'name' => 'Bob' } + ], + 'status' => 'active' + } + ] +}) +``` + +``` +items[1]: + - users[2]{id,name}: + 1,Ada + 2,Bob + status: active +``` + #### Mixed and Non-Uniform Arrays Arrays that don't meet the tabular requirements use list format: @@ -211,6 +279,90 @@ items[3]: - text ``` +When objects appear in list format, the first field is placed on the hyphen line: + +``` +items[2]: + - id: 1 + name: First + - id: 2 + name: Second + extra: true +``` + +> **Note:** **Nested array indentation:** When the first field of a list item is an array (primitive, tabular, or nested), its contents are indented two spaces under the header line, and subsequent fields of the same object appear at that same indentation level. This remains unambiguous because list items begin with `"- "`, tabular arrays declare a fixed row count in their header, and object fields contain `":"`. + +#### Arrays of Arrays + +When you have arrays containing primitive inner arrays: + +```ruby +Toon.encode({ + 'pairs' => [ + [1, 2], + [3, 4] + ] +}) +``` + +``` +pairs[2]: + - [2]: 1,2 + - [2]: 3,4 +``` + +#### Empty Arrays and Objects + +Empty containers have special representations: + +```ruby +Toon.encode({ 'items' => [] }) # items[0]: +Toon.encode([]) # [0]: +Toon.encode({}) # (empty output) +Toon.encode({ 'config' => {} }) # config: +``` + +## API + +### `Toon.encode(value, **options)` + +Converts any value to TOON format. + +**Parameters:** + +- `value` โ€“ Any value to encode (Hash, Array, primitives, or nested structures) +- `indent` โ€“ Number of spaces per indentation level (default: `2`) +- `delimiter` โ€“ Delimiter for array values and tabular rows: `','`, `"\t"`, or `'|'` (default: `','`) +- `length_marker` โ€“ Optional marker to prefix array lengths: `'#'` or `false` (default: `false`) + +**Returns:** + +A TOON-formatted string with no trailing newline or spaces. + +**Examples:** + +```ruby +# Basic usage +Toon.encode({ 'id' => 1, 'name' => 'Ada' }) +# => "id: 1\nname: Ada" + +# Tabular arrays +items = [ + { 'sku' => 'A1', 'qty' => 2, 'price' => 9.99 }, + { 'sku' => 'B2', 'qty' => 1, 'price' => 14.5 } +] +Toon.encode({ 'items' => items }) +# => "items[2]{sku,qty,price}:\n A1,2,9.99\n B2,1,14.5" + +# Custom delimiter (tab) +Toon.encode({ 'items' => items }, delimiter: "\t") +# => "items[2 ]{sku qty price}:\n A1\t2\t9.99\n B2\t1\t14.5" + +# Length marker +Toon.encode({ 'tags' => ['a', 'b', 'c'] }, length_marker: '#') +# => "tags[#3]: a,b,c" +``` + ### Delimiter Options The `delimiter` option allows you to choose between comma (default), tab, or pipe delimiters: @@ -235,6 +387,33 @@ items[2 ]{sku name qty}: B2 Gadget 1 ``` +**Benefits:** + +- Tabs are single characters and often tokenize more efficiently than commas. +- Tabs rarely appear in natural text, reducing the need for quote-escaping. +- The delimiter is explicitly encoded in the array header, making it self-descriptive. + +**Considerations:** + +- Some terminals and editors may collapse or expand tabs visually. +- String values containing tabs will still require quoting. + +#### Pipe Delimiter (`|`) + +Pipe delimiters offer a middle ground between commas and tabs: + +```ruby +Toon.encode(data, delimiter: '|') +``` + +Output: + +``` +items[2|]{sku|name|qty}: + A1|Widget|2 + B2|Gadget|1 +``` + ### Length Marker Option The `length_marker` option adds a hash (`#`) prefix to array lengths: @@ -260,6 +439,71 @@ items[#2]{sku,qty}: B2,1 ``` +## Quoting Rules + +TOON quotes strings **only when necessary** to maximize token efficiency. Inner spaces are allowed; leading or trailing spaces force quotes. Unicode and emoji are safe unquoted. + +> **Note:** When using alternative delimiters (tab or pipe), the quoting rules adapt automatically. Strings containing the active delimiter will be quoted, while other delimiters remain safe. + +### Keys + +Keys are quoted when any of the following is true: + +| Condition | Examples | +|---|---| +| Contains spaces, commas, colons, quotes, control chars | `"full name"`, `"a,b"`, `"order:id"`, `"tab\there"` | +| Contains brackets or braces | `"[index]"`, `"{key}"` | +| Leading hyphen | `"-lead"` | +| Numeric-only key | `"123"` | +| Empty key | `""` | + +**Notes:** + +- Quotes and control characters in keys are escaped (e.g., `"he said \"hi\""`, `"line\nbreak"`). + +### String Values + +String values are quoted when any of the following is true: + +| Condition | Examples | +|---|---| +| Empty string | `""` | +| Contains active delimiter, colon, quote, backslash, or control chars | `"a,b"` (comma), `"a\tb"` (tab), `"a\|b"` (pipe), `"a:b"`, `"say \"hi\""`, `"C:\\Users"`, `"line1\\nline2"` | +| Leading or trailing spaces | `" padded "`, `" "` | +| Looks like boolean/number/null | `"true"`, `"false"`, `"null"`, `"42"`, `"-3.14"`, `"1e-6"`, `"05"` | +| Starts with `"- "` (list-like) | `"- item"` | +| Looks like structural token | `"[5]"`, `"{key}"`, `"[3]: x,y"` | + +> **Important:** **Delimiter-aware quoting:** Unquoted strings never contain `:` or the active delimiter. This makes TOON reliably parseable with simple heuristics: split key/value on first `: `, and split array values on the delimiter declared in the array header. When using tab or pipe delimiters, commas don't need quoting โ€“ only the active delimiter triggers quoting for both array values and object values. + +### Examples + +``` +note: "hello, world" +items[3]: foo,"true","- item" +hello ๐Ÿ‘‹ world // unquoted +" padded " // quoted +value: null // null value +name: "" // empty string (quoted) +text: "line1\nline2" // multi-line string (escaped) +``` + +## Tabular Format Requirements + +For arrays of objects to use the efficient tabular format, all of the following must be true: + +| Requirement | Detail | +|---|---| +| All elements are objects | No primitives in the array | +| Identical key sets | No missing or extra keys across rows | +| Primitive values only | No nested arrays or objects | +| Header delimiter | Comma is implicit in headers (`[N]{f1,f2}`); tab and pipe are explicit (`[N ]{f1 f2}`, `[N|]{f1|f2}`) | +| Header key order | Taken from the first object | +| Header key quoting | Same rules as object keys; keys containing the active delimiter must be quoted | +| Row value quoting | Same rules as string values; values containing the active delimiter must be quoted | + +If any condition fails, TOON falls back to list format. + ## Type Conversions Some Ruby types are automatically normalized: @@ -272,25 +516,37 @@ Some Ruby types are automatically normalized: | `Float::INFINITY`, `Float::NAN` | `null` | | `Set` | Array | -## Quoting Rules +## Using TOON in LLM Prompts -TOON quotes strings **only when necessary** to maximize token efficiency: +TOON works best when you show the format instead of describing it. The structure is self-documenting โ€“ models parse it naturally once they see the pattern. -- Empty strings are quoted: `""` -- Strings with leading/trailing spaces: `" padded "` -- Strings that look like booleans/numbers: `"true"`, `"42"` -- Strings with structural characters: `"a,b"`, `"a:b"`, `"[5]"` -- The active delimiter triggers quoting +### Sending TOON to LLMs (Input) -Keys follow similar rules and are quoted when needed. +Wrap your encoded data in a fenced code block (label it \`\`\`toon for clarity). The indentation and headers are usually enough โ€“ models treat it like familiar YAML or CSV. The explicit length markers (`[N]`) and field headers (`{field1,field2}`) help the model track structure, especially for large tables. -## Using TOON in LLM Prompts +### Generating TOON from LLMs (Output) + +For output, be more explicit. When you want the model to **generate** TOON: -When incorporating TOON into your LLM workflows: +- **Show the expected header** (`users[N]{id,name,role}:`). The model fills rows instead of repeating keys, reducing generation errors. +- **State the rules**: 2-space indent, no trailing spaces, `[N]` matches row count. + +Here's a prompt that works for both reading and generating: + +``` +Data is in TOON format (2-space indent, arrays show length and fields). + +\```toon +users[3]{id,name,role,lastLogin}: + 1,Alice,admin,2025-01-15T10:30:00Z + 2,Bob,user,2025-01-14T15:22:00Z + 3,Charlie,user,2025-01-13T09:45:00Z +\``` + +Task: Return only users with role "user" as TOON. Use the same header. Set [N] to match the row count. Output only the code block. +``` -- Wrap TOON data in a fenced code block in your prompt -- Tell the model: "Do not add extra punctuation or spaces; follow the exact TOON format." -- When asking the model to generate TOON, specify the same rules (2-space indentation, no trailing spaces, quoting rules) +> **Tip:** For large uniform tables, use `Toon.encode(data, delimiter: "\t")` and tell the model "fields are tab-separated." Tabs often tokenize better than commas and reduce the need for quote-escaping. ## Notes and Limitations @@ -299,6 +555,49 @@ When incorporating TOON into your LLM workflows: - **Tabular arrays** require all objects to have exactly the same keys with primitive values only. - **Object key order** is preserved from the input. In tabular arrays, header order follows the first object's keys. +## Quick Reference + +``` +// Object +{ id: 1, name: 'Ada' } โ†’ id: 1 + name: Ada + +// Nested object +{ user: { id: 1 } } โ†’ user: + id: 1 + +// Primitive array (inline) +{ tags: ['foo', 'bar'] } โ†’ tags[2]: foo,bar + +// Tabular array (uniform objects) +{ items: [ โ†’ items[2]{id,qty}: + { id: 1, qty: 5 }, 1,5 + { id: 2, qty: 3 } 2,3 +]} + +// Mixed / non-uniform (list) +{ items: [1, { a: 1 }, 'x'] } โ†’ items[3]: + - 1 + - a: 1 + - x + +// Array of arrays +{ pairs: [[1, 2], [3, 4]] } โ†’ pairs[2]: + - [2]: 1,2 + - [2]: 3,4 + +// Root array +['x', 'y'] โ†’ [2]: x,y + +// Empty containers +{} โ†’ (empty output) +{ items: [] } โ†’ items[0]: + +// Special quoting +{ note: 'hello, world' } โ†’ note: "hello, world" +{ items: ['true', true] } โ†’ items[2]: "true",true +``` + ## Development After checking out the repo, run: @@ -330,4 +629,3 @@ The gem is available as open source under the terms of the [MIT License](LICENSE ## Credits This is a Ruby port of the original [TOON library](https://github.com/johannschopplich/toon) by [Johann Schopplich](https://github.com/johannschopplich). -