Add tool-schema support to SFT tokenization by hamishivi · Pull Request #1734 · allenai/open-instruct

hamishivi · 2026-06-23T20:57:01Z

Summary

Pass the dataset tools column to apply_chat_template during SFT tokenization so tool schemas are rendered into the prompt. A new _normalize_tools_for_chat_template helper accepts a list/dict/JSON-string (and treats empty/None as no tools), rejecting tool-name-string lists.
Derive assistant labels from token offset mappings against the rendered conversation (prefix-stable per-assistant-turn spans) instead of mask_labels, so labels stay correct when tools are present.
Consume the tools column during SFT tokenization rather than persisting it to the cached dataset (_SFT_TOKENIZE_FNS).
Add sft_tulu_tokenize_without_truncation_v1 and register last_turn_tulu_tokenize_and_truncate_v1 (which now also forwards tools).
Bump DATASET_CACHE_VERSION to v7 to invalidate stale caches.

Notes

For the no-tools tulu path the new tokenizer is byte-identical to the previous mask_labels path: the existing GOLD_SFT golden-hash test passes unchanged.
This PR intentionally excludes the unrelated dataset_config_name plumbing and the sft_filter_v1 rename that were bundled alongside this work on the source branch.

Test plan

uv run pytest open_instruct/test_dataset_transformation.py (20 passed locally), including:
- existing GOLD_SFT/preference/rlvr golden-hash tests (unchanged),
- new TestToolNormalization (JSON parsing, dict-wrapping, empty handling, rejection cases),
- new TestSFTTuluTokenizeLabels (assistant-only trainable labels, tools column consumed, no-truncation variant).

GPU_TESTS=bypass

Made with Cursor

Pass the dataset `tools` column to `apply_chat_template` during SFT tokenization so tool schemas are rendered into the prompt. JSON-encoded tool schemas are parsed via a new `_normalize_tools_for_chat_template` helper. Assistant labels are now derived from token offset mappings against the rendered conversation (prefix-stable per-assistant-turn spans), which keeps labels correct when tools are present. The tools column is consumed by SFT tokenization rather than persisted to the cached dataset. Bumps the dataset cache version to v7. Adds `sft_tulu_tokenize_without_truncation_v1` and registers `last_turn_tulu_tokenize_and_truncate_v1`. Co-authored-by: Cursor <cursoragent@cursor.com>

Co-authored-by: Cursor <cursoragent@cursor.com>

gemini-code-assist

Code Review

This pull request introduces tool-schema support to SFT tokenization, allowing tool schemas to be parsed, normalized, and passed to chat templates, while deriving assistant labels from offset mappings. Feedback on these changes highlights three key areas for improvement: explicitly validating that a fast tokenizer is used since offset mapping is required, robustly handling assistant content offsets for templates that do not end with a newline, and fixing a potential TypeError when the tools column contains a JSON-encoded null string.

Important

The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.

A JSON-encoded "null" (or "") decodes to None/empty after json.loads, which previously fell through to the list check and raised a spurious TypeError. Re-check for None/empty after parsing so these decode to no tools gracefully. Co-authored-by: Cursor <cursoragent@cursor.com>

The offset-mapping-based label derivation relies on return_offsets_mapping, which slow (Python) tokenizers do not support. Raise a clear error pointing at use_fast=True instead of the opaque ValueError/NotImplementedError. Co-authored-by: Cursor <cursoragent@cursor.com>

…heuristic The previous content_offset = assistant_text.find(chr(10)) heuristic assumed the assistant header always ends in a newline and the content never starts with one, which mis-masks templates like simple_chat (no newline header) or simple_concat_with_space (no header, multiline content). Locate the actual content string (rfind, falling back to the newline heuristic) and train any token that overlaps the content span so a header/content boundary token stays trainable. The tulu template result (and its golden-hash test) is unchanged. Co-authored-by: Cursor <cursoragent@cursor.com>

Type the normalized tools as a gradual list (assignable to apply_chat_template's list[dict | Callable]) and assert the tokenize=False renderings are str so ty can narrow them before startswith/slicing. Co-authored-by: Cursor <cursoragent@cursor.com>

Co-authored-by: Cursor <cursoragent@cursor.com>

hamishivi · 2026-06-23T21:50:09Z

/gemini review

gemini-code-assist

Code Review

This pull request introduces tool-schema support to SFT tokenization by parsing and passing the tools column to the chat template and deriving assistant labels from offset mappings. It also adds robust tool normalization and comprehensive unit tests. The feedback suggests extending the new offset-based tokenization logic to support last-turn-only tokenization, which would allow deprecating the fragile legacy mask_labels function. Additionally, it is recommended to handle float('nan') values in the tool normalization function to prevent potential type errors when processing datasets loaded from pandas or CSVs.

Important

The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.

Add a last_turn_only flag to _tokenize_tulu_sft_with_assistant_labels and route last_turn_tulu_tokenize_and_truncate_v1 through it, so the last-turn path forwards the tools column to the chat template and drops the legacy mask_labels path. Also guard the content match with content.strip() to avoid matching whitespace-only content. Golden preference/SFT hashes are unchanged. Co-authored-by: Cursor <cursoragent@cursor.com>

pandas/CSV-backed datasets can represent a missing object cell as float('nan'); normalize it to no tools instead of falling through to a TypeError. Co-authored-by: Cursor <cursoragent@cursor.com>

hamishivi · 2026-06-23T22:03:00Z

/gemini review

gemini-code-assist

Code Review

This pull request introduces tool-schema support to SFT tokenization by parsing the tools column and passing it to apply_chat_template, while deriving assistant labels using offset mappings. It also ensures the tools column is consumed rather than persisted for SFT transforms. The review feedback highlights two valuable optimization and robustness opportunities: pre-rendering chat template prefixes to avoid redundant Jinja2 rendering inside a loop, and using stripped content for substring matching to prevent failures caused by whitespace formatting variations.

Important

The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.

Search for content.strip() in the rendered text so leading/trailing whitespace in the message content (which templates may strip) doesn't make rfind miss and fall back to the newline heuristic. Co-authored-by: Cursor <cursoragent@cursor.com>

hamishivi · 2026-06-23T22:38:56Z

/gemini review

gemini-code-assist

Code Review

This pull request introduces tool-schema support to SFT tokenization by parsing the tools column and passing it to apply_chat_template. It also transitions assistant label derivation to use offset mappings instead of the legacy mask_labels path, and ensures the tools column is consumed rather than persisted during SFT tokenization. The review feedback highlights two key improvements: handling prefix-stability check failures caused by trailing eos_token on intermediate turns, and supporting block-based content lists (where content is a list of dictionaries) to extract text content instead of falling back to the newline heuristic.

Important

The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.

gemini-code-assist

Code Review

This pull request adds tool-schema support to SFT tokenization by parsing the tools column and passing it to apply_chat_template. It refactors tokenization to derive assistant labels using offset mappings from fast tokenizers, adds a non-truncating tokenization variant, and ensures the tools column is consumed rather than persisted during SFT transformations. Feedback on the changes suggests improving the robustness of the last_turn_only logic to correctly identify the final assistant turn even if the conversation does not end with an assistant message.

Important

The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.

Find the index of the last assistant message instead of assuming it is the final message, so a conversation ending with a user/system turn still trains its real last assistant turn. Co-authored-by: Cursor <cursoragent@cursor.com>

Some chat templates append eos_token only on the final turn (loop.last), so rendered_before ends with an eos that is absent at that position in the longer render, spuriously failing the prefix-stability check. Strip a trailing eos_token and retry before raising. Co-authored-by: Cursor <cursoragent@cursor.com>

hamishivi · 2026-06-23T22:59:41Z

/gemini review

gemini-code-assist

Code Review

This pull request adds tool-schema support to SFT tokenization by parsing the tools column, passing it to apply_chat_template, and deriving assistant labels from offset mappings. It also introduces new tokenization variants and extensive unit tests. The review feedback identifies two important issues: a potential ValueError when a conversation starts with an assistant message due to an empty message slice passed to apply_chat_template, and a bug where the tools column is not actually removed from target_columns when it defaults to all dataset columns. Code suggestions are provided to resolve both issues.

Important

The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.

When the first message is an assistant turn (message_idx == 0), messages[:0] is empty and apply_chat_template([]) raises in most tokenizers; treat the prefix as an empty string instead. Co-authored-by: Cursor <cursoragent@cursor.com>

When target_columns is None it defaults to dataset.column_names (which includes tools), so skipping _preserve_column was insufficient — explicitly remove TOOLS_COLUMN_KEY for SFT tokenization so the consumed column isn't persisted. Co-authored-by: Cursor <cursoragent@cursor.com>

hamishivi · 2026-06-23T23:18:20Z

/gemini review

gemini-code-assist

Code Review

This pull request introduces tool-schema support to SFT tokenization. It adds normalization for tool schemas, implements offset-mapping-based assistant label derivation for fast tokenizers, and ensures the tools column is consumed rather than persisted during SFT tokenization. Feedback was provided regarding the newline heuristic fallback in _tokenize_tulu_sft_with_assistant_labels, which can be fragile and incorrectly mask assistant content; a more robust approach using add_generation_prompt=True was suggested.

Important

The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.

When the assistant content can't be located verbatim, derive the header offset from add_generation_prompt (the exact assistant header the template emits) and raise if it can't be determined, replacing the fragile first-newline heuristic that mis-masked header-without-newline templates with multiline content. Co-authored-by: Cursor <cursoragent@cursor.com>

hamishivi · 2026-06-23T23:27:55Z

/gemini review

gemini-code-assist

Code Review

This pull request introduces tool-schema support to SFT tokenization by parsing and passing the tools column to the chat template, deriving assistant labels from offset mappings, and ensuring the tools column is consumed rather than persisted. It also increments the dataset cache version to v7 and adds comprehensive unit tests. The review feedback identifies a bug in the offset-based label derivation where the content offset can be overestimated by the length of the EOS token when the fallback path is triggered and the trailing EOS token was stripped from the prefix. A code suggestion is provided to track whether the EOS token was stripped and adjust the generation prompt rendering accordingly.

Important

The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.

gemini-code-assist · 2026-06-23T23:29:24Z

+        # An empty conversation (assistant is the first message) raises in most chat
+        # templates, so treat the prefix as empty.
+        if message_idx == 0:
+            rendered_before = ""
+        else:
+            rendered_before = tokenizer.apply_chat_template(
+                conversation=messages[:message_idx], tools=tools, tokenize=False, add_generation_prompt=False
+            )
+        rendered_through_message = tokenizer.apply_chat_template(
+            conversation=messages[: message_idx + 1], tools=tools, tokenize=False, add_generation_prompt=False
+        )
+        assert isinstance(rendered_before, str)
+        assert isinstance(rendered_through_message, str)
+        if not rendered_through_message.startswith(rendered_before):
+            # Some templates append eos_token only on the final turn (loop.last), so
+            # rendered_before ends with an eos that is absent at that position in the
+            # longer render. Strip a trailing eos_token and retry before failing.
+            eos_token = tokenizer.eos_token
+            if eos_token and rendered_before.endswith(eos_token):
+                stripped = rendered_before[: -len(eos_token)]
+                if rendered_through_message.startswith(stripped):
+                    rendered_before = stripped
+            if not rendered_through_message.startswith(rendered_before):
+                raise ValueError("Chat template rendering is not prefix-stable; cannot compute assistant label spans.")
+
+        assistant_text = rendered_through_message[len(rendered_before) :]
+        # Locate where the assistant's actual content begins so the header (e.g.
+        # "<|assistant|>\n" or "Assistant: ") is masked but the content is trained.
+        # Prefer matching the real content string (rfind avoids matching header
+        # look-alikes). When the content can't be located verbatim (empty content,
+        # tool calls, or template-transformed text), determine the header exactly from
+        # the chat template's generation prompt rather than guessing with a newline.
+        content = message.get("content")
+        stripped_content = content.strip() if isinstance(content, str) else ""
+        content_offset = assistant_text.rfind(stripped_content) if stripped_content else -1
+        if content_offset == -1:
+            # The generation prompt is exactly the assistant header the template emits
+            # before the assistant content, so its length gives the header offset.
+            rendered_with_generation_prompt = tokenizer.apply_chat_template(
+                conversation=messages[:message_idx], tools=tools, tokenize=False, add_generation_prompt=True
+            )
+            assert isinstance(rendered_with_generation_prompt, str)
+            if not rendered_with_generation_prompt.startswith(rendered_before):
+                raise ValueError(
+                    "Cannot determine the assistant header offset for label masking: "
+                    "add_generation_prompt did not extend the rendered prefix."
+                )
+            content_offset = len(rendered_with_generation_prompt) - len(rendered_before)
+            if content_offset > len(assistant_text):
+                raise ValueError("Computed assistant content offset exceeds the assistant turn length.")
+        trainable_char_spans.append((len(rendered_before) + content_offset, len(rendered_through_message)))


When the chat template appends the eos_token only on the final turn, the rendered_before prefix has its trailing eos_token stripped to match rendered_through_message. However, if the assistant content cannot be located verbatim (triggering the fallback that uses add_generation_prompt=True), the rendered_with_generation_prompt is rendered as the final turn of its sub-conversation and will still contain the trailing eos_token. This causes content_offset to be overestimated by len(eos_token), which incorrectly masks the first few characters of the assistant's actual response. We should track whether the eos_token was stripped from rendered_before and adjust rendered_with_generation_prompt accordingly.

stripped_eos = False # An empty conversation (assistant is the first message) raises in most chat # templates, so treat the prefix as empty. if message_idx == 0: rendered_before = "" else: rendered_before = tokenizer.apply_chat_template( conversation=messages[:message_idx], tools=tools, tokenize=False, add_generation_prompt=False ) rendered_through_message = tokenizer.apply_chat_template( conversation=messages[: message_idx + 1], tools=tools, tokenize=False, add_generation_prompt=False ) assert isinstance(rendered_before, str) assert isinstance(rendered_through_message, str) if not rendered_through_message.startswith(rendered_before): # Some templates append eos_token only on the final turn (loop.last), so # rendered_before ends with an eos that is absent at that position in the # longer render. Strip a trailing eos_token and retry before failing. eos_token = tokenizer.eos_token if eos_token and rendered_before.endswith(eos_token): stripped = rendered_before[: -len(eos_token)] if rendered_through_message.startswith(stripped): rendered_before = stripped stripped_eos = True if not rendered_through_message.startswith(rendered_before): raise ValueError("Chat template rendering is not prefix-stable; cannot compute assistant label spans.") assistant_text = rendered_through_message[len(rendered_before) :] # Locate where the assistant's actual content begins so the header (e.g. # "<|assistant|>\n" or "Assistant: ") is masked but the content is trained. # Prefer matching the real content string (rfind avoids matching header # look-alikes). When the content can't be located verbatim (empty content, # tool calls, or template-transformed text), determine the header exactly from # the chat template's generation prompt rather than guessing with a newline. content = message.get("content") stripped_content = content.strip() if isinstance(content, str) else "" content_offset = assistant_text.rfind(stripped_content) if stripped_content else -1 if content_offset == -1: # The generation prompt is exactly the assistant header the template emits # before the assistant content, so its length gives the header offset. rendered_with_generation_prompt = tokenizer.apply_chat_template( conversation=messages[:message_idx], tools=tools, tokenize=False, add_generation_prompt=True ) assert isinstance(rendered_with_generation_prompt, str) eos_token = tokenizer.eos_token if stripped_eos and eos_token and rendered_with_generation_prompt.startswith(rendered_before + eos_token): rendered_with_generation_prompt = rendered_before + rendered_with_generation_prompt[len(rendered_before) + len(eos_token):] if not rendered_with_generation_prompt.startswith(rendered_before): raise ValueError( "Cannot determine the assistant header offset for label masking: " "add_generation_prompt did not extend the rendered prefix." ) content_offset = len(rendered_with_generation_prompt) - len(rendered_before) if content_offset > len(assistant_text): raise ValueError("Computed assistant content offset exceeds the assistant turn length.") trainable_char_spans.append((len(rendered_before) + content_offset, len(rendered_through_message)))

Replace the accumulated rfind/eos-strip/newline/generation-prompt-fallback logic with one rule: the trainable span is [len(render(messages[:k], add_generation_prompt=True)), len(render(messages[:k+1]))], with both required to be prefixes of the full render. If not (e.g. templates that append eos only on the final turn, or lack add_generation_prompt support), raise a clear error instead of silently mis-masking. Tulu output (and the golden hashes) are unchanged. Co-authored-by: Cursor <cursoragent@cursor.com>

farhatkevin

Just to make sure I understand: should the older sft_tokenize_v1 / sft_tokenize_mask_out_prompt_v1 paths also be treated as consuming tools, even though only the Tulu SFT tokenizers pass tools into apply_chat_template?

hamishivi and others added 2 commits June 23, 2026 13:56

Add CHANGELOG entry for SFT tools support (#1734)

be3ac3a

Co-authored-by: Cursor <cursoragent@cursor.com>