Skip to content

Add tool-schema support to SFT tokenization#1734

Open
hamishivi wants to merge 16 commits into
mainfrom
hamishivi/sft-tools-support
Open

Add tool-schema support to SFT tokenization#1734
hamishivi wants to merge 16 commits into
mainfrom
hamishivi/sft-tools-support

Conversation

@hamishivi

Copy link
Copy Markdown
Collaborator

Summary

  • Pass the dataset tools column to apply_chat_template during SFT tokenization so tool schemas are rendered into the prompt. A new _normalize_tools_for_chat_template helper accepts a list/dict/JSON-string (and treats empty/None as no tools), rejecting tool-name-string lists.
  • Derive assistant labels from token offset mappings against the rendered conversation (prefix-stable per-assistant-turn spans) instead of mask_labels, so labels stay correct when tools are present.
  • Consume the tools column during SFT tokenization rather than persisting it to the cached dataset (_SFT_TOKENIZE_FNS).
  • Add sft_tulu_tokenize_without_truncation_v1 and register last_turn_tulu_tokenize_and_truncate_v1 (which now also forwards tools).
  • Bump DATASET_CACHE_VERSION to v7 to invalidate stale caches.

Notes

  • For the no-tools tulu path the new tokenizer is byte-identical to the previous mask_labels path: the existing GOLD_SFT golden-hash test passes unchanged.
  • This PR intentionally excludes the unrelated dataset_config_name plumbing and the sft_filter_v1 rename that were bundled alongside this work on the source branch.

Test plan

  • uv run pytest open_instruct/test_dataset_transformation.py (20 passed locally), including:
    • existing GOLD_SFT/preference/rlvr golden-hash tests (unchanged),
    • new TestToolNormalization (JSON parsing, dict-wrapping, empty handling, rejection cases),
    • new TestSFTTuluTokenizeLabels (assistant-only trainable labels, tools column consumed, no-truncation variant).

GPU_TESTS=bypass

Made with Cursor

hamishivi and others added 2 commits June 23, 2026 13:56
Pass the dataset `tools` column to `apply_chat_template` during SFT
tokenization so tool schemas are rendered into the prompt. JSON-encoded tool
schemas are parsed via a new `_normalize_tools_for_chat_template` helper.

Assistant labels are now derived from token offset mappings against the
rendered conversation (prefix-stable per-assistant-turn spans), which keeps
labels correct when tools are present. The tools column is consumed by SFT
tokenization rather than persisted to the cached dataset. Bumps the dataset
cache version to v7. Adds `sft_tulu_tokenize_without_truncation_v1` and
registers `last_turn_tulu_tokenize_and_truncate_v1`.

Co-authored-by: Cursor <cursoragent@cursor.com>
Co-authored-by: Cursor <cursoragent@cursor.com>

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces tool-schema support to SFT tokenization, allowing tool schemas to be parsed, normalized, and passed to chat templates, while deriving assistant labels from offset mappings. Feedback on these changes highlights three key areas for improvement: explicitly validating that a fast tokenizer is used since offset mapping is required, robustly handling assistant content offsets for templates that do not end with a newline, and fixing a potential TypeError when the tools column contains a JSON-encoded null string.

Important

The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.

Comment thread open_instruct/dataset_transformation.py
Comment thread open_instruct/dataset_transformation.py Outdated
Comment thread open_instruct/dataset_transformation.py Outdated
hamishivi and others added 5 commits June 23, 2026 14:21
A JSON-encoded "null" (or "") decodes to None/empty after json.loads, which
previously fell through to the list check and raised a spurious TypeError.
Re-check for None/empty after parsing so these decode to no tools gracefully.

Co-authored-by: Cursor <cursoragent@cursor.com>
The offset-mapping-based label derivation relies on return_offsets_mapping,
which slow (Python) tokenizers do not support. Raise a clear error pointing at
use_fast=True instead of the opaque ValueError/NotImplementedError.

Co-authored-by: Cursor <cursoragent@cursor.com>
…heuristic

The previous content_offset = assistant_text.find(chr(10)) heuristic assumed the
assistant header always ends in a newline and the content never starts with one,
which mis-masks templates like simple_chat (no newline header) or
simple_concat_with_space (no header, multiline content). Locate the actual content
string (rfind, falling back to the newline heuristic) and train any token that
overlaps the content span so a header/content boundary token stays trainable. The
tulu template result (and its golden-hash test) is unchanged.

Co-authored-by: Cursor <cursoragent@cursor.com>
Type the normalized tools as a gradual list (assignable to apply_chat_template's
list[dict | Callable]) and assert the tokenize=False renderings are str so ty can
narrow them before startswith/slicing.

Co-authored-by: Cursor <cursoragent@cursor.com>
Co-authored-by: Cursor <cursoragent@cursor.com>
@hamishivi

Copy link
Copy Markdown
Collaborator Author

/gemini review

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces tool-schema support to SFT tokenization by parsing and passing the tools column to the chat template and deriving assistant labels from offset mappings. It also adds robust tool normalization and comprehensive unit tests. The feedback suggests extending the new offset-based tokenization logic to support last-turn-only tokenization, which would allow deprecating the fragile legacy mask_labels function. Additionally, it is recommended to handle float('nan') values in the tool normalization function to prevent potential type errors when processing datasets loaded from pandas or CSVs.

Important

The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.

Comment thread open_instruct/dataset_transformation.py
Comment thread open_instruct/dataset_transformation.py
hamishivi and others added 2 commits June 23, 2026 14:58
Add a last_turn_only flag to _tokenize_tulu_sft_with_assistant_labels and route
last_turn_tulu_tokenize_and_truncate_v1 through it, so the last-turn path forwards
the tools column to the chat template and drops the legacy mask_labels path. Also
guard the content match with content.strip() to avoid matching whitespace-only
content. Golden preference/SFT hashes are unchanged.

Co-authored-by: Cursor <cursoragent@cursor.com>
pandas/CSV-backed datasets can represent a missing object cell as float('nan');
normalize it to no tools instead of falling through to a TypeError.

Co-authored-by: Cursor <cursoragent@cursor.com>
@hamishivi

Copy link
Copy Markdown
Collaborator Author

/gemini review

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces tool-schema support to SFT tokenization by parsing the tools column and passing it to apply_chat_template, while deriving assistant labels using offset mappings. It also ensures the tools column is consumed rather than persisted for SFT transforms. The review feedback highlights two valuable optimization and robustness opportunities: pre-rendering chat template prefixes to avoid redundant Jinja2 rendering inside a loop, and using stripped content for substring matching to prevent failures caused by whitespace formatting variations.

Important

The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.

Comment thread open_instruct/dataset_transformation.py
Comment thread open_instruct/dataset_transformation.py Outdated
Search for content.strip() in the rendered text so leading/trailing whitespace
in the message content (which templates may strip) doesn't make rfind miss and
fall back to the newline heuristic.

Co-authored-by: Cursor <cursoragent@cursor.com>
@hamishivi

Copy link
Copy Markdown
Collaborator Author

/gemini review

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces tool-schema support to SFT tokenization by parsing the tools column and passing it to apply_chat_template. It also transitions assistant label derivation to use offset mappings instead of the legacy mask_labels path, and ensures the tools column is consumed rather than persisted during SFT tokenization. The review feedback highlights two key improvements: handling prefix-stability check failures caused by trailing eos_token on intermediate turns, and supporting block-based content lists (where content is a list of dictionaries) to extract text content instead of falling back to the newline heuristic.

Important

The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.

Comment thread open_instruct/dataset_transformation.py Outdated
Comment thread open_instruct/dataset_transformation.py Outdated

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request adds tool-schema support to SFT tokenization by parsing the tools column and passing it to apply_chat_template. It refactors tokenization to derive assistant labels using offset mappings from fast tokenizers, adds a non-truncating tokenization variant, and ensures the tools column is consumed rather than persisted during SFT transformations. Feedback on the changes suggests improving the robustness of the last_turn_only logic to correctly identify the final assistant turn even if the conversation does not end with an assistant message.

Important

The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.

Comment thread open_instruct/dataset_transformation.py
hamishivi and others added 2 commits June 23, 2026 15:45
Find the index of the last assistant message instead of assuming it is the final
message, so a conversation ending with a user/system turn still trains its real
last assistant turn.

Co-authored-by: Cursor <cursoragent@cursor.com>
Some chat templates append eos_token only on the final turn (loop.last), so
rendered_before ends with an eos that is absent at that position in the longer
render, spuriously failing the prefix-stability check. Strip a trailing eos_token
and retry before raising.

Co-authored-by: Cursor <cursoragent@cursor.com>
@hamishivi

Copy link
Copy Markdown
Collaborator Author

/gemini review

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request adds tool-schema support to SFT tokenization by parsing the tools column, passing it to apply_chat_template, and deriving assistant labels from offset mappings. It also introduces new tokenization variants and extensive unit tests. The review feedback identifies two important issues: a potential ValueError when a conversation starts with an assistant message due to an empty message slice passed to apply_chat_template, and a bug where the tools column is not actually removed from target_columns when it defaults to all dataset columns. Code suggestions are provided to resolve both issues.

Important

The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.

Comment thread open_instruct/dataset_transformation.py Outdated
Comment thread open_instruct/dataset_transformation.py
hamishivi and others added 2 commits June 23, 2026 16:03
When the first message is an assistant turn (message_idx == 0), messages[:0] is
empty and apply_chat_template([]) raises in most tokenizers; treat the prefix as
an empty string instead.

Co-authored-by: Cursor <cursoragent@cursor.com>
When target_columns is None it defaults to dataset.column_names (which includes
tools), so skipping _preserve_column was insufficient — explicitly remove
TOOLS_COLUMN_KEY for SFT tokenization so the consumed column isn't persisted.

Co-authored-by: Cursor <cursoragent@cursor.com>
@hamishivi

Copy link
Copy Markdown
Collaborator Author

/gemini review

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces tool-schema support to SFT tokenization. It adds normalization for tool schemas, implements offset-mapping-based assistant label derivation for fast tokenizers, and ensures the tools column is consumed rather than persisted during SFT tokenization. Feedback was provided regarding the newline heuristic fallback in _tokenize_tulu_sft_with_assistant_labels, which can be fragile and incorrectly mask assistant content; a more robust approach using add_generation_prompt=True was suggested.

Important

The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.

Comment thread open_instruct/dataset_transformation.py Outdated
When the assistant content can't be located verbatim, derive the header offset
from add_generation_prompt (the exact assistant header the template emits) and
raise if it can't be determined, replacing the fragile first-newline heuristic
that mis-masked header-without-newline templates with multiline content.

Co-authored-by: Cursor <cursoragent@cursor.com>
@hamishivi

Copy link
Copy Markdown
Collaborator Author

/gemini review

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces tool-schema support to SFT tokenization by parsing and passing the tools column to the chat template, deriving assistant labels from offset mappings, and ensuring the tools column is consumed rather than persisted. It also increments the dataset cache version to v7 and adds comprehensive unit tests. The review feedback identifies a bug in the offset-based label derivation where the content offset can be overestimated by the length of the EOS token when the fallback path is triggered and the trailing EOS token was stripped from the prefix. A code suggestion is provided to track whether the EOS token was stripped and adjust the generation prompt rendering accordingly.

Important

The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.

Comment thread open_instruct/dataset_transformation.py Outdated
Comment on lines +1251 to +1301
# An empty conversation (assistant is the first message) raises in most chat
# templates, so treat the prefix as empty.
if message_idx == 0:
rendered_before = ""
else:
rendered_before = tokenizer.apply_chat_template(
conversation=messages[:message_idx], tools=tools, tokenize=False, add_generation_prompt=False
)
rendered_through_message = tokenizer.apply_chat_template(
conversation=messages[: message_idx + 1], tools=tools, tokenize=False, add_generation_prompt=False
)
assert isinstance(rendered_before, str)
assert isinstance(rendered_through_message, str)
if not rendered_through_message.startswith(rendered_before):
# Some templates append eos_token only on the final turn (loop.last), so
# rendered_before ends with an eos that is absent at that position in the
# longer render. Strip a trailing eos_token and retry before failing.
eos_token = tokenizer.eos_token
if eos_token and rendered_before.endswith(eos_token):
stripped = rendered_before[: -len(eos_token)]
if rendered_through_message.startswith(stripped):
rendered_before = stripped
if not rendered_through_message.startswith(rendered_before):
raise ValueError("Chat template rendering is not prefix-stable; cannot compute assistant label spans.")

assistant_text = rendered_through_message[len(rendered_before) :]
# Locate where the assistant's actual content begins so the header (e.g.
# "<|assistant|>\n" or "Assistant: ") is masked but the content is trained.
# Prefer matching the real content string (rfind avoids matching header
# look-alikes). When the content can't be located verbatim (empty content,
# tool calls, or template-transformed text), determine the header exactly from
# the chat template's generation prompt rather than guessing with a newline.
content = message.get("content")
stripped_content = content.strip() if isinstance(content, str) else ""
content_offset = assistant_text.rfind(stripped_content) if stripped_content else -1
if content_offset == -1:
# The generation prompt is exactly the assistant header the template emits
# before the assistant content, so its length gives the header offset.
rendered_with_generation_prompt = tokenizer.apply_chat_template(
conversation=messages[:message_idx], tools=tools, tokenize=False, add_generation_prompt=True
)
assert isinstance(rendered_with_generation_prompt, str)
if not rendered_with_generation_prompt.startswith(rendered_before):
raise ValueError(
"Cannot determine the assistant header offset for label masking: "
"add_generation_prompt did not extend the rendered prefix."
)
content_offset = len(rendered_with_generation_prompt) - len(rendered_before)
if content_offset > len(assistant_text):
raise ValueError("Computed assistant content offset exceeds the assistant turn length.")
trainable_char_spans.append((len(rendered_before) + content_offset, len(rendered_through_message)))

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

When the chat template appends the eos_token only on the final turn, the rendered_before prefix has its trailing eos_token stripped to match rendered_through_message. However, if the assistant content cannot be located verbatim (triggering the fallback that uses add_generation_prompt=True), the rendered_with_generation_prompt is rendered as the final turn of its sub-conversation and will still contain the trailing eos_token. This causes content_offset to be overestimated by len(eos_token), which incorrectly masks the first few characters of the assistant's actual response. We should track whether the eos_token was stripped from rendered_before and adjust rendered_with_generation_prompt accordingly.

        stripped_eos = False
        # An empty conversation (assistant is the first message) raises in most chat
        # templates, so treat the prefix as empty.
        if message_idx == 0:
            rendered_before = ""
        else:
            rendered_before = tokenizer.apply_chat_template(
                conversation=messages[:message_idx], tools=tools, tokenize=False, add_generation_prompt=False
            )
        rendered_through_message = tokenizer.apply_chat_template(
            conversation=messages[: message_idx + 1], tools=tools, tokenize=False, add_generation_prompt=False
        )
        assert isinstance(rendered_before, str)
        assert isinstance(rendered_through_message, str)
        if not rendered_through_message.startswith(rendered_before):
            # Some templates append eos_token only on the final turn (loop.last), so
            # rendered_before ends with an eos that is absent at that position in the
            # longer render. Strip a trailing eos_token and retry before failing.
            eos_token = tokenizer.eos_token
            if eos_token and rendered_before.endswith(eos_token):
                stripped = rendered_before[: -len(eos_token)]
                if rendered_through_message.startswith(stripped):
                    rendered_before = stripped
                    stripped_eos = True
            if not rendered_through_message.startswith(rendered_before):
                raise ValueError("Chat template rendering is not prefix-stable; cannot compute assistant label spans.")

        assistant_text = rendered_through_message[len(rendered_before) :]
        # Locate where the assistant's actual content begins so the header (e.g.
        # "<|assistant|>\n" or "Assistant: ") is masked but the content is trained.
        # Prefer matching the real content string (rfind avoids matching header
        # look-alikes). When the content can't be located verbatim (empty content,
        # tool calls, or template-transformed text), determine the header exactly from
        # the chat template's generation prompt rather than guessing with a newline.
        content = message.get("content")
        stripped_content = content.strip() if isinstance(content, str) else ""
        content_offset = assistant_text.rfind(stripped_content) if stripped_content else -1
        if content_offset == -1:
            # The generation prompt is exactly the assistant header the template emits
            # before the assistant content, so its length gives the header offset.
            rendered_with_generation_prompt = tokenizer.apply_chat_template(
                conversation=messages[:message_idx], tools=tools, tokenize=False, add_generation_prompt=True
            )
            assert isinstance(rendered_with_generation_prompt, str)
            eos_token = tokenizer.eos_token
            if stripped_eos and eos_token and rendered_with_generation_prompt.startswith(rendered_before + eos_token):
                rendered_with_generation_prompt = rendered_before + rendered_with_generation_prompt[len(rendered_before) + len(eos_token):]
            if not rendered_with_generation_prompt.startswith(rendered_before):
                raise ValueError(
                    "Cannot determine the assistant header offset for label masking: "
                    "add_generation_prompt did not extend the rendered prefix."
                )
            content_offset = len(rendered_with_generation_prompt) - len(rendered_before)
            if content_offset > len(assistant_text):
                raise ValueError("Computed assistant content offset exceeds the assistant turn length.")
        trainable_char_spans.append((len(rendered_before) + content_offset, len(rendered_through_message)))

Replace the accumulated rfind/eos-strip/newline/generation-prompt-fallback logic
with one rule: the trainable span is [len(render(messages[:k], add_generation_prompt=True)),
len(render(messages[:k+1]))], with both required to be prefixes of the full render.
If not (e.g. templates that append eos only on the final turn, or lack
add_generation_prompt support), raise a clear error instead of silently mis-masking.
Tulu output (and the golden hashes) are unchanged.

Co-authored-by: Cursor <cursoragent@cursor.com>
@farhatkevin farhatkevin self-requested a review June 25, 2026 21:47

@farhatkevin farhatkevin left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just to make sure I understand: should the older sft_tokenize_v1 / sft_tokenize_mask_out_prompt_v1 paths also be treated as consuming tools, even though only the Tulu SFT tokenizers pass tools into apply_chat_template?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants