Skip to content

Think masking#45

Merged
fabianlim merged 28 commits intogarrett361:mainfrom
divya-kumari32:think-masking
Sep 24, 2025
Merged

Think masking#45
fabianlim merged 28 commits intogarrett361:mainfrom
divya-kumari32:think-masking

Conversation

@divya-kumari32
Copy link
Copy Markdown
Collaborator

@divya-kumari32 divya-kumari32 commented Aug 26, 2025

This PR adds support for optionally appending \n<think>\n to the default asst_tag during span-based label masking for SFT — but only for samples that contain <think> content and when the user explicitly specifies it in the mask_think_tag flag. This feature is designed to additionally support models that are thinking models, but do not specify thinking mode by default, and also in case the think and no-think samples are mixed in the dataset during training.

Some assistant responses in a sample might contain a <think> block that represents internal monologue/reasoning by the model. During SFT, the user might not want the model to learn to generate the <think> tag itself. Instead, we optionally allow <think> to be treated as part of the assistant tag span boundary, so masking cleanly includes or excludes it.

The behavior is controlled by two conditions:

  1. mask_think_tag (passed by the user)

    • False: masking ignores <think> completely.
      • This may be because the user doesn’t want <think> included,
        the model is not a “thinking” model, or for other design/training reasons.
    • True: masking may append <think> to the assistant tag, depending on the passed sample's content.
  2. has_thinking_content(messages) (sample check)

    • Returns True if the assistant message:
      • Has an explicit thought field, or
      • Contains "<think>" inside its content string.

The <think> suffix is appended to the assistant tag only when both are true:

if mask_think_tag and has_thinking_content(messages):
   asst_tag += think_tag

Additional notes:
Other minor changes:
Although it is common for add_special_tokens to be set to False during tokenizer encoding, we set add_special_tokens=False here to disable the injection of tokenizer-specific BOS/EOS or SEP tokens that would otherwise be automatically added to the encoded sequence. This is essential for correct behavior in tasks like:

  • Span matching or pattern matching on tokenized control tags (e.g., <|start_of_role|>assistant<|end_of_role|>)
  • Label masking for SFT or instruction tuning pipelines, where precise position tracking is required
    Why is this necessary:
  • By default, tokenizers for models like LLaMA, Mistral, and Phi-2 have add_bos_token=True, which means they automatically prepend a BOS token. This would shift all token positions by 1 and break any span matching logic.
  • Other models like Qwen, etc., have add_bos_token=False by default, so they behave differently — leading to inconsistent behavior across model families if not explicitly handled.

Comment thread open_instruct/dataset_transformation.py
Comment thread open_instruct/dataset_transformation.py Outdated
Comment thread open_instruct/dataset_transformation.py Outdated
Comment thread open_instruct/dataset_transformation.py
Comment thread open_instruct/dataset_transformation.py
Comment thread open_instruct/dataset_transformation.py Outdated
Comment thread open_instruct/dataset_transformation.py Outdated
@garrett361
Copy link
Copy Markdown
Owner

The unit tests that @fabianlim wrote are failing. Please make sure this isn't indicative of a bug, and update the tests, if necessary

Comment thread open_instruct/dataset_transformation.py Outdated
Comment thread open_instruct/dataset_transformation.py Outdated
Co-authored-by: Yu Chin Fabian Lim <fabianlim@users.noreply.github.com>
Comment thread open_instruct/dataset_transformation.py Outdated
Comment thread open_instruct/dataset_transformation.py Outdated
Comment thread open_instruct/dataset_transformation.py
Comment thread open_instruct/dataset_transformation.py Outdated
Comment thread open_instruct/dataset_transformation.py
@dangxuanhong
Copy link
Copy Markdown
Collaborator

Hi @fabianlim @divya-kumari32 should we address above concerns in order to get this PR merged?

@Swanand-Kadhe Did you use this branch [think-masking](https://github.com/divya-kumari32/open-instruct/tree/think-masking) in running experiments wrt v7d1l2 dataset?

Comment thread open_instruct/dataset_transformation.py Outdated
tokenizer: PreTrainedTokenizer,
max_seq_length: int,
asst_tag: str = "<|start_of_role|>assistant<|end_of_role|>",
asst_tag: str = "<|start_of_role|>assistant<|end_of_role|>\n<think>\n",
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
asst_tag: str = "<|start_of_role|>assistant<|end_of_role|>\n<think>\n",
asst_tag: str = "<|start_of_role|>assistant<|end_of_role|>",

Comment thread open_instruct/dataset_transformation.py Outdated
if message.get("role") == "assistant":
# Check for an explicit 'thought' field or a '<think>' tag in the content.
if message.get("thought") or (
isinstance(message.get("content"), str) and "<think>" in message["content"]
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
isinstance(message.get("content"), str) and "<think>" in message["content"]
isinstance(message.get("content"), str) and think_tag.strip() in message["content"]

Copy link
Copy Markdown
Collaborator

@fabianlim fabianlim left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you accept the two suggestions

Comment thread open_instruct/dataset_transformation.py Outdated
Comment thread open_instruct/dataset_transformation.py Outdated
**additional_inputs,
)

# If the user has set `append_think_tag=True` and the current sample is a thinking sample,
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok thanks for the comment update.. so can we then change append_think_tag to mask_think_tag_if_present? Would that be more clear?

Copy link
Copy Markdown
Collaborator Author

@divya-kumari32 divya-kumari32 Sep 24, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

Comment thread open_instruct/dataset_transformation.py Outdated
Comment on lines +1273 to +1276
if append_think_tag:
if has_thinking_content(messages):
asst_tag += think_tag

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
if append_think_tag:
if has_thinking_content(messages):
asst_tag += think_tag
if append_think_tag and has_thinking_content(messages):
asst_tag += think_tag

Copy link
Copy Markdown
Collaborator Author

@divya-kumari32 divya-kumari32 Sep 24, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

will do

@fabianlim fabianlim merged commit 9db0eec into garrett361:main Sep 24, 2025
2 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants