Think masking#45
Conversation
Co-authored-by: Yu Chin Fabian Lim <fabianlim@users.noreply.github.com>
|
The unit tests that @fabianlim wrote are failing. Please make sure this isn't indicative of a bug, and update the tests, if necessary |
…n-instruct into think-masking merge#
Co-authored-by: Yu Chin Fabian Lim <fabianlim@users.noreply.github.com>
Co-authored-by: Yu Chin Fabian Lim <fabianlim@users.noreply.github.com>
…n-instruct into think-masking merge
|
Hi @fabianlim @divya-kumari32 should we address above concerns in order to get this PR merged? @Swanand-Kadhe Did you use this branch |
| tokenizer: PreTrainedTokenizer, | ||
| max_seq_length: int, | ||
| asst_tag: str = "<|start_of_role|>assistant<|end_of_role|>", | ||
| asst_tag: str = "<|start_of_role|>assistant<|end_of_role|>\n<think>\n", |
There was a problem hiding this comment.
| asst_tag: str = "<|start_of_role|>assistant<|end_of_role|>\n<think>\n", | |
| asst_tag: str = "<|start_of_role|>assistant<|end_of_role|>", |
| if message.get("role") == "assistant": | ||
| # Check for an explicit 'thought' field or a '<think>' tag in the content. | ||
| if message.get("thought") or ( | ||
| isinstance(message.get("content"), str) and "<think>" in message["content"] |
There was a problem hiding this comment.
| isinstance(message.get("content"), str) and "<think>" in message["content"] | |
| isinstance(message.get("content"), str) and think_tag.strip() in message["content"] |
fabianlim
left a comment
There was a problem hiding this comment.
Can you accept the two suggestions
| **additional_inputs, | ||
| ) | ||
|
|
||
| # If the user has set `append_think_tag=True` and the current sample is a thinking sample, |
There was a problem hiding this comment.
ok thanks for the comment update.. so can we then change append_think_tag to mask_think_tag_if_present? Would that be more clear?
| if append_think_tag: | ||
| if has_thinking_content(messages): | ||
| asst_tag += think_tag | ||
|
|
There was a problem hiding this comment.
| if append_think_tag: | |
| if has_thinking_content(messages): | |
| asst_tag += think_tag | |
| if append_think_tag and has_thinking_content(messages): | |
| asst_tag += think_tag | |
This PR adds support for optionally appending
\n<think>\nto the defaultasst_tagduring span-based label masking for SFT — but only for samples that contain<think>content and when the user explicitly specifies it in themask_think_tagflag. This feature is designed to additionally support models that are thinking models, but do not specify thinking mode by default, and also in case thethinkandno-thinksamples are mixed in the dataset during training.Some assistant responses in a sample might contain a
<think>block that represents internal monologue/reasoning by the model. During SFT, the user might not want the model to learn to generate the<think>tag itself. Instead, we optionally allow<think>to be treated as part of the assistant tag span boundary, so masking cleanly includes or excludes it.The behavior is controlled by two conditions:
mask_think_tag(passed by the user)False: masking ignores<think>completely.<think>included,the model is not a “thinking” model, or for other design/training reasons.
True: masking may append<think>to the assistant tag, depending on the passed sample's content.has_thinking_content(messages)(sample check)Trueif the assistant message:thoughtfield, or"<think>"inside its content string.The
<think>suffix is appended to the assistant tag only when both are true:Additional notes:
Other minor changes:
Although it is common for
add_special_tokensto be set to False during tokenizer encoding, we setadd_special_tokens=Falsehere to disable the injection of tokenizer-specific BOS/EOS or SEP tokens that would otherwise be automatically added to the encoded sequence. This is essential for correct behavior in tasks like:Why is this necessary:
add_bos_token=True, which means they automatically prepend a BOS token. This would shift all token positions by 1 and break any span matching logic.add_bos_token=Falseby default, so they behave differently — leading to inconsistent behavior across model families if not explicitly handled.