Hi everyone, I was following up the whole tutorial about LLM course and stuck by something.
Problem
There's a code snippet about initializing tokenizer with the dataset from MPRC:
tokenized_dataset = tokenizer(
raw_datasets["train"]["sentence1"],
raw_datasets["train"]["sentence2"],
padding=True,
truncation=True,
)
As the tutorial suggests, it should work well but I've encountered the following error:
Traceback (most recent call last):
File "/Users/leepil/projects/hf/main.py", line 10, in <module>
tokenized_dataset = tokenizer(
raw_datasets["train"]["sentence1"],
...<2 lines>...
truncation=True,
)
File "/Users/leepil/projects/hf/.venv/lib/python3.14/site-packages/transformers/tokenization_utils_base.py", line 2559, in __call__
encodings = self._encode_plus(
text=text,
...<4 lines>...
**all_kwargs,
)
File "/Users/leepil/projects/hf/.venv/lib/python3.14/site-packages/transformers/tokenization_utils_tokenizers.py", line 799, in _encode_plus
raise ValueError(
...<2 lines>...
)
ValueError: text input must be of type `str` (single example), `list[str]` (batch or single pretokenized example) or `list[list[str]]` (batch of pretokenized examples) or `list[tuple[list[str], list[str]]]` (batch of pretokenized sequence pairs).
After investigating the error, PreTrainedTokenizerBase intransformers/tokenization_utils_base.py only accept the input which is type of:
text: TextInput | PreTokenizedInput | list[TextInput] | list[PreTokenizedInput] | None = None,
And apparently the type of raw_datasets["train"]["sentence1"] is datasets.arrow_dataset.Dataset.
Suggestion
Because of that, wouldn't it be better to cover the input arguments with list() so it can be called properly like:
tokenized_dataset = tokenizer(
list(raw_datasets["train"]["sentence1"]),
list(raw_datasets["train"]["sentence2"]),
padding=True,
truncation=True,
)
This code gives me a output what I expected. I'm very new to this library so please let me know if there's something that I missed.
Hi everyone, I was following up the whole tutorial about LLM course and stuck by something.
Problem
There's a code snippet about initializing tokenizer with the dataset from MPRC:
As the tutorial suggests, it should work well but I've encountered the following error:
After investigating the error,
PreTrainedTokenizerBaseintransformers/tokenization_utils_base.pyonly accept the input which is type of:And apparently the type of
raw_datasets["train"]["sentence1"]isdatasets.arrow_dataset.Dataset.Suggestion
Because of that, wouldn't it be better to cover the input arguments with
list()so it can be called properly like:This code gives me a output what I expected. I'm very new to this library so please let me know if there's something that I missed.