Skip to content

Code error in ch 3, Processing the data #1181

@junepil

Description

@junepil

Hi everyone, I was following up the whole tutorial about LLM course and stuck by something.

Problem

There's a code snippet about initializing tokenizer with the dataset from MPRC:

tokenized_dataset = tokenizer(
    raw_datasets["train"]["sentence1"],
    raw_datasets["train"]["sentence2"],
    padding=True,
    truncation=True,
)

As the tutorial suggests, it should work well but I've encountered the following error:

Traceback (most recent call last):
  File "/Users/leepil/projects/hf/main.py", line 10, in <module>
    tokenized_dataset = tokenizer(
      raw_datasets["train"]["sentence1"],
    ...<2 lines>...
      truncation=True,
    )
  File "/Users/leepil/projects/hf/.venv/lib/python3.14/site-packages/transformers/tokenization_utils_base.py", line 2559, in __call__
    encodings = self._encode_plus(
        text=text,
    ...<4 lines>...
        **all_kwargs,
    )
  File "/Users/leepil/projects/hf/.venv/lib/python3.14/site-packages/transformers/tokenization_utils_tokenizers.py", line 799, in _encode_plus
    raise ValueError(
    ...<2 lines>...
    )
ValueError: text input must be of type `str` (single example), `list[str]` (batch or single pretokenized example) or `list[list[str]]` (batch of pretokenized examples) or `list[tuple[list[str], list[str]]]` (batch of pretokenized sequence pairs).

After investigating the error, PreTrainedTokenizerBase intransformers/tokenization_utils_base.py only accept the input which is type of:

text: TextInput | PreTokenizedInput | list[TextInput] | list[PreTokenizedInput] | None = None,

And apparently the type of raw_datasets["train"]["sentence1"] is datasets.arrow_dataset.Dataset.

Suggestion

Because of that, wouldn't it be better to cover the input arguments with list() so it can be called properly like:

tokenized_dataset = tokenizer(
  list(raw_datasets["train"]["sentence1"]),
  list(raw_datasets["train"]["sentence2"]),
  padding=True,
  truncation=True,
)

This code gives me a output what I expected. I'm very new to this library so please let me know if there's something that I missed.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions