Code error in ch 3, Processing the data

Hi everyone, I was following up the whole tutorial about LLM course and stuck by something. 

# Problem

There's a code snippet about [initializing tokenizer with the dataset](https://huggingface.co/learn/llm-course/en/chapter3/2#preprocessing-a-dataset) from *MPRC*:

```py
tokenized_dataset = tokenizer(
    raw_datasets["train"]["sentence1"],
    raw_datasets["train"]["sentence2"],
    padding=True,
    truncation=True,
)
```
As the tutorial suggests, it should work well but I've encountered the following error:
```py
Traceback (most recent call last):
  File "/Users/leepil/projects/hf/main.py", line 10, in <module>
    tokenized_dataset = tokenizer(
      raw_datasets["train"]["sentence1"],
    ...<2 lines>...
      truncation=True,
    )
  File "/Users/leepil/projects/hf/.venv/lib/python3.14/site-packages/transformers/tokenization_utils_base.py", line 2559, in __call__
    encodings = self._encode_plus(
        text=text,
    ...<4 lines>...
        **all_kwargs,
    )
  File "/Users/leepil/projects/hf/.venv/lib/python3.14/site-packages/transformers/tokenization_utils_tokenizers.py", line 799, in _encode_plus
    raise ValueError(
    ...<2 lines>...
    )
ValueError: text input must be of type `str` (single example), `list[str]` (batch or single pretokenized example) or `list[list[str]]` (batch of pretokenized examples) or `list[tuple[list[str], list[str]]]` (batch of pretokenized sequence pairs).
```

After investigating the error, `PreTrainedTokenizerBase` in`transformers/tokenization_utils_base.py` only accept the input which is type of:
```py
text: TextInput | PreTokenizedInput | list[TextInput] | list[PreTokenizedInput] | None = None,
```

And apparently the type of `raw_datasets["train"]["sentence1"]` is `datasets.arrow_dataset.Dataset`.

# Suggestion
Because of that, wouldn't it be better to cover the input arguments with `list()` so it can be called properly like:
```py
tokenized_dataset = tokenizer(
  list(raw_datasets["train"]["sentence1"]),
  list(raw_datasets["train"]["sentence2"]),
  padding=True,
  truncation=True,
)
```
This code gives me a output what I expected. I'm very new to this library so please let me know if there's something that I missed.  

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Code error in ch 3, Processing the data #1181

Problem

Suggestion

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Code error in ch 3, Processing the data #1181

Description

Problem

Suggestion

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions