Loss vs. Tokens

Thank you for this impressive and well-executed work. I truly appreciate the clarity and rigor of the experiments.

I have a question regarding the figures: I understand the relationship between training loss and training steps, and also how to derive the number of training tokens from the steps, given the number of tokens processed per step. Based on the provided hyper-parameters:

```
batch_size = 5
block_size = 1024
gradient_accumulation_steps = 12

# this makes total number of tokens be 300B
max_iters = 100000
```

Multiplying these values yields a total of 5 × 1024 × 12 × 100,000 = 6.14B tokens, which is significantly lower than the 5000B tokens shown on the x-axis of the training loss curves. I am curious about how the 5000B token count is computed in this context. Could you clarify this discrepancy?


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Loss vs. Tokens #6

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Loss vs. Tokens #6

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions