Thank you for this impressive and well-executed work. I truly appreciate the clarity and rigor of the experiments.
I have a question regarding the figures: I understand the relationship between training loss and training steps, and also how to derive the number of training tokens from the steps, given the number of tokens processed per step. Based on the provided hyper-parameters:
batch_size = 5
block_size = 1024
gradient_accumulation_steps = 12
# this makes total number of tokens be 300B
max_iters = 100000
Multiplying these values yields a total of 5 × 1024 × 12 × 100,000 = 6.14B tokens, which is significantly lower than the 5000B tokens shown on the x-axis of the training loss curves. I am curious about how the 5000B token count is computed in this context. Could you clarify this discrepancy?
Thank you for this impressive and well-executed work. I truly appreciate the clarity and rigor of the experiments.
I have a question regarding the figures: I understand the relationship between training loss and training steps, and also how to derive the number of training tokens from the steps, given the number of tokens processed per step. Based on the provided hyper-parameters:
Multiplying these values yields a total of 5 × 1024 × 12 × 100,000 = 6.14B tokens, which is significantly lower than the 5000B tokens shown on the x-axis of the training loss curves. I am curious about how the 5000B token count is computed in this context. Could you clarify this discrepancy?