All changes we make to the assignment code or PDF will be documented in this file.
- code: fix
test_get_batchto handle "AssertionError: Torch not compiled with CUDA enabled". - handout: clarify that gradient clipping norm is calculated over all the parameters.
- code: fix gradient clipping test comparing wrong tensors
- code: test skipping parameters with no gradient and properly computing norm with multiple parameters
- handout: edit expected TinyStories run time to 30-40 minutes.
- handout: add more details about how to use
np.memmapor themmap_modeflag tonp.load. - code: fix
get_tokenizer()docstring. - handout: specify that problem
main_experimentshould use the same settings as TinyStories. - code: replace mentions of layernorm with RMSNorm.
- handout: clarify example of preferring lexicographically greater merges to specify that we want tuple comparison.
- handout: fix expected number of training tokens for TinyStories, should be 327,680,000.
- code: fix typo in
run_get_lr_cosine_schedulereturn docstring. - code: fix typo in
test_tokenizer.py
- code: skip
Tokenizermemory-related tests on non-Linux systems, since support for RLIMIT_AS is inconsistent. - code: reduce increase atol on end-to-end Transformer forward pass tests.
- code: remove dropout in model-related tests to improve determinism across platforms.
- code: add
attn_pdroptorun_multihead_self_attentionadapter. - code: clarify
{q,k,v}_projdimension orders in the adapters. - code: increase atol on cross-entropy tests
- code: remove unnecessary warning in
test_get_lr_cosine_schedule
- handout: fix signature of
Tokenizer.__init__to includeself. - handout: mention that
Tokenizer.from_filesshould be a class method. - handout: clarify list of model hyperparameters listed in
adamwAccounting. - handout: clarify that
adamwAccounting(b) considers a GPT-2 XL-shaped model (with our architecture), not necessarily the literal GPT-2 XL model. - handout: moved softmax problem to where softmax is first mentioned (Scaled Dot-Product Attention, Section 3.4.3)
- handout: removed redundant initialization (t = 0) in AdamW pseudocode
- handout: added resources needed for BPE training
- handout: edit
adamWAccounting, part (d) to define MFU and mention that the backward pass is typically assumed to have twice the FLOPS of the forward pass. - handout: provide a hint about desired behavior when a user passes in input IDs
to
Tokenizer.decodethat correspond to invalid UTF-8 bytes.
- handout: added some more information about submitting to the leaderboard.
- code: add a note to README.md that pull requests and issues are welcome and encouraged.
- handout: edit motivation for pre-tokenization to include a note about desired behavior with tokens that differ only in punctuation.
- handout: remove total number of points after each section.
- handout: mention that large language models (e.g., LLaMA and GPT-3) often use AdamW betas of (0.9, 0.95) (in contrast to the PyTorch defaults of (0.9, 0.999)).
- handout: explicitly mention the deliverable in the
adamwproblem. - code: rename
test_serialization::test_checkpointtotest_serialization::test_checkpointingto match the handout. - code: slightly relax the time limit in
test_train_bpe_speed.
- code: fix an issue in the
train_bpetests where the expected merges and vocab did not properly reflect tiebreaking with the lexicographically greatest pair.- This occurred because our reference implementation (which checks against HF) follows the GPT-2 tokenizer in remapping bytes that aren't human-readable to printable unicode strings. To match the HF code, we were erroneously tiebreaking on this remapped unicode representation instead of the original bytes.
- handout: fix the expected number of non-embedding parameters for model with recommended TinyStories hyperparameters (section 7.2).
- handout: replace
<|endofsequence|>with<|endoftext|>in thedecodingproblem. - code: fix the setup command (
pip install -e .'[test]')to improve zsh compatibility. - handout: fix various trivial typos and formatting errors.
Initial release.