-
Notifications
You must be signed in to change notification settings - Fork 0
Description
Prerequisites
You need to know Git. If you have never used Git before, please read up on this: link
Before you start, you should be familiar git add, git commit, git push, git pull.
Setup
Before we start, if you are on Windows, please install WSL: link
If you are on Mac, install Homebrew, which is a package manager for Mac: link
Python Tooling (not urgent)
We use uv as our package manager: link
To install pre-commits (for python code in src/ and future tests): link
C++ Tooling
Run this command in the project root: cmake -S . -B build; this generates Makefiles required in the build process
There is nothing to build right now.
You also need to know the basics of cmake. For now, all you need to know are:
- cmake --build build (this builds the project)
- cmake --build build --target format
- cmake --build build --target lint
These depend on clang-format and clang-tidy, which are our formatter and linter, respectively.
Installation is below:
- WSL:
sudo apt update && sudo apt install clang-format - macOS:
brew install clang-format llvm(llvm includes clang-tidy). You may need to create a symlink or export path on Mac, ask AI how to do this if you don't know how.
Project Structure
- csrc: contains all of the cpp code. all headers are in csrc/include, and all source files are in csrc/src/libtokenizer
- src: contains the python code.
- csrc/core: should contain the tokenizer logic itself.
- runtime module: a separate module that saves the merges intermittently (i.e. every K merges) for callbacks/logging. We can plan this out in the future.
- data ingestion: a separate module for loading data, there's a chance that not all data can be ingested in main memory at the same time. We can plan this out in the future.
Note: Eventually we will pybind the C++ over to python. This will happen in csrc/bindings
Task: First Contribution
- Switch to a new branch:
git checkout -b <your-branch-name>. As a general practice, try to make branch names concise but informative. - Edit pyproject.toml and include your name and email under "authors".
- Run
git add .andgit commit -m "your-commit-message". - Run
git push -u origin <your-branch-name>. - On Github, create a Pull Request.
- Ask a team member to approve your Pull Request. Once merged, you should see your changes reflected.
Resolving Conflicts
You may need to rebase. Learn how to rebase here: link
That should be enough to get you started. If I missed anything or if there's anything confusing just let me know.