Skip to content

setup: project dependencies + first contribution #1

@clay-arras

Description

@clay-arras

Prerequisites

You need to know Git. If you have never used Git before, please read up on this: link
Before you start, you should be familiar git add, git commit, git push, git pull.


Setup

Before we start, if you are on Windows, please install WSL: link
If you are on Mac, install Homebrew, which is a package manager for Mac: link

Python Tooling (not urgent)

We use uv as our package manager: link
To install pre-commits (for python code in src/ and future tests): link

C++ Tooling

Run this command in the project root: cmake -S . -B build; this generates Makefiles required in the build process
There is nothing to build right now.

You also need to know the basics of cmake. For now, all you need to know are:

  • cmake --build build (this builds the project)
  • cmake --build build --target format
  • cmake --build build --target lint

These depend on clang-format and clang-tidy, which are our formatter and linter, respectively.
Installation is below:

  • WSL: sudo apt update && sudo apt install clang-format
  • macOS: brew install clang-format llvm (llvm includes clang-tidy). You may need to create a symlink or export path on Mac, ask AI how to do this if you don't know how.

Project Structure

  • csrc: contains all of the cpp code. all headers are in csrc/include, and all source files are in csrc/src/libtokenizer
  • src: contains the python code.
  • csrc/core: should contain the tokenizer logic itself.
  • runtime module: a separate module that saves the merges intermittently (i.e. every K merges) for callbacks/logging. We can plan this out in the future.
  • data ingestion: a separate module for loading data, there's a chance that not all data can be ingested in main memory at the same time. We can plan this out in the future.

Note: Eventually we will pybind the C++ over to python. This will happen in csrc/bindings


Task: First Contribution

  1. Switch to a new branch: git checkout -b <your-branch-name>. As a general practice, try to make branch names concise but informative.
  2. Edit pyproject.toml and include your name and email under "authors".
  3. Run git add . and git commit -m "your-commit-message".
  4. Run git push -u origin <your-branch-name>.
  5. On Github, create a Pull Request.
  6. Ask a team member to approve your Pull Request. Once merged, you should see your changes reflected.

Resolving Conflicts

You may need to rebase. Learn how to rebase here: link


That should be enough to get you started. If I missed anything or if there's anything confusing just let me know.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions