Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion psyche-book/src/development/models.md
Original file line number Diff line number Diff line change
Expand Up @@ -31,7 +31,7 @@ cargo run --example train -- \

## Adding a new model type

The `train` example currently asssumes your model is a Llama or Deepseek v2/v3 model, and instantiates it via `(LlamaForCausalLM|DeepseekForCausalLM)::from_pretrained`.
The `train` example currently assumes your model is a Llama or Deepseek v2/v3 model, and instantiates it via `(LlamaForCausalLM|DeepseekForCausalLM)::from_pretrained`.

We currently only support causal language models - to implement a new one, you can create a file similar to `llama_for_causal_lm` and implement your model, ensuring you provide a trait impl for `CausalLM`.

Expand Down
4 changes: 2 additions & 2 deletions psyche-book/src/enduser/join-run.md
Original file line number Diff line number Diff line change
Expand Up @@ -88,7 +88,7 @@ We recommend using a dedicated RPC service such as [Helius](https://www.helius.d

## Additional config variables

In general it's not neccesary to change these variables to join a run since we provide sensible defaults,
In general it's not necessary to change these variables to join a run since we provide sensible defaults,
though you might need to.

**`NVIDIA_DRIVER_CAPABILITIES`** - An environment variable that the NVIDIA Container Toolkit uses to determine which compute capabilities should be provided to your container. It is recommended to set it to 'all', e.g. `NVIDIA_DRIVER_CAPABILITIES=all`.
Expand All @@ -101,7 +101,7 @@ though you might need to.
**`TENSOR_PARALLELISM`** - Number of GPUs to distribute the model across, this lets you train a model you can't fit on one single GPU.

- If you have 1 GPU, set this to `1`
- If your have `n` GPUs you can distribute the model across all of them by setting it to `n`.
- If you have `n` GPUs you can distribute the model across all of them by setting it to `n`.

**`MICRO_BATCH_SIZE`** - Number of samples processed per GPU per training step

Expand Down
2 changes: 1 addition & 1 deletion psyche-book/src/enduser/run-config.md
Original file line number Diff line number Diff line change
Expand Up @@ -37,7 +37,7 @@ round_witness_time = 1
# than 100.
min_clients = 1

# minumum number of clients required before we transition from WaitingForMembers to Warmup.
# minimum number of clients required before we transition from WaitingForMembers to Warmup.
# must be equal to or greater than min_clients.
init_min_clients = 1

Expand Down
2 changes: 1 addition & 1 deletion psyche-book/src/explain/general-workflow.md
Original file line number Diff line number Diff line change
Expand Up @@ -103,7 +103,7 @@ Upon exiting the _Cooldown_ phase, the Coordinator transitions to the next epoch

### It all comes together

Here's is an overview of how the state of the run can change depending on the situation:
Here's an overview of how the state of the run can change depending on the situation:

```mermaid
%%{init: {'theme':'base', 'themeVariables': { 'fontSize':'35px'}}}%%
Expand Down
4 changes: 2 additions & 2 deletions psyche-book/src/explain/glossary.md
Original file line number Diff line number Diff line change
Expand Up @@ -82,7 +82,7 @@ A `RunState` indicating that the training run has completed its configured `tota
CI (Continuous Integration) service based on `Nix`, used by `Psyche`.

**Health Check**
A verification procedure (`health_check()`) initiated by designated `witness` clients. Its purpose is to monitor peer clients and confirm they are actively processing their assigned training batches. When a witness client detects a peer that appears unresponsive or failing (`unhealthy`), it notifies the central coordinator. The coordinator independently verifies the status of the reported peer by running its own health check. If this verification is verified then the peer is marked as `unhealthy` and is kicked.
A verification procedure (`health_check()`) initiated by designated `witness` clients. Its purpose is to monitor peer clients and confirm they are actively processing their assigned training batches. When a witness client detects a peer that appears unresponsive or failing (`unhealthy`), it notifies the central coordinator. The coordinator independently verifies the status of the reported peer by running its own health check. If the verification confirms the peer is unhealthy, the peer is marked as `unhealthy` and removed from the run.

**Healthy**
The desired `ClientState`, indicating the client is connected, responsive, and participating correctly in the training process. Only Healthy clients typically receive `Rewards`.
Expand Down Expand Up @@ -118,7 +118,7 @@ A feature that allows progressing early from the `RoundTrain` phase to the `Witn
A `RunState` where the training process is temporarily stopped by manual intervention. Can be resumed later.

**P2P**
Peer-to-Peer, meaning a client acts both as a client and as a server, sharing data with it's peers. This is the intended way of data-sharing during a stable run.
Peer-to-Peer, meaning a client acts both as a client and as a server, sharing data with its peers. This is the intended way of data-sharing during a stable run.

**Psyche**
Nous Research's set of systems that enable distributed training of transformer-based AI models over the internet.
Expand Down
2 changes: 1 addition & 1 deletion psyche-book/src/explain/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -8,7 +8,7 @@ The core system is composed of three main actors:

- **[Data Provider](./data-provider.md)**: Each run requires training data. This data could be served by the Psyche Data Provider server, over HTTP, or loaded from local copies of a dataset.

Psyche provides two different implementations of the network, one for [decentralized](./general-workflow.md#decentralized-backend) runs that use the Solana Blockchain with the coordinator running in it and another for [centralized](./general-workflow.md#centralized-backend) runs that use the Coordinator as a regular TCP server and mostly is mostly used to test local runs and as a dev oriented tool.
Psyche provides two different implementations of the network, one for [decentralized](./general-workflow.md#decentralized-backend) runs that use the Solana Blockchain with the coordinator running in it and another for [centralized](./general-workflow.md#centralized-backend) runs that use the Coordinator as a regular TCP server and is mostly used to test local runs and as a dev oriented tool.

## Sample topologies

Expand Down