Skip to content

improve model debugging workflow#611

Open
ethernet8023 wants to merge 8 commits intomainfrom
train-refactor
Open

improve model debugging workflow#611
ethernet8023 wants to merge 8 commits intomainfrom
train-refactor

Conversation

@ethernet8023
Copy link
Copy Markdown
Contributor

@ethernet8023 ethernet8023 commented Mar 2, 2026

  • adds the ability to run nix run .#train -- config config.toml on a regular psyche config toml, to test things locally before deploying to the network
  • adds a dump-config to run manager to dump a config for an existing rung out as a toml to be edited / played with.

@ethernet8023 ethernet8023 changed the title Train refactor improve model debuggingw orkflow Mar 2, 2026
@ethernet8023 ethernet8023 force-pushed the train-refactor branch 3 times, most recently from 628fbfb to 42b290b Compare March 3, 2026 19:23
@ethernet8023 ethernet8023 changed the title improve model debuggingw orkflow improve model debugging workflow Mar 3, 2026
@ethernet8023 ethernet8023 force-pushed the train-refactor branch 3 times, most recently from 969388d to 308eec8 Compare March 3, 2026 22:27
@ethernet8023 ethernet8023 force-pushed the train-refactor branch 6 times, most recently from 066f766 to d74613b Compare March 11, 2026 18:25
@ethernet8023 ethernet8023 marked this pull request as ready for review March 11, 2026 19:39
@ethernet8023 ethernet8023 force-pushed the train-refactor branch 5 times, most recently from 984ef03 to f4475ce Compare March 12, 2026 14:45
They're instantiated via `(LlamaForCausalLM|DeepseekForCausalLM)::from_pretrained` or `(PythonCausalLM::new|PythonDistributedCausalLM::new)`.

There's alpha-level support for models written in Python. See the [Python](./python.md) docs for more information.
We currently only support causal language models — to implement a new one, you can create a file similar to `llama_for_causal_lm` and implement your model, ensuring you provide a trait impl for `CausalLM` - or, preferrably, add your model to [our TorchTitan fork](https://github.com/nousResearch/torchtitan). See the [Python](./python.md) docs for more information.
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add your model to our Torchtitan fork

what does this mean exactly? and is it torchtitan format or safetensors or something else?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it's for if you're adding a new model architecture - it's just "implement your model architecture in our torch titan fork". new model shapes in existing architectures don't need anything fancy like this.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ahhh ok I see ty

Comment thread tools/rust-tools/run-manager/src/commands/run/dump_config.rs
@ethernet8023 ethernet8023 enabled auto-merge March 18, 2026 22:06
…o python model string

also, type UnsupportedArchitecture more strictly
useful for testing combinations of featuresets to ensure compilation
with all of them
- now you can `train` on a regular config toml!
- moved train to its own package
- modified train to work more like tp/dp from the main binary
- updated docs to reflect changes
- added a script to build a tiny version of any model architecture
which dumps a config in the exact toml shape required to feed it back in
to setting a config!
there was no path that returned an error here
we try a few dirs to see if they are exec bfirst
Copy link
Copy Markdown
Collaborator

@rob-maron rob-maron left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks great Ari, I'm trying this out today. TY

Comment on lines 45 to +57

impl LLMArchitecture {
/// Separate from a display impl, since we actually match on these in the codebase
pub fn to_python_model_string(&self) -> String {
match self {
LLMArchitecture::HfLlama => "HfLlama",
LLMArchitecture::HfDeepseek => "HfDeepseek",
LLMArchitecture::HfAuto => "HfAuto",
LLMArchitecture::Torchtitan => "Torchtitan",
}
.to_string()
}
}
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice 👍

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants