Skip to content

Benchmark test for single controller#131

Closed
abaheti95 wants to merge 5 commits intomainfrom
ashu/benchmark_test
Closed

Benchmark test for single controller#131
abaheti95 wants to merge 5 commits intomainfrom
ashu/benchmark_test

Conversation

@abaheti95
Copy link
Collaborator

Uses a llama 8b math dataset instead of qwen open_r1. Modifies the generation length and sequence length to see rewards and hillclimbing in 5 steps.

(mlflow green curve)[https://dbc-559ffd80-2bfc.cloud.databricks.com/ml/experiments/3739963835954932/runs?o=7395834863327820&searchFilter=&orderByKey=attributes.start_time&orderByAsc=false&startTime=ALL&lifecycleFilter=Active&modelVersionFilter=All+Runs&datasetsFilter=W10%3D&compareRunsMode=CHART]

abaheti95 and others added 5 commits July 29, 2025 14:55
torch.barrier has a fixed limited timeout depending on its backend so
won't help keep all MCT managed processes alive. So we need a new
barrier mechanism. This implementation is a refactor of existing
SyncActor but made it more general to serve as a barrier between
clients.
@abaheti95 abaheti95 closed this Aug 4, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants