GitHub - palmshed/ml: Machine learning framework.

distributed ml framework

c++17 distributed training
real-time performance and task monitoring
web dashboard — http://localhost:8080
simple build: cmake -B build -S . && cmake --build build
run: mpirun -n <num_processes> ./distributed_ml
kubernetes-ready (docker, helm, autoscaling)
ci/cd on github actions (macos m1)
apache 2.0 license

docker

build the docker image — this may take several minutes:

docker build -t distributed-ml:latest .

run the container (exposes dashboard on port 8080):

docker run -p 8080:8080 distributed-ml:latest

the container will run training then keep the dashboard server running indefinitely.

api endpoints

GET / - api info and available endpoints
GET /tasks - list all tasks
GET /performance - performance metrics
POST /tasks - create a new task

example (local run)

$ cd build && mpirun -np 1 ./distributed_ml
[info] mpi initialized successfully
[info] configuration set: lr=0.01, epochs=100, batchsize=32
[info] distributed trainer initialized. rank: 0, world size: 1
[info] node 0 received 1000 training samples
[info] model parameters synchronized
[info] epoch 1/100 - global loss: 31.7566, gradient norm: 1.02059
[info] epoch 2/100 - global loss: 31.7566, gradient norm: 1.02059
[info] epoch 3/100 - global loss: 31.7566, gradient norm: 1.02059
[info] epoch 4/100 - global loss: 31.7566, gradient norm: 1.02059
[info] early stopping triggered
[info] distributed training completed
dashboard server listening on: http://0.0.0.0:8080/
[info] results aggregated from node 0
training metrics: {
    "batch_size": 32,
    "epochs": 100,
    "learning_rate": 0.01,
    "local_data_size": 1000,
    "rank": 0,
    "total_data_size": 1000,
    "world_size": 1
}

Name		Name	Last commit message	Last commit date
Latest commit History 69 Commits
.github/workflows		.github/workflows
deploy		deploy
include		include
scripts		scripts
src		src
tests		tests
.dockerignore		.dockerignore
.gitignore		.gitignore
CMakeLists.txt		CMakeLists.txt
Dockerfile		Dockerfile
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

distributed ml framework

docker

api endpoints

example (local run)

About

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

distributed ml framework

docker

api endpoints

example (local run)

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Uh oh!

Contributors

Uh oh!

Languages