This repo is structured so you can test each feature individually before wiring components together.
API-only FastAPI service with an in-memory job store and a simple “executor” that can:
- compute a small CPU GEMM result summary (for tiny shapes), or
- simulate a result for larger shapes (deterministic checksum)
This lets you test:
- request validation
- job lifecycle (QUEUED/RUNNING/DONE/FAILED)
- metrics endpoint
api/FastAPI service (Feature 1)client/simple CLI client to submit jobs to the APIworker_cuda/placeholder for the standalone CUDA kernel benchmark (Feature 4 later)docs/notes / architecture
cd api
python -m venv .venv
# Windows:
.venv\Scripts\activate
# macOS/Linux:
source .venv/bin/activate
pip install -r requirements.txtuvicorn app.main:app --reload --port 8000python ../client/submit_job.py --m 64 --n 64 --k 64 --dtype fp32 --repeats 5
python ../client/submit_job.py --m 4096 --n 4096 --k 4096 --dtype fp16 --simulate# replace JOB_ID
curl http://127.0.0.1:8000/v1/jobs/JOB_ID
curl http://127.0.0.1:8000/v1/jobs/JOB_ID/result
curl http://127.0.0.1:8000/metrics- Redis queue + metadata store (API still runnable alone)
- Worker stub (no CUDA yet): pulls from Redis, produces results
- Standalone CUDA tiled GEMM binary (benchmark CLI)
- Integrate worker + CUDA + batching/streams