You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
@taos FLUX image-gen OOMd the shared 3060 because @taOSmd had ~9.4GB of Ollama models loaded. The A2A GPU-lease protocol (#893) is the stop-gap; the proper fix is the taOS resource queue both consumers submit to instead of negotiating by hand.
Current state (scaffolding exists)
scheduling/resource_manager.py: discovers hardware, queries Ollama for loaded models, tracks VRAM/RAM pressure, adjusts job-queue concurrency. Awareness only, no enforcement.
scheduler/scheduler.py + scheduler/resource.py: Phase 1, deliberately minimal -- synchronous routing, per-resource concurrency semaphore, first-resource-that-passes-admission runs. Priority queues, aging, preemption, max_wait_ms are Phase 2 and intentionally absent (per the module docstrings).
Per-device VRAM budget in resource.py: track total/used/free per accelerator (free = real-time, already accounts for external Ollama via the resource_manager read, so @taOSmd co-tenancy is visible).
Admission control: a GPU job (load model / generate / inference) is admitted only if est_vram <= free_vram; otherwise it queues.
Wait-queue with aging: queue blocked jobs, FIFO + aging to avoid starvation; max_wait_ms enforcement.
Eviction: before rejecting, evict idle keep-alive models (LifecycleManager) to make room.
Single arbiter for co-tenants: route both taOS backends AND (ideally) taOSmd's Ollama loads through admission. First step: taOS-side admission respects real-time free VRAM and queues/evicts; full mutual arbitration has taOSmd defer GPU loads to the same manager.
Two consumers (image-gen + chat models) never co-load past a GPU's VRAM; over-budget work queues and runs when it fits (after eviction), with no OOM. The describe_image_capabilities tool reports live VRAM + queue depth. Subsumes the #893 A2A stop-gap. Brainstorm -> spec -> Phase 2 impl; do not block the current storybook demo on it.
Why now (Jay 2026-06-14)
@taos FLUX image-gen OOMd the shared 3060 because @taOSmd had ~9.4GB of Ollama models loaded. The A2A GPU-lease protocol (#893) is the stop-gap; the proper fix is the taOS resource queue both consumers submit to instead of negotiating by hand.
Current state (scaffolding exists)
scheduling/resource_manager.py: discovers hardware, queries Ollama for loaded models, tracks VRAM/RAM pressure, adjusts job-queue concurrency. Awareness only, no enforcement.scheduler/scheduler.py+scheduler/resource.py: Phase 1, deliberately minimal -- synchronous routing, per-resource concurrency semaphore, first-resource-that-passes-admission runs. Priority queues, aging, preemption, max_wait_ms are Phase 2 and intentionally absent (per the module docstrings).scheduler/core_aware_scheduler.py(Reddit Client app (Knowledge Pipeline Step 3) #172): load_with_core_awareness(), not yet wired into the dispatch path.The gap to build (Phase 2)
Acceptance
Two consumers (image-gen + chat models) never co-load past a GPU's VRAM; over-budget work queues and runs when it fits (after eviction), with no OOM. The describe_image_capabilities tool reports live VRAM + queue depth. Subsumes the #893 A2A stop-gap. Brainstorm -> spec -> Phase 2 impl; do not block the current storybook demo on it.