Skip to content

Scheduler Phase 2: VRAM-accounted admission + queue + eviction (the GPU arbiter both agents use) #894

@jaylfc

Description

@jaylfc

Why now (Jay 2026-06-14)

@taos FLUX image-gen OOMd the shared 3060 because @taOSmd had ~9.4GB of Ollama models loaded. The A2A GPU-lease protocol (#893) is the stop-gap; the proper fix is the taOS resource queue both consumers submit to instead of negotiating by hand.

Current state (scaffolding exists)

  • scheduling/resource_manager.py: discovers hardware, queries Ollama for loaded models, tracks VRAM/RAM pressure, adjusts job-queue concurrency. Awareness only, no enforcement.
  • scheduler/scheduler.py + scheduler/resource.py: Phase 1, deliberately minimal -- synchronous routing, per-resource concurrency semaphore, first-resource-that-passes-admission runs. Priority queues, aging, preemption, max_wait_ms are Phase 2 and intentionally absent (per the module docstrings).
  • scheduler/core_aware_scheduler.py (Reddit Client app (Knowledge Pipeline Step 3) #172): load_with_core_awareness(), not yet wired into the dispatch path.

The gap to build (Phase 2)

  1. Per-device VRAM budget in resource.py: track total/used/free per accelerator (free = real-time, already accounts for external Ollama via the resource_manager read, so @taOSmd co-tenancy is visible).
  2. Admission control: a GPU job (load model / generate / inference) is admitted only if est_vram <= free_vram; otherwise it queues.
  3. Wait-queue with aging: queue blocked jobs, FIFO + aging to avoid starvation; max_wait_ms enforcement.
  4. Eviction: before rejecting, evict idle keep-alive models (LifecycleManager) to make room.
  5. Single arbiter for co-tenants: route both taOS backends AND (ideally) taOSmd's Ollama loads through admission. First step: taOS-side admission respects real-time free VRAM and queues/evicts; full mutual arbitration has taOSmd defer GPU loads to the same manager.
  6. Wire TaskRouter.generate_image + core-aware loading (Reddit Client app (Knowledge Pipeline Step 3) #172) through the admission path.

Acceptance

Two consumers (image-gen + chat models) never co-load past a GPU's VRAM; over-budget work queues and runs when it fits (after eviction), with no OOM. The describe_image_capabilities tool reports live VRAM + queue depth. Subsumes the #893 A2A stop-gap. Brainstorm -> spec -> Phase 2 impl; do not block the current storybook demo on it.

Metadata

Metadata

Assignees

No one assigned

    Labels

    featureNew featurekilo-duplicateAuto-generated label by Kilokilo-triagedAuto-generated label by Kilo

    Projects

    Status
    Todo

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions