Scheduler Phase 2: VRAM-accounted admission + queue + eviction (the GPU arbiter both agents use)

## Why now (Jay 2026-06-14)
@taOS FLUX image-gen OOMd the shared 3060 because @taOSmd had ~9.4GB of Ollama models loaded. The A2A GPU-lease protocol (#893) is the stop-gap; the proper fix is the taOS resource queue both consumers submit to instead of negotiating by hand.

## Current state (scaffolding exists)
- `scheduling/resource_manager.py`: discovers hardware, **queries Ollama for loaded models**, tracks VRAM/RAM pressure, adjusts job-queue concurrency. Awareness only, no enforcement.
- `scheduler/scheduler.py` + `scheduler/resource.py`: Phase 1, deliberately minimal -- synchronous routing, per-resource concurrency semaphore, first-resource-that-passes-admission runs. Priority queues, aging, preemption, max_wait_ms are Phase 2 and intentionally absent (per the module docstrings).
- `scheduler/core_aware_scheduler.py` (#172): load_with_core_awareness(), not yet wired into the dispatch path.

## The gap to build (Phase 2)
1. **Per-device VRAM budget** in resource.py: track total/used/free per accelerator (free = real-time, already accounts for external Ollama via the resource_manager read, so @taOSmd co-tenancy is visible).
2. **Admission control**: a GPU job (load model / generate / inference) is admitted only if est_vram <= free_vram; otherwise it queues.
3. **Wait-queue with aging**: queue blocked jobs, FIFO + aging to avoid starvation; max_wait_ms enforcement.
4. **Eviction**: before rejecting, evict idle keep-alive models (LifecycleManager) to make room.
5. **Single arbiter for co-tenants**: route both taOS backends AND (ideally) taOSmd's Ollama loads through admission. First step: taOS-side admission respects real-time free VRAM and queues/evicts; full mutual arbitration has taOSmd defer GPU loads to the same manager.
6. Wire TaskRouter.generate_image + core-aware loading (#172) through the admission path.

## Acceptance
Two consumers (image-gen + chat models) never co-load past a GPU's VRAM; over-budget work queues and runs when it fits (after eviction), with no OOM. The describe_image_capabilities tool reports live VRAM + queue depth. Subsumes the #893 A2A stop-gap. Brainstorm -> spec -> Phase 2 impl; do not block the current storybook demo on it.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Scheduler Phase 2: VRAM-accounted admission + queue + eviction (the GPU arbiter both agents use) #894

Why now (Jay 2026-06-14)

Current state (scaffolding exists)

The gap to build (Phase 2)

Acceptance

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Uh oh!

Scheduler Phase 2: VRAM-accounted admission + queue + eviction (the GPU arbiter both agents use) #894

Description

Why now (Jay 2026-06-14)

Current state (scaffolding exists)

The gap to build (Phase 2)

Acceptance

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions