Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
5 changes: 3 additions & 2 deletions docs/architecture/api.md
Original file line number Diff line number Diff line change
Expand Up @@ -21,8 +21,9 @@ yields `None`. Import it from `batchling`.
strict queue key `(provider, endpoint, model)`.
- **`dry_run` behavior**: when `dry_run=True`, requests are still intercepted, queued,
and grouped using normal window/size triggers, but provider batch submission and polling
are skipped. Requests resolve with synthetic `httpx.Response` objects marked with
`x-batchling-dry-run: 1`.
are skipped. Intercepted requests raise `DryRunEarlyExit` on return instead of
producing synthetic provider responses. A static dry-run summary report is emitted
at context teardown.
- **`cache` behavior**: when `cache=True` (default), intercepted requests are fingerprinted
and looked up in a persistent request cache. Cache hits bypass queueing and resume polling
from an existing provider batch when not in dry-run mode.
Expand Down
8 changes: 5 additions & 3 deletions docs/architecture/context.md
Original file line number Diff line number Diff line change
Expand Up @@ -10,6 +10,7 @@ a context variable.
- Yield `None` for scope-only lifecycle control.
- Support sync and async context manager patterns for cleanup and context scoping.
- Start and stop optional Rich live activity display while the context is active.
- In dry-run mode, aggregate and print a static Rich summary at teardown.

## Flow summary

Expand All @@ -20,9 +21,10 @@ a context variable.
4. If `live_display=True`, the context attempts to start Rich panel rendering at
enter-time when terminal auto-detection passes (`TTY`, non-`dumb`, non-`CI`).
Otherwise it registers an `INFO` logging fallback that emits progress at poll-time.
5. `__aexit__` resets the context and awaits `batcher.close()` to flush pending work.
6. The live display listener is removed and the panel is stopped when context cleanup
finishes.
5. In dry-run mode, a dedicated summary listener is also registered at enter-time.
6. `__aexit__` resets the context and awaits `batcher.close()` to flush pending work.
7. On teardown, the context prints one static dry-run summary report when dry-run is enabled.
8. Display/listener cleanup runs after close completes.

## Code reference

Expand Down
6 changes: 4 additions & 2 deletions docs/architecture/core.md
Original file line number Diff line number Diff line change
Expand Up @@ -34,9 +34,11 @@ resolves futures back to callers.
7. `close()` flushes remaining requests and cancels timers.

In `dry_run` mode, step 3 and provider polling are bypassed: `_process_batch()` still
creates `_ActiveBatch` for tracking, then resolves each request immediately with a
synthetic `httpx.Response` (`200`) marked with `x-batchling-dry-run: 1`.
creates `_ActiveBatch` for tracking, then resolves each request by raising
`DryRunEarlyExit`.
Cache lookups remain enabled in dry-run mode for hit accounting, but cache writes are disabled.
`close()` also waits for in-flight background submission/poll tasks so teardown
reporting has stable totals.

## Extension notes

Expand Down
35 changes: 32 additions & 3 deletions docs/dry-run.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,9 +4,38 @@

This feature exists for users to be able to debug and better understand what WILL happen when they ultimately disable the flag, giving them the transparency required to be confident in the library.

In practice, the dry run feature deactivates all batch submissions, but everything is done virtually, which means we can count incoming requests, number of batch we would have created, etc..

To put it simply, it provides users with an exact breakdown of what their batched inference run would have been for real.
In practice, dry-run deactivates all provider submissions while keeping the
internal batching path active (queueing, windowing, and per-queue grouping).

To put it simply, it provides users with an exact breakdown of what their
batched inference run would have been for real.

Sample output:

```text
╭────────────────────────────────────────────── batchling dry run summary ───────────────────────────────────────────────╮
│ Batchable Requests: 8 - Cache Hit Requests: 0 │
│ ┏━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━┓ │
│ ┃ provider ┃ endpoint ┃ model ┃ expected reques… ┃ expected batch… ┃ │
│ ┡━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━┩ │
│ │ anthropic │ /v1/messages │ claude-haiku-4-5 │ 1 │ 1 │ │
│ │ doubleword │ /v1/responses │ openai/gpt-oss-20b │ 1 │ 1 │ │
│ │ gemini │ /v1beta/models/gemini-2.5-flash-… │ gemini-2.5-flash-lite │ 1 │ 1 │ │
│ │ groq │ /openai/v1/chat/completions │ llama-3.1-8b-instant │ 1 │ 1 │ │
│ │ mistral │ /v1/chat/completions │ mistral-medium-2505 │ 1 │ 1 │ │
│ │ openai │ /v1/responses │ gpt-4o-mini │ 1 │ 1 │ │
│ │ together │ /v1/chat/completions │ google/gemma-3n-E4B-it │ 1 │ 1 │ │
│ │ xai │ /v1/chat/completions │ grok-4-1-fast-non-reasoning │ 1 │ 1 │ │
│ └─────────────┴───────────────────────────────────┴─────────────────────────────┴──────────────────┴─────────────────┘ │
╰────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╯
```

## Avoid partial counts

Dry-run exits as soon as the first intercepted request returns, which can lead
to partial totals if requests are awaited one by one. To let batchling see the
full request set before exit, schedule requests together and await them with
`asyncio.gather`.

## Activating dry run

Expand Down
Loading