feat: answer-span extraction + POST /v1/answer endpoint (Phase 1.1 + 3.3)#16
Conversation
Phase 1.1 + 3.3 of the engine plan: turns the whitepaper's
"answer has a verbatim quote with a page number" claim into two
demonstrable, regulator-defensible features.
pkg/retrieval/span.go
- SpanExtractor.Extract(ctx, content, query) runs one LLM call,
asks for the SHORTEST verbatim quote from the section that
answers the query, returns AnswerSpan{Start, End, Text} with
byte offsets back into content.
- Two-pass locator: exact substring first, then whitespace-
normalised match for content with weird linebreaks/spacing.
- When the model paraphrases despite the verbatim rule, offsets
are sentinel (-1, -1) but Text is preserved so callers can flag.
- Schema-validated JSON output ({found, quote}) with the same
tolerant parser pattern as ParseSelection (code fences, leading
prose).
internal/api/server.go
- Deps gains LLM (llmgate.Client), LLMModel (default fallback),
AnswerSpan + Answer config blocks. cmd/engine/main.go wires them.
- /v1/query: when retrieval.answer_span.enabled, every returned
section gets an `answer_span` field. Bounded concurrency
(config) via a semaphore; failures are logged and dropped.
- POST /v1/answer (NEW): single round-trip runs retrieval +
per-section span extraction + a synthesis LLM call, returns
{answer, citations:[{section_id, title, page_start, page_end,
quote, quote_start, quote_end}], strategy, model, usage,
elapsed_ms}. The model is instructed to cite by section_id.
pkg/config/config.go
- RetrievalConfig.AnswerSpan + .Answer blocks; defaults match
behaviour: AnswerSpan disabled, Answer.MaxSections=5,
Answer.MaxAnswerTokens=1024.
- VLE_RETRIEVAL_ANSWER_SPAN_ENABLED / _MODEL / _MAX_CONCURRENCY
and VLE_RETRIEVAL_ANSWER_MODEL / _MAX_SECTIONS env overrides
follow the existing pattern.
openapi.yaml
- /v1/answer endpoint + AnswerRequest / AnswerResponse /
AnswerCitation / AnswerSpan schemas. answer_span added as an
optional field on QuerySection (omitempty in JSON).
Tests:
- pkg/retrieval/span_test.go covers the verbatim path, the
not-found path, whitespace-normalised location, the
hallucinated-quote sentinel-offset path, empty inputs,
code-fence stripping, leading-prose stripping, and locateQuote
edge cases. All pass.
- go build ./..., go vet ./..., go test ./... — all green.
|
Warning Review limit reached
More reviews will be available in 11 minutes and 35 seconds. Learn how PR review limits work. Your organization has run out of usage credits. Purchase more in the billing tab. ⌛ How to resolve this issue?After more reviews become available, a review can be triggered using the We recommend that you space out your commits to avoid hitting the rate limit. 🚦 How do rate limits work?CodeRabbit enforces hourly rate limits for each developer per organization. Our paid plans include higher PR review limits than trial, open-source, and free plans. In all cases, reviews become available again over time. During sustained high-volume PR review activity, CodeRabbit may temporarily slow when the next review becomes available. Please see our Fair Usage Limits Policy for further information. ℹ️ Review info⚙️ Run configurationConfiguration used: defaults Review profile: CHILL Plan: Pro Run ID: 📒 Files selected for processing (7)
✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
Reviewer's GuideImplements configurable answer-span extraction for /v1/query sections and adds a new /v1/answer endpoint that performs retrieval, per-section span extraction, and answer synthesis via an LLM with grounded citations and usage accounting. Sequence diagram for POST /v1/answer end-to-end flowsequenceDiagram
actor Client
participant APIServer as APIServer_handleAnswer
participant DB as DB_LoadTree
participant Strategy as RetrievalStrategy
participant Storage as Storage_Get
participant SpanExtractor as SpanExtractor
participant LLM as LLM_Client
Client->>APIServer: POST /v1/answer (AnswerRequest)
APIServer-->>Client: [if no LLM] 501
APIServer-->>Client: [if no strategy] 503
APIServer->>DB: LoadTree(document_id)
DB-->>APIServer: Tree or error
APIServer->>Strategy: Select / SelectWithCost(query, budget)
Strategy-->>APIServer: section_ids, retrievalUsage
loop sections
APIServer->>Storage: Get(content_ref)
Storage-->>APIServer: sectionContent
end
APIServer->>SpanExtractor: Extract(ctx, content, query)
SpanExtractor->>LLM: Complete(span extraction)
LLM-->>SpanExtractor: span quote + usage
SpanExtractor-->>APIServer: AnswerSpan
APIServer->>LLM: Complete(synthesis prompt)
LLM-->>APIServer: answerText + synthUsage
APIServer-->>Client: AnswerResponse{answer, citations, usage}
Sequence diagram for /v1/query with optional answer-span extractionsequenceDiagram
actor Client
participant APIServer as APIServer_handleQuery
participant Strategy as RetrievalStrategy
participant DB as DB_LoadTree
participant Storage as Storage_Get
participant SpanExtractor as SpanExtractor
participant LLM as LLM_Client
Client->>APIServer: POST /v1/query
APIServer->>DB: LoadTree(document_id)
DB-->>APIServer: Tree
APIServer->>Strategy: Select(query, budget)
Strategy-->>APIServer: section_ids
loop sections
APIServer->>Storage: Get(content_ref)
Storage-->>APIServer: sectionContent
end
alt answer_span enabled and LLM configured
APIServer->>SpanExtractor: Extract(ctx, content, query)
SpanExtractor->>LLM: Complete(span extraction)
LLM-->>SpanExtractor: AnswerSpan
SpanExtractor-->>APIServer: AnswerSpan
else
APIServer-->>APIServer: [skip span extraction]
end
APIServer-->>Client: QueryResponse{sections[answer_span?]}
File-Level Changes
Tips and commandsInteracting with Sourcery
Customizing Your ExperienceAccess your dashboard to:
Getting Help
|
There was a problem hiding this comment.
Hey - I've found 4 issues, and left some high level feedback:
- In
applyEnvOverrides, you add env vars forVLE_RETRIEVAL_ANSWER_SPAN_*andVLE_RETRIEVAL_ANSWER_*but skip overrides forAnswerSpan.MaxQuoteLenandAnswer.MaxAnswerTokens, which makes those settings configurable only via YAML; consider adding env hooks for these to keep configuration surfaces consistent. - Both
spanExtractorandhandleAnswercan end up constructing LLM requests with an emptymodelif no config or request model is provided; it may be more robust to validate the resolved model name early and return a clear 4xx/5xx error rather than relying on downstream llmgate failures.
Prompt for AI Agents
Please address the comments from this code review:
## Overall Comments
- In `applyEnvOverrides`, you add env vars for `VLE_RETRIEVAL_ANSWER_SPAN_*` and `VLE_RETRIEVAL_ANSWER_*` but skip overrides for `AnswerSpan.MaxQuoteLen` and `Answer.MaxAnswerTokens`, which makes those settings configurable only via YAML; consider adding env hooks for these to keep configuration surfaces consistent.
- Both `spanExtractor` and `handleAnswer` can end up constructing LLM requests with an empty `model` if no config or request model is provided; it may be more robust to validate the resolved model name early and return a clear 4xx/5xx error rather than relying on downstream llmgate failures.
## Individual Comments
### Comment 1
<location path="internal/api/server.go" line_range="630-639" />
<code_context>
+ totalUsage := retrieval.Usage{}
</code_context>
<issue_to_address>
**issue (bug_risk):** LLM usage from span extraction is not included in the reported usage totals
In `handleAnswer`, `totalUsage` only aggregates `retrievalUsage` and `synthUsage`, while the `Usage` returned from span extraction (`runSpansConcurrent` → `SpanExtractor.Extract`) is dropped. This causes the `/v1/answer` `usage` block to under-report tokens, cost, and `llm_calls`. Please propagate the span-extraction `Usage` out of `runSpansConcurrent` (e.g., via an accumulator or channel) and fold it into `totalUsage` for accurate accounting.
</issue_to_address>
### Comment 2
<location path="pkg/retrieval/span.go" line_range="71-72" />
<code_context>
+ maxQuote = 400
+ }
+
+ user := fmt.Sprintf(
+ "Section content:\n---\n%s\n---\n\nUser query:\n%s\n\nReturn a JSON object with `found` (boolean) and `quote` (string, verbatim from the section, ≤ %d characters).",
+ sectionContent, query, maxQuote,
+ )
</code_context>
<issue_to_address>
**suggestion (performance):** Prompt injects full section content without regard to token budget; consider bounding or truncating by token count
`SpanExtractor.Extract` passes the full `sectionContent` into the prompt and only bounds the quote length, not the context size. For very large sections this can cause excessive prompt tokens, driving up cost and latency. Since you already have context-budgeting logic elsewhere, consider adding a similar guard here (e.g., truncate `sectionContent` by an approximate token or character limit) so span extraction remains bounded for large sections.
Suggested implementation:
```golang
maxQuote := e.MaxQuoteLen
if maxQuote <= 0 {
maxQuote = 400
}
// Bound the section content so span extraction prompts remain within a reasonable budget.
const maxSectionChars = 8000
if len(sectionContent) > maxSectionChars {
sectionContent = sectionContent[:maxSectionChars]
}
user := fmt.Sprintf(
"Section content:\n---\n%s\n---\n\nUser query:\n%s\n\nReturn a JSON object with `found` (boolean) and `quote` (string, verbatim from the section, ≤ %d characters).",
sectionContent, query, maxQuote,
)
```
If your codebase already has a shared token/character budgeting helper (e.g., for retrieval or summarization prompts), you may want to:
1. Replace the hard-coded `maxSectionChars` with a call to that helper, or derive it from a configuration value on `SpanExtractor` (such as a context budget field if one exists).
2. Tune `maxSectionChars` or the helper to target a specific token budget for this model (e.g., estimating 3–4 chars per token).
</issue_to_address>
### Comment 3
<location path="pkg/retrieval/span.go" line_range="106-107" />
<code_context>
+ if !found || strings.TrimSpace(quote) == "" {
+ return nil, usage, nil
+ }
+ if len(quote) > maxQuote {
+ quote = quote[:maxQuote]
+ }
+
</code_context>
<issue_to_address>
**suggestion (bug_risk):** Truncating strings by byte length risks cutting in the middle of UTF-8 runes
Here and in `synthesiseAnswer` (4000-char truncation), `s[:N]` truncates by bytes, not runes. With non-ASCII input this can split UTF-8 code points and yield invalid UTF-8, making offsets and rendering unreliable. Please switch to rune-aware truncation (e.g., via `[]rune` or a helper that truncates by runes while keeping byte offsets explicit).
Suggested implementation:
```golang
if !found || strings.TrimSpace(quote) == "" {
return nil, usage, nil
}
quote = truncateRunes(quote, maxQuote)
```
```golang
func truncateRunes(s string, maxRunes int) string {
if maxRunes <= 0 {
return ""
}
runes := []rune(s)
if len(runes) <= maxRunes {
return s
}
return string(runes[:maxRunes])
}
import (
"context"
```
1. In the `synthesiseAnswer` function (the 4000-char truncation mentioned in your comment), replace any `if len(x) > N { x = x[:N] }` or equivalent with `x = truncateRunes(x, N)` to make truncation rune-aware there as well.
2. If `maxQuote` or the 4000 limit are intended as byte limits rather than character limits, adjust the helper (or add an additional helper) to enforce both a rune count and a byte-length ceiling while still avoiding cutting UTF-8 runes in half (e.g., by walking runes until the next rune would exceed the byte budget).
</issue_to_address>
### Comment 4
<location path="pkg/retrieval/span_test.go" line_range="116-125" />
<code_context>
+ }
+}
+
+func TestSpanExtractor_EmptyInput(t *testing.T) {
+ m := &spanMockLLM{reply: `{"found":true,"quote":"x"}`}
+ e := NewSpanExtractor(m, "gemini-2.5-flash")
+ if span, _, _ := e.Extract(context.Background(), "", "Q"); span != nil {
+ t.Errorf("empty content should yield nil span without an LLM call")
+ }
+ if span, _, _ := e.Extract(context.Background(), "content", ""); span != nil {
+ t.Errorf("empty query should yield nil span without an LLM call")
+ }
+}
+
+func TestParseSpanResponse_CodeFence(t *testing.T) {
</code_context>
<issue_to_address>
**suggestion (testing):** Assert that the LLM is not called on empty input to fully exercise the optimization
Right now the test only verifies that the span is nil. Since the behavior also includes *skipping* the LLM call, please also assert `m.calls == 0` after each invocation to lock in that optimization and avoid future regressions that reintroduce unnecessary LLM traffic for empty inputs.
```suggestion
func TestSpanExtractor_EmptyInput(t *testing.T) {
m := &spanMockLLM{reply: `{"found":true,"quote":"x"}`}
e := NewSpanExtractor(m, "gemini-2.5-flash")
if span, _, _ := e.Extract(context.Background(), "", "Q"); span != nil {
t.Errorf("empty content should yield nil span without an LLM call")
}
if m.calls != 0 {
t.Errorf("expected no LLM calls for empty content, got %d", m.calls)
}
if span, _, _ := e.Extract(context.Background(), "content", ""); span != nil {
t.Errorf("empty query should yield nil span without an LLM call")
}
if m.calls != 0 {
t.Errorf("expected no LLM calls for empty query, got %d", m.calls)
}
}
```
</issue_to_address>Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.
| totalUsage := retrieval.Usage{} | ||
|
|
||
| var ids []tree.SectionID | ||
| var retrievalUsage retrieval.Usage | ||
| if cs, ok := d.Strategy.(retrieval.CostStrategy); ok { | ||
| res, err := cs.SelectWithCost(r.Context(), t, body.Query, budget) | ||
| if err != nil { | ||
| writeErr(w, http.StatusInternalServerError, "retrieval failed: "+err.Error()) | ||
| return | ||
| } |
There was a problem hiding this comment.
issue (bug_risk): LLM usage from span extraction is not included in the reported usage totals
In handleAnswer, totalUsage only aggregates retrievalUsage and synthUsage, while the Usage returned from span extraction (runSpansConcurrent → SpanExtractor.Extract) is dropped. This causes the /v1/answer usage block to under-report tokens, cost, and llm_calls. Please propagate the span-extraction Usage out of runSpansConcurrent (e.g., via an accumulator or channel) and fold it into totalUsage for accurate accounting.
| user := fmt.Sprintf( | ||
| "Section content:\n---\n%s\n---\n\nUser query:\n%s\n\nReturn a JSON object with `found` (boolean) and `quote` (string, verbatim from the section, ≤ %d characters).", |
There was a problem hiding this comment.
suggestion (performance): Prompt injects full section content without regard to token budget; consider bounding or truncating by token count
SpanExtractor.Extract passes the full sectionContent into the prompt and only bounds the quote length, not the context size. For very large sections this can cause excessive prompt tokens, driving up cost and latency. Since you already have context-budgeting logic elsewhere, consider adding a similar guard here (e.g., truncate sectionContent by an approximate token or character limit) so span extraction remains bounded for large sections.
Suggested implementation:
maxQuote := e.MaxQuoteLen
if maxQuote <= 0 {
maxQuote = 400
}
// Bound the section content so span extraction prompts remain within a reasonable budget.
const maxSectionChars = 8000
if len(sectionContent) > maxSectionChars {
sectionContent = sectionContent[:maxSectionChars]
}
user := fmt.Sprintf(
"Section content:\n---\n%s\n---\n\nUser query:\n%s\n\nReturn a JSON object with `found` (boolean) and `quote` (string, verbatim from the section, ≤ %d characters).",
sectionContent, query, maxQuote,
)If your codebase already has a shared token/character budgeting helper (e.g., for retrieval or summarization prompts), you may want to:
- Replace the hard-coded
maxSectionCharswith a call to that helper, or derive it from a configuration value onSpanExtractor(such as a context budget field if one exists). - Tune
maxSectionCharsor the helper to target a specific token budget for this model (e.g., estimating 3–4 chars per token).
| if len(quote) > maxQuote { | ||
| quote = quote[:maxQuote] |
There was a problem hiding this comment.
suggestion (bug_risk): Truncating strings by byte length risks cutting in the middle of UTF-8 runes
Here and in synthesiseAnswer (4000-char truncation), s[:N] truncates by bytes, not runes. With non-ASCII input this can split UTF-8 code points and yield invalid UTF-8, making offsets and rendering unreliable. Please switch to rune-aware truncation (e.g., via []rune or a helper that truncates by runes while keeping byte offsets explicit).
Suggested implementation:
if !found || strings.TrimSpace(quote) == "" {
return nil, usage, nil
}
quote = truncateRunes(quote, maxQuote)func truncateRunes(s string, maxRunes int) string {
if maxRunes <= 0 {
return ""
}
runes := []rune(s)
if len(runes) <= maxRunes {
return s
}
return string(runes[:maxRunes])
}
import (
"context"- In the
synthesiseAnswerfunction (the 4000-char truncation mentioned in your comment), replace anyif len(x) > N { x = x[:N] }or equivalent withx = truncateRunes(x, N)to make truncation rune-aware there as well. - If
maxQuoteor the 4000 limit are intended as byte limits rather than character limits, adjust the helper (or add an additional helper) to enforce both a rune count and a byte-length ceiling while still avoiding cutting UTF-8 runes in half (e.g., by walking runes until the next rune would exceed the byte budget).
| func TestSpanExtractor_EmptyInput(t *testing.T) { | ||
| m := &spanMockLLM{reply: `{"found":true,"quote":"x"}`} | ||
| e := NewSpanExtractor(m, "gemini-2.5-flash") | ||
| if span, _, _ := e.Extract(context.Background(), "", "Q"); span != nil { | ||
| t.Errorf("empty content should yield nil span without an LLM call") | ||
| } | ||
| if span, _, _ := e.Extract(context.Background(), "content", ""); span != nil { | ||
| t.Errorf("empty query should yield nil span without an LLM call") | ||
| } | ||
| } |
There was a problem hiding this comment.
suggestion (testing): Assert that the LLM is not called on empty input to fully exercise the optimization
Right now the test only verifies that the span is nil. Since the behavior also includes skipping the LLM call, please also assert m.calls == 0 after each invocation to lock in that optimization and avoid future regressions that reintroduce unnecessary LLM traffic for empty inputs.
| func TestSpanExtractor_EmptyInput(t *testing.T) { | |
| m := &spanMockLLM{reply: `{"found":true,"quote":"x"}`} | |
| e := NewSpanExtractor(m, "gemini-2.5-flash") | |
| if span, _, _ := e.Extract(context.Background(), "", "Q"); span != nil { | |
| t.Errorf("empty content should yield nil span without an LLM call") | |
| } | |
| if span, _, _ := e.Extract(context.Background(), "content", ""); span != nil { | |
| t.Errorf("empty query should yield nil span without an LLM call") | |
| } | |
| } | |
| func TestSpanExtractor_EmptyInput(t *testing.T) { | |
| m := &spanMockLLM{reply: `{"found":true,"quote":"x"}`} | |
| e := NewSpanExtractor(m, "gemini-2.5-flash") | |
| if span, _, _ := e.Extract(context.Background(), "", "Q"); span != nil { | |
| t.Errorf("empty content should yield nil span without an LLM call") | |
| } | |
| if m.calls != 0 { | |
| t.Errorf("expected no LLM calls for empty content, got %d", m.calls) | |
| } | |
| if span, _, _ := e.Extract(context.Background(), "content", ""); span != nil { | |
| t.Errorf("empty query should yield nil span without an LLM call") | |
| } | |
| if m.calls != 0 { | |
| t.Errorf("expected no LLM calls for empty query, got %d", m.calls) | |
| } | |
| } |
Summary
Lands Phase 1.1 (per-section answer-span extraction) and Phase 3.3 (
POST /v1/answer) of the engine plan. Turns the whitepaper's "the answer has a verbatim quote with a page number" claim into two demonstrable features.pkg/retrieval/span.go—SpanExtractor.Extract(ctx, content, query)runs one LLM call asking for the shortest verbatim span that answers the query, returnsAnswerSpan{Start, End, Text}with byte offsets back into the content. Two-pass locator (exact substring → whitespace-normalised) handles real PDFs with weird linebreaks/spacing. Sentinel offsets(-1, -1)flag paraphrased quotes./v1/queryopt-in: whenretrieval.answer_span.enabled: true, every returned section gets an extraanswer_spanfield. Bounded concurrency via semaphore. Failures are logged and dropped — span extraction is best-effort.POST /v1/answer(new): single round-trip runs retrieval + per-section span extraction + a synthesis LLM call, returns{answer, citations: [{section_id, title, page_start, page_end, quote, quote_start, quote_end}], strategy, model, usage, elapsed_ms}. The most regulator-defensible endpoint — every claim carries a replayable provenance trail.What's wired
DepsgainsLLM,LLMModel,AnswerSpan,Answerfields.cmd/engine/main.gopopulates them from config (adds amodelFor(LLMConfig)helper).pkg/config/config.go:AnswerSpanBlock(enabled, model, max_concurrency, max_quote_len) +AnswerBlock(model, max_sections, max_answer_tokens). Five newVLE_RETRIEVAL_ANSWER_*env overrides.config.example.yaml: documented retrieval blocks with defaults.openapi.yaml:/v1/answerendpoint +AnswerRequest/AnswerResponse/AnswerCitation/AnswerSpanschemas.answer_spanadded toQuerySection.Test plan
go build ./...go vet ./...go test ./...— all greenRisk envelope
/v1/querycallers see no behaviour change./v1/answerreturns 501 when no LLM client is configured (defensive — never hangs)./v1/answeris additive; existing endpoints untouched.Open consequences
/v1/answeralways extracts spans (it grounds the citations);/v1/queryhonours the opt-in flag.Summary by Sourcery
Add configurable answer-span extraction to retrieval responses and introduce a new quote-grounded /v1/answer endpoint that runs retrieval, span extraction, and synthesis in a single LLM-backed call.
New Features:
Enhancements:
Tests: