Skip to content

feat: answer-span extraction + POST /v1/answer endpoint (Phase 1.1 + 3.3)#16

Merged
hallelx2 merged 1 commit into
mainfrom
feat/answer-span-and-answer-api
May 27, 2026
Merged

feat: answer-span extraction + POST /v1/answer endpoint (Phase 1.1 + 3.3)#16
hallelx2 merged 1 commit into
mainfrom
feat/answer-span-and-answer-api

Conversation

@hallelx2
Copy link
Copy Markdown
Owner

@hallelx2 hallelx2 commented May 27, 2026

Summary

Lands Phase 1.1 (per-section answer-span extraction) and Phase 3.3 (POST /v1/answer) of the engine plan. Turns the whitepaper's "the answer has a verbatim quote with a page number" claim into two demonstrable features.

  • pkg/retrieval/span.goSpanExtractor.Extract(ctx, content, query) runs one LLM call asking for the shortest verbatim span that answers the query, returns AnswerSpan{Start, End, Text} with byte offsets back into the content. Two-pass locator (exact substring → whitespace-normalised) handles real PDFs with weird linebreaks/spacing. Sentinel offsets (-1, -1) flag paraphrased quotes.

  • /v1/query opt-in: when retrieval.answer_span.enabled: true, every returned section gets an extra answer_span field. Bounded concurrency via semaphore. Failures are logged and dropped — span extraction is best-effort.

  • POST /v1/answer (new): single round-trip runs retrieval + per-section span extraction + a synthesis LLM call, returns {answer, citations: [{section_id, title, page_start, page_end, quote, quote_start, quote_end}], strategy, model, usage, elapsed_ms}. The most regulator-defensible endpoint — every claim carries a replayable provenance trail.

What's wired

  • Deps gains LLM, LLMModel, AnswerSpan, Answer fields. cmd/engine/main.go populates them from config (adds a modelFor(LLMConfig) helper).
  • pkg/config/config.go: AnswerSpanBlock (enabled, model, max_concurrency, max_quote_len) + AnswerBlock (model, max_sections, max_answer_tokens). Five new VLE_RETRIEVAL_ANSWER_* env overrides.
  • config.example.yaml: documented retrieval blocks with defaults.
  • openapi.yaml: /v1/answer endpoint + AnswerRequest / AnswerResponse / AnswerCitation / AnswerSpan schemas. answer_span added to QuerySection.

Test plan

  • go build ./...
  • go vet ./...
  • go test ./... — all green
  • Span extractor unit tests cover: verbatim match → resolved offsets, not-found → nil span, whitespace-normalised location, hallucinated quote → sentinel offsets, empty inputs → no LLM call, code-fence + leading-prose tolerant parser, locateQuote edge cases

Risk envelope

  • Opt-in via config — default-disabled. Existing /v1/query callers see no behaviour change.
  • /v1/answer returns 501 when no LLM client is configured (defensive — never hangs).
  • /v1/answer is additive; existing endpoints untouched.

Open consequences

  • For now /v1/answer always extracts spans (it grounds the citations); /v1/query honours the opt-in flag.
  • The synthesis prompt sends section content (capped at 4000 chars/section) alongside spans, so the model isn't blind when extraction returns nothing. Future optimisation: spans-only mode for cost-sensitive deployments.

Summary by Sourcery

Add configurable answer-span extraction to retrieval responses and introduce a new quote-grounded /v1/answer endpoint that runs retrieval, span extraction, and synthesis in a single LLM-backed call.

New Features:

  • Introduce per-section answer-span extraction that enriches /v1/query sections with an optional answer_span field containing a verbatim quote and byte offsets.
  • Add the /v1/answer endpoint that returns a synthesised, quote-grounded answer with structured citations, usage metadata, and elapsed time.

Enhancements:

  • Wire a shared LLM client and default model into the API server dependencies for non-retrieval LLM calls, with configuration blocks for answer-span and answer synthesis behaviour.
  • Extend OpenAPI, configuration defaults, and example config to document and control answer-span extraction and the /v1/answer endpoint.
  • Factor shared section-with-content handling in the query path and add concurrent span-extraction orchestration with bounded concurrency.

Tests:

  • Add unit tests for the span extractor, including verbatim and whitespace-normalised matches, not-found behaviour, hallucinated quotes, empty-input handling, and robust response parsing.

Phase 1.1 + 3.3 of the engine plan: turns the whitepaper's
"answer has a verbatim quote with a page number" claim into two
demonstrable, regulator-defensible features.

pkg/retrieval/span.go
  - SpanExtractor.Extract(ctx, content, query) runs one LLM call,
    asks for the SHORTEST verbatim quote from the section that
    answers the query, returns AnswerSpan{Start, End, Text} with
    byte offsets back into content.
  - Two-pass locator: exact substring first, then whitespace-
    normalised match for content with weird linebreaks/spacing.
  - When the model paraphrases despite the verbatim rule, offsets
    are sentinel (-1, -1) but Text is preserved so callers can flag.
  - Schema-validated JSON output ({found, quote}) with the same
    tolerant parser pattern as ParseSelection (code fences, leading
    prose).

internal/api/server.go
  - Deps gains LLM (llmgate.Client), LLMModel (default fallback),
    AnswerSpan + Answer config blocks. cmd/engine/main.go wires them.
  - /v1/query: when retrieval.answer_span.enabled, every returned
    section gets an `answer_span` field. Bounded concurrency
    (config) via a semaphore; failures are logged and dropped.
  - POST /v1/answer (NEW): single round-trip runs retrieval +
    per-section span extraction + a synthesis LLM call, returns
    {answer, citations:[{section_id, title, page_start, page_end,
    quote, quote_start, quote_end}], strategy, model, usage,
    elapsed_ms}. The model is instructed to cite by section_id.

pkg/config/config.go
  - RetrievalConfig.AnswerSpan + .Answer blocks; defaults match
    behaviour: AnswerSpan disabled, Answer.MaxSections=5,
    Answer.MaxAnswerTokens=1024.
  - VLE_RETRIEVAL_ANSWER_SPAN_ENABLED / _MODEL / _MAX_CONCURRENCY
    and VLE_RETRIEVAL_ANSWER_MODEL / _MAX_SECTIONS env overrides
    follow the existing pattern.

openapi.yaml
  - /v1/answer endpoint + AnswerRequest / AnswerResponse /
    AnswerCitation / AnswerSpan schemas. answer_span added as an
    optional field on QuerySection (omitempty in JSON).

Tests:
  - pkg/retrieval/span_test.go covers the verbatim path, the
    not-found path, whitespace-normalised location, the
    hallucinated-quote sentinel-offset path, empty inputs,
    code-fence stripping, leading-prose stripping, and locateQuote
    edge cases. All pass.
  - go build ./..., go vet ./..., go test ./... — all green.
Copilot AI review requested due to automatic review settings May 27, 2026 00:42
@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented May 27, 2026

Warning

Review limit reached

@hallelx2, we couldn't start this review because you've reached your PR review rate limit.

More reviews will be available in 11 minutes and 35 seconds. Learn how PR review limits work.

Your organization has run out of usage credits. Purchase more in the billing tab.

⌛ How to resolve this issue?

After more reviews become available, a review can be triggered using the @coderabbitai review command as a PR comment. Alternatively, push new commits to this PR.

We recommend that you space out your commits to avoid hitting the rate limit.

🚦 How do rate limits work?

CodeRabbit enforces hourly rate limits for each developer per organization.

Our paid plans include higher PR review limits than trial, open-source, and free plans. In all cases, reviews become available again over time. During sustained high-volume PR review activity, CodeRabbit may temporarily slow when the next review becomes available.

Please see our Fair Usage Limits Policy for further information.

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 83b995cb-81c2-400f-9286-72a962dea5e3

📥 Commits

Reviewing files that changed from the base of the PR and between 1100390 and 39e677f.

📒 Files selected for processing (7)
  • cmd/engine/main.go
  • config.example.yaml
  • internal/api/server.go
  • openapi.yaml
  • pkg/config/config.go
  • pkg/retrieval/span.go
  • pkg/retrieval/span_test.go
✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch feat/answer-span-and-answer-api

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@sourcery-ai
Copy link
Copy Markdown

sourcery-ai Bot commented May 27, 2026

Reviewer's Guide

Implements configurable answer-span extraction for /v1/query sections and adds a new /v1/answer endpoint that performs retrieval, per-section span extraction, and answer synthesis via an LLM with grounded citations and usage accounting.

Sequence diagram for POST /v1/answer end-to-end flow

sequenceDiagram
    actor Client
    participant APIServer as APIServer_handleAnswer
    participant DB as DB_LoadTree
    participant Strategy as RetrievalStrategy
    participant Storage as Storage_Get
    participant SpanExtractor as SpanExtractor
    participant LLM as LLM_Client

    Client->>APIServer: POST /v1/answer (AnswerRequest)
    APIServer-->>Client: [if no LLM] 501
    APIServer-->>Client: [if no strategy] 503

    APIServer->>DB: LoadTree(document_id)
    DB-->>APIServer: Tree or error

    APIServer->>Strategy: Select / SelectWithCost(query, budget)
    Strategy-->>APIServer: section_ids, retrievalUsage

    loop sections
        APIServer->>Storage: Get(content_ref)
        Storage-->>APIServer: sectionContent
    end

    APIServer->>SpanExtractor: Extract(ctx, content, query)
    SpanExtractor->>LLM: Complete(span extraction)
    LLM-->>SpanExtractor: span quote + usage
    SpanExtractor-->>APIServer: AnswerSpan

    APIServer->>LLM: Complete(synthesis prompt)
    LLM-->>APIServer: answerText + synthUsage

    APIServer-->>Client: AnswerResponse{answer, citations, usage}
Loading

Sequence diagram for /v1/query with optional answer-span extraction

sequenceDiagram
    actor Client
    participant APIServer as APIServer_handleQuery
    participant Strategy as RetrievalStrategy
    participant DB as DB_LoadTree
    participant Storage as Storage_Get
    participant SpanExtractor as SpanExtractor
    participant LLM as LLM_Client

    Client->>APIServer: POST /v1/query
    APIServer->>DB: LoadTree(document_id)
    DB-->>APIServer: Tree

    APIServer->>Strategy: Select(query, budget)
    Strategy-->>APIServer: section_ids

    loop sections
        APIServer->>Storage: Get(content_ref)
        Storage-->>APIServer: sectionContent
    end

    alt answer_span enabled and LLM configured
        APIServer->>SpanExtractor: Extract(ctx, content, query)
        SpanExtractor->>LLM: Complete(span extraction)
        LLM-->>SpanExtractor: AnswerSpan
        SpanExtractor-->>APIServer: AnswerSpan
    else
        APIServer-->>APIServer: [skip span extraction]
    end

    APIServer-->>Client: QueryResponse{sections[answer_span?]}
Loading

File-Level Changes

Change Details Files
Add answer-span extraction component and integrate it into retrieval flow and /v1/query responses.
  • Introduce retrieval.AnswerSpan type and SpanExtractor with LLM JSON-mode call to produce shortest verbatim quote plus byte offsets, including whitespace-normalised locator and robust response parser.
  • Implement locateQuote, collapseWS, and mapNormToOriginal helpers to resolve quote positions in real-world PDF-like text, with sentinel offsets when the quote cannot be matched verbatim.
  • Add unit tests for the span extractor, locator, and parser covering verbatim matches, no-evidence cases, whitespace-normalised matches, hallucinated quotes, empty-input behaviour, and fenced/leading-prose JSON outputs.
pkg/retrieval/span.go
pkg/retrieval/span_test.go
Wire optional answer-span extraction into /v1/query and add a shared sectionWithContent representation.
  • Refactor handleQuery to materialise sections as sectionWithContent structs, optionally run span extraction concurrently with a bounded semaphore, then convert back to API maps including optional answer_span.
  • Add sectionWithContentToMap helper to centralise section-to-JSON shaping (including page range, candidate questions, and answer_span fields).
  • Introduce runSpansConcurrent helper that fans out extraction calls with MaxConcurrency, skips empty content, respects context cancellation, and logs non-fatal failures.
internal/api/server.go
Introduce new /v1/answer endpoint that runs retrieval, span extraction, and answer synthesis with grounded citations and usage accounting.
  • Add handleAnswer to validate requests, load the document tree, run retrieval with ContextBudget (supporting CostStrategy usage accounting), cap sections using configurable defaults, load section content, and always run span extraction.
  • Implement synthesiseAnswer to call the shared LLM client with a grounded prompt that includes section metadata, spans, and truncated content, and return answer text plus token/cost usage.
  • Shape AnswerResponse with answer, citations (including quote and byte offsets when available), strategy name, effective model, usage summary, and elapsed_ms; return 501 when LLM is not configured and 503 when no retrieval strategy is available.
internal/api/server.go
Extend API surface and configuration to support answer-span and /v1/answer features and wire LLM dependencies into the server.
  • Augment Deps with LLM client, default LLMModel, and AnswerSpan/Answer config blocks; wire them from engine config in cmd/engine/main.go and add a modelFor helper to derive the default chat model by driver.
  • Expand RetrievalConfig with AnswerSpanBlock and AnswerBlock, set safe defaults in config.Default, and add environment overrides for enabling span extraction and tuning models, concurrency, and max sections.
  • Extend OpenAPI spec with /v1/answer path, AnswerRequest/AnswerResponse/AnswerCitation/AnswerSpan schemas, and add answer_span to QuerySection; document new retrieval.answer_span and retrieval.answer blocks in config.example.yaml.
internal/api/server.go
cmd/engine/main.go
pkg/config/config.go
openapi.yaml
config.example.yaml

Tips and commands

Interacting with Sourcery

  • Trigger a new review: Comment @sourcery-ai review on the pull request.
  • Continue discussions: Reply directly to Sourcery's review comments.
  • Generate a GitHub issue from a review comment: Ask Sourcery to create an
    issue from a review comment by replying to it. You can also reply to a
    review comment with @sourcery-ai issue to create an issue from it.
  • Generate a pull request title: Write @sourcery-ai anywhere in the pull
    request title to generate a title at any time. You can also comment
    @sourcery-ai title on the pull request to (re-)generate the title at any time.
  • Generate a pull request summary: Write @sourcery-ai summary anywhere in
    the pull request body to generate a PR summary at any time exactly where you
    want it. You can also comment @sourcery-ai summary on the pull request to
    (re-)generate the summary at any time.
  • Generate reviewer's guide: Comment @sourcery-ai guide on the pull
    request to (re-)generate the reviewer's guide at any time.
  • Resolve all Sourcery comments: Comment @sourcery-ai resolve on the
    pull request to resolve all Sourcery comments. Useful if you've already
    addressed all the comments and don't want to see them anymore.
  • Dismiss all Sourcery reviews: Comment @sourcery-ai dismiss on the pull
    request to dismiss all existing Sourcery reviews. Especially useful if you
    want to start fresh with a new review - don't forget to comment
    @sourcery-ai review to trigger a new review!

Customizing Your Experience

Access your dashboard to:

  • Enable or disable review features such as the Sourcery-generated pull request
    summary, the reviewer's guide, and others.
  • Change the review language.
  • Add, remove or edit custom review instructions.
  • Adjust other review settings.

Getting Help

@hallelx2 hallelx2 merged commit d92db83 into main May 27, 2026
6 of 9 checks passed
@hallelx2 hallelx2 deleted the feat/answer-span-and-answer-api branch May 27, 2026 00:43
Copy link
Copy Markdown

@sourcery-ai sourcery-ai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey - I've found 4 issues, and left some high level feedback:

  • In applyEnvOverrides, you add env vars for VLE_RETRIEVAL_ANSWER_SPAN_* and VLE_RETRIEVAL_ANSWER_* but skip overrides for AnswerSpan.MaxQuoteLen and Answer.MaxAnswerTokens, which makes those settings configurable only via YAML; consider adding env hooks for these to keep configuration surfaces consistent.
  • Both spanExtractor and handleAnswer can end up constructing LLM requests with an empty model if no config or request model is provided; it may be more robust to validate the resolved model name early and return a clear 4xx/5xx error rather than relying on downstream llmgate failures.
Prompt for AI Agents
Please address the comments from this code review:

## Overall Comments
- In `applyEnvOverrides`, you add env vars for `VLE_RETRIEVAL_ANSWER_SPAN_*` and `VLE_RETRIEVAL_ANSWER_*` but skip overrides for `AnswerSpan.MaxQuoteLen` and `Answer.MaxAnswerTokens`, which makes those settings configurable only via YAML; consider adding env hooks for these to keep configuration surfaces consistent.
- Both `spanExtractor` and `handleAnswer` can end up constructing LLM requests with an empty `model` if no config or request model is provided; it may be more robust to validate the resolved model name early and return a clear 4xx/5xx error rather than relying on downstream llmgate failures.

## Individual Comments

### Comment 1
<location path="internal/api/server.go" line_range="630-639" />
<code_context>
+	totalUsage := retrieval.Usage{}
</code_context>
<issue_to_address>
**issue (bug_risk):** LLM usage from span extraction is not included in the reported usage totals

In `handleAnswer`, `totalUsage` only aggregates `retrievalUsage` and `synthUsage`, while the `Usage` returned from span extraction (`runSpansConcurrent``SpanExtractor.Extract`) is dropped. This causes the `/v1/answer` `usage` block to under-report tokens, cost, and `llm_calls`. Please propagate the span-extraction `Usage` out of `runSpansConcurrent` (e.g., via an accumulator or channel) and fold it into `totalUsage` for accurate accounting.
</issue_to_address>

### Comment 2
<location path="pkg/retrieval/span.go" line_range="71-72" />
<code_context>
+		maxQuote = 400
+	}
+
+	user := fmt.Sprintf(
+		"Section content:\n---\n%s\n---\n\nUser query:\n%s\n\nReturn a JSON object with `found` (boolean) and `quote` (string, verbatim from the section, ≤ %d characters).",
+		sectionContent, query, maxQuote,
+	)
</code_context>
<issue_to_address>
**suggestion (performance):** Prompt injects full section content without regard to token budget; consider bounding or truncating by token count

`SpanExtractor.Extract` passes the full `sectionContent` into the prompt and only bounds the quote length, not the context size. For very large sections this can cause excessive prompt tokens, driving up cost and latency. Since you already have context-budgeting logic elsewhere, consider adding a similar guard here (e.g., truncate `sectionContent` by an approximate token or character limit) so span extraction remains bounded for large sections.

Suggested implementation:

```golang
	maxQuote := e.MaxQuoteLen
	if maxQuote <= 0 {
		maxQuote = 400
	}

	// Bound the section content so span extraction prompts remain within a reasonable budget.
	const maxSectionChars = 8000
	if len(sectionContent) > maxSectionChars {
		sectionContent = sectionContent[:maxSectionChars]
	}

	user := fmt.Sprintf(
		"Section content:\n---\n%s\n---\n\nUser query:\n%s\n\nReturn a JSON object with `found` (boolean) and `quote` (string, verbatim from the section, ≤ %d characters).",
		sectionContent, query, maxQuote,
	)

```

If your codebase already has a shared token/character budgeting helper (e.g., for retrieval or summarization prompts), you may want to:
1. Replace the hard-coded `maxSectionChars` with a call to that helper, or derive it from a configuration value on `SpanExtractor` (such as a context budget field if one exists).
2. Tune `maxSectionChars` or the helper to target a specific token budget for this model (e.g., estimating 3–4 chars per token).
</issue_to_address>

### Comment 3
<location path="pkg/retrieval/span.go" line_range="106-107" />
<code_context>
+	if !found || strings.TrimSpace(quote) == "" {
+		return nil, usage, nil
+	}
+	if len(quote) > maxQuote {
+		quote = quote[:maxQuote]
+	}
+
</code_context>
<issue_to_address>
**suggestion (bug_risk):** Truncating strings by byte length risks cutting in the middle of UTF-8 runes

Here and in `synthesiseAnswer` (4000-char truncation), `s[:N]` truncates by bytes, not runes. With non-ASCII input this can split UTF-8 code points and yield invalid UTF-8, making offsets and rendering unreliable. Please switch to rune-aware truncation (e.g., via `[]rune` or a helper that truncates by runes while keeping byte offsets explicit).

Suggested implementation:

```golang
	if !found || strings.TrimSpace(quote) == "" {
		return nil, usage, nil
	}
	quote = truncateRunes(quote, maxQuote)

```

```golang
func truncateRunes(s string, maxRunes int) string {
	if maxRunes <= 0 {
		return ""
	}

	runes := []rune(s)
	if len(runes) <= maxRunes {
		return s
	}

	return string(runes[:maxRunes])
}

import (
	"context"

```

1. In the `synthesiseAnswer` function (the 4000-char truncation mentioned in your comment), replace any `if len(x) > N { x = x[:N] }` or equivalent with `x = truncateRunes(x, N)` to make truncation rune-aware there as well.
2. If `maxQuote` or the 4000 limit are intended as byte limits rather than character limits, adjust the helper (or add an additional helper) to enforce both a rune count and a byte-length ceiling while still avoiding cutting UTF-8 runes in half (e.g., by walking runes until the next rune would exceed the byte budget).
</issue_to_address>

### Comment 4
<location path="pkg/retrieval/span_test.go" line_range="116-125" />
<code_context>
+	}
+}
+
+func TestSpanExtractor_EmptyInput(t *testing.T) {
+	m := &spanMockLLM{reply: `{"found":true,"quote":"x"}`}
+	e := NewSpanExtractor(m, "gemini-2.5-flash")
+	if span, _, _ := e.Extract(context.Background(), "", "Q"); span != nil {
+		t.Errorf("empty content should yield nil span without an LLM call")
+	}
+	if span, _, _ := e.Extract(context.Background(), "content", ""); span != nil {
+		t.Errorf("empty query should yield nil span without an LLM call")
+	}
+}
+
+func TestParseSpanResponse_CodeFence(t *testing.T) {
</code_context>
<issue_to_address>
**suggestion (testing):** Assert that the LLM is not called on empty input to fully exercise the optimization

Right now the test only verifies that the span is nil. Since the behavior also includes *skipping* the LLM call, please also assert `m.calls == 0` after each invocation to lock in that optimization and avoid future regressions that reintroduce unnecessary LLM traffic for empty inputs.

```suggestion
func TestSpanExtractor_EmptyInput(t *testing.T) {
	m := &spanMockLLM{reply: `{"found":true,"quote":"x"}`}
	e := NewSpanExtractor(m, "gemini-2.5-flash")

	if span, _, _ := e.Extract(context.Background(), "", "Q"); span != nil {
		t.Errorf("empty content should yield nil span without an LLM call")
	}
	if m.calls != 0 {
		t.Errorf("expected no LLM calls for empty content, got %d", m.calls)
	}

	if span, _, _ := e.Extract(context.Background(), "content", ""); span != nil {
		t.Errorf("empty query should yield nil span without an LLM call")
	}
	if m.calls != 0 {
		t.Errorf("expected no LLM calls for empty query, got %d", m.calls)
	}
}
```
</issue_to_address>

Sourcery is free for open source - if you like our reviews please consider sharing them ✨
Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.

Comment thread internal/api/server.go
Comment on lines +630 to +639
totalUsage := retrieval.Usage{}

var ids []tree.SectionID
var retrievalUsage retrieval.Usage
if cs, ok := d.Strategy.(retrieval.CostStrategy); ok {
res, err := cs.SelectWithCost(r.Context(), t, body.Query, budget)
if err != nil {
writeErr(w, http.StatusInternalServerError, "retrieval failed: "+err.Error())
return
}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

issue (bug_risk): LLM usage from span extraction is not included in the reported usage totals

In handleAnswer, totalUsage only aggregates retrievalUsage and synthUsage, while the Usage returned from span extraction (runSpansConcurrentSpanExtractor.Extract) is dropped. This causes the /v1/answer usage block to under-report tokens, cost, and llm_calls. Please propagate the span-extraction Usage out of runSpansConcurrent (e.g., via an accumulator or channel) and fold it into totalUsage for accurate accounting.

Comment thread pkg/retrieval/span.go
Comment on lines +71 to +72
user := fmt.Sprintf(
"Section content:\n---\n%s\n---\n\nUser query:\n%s\n\nReturn a JSON object with `found` (boolean) and `quote` (string, verbatim from the section, ≤ %d characters).",
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

suggestion (performance): Prompt injects full section content without regard to token budget; consider bounding or truncating by token count

SpanExtractor.Extract passes the full sectionContent into the prompt and only bounds the quote length, not the context size. For very large sections this can cause excessive prompt tokens, driving up cost and latency. Since you already have context-budgeting logic elsewhere, consider adding a similar guard here (e.g., truncate sectionContent by an approximate token or character limit) so span extraction remains bounded for large sections.

Suggested implementation:

	maxQuote := e.MaxQuoteLen
	if maxQuote <= 0 {
		maxQuote = 400
	}

	// Bound the section content so span extraction prompts remain within a reasonable budget.
	const maxSectionChars = 8000
	if len(sectionContent) > maxSectionChars {
		sectionContent = sectionContent[:maxSectionChars]
	}

	user := fmt.Sprintf(
		"Section content:\n---\n%s\n---\n\nUser query:\n%s\n\nReturn a JSON object with `found` (boolean) and `quote` (string, verbatim from the section, ≤ %d characters).",
		sectionContent, query, maxQuote,
	)

If your codebase already has a shared token/character budgeting helper (e.g., for retrieval or summarization prompts), you may want to:

  1. Replace the hard-coded maxSectionChars with a call to that helper, or derive it from a configuration value on SpanExtractor (such as a context budget field if one exists).
  2. Tune maxSectionChars or the helper to target a specific token budget for this model (e.g., estimating 3–4 chars per token).

Comment thread pkg/retrieval/span.go
Comment on lines +106 to +107
if len(quote) > maxQuote {
quote = quote[:maxQuote]
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

suggestion (bug_risk): Truncating strings by byte length risks cutting in the middle of UTF-8 runes

Here and in synthesiseAnswer (4000-char truncation), s[:N] truncates by bytes, not runes. With non-ASCII input this can split UTF-8 code points and yield invalid UTF-8, making offsets and rendering unreliable. Please switch to rune-aware truncation (e.g., via []rune or a helper that truncates by runes while keeping byte offsets explicit).

Suggested implementation:

	if !found || strings.TrimSpace(quote) == "" {
		return nil, usage, nil
	}
	quote = truncateRunes(quote, maxQuote)
func truncateRunes(s string, maxRunes int) string {
	if maxRunes <= 0 {
		return ""
	}

	runes := []rune(s)
	if len(runes) <= maxRunes {
		return s
	}

	return string(runes[:maxRunes])
}

import (
	"context"
  1. In the synthesiseAnswer function (the 4000-char truncation mentioned in your comment), replace any if len(x) > N { x = x[:N] } or equivalent with x = truncateRunes(x, N) to make truncation rune-aware there as well.
  2. If maxQuote or the 4000 limit are intended as byte limits rather than character limits, adjust the helper (or add an additional helper) to enforce both a rune count and a byte-length ceiling while still avoiding cutting UTF-8 runes in half (e.g., by walking runes until the next rune would exceed the byte budget).

Comment on lines +116 to +125
func TestSpanExtractor_EmptyInput(t *testing.T) {
m := &spanMockLLM{reply: `{"found":true,"quote":"x"}`}
e := NewSpanExtractor(m, "gemini-2.5-flash")
if span, _, _ := e.Extract(context.Background(), "", "Q"); span != nil {
t.Errorf("empty content should yield nil span without an LLM call")
}
if span, _, _ := e.Extract(context.Background(), "content", ""); span != nil {
t.Errorf("empty query should yield nil span without an LLM call")
}
}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

suggestion (testing): Assert that the LLM is not called on empty input to fully exercise the optimization

Right now the test only verifies that the span is nil. Since the behavior also includes skipping the LLM call, please also assert m.calls == 0 after each invocation to lock in that optimization and avoid future regressions that reintroduce unnecessary LLM traffic for empty inputs.

Suggested change
func TestSpanExtractor_EmptyInput(t *testing.T) {
m := &spanMockLLM{reply: `{"found":true,"quote":"x"}`}
e := NewSpanExtractor(m, "gemini-2.5-flash")
if span, _, _ := e.Extract(context.Background(), "", "Q"); span != nil {
t.Errorf("empty content should yield nil span without an LLM call")
}
if span, _, _ := e.Extract(context.Background(), "content", ""); span != nil {
t.Errorf("empty query should yield nil span without an LLM call")
}
}
func TestSpanExtractor_EmptyInput(t *testing.T) {
m := &spanMockLLM{reply: `{"found":true,"quote":"x"}`}
e := NewSpanExtractor(m, "gemini-2.5-flash")
if span, _, _ := e.Extract(context.Background(), "", "Q"); span != nil {
t.Errorf("empty content should yield nil span without an LLM call")
}
if m.calls != 0 {
t.Errorf("expected no LLM calls for empty content, got %d", m.calls)
}
if span, _, _ := e.Extract(context.Background(), "content", ""); span != nil {
t.Errorf("empty query should yield nil span without an LLM call")
}
if m.calls != 0 {
t.Errorf("expected no LLM calls for empty query, got %d", m.calls)
}
}

@hallelx2 hallelx2 review requested due to automatic review settings May 27, 2026 01:04
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant