Page citations + HyDE candidate questions#14
Conversation
SEC filings have no PDF outline and use bold at body font size (not larger
fonts) for section headings, so the size-only heading heuristic missed every
real section and collapsed the entire body into one giant block. Wide
letter-tracking on cover/header rows also extracted as "U N I T E D".
Three targeted changes:
- Per-row bold detection from the glyph font name. Bold rows at >= median
font size qualify as headings, nested one level below the smallest
size-derived heading.
- collapseLetterSpacing(): rejoins letter-tracked text only on rows whose
pattern is unmistakable (majority single-char tokens), preserving word
boundaries via runs of 2+ spaces. Normal prose is untouched.
- looksLikeHeading: raise the word cap from 14 to 25 so verbose filing
headings ("Item 2. Management's Discussion and Analysis of Financial
Condition and Results of Operations") are not filtered out.
Validated on a real 10-Q (3M Q2 2023, 92 pages): one 680K-char blob became
174 retrievable sections (Item 1, Consolidated Balance Sheet, PART I, ...);
title "U N I T E D S T A T E S" became "UNITED STATES". All existing parser
tests pass; no regression.
…lures The selection LLM call (chunked-tree slices and single-pass alike) sometimes returns plain text instead of the JSON the schema asks for. Most often this is Gemini briefly ignoring JSON mode. Today that surfaces as a 500 to the SDK on every blip, plus the wasted LLM cost — and the SDK's transport-level retry just repeats the same blow-up. Wrap Complete + ParseSelection in a small retry loop (2 retries by default, 3 attempts total). On retry the last user message gets an extra "ONLY JSON, no prose, no fences" reminder, which Gemini usually honors on the second try. If all attempts still fail, log a warning and return an empty selection so the HTTP request succeeds with no sections instead of erroring out — one bad LLM response can no longer take down a multi-slice retrieval. Test TestSinglePassGracefulOnNonJSON locks the behaviour: prose-only response → empty selection, nil error, 3 LLM attempts counted in usage.
The current summary prompt asks for "a single factual sentence" — fine for
human reading, but the resulting summaries describe sections generically
("Cover page of 3M's 10-Q with company identification") instead of naming
their concrete topics ("registered debt securities, trading symbols MMM26
/ MMM30, NYSE listings, IRS employer ID"). The downstream retrieval LLM,
given only those summaries, then can't tell which section answers a
specific question — e.g. q_00941 ("Which debt securities are registered to
trade on a national exchange under 3M's name?") picks two "Long-Term Debt"
sections instead of the cover-page section that actually contains the
registration table.
Rewrite the summary prompt for retrieval: explicitly ask the model to name
the section's concrete entities, identifiers, table contents, named items,
and key numbers. One sentence, raised cap to ≤60 words (with MaxTokens
260) so dense sections aren't truncated mid-list. The domain framings
(research / medical / default) are preserved and now include the same
retrieval rule. Existing ingest tests pass.
…evable
Filing cover pages (and any other long, mixed-topic leaf section) produce
one 2-3k-char blob under a generic title like "3M COMPANY" — mixing
registration tables, addresses, IRS IDs, contact info. A single summary
can't cover all those topics, so retrieval picks unrelated "long-term
debt" sections instead of the one that actually holds the answer.
Add chunkOversizedLeaves: any LEAF section whose Content exceeds 2400
chars is replaced by a parent (title preserved) with smaller children at
the next level. Children are sized around 900 chars and split at word
boundaries. The chunk title prefers a natural colon-terminated header
within the first 80 chars ("Securities registered pursuant to Section
12(b) of the Act:") when available — exactly the pattern in filings —
otherwise the first ~60 chars trimmed at a word boundary, falling back
to "<parent title> — part N".
Internal nodes are recursed into but never split (they're already
structured). Threshold deliberately high (2400) so most paper sub-
sections aren't affected; combined with the retrieval-friendly summary
prompt (previous commit), each chunk gets a topic-rich summary downstream
so the retrieval LLM can match it to specific questions.
Tests in chunk_test.go: oversized leaf gets split with the parent title
preserved + children at level+1; first chunk takes the colon-header
title; small sections are untouched; oversized leaves nested inside
internal nodes are still split.
Adds two retrieval-quality boosters and wires them through the data
layer + ingest pipeline + retrieval prompt + API surface.
1. Page citations (Phase 1.1)
- sections gains page_start / page_end (nullable INTEGER) plus an
index on (document_id, page_start, page_end) for citation lookups.
- The PDF parser tracks the inclusive page range each section
covers (from row.page on the pdfRow stream) and propagates it up
to internal nodes. Non-paginated formats (markdown/HTML/DOCX/text)
leave both columns NULL.
- Pages survive the oversized-leaf chunker — children inherit the
parent leaf's range.
- Pages flow through db.Section -> tree.Section -> tree.SectionView
and are surfaced (omitempty) on every API handler that returns
sections: /sections/{id}, /query, /query/multi, /documents/{id}/tree.
2. HyDE candidate questions (Phase 1.2)
- sections gains candidate_questions (JSONB nullable).
- New ingest stage pkg/ingest/hyde.go: per leaf, asks the LLM for
up to N self-contained questions the section can answer, with the
same JSON-retry + graceful-degrade pattern retrieval already uses.
Failures are logged and the pipeline proceeds to StatusReady
(HyDE is a recall booster, not a correctness gate).
- Pipeline gains HyDEEnabled / HyDEModel / HyDENumQuestions /
HyDEConcurrency knobs.
- tree.SectionView gains CandidateQuestions so the retrieval prompt
can surface them.
- retrieval.writeSectionLine appends an "answers: <first question>"
hint per section (~120 chars cap) so the LLM has a wider lexical
surface to match user queries against.
3. Config (CC.3)
- config.IngestConfig{HyDE HyDEConfig} added with defaults
Enabled=true, NumQuestions=5, Concurrency=4.
- Env overrides: VLE_INGEST_HYDE_ENABLED / _MODEL / _NUM_QUESTIONS /
_CONCURRENCY.
- Validation rejects negative counts.
- Wired from cmd/engine/main.go and cmd/server/main.go into
ingest.Pipeline.
Migration 0004_sections_extras adds the new columns + index.
- pkg/parser/pdf_pages_test.go: propagateSectionPages union semantics (zero left alone, parent widens, child range preserved) + chunker inheriting parent leaf's pages. - pkg/db/sections_marshal_test.go: JSONB round-trip for candidate_questions, NULL handling, garbled-bytes tolerance, and the page sql.NullInt64 ↔ int helpers. - pkg/ingest/hyde_test.go: parseHyDEResponse tolerance (fences, prose, empty), dedupeNonEmpty cap + dedupe, runHyDEWithRetry happy / retry / final-fail paths via llmgate.Mock, HyDEModel override + fallback to SummaryModel, NumQuestions cap. - pkg/retrieval/retrieval_test.go: assert the selection prompt surfaces the FIRST candidate question as an "answers:" hint and does not leak subsequent ones.
Reviewer's GuideImplements page-aware PDF sections with propagated page spans, introduces a HyDE-based candidate question generation stage in the ingest pipeline with DB persistence and configuration knobs, and threads both page metadata and questions through tree, retrieval, and HTTP APIs while hardening LLM JSON-mode handling with retries and graceful degradation. Sequence diagram for HyDE candidate question generation and persistencesequenceDiagram
participant Pipe as Pipeline.Run
participant HyDE as generateCandidateQuestions
participant DB as db.Pool
participant Store as Storage
participant LLM as llmgate.Client
Pipe->>HyDE: generateCandidateQuestions(docID, profile)
HyDE->>DB: ListSectionsForWorker(docID)
DB-->>HyDE: []db.Section
loop for each leaf section
HyDE->>Pipe: candidateQuestionsFor(section, profile)
activate Pipe
alt section.ContentRef != ""
Pipe->>Store: Get(ctx, section.ContentRef)
Store-->>Pipe: io.ReadCloser
end
Pipe->>LLM: runHyDEWithRetry(request, defaultHyDERetries)
note over LLM,Pipe: runHyDEWithRetry calls Complete + parseHyDEResponse
alt JSON parses
LLM-->>Pipe: []string questions
Pipe-->>HyDE: questions
alt len(questions) > 0
HyDE->>DB: UpdateSectionCandidateQuestions(id, questions)
DB-->>HyDE: ok
else no usable questions
HyDE-->>HyDE: skip update (NULL in DB)
end
else [parse failed]
LLM-->>Pipe: parse error (retry inside runHyDEWithRetry)
Pipe-->>HyDE: error
HyDE-->>HyDE: record non-fatal error and continue
end
deactivate Pipe
end
HyDE-->>Pipe: errors.Join(non-fatal errs)
Flow diagram for page-aware sections and HyDE candidate questionsflowchart TD
PDFParser[PDF.Parse<br/>+ propagateSectionPages<br/>+ chunkOversizedLeaves]
DBSection[db.Section<br/>page_start/page_end<br/>candidate_questions]
Tree[tree.Section<br/>tree.SectionView]
Retrieval[retrieval<br/>BuildSelectionPrompt<br/>writeSectionLine]
API[HTTP API<br/>handleGetSection<br/>handleQuery]
PDFParser -->|PageStart/PageEnd| DBSection
DBSection -->|UpsertSection<br/>ListSections*| Tree
Tree -->|BuildView<br/>PageStart/PageEnd<br/>CandidateQuestions| Retrieval
Retrieval -->|answers: firstCandidateQuestion| Retrieval
Tree -->|SectionView fields| API
DBSection -->|GetSection<br/>Query results| API
File-Level Changes
Tips and commandsInteracting with Sourcery
Customizing Your ExperienceAccess your dashboard to:
Getting Help
|
|
Warning Review limit reached
More reviews will be available in 28 minutes and 11 seconds. Learn how PR review limits work. Your organization has run out of usage credits. Purchase more in the billing tab. ⌛ How to resolve this issue?After more reviews become available, a review can be triggered using the We recommend that you space out your commits to avoid hitting the rate limit. 🚦 How do rate limits work?CodeRabbit enforces hourly rate limits for each developer per organization. Our paid plans include higher PR review limits than trial, open-source, and free plans. In all cases, reviews become available again over time. During sustained high-volume PR review activity, CodeRabbit may temporarily slow when the next review becomes available. Please see our Fair Usage Limits Policy for further information. ℹ️ Review info⚙️ Run configurationConfiguration used: defaults Review profile: CHILL Plan: Pro Run ID: 📒 Files selected for processing (22)
✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
- openapi.yaml: SectionResponse + QuerySection gain page_start / page_end / candidate_questions. - config.example.yaml: ingest.hyde block with defaults.
There was a problem hiding this comment.
Hey - I've found 2 issues, and left some high level feedback:
- In
runSelectionWithRetryyou’re usinglog.Printf, which is inconsistent with the rest of the codebase’s structured logging viaslog; consider wiring a logger into the retrieval strategy and emitting a structured warning instead of using the global logger.
Prompt for AI Agents
Please address the comments from this code review:
## Overall Comments
- In `runSelectionWithRetry` you’re using `log.Printf`, which is inconsistent with the rest of the codebase’s structured logging via `slog`; consider wiring a logger into the retrieval strategy and emitting a structured warning instead of using the global logger.
## Individual Comments
### Comment 1
<location path="pkg/parser/pdf.go" line_range="359-368" />
<code_context>
+
+// splitContentByWords breaks a long string into pieces near target size at
+// word boundaries. The last piece may be smaller; pieces are never midword.
+func splitContentByWords(s string, target int) []string {
+ s = strings.TrimSpace(s)
+ if target < 200 {
+ target = 200
+ }
+ slack := target / 4
+ if len(s) <= target+slack {
+ return []string{s}
+ }
+ var chunks []string
+ for len(s) > 0 {
+ if len(s) <= target+slack {
+ chunks = append(chunks, strings.TrimSpace(s))
+ break
+ }
+ upper := target + slack
+ if upper > len(s) {
+ upper = len(s)
+ }
+ cut := strings.LastIndex(s[:upper], " ")
+ if cut < target/2 {
+ cut = upper // no good break: hard-cut at upper bound
+ }
+ chunks = append(chunks, strings.TrimSpace(s[:cut]))
+ s = strings.TrimSpace(s[cut:])
+ }
+ return chunks
+}
+
</code_context>
<issue_to_address>
**issue (bug_risk):** Avoid byte-based slicing that can split multi-byte runes and corrupt UTF-8
`splitContentByWords` (and `deriveChunkTitle`) use byte-indexed slices like `s[:upper]` and `s[:cut]` on a UTF-8 string. Since `len` and these indices are in bytes, you can slice in the middle of a multi-byte rune and produce invalid UTF-8, which may break consumers (e.g., JSON encoding or UI). Please use rune-aware iteration (e.g., convert to `[]rune` or iterate with `range`) or otherwise ensure cut positions are aligned to rune boundaries.
</issue_to_address>
### Comment 2
<location path="pkg/ingest/hyde_test.go" line_range="133-141" />
<code_context>
+ }
+}
+
+func TestRunHyDEWithRetryFinalParseFailReturnsError(t *testing.T) {
+ m := &llmgate.Mock{Reply: "no JSON anywhere here, just prose."}
+ _, err := runHyDEWithRetry(context.Background(), m, llmgate.Request{
+ Messages: []llmgate.Message{{Role: llmgate.RoleUser, Content: "go"}},
+ }, 2)
+ if err == nil {
+ t.Error("want final-parse error after all retries fail")
+ }
+ if m.Calls() != 3 { // 1 initial + 2 retries
+ t.Errorf("want 3 attempts, got %d", m.Calls())
+ }
</code_context>
<issue_to_address>
**suggestion (testing):** Add an end-to-end test for generateCandidateQuestions to cover leaf-only selection, internal-node skipping, and non-fatal aggregation of errors
Current tests cover `parseHyDEResponse`, `runHyDEWithRetry`, and `candidateQuestionsFor`, but not the higher-level orchestration in `generateCandidateQuestions` (leaf detection, concurrency limits, and error aggregation). Please add a test that uses a fake `DB` and `LLM` in a `Pipeline`, calls `generateCandidateQuestions`, and verifies that:
- only leaf sections (no children) are processed;
- sections with existing `CandidateQuestions` are left unchanged;
- non-fatal errors from some leaves are collected while other leaves still persist questions; and
- the `HyDEConcurrency` limit is respected (e.g., via a counting/blocking mock).
This will lock in the intended non-fatal, concurrent behavior and guard against regressions in this stage’s orchestration logic.
Suggested implementation:
```golang
func TestParseHyDEResponseHappy(t *testing.T) {
got, err := parseHyDEResponse(`{"questions":["Q1","Q2","Q3"]}`)
if err != nil {
t.Fatalf("parse: %v", err)
}
if len(got) != 3 || got[0] != "Q1" || got[2] != "Q3" {
t.Errorf("got %+v", got)
}
}
// End-to-end test for generateCandidateQuestions orchestration:
// - only leaf sections are processed;
// - sections with existing CandidateQuestions are left unchanged;
// - non-fatal errors from some leaves are collected while other leaves still persist questions; and
// - HyDEConcurrency limit is respected.
func TestGenerateCandidateQuestions_EndToEnd(t *testing.T) {
t.Parallel()
ctx := context.Background()
// Section tree:
//
// root (id=1)
// ├── internal (id=2)
// │ ├── leaf-success (id=3)
// │ └── leaf-error (id=4)
// └── leaf-with-existing-questions (id=5)
//
// Expectation:
// - sections 3 and 4 are leaves and are candidates for HyDE;
// - section 5 is a leaf but is skipped due to existing CandidateQuestions;
// - only 3 should end up with new CandidateQuestions from the LLM;
// - 4 should fail with a non-fatal error that is returned but does not abort the pipeline;
// - concurrency for HyDE calls must respect HyDEConcurrency.
rootID := "1"
internalID := "2"
leafSuccessID := "3"
leafErrorID := "4"
leafExistingID := "5"
sections := map[string]*Section{
rootID: {
ID: rootID,
ParentID: "",
// children 2 and 5
},
internalID: {
ID: internalID,
ParentID: rootID,
// children 3 and 4
},
leafSuccessID: {
ID: leafSuccessID,
ParentID: internalID,
},
leafErrorID: {
ID: leafErrorID,
ParentID: internalID,
},
leafExistingID: {
ID: leafExistingID,
ParentID: rootID,
// Already has CandidateQuestions; should be skipped.
CandidateQuestions: []CandidateQuestion{
{Question: "existing question"},
},
},
}
// Fake DB implementing just enough for generateCandidateQuestions.
fdb := &fakeHyDEDB{
sections: sections,
// record of what sections get persisted to validate leaf-only behavior
persisted: make(map[string][]CandidateQuestion),
}
// HyDE calls:
// - section 3 -> returns valid questions
// - section 4 -> returns an error
// The mock also tracks concurrency and total call count.
mockLLM := &countingHyDELLM{
responses: map[string]llmgate.Response{
leafSuccessID: {
Content: `{"questions":["Q-leaf-success-1","Q-leaf-success-2"]}`,
},
},
errors: map[string]error{
leafErrorID: errors.New("synthetic HyDE error"),
},
maxConcurrentCh: make(chan struct{}, 2), // allow up to 2 in-flight -> we will assert against HyDEConcurrency value
}
p := &Pipeline{
DB: fdb,
LLM: mockLLM,
HyDEConcurrency: 2,
}
errs := p.generateCandidateQuestions(ctx, []*Section{
sections[rootID],
sections[internalID],
sections[leafSuccessID],
sections[leafErrorID],
sections[leafExistingID],
})
// Non-fatal error from leafErrorID must be present.
if len(errs) == 0 {
t.Fatalf("expected non-fatal errors, got none")
}
var sawLeafError bool
for _, err := range errs {
if err == nil {
continue
}
if strings.Contains(err.Error(), leafErrorID) || strings.Contains(err.Error(), "synthetic HyDE error") {
sawLeafError = true
}
}
if !sawLeafError {
t.Errorf("expected error associated with leaf %s to be reported; got errs=%v", leafErrorID, errs)
}
// Leaf-only behavior:
// - success leaf should be persisted with questions
// - error leaf should *not* have questions persisted
// - leaf with existing questions must not be overwritten
// - internal and root sections must not be processed.
if got := fdb.persisted[leafSuccessID]; len(got) == 0 {
t.Errorf("expected candidate questions persisted for leaf %s; got none", leafSuccessID)
}
if _, ok := fdb.persisted[leafErrorID]; ok {
t.Errorf("did not expect candidate questions for error leaf %s", leafErrorID)
}
if persisted, ok := fdb.persisted[leafExistingID]; ok && len(persisted) > 0 {
t.Errorf("expected leaf with existing CandidateQuestions (id=%s) to be skipped; got %+v", leafExistingID, persisted)
}
if _, ok := fdb.persisted[rootID]; ok {
t.Errorf("root section id=%s should not be processed as a leaf", rootID)
}
if _, ok := fdb.persisted[internalID]; ok {
t.Errorf("internal section id=%s should not be processed as a leaf", internalID)
}
// Ensure the in-memory representation for the leaf with existing questions
// has not been overwritten.
if len(sections[leafExistingID].CandidateQuestions) != 1 ||
sections[leafExistingID].CandidateQuestions[0].Question != "existing question" {
t.Errorf("existing CandidateQuestions on leaf %s were mutated: %+v", leafExistingID, sections[leafExistingID].CandidateQuestions)
}
// HyDE should have been called only for the two leaves without existing questions.
if calls := mockLLM.Calls(); calls != 2 {
t.Errorf("expected 2 HyDE calls (for %s and %s), got %d", leafSuccessID, leafErrorID, calls)
}
// Concurrency: verify that generateCandidateQuestions respected HyDEConcurrency
// by asserting that the LLM never saw more than HyDEConcurrency in flight.
if mockLLM.MaxConcurrent() > p.HyDEConcurrency {
t.Errorf("HyDEConcurrency exceeded: max concurrent calls = %d, HyDEConcurrency = %d",
mockLLM.MaxConcurrent(), p.HyDEConcurrency)
}
}
// fakeHyDEDB is a minimal fake DB implementation sufficient for
// generateCandidateQuestions orchestration tests.
type fakeHyDEDB struct {
mu sync.Mutex
sections map[string]*Section
persisted map[string][]CandidateQuestion
}
func (f *fakeHyDEDB) LoadSections(ctx context.Context, ids []string) ([]*Section, error) {
f.mu.Lock()
defer f.mu.Unlock()
var out []*Section
for _, id := range ids {
if s, ok := f.sections[id]; ok {
out = append(out, s)
}
}
return out, nil
}
// PersistCandidateQuestions records per-section CandidateQuestions that
// generateCandidateQuestions asked to persist.
func (f *fakeHyDEDB) PersistCandidateQuestions(ctx context.Context, sectionID string, qs []CandidateQuestion) error {
f.mu.Lock()
defer f.mu.Unlock()
// Copy to prevent later mutation affecting assertions.
copied := make([]CandidateQuestion, len(qs))
copy(copied, qs)
f.persisted[sectionID] = copied
if s, ok := f.sections[sectionID]; ok {
s.CandidateQuestions = copied
}
return nil
}
// countingHyDELLM is a mock LLM that:
// - returns configured responses or errors per section ID;
// - tracks total call count;
// - tracks maximum concurrent calls via a semaphore channel.
type countingHyDELLM struct {
mu sync.Mutex
totalCalls int
maxConcurrent int
currentInFlight int
// keyed by Section.ID or some opaque key the pipeline passes through
responses map[string]llmgate.Response
errors map[string]error
maxConcurrentCh chan struct{}
}
func (m *countingHyDELLM) Call(ctx context.Context, req llmgate.Request) (llmgate.Response, error) {
// Acquire slot to track concurrency.
select {
case m.maxConcurrentCh <- struct{}{}:
case <-ctx.Done():
return llmgate.Response{}, ctx.Err()
}
defer func() { <-m.maxConcurrentCh }()
m.mu.Lock()
m.totalCalls++
m.currentInFlight++
if m.currentInFlight > m.maxConcurrent {
m.maxConcurrent = m.currentInFlight
}
m.mu.Unlock()
defer func() {
m.mu.Lock()
m.currentInFlight--
m.mu.Unlock()
}()
// For the purpose of the test we rely on a convention that the section ID
// is embedded in the user message content. The production code already
wants deterministic prompts for HyDE, so this is safe to assert on here.
var key string
if len(req.Messages) > 0 {
key = req.Messages[0].Content
}
if err, ok := m.errors[key]; ok {
return llmgate.Response{}, err
}
if resp, ok := m.responses[key]; ok {
return resp, nil
}
return llmgate.Response{
Content: `{"questions":["default-question"]}`,
}, nil
}
func (m *countingHyDELLM) Calls() int {
m.mu.Lock()
defer m.mu.Unlock()
return m.totalCalls
}
func (m *countingHyDELLM) MaxConcurrent() int {
m.mu.Lock()
defer m.mu.Unlock()
return m.maxConcurrent
}
```
To integrate this test with your existing codebase you will need to:
1. **Import dependencies** in `pkg/ingest/hyde_test.go` if they are not already present:
- Add to the import block: `context`, `errors`, `strings`, and `sync`.
- Ensure `llmgate`, `Pipeline`, `Section`, and `CandidateQuestion` are already imported / in scope; if they are in another package, adjust the qualifiers accordingly.
2. **Align fake types with real interfaces**:
- Update `fakeHyDEDB` to implement the exact DB interface used by `Pipeline.generateCandidateQuestions`. If your DB interface uses different method names or signatures (e.g. `SectionsByID` instead of `LoadSections`, or `SaveCandidateQuestions` instead of `PersistCandidateQuestions`), rename and adjust the methods accordingly.
- If `generateCandidateQuestions` uses additional DB methods, add no-op or recording implementations to `fakeHyDEDB` so the test compiles and runs.
3. **Align LLM mock with your LLM abstraction**:
- Change `countingHyDELLM.Call` to match the actual method name/signature expected by the `Pipeline` for HyDE calls. If the abstraction is, for example, `HyDE(ctx, section *Section) (llmgate.Response, error)` instead of a raw `llmgate.Request`, adjust the mock accordingly, including how it derives the `key` used to look up responses/errors.
- If your real HyDE call extracts the section ID differently (e.g. via metadata rather than the first message’s content), mirror that in the mock so the per-section behavior (success vs error) still matches the test’s intended assertions.
4. **Adjust section construction**:
- Replace the simple `&Section{ID: ..., ParentID: ...}` literals with whatever fields are required by your real `Section` type (e.g. `DocumentID`, `Path`, `Children []*Section`, etc.). In particular, ensure that `generateCandidateQuestions` will recognize which sections are leaves vs internal nodes based on how children are represented in your `Section` model.
5. **Match error formatting**:
- The test currently searches error messages using `strings.Contains` for either `leafErrorID` or `"synthetic HyDE error"`. If your `generateCandidateQuestions` wraps or formats errors differently, tweak that check so it matches the actual error strings produced by your orchestration logic.
With these adjustments, the new `TestGenerateCandidateQuestions_EndToEnd` should compile and validate:
- leaf-only processing,
- skipping of leaves with existing questions,
- non-fatal aggregation of errors, and
- enforcement of the `HyDEConcurrency` limit via the counting/blocking mock.
</issue_to_address>Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.
| func splitContentByWords(s string, target int) []string { | ||
| s = strings.TrimSpace(s) | ||
| if target < 200 { | ||
| target = 200 | ||
| } | ||
| slack := target / 4 | ||
| if len(s) <= target+slack { | ||
| return []string{s} | ||
| } | ||
| var chunks []string |
There was a problem hiding this comment.
issue (bug_risk): Avoid byte-based slicing that can split multi-byte runes and corrupt UTF-8
splitContentByWords (and deriveChunkTitle) use byte-indexed slices like s[:upper] and s[:cut] on a UTF-8 string. Since len and these indices are in bytes, you can slice in the middle of a multi-byte rune and produce invalid UTF-8, which may break consumers (e.g., JSON encoding or UI). Please use rune-aware iteration (e.g., convert to []rune or iterate with range) or otherwise ensure cut positions are aligned to rune boundaries.
| func TestRunHyDEWithRetryFinalParseFailReturnsError(t *testing.T) { | ||
| m := &llmgate.Mock{Reply: "no JSON anywhere here, just prose."} | ||
| _, err := runHyDEWithRetry(context.Background(), m, llmgate.Request{ | ||
| Messages: []llmgate.Message{{Role: llmgate.RoleUser, Content: "go"}}, | ||
| }, 2) | ||
| if err == nil { | ||
| t.Error("want final-parse error after all retries fail") | ||
| } | ||
| if m.Calls() != 3 { // 1 initial + 2 retries |
There was a problem hiding this comment.
suggestion (testing): Add an end-to-end test for generateCandidateQuestions to cover leaf-only selection, internal-node skipping, and non-fatal aggregation of errors
Current tests cover parseHyDEResponse, runHyDEWithRetry, and candidateQuestionsFor, but not the higher-level orchestration in generateCandidateQuestions (leaf detection, concurrency limits, and error aggregation). Please add a test that uses a fake DB and LLM in a Pipeline, calls generateCandidateQuestions, and verifies that:
- only leaf sections (no children) are processed;
- sections with existing
CandidateQuestionsare left unchanged; - non-fatal errors from some leaves are collected while other leaves still persist questions; and
- the
HyDEConcurrencylimit is respected (e.g., via a counting/blocking mock).
This will lock in the intended non-fatal, concurrent behavior and guard against regressions in this stage’s orchestration logic.
Suggested implementation:
func TestParseHyDEResponseHappy(t *testing.T) {
got, err := parseHyDEResponse(`{"questions":["Q1","Q2","Q3"]}`)
if err != nil {
t.Fatalf("parse: %v", err)
}
if len(got) != 3 || got[0] != "Q1" || got[2] != "Q3" {
t.Errorf("got %+v", got)
}
}
// End-to-end test for generateCandidateQuestions orchestration:
// - only leaf sections are processed;
// - sections with existing CandidateQuestions are left unchanged;
// - non-fatal errors from some leaves are collected while other leaves still persist questions; and
// - HyDEConcurrency limit is respected.
func TestGenerateCandidateQuestions_EndToEnd(t *testing.T) {
t.Parallel()
ctx := context.Background()
// Section tree:
//
// root (id=1)
// ├── internal (id=2)
// │ ├── leaf-success (id=3)
// │ └── leaf-error (id=4)
// └── leaf-with-existing-questions (id=5)
//
// Expectation:
// - sections 3 and 4 are leaves and are candidates for HyDE;
// - section 5 is a leaf but is skipped due to existing CandidateQuestions;
// - only 3 should end up with new CandidateQuestions from the LLM;
// - 4 should fail with a non-fatal error that is returned but does not abort the pipeline;
// - concurrency for HyDE calls must respect HyDEConcurrency.
rootID := "1"
internalID := "2"
leafSuccessID := "3"
leafErrorID := "4"
leafExistingID := "5"
sections := map[string]*Section{
rootID: {
ID: rootID,
ParentID: "",
// children 2 and 5
},
internalID: {
ID: internalID,
ParentID: rootID,
// children 3 and 4
},
leafSuccessID: {
ID: leafSuccessID,
ParentID: internalID,
},
leafErrorID: {
ID: leafErrorID,
ParentID: internalID,
},
leafExistingID: {
ID: leafExistingID,
ParentID: rootID,
// Already has CandidateQuestions; should be skipped.
CandidateQuestions: []CandidateQuestion{
{Question: "existing question"},
},
},
}
// Fake DB implementing just enough for generateCandidateQuestions.
fdb := &fakeHyDEDB{
sections: sections,
// record of what sections get persisted to validate leaf-only behavior
persisted: make(map[string][]CandidateQuestion),
}
// HyDE calls:
// - section 3 -> returns valid questions
// - section 4 -> returns an error
// The mock also tracks concurrency and total call count.
mockLLM := &countingHyDELLM{
responses: map[string]llmgate.Response{
leafSuccessID: {
Content: `{"questions":["Q-leaf-success-1","Q-leaf-success-2"]}`,
},
},
errors: map[string]error{
leafErrorID: errors.New("synthetic HyDE error"),
},
maxConcurrentCh: make(chan struct{}, 2), // allow up to 2 in-flight -> we will assert against HyDEConcurrency value
}
p := &Pipeline{
DB: fdb,
LLM: mockLLM,
HyDEConcurrency: 2,
}
errs := p.generateCandidateQuestions(ctx, []*Section{
sections[rootID],
sections[internalID],
sections[leafSuccessID],
sections[leafErrorID],
sections[leafExistingID],
})
// Non-fatal error from leafErrorID must be present.
if len(errs) == 0 {
t.Fatalf("expected non-fatal errors, got none")
}
var sawLeafError bool
for _, err := range errs {
if err == nil {
continue
}
if strings.Contains(err.Error(), leafErrorID) || strings.Contains(err.Error(), "synthetic HyDE error") {
sawLeafError = true
}
}
if !sawLeafError {
t.Errorf("expected error associated with leaf %s to be reported; got errs=%v", leafErrorID, errs)
}
// Leaf-only behavior:
// - success leaf should be persisted with questions
// - error leaf should *not* have questions persisted
// - leaf with existing questions must not be overwritten
// - internal and root sections must not be processed.
if got := fdb.persisted[leafSuccessID]; len(got) == 0 {
t.Errorf("expected candidate questions persisted for leaf %s; got none", leafSuccessID)
}
if _, ok := fdb.persisted[leafErrorID]; ok {
t.Errorf("did not expect candidate questions for error leaf %s", leafErrorID)
}
if persisted, ok := fdb.persisted[leafExistingID]; ok && len(persisted) > 0 {
t.Errorf("expected leaf with existing CandidateQuestions (id=%s) to be skipped; got %+v", leafExistingID, persisted)
}
if _, ok := fdb.persisted[rootID]; ok {
t.Errorf("root section id=%s should not be processed as a leaf", rootID)
}
if _, ok := fdb.persisted[internalID]; ok {
t.Errorf("internal section id=%s should not be processed as a leaf", internalID)
}
// Ensure the in-memory representation for the leaf with existing questions
// has not been overwritten.
if len(sections[leafExistingID].CandidateQuestions) != 1 ||
sections[leafExistingID].CandidateQuestions[0].Question != "existing question" {
t.Errorf("existing CandidateQuestions on leaf %s were mutated: %+v", leafExistingID, sections[leafExistingID].CandidateQuestions)
}
// HyDE should have been called only for the two leaves without existing questions.
if calls := mockLLM.Calls(); calls != 2 {
t.Errorf("expected 2 HyDE calls (for %s and %s), got %d", leafSuccessID, leafErrorID, calls)
}
// Concurrency: verify that generateCandidateQuestions respected HyDEConcurrency
// by asserting that the LLM never saw more than HyDEConcurrency in flight.
if mockLLM.MaxConcurrent() > p.HyDEConcurrency {
t.Errorf("HyDEConcurrency exceeded: max concurrent calls = %d, HyDEConcurrency = %d",
mockLLM.MaxConcurrent(), p.HyDEConcurrency)
}
}
// fakeHyDEDB is a minimal fake DB implementation sufficient for
// generateCandidateQuestions orchestration tests.
type fakeHyDEDB struct {
mu sync.Mutex
sections map[string]*Section
persisted map[string][]CandidateQuestion
}
func (f *fakeHyDEDB) LoadSections(ctx context.Context, ids []string) ([]*Section, error) {
f.mu.Lock()
defer f.mu.Unlock()
var out []*Section
for _, id := range ids {
if s, ok := f.sections[id]; ok {
out = append(out, s)
}
}
return out, nil
}
// PersistCandidateQuestions records per-section CandidateQuestions that
// generateCandidateQuestions asked to persist.
func (f *fakeHyDEDB) PersistCandidateQuestions(ctx context.Context, sectionID string, qs []CandidateQuestion) error {
f.mu.Lock()
defer f.mu.Unlock()
// Copy to prevent later mutation affecting assertions.
copied := make([]CandidateQuestion, len(qs))
copy(copied, qs)
f.persisted[sectionID] = copied
if s, ok := f.sections[sectionID]; ok {
s.CandidateQuestions = copied
}
return nil
}
// countingHyDELLM is a mock LLM that:
// - returns configured responses or errors per section ID;
// - tracks total call count;
// - tracks maximum concurrent calls via a semaphore channel.
type countingHyDELLM struct {
mu sync.Mutex
totalCalls int
maxConcurrent int
currentInFlight int
// keyed by Section.ID or some opaque key the pipeline passes through
responses map[string]llmgate.Response
errors map[string]error
maxConcurrentCh chan struct{}
}
func (m *countingHyDELLM) Call(ctx context.Context, req llmgate.Request) (llmgate.Response, error) {
// Acquire slot to track concurrency.
select {
case m.maxConcurrentCh <- struct{}{}:
case <-ctx.Done():
return llmgate.Response{}, ctx.Err()
}
defer func() { <-m.maxConcurrentCh }()
m.mu.Lock()
m.totalCalls++
m.currentInFlight++
if m.currentInFlight > m.maxConcurrent {
m.maxConcurrent = m.currentInFlight
}
m.mu.Unlock()
defer func() {
m.mu.Lock()
m.currentInFlight--
m.mu.Unlock()
}()
// For the purpose of the test we rely on a convention that the section ID
// is embedded in the user message content. The production code already
wants deterministic prompts for HyDE, so this is safe to assert on here.
var key string
if len(req.Messages) > 0 {
key = req.Messages[0].Content
}
if err, ok := m.errors[key]; ok {
return llmgate.Response{}, err
}
if resp, ok := m.responses[key]; ok {
return resp, nil
}
return llmgate.Response{
Content: `{"questions":["default-question"]}`,
}, nil
}
func (m *countingHyDELLM) Calls() int {
m.mu.Lock()
defer m.mu.Unlock()
return m.totalCalls
}
func (m *countingHyDELLM) MaxConcurrent() int {
m.mu.Lock()
defer m.mu.Unlock()
return m.maxConcurrent
}To integrate this test with your existing codebase you will need to:
-
Import dependencies in
pkg/ingest/hyde_test.goif they are not already present:- Add to the import block:
context,errors,strings, andsync. - Ensure
llmgate,Pipeline,Section, andCandidateQuestionare already imported / in scope; if they are in another package, adjust the qualifiers accordingly.
- Add to the import block:
-
Align fake types with real interfaces:
- Update
fakeHyDEDBto implement the exact DB interface used byPipeline.generateCandidateQuestions. If your DB interface uses different method names or signatures (e.g.SectionsByIDinstead ofLoadSections, orSaveCandidateQuestionsinstead ofPersistCandidateQuestions), rename and adjust the methods accordingly. - If
generateCandidateQuestionsuses additional DB methods, add no-op or recording implementations tofakeHyDEDBso the test compiles and runs.
- Update
-
Align LLM mock with your LLM abstraction:
- Change
countingHyDELLM.Callto match the actual method name/signature expected by thePipelinefor HyDE calls. If the abstraction is, for example,HyDE(ctx, section *Section) (llmgate.Response, error)instead of a rawllmgate.Request, adjust the mock accordingly, including how it derives thekeyused to look up responses/errors. - If your real HyDE call extracts the section ID differently (e.g. via metadata rather than the first message’s content), mirror that in the mock so the per-section behavior (success vs error) still matches the test’s intended assertions.
- Change
-
Adjust section construction:
- Replace the simple
&Section{ID: ..., ParentID: ...}literals with whatever fields are required by your realSectiontype (e.g.DocumentID,Path,Children []*Section, etc.). In particular, ensure thatgenerateCandidateQuestionswill recognize which sections are leaves vs internal nodes based on how children are represented in yourSectionmodel.
- Replace the simple
-
Match error formatting:
- The test currently searches error messages using
strings.Containsfor eitherleafErrorIDor"synthetic HyDE error". If yourgenerateCandidateQuestionswraps or formats errors differently, tweak that check so it matches the actual error strings produced by your orchestration logic.
- The test currently searches error messages using
With these adjustments, the new TestGenerateCandidateQuestions_EndToEnd should compile and validate:
- leaf-only processing,
- skipping of leaves with existing questions,
- non-fatal aggregation of errors, and
- enforcement of the
HyDEConcurrencylimit via the counting/blocking mock.
Summary
Phase 1.1 (page citations) + Phase 1.2 (HyDE candidate questions) + the cross-cutting prerequisites (DB migration, API extensions, config).
db.Section->tree.Section->tree.SectionViewand are surfaced (omitempty) on every API handler that returns sections.pkg/ingest/hyde.goasks the LLM for up to N self-contained questions each leaf section can answer, with the same JSON-retry + graceful-degrade pattern retrieval already uses. Non-fatal — failures leave the doc fully usable.answers:hint (~120 chars cap) so the LLM has wider lexical overlap with user queries.ingest.hyde.{enabled,model,num_questions,concurrency}withVLE_INGEST_HYDE_*env overrides. Defaults: enabled=true, num_questions=5, concurrency=4.Built on top of
fix/parser-bold-headings(PR #12) so the parser fixes already in flight carry forward.Migration
0004_sections_extrasadds the new columns (page_start,page_end,candidate_questions) and a(document_id, page_start, page_end)index.Test plan
go test ./...— all packages pass.pkg/parser/pdf_pages_test.gocoverspropagateSectionPagesunion semantics + chunker page inheritance.pkg/db/sections_marshal_test.gocovers JSONB round-trip for candidate_questions and NULL handling for page columns.pkg/ingest/hyde_test.gocovers parse tolerance, retry, final-fail-graceful, model override, NumQuestions cap.pkg/retrieval/retrieval_test.goasserts the selection prompt surfaces the first candidate question as ananswers:hint.page_start/page_endandcandidate_questionspopulate on its sections.Summary by Sourcery
Add page range tracking for PDF sections, introduce HyDE-generated candidate questions during ingest, and surface both through storage, retrieval, and API layers to improve retrieval quality and citations.
New Features:
Bug Fixes:
Enhancements:
Tests:
answers:hints in prompts.