feat: PGVector-backed log indexing and semantic search (#24)#80
feat: PGVector-backed log indexing and semantic search (#24)#80khat190 wants to merge 2 commits into
Conversation
|
@khat190 is attempting to deploy a commit to the Deekshith Gowda HS's projects Team on Vercel. A member of the Team first needs to authorize it. |
📝 WalkthroughWalkthroughThis PR implements semantic search over deployment logs by combining pgvector storage, Cohere embeddings, Groq summarization, and Inngest background jobs. Logs are automatically indexed when deployments go live and can be searched via authenticated API endpoints. A scheduled cron job re-indexes active sandboxes weekly. ChangesSemantic Log Search
Estimated code review effort🎯 4 (Complex) | ⏱️ ~65 minutes Possibly related issues
Possibly related PRs
Suggested labels
Poem
🚥 Pre-merge checks | ✅ 5✅ Passed checks (5 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 3
🧹 Nitpick comments (1)
lib/vector-indexer.ts (1)
65-88: ⚡ Quick winConsider adding a timeout to external API calls.
The
fetchcalls to Cohere lack a timeout. If the Cohere API is slow or unresponsive, the Inngest function could hang until the platform timeout kicks in, wasting resources. AddingAbortSignal.timeout()would allow graceful failure and retry.♻️ Proposed fix
async function embedTexts(texts: string[]): Promise<number[][]> { const response = await fetch("https://api.cohere.com/v1/embed", { method: "POST", headers: { Authorization: `Bearer ${getCohereApiKey()}`, "Content-Type": "application/json", "X-Client-Name": "secdev", }, body: JSON.stringify({ model: "embed-english-v3.0", texts, input_type: "search_document", truncate: "END", }), + signal: AbortSignal.timeout(30_000), // 30s timeout });🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@lib/vector-indexer.ts` around lines 65 - 88, The embedTexts function currently calls fetch without a timeout; update it to create an AbortSignal via AbortSignal.timeout(...) (or an AbortController with setTimeout) and pass the signal option into fetch to enforce a timeout (choose an appropriate duration, e.g., 10s). Ensure you clear any timers if using AbortController, handle aborts and propagate a clear error message (e.g., "Cohere embed request timed out") alongside other non-OK responses, and keep the rest of the response handling in embedTexts unchanged.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@app/api/logs/search/route.ts`:
- Around line 135-160: The handleTrigger function currently allows any
authenticated user to trigger indexing for any sandboxId; before calling
inngest.send, query your sandbox/store to verify ownership (e.g., fetch sandbox
by sandboxId and confirm its ownerId or userId matches the authenticated userId)
and return a 403/400 JSON response if the sandbox is missing or not owned by the
requester; ensure this ownership check happens in handleTrigger immediately
after validating sandboxId and before invoking inngest.send so only owners can
enqueue indexing events.
In `@lib/db.ts`:
- Around line 266-267: Update the outdated comment in the Log vector index table
block to reflect that the schema uses Cohere 1024-dim embeddings (vector(1024))
instead of "1536-dim OpenAI embeddings"; locate the comment text near the "Log
vector index table" header in lib/db.ts and change the wording to mention Cohere
1024-dim embeddings so it matches the schema and avoids confusion.
In `@lib/vector-indexer.ts`:
- Around line 156-175: indexSandbox currently calls getDb() and runs queries
without ensuring the schema exists; add an awaited call to ensureTables() at the
start of indexSandbox (before calling getDb()) so the DB schema is created
before any queries run, and import/require ensureTables (the function referenced
in lib/deployer.ts) if it isn’t already imported; ensureTables is awaited (await
ensureTables()) and then proceed to const sql = getDb().
---
Nitpick comments:
In `@lib/vector-indexer.ts`:
- Around line 65-88: The embedTexts function currently calls fetch without a
timeout; update it to create an AbortSignal via AbortSignal.timeout(...) (or an
AbortController with setTimeout) and pass the signal option into fetch to
enforce a timeout (choose an appropriate duration, e.g., 10s). Ensure you clear
any timers if using AbortController, handle aborts and propagate a clear error
message (e.g., "Cohere embed request timed out") alongside other non-OK
responses, and keep the rest of the response handling in embedTexts unchanged.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: 65369052-7980-4098-b703-3bac9f7189d6
📒 Files selected for processing (5)
app/api/inngest/route.tsapp/api/logs/search/route.tslib/db.tslib/deployer.tslib/vector-indexer.ts
| async function handleTrigger(req: NextRequest, userId: string) { | ||
| let body: { sandboxId?: string; ttlDays?: number }; | ||
| try { | ||
| body = await req.json(); | ||
| } catch { | ||
| return NextResponse.json({ error: "Invalid JSON body" }, { status: 400 }); | ||
| } | ||
|
|
||
| const { sandboxId, ttlDays = 30 } = body; | ||
| if (!sandboxId) { | ||
| return NextResponse.json( | ||
| { error: "sandboxId is required" }, | ||
| { status: 400 } | ||
| ); | ||
| } | ||
|
|
||
| await inngest.send({ | ||
| name: "log/index.requested", | ||
| data: { sandboxId, userId, ttlDays }, | ||
| }); | ||
|
|
||
| return NextResponse.json({ | ||
| ok: true, | ||
| message: `Indexing triggered for sandbox ${sandboxId}`, | ||
| }); | ||
| } No newline at end of file |
There was a problem hiding this comment.
Missing sandbox ownership validation – IDOR vulnerability.
Any authenticated user can trigger indexing for any sandboxId. The vectors are stored with the triggering user's userId, allowing that user to subsequently search and view another user's deployment logs.
Before sending the Inngest event, verify the authenticated user owns the sandbox:
🔒 Proposed fix
+import { getDeployment } from "`@/lib/deployer`";
+
async function handleTrigger(req: NextRequest, userId: string) {
let body: { sandboxId?: string; ttlDays?: number };
try {
body = await req.json();
} catch {
return NextResponse.json({ error: "Invalid JSON body" }, { status: 400 });
}
const { sandboxId, ttlDays = 30 } = body;
if (!sandboxId) {
return NextResponse.json(
{ error: "sandboxId is required" },
{ status: 400 }
);
}
+ // Verify ownership before triggering indexing
+ const deployment = await getDeployment(sandboxId);
+ if (!deployment || deployment.userId !== userId) {
+ return NextResponse.json({ error: "Not found" }, { status: 404 });
+ }
+
await inngest.send({
name: "log/index.requested",
data: { sandboxId, userId, ttlDays },
});🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@app/api/logs/search/route.ts` around lines 135 - 160, The handleTrigger
function currently allows any authenticated user to trigger indexing for any
sandboxId; before calling inngest.send, query your sandbox/store to verify
ownership (e.g., fetch sandbox by sandboxId and confirm its ownerId or userId
matches the authenticated userId) and return a 403/400 JSON response if the
sandbox is missing or not owned by the requester; ensure this ownership check
happens in handleTrigger immediately after validating sandboxId and before
invoking inngest.send so only owners can enqueue indexing events.
| // ── Log vector index table ──────────────────────────────────────────────── | ||
| // Stores chunked log text + 1536-dim OpenAI embeddings for semantic search. |
There was a problem hiding this comment.
Outdated comment: mentions OpenAI 1536-dim, but schema uses Cohere 1024-dim.
The comment says "1536-dim OpenAI embeddings" but the schema defines vector(1024) for Cohere embeddings. This inconsistency could confuse future maintainers.
📝 Proposed fix
- // ── Log vector index table ────────────────────────────────────────────────
- // Stores chunked log text + 1536-dim OpenAI embeddings for semantic search.
+ // ── Log vector index table ────────────────────────────────────────────────
+ // Stores chunked log text + 1024-dim Cohere embeddings for semantic search.📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| // ── Log vector index table ──────────────────────────────────────────────── | |
| // Stores chunked log text + 1536-dim OpenAI embeddings for semantic search. | |
| // ── Log vector index table ──────────────────────────────────────────────── | |
| // Stores chunked log text + 1024-dim Cohere embeddings for semantic search. |
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@lib/db.ts` around lines 266 - 267, Update the outdated comment in the Log
vector index table block to reflect that the schema uses Cohere 1024-dim
embeddings (vector(1024)) instead of "1536-dim OpenAI embeddings"; locate the
comment text near the "Log vector index table" header in lib/db.ts and change
the wording to mention Cohere 1024-dim embeddings so it matches the schema and
avoids confusion.
| async function indexSandbox(opts: { | ||
| sandboxId: string; | ||
| userId: string; | ||
| ttlDays: number; | ||
| }): Promise<{ chunksIndexed: number; pruned: number }> { | ||
| const { sandboxId, userId, ttlDays } = opts; | ||
| const sql = getDb(); | ||
|
|
||
| // 1. Find where we left off | ||
| const lastId = await getLastIndexedLogId(sandboxId); | ||
|
|
||
| // 2. Fetch new log lines | ||
| const rows = (await sql` | ||
| SELECT id, msg, level, ts | ||
| FROM deployment_logs | ||
| WHERE sandbox_id = ${sandboxId} | ||
| AND id > ${lastId} | ||
| ORDER BY id ASC | ||
| LIMIT ${CHUNK_SIZE_LINES * MAX_CHUNKS_PER_RUN} | ||
| `) as LogRow[]; |
There was a problem hiding this comment.
Missing ensureTables() call before database access.
The indexSandbox function uses getDb() directly without calling ensureTables() first. Per the upstream contract in lib/deployer.ts (lines 44-55), ensureTables() should be called before DB access to ensure schema exists. If the cron job or indexing runs before any deployment has initialized the schema, queries will fail.
🐛 Proposed fix
+import {
+ ensureTables,
+ getDb,
+ insertLogVector,
+ getLastIndexedLogId,
+ pruneExpiredLogVectors,
+} from "`@/lib/db`";
async function indexSandbox(opts: {
sandboxId: string;
userId: string;
ttlDays: number;
}): Promise<{ chunksIndexed: number; pruned: number }> {
const { sandboxId, userId, ttlDays } = opts;
+ await ensureTables();
const sql = getDb();🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@lib/vector-indexer.ts` around lines 156 - 175, indexSandbox currently calls
getDb() and runs queries without ensuring the schema exists; add an awaited call
to ensureTables() at the start of indexSandbox (before calling getDb()) so the
DB schema is created before any queries run, and import/require ensureTables
(the function referenced in lib/deployer.ts) if it isn’t already imported;
ensureTables is awaited (await ensureTables()) and then proceed to const sql =
getDb().
What changed
Enabled pgvector extension and added a new
deployment_log_vectorstable inlib/db.tsAdded database helper functions:
insertLogVectorsearchLogVectorspruneExpiredLogVectorsgetLastIndexedLogIdCreated
lib/vector-indexer.tswith two Inngest functions:indexDeploymentLogs(on-demand indexing)cronReindexLogs(daily re-indexing job)Created
app/api/logs/search/route.tswith:Updated
lib/deployer.tsto trigger log indexing when a deployment becomes liveUpdated
app/api/inngest/route.tsto register the new Inngest functionsWhy
Deployment logs were already being stored, but searching them required exact keyword matches. This feature introduces semantic search, allowing developers to find relevant logs using natural-language queries such as:
Logs are chunked, embedded using Cohere embeddings, and stored as vectors in PostgreSQL using pgvector. A retention policy automatically removes vectors older than 30 days to keep storage usage under control.
How to test
COHERE_API_KEYto.envRequest body:
{ "sandboxId": "<sandbox-id>" }AI Assistance
I used Claude AI as a learning and implementation aid while working with pgvector, embeddings, and Inngest workflows. All generated code was reviewed, tested, and modified as needed before submission.
The semantic search flow was tested locally using real Cohere embeddings and returned relevant results for sample deployment log queries.
Known trade-offs
##Closes #24
Summary by CodeRabbit