feat: PGVector-backed log indexing and semantic search (#24) by khat190 · Pull Request #80 · deekshithgowda85/SecDev

khat190 · 2026-06-01T00:22:01Z

What changed

Enabled pgvector extension and added a new deployment_log_vectors table in lib/db.ts
Added database helper functions:
- insertLogVector
- searchLogVectors
- pruneExpiredLogVectors
- getLastIndexedLogId
Created lib/vector-indexer.ts with two Inngest functions:
- indexDeploymentLogs (on-demand indexing)
- cronReindexLogs (daily re-indexing job)
Created app/api/logs/search/route.ts with:
- GET semantic search endpoint
- POST semantic search endpoint
- Manual indexing trigger endpoint
Updated lib/deployer.ts to trigger log indexing when a deployment becomes live
Updated app/api/inngest/route.ts to register the new Inngest functions

Why

Deployment logs were already being stored, but searching them required exact keyword matches. This feature introduces semantic search, allowing developers to find relevant logs using natural-language queries such as:

"memory error during build"
"deployment failed after install step"
"container startup issues"

Logs are chunked, embedded using Cohere embeddings, and stored as vectors in PostgreSQL using pgvector. A retention policy automatically removes vectors older than 30 days to keep storage usage under control.

How to test

Add a valid COHERE_API_KEY to .env
Trigger indexing:

POST /api/logs/search?action=trigger

Request body:

{
  "sandboxId": "<sandbox-id>"
}

Perform a semantic search:

GET /api/logs/search?q=build+failed

Verify that semantically related deployment logs are returned even when exact keywords do not match.

Tested locally: triggered indexing for a test sandbox, ran semantic
search for "build failed" and received correct matching log chunks
in the response.

AI Assistance

I used Claude AI as a learning and implementation aid while working with pgvector, embeddings, and Inngest workflows. All generated code was reviewed, tested, and modified as needed before submission.
The semantic search flow was tested locally using real Cohere embeddings and returned relevant results for sample deployment log queries.

Known trade-offs

Uses Cohere embeddings instead of OpenAI embeddings because Cohere provides a free tier without requiring a credit card.
Embeddings are 1024-dimensional instead of 1536-dimensional, reducing storage requirements at the cost of some representational capacity.
Daily re-indexing introduces a small amount of background processing overhead.

##Closes #24

Summary by CodeRabbit

New Features
- Added semantic log search to find similar deployment logs by content similarity
- Logs automatically indexed for search when deployments transition to live status
- Manual log indexing trigger available with configurable retention settings

…gowda85#24)

…earch

vercel · 2026-06-01T00:22:06Z

@khat190 is attempting to deploy a commit to the Deekshith Gowda HS's projects Team on Vercel.

A member of the Team first needs to authorize it.

coderabbitai · 2026-06-01T00:22:12Z

📝 Walkthrough

Walkthrough

This PR implements semantic search over deployment logs by combining pgvector storage, Cohere embeddings, Groq summarization, and Inngest background jobs. Logs are automatically indexed when deployments go live and can be searched via authenticated API endpoints. A scheduled cron job re-indexes active sandboxes weekly.

Changes

Semantic Log Search

Layer / File(s)	Summary
Vector database schema and operations `lib/db.ts`	Enables pgvector extension and creates `deployment_log_vectors` table with vector embeddings, TTL-based expiry, and HNSW indexing. Exports `LogVector` interface and CRUD functions: `insertLogVector`, `searchLogVectors` (cosine similarity with optional sandbox filter), `pruneExpiredLogVectors`, and `getLastIndexedLogId` for incremental indexing.
Indexing pipeline with embeddings `lib/vector-indexer.ts`	Implements Cohere API integration for text embeddings (search_document mode for chunks, search_query mode for queries), optional Groq-based chunk summarization, and core `indexSandbox` routine that fetches new logs, chunks them, embeds in batch, and persists to database. Exports two Inngest functions: `indexDeploymentLogs` (on-demand) and `cronReindexLogs` (weekly re-index of active sandboxes).
Deployment-triggered indexing `lib/deployer.ts`	Extends `runDeploymentPipeline` to accept userId and emits fire-and-forget Inngest event `log/index.requested` with sandbox, user, and 30-day TTL when deployment transitions to live status.
Search API and Inngest wiring `app/api/logs/search/route.ts`, `app/api/inngest/route.ts`	Adds authenticated GET/POST `/api/logs/search` supporting query embedding, per-user vector search with optional sandbox scoping, and `?action=trigger` to manually request indexing. Registers `indexDeploymentLogs` and `cronReindexLogs` with Inngest route handler.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~65 minutes

Possibly related issues

deekshithgowda85/SecDev#24: The PR directly implements the pgvector-backed log indexing and search feature described in this issue, including vector embeddings, Inngest orchestration, and the search API.

Possibly related PRs

deekshithgowda85/SecDev#2: Both PRs modify app/api/inngest/route.ts to register additional Inngest functions with the Next.js route handler wiring.

Suggested labels

gssoc:approved, quality:clean

Poem

🐰 Log chunks meet embeddings bright,
Cohere dreams in vectors' light,
Inngest orchestrates the dance,
Search semantic at a glance—
Logs transformed to insight's flight! ✨

🚥 Pre-merge checks | ✅ 5

✅ Passed checks (5 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title accurately and concisely summarizes the primary feature: PGVector-backed log indexing and semantic search capabilities being added to the codebase.
Docstring Coverage	✅ Passed	Docstring coverage is 90.00% which is sufficient. The required threshold is 80.00%.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 3

🧹 Nitpick comments (1)

lib/vector-indexer.ts (1)

65-88: ⚡ Quick win

Consider adding a timeout to external API calls.

The fetch calls to Cohere lack a timeout. If the Cohere API is slow or unresponsive, the Inngest function could hang until the platform timeout kicks in, wasting resources. Adding AbortSignal.timeout() would allow graceful failure and retry.

♻️ Proposed fix

 async function embedTexts(texts: string[]): Promise<number[][]> {
   const response = await fetch("https://api.cohere.com/v1/embed", {
     method: "POST",
     headers: {
       Authorization: `Bearer ${getCohereApiKey()}`,
       "Content-Type": "application/json",
       "X-Client-Name": "secdev",
     },
     body: JSON.stringify({
       model: "embed-english-v3.0",
       texts,
       input_type: "search_document",
       truncate: "END",
     }),
+    signal: AbortSignal.timeout(30_000), // 30s timeout
   });

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@lib/vector-indexer.ts` around lines 65 - 88, The embedTexts function
currently calls fetch without a timeout; update it to create an AbortSignal via
AbortSignal.timeout(...) (or an AbortController with setTimeout) and pass the
signal option into fetch to enforce a timeout (choose an appropriate duration,
e.g., 10s). Ensure you clear any timers if using AbortController, handle aborts
and propagate a clear error message (e.g., "Cohere embed request timed out")
alongside other non-OK responses, and keep the rest of the response handling in
embedTexts unchanged.

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@app/api/logs/search/route.ts`:
- Around line 135-160: The handleTrigger function currently allows any
authenticated user to trigger indexing for any sandboxId; before calling
inngest.send, query your sandbox/store to verify ownership (e.g., fetch sandbox
by sandboxId and confirm its ownerId or userId matches the authenticated userId)
and return a 403/400 JSON response if the sandbox is missing or not owned by the
requester; ensure this ownership check happens in handleTrigger immediately
after validating sandboxId and before invoking inngest.send so only owners can
enqueue indexing events.

In `@lib/db.ts`:
- Around line 266-267: Update the outdated comment in the Log vector index table
block to reflect that the schema uses Cohere 1024-dim embeddings (vector(1024))
instead of "1536-dim OpenAI embeddings"; locate the comment text near the "Log
vector index table" header in lib/db.ts and change the wording to mention Cohere
1024-dim embeddings so it matches the schema and avoids confusion.

In `@lib/vector-indexer.ts`:
- Around line 156-175: indexSandbox currently calls getDb() and runs queries
without ensuring the schema exists; add an awaited call to ensureTables() at the
start of indexSandbox (before calling getDb()) so the DB schema is created
before any queries run, and import/require ensureTables (the function referenced
in lib/deployer.ts) if it isn’t already imported; ensureTables is awaited (await
ensureTables()) and then proceed to const sql = getDb().

---

Nitpick comments:
In `@lib/vector-indexer.ts`:
- Around line 65-88: The embedTexts function currently calls fetch without a
timeout; update it to create an AbortSignal via AbortSignal.timeout(...) (or an
AbortController with setTimeout) and pass the signal option into fetch to
enforce a timeout (choose an appropriate duration, e.g., 10s). Ensure you clear
any timers if using AbortController, handle aborts and propagate a clear error
message (e.g., "Cohere embed request timed out") alongside other non-OK
responses, and keep the rest of the response handling in embedTexts unchanged.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 65369052-7980-4098-b703-3bac9f7189d6

📥 Commits

Reviewing files that changed from the base of the PR and between 2d65a13 and d0ad7f2.

📒 Files selected for processing (5)

app/api/inngest/route.ts
app/api/logs/search/route.ts
lib/db.ts
lib/deployer.ts
lib/vector-indexer.ts

coderabbitai · 2026-06-01T00:28:02Z

+async function handleTrigger(req: NextRequest, userId: string) {
+  let body: { sandboxId?: string; ttlDays?: number };
+  try {
+    body = await req.json();
+  } catch {
+    return NextResponse.json({ error: "Invalid JSON body" }, { status: 400 });
+  }
+
+  const { sandboxId, ttlDays = 30 } = body;
+  if (!sandboxId) {
+    return NextResponse.json(
+      { error: "sandboxId is required" },
+      { status: 400 }
+    );
+  }
+
+  await inngest.send({
+    name: "log/index.requested",
+    data: { sandboxId, userId, ttlDays },
+  });
+
+  return NextResponse.json({
+    ok: true,
+    message: `Indexing triggered for sandbox ${sandboxId}`,
+  });
+}


⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Missing sandbox ownership validation – IDOR vulnerability.

Any authenticated user can trigger indexing for any sandboxId. The vectors are stored with the triggering user's userId, allowing that user to subsequently search and view another user's deployment logs.

Before sending the Inngest event, verify the authenticated user owns the sandbox:

🔒 Proposed fix

+import { getDeployment } from "`@/lib/deployer`"; + async function handleTrigger(req: NextRequest, userId: string) { let body: { sandboxId?: string; ttlDays?: number }; try { body = await req.json(); } catch { return NextResponse.json({ error: "Invalid JSON body" }, { status: 400 }); } const { sandboxId, ttlDays = 30 } = body; if (!sandboxId) { return NextResponse.json( { error: "sandboxId is required" }, { status: 400 } ); } + // Verify ownership before triggering indexing + const deployment = await getDeployment(sandboxId); + if (!deployment || deployment.userId !== userId) { + return NextResponse.json({ error: "Not found" }, { status: 404 }); + } + await inngest.send({ name: "log/index.requested", data: { sandboxId, userId, ttlDays }, });

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@app/api/logs/search/route.ts` around lines 135 - 160, The handleTrigger function currently allows any authenticated user to trigger indexing for any sandboxId; before calling inngest.send, query your sandbox/store to verify ownership (e.g., fetch sandbox by sandboxId and confirm its ownerId or userId matches the authenticated userId) and return a 403/400 JSON response if the sandbox is missing or not owned by the requester; ensure this ownership check happens in handleTrigger immediately after validating sandboxId and before invoking inngest.send so only owners can enqueue indexing events.

coderabbitai · 2026-06-01T00:28:02Z

+  // ── Log vector index table ────────────────────────────────────────────────
+  // Stores chunked log text + 1536-dim OpenAI embeddings for semantic search.


⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Outdated comment: mentions OpenAI 1536-dim, but schema uses Cohere 1024-dim.

The comment says "1536-dim OpenAI embeddings" but the schema defines vector(1024) for Cohere embeddings. This inconsistency could confuse future maintainers.

📝 Proposed fix

- // ── Log vector index table ──────────────────────────────────────────────── - // Stores chunked log text + 1536-dim OpenAI embeddings for semantic search. + // ── Log vector index table ──────────────────────────────────────────────── + // Stores chunked log text + 1024-dim Cohere embeddings for semantic search.

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

// ── Log vector index table ────────────────────────────────────────────────

// Stores chunked log text + 1536-dim OpenAI embeddings for semantic search.

// ── Log vector index table ────────────────────────────────────────────────

// Stores chunked log text + 1024-dim Cohere embeddings for semantic search.

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@lib/db.ts` around lines 266 - 267, Update the outdated comment in the Log vector index table block to reflect that the schema uses Cohere 1024-dim embeddings (vector(1024)) instead of "1536-dim OpenAI embeddings"; locate the comment text near the "Log vector index table" header in lib/db.ts and change the wording to mention Cohere 1024-dim embeddings so it matches the schema and avoids confusion.

coderabbitai · 2026-06-01T00:28:02Z

+async function indexSandbox(opts: {
+  sandboxId: string;
+  userId: string;
+  ttlDays: number;
+}): Promise<{ chunksIndexed: number; pruned: number }> {
+  const { sandboxId, userId, ttlDays } = opts;
+  const sql = getDb();
+
+  // 1. Find where we left off
+  const lastId = await getLastIndexedLogId(sandboxId);
+
+  // 2. Fetch new log lines
+  const rows = (await sql`
+    SELECT id, msg, level, ts
+    FROM   deployment_logs
+    WHERE  sandbox_id = ${sandboxId}
+      AND  id         > ${lastId}
+    ORDER  BY id ASC
+    LIMIT  ${CHUNK_SIZE_LINES * MAX_CHUNKS_PER_RUN}
+  `) as LogRow[];


⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Missing ensureTables() call before database access.

The indexSandbox function uses getDb() directly without calling ensureTables() first. Per the upstream contract in lib/deployer.ts (lines 44-55), ensureTables() should be called before DB access to ensure schema exists. If the cron job or indexing runs before any deployment has initialized the schema, queries will fail.

🐛 Proposed fix

+import { + ensureTables, + getDb, + insertLogVector, + getLastIndexedLogId, + pruneExpiredLogVectors, +} from "`@/lib/db`"; async function indexSandbox(opts: { sandboxId: string; userId: string; ttlDays: number; }): Promise<{ chunksIndexed: number; pruned: number }> { const { sandboxId, userId, ttlDays } = opts; + await ensureTables(); const sql = getDb();

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@lib/vector-indexer.ts` around lines 156 - 175, indexSandbox currently calls getDb() and runs queries without ensuring the schema exists; add an awaited call to ensureTables() at the start of indexSandbox (before calling getDb()) so the DB schema is created before any queries run, and import/require ensureTables (the function referenced in lib/deployer.ts) if it isn’t already imported; ensureTables is awaited (await ensureTables()) and then proceed to const sql = getDb().

khat190 added 2 commits June 1, 2026 05:28

feat: add pgvector-backed log indexing and semantic search (deekshith…

27ccbb2

…gowda85#24)

Merge remote-tracking branch 'upstream/prod' into feat/pgvector-log-s…

d0ad7f2

…earch

coderabbitai Bot reviewed Jun 1, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: PGVector-backed log indexing and semantic search (#24)#80

feat: PGVector-backed log indexing and semantic search (#24)#80
khat190 wants to merge 2 commits into
deekshithgowda85:prodfrom
khat190:feat/pgvector-log-search

khat190 commented Jun 1, 2026 •

edited

Loading

Uh oh!

vercel Bot commented Jun 1, 2026

Uh oh!

coderabbitai Bot commented Jun 1, 2026 •

edited

Loading

Walkthrough

Changes

Estimated code review effort

Possibly related issues

Possibly related PRs

Suggested labels

Poem

Uh oh!

coderabbitai Bot left a comment

Uh oh!

coderabbitai Bot Jun 1, 2026

Uh oh!

coderabbitai Bot Jun 1, 2026

Uh oh!

coderabbitai Bot Jun 1, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

		// ── Log vector index table ────────────────────────────────────────────────
		// Stores chunked log text + 1536-dim OpenAI embeddings for semantic search.

Conversation

khat190 commented Jun 1, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changed

Why

How to test

AI Assistance

Known trade-offs

Summary by CodeRabbit

Uh oh!

vercel Bot commented Jun 1, 2026

Uh oh!

coderabbitai Bot commented Jun 1, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Estimated code review effort

Possibly related issues

Possibly related PRs

Suggested labels

Poem

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Jun 1, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Jun 1, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Jun 1, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

khat190 commented Jun 1, 2026 •

edited

Loading

coderabbitai Bot commented Jun 1, 2026 •

edited

Loading