A lightweight, local-only Retrieval Augmented Generation (RAG) API.
Express- HTTP API layerbetter-sqlite3+sqlite-vector- local storage and vector similarity searchnode-llama-cpp- local embedding + generation models (no cloud dependency)
The app reads .txt files from documents/, splits them into chunks, generates embeddings, stores them in SQLite, retrieves the most similar chunks for a query, and uses a local LLM to produce the final answer.
- Runs a local HTTP API on
http://localhost:3000 - Reads files from
./documents - Splits file content into chunks by blank lines (
\n\n+) - Creates embeddings with model Qwen3-Embedding-0.6B
- Stores embeddings in
rag.dbusing cosine distance - Retrieves top
8(can be modified) chunks for a query - Generates the final answer with model Qwen3-0.6B
This is intentionally a small local project (not production-ready):
- no auth
- no multi-user isolation
- no background jobs / queueing
- no cloud storage
- Node.js (v24.14.1)
- Install dependencies:
npm installpostinstall automatically downloads the two GGUF models into ./models.
- Make sure this folder exists and contains your text files:
documents/- Start the API:
npm startOr with watch mode:
npm run start:watchBase URL: http://localhost:3000
curl -X GET http://localhost:3000/documentsExample response:
{
"totalFiles": 2,
"files": [{ "name": "cv.txt" }, { "name": "projects.txt" }]
}file must exist inside ./documents.
curl -X POST http://localhost:3000/documents/embed \
-H "Content-Type: application/json" \
-d '{"fileName":"example.txt"}'Example response:
{
"message": "Embeddings created successfully",
"chunksStored": 12
}curl -X POST http://localhost:3000/documents/search \
-H "Content-Type: application/json" \
-d '{"query":"What backend experience do you have?"}'Example response:
{
"generatedAnswer": "..."
}This removes rows from SQLite, not the physical file in documents/.
curl -X DELETE http://localhost:3000/documents/cv.txtExample response:
{
"message": "File deleted successfully"
}npm start- run APInpm run start:watch- run API with Node watch modenpm run models:pull- download models manuallynpm run models:check- opennode-llama-cppchat check
index.js- API server + embedding/search flowrag.db- local SQLite database (created at runtime)models/- downloaded GGUF modelsdocuments/- your local source files for indexing
- First request that needs a model can take longer (model load).
- The
.txtfile is first read as plain text (parsed into one full string), then split into chunks by blank lines using/\n\n+/(paragraph-style boundaries). 1024means the number of values in each embedding vector (dimensions), not text length.f32means each of those 1024 values is stored as a 32-bit float (float32) in SQLite viavector_as_f32(...).- Embedding is currently one-file-at-a-time via API.
- If a file is already embedded,
/documents/embedreturns an error until you delete it from DB first. - Im using macOS Apple Silicon that is why
@sqliteai/sqlite-vector-darwin-arm64is loaded, this is installed automatically. For other platforms, check sqlite-vector docs. - The prompt should be modified in
index.jsto fit your use case, currently it's a simple instruction + retrieved chunks. You can also add system instructions or few-shot examples as needed.