Ledgerline is a vectorless RAG demo stack for financial PDFs: documents are ingested, parsed into a hierarchical tree (PageIndex), cached in Redis, and queried through an API gateway that uses an LLM to navigate the tree and answer questions. There is no vector database and no fixed chunking step in the retrieval path.
flowchart LR
User[User_or_UI] --> GW[API_Gateway_Go]
User --> IG[Ingestion_Go]
IG --> S3[(S3_MinIO)]
IG --> K1[Kafka_ingested]
K1 --> PS[Parser_Python]
PS --> S3
PS --> K2[Kafka_parsed]
K2 --> CS[Cache_Python]
CS --> Redis[(Redis)]
GW --> CS
GW --> LLM[LLM_API]
| Requirement | Notes |
|---|---|
| Docker Desktop | Compose v2; used for Kafka, Redis, MinIO, Postgres, UIs |
| Go 1.21+ | Ingestion + API gateway |
| Python 3.11+ | Parser, cache, evaluation |
| Node.js 18+ | Optional web/ dashboard |
| LLM credentials | Set at least one provider in .env (see below) |
- Copy the template and edit values:
cp .env.example .env- S3 bucket names (local) must match what Docker creates. In docker/docker-compose.yml the MinIO init job creates
pageindex-documents-devandpageindex-trees-dev.[.env.example](.env.example)is aligned with those names; if you change either side, keep them in sync or uploads will fail withNoSuchBucket. - LLM keys (see
.env.examplefor full list):
- Recommended for this repo: set
CLAUDE_API_KEYandCLAUDE_MODELfor Anthropic. - Alternatives documented in
.env.example: Gemini (GEMINI_API_KEY), OpenAI-compatible endpoints, etc.
- Gateway loads
.envfrom the repo root or service directory. Parser and cache load the repo root.envautomatically when you start them fromservices/*.
From the repository root:
make upOr:
docker compose -f docker/docker-compose.yml up -dTypical local URLs:
| Service | URL / host |
|---|---|
| Kafka UI | http://localhost:8090 |
| MinIO API | http://localhost:9000 (user minioadmin, password minioadmin) |
| MinIO console | http://localhost:9001 |
| Redis | localhost:6379 |
| PostgreSQL (evaluation) | localhost:5433 (see compose; DB pageindex_eval, user pageindex) |
Stop containers (keep volumes): make down. Nuke data: make clean.
From the repo root (PowerShell):
.\scripts\dev\start-local.ps1- Add
-WithWebto also start the Next.js app on port 3000. - Add
-SkipHealthWaitto skip the post-start health poll.
Stop app processes on ports 8080–8083: .\scripts\dev\stop-local.ps1. Add -IncludeWeb to also stop whatever is listening on port 3000. Docker stack is not stopped; use make down for infra.
Use four terminals from the repo root (after make up and Python deps installed):
make run-ingestion
make run-parser
make run-cache
make run-gatewayOptional evaluation consumer:
make run-evaluationcd web
npm ci
npm run devOpen http://localhost:3000. The UI expects the API gateway (and the rest of the stack) to be running.
Representative views from the Next.js UI (sidebar navigation is the same across pages).
- Confirm health:
curl -s http://localhost:8080/health
curl -s http://localhost:8081/health
curl -s http://localhost:8082/health
curl -s http://localhost:8083/health- Upload a PDF:
curl -s -F "file=@path/to/file.pdf" http://localhost:8080/documents/uploadSave doc_id from the JSON response.
3. Wait for parsing (depends on PDF size and LLM latency), then query:
curl -s -X POST http://localhost:8083/query \
-H "Content-Type: application/json" \
-d "{\"doc_id\":\"YOUR_DOC_ID\",\"question\":\"What is the main topic?\"}"- WebSocket streaming (requires a WS client such as
wscat): connect tows://localhost:8083/wsand send JSON{"doc_id":"...","question":"..."}.
With the stack running, you can generate tiny local PDFs and upload them:
# From repo root (installs reportlab if needed)
python scripts/seed/create-test-pdf.pyThen ingest everything under sample-data/ (create the folder locally; it is gitignored):
.\scripts\seed\ingest-documents.ps1Or run the full fetch → ingest → verify flow:
.\scripts\seed\run-seeding.ps1The fetch step runs create-test-pdf.py (not a remote dataset). You can also place your own PDFs in sample-data/ and run ingest-documents.ps1 alone.
Machine-readable contract: docs/api-spec.yaml.
With infra and all four core services running, after a document is parsed and cached, query the gateway (replace DOC_ID):
curl -s -X POST http://localhost:8083/query -H "Content-Type: application/json" -d "{\"doc_id\":\"DOC_ID\",\"question\":\"What is this document about?\"}"High-level checklist:
- Infrastructure: Terraform under infrastructure/terraform/ (VPC, EKS, MSK, ElastiCache, S3, IAM, etc. - see variables and modules there).
- Manifests: kubernetes/ base + overlays; adjust images and secrets for your registry.
- Cluster access:
kubectlconfigured for the target cluster; apply overlays (e.g.make deploy-dev/make deploy-prodif your Makefile targets match your environment).
Details vary by account and region; treat the above as pointers, not a full runbook.
| Symptom | Things to check |
|---|---|
NoSuchBucket on upload |
S3_BUCKET_* in .env vs MinIO bucket names in docker/docker-compose.yml |
| Empty or nonsense answers | Missing LLM API key in .env; gateway may fall back to mock behavior |
| Cache hit rate stays zero | Parser must consume documents.parsed and cache must warm; repeat reads after a successful parse |
| Browser cannot call cache service | Prefer the gateway for browser-facing APIs (CORS) |
| Evaluation metrics stuck at zero | Postgres URL/credentials must match the running DB; topic queries.completed and sampling rate |
Issues and PRs welcome. Run make test and relevant integration checks before submitting. Keep secrets out of git (.env is ignored).



