Hybrid BM25 + vector search over large PDF / DOCX / TXT documents, with interchangeable German and English queries, optional Elasticsearch backend, and optional DeepSeek (Ollama) LLM integration for query expansion, re-ranking, and explanations.
Designed for 1 000+ page documents. Ships with a CLI, a FastAPI REST API, and a zero-dependency web UI.
- Features
- Quick start
- Installation
- Running the app
- When to use Elasticsearch vs in-memory
- Web UI walkthrough
- REST API
- Configuration
- Architecture
- Output files
- Persistence & databases
- Scoring formula
- Performance benchmarks
- Docker
- Project layout
- Troubleshooting
- License
- Cross-lingual matching — search
morgen→ also finds tomorrow, morning; searchhometown→ also finds Heimatstadt, Geburtsort. - Hybrid retrieval — BM25 + dense vectors + proximity + NER boost, fused with min-max normalized weighted sum.
- Elasticsearch OR in-memory — runs entirely in-memory (BM25 + FAISS) with no infrastructure. Enable ES for scale.
- DeepSeek (Ollama) integration at three well-defined points:
- Query expansion (synonyms, paraphrases, translation)
- Top-K re-ranking
- Per-page relevance explanations
- Three-pane UI — ranked results, top-5 full page text with highlights (primary + cross-lingual equivalent), and a compact per-page counts table.
- Session logging — every search is captured to JSON files that reset
on server restart, plus timestamped archives under
data/sessions/anddata/elastic/that survive restarts. - SQLite persistence + memoization —
inputs.dbtracks every ingested document by SHA-256 hash; re-uploading the same file skips parse/embed and reuses the cached chunks + FAISS index.outputs.dblogs every search with pointers back to its archived JSON.
git clone <repo-url>
cd "DE-EN Search"
python -m venv .venv
.venv\Scripts\activate # Windows
# source .venv/bin/activate # macOS / Linux
pip install -r requirements.txt
python -m spacy download de_core_news_sm
python -m spacy download en_core_web_sm
uvicorn docsearch.server:app --reloadOpen http://127.0.0.1:8000/, upload a PDF, search.
| Component | Version | Required? |
|---|---|---|
| Python | 3.10 – 3.12 | ✅ |
| pip | latest | ✅ |
| Elasticsearch | 8.13+ | Optional |
| Docker Desktop | any | Optional (for ES / full stack) |
| Ollama | 0.1.40+ | Optional (for DeepSeek LLM) |
git clone <repo-url>
cd "DE-EN Search"
python -m venv .venv
# activate:
.venv\Scripts\activate # Windows PowerShell
source .venv/bin/activate # macOS / Linuxpip install --upgrade pip
pip install -r requirements.txtpython -m spacy download de_core_news_sm
python -m spacy download en_core_web_smLarger models (de_core_news_lg, en_core_web_lg) give better NER — set
SPACY_DE / SPACY_EN env vars to override.
# install ollama: https://ollama.com/download
ollama pull deepseek-r1 # or deepseek-r1:7b for a smaller/faster variant
ollama serve # if not already runningThe app auto-detects Ollama at http://localhost:11434. If Ollama isn't
running, the three LLM hooks simply no-op.
docker compose up -d elasticsearchuvicorn docsearch.server:app --reload- Web UI: http://127.0.0.1:8000/
- Swagger: http://127.0.0.1:8000/docs
- Health: http://127.0.0.1:8000/healthz
# ingest a document (no ES)
python -m docsearch.cli ingest sample.pdf --id sample --no-es
# search
python -m docsearch.cli search "morgen" --no-es
python -m docsearch.cli search "hometown of Phil" --no-esBoth backends score with BM25 and serve k-NN over the same embeddings, so ranking quality is indistinguishable on small/medium corpora. Elasticsearch is only worth the operational overhead at scale or when you need richer German-language analyzers.
| Situation | Backend | Why |
|---|---|---|
| Single PDF, local demo, rapid iteration | In-memory (--no-es) |
Zero infra, < 2 s startup, easy to debug |
| Corpus ≤ ~20 000 chunks (≈ 6–7 thousand-page PDFs) | In-memory | Latency is identical, no JVM to babysit |
| Corpus > 20 000 chunks, or multi-doc library | Elasticsearch | Disk-backed index, survives restarts, scales horizontally |
| German-heavy text needing stemming / compound handling | Elasticsearch | light_german stemmer + index-time synonym graph |
| Production deploy with multiple users / persistent index | Elasticsearch | Persistence, concurrency, filtering, aggregations |
| CI pipeline / ephemeral container | In-memory | No external service required |
| Feature | In-memory | Elasticsearch |
|---|---|---|
| BM25 ranking | ✅ rank-bm25 |
✅ native |
| Dense vector (cosine) | ✅ FAISS IndexFlatIP |
✅ dense_vector + kNN |
| DE↔EN synonym expansion | ✅ query-time only | ✅ index & query time |
| German stemmer / stop-words | ❌ | ✅ light_german |
| Snippet highlighting | Regex (coarse) | Unified highlighter (precise) |
| Persistence across restarts | Pickle files in data/ |
Full ES index |
| Multi-process / multi-user | ✅ | |
| Cold-start time | ~2 s | ~20–30 s (JVM + warm-up) |
| First ingest of 1000-page PDF | ~75 s | ~80 s |
| Query latency (hybrid, 3000 chunks) | ~40 ms | ~60 ms |
No Docker, no services to start.
# CLI
python -m docsearch.cli ingest sample.pdf --id sample --no-es
python -m docsearch.cli search "morgen" --no-es
# Web UI / REST API
uvicorn docsearch.server:app --reload
# open http://127.0.0.1:8000/The web UI uses in-memory by default unless USE_ES=1 is set in the
environment and Elasticsearch is reachable.
Needs Docker Desktop running (or a local ES 8.x).
# 1) Start Elasticsearch (single-node, security off, port 9200)
docker compose up -d elasticsearch
# 2) Wait ~30 s until healthy, then verify
curl http://localhost:9200
# 3) Ingest (creates the 'doc_chunks' index with DE↔EN synonyms + analyzers)
python -m docsearch.cli ingest sample.pdf --id sample --recreate
# 4) Search
python -m docsearch.cli search "morgen"
# 5) Start the web UI pointing at ES
set USE_ES=1 # Windows cmd
$env:USE_ES=1 # PowerShell
export USE_ES=1 # bash
uvicorn docsearch.server:app --reloadShut ES down cleanly:
docker compose downdocker compose up --build
# app on http://127.0.0.1:8000, ES on http://127.0.0.1:9200Both paths write to the same data/ folder (chunks.pkl, faiss.index,
pages.pkl). Switching from in-memory to ES requires re-ingesting once
so the ES index gets populated:
docker compose up -d elasticsearch
python -m docsearch.cli ingest sample.pdf --id sample --recreateSwitching back to in-memory just needs --no-es on the CLI or USE_ES=0
for the server — the FAISS index is already on disk.
Start with in-memory. It's what you'll hit every command in the Quick start with. Flip to Elasticsearch only when:
- You index more than one document, or
- You care about German stemming quality, or
- You deploy to a server with multiple users.
The UI has three stacked input panels and a three-column results grid:
Three fields:
- File picker — choose a
.pdf,.docx, or.txt. - Doc ID — logical id (e.g.
sample). Re-using the id overwrites. - Index — runs parse → chunk → embed → index.
- Keyword / phrase input.
- Query language:
Auto/German/English— pins the primary language so counts and highlighting match your intent. - Top pages: how many to rank.
- Use DeepSeek (Ollama) toggle + live status pill (green/red).
| Column | What it shows |
|---|---|
| 1 — Ranked results | All top-N pages with snippets, DE/EN counts, hybrid score, visual score bar. |
| 2 — Top 5 full pages | Full page text for the top 5, toggle between primary (yellow highlights) and cross-lingual equivalent (purple highlights). Includes LLM explanation if available. |
| 3 — Page counts | Compact table: Page / Mentions / DE / EN. |
multipart/form-data:
file— uploaded documentdoc_id— stringrecreate— bool (re-create ES index if enabled)
Returns {"doc_id": "...", "chunks": N}.
Query string:
q— user query (required)top— int, default20lang—auto|de|en, defaultautouse_llm— bool, defaulttrue
Returns:
All knobs live in docsearch/config.py and are
env-overridable:
| Variable | Default | Purpose |
|---|---|---|
ES_HOST |
http://localhost:9200 |
Elasticsearch URL |
ES_INDEX |
doc_chunks |
Index name |
USE_ES |
1 |
Toggle ES |
EMBED_MODEL |
paraphrase-multilingual-MiniLM-L12-v2 |
Sentence-Transformer |
SPACY_DE / SPACY_EN |
*_sm models |
spaCy models |
USE_MT |
0 |
Enable MarianMT translation (slow first load) |
OLLAMA_HOST |
http://localhost:11434 |
Ollama endpoint |
OLLAMA_MODEL |
deepseek-r1:latest |
Ollama model tag |
USE_LLM |
1 |
Global LLM toggle |
LLM_RERANK_K |
10 |
# candidates sent to LLM re-ranker |
LLM_EXPLAIN_K |
5 |
# pages explained |
LLM_RERANK_WEIGHT |
0.4 |
Blend weight LLM:hybrid |
┌──────────┐ parse ┌────────────┐ chunk+NER ┌────────────┐
│ PDF/DOCX │ ──────────▶ │ pages │ ────────────▶ │ chunks │
└──────────┘ └────────────┘ └──────┬─────┘
│
┌────────────── embed (LaBSE-class) ────────────────┤
▼ ▼
┌──────────────┐ ┌──────────────────┐
│ FAISS │ │ Elasticsearch │
│ (cosine) │ │ BM25 + dense_vec │
└──────────────┘ └──────────────────┘
│ │
│ query → QueryExpander │
│ ├─ spaCy lemma + NER │
│ ├─ DE↔EN synonym dict │
│ ├─ MarianMT (optional) │
│ └─ ★ Ollama/DeepSeek expansion ★ │
▼ ▼
└──────────── Hybrid fuser ──────────────────────────┘
│
▼
α·BM25̂ + β·cosinê + γ·proximitŷ + δ·NER
│
▼
★ DeepSeek re-ranking (blend) ★
│
▼
Page aggregator
│
▼
★ DeepSeek explanations (top 5) ★
│
▼
JSON → UI (3 columns)
The three ★ are the only places the LLM is invoked.
Everything that the app writes lives under ./data/.
| File | Contents |
|---|---|
data/session_output.json |
Array appended with {query, timestamp, doc_id, top_pages, llm} for every search this run. |
data/elastic_output.json |
Overwritten each search with the raw BM25 + vector hit lists, ES host/index, and mode (elasticsearch or in-memory-bm25). |
When Elasticsearch isn't available, elastic_output.json still captures the
in-memory BM25 hits under "mode": "in-memory-bm25" so the contract is
identical.
Every ingest and every search also writes a timestamped copy so the full history survives restarts.
| Directory | Filename pattern | Contents |
|---|---|---|
data/sessions/ |
<utc-ts>__<query-slug>.json |
One file per search (same shape as a session_output.json row). |
data/elastic/ |
<utc-ts>__<query-slug>.json |
One file per search — raw BM25 + vector hits. |
data/ingests/ |
<doc-id>__<utc-ts>.json |
One manifest per ingest — file hash, size, chunk metadata (chunk_id, page_no, lang, entities, text preview), pointers to on-disk artefacts. |
| Path | Contents |
|---|---|
data/docs/<safe_doc_id>/chunks.pkl |
Pickled list of Chunk dataclasses. |
data/docs/<safe_doc_id>/pages.pkl |
Pickled {page_no: page_text} dict. |
data/docs/<safe_doc_id>/faiss.index |
FAISS cosine index for this doc. |
data/docs/<safe_doc_id>/faiss_meta.pkl |
Per-vector metadata (chunk_id, page_no, lang, entities). |
data/chunks.pkl, data/pages.pkl, data/faiss.index, data/faiss_meta.pkl |
Legacy canonical copies of the last-ingested doc — used by the fallback loader on server start. |
A pair of SQLite databases tie everything above together.
Table documents — primary key doc_id, indexed file_hash.
| Column | Notes |
|---|---|
doc_id |
Primary key — user-supplied logical id. |
filename |
Original uploaded filename. |
file_hash |
SHA-256 of the file bytes. Indexed — used for memoization. |
file_size, num_pages, num_chunks |
Summary stats. |
embed_model |
Model that produced the vectors (so stale indexes can be detected). |
ingested_at |
UTC ISO timestamp. |
chunks_path, pages_path, faiss_path, faiss_meta_path |
Pointers into data/docs/<doc_id>/…. |
manifest_path |
Pointer to the data/ingests/…json manifest. |
Memoization: on every POST /ingest, the SHA-256 is computed and looked
up in documents. If the file has been seen before and the on-disk
artefacts still exist, the parse/embed/index steps are skipped entirely and
the cached chunks/FAISS are reused. A second ingest of the same file under
a different doc_id registers a new row that points at the same artefacts.
Force a rebuild with recreate=true (REST) or --recreate (CLI).
Table search_logs — auto-increment log_id, foreign key doc_id.
| Column | Notes |
|---|---|
log_id |
Auto-increment PK. |
doc_id |
Which document was queried. |
query, lang_hint, use_llm, num_pages |
Request parameters. |
timestamp, duration_ms |
UTC ISO + measured wall-clock. |
session_json |
Absolute path to the archived data/sessions/…json. |
elastic_json |
Absolute path to the archived data/elastic/…json. |
top_pages |
JSON string — compact top-10 summary. |
llm_flags |
JSON string — {enabled, available, model, rerank_used, explain_used, query_expansion_used}. |
Because every log row stores the JSON-archive paths, you can replay any historical search by reading the referenced file — nothing is lost even across server restarts.
| Endpoint | Purpose |
|---|---|
GET /docs-db |
JSON listing of all ingested documents. |
GET /logs?doc_id=…&limit=… |
JSON listing of search logs (optionally filtered by doc_id). |
For ad-hoc queries, open the .db files with any SQLite browser (e.g.
DB Browser for SQLite) or:
sqlite3 data/db/inputs.db ".schema documents"
sqlite3 data/db/outputs.db "SELECT log_id,doc_id,query,duration_ms FROM search_logs ORDER BY log_id DESC LIMIT 20;"data/
├── session_output.json # live, reset each server start
├── elastic_output.json # live, overwritten each search
├── sessions/ # archive — one file per search
├── elastic/ # archive — one file per search
├── ingests/ # archive — one manifest per ingest
├── docs/<safe_doc_id>/ # per-doc chunks/pages/faiss (memoized)
│ ├── chunks.pkl
│ ├── pages.pkl
│ ├── faiss.index
│ └── faiss_meta.pkl
├── chunks.pkl pages.pkl # legacy canonical copies (last ingest)
├── faiss.index faiss_meta.pkl
└── db/
├── inputs.db # documents table (PK doc_id)
└── outputs.db # search_logs table (FK doc_id)
Everything under data/ is in .gitignore — regenerated from the source
files on demand.
Per chunk c against query Q:
score(c) = α·BM25̂(Q,c) + β·cosinê(Q,c) + γ·proximitŷ(Q,c) + δ·NER(c)
subject to α+β+γ+δ = 1
All four components are min-max normalized over the candidate pool.
- BM25 — standard (
k1=1.2,b=0.75) - cosine — on L2-normalized embeddings
- proximity —
m / (1 + span)over smallest window containing all query terms (m= unique query terms,span= token distance); 0 if not found withinprox_window=30 - NER —
1if query-extracted entity appears in chunk entities, else0
Per page:
score(p) = log(1 + mentions(p)) · max_c score(c) + λ · mean_c score(c)
with λ=0.3.
If LLM re-ranking is on, for the top LLM_RERANK_K:
final(p) = (1 - w)·score(p) + w·llm_score·max_base
with w = LLM_RERANK_WEIGHT (default 0.4).
Indicative wall-clock on a ~1 000-page PDF (≈ 3 000 chunks, AMD Ryzen 5, CPU-only). Your mileage will vary with model size, disk I/O, and Ollama model selection.
| Configuration | Time |
|---|---|
| Parse only (PyMuPDF) | ~3 s |
Chunk + lemmatize (spaCy *_sm) |
~8 s |
| Embed (paraphrase-multilingual-MiniLM, CPU) | ~60 s |
| Embed (LaBSE, CPU) | ~180 s |
| Embed (LaBSE, GPU) | ~15 s |
| FAISS build | < 1 s |
| ES bulk index (3 000 docs) | ~6 s |
| Configuration | Time / query |
|---|---|
| In-memory BM25 only | ~15 ms |
| FAISS only | ~5 ms |
| Hybrid (BM25 + FAISS), no LLM | ~40 ms |
| Elasticsearch BM25 + kNN | ~60 ms |
| Hybrid + LLM query expansion (deepseek-r1:7b) | +1.5 – 4 s |
| Hybrid + LLM re-ranking (top-10) | +2 – 6 s |
| Hybrid + LLM explanations (top-5) | +4 – 10 s |
| Hybrid + all 3 LLM stages | ~8 – 20 s total |
Tips to stay fast:
- Use the smaller
deepseek-r1:7bordeepseek-r1:1.5b. - Lower
LLM_RERANK_KandLLM_EXPLAIN_K. - Keep the UI toggle off for rapid iteration; flip it on when you want the richer output.
# full stack (ES + app)
docker compose up --build
# ES only
docker compose up -d elasticsearchThe compose file mounts ./analysis/ into the ES container so the DE↔EN
synonym filter is loaded at index creation time.
docsearch/
├── config.py # all env-tunable settings
├── parser.py # PDF / DOCX / TXT → (page_no, text)
├── preprocess.py # langdetect + spaCy lemma + NER + chunking
├── embedder.py # sentence-transformers wrapper
├── indexer_es.py # Elasticsearch mapping + BM25 + kNN
├── indexer_faiss.py # FAISS cosine index + metadata
├── synonyms.py # DE↔EN seed dictionary + word-bucket classifier
├── query.py # query expansion (terms, synonyms, MT, LLM★, NER)
├── ranker.py # proximity, weighted hybrid, page aggregation
├── llm.py # Ollama client — the 3 LLM integration points
├── output_log.py # session + elastic + ingest archive writers
├── db.py # SQLite (inputs.db + outputs.db) persistence
├── pipeline.py # DocSearch orchestrator (hash-memoized ingest)
├── cli.py # `python -m docsearch.cli ...`
├── server.py # FastAPI app
└── static/
└── index.html # self-contained web UI
analysis/
└── de_en_synonyms.txt # mounted into ES for synonym_graph filter
data/ # runtime artefacts (FAISS, pickles, output JSON)
pymupdf.exe/ permission error onpip install→ you're installing into system Python. Use a virtualenv (see Installation).OSError: [E050] Can't find model 'de_core_news_sm'→ download the model inside the active venv.- LLM status pill is red → Ollama isn't running, or the model name in
OLLAMA_MODELisn't pulled. Runollama listto check. - "No document indexed yet" → ingest at least once (UI panel 1, or CLI).
- Docker error
dockerDesktopLinuxEngine: cannot find file→ start Docker Desktop; wait for the whale icon to go solid. embeddings.position_ids | UNEXPECTED→ benign Sentence-Transformers checkpoint warning; ignore.
MIT.
{ "query": "morgen", "lang_hint": "de", "timestamp": "...", "expanded": { "raw": "...", "lang": "de", "translated": "tomorrow", "synonyms": [...], "paraphrases": [...], "entities": [...], "terms": [...], "llm_used": true }, "pages": [ { "page": 1, "mentions": 24, "de": 20, "en": 4, "score": 3.91, "snippets": [...] } ], "top_full": [ { "page": 1, "primary_hits": "<html>...", "equivalent_hits": "<html>...", "explanation": "…", "llm_score": 0.82, ... } ], "page_counts": [ { "page": 1, "mentions": 24, "de": 20, "en": 4 } ], "llm": { "enabled": true, "available": true, "model": "deepseek-r1:latest", "rerank_used": true, "explain_used": true, "query_expansion_used": true } }