A local financial document classifier that examines documents in PDF format to determine
- Source: the originating institution (e.g., "HDFC", "Axis Bank")
- Account Type: the account category (e.g., "Credit Card", "Savings Account")
- Statement Type: the kind of statement (e.g., "Monthly Statement", "Interest Certificate")
The classifier extracts text from documents, and returns a candidate with confidence score. Everything runs locally — no external API calls. A downstream extractor pipeline (out of scope) consumes this output.
This classifier will be used as a tool by a financial agent (outside the scope of this repo). Given a file, the agent will invoke this classification tool to determine the source, account type, and statement type. After user confirmation, the agent will invoke another statement specific tool to extract all the contents.
- Docker
- Python
- Ollama
- download from https://ollama.aiand install
- OR
brew install ollama - Also install the command line tool ollama ( instructions are part of the ollama downloader )
ollama pull llama3.1:8b- download the modelollama list - list allthe downloaded models
Instructions to use the classifier from this repo
./setup.sh./start-server.sh # Start Ollama, pull default model if needed
./stop-server.sh # Stop Ollama serverInstitutions, account types, and statement types are defined in kb.yaml. The classifier matches documents against entries in this file.
Documents that don't match any KB entry return an empty candidates list.
Usage: classify.sh [options] <file>
Options:
--help Show usage
--model <name> Ollama model (default: llama3.1:8b)
--kb <path> Knowledge base file (default: ./kb.yaml)
--timeout <seconds> LLM request timeout (default: 60)
--verbose Show intermediate steps on stderr
Arguments:
<file> Path to input fileOutput format on successful extraction
{
"input": {
"file": "statement_jan2025.pdf",
"file_type": "pdf"
},
"result": {
"status" : "success",
"confidence": "HIGH",
"issuer": "...",
"account_type" : "...",
"statement_type" : "..."
},
"additional_info" : {
"pages_analyzed": [1, 12],
"text_analysis" : {
"status" : "success",
"confidence": "HIGH",
"issuer": "...",
"account_type" : "...",
"statement_type" : "...",
"evidence" : [
... list of matching fingerprints ...
]
},
"llm_classification" : {
"status" : "success",
"confidence": "HIGH",
"issuer": "...",
"account_type" : "...",
"statement_type" : "...",
"evidence" : [
.. list of evidence matching ....
]
}
}
}Notes
statuscan be "success" or "no_match" or "error"confidencecan beHIGH,MEDIUMorLOW
Output format on error
{
"file": "bad.txt",
"error": "... details of error ...."
}Exit Codes
0: Classified successfully1: No match or not classifiable2: Error (unsupported file, server down, extraction failure, etc.)
See DESIGN.md for the full design document covering architecture, classification algorithm, prompt structure, output format, error handling, and future enhancements.
./classify.sh --verbose test-data/AAA.pdf > output.txt 2>&1While user should use only the top-level classifier.sh, the lower-level scripts can also be used directly for debugging
python -m venv .venv
source .venv/bin/activatefollowed by
python scripts/extract_pdf.py test-data/FFF.pdf > {textile}
python scripts/llm_classifier.py {textfile} kb.yaml --model llama3.1:8b --timeout 60 --verboseThe classification scripts in this repo will ultimately be used as a tool in a LLM based financial agent. This section contains the process for packaging and deploying
TBD