File Classifier

A local financial document classifier that examines documents in PDF format to determine

Source: the originating institution (e.g., "HDFC", "Axis Bank")
Account Type: the account category (e.g., "Credit Card", "Savings Account")
Statement Type: the kind of statement (e.g., "Monthly Statement", "Interest Certificate")

The classifier extracts text from documents, and returns a candidate with confidence score. Everything runs locally — no external API calls. A downstream extractor pipeline (out of scope) consumes this output.

This classifier will be used as a tool by a financial agent (outside the scope of this repo). Given a file, the agent will invoke this classification tool to determine the source, account type, and statement type. After user confirmation, the agent will invoke another statement specific tool to extract all the contents.

Usage: classify.sh [options] <file>

Options:
  --help                Show usage
  --model <name>        Ollama model (default: llama3.1:8b)
  --kb <path>           Knowledge base file (default: ./kb.yaml)
  --timeout <seconds>   LLM request timeout (default: 60)
  --verbose             Show intermediate steps on stderr

Arguments:
  <file>                Path to input file

Output format on successful extraction

{
  "input": {
    "file": "statement_jan2025.pdf",
    "file_type": "pdf"
  },
  "result": {
    "status" : "success",
    "confidence": "HIGH",
    "issuer": "...",
    "account_type" : "...",
    "statement_type" : "..."
  },
  "additional_info" : {
    "pages_analyzed": [1, 12],
    "text_analysis" : {
        "status" : "success",
        "confidence": "HIGH",
        "issuer": "...",
        "account_type" : "...",
        "statement_type" : "...",
        "evidence" : [
            ... list of matching fingerprints ...
        ]
    },
    "llm_classification" : {
        "status" : "success",
        "confidence": "HIGH",
        "issuer": "...",
        "account_type" : "...",
        "statement_type" : "...",
        "evidence" : [
           .. list of evidence matching ....
        ]
    }
  }
}

Notes

status can be "success" or "no_match" or "error"
confidence can be HIGH, MEDIUM or LOW

Output format on error

{
  "file": "bad.txt",
  "error": "... details of error ...."
}

Exit Codes

0 : Classified successfully
1 : No match or not classifiable
2 : Error (unsupported file, server down, extraction failure, etc.)

Design

See DESIGN.md for the full design document covering architecture, classification algorithm, prompt structure, output format, error handling, and future enhancements.

Debugging

./classify.sh --verbose test-data/AAA.pdf > output.txt 2>&1

While user should use only the top-level classifier.sh, the lower-level scripts can also be used directly for debugging

python -m venv .venv
source .venv/bin/activate

followed by

python scripts/extract_pdf.py test-data/FFF.pdf > {textile}
python scripts/llm_classifier.py {textfile} kb.yaml --model llama3.1:8b --timeout 60 --verbose

Deployment

The classification scripts in this repo will ultimately be used as a tool in a LLM based financial agent. This section contains the process for packaging and deploying

TBD

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
scripts		scripts
.gitignore		.gitignore
DESIGN.md		DESIGN.md
README.md		README.md
classify.sh		classify.sh
kb.yaml		kb.yaml
requirements.txt		requirements.txt
setup.sh		setup.sh
start-server.sh		start-server.sh
stop-server.sh		stop-server.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

File Classifier

Table of Contents

Dependencies

Usage

First-time setup

Server management

Knowledge Base

CLI Interface

Design

Debugging

Deployment

About

Uh oh!

Releases

Packages

Languages

Sdaas/pdf-classifier

Folders and files

Latest commit

History

Repository files navigation

File Classifier

Table of Contents

Dependencies

Usage

First-time setup

Server management

Knowledge Base

CLI Interface

Design

Debugging

Deployment

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages