Skip to content

Classify PDF documents based on text analysis and LLM

Notifications You must be signed in to change notification settings

Sdaas/pdf-classifier

Repository files navigation

File Classifier

A local financial document classifier that examines documents in PDF format to determine

  • Source: the originating institution (e.g., "HDFC", "Axis Bank")
  • Account Type: the account category (e.g., "Credit Card", "Savings Account")
  • Statement Type: the kind of statement (e.g., "Monthly Statement", "Interest Certificate")

The classifier extracts text from documents, and returns a candidate with confidence score. Everything runs locally — no external API calls. A downstream extractor pipeline (out of scope) consumes this output.

This classifier will be used as a tool by a financial agent (outside the scope of this repo). Given a file, the agent will invoke this classification tool to determine the source, account type, and statement type. After user confirmation, the agent will invoke another statement specific tool to extract all the contents.

Table of Contents

Dependencies

  • Docker
  • Python
  • Ollama
    • download from https://ollama.aiand install
    • OR brew install ollama
    • Also install the command line tool ollama ( instructions are part of the ollama downloader )
    • ollama pull llama3.1:8b - download the model
    • ollama list - list all the downloaded models

Usage

Instructions to use the classifier from this repo

First-time setup

./setup.sh

Server management

./start-server.sh              # Start Ollama, pull default model if needed
./stop-server.sh               # Stop Ollama server

Knowledge Base

Institutions, account types, and statement types are defined in kb.yaml. The classifier matches documents against entries in this file. Documents that don't match any KB entry return an empty candidates list.

CLI Interface

Usage: classify.sh [options] <file>

Options:
  --help                Show usage
  --model <name>        Ollama model (default: llama3.1:8b)
  --kb <path>           Knowledge base file (default: ./kb.yaml)
  --timeout <seconds>   LLM request timeout (default: 60)
  --verbose             Show intermediate steps on stderr

Arguments:
  <file>                Path to input file

Output format on successful extraction

{
  "input": {
    "file": "statement_jan2025.pdf",
    "file_type": "pdf"
  },
  "result": {
    "status" : "success",
    "confidence": "HIGH",
    "issuer": "...",
    "account_type" : "...",
    "statement_type" : "..."
  },
  "additional_info" : {
    "pages_analyzed": [1, 12],
    "text_analysis" : {
        "status" : "success",
        "confidence": "HIGH",
        "issuer": "...",
        "account_type" : "...",
        "statement_type" : "...",
        "evidence" : [
            ... list of matching fingerprints ...
        ]
    },
    "llm_classification" : {
        "status" : "success",
        "confidence": "HIGH",
        "issuer": "...",
        "account_type" : "...",
        "statement_type" : "...",
        "evidence" : [
           .. list of evidence matching ....
        ]
    }
  }
}

Notes

  • status can be "success" or "no_match" or "error"
  • confidence can be HIGH, MEDIUM or LOW

Output format on error

{
  "file": "bad.txt",
  "error": "... details of error ...."
}

Exit Codes

  • 0 : Classified successfully
  • 1 : No match or not classifiable
  • 2 : Error (unsupported file, server down, extraction failure, etc.)

Design

See DESIGN.md for the full design document covering architecture, classification algorithm, prompt structure, output format, error handling, and future enhancements.

Debugging

./classify.sh --verbose test-data/AAA.pdf > output.txt 2>&1

While user should use only the top-level classifier.sh, the lower-level scripts can also be used directly for debugging

python -m venv .venv
source .venv/bin/activate

followed by

python scripts/extract_pdf.py test-data/FFF.pdf > {textile}
python scripts/llm_classifier.py {textfile} kb.yaml --model llama3.1:8b --timeout 60 --verbose

Deployment

The classification scripts in this repo will ultimately be used as a tool in a LLM based financial agent. This section contains the process for packaging and deploying

TBD

About

Classify PDF documents based on text analysis and LLM

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published