Skip to content

Latest commit

 

History

History
227 lines (159 loc) · 5.66 KB

File metadata and controls

227 lines (159 loc) · 5.66 KB

Persona-Driven Document Intelligence

Adobe Hackathon 2025 Project

This repository implements a persona-aware document analysis engine that processes collections of PDFs to extract relevant information based on specific user roles (personas) and job tasks. It uses a hybrid AI approach that combines semantic similarity, keyword extraction, and statistical analysis to produce structured JSON outputs.


📁 Folder Structure

.
├── Dockerfile                   # Dockerfile for containerization
├── input/                       # Primary input directory (can hold multiple collections)
├── output/                      # Output JSONs for each collection
│   ├── Collection 1_output.json
│   ├── Collection 2_output.json
│   ├── Collection 3_output.json
│   └── Collection 4_output.json
├── requirements.txt             # Python dependency list
├── sample_dataset/              # Example data for local testing
│   ├── Collection 1/
│   │   ├── challenge1b_input.json
│   │   └── PDFs/
│   ├── Collection 2/
│   │   ├── challenge1b_input.json
│   │   └── PDFs/
│   ├── Collection 3/
│   │   ├── challenge1b_input.json
│   │   ├── challenge1b_output.json
│   │   └── PDFs/
│   └── Collection 4/
│       ├── challenge1b_input.json
│       └── PDFs/
├── script.py                    # Main analysis script
└── README.md                    # This file

🧠 Approach & Models Used

Overview

The script implements a job-aware PDF analyzer that reads PDFs based on input persona + task and extracts semantically relevant sections using a hybrid NLP pipeline. It balances semantic embeddings, keyword extraction, and TF-IDF statistics for reliable section scoring.


🔍 Core Models & Libraries

1. 🧠 Sentence Transformers (Semantic Embedding)

SentenceTransformer('sentence-transformers/all-mpnet-base-v2')
  • Usage: Convert sections and query into high-dimensional vector embeddings
  • Fallback: paraphrase-MiniLM-L6-v2
  • Purpose: Compute semantic similarity between document sections and user task

2. 🧪 spaCy (NLP Pipeline)

spacy.load("en_core_web_sm")
  • Part-of-speech tagging
  • Named entity recognition
  • Lemmatization and noun phrase extraction
  • Used to validate headings and keyword tokens

3. 🔑 KeyBERT (Keyword Extraction)

KeyBERT(self.sentence_model)
  • Extracts key phrases from persona + job context using BERT embeddings
  • Helps identify what content is most important to extract from documents

4. 🧮 TF-IDF Vectorizer

TfidfVectorizer(ngram_range=(1, 3), stop_words='english')
  • Adds a statistical perspective to the semantic pipeline
  • Used to compute token overlap and match strength

⚙️ Processing Pipeline (3 Phases)

Phase 1: Enhanced Heading Extraction

Extracts headings using:

  • Font Analysis: Large, bold fonts
  • Position Heuristics: Standalone lines, punctuation checks
  • Regex + Semantic Scoring: Chapter titles, ALL CAPS, numbered headings

Phase 2: Contextual Keyword Generation

Generates keywords using:

  • KeyBERT for context-aware phrases
  • spaCy for noun/verb/adjective chunks
  • Semantic Expansion using sentence similarity
  • Custom Rules to ensure domain specificity

Phase 3: Hybrid Scoring System

Calculates final scores with:

  • TF-IDF Similarity (30%)
  • Semantic Embedding Similarity (40%)
  • Individual Keyword Matching (30%)
  • Document Length Normalization to ensure fairness
  • Perfect Match Bonus and Title Relevance Scoring

🧪 Key Innovations

  • Fairness through Length Normalization
  • Multimodal Heading Extraction (visual + positional + semantic)
  • Persona-Driven Keyword Expansion
  • Hybrid Scoring System (semantic + statistical + keyword)

🏗️ Architecture Diagram

graph TD
    A[PDF Input] --> B[Heading Extraction]
    B --> C[Font Analysis]
    B --> D[Position Analysis] 
    B --> E[Content Pattern Analysis]

    F[Persona + Task] --> G[Keyword Generation]
    G --> H[KeyBERT]
    G --> I[spaCy NLP]
    G --> J[Semantic Expansion]

    E --> K[Section Filtering]
    J --> K
    K --> L[TF-IDF Similarity]
    K --> M[Semantic Similarity]
    K --> N[Keyword Matching]

    L --> O[Hybrid Score Calculation]
    M --> O
    N --> O
    O --> P[Normalization]
    P --> Q[Final Ranked Output]
Loading

📤 Output Format

  • Output is saved in both:

    • Global output/ directory as Collection X_output.json
    • Inside each input collection folder (optional)
  • Each output includes:

    • Extracted sections
    • Relevance score
    • Matching metadata
    • Document statistics

🐳 Docker Usage

🔧 Build the Image

docker build --platform linux/amd64 -t <image-name> .

Replace <image-name> with your tag, e.g., persona-doc-intelligence:v1.


▶️ Run the Container with Default Input

docker run --rm \
  -v $(pwd)/input:/app/input:ro \
  -v $(pwd)/output:/app/output:rw \
  --network none \
  <image-name>

▶️ Run with Sample Dataset

docker run --rm \
  -v $(pwd)/sample_dataset:/app/input:ro \
  -v $(pwd)/output:/app/output:rw \
  --network none \
  <image-name>

⏱️ Performance

  • Efficient processing: 5–10 PDFs in under 60 seconds
  • Optimized for batch inference and multi-collection analysis

🧾 License

This project is developed for the Adobe Hackathon and is provided for research and demonstration purposes.