Persona-Driven Document Intelligence

Adobe Hackathon 2025 Project

This repository implements a persona-aware document analysis engine that processes collections of PDFs to extract relevant information based on specific user roles (personas) and job tasks. It uses a hybrid AI approach that combines semantic similarity, keyword extraction, and statistical analysis to produce structured JSON outputs.

📁 Folder Structure

.
├── Dockerfile                   # Dockerfile for containerization
├── input/                       # Primary input directory (can hold multiple collections)
├── output/                      # Output JSONs for each collection
│   ├── Collection 1_output.json
│   ├── Collection 2_output.json
│   ├── Collection 3_output.json
│   └── Collection 4_output.json
├── requirements.txt             # Python dependency list
├── sample_dataset/              # Example data for local testing
│   ├── Collection 1/
│   │   ├── challenge1b_input.json
│   │   └── PDFs/
│   ├── Collection 2/
│   │   ├── challenge1b_input.json
│   │   └── PDFs/
│   ├── Collection 3/
│   │   ├── challenge1b_input.json
│   │   ├── challenge1b_output.json
│   │   └── PDFs/
│   └── Collection 4/
│       ├── challenge1b_input.json
│       └── PDFs/
├── script.py                    # Main analysis script
└── README.md                    # This file

🧠 Approach & Models Used

Overview

The script implements a job-aware PDF analyzer that reads PDFs based on input persona + task and extracts semantically relevant sections using a hybrid NLP pipeline. It balances semantic embeddings, keyword extraction, and TF-IDF statistics for reliable section scoring.

🔍 Core Models & Libraries

1. 🧠 Sentence Transformers (Semantic Embedding)

SentenceTransformer('sentence-transformers/all-mpnet-base-v2')

Usage: Convert sections and query into high-dimensional vector embeddings
Fallback: paraphrase-MiniLM-L6-v2
Purpose: Compute semantic similarity between document sections and user task

2. 🧪 spaCy (NLP Pipeline)

spacy.load("en_core_web_sm")

Part-of-speech tagging
Named entity recognition
Lemmatization and noun phrase extraction
Used to validate headings and keyword tokens

3. 🔑 KeyBERT (Keyword Extraction)

KeyBERT(self.sentence_model)

Extracts key phrases from persona + job context using BERT embeddings
Helps identify what content is most important to extract from documents

4. 🧮 TF-IDF Vectorizer

TfidfVectorizer(ngram_range=(1, 3), stop_words='english')

Adds a statistical perspective to the semantic pipeline
Used to compute token overlap and match strength

⚙️ Processing Pipeline (3 Phases)

Phase 1: Enhanced Heading Extraction

Extracts headings using:

Font Analysis: Large, bold fonts
Position Heuristics: Standalone lines, punctuation checks
Regex + Semantic Scoring: Chapter titles, ALL CAPS, numbered headings

Phase 2: Contextual Keyword Generation

Generates keywords using:

KeyBERT for context-aware phrases
spaCy for noun/verb/adjective chunks
Semantic Expansion using sentence similarity
Custom Rules to ensure domain specificity

Phase 3: Hybrid Scoring System

Calculates final scores with:

TF-IDF Similarity (30%)
Semantic Embedding Similarity (40%)
Individual Keyword Matching (30%)
Document Length Normalization to ensure fairness
Perfect Match Bonus and Title Relevance Scoring

🧪 Key Innovations

✅ Fairness through Length Normalization
✅ Multimodal Heading Extraction (visual + positional + semantic)
✅ Persona-Driven Keyword Expansion
✅ Hybrid Scoring System (semantic + statistical + keyword)

🏗️ Architecture Diagram

graph TD
    A[PDF Input] --> B[Heading Extraction]
    B --> C[Font Analysis]
    B --> D[Position Analysis] 
    B --> E[Content Pattern Analysis]

    F[Persona + Task] --> G[Keyword Generation]
    G --> H[KeyBERT]
    G --> I[spaCy NLP]
    G --> J[Semantic Expansion]

    E --> K[Section Filtering]
    J --> K
    K --> L[TF-IDF Similarity]
    K --> M[Semantic Similarity]
    K --> N[Keyword Matching]

    L --> O[Hybrid Score Calculation]
    M --> O
    N --> O
    O --> P[Normalization]
    P --> Q[Final Ranked Output]

📤 Output Format

Output is saved in both:
- Global output/ directory as Collection X_output.json
- Inside each input collection folder (optional)
Each output includes:
- Extracted sections
- Relevance score
- Matching metadata
- Document statistics

🐳 Docker Usage

🔧 Build the Image

docker build --platform linux/amd64 -t <image-name> .

Replace <image-name> with your tag, e.g., persona-doc-intelligence:v1.

▶️ Run the Container with Default Input

docker run --rm \
  -v $(pwd)/input:/app/input:ro \
  -v $(pwd)/output:/app/output:rw \
  --network none \
  <image-name>

▶️ Run with Sample Dataset

docker run --rm \
  -v $(pwd)/sample_dataset:/app/input:ro \
  -v $(pwd)/output:/app/output:rw \
  --network none \
  <image-name>

⏱️ Performance

Efficient processing: 5–10 PDFs in under 60 seconds
Optimized for batch inference and multi-collection analysis

🧾 License

This project is developed for the Adobe Hackathon and is provided for research and demonstration purposes.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Persona-Driven Document Intelligence

📁 Folder Structure

🧠 Approach & Models Used

Overview

🔍 Core Models & Libraries

1. 🧠 Sentence Transformers (Semantic Embedding)

2. 🧪 spaCy (NLP Pipeline)

3. 🔑 KeyBERT (Keyword Extraction)

4. 🧮 TF-IDF Vectorizer

⚙️ Processing Pipeline (3 Phases)

Phase 1: Enhanced Heading Extraction

Phase 2: Contextual Keyword Generation

Phase 3: Hybrid Scoring System

🧪 Key Innovations

🏗️ Architecture Diagram

📤 Output Format

🐳 Docker Usage

🔧 Build the Image

▶️ Run the Container with Default Input

▶️ Run with Sample Dataset

⏱️ Performance

🧾 License

FilesExpand file tree

README.md

Latest commit

History

README.md

File metadata and controls

Persona-Driven Document Intelligence

📁 Folder Structure

🧠 Approach & Models Used

Overview

🔍 Core Models & Libraries

1. 🧠 Sentence Transformers (Semantic Embedding)

2. 🧪 spaCy (NLP Pipeline)

3. 🔑 KeyBERT (Keyword Extraction)

4. 🧮 TF-IDF Vectorizer

⚙️ Processing Pipeline (3 Phases)

Phase 1: Enhanced Heading Extraction

Phase 2: Contextual Keyword Generation

Phase 3: Hybrid Scoring System

🧪 Key Innovations

🏗️ Architecture Diagram

📤 Output Format

🐳 Docker Usage

🔧 Build the Image

▶️ Run the Container with Default Input

▶️ Run with Sample Dataset

⏱️ Performance

🧾 License