Product Data Extraction Platform

A platform for extracting structured product data from PDF documents using NLP and storing it in a PostgreSQL database.

Overview

This system extracts product specifications, SKUs, dimensions, and certifications from product documentation PDFs. It uses SpaCy for Named Entity Recognition (NER) and provides a RESTful API for managing the extraction process.

Features

PDF text extraction with OCR support
Named Entity Recognition for product data
PostgreSQL database for structured storage
REST API for document upload and querying
Batch processing for multiple documents

Setup

Prerequisites

Python 3.8+
PostgreSQL
Tesseract OCR (optional, for image-based PDFs)

Installation

Clone the repository:

git clone https://github.com/muhtalhakhan/Product-Data-Extraction.git
cd Product-Data-Extraction

Run the initialization script:

chmod +x init_db.sh
./init_db.sh

This script will:

Verify PostgreSQL is running
Create the database
Initialize the schema
Install required packages
Download SpaCy models

Usage

Processing PDFs

Place PDF files in the data/raw directory, then run:

python run_pipeline.py --input data/raw --output-dir data/processed

API

Start the API server:

python -m src.api.main

Access the API at http://localhost:8000:

POST /api/documents - Upload PDFs
GET /api/products - Query extracted products
GET /api/statistics - System stats

The Swagger UI is available at http://localhost:8000/docs.

Project Structure

.
├── data/
│   ├── raw/           # Raw PDF files
│   └── processed/     # Processing results
├── src/
│   ├── api/           # REST API
│   ├── database/      # Database operations
│   ├── pdf_processing/# PDF extraction
│   ├── nlp/           # Entity extraction
│   └── utils/         # Utilities
├── tests/             # Test suite
├── init_db.sh         # Database setup
├── requirements.txt   # Python dependencies
└── README.md

License

MIT License

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Product Data Extraction Platform

Overview

Features

Setup

Prerequisites

Installation

Usage

Processing PDFs

API

Project Structure

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
src		src
tests		tests
LICENSE		LICENSE
README.md		README.md
init_db.ps1		init_db.ps1
init_db.sh		init_db.sh
pipeline.log		pipeline.log
requirements.txt		requirements.txt
run_pipeline.py		run_pipeline.py
setup.py		setup.py
test_system.py		test_system.py

Folders and files

Latest commit

History

Repository files navigation

Product Data Extraction Platform

Overview

Features

Setup

Prerequisites

Installation

Usage

Processing PDFs

API

Project Structure

License

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages