A platform for extracting structured product data from PDF documents using NLP and storing it in a PostgreSQL database.
This system extracts product specifications, SKUs, dimensions, and certifications from product documentation PDFs. It uses SpaCy for Named Entity Recognition (NER) and provides a RESTful API for managing the extraction process.
- PDF text extraction with OCR support
- Named Entity Recognition for product data
- PostgreSQL database for structured storage
- REST API for document upload and querying
- Batch processing for multiple documents
- Python 3.8+
- PostgreSQL
- Tesseract OCR (optional, for image-based PDFs)
- Clone the repository:
git clone https://github.com/muhtalhakhan/Product-Data-Extraction.git
cd Product-Data-Extraction
- Run the initialization script:
chmod +x init_db.sh
./init_db.sh
This script will:
- Verify PostgreSQL is running
- Create the database
- Initialize the schema
- Install required packages
- Download SpaCy models
Place PDF files in the data/raw directory, then run:
python run_pipeline.py --input data/raw --output-dir data/processed
Start the API server:
python -m src.api.main
Access the API at http://localhost:8000:
- POST
/api/documents- Upload PDFs - GET
/api/products- Query extracted products - GET
/api/statistics- System stats
The Swagger UI is available at http://localhost:8000/docs.
.
├── data/
│ ├── raw/ # Raw PDF files
│ └── processed/ # Processing results
├── src/
│ ├── api/ # REST API
│ ├── database/ # Database operations
│ ├── pdf_processing/# PDF extraction
│ ├── nlp/ # Entity extraction
│ └── utils/ # Utilities
├── tests/ # Test suite
├── init_db.sh # Database setup
├── requirements.txt # Python dependencies
└── README.md