This project is a PDF data extraction tool designed to extract specific fields from academic papers. It uses the PyMuPDF library to read PDF files, OpenAI's GPT-4 to extract information from the text, regular expressions (re) for pattern matching, and Scholarly for academic metadata retrieval.
- Extracts metadata such as authors, title, source, document type, keywords, abstract, affiliations, corresponding author, publication year, volume, issue, DOI, and unique article identifier from the first page of a PDF.
- Extracts references from the entire PDF text using regular expressions (re) or GPT.
- Saves extracted data to a JSON file.
- Saves full text of the PDF to a text file.
- Python 3.7 or higher
- PyMuPDF
- OpenAI API key
- re (Python standard library)
- scholarly
- python-dotenv
-
Clone the repository:
git clone https://github.com/gtraskas/pdf-data-extractor.git cd pdf-data-extractor -
Install the required packages:
pip install -r requirements.txt
-
Set up your OpenAI API key:
-
Create a
.envfile in the root directory. -
Add your OpenAI API key to the
.envfile:OPENAI_API_KEY=your_openai_api_key
-
-
Place the PDF files you want to process in the
data/inputdirectory. -
Run the script:
python extract_pdf_data.py
-
The extracted data will be saved to
data/output/extracted_data.json.
- The
extract_fieldsfunction inextract_pdf_data.pycan be customized to extract additional fields or modify the extraction logic. - You can choose to extract references using either regular expressions (re) or GPT by setting the
use_gpt_for_referencesflag in theextract_fieldsmethod.
- Ensure that your OpenAI API key is correctly set in the
.envfile. - If you encounter issues with PDF text extraction, verify that the PDFs are not scanned images, as this tool does not perform OCR.