Skip to content

gtraskas/pdf-data-extractor

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

6 Commits
 
 
 
 
 
 
 
 

Repository files navigation

PDF Data Extractor

This project is a PDF data extraction tool designed to extract specific fields from academic papers. It uses the PyMuPDF library to read PDF files, OpenAI's GPT-4 to extract information from the text, regular expressions (re) for pattern matching, and Scholarly for academic metadata retrieval.

Features

  • Extracts metadata such as authors, title, source, document type, keywords, abstract, affiliations, corresponding author, publication year, volume, issue, DOI, and unique article identifier from the first page of a PDF.
  • Extracts references from the entire PDF text using regular expressions (re) or GPT.
  • Saves extracted data to a JSON file.
  • Saves full text of the PDF to a text file.

Requirements

  • Python 3.7 or higher
  • PyMuPDF
  • OpenAI API key
  • re (Python standard library)
  • scholarly
  • python-dotenv

Installation

  1. Clone the repository:

    git clone https://github.com/gtraskas/pdf-data-extractor.git
    cd pdf-data-extractor
  2. Install the required packages:

    pip install -r requirements.txt
  3. Set up your OpenAI API key:

    • Create a .env file in the root directory.

    • Add your OpenAI API key to the .env file:

      OPENAI_API_KEY=your_openai_api_key
      

Usage

  1. Place the PDF files you want to process in the data/input directory.

  2. Run the script:

    python extract_pdf_data.py
  3. The extracted data will be saved to data/output/extracted_data.json.

Customization

  • The extract_fields function in extract_pdf_data.py can be customized to extract additional fields or modify the extraction logic.
  • You can choose to extract references using either regular expressions (re) or GPT by setting the use_gpt_for_references flag in the extract_fields method.

Troubleshooting

  • Ensure that your OpenAI API key is correctly set in the .env file.
  • If you encounter issues with PDF text extraction, verify that the PDFs are not scanned images, as this tool does not perform OCR.

About

No description or website provided.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages