PDF Data Extractor

This project is a PDF data extraction tool designed to extract specific fields from academic papers. It uses the PyMuPDF library to read PDF files, OpenAI's GPT-4 to extract information from the text, regular expressions (re) for pattern matching, and Scholarly for academic metadata retrieval.

Features

Extracts metadata such as authors, title, source, document type, keywords, abstract, affiliations, corresponding author, publication year, volume, issue, DOI, and unique article identifier from the first page of a PDF.
Extracts references from the entire PDF text using regular expressions (re) or GPT.
Saves extracted data to a JSON file.
Saves full text of the PDF to a text file.

Requirements

Python 3.7 or higher
PyMuPDF
OpenAI API key
re (Python standard library)
scholarly
python-dotenv

Installation

Clone the repository:

git clone https://github.com/gtraskas/pdf-data-extractor.git
cd pdf-data-extractor

Install the required packages:
```
pip install -r requirements.txt
```
Set up your OpenAI API key:
- Create a .env file in the root directory.
- Add your OpenAI API key to the .env file:
```
OPENAI_API_KEY=your_openai_api_key
```

Usage

Place the PDF files you want to process in the data/input directory.
Run the script:
```
python extract_pdf_data.py
```
The extracted data will be saved to data/output/extracted_data.json.

Customization

The extract_fields function in extract_pdf_data.py can be customized to extract additional fields or modify the extraction logic.
You can choose to extract references using either regular expressions (re) or GPT by setting the use_gpt_for_references flag in the extract_fields method.

Troubleshooting

Ensure that your OpenAI API key is correctly set in the .env file.
If you encounter issues with PDF text extraction, verify that the PDFs are not scanned images, as this tool does not perform OCR.

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
.gitignore		.gitignore
README.md		README.md
extract_pdf_data.py		extract_pdf_data.py
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

PDF Data Extractor

Features

Requirements

Installation

Usage

Customization

Troubleshooting

About

Uh oh!

Releases

Packages

Languages

gtraskas/pdf-data-extractor

Folders and files

Latest commit

History

Repository files navigation

PDF Data Extractor

Features

Requirements

Installation

Usage

Customization

Troubleshooting

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages