Basic PDF Parser 📝

A Python-based PDF parser that extracts questions, options, tables, and images from structured text-based PDFs such as multiple-choice question papers.

Features ✨

Question & Option Extraction: Extracts numbered questions (e.g., Q1., 1.) and associated lettered multiple-choice options (e.g., a., b., c., d.).
Table Extraction: Extracts tables from the PDF pages using pdfplumber.
Image Extraction: Extracts images from the PDF pages.
Structured Output: Saves the final extracted data as a structured JSON file (result.json).
File Management: Stores all extracted images in a dedicated folder (Pics extraction).

Installation ⚙️

Clone the repository:

git clone [https://github.com/lck6055/Basic-pdf-parser.git](https://github.com/lck6055/Basic-pdf-parser.git)
cd Basic-pdf-parser

Install the required Python libraries:
```
pip install -r requirements.txt
```

Usage 🖥️

Prepare the PDF: Place your target PDF file in the project directory and, if necessary, update the PDF_PATH variable inside the parsing.py script.
Run the parser:
```
python parsing.py
```

Output:

JSON Data: A file named result.json containing the extracted questions and options.
Images: Extracted images saved within the folder Pics extraction.

Running Mode & Limitations ⚠️

❗ IMPORTANT: This is currently a prototype and is still under development. The existing limitations are acknowledged.

Aspect	Details
Best Used With	Text-based PDFs. It will not work on scanned PDFs without Optical Character Recognition (OCR).
Question Format	Assumes clearly numbered questions (e.g., `Q1.`, `1.`, etc.).
Option Format	Assumes lettered options (e.g., `a.`, `b.,` `c.`, `d.`).
Image Caveat	Image extraction is page-based and may include all images on the page, not strictly tied to the exact question block.
Table Caveat	Tables are extracted using `pdfplumber` and may not be perfect for highly complex or irregular table layouts.
Failure Condition	PDFs without clear numbering or those consisting solely of scanned images may result in incomplete or empty JSON data.

References 📚

This project utilizes the following key libraries and documentation:

PyMuPDF Documentation
pdfplumber Documentation
Python official docs for re and json modules
Documentation Assistance: An AI language model was used to help structure and write the content of this README file.

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
README.md		README.md
parsing.py		parsing.py
requirement.txt		requirement.txt
solar.pdf		solar.pdf

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Basic PDF Parser 📝

Features ✨

Installation ⚙️

Usage 🖥️

Output:

Running Mode & Limitations ⚠️

References 📚

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Basic PDF Parser 📝

Features ✨

Installation ⚙️

Usage 🖥️

Output:

Running Mode & Limitations ⚠️

References 📚

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages