parsing / pdf parser

with GUI
update: 03/26 PDF_Parser-Sevenof9_v7i, again 25% faster through page chunking for each core
Check the PDF before converting it to text: go to any page, ideally one at the beginning and one at the end, select the text with the mouse and copy it into an editor (can you see what you copied?)... if that doesn't work, this parser won't work and neither will any other program! To do this, you must remove the copy protection, or the page is just an image and you must use OCR first.

PDF to TXT converter ready to chunk for your RAG

EXE- ONLY WINDOWS
PY available (en), should be run everywhere
exe files aviable on hugging (or relases -> right side):
https://huggingface.co/kalle07/pdf2txt_parser_converter

⇨ give me a ❤️, if you like ;)

newest: PDF Parser - Sevenof9_v7i.py

Most LLM applications only convert your PDF simple to txt, nothing more, its like you save your PDF as txt file. Often textblocks are mixed and tables not readable. Therefore its better to convert it with some help of a parser.
I work with "pdfplumber/pdfminer" none OCR(no images) and the PDF must contain copyable text.

Works with single and multi PDF list, works with folder
Intelligent multiprocessing ~10-30 pages per second
Error tolerant, that means if your PDF is not convertible, it will be skipped, no special handling
Instant view of the result, hit one pdf on top of the list
Removes about 5% of the margins around the page
Converts some common tables as json inside the txt file
Add the absolute PAGE number to each page
Add the tag “chapter” or “important” to large and/or bold font.
All txt files will be created in original folder of PDF, same name as *.txt
All txt files will be overwritten if you start converting with same PDF
If there are many text blocks on a page, it may be that text blocks that you would read first appear further down the page. (It is a compromise between many layout options)
Small blocks of text (such as units or individual numbers), usually near diagrams and sketches, appear at the end of each page
I advise against using a PDF file directly for RAG formatting (embedding), as you never know how it will look, and incorrect input can lead to poor results
tested on 300 PDF files ~30000 pages

This I have created with my brain and the help of Ai, Iam not a coder... sorry so I will not fulfill any wishes unless there are real errors.
It is really hard for me with GUI and the Function and in addition to compile it.
For the python-file oc you need to import missing libraries.

INSTALL:
python -m venv venv
venv\Scripts\activate # On Windows
pip install -r requirements.txt
python version_xyz.py

now have fun and leave a comment if you like ;)
on discord "sevenof9"
my raw-txt-snippet extractor
https://github.com/kalle07/raw-txt-snippet-creator
my embedder collection:
https://huggingface.co/kalle07/embedder_collection

I am not responsible for any errors or crashes on your system. If you use it, you take full responsibility!

Name		Name	Last commit message	Last commit date
Latest commit History 45 Commits
PDF_Parser-Sevenof9_v7i.py		PDF_Parser-Sevenof9_v7i.py
PDF_Parser-Sevenof9_v7i.spec		PDF_Parser-Sevenof9_v7i.spec
README.md		README.md
build_exe_v7i.py		build_exe_v7i.py
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

parsing / pdf parser

PDF to TXT converter ready to chunk for your RAG

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

parsing / pdf parser

PDF to TXT converter ready to chunk for your RAG

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages