Complete source code for PDF Extracter, a project to extract and normalize data from PDFs.
Frontend is built with Next.js, and backend is powered by FastAPI.
- Project Overview
- Features
- Architecture
- Models
- Setup Instructions
- Running Locally
- Deployment
- Environment Variables
- License
PDF Extracter allows users to upload PDF documents and extract structured data using multiple extraction models. The system normalizes the extracted data into a consistent format for downstream usage.
- Upload PDF files from frontend
- Choose extraction model (
OmniDocsorDocling) - Backend processing with FastAPI
- Normalized output for easy consumption
- OAuth login support via Google (NextAuth.js)
- Supports multiple file uploads
- Handles complex layouts and tables
PDF Extracter
│
├── backend (FastAPI)
│ ├── app
│ │ ├── main.py # FastAPI entrypoint
│ │ ├── extractor_factory.py # Factory for model selection
│ │ ├── facade.py # Facade for extraction pipeline
│ │ ├── utils # Helper functions, normalization
│ │ └── models # Model handlers (OmniDocs, Docling)
│ ├── requirements.txt
│ └── deploy.py # Modal deployment script
│
├── frontend (Next.js)
│ ├── app # Pages and API routes
│ ├── components # UI components
│ ├── lib # API functions, utilities
│ ├── public
│ ├── .env.local # Local environment variables
│ ├── package.json
│ └── next.config.mjs
│
└── README.md
- Proprietary PDF extraction model
- Handles structured and semi-structured documents
- Returns normalized JSON output
- Open-source document extraction model
- Good for text-heavy PDFs
- Returns normalized JSON output
cd backend
python -m venv venv
# Activate the virtual environment
# Windows
venv\Scripts\activate
# macOS/Linux
source venv/bin/activate
pip install -r requirements.txtcd frontend
npm installCreate a .env.local file with the following variables:
NEXTAUTH_URL=https://<your-vercel-subdomain>.vercel.app
NEXTAUTH_SECRET=<random-string>
GOOGLE_CLIENT_ID=<google-client-id>
GOOGLE_CLIENT_SECRET=<google-client-secret>uvicorn app.main:app --reloadAccess at: http://localhost:8000
npm run devAccess at: http://localhost:3000
- Go to Vercel dashboard → Import Project from GitHub
- Set Root Directory:
frontend - Add environment variables (see above)
- Deploy the project
Access live app at:
https://<your-vercel-subdomain>.vercel.app
- Deploy with Modal (
deploy.py) or any cloud provider - Ensure frontend points to the backend API URL
NEXTAUTH_URL=https://<your-vercel-subdomain>.vercel.app
NEXTAUTH_SECRET=<random-string>
GOOGLE_CLIENT_ID=<google-client-id>
GOOGLE_CLIENT_SECRET=<google-client-secret>Add backend-specific credentials if needed.
Note: Ensure Google OAuth redirect URIs match your deployed domain:
https://<your-vercel-subdomain>.vercel.app/api/auth/callback/google
Ensure .gitignore excludes:
node_modules.nextvenv__pycache__*.pyc.env.local.env
MIT License © Suhas-30
Need help? Open an issue or reach out to the maintainer.