Skip to content

kdeepak2001/jobyaari-pdf-extractor

Folders and files

NameName
Last commit message
Last commit date

Latest commit

ย 

History

3 Commits
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 

Repository files navigation

๐Ÿค– AI-Powered Job Data Extractor

Intelligent automation tool to extract structured information from ANY job notification PDF using 100% FREE Google Gemini API.

Perfect for government jobs, private sector positions, internships, and international opportunities!

Status Python AI Cost


๐ŸŽฏ What It Does

Automatically extracts key information from ANY job notification PDF - whether it's government, private sector, startup, or international positions:

  • โœ… Organization Name - Company/Department/Commission name
  • โœ… Post Name - Job title/designation/role
  • โœ… Total Vacancies - Number of open positions
  • โœ… Salary Range - Compensation/pay scale
  • โœ… Location - Job location/remote options
  • โœ… Qualifications - Required education/certifications
  • โœ… Experience Required - Years/type of experience needed
  • โœ… Application Deadline - Last date to apply
  • โœ… Application Fee - Fee amount (if applicable)
  • โœ… Age Limit - Age criteria (if applicable) Returns clean JSON format ready for database/API integration, automation workflows, or chatbot training.

๐Ÿ’ผ Use Cases

For Job Portals (like JobYaari)

  • Automate government job notification processing
  • Extract data from UPSC, SSC, Railway notifications
  • Feed structured data to blog/content generators
  • Build searchable job databases

For Recruitment Platforms

  • Process company job postings
  • Standardize data from multiple sources
  • Auto-populate job listings
  • Track hiring trends

For Job Seekers

  • Quick summary of lengthy PDFs
  • Compare multiple opportunities
  • Track application deadlines
  • Build personal job database

For Developers

  • API integration ready
  • Automation workflow component
  • Training data for chatbots
  • Analytics and insights

๐Ÿš€ Technologies Used

Technology Purpose
Python 3.13 Core programming language
PyPDF2 PDF text extraction
Google Gemini 2.5 Flash API AI-powered data extraction (FREE!)
OpenAI SDK API client library

โœจ Key Features

๐Ÿ’ฐ Completely FREE

  • Uses Google Gemini API - no credit card required
  • 250 requests/day on free tier
  • Perfect for startup/MVP stage

โšก Fast Processing

  • Extracts data in 5-10 seconds
  • Handles multi-page PDFs
  • Real-time progress updates

๐ŸŽฏ High Accuracy

  • AI understands complex government documents
  • Handles various PDF formats
  • Smart "Not mentioned" handling for missing data

๐Ÿ“Š Clean Output

  • JSON format for easy integration
  • Structured, consistent data
  • Ready for automation workflows

๐Ÿ”’ Privacy First

  • All processing on-device
  • No data stored externally
  • API key stays local

๐Ÿ“ฆ Installation

Prerequisites

  • Python 3.10 or higher
  • FREE Google Gemini API key

Setup

  1. Clone this repository git clone https://github.com/kdeepak2001/jobyaari-pdf-extractor.git cd jobyaari-pdf-extractor
  2. Install required packages pip install PyPDF2 openai
  3. Get FREE Gemini API key
    • Go to Google AI Studio
    • Sign in with Google account
    • Click "Get API key" โ†’ "Create API key in new project"
    • Copy your key (starts with AIzaSy...)
  4. Add your API key to the code
    • Open pdf_extractor.py in any text editor
    • Find line 14: api_key="YOUR_GEMINI_API_KEY_HERE"
    • Replace YOUR_GEMINI_API_KEY_HERE with your actual key
    • Save the file

๐ŸŽฎ Usage

Basic Usage

  1. Place your PDF in the same folder and name it sample_job.pdf

  2. Run the extractor python pdf_extractor.py

  3. Check results in extracted_job_data.json

Example Output

{ organization_name": "UNION PUBLIC SERVICE COMMISSION", "post_name": "Combined Medical Services Examination 2025", "total_vacancies": "150", "salary": "Rs. 56,100 - 1,77,500", "location": "All India", "qualifications": "MBBS degree from recognized university", "experience_required": "Not mentioned", "application_deadline": "15th October 2025", "application_fee": "Rs. 100", "age_limit": "21-32 years }


๐Ÿ’ก Built For JobYaari Internship

This project demonstrates:

โœ… AI Automation Skills

  • Practical use of LLMs (Large Language Models)
  • Prompt engineering for structured extraction
  • API integration and error handling

โœ… Problem-Solving Ability

  • Addresses real JobYaari use case
  • Automates manual data entry
  • Scalable solution (250 PDFs/day free)

โœ… Cost-Effective Thinking

  • $0 monthly cost vs paid alternatives
  • Perfect for startup MVP stage
  • Sustainable for long-term use

โœ… Production-Ready Code

  • Clean, documented code
  • Error handling and user feedback
  • Modular, maintainable structure

๐ŸŽฏ Use Cases for JobYaari

Current Implementation

  • Manual PDF processing automation
  • Quick data extraction from notifications
  • JSON output for database storage

Future Integration Possibilities

  • Blog Generator: Feed extracted data to GPT for article generation
  • Social Media Automation: Auto-create posts from job data
  • Chatbot Training: Build FAQ database from job notifications
  • Analytics Dashboard: Track trends in government hiring
  • Multi-PDF Batch Processing: Process 100+ PDFs automatically

๐Ÿ“ˆ Scalability

Free Tier Capacity

  • 250 requests/day = 7,500 PDFs/month
  • 250K tokens/minute = Handles large multi-page PDFs
  • No time limit on free tier

Performance

  • Average processing: 8-12 seconds/PDF
  • Can process 30-40 PDFs/hour
  • Suitable for JobYaari's daily job notification volume

๐Ÿ› ๏ธ Technical Details

How It Works

  1. PDF Text Extraction (PyPDF2)

    • Reads all pages from PDF
    • Extracts raw text content
    • Handles multi-page documents
  2. AI Processing (Gemini API)

    • Sends text to Gemini 2.5 Flash model
    • Uses structured prompt for JSON extraction
    • Temperature set to 0.1 for consistency
  3. Data Structuring

    • Validates JSON output
    • Saves to file for persistence
    • Ready for database insertion

Code Structure

Component Purpose
extract_pdf_text() Reads PDF file and extracts raw text from all pages
extract_job_details() Sends text to Gemini AI and extracts structured JSON data
main execution Coordinates the workflow and handles file I/O

๐Ÿš€ Future Enhancements

  • OCR Support - Handle scanned/image-based PDFs
  • Batch Processing - Process multiple PDFs in one run
  • Web Interface - Streamlit UI for non-technical users
  • API Endpoint - REST API for integration
  • Auto-Categorization - Classify jobs by department/type
  • Email Notifications - Alert when new jobs extracted
  • Database Integration - Direct MongoDB/PostgreSQL save

๐Ÿ‘จโ€๐Ÿ’ป About

Built by: K Deepak
Purpose: AI Agent Development Internship Application at JobYaari
Built in: October 2025
Time to Build: 1 day (concept to deployment)

Why This Project?

I'm passionate about using AI to solve real-world problems and make information accessible. While researching JobYaari's AI Agent Development Internship, I identified their core challenge: automating government job notification processing. Instead of just applying with a resume, I built the exact solution they need. This demonstrates my approach: Don't just talk about skills - build working solutions.

JobYaari's mission to simplify job notification pdf for millions of aspirants aligns perfectly with my goal to build practical, scalable AI solutions for social impact. This tool processes 250 PDFs/day at zero cost - perfect for their growth stage.


๐Ÿค Contributing

This is a portfolio project, but suggestions are welcome! Found a bug? Have an idea? Open an issue!


๐Ÿ“ License

MIT License - Free to use and modify


๐Ÿ™ Acknowledgments

  • Google for FREE Gemini API access
  • JobYaari for the inspiration and mission
  • Open source community for amazing tools

๐Ÿ“ง Connect With Me

GitHub LinkedIn Email Open to: AI/ML Internships | Prompt Engineering Roles | Automation Projects


๐Ÿ’ก Interested in this project or want to collaborate? Feel free to reach out!


โญ Star This Repository

If you find this project useful, please give it a star! โญ It helps others discover this solution and supports my application to JobYaari!


Built with โค๏ธ for automating job information accessiblity

About

AI-powered PDF job data extractor using FREE Google Gemini API

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages