A Streamlit app to upload a scanned PDF of UPSC answer sheets, run OCR (Tesseract or Google Vision), parse question-answer pairs, and evaluate answers using a rubric. Works without LLM (heuristic baseline) or with OpenAI for improved parsing and scoring.
- PDF to image rendering (PyMuPDF)
- OCR via:
- Tesseract (local, free)
- Google Cloud Vision (service account JSON)
- Q/A parsing:
- Heuristic baseline (no LLM)
- OpenAI model (e.g., gpt-4o-mini) for structured JSON extraction
- Evaluation:
- Heuristic baseline scoring
- OpenAI-based rubric scoring (Relevance, Accuracy, Depth, Structure, Language)
- JSON report download
- Toggle to skip the first two pages during OCR (useful to ignore cover/index pages)
- Optional factual grounding via web search to produce specific, cited comments
-
Install Python 3.10+ and pip.
-
Create a virtual environment (optional but recommended):
py -m venv .venv
.\.venv\Scripts\Activate.ps1- Install Python dependencies:
pip install -r requirements.txt- Install Tesseract (for local OCR):
- Download Windows installer from: https://github.com/UB-Mannheim/tesseract/wiki
- After install, note the path to
tesseract.exe(commonlyC:\\Program Files\\Tesseract-OCR\\tesseract.exe). - In the app sidebar, provide this path if Tesseract is not on PATH.
- Optional: Configure Google Vision OCR
- Create a Google Cloud project and enable Vision API.
- Create a service account and download the JSON key (a .json file).
- In the app sidebar, choose "Google Vision" as the OCR provider and upload the JSON key file when prompted. The app stores it in a temporary file for this session and uses it for OCR.
- Optional: Use OpenAI for parsing/scoring
- Set your OpenAI key in the sidebar, or export before running:
$env:OPENAI_API_KEY = "sk-..."streamlit run app.pyThen open the local URL shown (typically http://localhost:8501).
- Start with Tesseract OCR for a quick baseline. If handwriting is poor, try Google Vision.
- If you don’t have an OpenAI key, the app still works with heuristic parsing and scoring.
- Adjust rubric weights in the sidebar based on the marking scheme.
- If your PDF's first two pages are covers or instructions, enable "Skip first 2 pages during OCR" in the sidebar. Rendering/preview still shows all pages, but OCR starts from page 3.
- Enable "Factual grounding via web search" when using OpenAI to make comments more specific and include citations. This uses DuckDuckGo search; no API key required.
app.py– Streamlit UIcore/– Core logic packageevaluator.py– Evaluation and parsing logicvlm.py– Vision Language Model integrationocr.py– OCR handling (Tesseract/Google Vision)pdf.py– PDF processingcommon.py– Shared utilitiesmodels.py– Data models
requirements.txt– Python depsEvaluationPrototype.ipynb– Original prototype notebook
- PyMuPDF or Tesseract import errors: ensure
pip install -r requirements.txtwas successful. - Tesseract not found: provide full path to
tesseract.exein the sidebar. - Google Vision errors: verify the service account has Vision API access and that you uploaded a valid JSON key in the sidebar.
- OpenAI errors: check the model name (e.g.,
gpt-4o-mini) and key validity. - Error
cannot import name 'Sentinel' from 'typing_extensions': your environment has an outdatedtyping_extensions.- Fix:
pip install -U typing_extensions # or ensure prod env uses the pinned version from requirements.txt pip install -r requirements.txt
- Fix: