Extracts structured data from a handwritten AJ Bell Bare Trust Dealing Account application form (PDF) and outputs a validated JSON file.
pip install -r requirements.txtCreate a .env file in the project folder:
GEMINI_API_KEY=your_key_here
python main.py --input DataExtraction.pdf --output output.jsonAdd --verbose to see per-page field counts, token usage, and cost breakdown.
output.json — all form fields mapped to the schema defined in schemas/JSONSchema.txt.
- PDF is converted to images (300 DPI, one per page)
- Each page image is sent to Gemini vision — fields extracted as XML with chain-of-thought reasoning
- All extracted fields are mapped to the target JSON schema using Gemini structured output
- Result is validated against the schema; errors trigger an automatic correction pass
- Post-processing applies deterministic fixes (NI numbers, sort code formatting, country casing)
- The form is always the standard 8-page AJ Bell Bare Trust Dealing Account form. Other page counts are handled with a generic fallback but page-specific accuracy will be lower.
- Titles (Dr/Mr/Mrs/Miss/Ms/Other) are marked by striking through all options except the chosen one — not by circling or ticking.
- Whatever is physically written in the Surname box is treated as the surname, even if it looks like a first name. The applicant's input is preserved as-is.
- Unticked checkboxes are extracted as
false, notnull. Anullcheckbox means that field is absent from the page entirely. - NI numbers starting with
N1followed by a digit are corrected toNI— the digit 1 is a common OCR misread of capital I. - Sort codes are formatted as
XX-XX-XX. If the model returns 6 bare digits, post-processing adds the dashes. - Country fields matching "uk" in any casing are normalised to
"UK". - If Section D (Second Trustee) is left completely blank, it is not included in
TrusteeDetails. - Signature rows with no name written are not included in
trustee_signatures. - The
proof_of_registration_attachedfield is alwaysnull— the form does not have a checkbox for this; it is a separate document attached externally. identity_verification_consentandfraud_prevention_checks_consentarenullon the standard form — these checkboxes do not appear on the version of the form used.