Automated order processing system with separate steps for email extraction and AI validation.
The project is organized into independent steps that can be run separately:
- Step 0 (step0_daily_offer_pdf.py): Extracts daily offer PDF from email and converts to readable text using AI Vision
- Step 1 (step1_email_reader.py): Reads Gmail emails and extracts order data
- Step 2 (step2_ai_order_validation.py): Classifies orders using LLM
- Step 3 (step3_reference_validation.py): Validates VIP numbers and product codes against reference data
- AI package (ai/): Reusable AI utilities for all LLM-powered steps
project/ ├── step0_daily_offer_pdf.py # Step 0: Daily offer PDF extraction ├── step1_email_reader.py # Step 1: Gmail email extraction ├── step2_ai_order_validation.py # Step 2: LLM order classification ├── step3_reference_validation.py # Step 3: VIP & product code validation ├── ai/ # Reusable AI utilities │ ├── init.py │ ├── core.py # High-level LLM pipeline functions │ ├── llm_client.py # OpenAI-compatible API client │ ├── prompt_builder.py # Legacy prompt builder │ └── schemas.py # Pydantic validation schemas ├── prompts/ # LLM prompt templates │ └── order_classification.md ├── reference_data/ # Reference files for validation │ ├── VIP.xlsx # Valid VIP numbers │ └── product_codes.xls # Valid product codes ├── out_step0_daily_offer/ # Step 0 output (daily offer PDF + extracted text) │ └── YYYYMMDD_HHMMSS/ # Batch timestamp folder │ ├── pdf_raw/ # Downloaded PDF │ ├── extracted/ # Extracted text │ │ ├── text_full.txt # Complete extracted text │ │ └── text_by_page.json # Text by page │ └── meta.json # Extraction metadata ├── out_step1_email_inputs/ # Step 1 output (extracted emails) │ └── YYYYMMDD_HHMMSS/ # Batch timestamp folder │ └── *.json # One JSON per email ├── out_step2_ai_order_validation/ # Step 2 output (classifications) │ └── YYYYMMDD_HHMMSS/ │ ├── raw/ # Request, response, parsed (all files) │ ├── order/ # Confirmed orders │ ├── not_order/ # Non-orders (marketing, etc.) │ └── needs_manual/ # Requires human review ├── out_step3_reference_validation/ # Step 3 output (validated orders) │ └── YYYYMMDD_HHMMSS/ # Batch timestamp folder │ ├── extracted_reference_data/ # Extracted VIP & product lists │ ├── valid_data/ # Orders with valid VIP & products │ ├── invalid_data/ # Orders with invalid VIP or products │ ├── step3_report.json # Validation summary │ └── step3_summary.csv # CSV summary of all orders ├── credentials.json # Gmail OAuth credentials ├── token.json # Gmail OAuth token (auto-generated) └── .env.example # Environment config template
- Create
.envfrom.env.exampleand setOPENAI_API_KEY - Install dependencies:
pip install requests pydantic beautifulsoup4 google-api-python-client openpyxl pyexcel pyexcel-xls python-dotenv pandas xlrd
- Set up Gmail OAuth:
- Download OAuth credentials from Google Cloud Console
- Save as
credentials.jsonin project root
- Add reference data (for step3):
- Create
reference_data/folder - Add
VIP.xlsxwith valid VIP numbers - Add
product_codes.xlswith valid product codes
- Create
Step 0 - Extract Daily Offer (optional, run once per day):
# Using AI Vision (recommended - handles complex PDFs with custom fonts)
python step0_daily_offer_pdf.py --ai-vision
# Using AI Vision with GPT-4o (more accurate but slower/costlier)
python step0_daily_offer_pdf.py --ai-vision --ai-model gpt-4o
# Using OCR (fallback if AI Vision unavailable, requires Tesseract)
python step0_daily_offer_pdf.py --ocr
# Basic mode (tries PDF text layer only)
python step0_daily_offer_pdf.py- Searches for latest email from
sales@green-master.eu - Downloads daily offer PDF
- Extracts text using AI Vision, OCR, or PDF text layer
- Saves to
out_step0_daily_offer/TIMESTAMP/ - AI Vision recommended: Handles PDFs with custom font encoding and complex layouts
Note: For Step0 with AI Vision, install: pip install pdf2image Pillow
Step 1 - Extract emails:
python step1_email_reader.py- Reads unread emails from Gmail Primary tab (last 1 day)
- Extracts Excel attachments in-memory (never saves to disk)
- Saves email data to
out_step1_email_inputs/TIMESTAMP/
Step 2 - Classify with AI:
# Process latest batch
python step2_ai_order_validation.py
# Process specific batch
python step2_ai_order_validation.py --batch 20260110_130000- Reads emails from step1 output
- Calls LLM for classification
- Saves results to
out_step2_ai_order_validation/TIMESTAMP/
Step 3 - Validate against reference data:
# Auto-detect latest batch (recommended)
python step3_reference_validation.py
# Specify custom paths
python step3_reference_validation.py \
--step2_input out_step2_ai_order_validation/20260110_130000/order \
--vip_file reference_data/VIP.xlsx \
--product_file reference_data/product_codes.xls- Reads classified orders from step2
- Validates VIP numbers against VIP.xlsx
- Validates product codes against product_codes.xls
- Saves valid/invalid orders to
out_step3_reference_validation/TIMESTAMP/
Reusable AI utilities for building LLM-powered steps:
load_prompt_template()- Load prompts from prompts/ directorybuild_chat_messages()- Build chat messages from system prompt + user datacall_llm_and_save()- Call LLM API and save request/responseextract_assistant_content()- Extract text from API responseparse_and_save_response()- Parse and validate LLM output with Pydanticsave_json_atomic()- Atomically save JSON files
from ai.core import load_prompt_template, build_chat_messages, call_llm_and_save
# Load prompt
system_prompt = load_prompt_template("order_classification.md")
# Build messages
messages = build_chat_messages(system_prompt, {"body_text": "..."})
# Call LLM
response = call_llm_and_save(
messages=messages,
output_dir="output/raw",
base_filename="email_123",
model="gpt-4o-mini"
)The LLM returns structured JSON with multiple orders. Each email can contain multiple orders,
and all orders from the same email share the same email_id:
{
"email_id": "19bad18a2820ecc7",
"classification": "order" | "not_order" | "needs_manual",
"confidence": 0.0-1.0,
"orders": [
{
"email_id": "19bad18a2820ecc7",
"vip_number": "12345",
"items": [
{"product_code": "3000950", "qty": 2}
],
"order_status": "order" | "needs_manual",
"reasons": ["explanation..."],
"questions_for_human": ["question..."]
}
]
}Important: email_id is propagated through all steps (Step1 → Step2 → Step3) to track which
email each order originated from. When one email contains multiple orders, they all share the same email_id.
Classification Logic:
order: All orders are valid and actionablenot_order: No orders found (marketing, newsletter, etc.)needs_manual: At least one order requires human review
Auto Mark as Read: Emails are marked as read when:
classification != "needs_manual"ANDconfidence >= 0.85
- Separate steps: Step 1 extracts emails, Step 2 classifies them, Step 3 validates
- In-memory processing: Excel attachments are never saved to disk
- Multi-order support: One email can contain multiple orders (same or different VIP numbers)
- Email tracking: Each order has
email_idto trace back to the original email- All orders from the same email share the same
email_id email_idis preserved through all steps (Step1 → Step2 → Step3)
- All orders from the same email share the same
- Batch organization: Each run creates a timestamped folder
- Reusable AI layer:
ai/package can be used for future steps (step3, step4, etc.) - Non-Excel attachments: PDFs, images, etc. are skipped (metadata recorded only)
Default: is:unread newer_than:1d -in:spam -in:trash category:primary
- Only unread emails
- From last 1 day
- Primary tab only
- Excludes spam and trash
Modify GMAIL_QUERY in step1_email_reader.py to change this.