Project: Order Automation — Multi-step LLM pipeline

Overview

Automated order processing system with separate steps for email extraction and AI validation.

Architecture

The project is organized into independent steps that can be run separately:

Step 0 (step0_daily_offer_pdf.py): Extracts daily offer PDF from email and converts to readable text using AI Vision
Step 1 (step1_email_reader.py): Reads Gmail emails and extracts order data
Step 2 (step2_ai_order_validation.py): Classifies orders using LLM
Step 3 (step3_reference_validation.py): Validates VIP numbers and product codes against reference data
AI package (ai/): Reusable AI utilities for all LLM-powered steps

Structure

project/ ├── step0_daily_offer_pdf.py # Step 0: Daily offer PDF extraction ├── step1_email_reader.py # Step 1: Gmail email extraction ├── step2_ai_order_validation.py # Step 2: LLM order classification ├── step3_reference_validation.py # Step 3: VIP & product code validation ├── ai/ # Reusable AI utilities │ ├── init.py │ ├── core.py # High-level LLM pipeline functions │ ├── llm_client.py # OpenAI-compatible API client │ ├── prompt_builder.py # Legacy prompt builder │ └── schemas.py # Pydantic validation schemas ├── prompts/ # LLM prompt templates │ └── order_classification.md ├── reference_data/ # Reference files for validation │ ├── VIP.xlsx # Valid VIP numbers │ └── product_codes.xls # Valid product codes ├── out_step0_daily_offer/ # Step 0 output (daily offer PDF + extracted text) │ └── YYYYMMDD_HHMMSS/ # Batch timestamp folder │ ├── pdf_raw/ # Downloaded PDF │ ├── extracted/ # Extracted text │ │ ├── text_full.txt # Complete extracted text │ │ └── text_by_page.json # Text by page │ └── meta.json # Extraction metadata ├── out_step1_email_inputs/ # Step 1 output (extracted emails) │ └── YYYYMMDD_HHMMSS/ # Batch timestamp folder │ └── *.json # One JSON per email ├── out_step2_ai_order_validation/ # Step 2 output (classifications) │ └── YYYYMMDD_HHMMSS/ │ ├── raw/ # Request, response, parsed (all files) │ ├── order/ # Confirmed orders │ ├── not_order/ # Non-orders (marketing, etc.) │ └── needs_manual/ # Requires human review ├── out_step3_reference_validation/ # Step 3 output (validated orders) │ └── YYYYMMDD_HHMMSS/ # Batch timestamp folder │ ├── extracted_reference_data/ # Extracted VIP & product lists │ ├── valid_data/ # Orders with valid VIP & products │ ├── invalid_data/ # Orders with invalid VIP or products │ ├── step3_report.json # Validation summary │ └── step3_summary.csv # CSV summary of all orders ├── credentials.json # Gmail OAuth credentials ├── token.json # Gmail OAuth token (auto-generated) └── .env.example # Environment config template

Usage

Setup

Create .env from .env.example and set OPENAI_API_KEY

Install dependencies:

pip install requests pydantic beautifulsoup4 google-api-python-client openpyxl pyexcel pyexcel-xls python-dotenv pandas xlrd

Set up Gmail OAuth:
- Download OAuth credentials from Google Cloud Console
- Save as credentials.json in project root
Add reference data (for step3):
- Create reference_data/ folder
- Add VIP.xlsx with valid VIP numbers
- Add product_codes.xls with valid product codes

Running the Pipeline

Step 0 - Extract Daily Offer (optional, run once per day):

# Using AI Vision (recommended - handles complex PDFs with custom fonts)
python step0_daily_offer_pdf.py --ai-vision

# Using AI Vision with GPT-4o (more accurate but slower/costlier)
python step0_daily_offer_pdf.py --ai-vision --ai-model gpt-4o

# Using OCR (fallback if AI Vision unavailable, requires Tesseract)
python step0_daily_offer_pdf.py --ocr

# Basic mode (tries PDF text layer only)
python step0_daily_offer_pdf.py

Searches for latest email from sales@green-master.eu
Downloads daily offer PDF
Extracts text using AI Vision, OCR, or PDF text layer
Saves to out_step0_daily_offer/TIMESTAMP/
AI Vision recommended: Handles PDFs with custom font encoding and complex layouts

Note: For Step0 with AI Vision, install: pip install pdf2image Pillow

Step 1 - Extract emails:

python step1_email_reader.py

Reads unread emails from Gmail Primary tab (last 1 day)
Extracts Excel attachments in-memory (never saves to disk)
Saves email data to out_step1_email_inputs/TIMESTAMP/

Step 2 - Classify with AI:

# Process latest batch
python step2_ai_order_validation.py

# Process specific batch
python step2_ai_order_validation.py --batch 20260110_130000

Reads emails from step1 output
Calls LLM for classification
Saves results to out_step2_ai_order_validation/TIMESTAMP/

Step 3 - Validate against reference data:

# Auto-detect latest batch (recommended)
python step3_reference_validation.py

# Specify custom paths
python step3_reference_validation.py \
  --step2_input out_step2_ai_order_validation/20260110_130000/order \
  --vip_file reference_data/VIP.xlsx \
  --product_file reference_data/product_codes.xls

Reads classified orders from step2
Validates VIP numbers against VIP.xlsx
Validates product codes against product_codes.xls
Saves valid/invalid orders to out_step3_reference_validation/TIMESTAMP/

AI Package (ai/)

Reusable AI utilities for building LLM-powered steps:

Core Functions (ai/core.py)

load_prompt_template() - Load prompts from prompts/ directory
build_chat_messages() - Build chat messages from system prompt + user data
call_llm_and_save() - Call LLM API and save request/response
extract_assistant_content() - Extract text from API response
parse_and_save_response() - Parse and validate LLM output with Pydantic
save_json_atomic() - Atomically save JSON files

Example Usage

from ai.core import load_prompt_template, build_chat_messages, call_llm_and_save

# Load prompt
system_prompt = load_prompt_template("order_classification.md")

# Build messages
messages = build_chat_messages(system_prompt, {"body_text": "..."})

# Call LLM
response = call_llm_and_save(
    messages=messages,
    output_dir="output/raw",
    base_filename="email_123",
    model="gpt-4o-mini"
)

Model Output Format

The LLM returns structured JSON with multiple orders. Each email can contain multiple orders, and all orders from the same email share the same email_id:

{
  "email_id": "19bad18a2820ecc7",
  "classification": "order" | "not_order" | "needs_manual",
  "confidence": 0.0-1.0,
  "orders": [
    {
      "email_id": "19bad18a2820ecc7",
      "vip_number": "12345",
      "items": [
        {"product_code": "3000950", "qty": 2}
      ],
      "order_status": "order" | "needs_manual",
      "reasons": ["explanation..."],
      "questions_for_human": ["question..."]
    }
  ]
}

Important: email_id is propagated through all steps (Step1 → Step2 → Step3) to track which email each order originated from. When one email contains multiple orders, they all share the same email_id.

Classification Logic:

order: All orders are valid and actionable
not_order: No orders found (marketing, newsletter, etc.)
needs_manual: At least one order requires human review

Auto Mark as Read: Emails are marked as read when:

classification != "needs_manual" AND
confidence >= 0.85

Notes

Separate steps: Step 1 extracts emails, Step 2 classifies them, Step 3 validates
In-memory processing: Excel attachments are never saved to disk
Multi-order support: One email can contain multiple orders (same or different VIP numbers)
Email tracking: Each order has email_id to trace back to the original email
- All orders from the same email share the same email_id
- email_id is preserved through all steps (Step1 → Step2 → Step3)
Batch organization: Each run creates a timestamped folder
Reusable AI layer: ai/ package can be used for future steps (step3, step4, etc.)
Non-Excel attachments: PDFs, images, etc. are skipped (metadata recorded only)

Gmail Query

Default: is:unread newer_than:1d -in:spam -in:trash category:primary

Only unread emails
From last 1 day
Primary tab only
Excludes spam and trash

Modify GMAIL_QUERY in step1_email_reader.py to change this.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Project: Order Automation — Multi-step LLM pipeline

Overview

Architecture

Structure

Usage

Setup

Running the Pipeline

AI Package (ai/)

Core Functions (ai/core.py)

Example Usage

Model Output Format

Notes

Gmail Query

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 12 Commits
ai		ai
prompts		prompts
.env.example		.env.example
.gitignore		.gitignore
README.md		README.md
step0_daily_offer_pdf.py		step0_daily_offer_pdf.py
step1_email_reader.py		step1_email_reader.py
step2_ai_order_validation.py		step2_ai_order_validation.py
step3_reference_validation.py		step3_reference_validation.py

Folders and files

Latest commit

History

Repository files navigation

Project: Order Automation — Multi-step LLM pipeline

Overview

Architecture

Structure

Usage

Setup

Running the Pipeline

AI Package (ai/)

Core Functions (ai/core.py)

Example Usage

Model Output Format

Notes

Gmail Query

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages