POP Translation Pipeline

This repository contains a production-oriented pipeline for translating agricultural POP (Package of Practices) PDF documents into English using the Gemini API and generating editable Word (.docx) outputs for review by agriculture experts.

The pipeline was designed for POP documents that contain agricultural instructions, tables, chemical names, crop names, pest/disease names, formulation codes, units, doses, percentages, images, and structured technical content.

Objective

The objective of this project is to translate priority agricultural POP documents into English while preserving the document structure as much as possible.

The final output is generated as an editable .docx file so that agriculture experts can easily:

review translations,
correct technical terms,
edit table values,
add missing points,
add comments or suggestions,
and finalize the document for downstream use.

Current Pipeline

The current pipeline processes a source PDF page-by-page.

Source PDF
   ↓
Split into page-wise PDFs
   ↓
Send each page PDF to Gemini
   ↓
Generate translated HTML per page
   ↓
Extract and inject original page images
   ↓
Merge translated HTML pages
   ↓
Convert combined HTML to DOCX using Pandoc
   ↓
Final editable Word file

The pipeline is implemented in: scripts/run_pop_to_docx.py

The translation prompt is stored separately in: prompts/page_to_pdf.txt

Repository Structure

POP-Translation/
├── prompts/
│   └── page_to_pdf.txt
│
├── scripts/
│   └── run_pop_to_docx.py
│
├── requirements.txt
├── README.md
└── .gitignore

Recommended local working structure:

POP_Work/
├── Data/
│   └── Karnataka/
│       ├── Ginger/
│       ├── Onion/
│       ├── Bengal gram/
│       ├── Sugarcane/
│       └── ...
│
├── Workdir/
│   └── Karnataka/
│       └── <Crop>/
│           └── <Document_Output_Folder>/
│
├── prompts/
├── scripts/
├── requirements.txt
└── README.md

Main Features

Page-wise PDF splitting
Gemini-based translation
HTML output generation
Original image extraction from PDF pages
Image injection into translated HTML
Combined HTML generation
DOCX generation using Pandoc
Resume support
Page-wise intermediate outputs
Runtime logging
Parallel page translation using concurrency
Optional page range processing
Skip translation and rebuild DOCX from existing HTML
Safe reruns without overwriting completed pages

Gemini Translation Configuration

Current configuration:

Model: gemini-3.1-pro-preview
Thinking level: HIGH
Google Search: Enabled
Input: Page-wise PDF bytes
Output: Clean translated HTML

Requirements

Recommended Python version:

Python 3.11+

Install required Python packages using:

pip install -r requirements.txt

Pandoc is required for converting combined HTML into .docx. Install Pandoc separately and check installation:

pandoc --version

Environment Setup

Clone the repository:

git clone https://github.com/vicharanashala/POP-Translation.git
cd POP-Translation

Create virtual environment:

python -m venv venv
venv\Scripts\activate

Install dependencies:

pip install -r requirements.txt

Set Gemini API key:

set GEMINI_API_KEY=PASTE_YOUR_GEMINI_API_KEY_HERE

Input Folder

Place source PDFs inside the Data/ folder. Example:

Data/
└── Karnataka/
    └── Ginger/
        └── Rhizome rot management in ginger_KVK Krishi Vigyana Kendra, Shivamogga, Karnataka.pdf

Basic Usage

Run the pipeline from the project root. Example:

python scripts\run_pop_to_docx.py --source-pdf "Data\Karnataka\Ginger\Rhizome rot management in ginger_KVK Krishi Vigyana Kendra, Shivamogga, Karnataka.pdf" --workdir-root "Workdir\Karnataka\Ginger\Rhizome rot management in ginger_KVK Krishi Vigyana Kendra, Shivamogga, Karnataka" --doc-name "Rhizome rot management in ginger_KVK Krishi Vigyana Kendra, Shivamogga, Karnataka" --prompt-file "prompts\page_to_pdf.txt" --start-page 1 --end-page 1 --concurrency 1

Important Command Arguments

Argument Purpose:

--source-pdf : Path to input PDF
--workdir-root : Working/output folder for that document
--doc-name : Base name used for combined HTML and DOCX
--prompt-file : Path to translation prompt
--start-page : First page to process
--end-page : Last page to process
--concurrency : Number of pages translated in parallel
--overwrite : Reprocess existing outputs
--skip-translation : Skip Gemini translation and reuse existing translated.html
--skip-image-injection : Skip image extraction/injection and reuse existing final_with_images.html

Resume Behaviour

The pipeline is resumable.

Each page has its own folder:

page_001/
├── page_001.pdf
├── translated.html
├── images/
└── final_with_images.html

If the script stops midway, rerun the same command without --overwrite.

The script will skip completed pages:

translated.html already exists
final_with_images.html already exists

Only missing pages will be processed again.

Output Structure

For each processed document, the output folder contains:

Workdir/<State>/<Crop>/<Document_Name>/
├── page_001/
│   ├── page_001.pdf
│   ├── translated.html
│   ├── images/
│   └── final_with_images.html
│
├── page_002/
│   ├── page_002.pdf
│   ├── translated.html
│   ├── images/
│   └── final_with_images.html
│
├── translation_summary.json
├── image_injection_summary.json
├── pipeline_runtime.log
├── <doc-name>_combined_pages_001_to_XXX.html
└── final_output/
    └── <doc-name>_translated_pages_001_to_XXX.docx

Final DOCX output is saved inside:

final_output/

Runtime Logs

The script creates a runtime log:

pipeline_runtime.log

This log records:

pipeline configuration,
page splitting status,
translation progress,
Gemini request timing,
retry attempts,
image injection progress,
combined HTML generation,
DOCX conversion,
final output path.

Parallel Processing

The pipeline supports concurrent page translation. Example:

--concurrency 2

Higher concurrency may increase speed, but can also increase the chance of API rate-limit errors or stuck requests.

Known Limitations

Pandoc may produce incomplete DOCX files for some very large or complex HTML documents.
Some documents may need chunk-wise DOCX generation.
A few problematic pages may need to be generated separately.
The script currently assumes Pandoc is installed and available in system PATH.
API rate limits may affect concurrency.
Output quality depends on the source PDF quality and Gemini response.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

POP Translation Pipeline

Objective

Current Pipeline

Repository Structure

Main Features

Gemini Translation Configuration

Requirements

Environment Setup

Input Folder

Basic Usage

Important Command Arguments

Resume Behaviour

Output Structure

Runtime Logs

Parallel Processing

Known Limitations

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 26 Commits
prompts		prompts
scripts		scripts
.gitignore		.gitignore
README.md		README.md
requirements.txt		requirements.txt

Folders and files

Latest commit

History

Repository files navigation

POP Translation Pipeline

Objective

Current Pipeline

Repository Structure

Main Features

Gemini Translation Configuration

Requirements

Environment Setup

Input Folder

Basic Usage

Important Command Arguments

Resume Behaviour

Output Structure

Runtime Logs

Parallel Processing

Known Limitations

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages