Skip to content

sanbrasp/PDFTranslator

Repository files navigation

PDFTranslator

A small CLI tool that extracts text from a PDF and (optionally) translates it.


About

This program allows the user to input a PDF file and have it translated to English.
This might simplify workflows for users who are more comfortable working with English task descriptions and documents.
WIP - Will be buggy.
Created by 1st year backend programming student on their second semester of a 4 semester long course - be gentle.

AI has been used as a sparring partner, learning tool, for suggestions, documentation assistance, and ease of repetition (less writing, more coding).

This program is first and foremost a personal learning project and not intended to be a perfect work of art.
However, suggestions, advice, collabs, etc. are warmly welcome.


Requirements

  • .NET 10 SDK
  • JetBrains Rider (optional; any IDE works)
  • Self‑hosted LibreTranslate (free): run via Docker or Python (pip)

LibreTranslate is a free, open‑source machine translation API designed to be self‑hosted. It exposes simple REST endpoints like POST /translate, POST /detect, and GET /languages. Self‑hosted instances do not require an API key; API keys are only needed for managed/paid instances.

LibreTranslate Docs


Quick Start

dotnet restore #(if needed)
dotnet build
dotnet run --project src/PDFTranslator.Cli -- -i "sample.pdf" -o "output.txt" -t "en" --provider dummy

Start LibreTranslate:
Option A (Docker - Recommended)

docker run -d --name libretranslate -p 5000:5000 libretranslate/libretranslate:latest

Persist models for faster load:

docker rm -f libretranslate

docker run -d --name libretranslate -p 5000:5000 -v lt-data:/home/libretranslate/.local libretranslate/libretranslate:latest

LibreTranslate is designed for self‑hosting and exposes the documented endpoints used by this app.
LibreTranslate Github

Option B (Python, pip)

pip install libretranslate
libretranslate
# serves on http://localhost:5000

Verify server
Health or languages (either is fine):

Invoke-RestMethod http://127.0.0.1:5000/health

Invoke-RestMethod http://127.0.0.1:5000/languages

/languages returns the list of supported language codes (e.g., en, nb, …). The translator in this project queries this endpoint to align your requested codes to what the server actually supports.


How to use

Quick Start (local LibreTranslate) Absolute paths are safe; -o is optional (the app generates non‑overwriting filenames automatically).

dotnet run --project PDFTranslator.Cli -- `
  -i "C:\path\to\your.pdf" `
  -t en `
  --provider libre `
  --libre-url "http://127.0.0.1:5000"

Example 1 - Norwegian -> English

dotnet run --project PDFTranslator.Cli -- `
  -i "C:\docs\eksempel.pdf" `
  -t en `
  --provider libre `
  --libre-url "http://127.0.0.1:5000"

Example 2 - English -> Norwegian

dotnet run --project PDFTranslator.Cli -- `
  -i "C:\docs\article.pdf" `
  -t nb `
  --provider libre `
  --libre-url "http://127.0.0.1:5000"

Optional environment variables You can set these to avoid passing flags every run:

LIBRETRANSLATE_URL — e.g., http://127.0.0.1:5000 LIBRETRANSLATE_API_KEY — only if you use a managed/paid instance (not needed for self‑host).

LibreTranslate API Usage

CLI Options

-i, --input      Path to input PDF file (required)
-t, --target     Target language code, e.g., en, nb, de (required)
-o, --output     Output text file (optional; unique name auto-generated if omitted)
--provider       Translation provider: libre (real) or dummy (echo). Default: libre
--libre-url      Base URL for LibreTranslate (defaults to $LIBRETRANSLATE_URL or http://localhost:5000)
--libre-key      API key for LibreTranslate (not required for self-hosted)

Verifying Server (Powershell)
PowerShell can mangle JSON bodies if formatting/quoting isn’t careful. These two patterns are reliable:
A) Hashtable → ConvertTo‑Json → UTF‑8 bytes

$payload = @{
  q      = "Hei, hvordan går det?"
  source = "nb"
  target = "en"
  format = "text"
} | ConvertTo-Json -Depth 10

$bytes = [System.Text.Encoding]::UTF8.GetBytes($payload)

Invoke-RestMethod -Uri "http://127.0.0.1:5000/translate" -Method Post -ContentType "application/json; charset=utf-8" -Body $bytes

B) curl.exe

curl.exe -s -X POST "http://127.0.0.1:5000/translate" ^
  -H "Content-Type: application/json; charset=utf-8" ^
  -d "{\"q\":\"Hei, hvordan går det?\",\"source\":\"nb\",\"target\":\"en\",\"format\":\"text\"}"

LibreTranslate’s /translate and /detect JSON contracts are documented; using byte arrays or curl avoids PS encoding issues.


Acknowledgements

This project was developed with assistance from Microsoft M365 Copilot, used primarily to:

Generate initial project scaffolding ideas and folder structure. Draft example implementations (e.g., DummyTranslator, PdfPigTextExtractor) in C#. Provide troubleshooting guidance for Rider, .NET SDK project setup, NuGet references, and xUnit configuration. Offer recommendations for architecture, naming, and clean code organization. Suggest documentation text (including this section).

All code has been reviewed, modified, or rewritten by the human developer before inclusion. Copilot was used as a supportive tool — not an autonomous code generator — and responsibility for all design decisions, final code, and validation remains with the project author.

External libraries acknowledged:

  • UglyToad.PdfPig — PDF text extraction
  • xUnit — Unit testing framework
  • Microsoft.Extensions.DependencyInjection — Lightweight dependency injection

🛠️ Troubleshooting

The container is “running” but /languages fails or /health says no
(Hinting towards Little Britain, are we?)
First boot downloads/initializes models and can take a while; check logs (docker logs -f libretranslate) and wait until you see Gunicorn “Listening at: …:5000”. This is expected per self‑host quickstart guidance. [github.com], [docs.libre...nslate.com]

Output is still the same language
The app now:

Detects from multiple samples and uses the majority.
Batches long text (API supports q: string[]). [deepwiki.com] Aligns codes to what the server supports via /languages. [docs.libre...nslate.com] Applies fallback attempts if the first pass looks unchanged. If it still appears unchanged, verify the PDF actually contains extractable text (not scanned images).

Scanned PDFs or image‑only content
PdfPig extracts text; it does not perform OCR. If your output is empty or unchanged for scans, enable an OCR fallback (e.g., Tesseract) in a future update.

Public instances vs. self‑host
Public/mirror instances can be flaky; the project explicitly notes only the official hosted instance offers high availability. Self‑hosting is the best free/always‑on approach. [github.com]


Sources

  • Microsoft Copilot

LibreTranslate – Official “API Usage” Guide
LibreTranslate – “Get Supported Languages” API Reference
LibreTranslate – Translation Endpoints Deep Dive (covers batch q: [])
LibreTranslate – Project README / Self‑hosting Overview (GitHub)
LibreTranslate – Official Documentation Home
Argos Open Tech – Argos Translate (the engine behind LibreTranslate)
Argos Translate – Official Documentation
LibreTranslate – Supported Languages / Codes (includes Norwegian Bokmål)

LibreTranslate Status / Reliability Notes about Public Instances:

About

WIP: A program that translates a PDF. Requires docker for Libretranslate API

Topics

Resources

License

Code of conduct

Contributing

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages