Skip to content

Latest commit

 

History

History
319 lines (172 loc) · 8.45 KB

File metadata and controls

319 lines (172 loc) · 8.45 KB

Process of progress


Current version

(Generated by AI)

🔧 Code changes — Infrastructure (PDFTranslator.Infrastructure)

Added real translator: LibreTranslateTranslator

Uses HttpClient to call LibreTranslate’s REST API (POST /translate) with JSON. Purpose: enable free, real machine translation (self‑hosted).

Detect‑then‑translate logic

Calls POST /detect first to get the source language, then translates with explicit source → avoids "auto" no‑ops. Purpose: stop cases where long/mixed text leads to “unchanged” output.

Language code alignment

Queries GET /languages and normalizes codes (e.g., nb/no) to what the server actually supports. Purpose: guarantee the pair we send is valid on the running instance.

Batch translation + chunking

Splits large text into paragraph‑aware chunks and posts q: string[] in one request; stitches results back together. Purpose: improve translation reliability and throughput on long documents.

Fallbacks & resilience

If first pass equals the input (lenient comparison), retries once with a stricter source (detected), and, if still unchanged, a forced Norwegian source (nb then no, if supported). Adds a 20s HTTP timeout and clean error surfacing for non‑2xx responses.


🖥️ Code changes — CLI (PDFTranslator.Cli)

Provider wiring

Added --provider libre (kept dummy) and resolved base URL/API key from flags or env vars. Defaults designed for local self‑host (--libre-url http://localhost:5000) so it stays free.

Output safety

Made -o/--output optional; if omitted, auto‑generates a unique path: .translated..txt (appends (2), (3)… if needed). Purpose: hands‑free runs without overwriting.

Source language info (non‑blocking)

After PDF extraction, prints a detected source (informational only) so you see what the server thinks—but we always translate to the target you requested. Purpose: transparency without accidental “same→same” skips.


🧭 Runtime / DevOps setup (so it stays free)

Self‑hosted LibreTranslate container

Guided you to run libretranslate/libretranslate locally and wait for initial model load. Added notes on first‑run model downloads, health checks, and persisting models via a Docker volume to speed up restarts.

PowerShell API testing fixes

Showed reliable ways to call /translate in PowerShell (e.g., using UTF‑8 byte arrays or curl.exe) to avoid malformed JSON bodies.


🎯 Resulting behavior (user‑visible)

Run the CLI with --provider libre and your local URL:

dotnet run --project PDFTranslator.Cli -- -i "<your.pdf>" -t en --provider libre --libre-url "http://localhost:5000"

The app now:

Extracts text from the PDF. Detects source robustly (multi‑snippet voting). Aligns source/target to the server’s supported language codes. Translates in batch for long text (reliable and faster). Writes to a non‑overwriting output file next to the input.

Confirmed: Norwegian → English works on your local instance.


📌 What we did not change

Your Core interfaces are unchanged (ITextExtractor, ITranslator). Existing tests still pass; you can add more when ready (e.g., simulate translator behavior)



Previous version

(Generated by AI)

🎯 Goal Build a clean .NET 10 solution in Rider with:

A CLI tool that extracts text from a PDF A pluggable translation layer (currently a dummy pass-through) Unit tests (xUnit) to verify behavior Proper project references and NuGet dependencies

Why: This gives you a modular, testable codebase that’s easy to extend (e.g., real translation provider, output-to-PDF, better CLI parsing).


🧱 Solution Structure (and Why) Projects:

PDFTranslator.Core — Interfaces (ITextExtractor, ITranslator) Why: Encapsulates contracts to keep implementations swappable and testable.

PDFTranslator.Infrastructure — Implementations (PdfPigTextExtractor, DummyTranslator) Why: Keeps dependencies (PdfPig and future translation SDKs) out of Core.

PDFTranslator.Cli — Console app (wires DI + argument parsing) Why: A thin entry point that composes services and handles I/O.

PDFTranslator.Tests — xUnit tests Why: Validates behavior (now and as you extend functionality).

(There was also a TranslatorApp project in the solution; we left it harmless, but it can be removed to keep things tidy.)


🛠️ Project Creation (Rider-first)

Created the solution and four .NET SDK projects (not Rider’s C#-only model) so each has a .csproj. Why: SDK projects support NuGet, MSBuild, dotnet CLI, CI, and test discovery.

Verified Target Framework for all projects: net10.0. Why: Ensures consistency across compile/runtime and Rider’s analyzers.

Added project references:

Infrastructure → Core CLI → Infrastructure Tests → Core & Infrastructure Why: Establishes correct build-time dependencies.


📦 NuGet Packages (and Why)

Infrastructure: UglyToad.PdfPig Why: Simple, MIT-licensed PDF text extraction.

CLI: Microsoft.Extensions.DependencyInjection Why: Minimal DI to wire interfaces to implementations without a heavy framework.

Tests: xunit, xunit.runner.visualstudio, Microsoft.NET.Test.Sdk, coverlet.collector Why: xUnit is lightweight with great Rider/CLI integration; SDK + runner enable discovery; coverlet is ready for coverage.


🧩 Core Contracts

ITextExtractor — string ExtractText(string pdfPath) ITranslator — Task TranslateAsync(string text, string targetLanguage, CancellationToken)

Why: Clean separation of concerns; easy to mock in tests; makes swapping providers trivial (e.g., Azure Translator later).


🧪 Implementations

PdfPigTextExtractor

Validates path, opens PDF once, iterates pages, aggregates text with spacing. Why: Robust text extraction with readable output.

DummyTranslator

Echoes input text. Why: Enables end-to-end prototype without external API keys.


🖥️ CLI Wiring

Manual argument parsing for:

-i/--input, -o/--output, -t/--target, --provider

DI container registers ITextExtractor and ITranslator (provider selectable; dummy for now). Writes output to file, creating the directory if needed. Help text prints with literal < and > characters (as you prefer).

Why: Keeps dependencies minimal now; we can upgrade to a richer parser later.


✅ Build & Run

  • Build: dotnet build
  • Run prototype:
dotnet run --project PDFTranslator.Cli -- -i "sample.pdf" -o "output.txt" -t "en" --provider dummy
  • Rider: Uses an auto-created run configuration; you only needed to set Program Arguments and (optionally) Working Directory.

Why: Fast feedback loop from Rider or CLI.


🧪 Tests

Added: DummyTranslatorTests (xUnit) → verifies echo behavior. Outcome: Tests discovered and passed:

total: 1, failed: 0, succeeded: 1, skipped: 0

Why: Ensures the test harness and project references are correctly wired.


🧯 Troubleshooting We Solved

No .csproj visible in Rider

Root cause: Earlier Rider C#-only projects vs SDK projects / visibility settings. Fix: Created SDK projects (with .csproj), moved files, and ensured visibility (“Edit Project File”).

xUnit not found errors (Xunit, [Fact])

Root cause: Missing xUnit packages in test project. Fix: Installed xunit, Microsoft.NET.Test.Sdk, xunit.runner.visualstudio, coverlet.collector.

Ambiguous Assert (NUnit vs xUnit) & missing NUnit types

Root cause: Test project pulled in NUnit via global usings and template UnitTest1.cs. Fix: Removed NUnit global using from .csproj, uninstalled NUnit packages, deleted UnitTest1.cs, cleaned bin/obj, rebuilt.

Why this matters: Ensures a clean, unambiguous xUnit-only test setup moving forward.


🧹 Optional Housekeeping (nice-to-have)

Remove TranslatorApp if not needed:

dotnet sln remove TranslatorApp/TranslatorApp.csproj

Add a .gitignore (build artifacts, IDE folders) and .editorconfig (code style). Why: Keeps the repo clean and consistent.


🚀 Ready Next Steps (pick one and I’ll implement with line-by-line explanations)

Real translation provider (--provider azure)

Add AzureCognitiveTranslator using HttpClient Read AZURE_TRANSLATOR_KEY, AZURE_TRANSLATOR_ENDPOINT, and (if required) AZURE_TRANSLATOR_REGION from env vars Validate inputs and return translated text

Write translated text back to a PDF

Add --out-format pdf|txt Implement a simple PDF writer (e.g., QuestPDF), page text neatly

Better CLI UX

Adopt System.CommandLine or Spectre.Console.Cli Strong validation, nicer help, future subcommands

CI/CD

GitHub Actions or GitLab CI: restore, build, test on push/pr Optional artifact publishing