Skip to content

Releases: notoriouslab/doc-cleaner

v1.2.0 — DXF, PPTX, PPT, DOC support + security hardening

08 Apr 03:28

Choose a tag to compare

New Format Support

  • DXF (.dxf): Extract text annotations, dimensions, layer names, block attributes from engineering drawings via ezdxf
  • PPTX (.pptx): Extract slide text, tables (Markdown pipe tables), and speaker notes via python-pptx
  • PPT (.ppt): Legacy PowerPoint extraction via macOS textutil
  • DOC (.doc): Legacy Word extraction via macOS textutil

Breaking Changes

  • YAML frontmatter: source_path renamed to sourcePath (camelCase, consistent with pubDate)

Security

  • Fix YAML newline injection in frontmatter escaping
  • Add entity count limit (50,000) to DXF parser to prevent resource exhaustion
  • Add ZIP decompressed size check (500MB) and slide limit (500) to PPTX parser
  • Add 60s timeout to all textutil subprocess calls

Improvements

  • Deduplicate textutil conversion logic into shared parsers/_textutil.py
  • Updated README (zh/en) with new format documentation
  • Added CHANGELOG.md

Install optional dependencies

pip install python-pptx   # for PPTX
pip install ezdxf          # for DXF
# PPT/DOC: macOS textutil built-in, no extra install

v1.1.0 — High-quality PDF extraction + Ad cleaning

29 Mar 10:34

Choose a tag to compare

What's New

High-quality PDF extraction via opendataloader-pdf

doc-cleaner now supports opendataloader-pdf as an optional PDF extraction backend. When installed (with Java 11+), it automatically becomes the primary extractor — producing proper Markdown pipe tables directly from PDFs where PyMuPDF previously failed completely.

  • Auto-detected at runtime; falls back to PyMuPDF if not installed
  • Tables extracted as pipe tables (|col1|col2|col3|), no AI needed
  • Image references and blank line noise automatically cleaned from output
  • Many previously "layout-broken" PDFs now classified as native text, skipping AI entirely
# Install (optional)
brew install openjdk@21
pip install opendataloader-pdf
# That's it — doc-cleaner auto-detects and uses it

Inline ad removal (ad_strip_patterns)

New config option for removing promotional blocks embedded between useful content — complementing the existing tail truncation (ad_truncation_patterns).

{
  "ad_strip_patterns": ["※運動賺回饋", "【行銷活動】"]
}
Setting Behavior Use case
ad_truncation_patterns Truncate everything after first match End-of-document disclaimers
ad_strip_patterns Remove each matched paragraph Inline promotional blocks

Security hardening

  • glob.escape() prevents special characters in filenames from affecting cleanup
  • Strip patterns use compile-then-search instead of regex concatenation (ReDoS prevention)
  • Both ad_truncation_patterns and ad_strip_patterns validated at startup
  • ODL classifier adds per-page density check to avoid false positives on image-heavy PDFs
  • Cutoff pattern join uses non-capturing groups

Other

  • Version bump to 1.1.0
  • Updated README (zh-TW + en) with new features and install instructions
  • config.example.json updated with ad_strip_patterns field

v1.0.3 — Groq backend, PII redaction, parser fixes

14 Mar 09:48

Choose a tag to compare

What's New

Groq backend support (community contribution by @sunnyworm)

  • New --ai groq option for fast cloud inference via Groq's OpenAI-compatible API
  • Default model: meta-llama/llama-4-scout-17b-16e-instruct
  • Supports text and vision (up to 5 images per request)
  • User-Agent header fix for Cloudflare transport fingerprinting (403/1010)

PII redaction (opt-in)

  • New classifiers/pii.py module for Taiwan personal data patterns (身分證字號, 電話, email, etc.)
  • Dual-pass redaction: before AI processing + after output rendering
  • Opt-in via config.json: "pii": {"enabled": true}

Bug Fixes

  • DOCX header detection: fixed false negatives on CJK financial tables where date/amount cells (e.g. 2024-01-15, 1,234.56) were misclassified as non-header rows
  • XLSX wide table truncation: added safety net for extremely wide tables where even a single row exceeds the per-sheet character budget
  • Code cleanup: moved import time to file top, removed extra blank line in pdf.py
  • Various fixes from expert code review rounds (#1, #2)

Upgrade

pip install -r requirements.txt  # adds pikepdf if not present

No config changes required. PII redaction is off by default.

Full Changelog: v1.0.1...v1.0.3

v1.0.1 — Security Hardening / 安全強化

10 Mar 05:01

Choose a tag to compare

What's Changed / 變更內容

Security / 安全性

  • Ollama SSRF prevention / Ollama SSRF 防護: host config now whitelisted to localhost / 127.0.0.1 / ::1 only — prevents accidental data leakage to remote servers via malicious config
    Ollama host 設定現僅允許 localhost,防止惡意設定將文件內容洩漏到遠端伺服器
  • YAML tag injection fix / YAML 標籤注入修復: double quotes in tags are now escaped in frontmatter output
    frontmatter 標籤中的雙引號現在會正確跳脫
  • Password length cap / 密碼長度上限: --password CLI argument capped at 1024 characters
    CLI 密碼參數上限 1024 字元
  • Symlink escape protection / Symlink 逃脫防護: collect_files() now uses os.path.realpath() and rejects symlinks pointing outside the input directory
    目錄掃描加入路徑正規化,拒絕指向輸入目錄外部的 symlink
  • Tempfile race condition fix / 暫存檔競態修復: DOCX textutil fallback now uses TemporaryDirectory context manager for guaranteed cleanup
    DOCX 解析的 textutil 備援改用 TemporaryDirectory,保證自動清理

Upgrading / 升級方式

cd doc-cleaner && git pull

No config changes needed. Drop-in replacement for v1.0.0.
不需要改設定,直接替換 v1.0.0。

Full Changelog: v1.0.0...v1.0.1