Releases: notoriouslab/doc-cleaner
Releases · notoriouslab/doc-cleaner
v1.2.0 — DXF, PPTX, PPT, DOC support + security hardening
New Format Support
- DXF (
.dxf): Extract text annotations, dimensions, layer names, block attributes from engineering drawings viaezdxf - PPTX (
.pptx): Extract slide text, tables (Markdown pipe tables), and speaker notes viapython-pptx - PPT (
.ppt): Legacy PowerPoint extraction via macOStextutil - DOC (
.doc): Legacy Word extraction via macOStextutil
Breaking Changes
- YAML frontmatter:
source_pathrenamed tosourcePath(camelCase, consistent withpubDate)
Security
- Fix YAML newline injection in frontmatter escaping
- Add entity count limit (50,000) to DXF parser to prevent resource exhaustion
- Add ZIP decompressed size check (500MB) and slide limit (500) to PPTX parser
- Add 60s timeout to all
textutilsubprocess calls
Improvements
- Deduplicate
textutilconversion logic into sharedparsers/_textutil.py - Updated README (zh/en) with new format documentation
- Added CHANGELOG.md
Install optional dependencies
pip install python-pptx # for PPTX
pip install ezdxf # for DXF
# PPT/DOC: macOS textutil built-in, no extra installv1.1.0 — High-quality PDF extraction + Ad cleaning
What's New
High-quality PDF extraction via opendataloader-pdf
doc-cleaner now supports opendataloader-pdf as an optional PDF extraction backend. When installed (with Java 11+), it automatically becomes the primary extractor — producing proper Markdown pipe tables directly from PDFs where PyMuPDF previously failed completely.
- Auto-detected at runtime; falls back to PyMuPDF if not installed
- Tables extracted as pipe tables (
|col1|col2|col3|), no AI needed - Image references and blank line noise automatically cleaned from output
- Many previously "layout-broken" PDFs now classified as native text, skipping AI entirely
# Install (optional)
brew install openjdk@21
pip install opendataloader-pdf
# That's it — doc-cleaner auto-detects and uses itInline ad removal (ad_strip_patterns)
New config option for removing promotional blocks embedded between useful content — complementing the existing tail truncation (ad_truncation_patterns).
{
"ad_strip_patterns": ["※運動賺回饋", "【行銷活動】"]
}| Setting | Behavior | Use case |
|---|---|---|
ad_truncation_patterns |
Truncate everything after first match | End-of-document disclaimers |
ad_strip_patterns |
Remove each matched paragraph | Inline promotional blocks |
Security hardening
glob.escape()prevents special characters in filenames from affecting cleanup- Strip patterns use compile-then-search instead of regex concatenation (ReDoS prevention)
- Both
ad_truncation_patternsandad_strip_patternsvalidated at startup - ODL classifier adds per-page density check to avoid false positives on image-heavy PDFs
- Cutoff pattern join uses non-capturing groups
Other
- Version bump to 1.1.0
- Updated README (zh-TW + en) with new features and install instructions
config.example.jsonupdated withad_strip_patternsfield
v1.0.3 — Groq backend, PII redaction, parser fixes
What's New
Groq backend support (community contribution by @sunnyworm)
- New
--ai groqoption for fast cloud inference via Groq's OpenAI-compatible API - Default model:
meta-llama/llama-4-scout-17b-16e-instruct - Supports text and vision (up to 5 images per request)
User-Agentheader fix for Cloudflare transport fingerprinting (403/1010)
PII redaction (opt-in)
- New
classifiers/pii.pymodule for Taiwan personal data patterns (身分證字號, 電話, email, etc.) - Dual-pass redaction: before AI processing + after output rendering
- Opt-in via
config.json:"pii": {"enabled": true}
Bug Fixes
- DOCX header detection: fixed false negatives on CJK financial tables where date/amount cells (e.g.
2024-01-15,1,234.56) were misclassified as non-header rows - XLSX wide table truncation: added safety net for extremely wide tables where even a single row exceeds the per-sheet character budget
- Code cleanup: moved
import timeto file top, removed extra blank line inpdf.py - Various fixes from expert code review rounds (#1, #2)
Upgrade
pip install -r requirements.txt # adds pikepdf if not presentNo config changes required. PII redaction is off by default.
Full Changelog: v1.0.1...v1.0.3
v1.0.1 — Security Hardening / 安全強化
What's Changed / 變更內容
Security / 安全性
- Ollama SSRF prevention / Ollama SSRF 防護:
hostconfig now whitelisted tolocalhost/127.0.0.1/::1only — prevents accidental data leakage to remote servers via malicious config
Ollama host 設定現僅允許 localhost,防止惡意設定將文件內容洩漏到遠端伺服器 - YAML tag injection fix / YAML 標籤注入修復: double quotes in tags are now escaped in frontmatter output
frontmatter 標籤中的雙引號現在會正確跳脫 - Password length cap / 密碼長度上限:
--passwordCLI argument capped at 1024 characters
CLI 密碼參數上限 1024 字元 - Symlink escape protection / Symlink 逃脫防護:
collect_files()now usesos.path.realpath()and rejects symlinks pointing outside the input directory
目錄掃描加入路徑正規化,拒絕指向輸入目錄外部的 symlink - Tempfile race condition fix / 暫存檔競態修復: DOCX textutil fallback now uses
TemporaryDirectorycontext manager for guaranteed cleanup
DOCX 解析的 textutil 備援改用 TemporaryDirectory,保證自動清理
Upgrading / 升級方式
cd doc-cleaner && git pullNo config changes needed. Drop-in replacement for v1.0.0.
不需要改設定,直接替換 v1.0.0。
Full Changelog: v1.0.0...v1.0.1