Releases · notoriouslab/doc-cleaner

08 Apr 03:28

notoriouslab

v1.2.0

dd74ea9

v1.2.0 — DXF, PPTX, PPT, DOC support + security hardening Latest

Latest

New Format Support

DXF (.dxf): Extract text annotations, dimensions, layer names, block attributes from engineering drawings via ezdxf
PPTX (.pptx): Extract slide text, tables (Markdown pipe tables), and speaker notes via python-pptx
PPT (.ppt): Legacy PowerPoint extraction via macOS textutil
DOC (.doc): Legacy Word extraction via macOS textutil

Breaking Changes

YAML frontmatter: source_path renamed to sourcePath (camelCase, consistent with pubDate)

Security

Fix YAML newline injection in frontmatter escaping
Add entity count limit (50,000) to DXF parser to prevent resource exhaustion
Add ZIP decompressed size check (500MB) and slide limit (500) to PPTX parser
Add 60s timeout to all textutil subprocess calls

Improvements

Deduplicate textutil conversion logic into shared parsers/_textutil.py
Updated README (zh/en) with new format documentation
Added CHANGELOG.md

Install optional dependencies

pip install python-pptx   # for PPTX
pip install ezdxf          # for DXF
# PPT/DOC: macOS textutil built-in, no extra install

Assets 2

29 Mar 10:34

notoriouslab

v1.1.0

1cb6309

v1.1.0 — High-quality PDF extraction + Ad cleaning

What's New

High-quality PDF extraction via opendataloader-pdf

doc-cleaner now supports opendataloader-pdf as an optional PDF extraction backend. When installed (with Java 11+), it automatically becomes the primary extractor — producing proper Markdown pipe tables directly from PDFs where PyMuPDF previously failed completely.

Auto-detected at runtime; falls back to PyMuPDF if not installed
Tables extracted as pipe tables (|col1|col2|col3|), no AI needed
Image references and blank line noise automatically cleaned from output
Many previously "layout-broken" PDFs now classified as native text, skipping AI entirely

# Install (optional)
brew install openjdk@21
pip install opendataloader-pdf
# That's it — doc-cleaner auto-detects and uses it

Inline ad removal (`ad_strip_patterns`)

New config option for removing promotional blocks embedded between useful content — complementing the existing tail truncation (ad_truncation_patterns).

{
  "ad_strip_patterns": ["※運動賺回饋", "【行銷活動】"]
}

Setting	Behavior	Use case
`ad_truncation_patterns`	Truncate everything after first match	End-of-document disclaimers
`ad_strip_patterns`	Remove each matched paragraph	Inline promotional blocks

Security hardening

glob.escape() prevents special characters in filenames from affecting cleanup
Strip patterns use compile-then-search instead of regex concatenation (ReDoS prevention)
Both ad_truncation_patterns and ad_strip_patterns validated at startup
ODL classifier adds per-page density check to avoid false positives on image-heavy PDFs
Cutoff pattern join uses non-capturing groups

Other

Version bump to 1.1.0
Updated README (zh-TW + en) with new features and install instructions
config.example.json updated with ad_strip_patterns field

Assets 2

14 Mar 09:48

notoriouslab

v1.0.3

fe2353b

v1.0.3 — Groq backend, PII redaction, parser fixes

What's New

Groq backend support (community contribution by @sunnyworm)

New --ai groq option for fast cloud inference via Groq's OpenAI-compatible API
Default model: meta-llama/llama-4-scout-17b-16e-instruct
Supports text and vision (up to 5 images per request)
User-Agent header fix for Cloudflare transport fingerprinting (403/1010)

PII redaction (opt-in)

New classifiers/pii.py module for Taiwan personal data patterns (身分證字號, 電話, email, etc.)
Dual-pass redaction: before AI processing + after output rendering
Opt-in via config.json: "pii": {"enabled": true}

Bug Fixes

DOCX header detection: fixed false negatives on CJK financial tables where date/amount cells (e.g. 2024-01-15, 1,234.56) were misclassified as non-header rows
XLSX wide table truncation: added safety net for extremely wide tables where even a single row exceeds the per-sheet character budget
Code cleanup: moved import time to file top, removed extra blank line in pdf.py
Various fixes from expert code review rounds (#1, #2)

Upgrade

pip install -r requirements.txt  # adds pikepdf if not present

No config changes required. PII redaction is off by default.

Full Changelog: v1.0.1...v1.0.3

Contributors

sunnyworm

Assets 2

10 Mar 05:01

notoriouslab

v1.0.1

1b635ce

v1.0.1 — Security Hardening / 安全強化

What's Changed / 變更內容

Security / 安全性

Ollama SSRF prevention / Ollama SSRF 防護: host config now whitelisted to localhost / 127.0.0.1 / ::1 only — prevents accidental data leakage to remote servers via malicious config
Ollama host 設定現僅允許 localhost，防止惡意設定將文件內容洩漏到遠端伺服器
YAML tag injection fix / YAML 標籤注入修復: double quotes in tags are now escaped in frontmatter output
frontmatter 標籤中的雙引號現在會正確跳脫
Password length cap / 密碼長度上限: --password CLI argument capped at 1024 characters
CLI 密碼參數上限 1024 字元
Symlink escape protection / Symlink 逃脫防護: collect_files() now uses os.path.realpath() and rejects symlinks pointing outside the input directory
目錄掃描加入路徑正規化，拒絕指向輸入目錄外部的 symlink
Tempfile race condition fix / 暫存檔競態修復: DOCX textutil fallback now uses TemporaryDirectory context manager for guaranteed cleanup
DOCX 解析的 textutil 備援改用 TemporaryDirectory，保證自動清理

Upgrading / 升級方式

cd doc-cleaner && git pull

No config changes needed. Drop-in replacement for v1.0.0.
不需要改設定，直接替換 v1.0.0。

Full Changelog: v1.0.0...v1.0.1

Assets 2

Releases: notoriouslab/doc-cleaner

v1.2.0 — DXF, PPTX, PPT, DOC support + security hardening

New Format Support

Breaking Changes

Security

Improvements

Install optional dependencies

Uh oh!

v1.1.0 — High-quality PDF extraction + Ad cleaning

What's New

High-quality PDF extraction via opendataloader-pdf

Inline ad removal (ad_strip_patterns)

Security hardening

Other

Uh oh!

v1.0.3 — Groq backend, PII redaction, parser fixes

What's New

Groq backend support (community contribution by @sunnyworm)

PII redaction (opt-in)

Bug Fixes

Upgrade

Contributors

Uh oh!

v1.0.1 — Security Hardening / 安全強化

What's Changed / 變更內容

Security / 安全性

Upgrading / 升級方式

Uh oh!

Inline ad removal (`ad_strip_patterns`)