A high-performance Python utility designed to aggregate word counts across multiple .docx files. Unlike standard counters, this tool is built for massive datasets, using streaming generators to keep memory usage near zero while identifying specific text "milestones" (e.g., every millionth word).
- Zero-RAM Streaming: Processes billions of words without crashing by reading files one word at a time rather than loading them all into memory.
- Milestone Snapshots: Automatically identifies and captures a 10-word searchable snippet every time a global word count milestone (default: 1,000,000 words) is reached.
- Universal Folder Picker: Includes a cross-platform GUI (Windows/macOS) to select directories, with a terminal fallback for headless environments.
- Table Extraction: Unlike basic counters, this script extracts and counts text hidden inside Word Tables.
- Clean CLI Output: Generates a formatted table showing individual file counts and a final global total.
- Python 3.x
- Library:
python-docx - OS: Windows or macOS (GUI folder picker supported on both).
pip install python-docx
You can customize the script behavior by editing the globals at the top of counter.py:
| Global | Purpose | Default |
|---|---|---|
MILESTONE_INTERVAL |
Frequency of snippets (in words). | 1,000_000 |
SNIPPET_SIZE |
Length of the searchable string captured. | 10 |
COL_WIDTH_FILE |
Adjusts terminal table width for long filenames. | 45 |
- Run the script:
python counter.py
- A folder selection dialog will appear. Select the folder containing your
.docxfiles. - The script will process files alphabetically, printing a live tally to the terminal.
- View your Milestone Snapshots at the end of the report to see exactly where each million-word mark was hit.
- Encrypted Files: The script will skip password-protected
.docxfiles as they cannot be read via XML streaming. - Non-Docx: Only files ending in
.docxare processed;.doc(Legacy) or.rtffiles are ignored.