Skip to content

zromick/FolderWordCounter

Folders and files

NameName
Last commit message
Last commit date

Latest commit

ย 

History

3 Commits
ย 
ย 
ย 
ย 
ย 
ย 

Repository files navigation

Folder Word Counter

A high-performance Python utility designed to aggregate word counts across multiple .docx files. Unlike standard counters, this tool is built for massive datasets, using streaming generators to keep memory usage near zero while identifying specific text "milestones" (e.g., every millionth word).

๐Ÿš€ Key Features

  • Zero-RAM Streaming: Processes billions of words without crashing by reading files one word at a time rather than loading them all into memory.
  • Milestone Snapshots: Automatically identifies and captures a 10-word searchable snippet every time a global word count milestone (default: 1,000,000 words) is reached.
  • Universal Folder Picker: Includes a cross-platform GUI (Windows/macOS) to select directories, with a terminal fallback for headless environments.
  • Table Extraction: Unlike basic counters, this script extracts and counts text hidden inside Word Tables.
  • Clean CLI Output: Generates a formatted table showing individual file counts and a final global total.

๐Ÿ› ๏ธ Requirements

  • Python 3.x
  • Library: python-docx
  • OS: Windows or macOS (GUI folder picker supported on both).

Installation

pip install python-docx

๐Ÿ’ป Configuration

You can customize the script behavior by editing the globals at the top of counter.py:

Global Purpose Default
MILESTONE_INTERVAL Frequency of snippets (in words). 1,000_000
SNIPPET_SIZE Length of the searchable string captured. 10
COL_WIDTH_FILE Adjusts terminal table width for long filenames. 45

๐Ÿ“– Usage

  1. Run the script:
python counter.py
  1. A folder selection dialog will appear. Select the folder containing your .docx files.
  2. The script will process files alphabetically, printing a live tally to the terminal.
  3. View your Milestone Snapshots at the end of the report to see exactly where each million-word mark was hit.

โš ๏ธ Limitations

  • Encrypted Files: The script will skip password-protected .docx files as they cannot be read via XML streaming.
  • Non-Docx: Only files ending in .docx are processed; .doc (Legacy) or .rtf files are ignored.

About

For folders containing .docx files: Word counter with milestones

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages