docx2everything

Convert DOCX files to plain text or markdown format with preserved structure.

Installation

pip install docx2everything

Or install from source:

# Modern way (recommended)
pip install .

# Or using setup.py (deprecated but still works)
python setup.py install

Testing Without Installation

The CLI script works directly without installation - no PYTHONPATH needed!

Using CLI (no installation required):

# Extract text
python3 bin/docx2everything demo.docx

# Convert to markdown
python3 bin/docx2everything --markdown demo.docx > output.md

# With images
python3 bin/docx2everything --markdown -i images/ demo.docx > output.md

Using Python:

# Set PYTHONPATH to current directory
PYTHONPATH=. python3 -c "import docx2everything; print(docx2everything.process('demo.docx')[:100])"

In Python script:

import sys
sys.path.insert(0, '/path/to/python-docx2txt')

import docx2everything
text = docx2everything.process('document.docx')

Usage

Command Line

Extract plain text:

docx2everything document.docx

Convert to markdown:

docx2everything --markdown document.docx > output.md

Extract images:

docx2everything -i images/ document.docx

Markdown with images:

docx2everything --markdown -i images/ document.docx > output.md

Python API

import docx2everything

# Extract plain text
text = docx2everything.process("document.docx")

# Convert to markdown
markdown = docx2everything.process_to_markdown("document.docx")

# Extract images
text = docx2everything.process("document.docx", img_dir="images/")

# Markdown with images
markdown = docx2everything.process_to_markdown("document.docx", img_dir="images/")

Features

✅ Plain text extraction
✅ Markdown conversion with preserved structure:
- Tables → Markdown tables (with merged cells support, alignment hints)
- Lists → Bulleted/numbered lists (with proper sequence tracking)
- Headings → Markdown headings (#, ##, ###, etc.) with custom style detection
- Formatting → Bold, italic, strikethrough
- Links → Markdown links
- Images → Markdown image references
- Footnotes → Markdown footnote references [^1]
- Endnotes → Markdown endnote references [^1]
- Comments → Inline HTML comments with author info
- Charts → Chart placeholders with type and metadata *[Chart: Title (Chart Type)]*
- Page breaks → HTML comments 
- Section breaks → HTML comments 
✅ Image extraction
✅ Header and footer support
✅ Custom style detection (parses styles.xml for better heading detection)
✅ Table formatting (column alignment detection and hints)
✅ Robust error handling for malformed DOCX files

Requirements

Python 3.6+

License

MIT License - see LICENSE.txt

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
bin		bin
docx2everything		docx2everything
.gitignore		.gitignore
LICENSE		LICENSE
MANIFEST.in		MANIFEST.in
PUBLISH.md		PUBLISH.md
README.md		README.md
pyproject.toml		pyproject.toml
setup.cfg		setup.cfg
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

docx2everything

Installation

Testing Without Installation

Usage

Command Line

Python API

Features

Requirements

License

About

Uh oh!

Releases 2

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

docx2everything

Installation

Testing Without Installation

Usage

Command Line

Python API

Features

Requirements

License

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 2

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages