A comprehensive Python CLI tool for analyzing Excel files (.xlsx and .xls) and extracting all data necessary for programmatic file duplication. This tool provides detailed analysis of cell data, formulas, formatting, charts, images, data validation, conditional formatting, and more.
-
Comprehensive Analysis: Extracts all Excel file components:
- Cell data, formulas, and calculated values
- Complete formatting (fonts, colors, borders, alignment, number formats)
- Merged cells, column widths, row heights
- Data validation rules
- Conditional formatting rules
- Charts with full configuration (series, axes, titles, legends)
- Images and shapes (with base64 encoding)
- Print settings, headers, and footers
- Freeze panes, auto filters, and sheet properties
-
Multiple Output Formats: Generate analysis in various formats:
- JSON: Complete hierarchical data structure
- HTML: Interactive visual report with tabs for each worksheet
- Text: Human-readable text format with tables
- CSV: Cell data export (one file per worksheet)
- Excel: Summary analysis in Excel format
-
Format Support:
- ✅
.xlsxfiles (full support) - ✅
.xlsfiles (legacy format with some limitations)
- ✅
-
Command-Line Interface: Easy-to-use CLI for programmers
- Verbose mode for detailed progress
- Multiple output formats in a single run
- Customizable output directory
# Clone or navigate to the project directory
cd /path/to/ExcelAnalyzer
# Create virtual environment (recommended)
python3 -m venv venv
source venv/bin/activate # On Windows: venv\Scripts\activate
# Install dependencies
pip install -r requirements.txt
# Install the package in development mode
pip install -e .pip install -e /path/to/ExcelAnalyzer# Analyze an Excel file and generate JSON output
excel-analyzer myfile.xlsx --json
# Generate multiple output formats
excel-analyzer myfile.xlsx --json --html --text
# Generate all output formats
excel-analyzer myfile.xlsx --json --html --text --csv --excel
# Enable verbose mode for detailed progress
excel-analyzer myfile.xlsx --json --html -v
# Specify output directory
excel-analyzer myfile.xlsx --json -o /path/to/outputusage: excel-analyzer [-h] [--json] [--html] [--text] [--csv] [--excel]
[-o OUTPUT_DIR] [-v] [--version]
input_file
positional arguments:
input_file Path to Excel file (.xlsx or .xls)
optional arguments:
-h, --help show this help message and exit
--json Generate JSON output
--html Generate HTML output
--text Generate text output
--csv Generate CSV output (one file per sheet)
--excel Generate Excel summary output
-o OUTPUT_DIR, --output-dir OUTPUT_DIR
Output directory for generated files (default: current directory)
-v, --verbose Enable verbose output with detailed progress
--version show program's version number and exit
You can also use the analyzer as a Python module:
from excel_analyzer.analyzers import XLSXAnalyzer
from excel_analyzer.formatters import JSONFormatter
# Create analyzer
analyzer = XLSXAnalyzer()
# Analyze file
workbook_model = analyzer.analyze('myfile.xlsx', verbose=True)
# Generate output
formatter = JSONFormatter()
formatter.format(workbook_model, 'output.json', verbose=True)
# Access data programmatically
for worksheet in workbook_model.worksheets:
print(f"Worksheet: {worksheet.name}")
print(f" Cells: {len(worksheet.cells)}")
print(f" Charts: {len(worksheet.charts)}")
print(f" Images: {len(worksheet.images)}"){
"metadata": {
"analyzer_version": "1.0.0",
"analyzed_at": "2026-02-08T15:30:00",
"source_file": "/path/to/file.xlsx"
},
"workbook": {
"file_format": "xlsx",
"properties": {
"creator": "John Doe",
"created": "2026-01-15T10:00:00"
},
"worksheets": [
{
"name": "Sheet1",
"index": 0,
"cells": [
{
"coordinate": "A1",
"value": "Product",
"data_type": "s",
"formatting": {
"font": {
"name": "Calibri",
"size": 11,
"bold": true,
"color": {"type": "rgb", "value": "#000000"}
},
"fill": {
"pattern_type": "solid",
"fg_color": {"type": "rgb", "value": "#DDEBF7"}
}
}
}
],
"merged_cells": ["A1:B1"],
"charts": [...],
"images": [...]
}
]
}
}Interactive HTML report with:
- File metadata and properties
- Tabbed interface for multiple worksheets
- Styled tables showing cell data
- Chart and image information
- Embedded images (base64)
- Color-coded sections
Human-readable text format:
================================================================================
EXCEL WORKBOOK ANALYSIS
================================================================================
File: /path/to/file.xlsx
Format: XLSX
Analyzer Version: 1.0.0
--------------------------------------------------------------------------------
WORKBOOK PROPERTIES
--------------------------------------------------------------------------------
Creator: John Doe
Created: 2026-01-15
Total Worksheets: 3
================================================================================
WORKSHEET: Sheet1
================================================================================
Sheet Index: 0
Total Cells: 288
Sheet State: visible
Cell Data (showing first 100 cells):
+------+------+----------+-----------+---------+
| Cell | Type | Value | Formula | Format |
+------+------+----------+-----------+---------+
| A1 | s | Product | | General |
| B1 | s | Price | | General |
+------+------+----------+-----------+---------+
One CSV file per worksheet containing cell data (values only, no formatting).
Summary Excel file with:
- Overview sheet with file metadata
- One analysis sheet per source worksheet
- Tables with cell data, charts, and images information
The project is organized into modular components:
excel_analyzer/
├── analyzers/ # Analysis modules
│ ├── xlsx_analyzer.py # XLSX file analyzer
│ ├── xls_analyzer.py # XLS file analyzer
│ ├── cell_analyzer.py # Cell data extraction
│ ├── format_analyzer.py # Formatting extraction
│ ├── structure_analyzer.py # Merged cells, dimensions
│ ├── chart_analyzer.py # Chart extraction
│ ├── image_analyzer.py # Image extraction
│ └── ...
├── models/ # Data models
│ ├── workbook.py # Workbook model
│ ├── worksheet.py # Worksheet model
│ ├── cell.py # Cell model
│ ├── formatting.py # Formatting models
│ ├── chart.py # Chart models
│ └── image.py # Image model
├── formatters/ # Output formatters
│ ├── json_formatter.py
│ ├── html_formatter.py
│ ├── text_formatter.py
│ ├── csv_formatter.py
│ └── excel_formatter.py
├── utils/ # Utilities
│ ├── color_utils.py
│ ├── file_utils.py
│ └── logging_utils.py
└── cli.py # CLI interface
The legacy .xls format has some limitations compared to .xlsx:
- Limited chart extraction (charts are difficult to parse in XLS)
- Limited image extraction
- No conditional formatting support
- Basic formatting only
These limitations are due to the xlrd library's capabilities and the binary nature of the XLS format.
openpyxl>=3.1.0- XLSX file parsingxlrd>=2.0.0- XLS file parsingPillow>=10.0.0- Image handlingJinja2>=3.1.0- HTML template renderingcolorama>=0.4.6- Colored terminal outputtqdm>=4.65.0- Progress barstabulate>=0.9.0- Text table formattinglxml>=4.9.0- XML parsing
This tool is ideal for:
- File Migration: Analyze Excel files before migrating to a new format
- Documentation: Generate comprehensive documentation of Excel file structures
- Validation: Verify that Excel files contain expected data and formatting
- Backup & Recovery: Create detailed snapshots of Excel files for recovery purposes
- Programmatic Duplication: Extract all information needed to recreate Excel files programmatically
- Auditing: Review complex Excel files without opening them
- Data Extraction: Extract data from Excel files for processing in other systems
# Install development dependencies
pip install -r requirements-dev.txt
# Run tests (when implemented)
pytestThe project follows PEP 8 style guidelines. Format code with:
black excel_analyzer/You must specify at least one output format using --json, --html, --text, --csv, or --excel.
Check that the file path is correct and the file is accessible.
Only .xlsx and .xls files are supported. Other formats like .xlsb or .xlsm may not work correctly.
For very large Excel files:
- Use
--jsononly to reduce memory usage - Avoid
--htmlwhich embeds images and can be memory-intensive - Close other applications to free up memory
- Initial release
- Support for .xlsx and .xls formats
- Multiple output formats (JSON, HTML, Text, CSV, Excel)
- Comprehensive extraction of all Excel features
- CLI interface with verbose mode
This project is provided as-is for educational and commercial use.
Contributions are welcome! Areas for improvement:
- Additional output formats (XML, YAML)
- Excel file recreation from JSON
- Diff tool to compare two Excel files
- Web interface for visual analysis
- Support for .xlsb format
- Performance optimizations for large files
Enginerd-2019
- Built with openpyxl for XLSX parsing
- Uses xlrd for legacy XLS support
- Jinja2 for HTML template rendering
- Inspired by the need for comprehensive Excel file documentation tools