CLAUDE.md

This file provides guidance to Claude Code (claude.ai/code) when working with code in this repository.

Development Commands

Running the CLI (22+ Tools Available)

# Session management
dataops session start --who "analyst" --purpose "data analysis"
dataops session status
dataops session end --status success

# Data profiling & quality
dataops run csv-profile data.csv -o profile.json
dataops run quality-score data.csv -o score.json
dataops run dq-validate data.csv rules.json -o violations.csv

# Schema & mapping
dataops run schema-infer data.csv schema.json
dataops run map-suggest source.csv target.csv mapping.json
dataops run field-correspond directory/ correspondences.json

# Data transformation
dataops run column-cut input.csv output.csv -c col1 col2
dataops run row-filter input.csv output.csv "column > value"
dataops run csv-join left.csv right.csv output.csv --on key
dataops run csv-aggregate input.csv output.csv --group-by col
dataops run csv-transform input.csv output.csv mapping.csv
dataops run csv-fill input.csv output.csv group_col

# SQL & analysis
dataops run csv-sql "SELECT * FROM table" -i table=file.csv -o output.csv
dataops run csv-sql-multi -q "QUERY1" -q "QUERY2" -o output.csv
dataops run csv-diff file1.csv file2.csv -o diff.csv
dataops run csv-pivot input.csv output.csv --operation pivot
dataops run func-dep-check input.csv -o dependencies.txt

# Data organization
dataops run csv-split large.csv chunks/ --chunk-size 1000
dataops run csv-consolidate input.csv output.csv --uid-column id
dataops run csv-merge directory/ merged.csv --pattern "*.csv"

# Data cleansing
dataops run csv-clean input.csv output.csv --remove-duplicates
dataops run dedupe-er input.csv output.csv -m col1 col2 -t 85

# Lineage
dataops lineage build

Development Tasks

# Run all tests (except known issues)
pytest -k "not test_poor_quality_detection"

# Run specific test file
pytest tests/test_data_quality_tools.py

# Run with coverage
pytest --cov=dataops --cov-report=html

# Code formatting (line length 100)
black dataops/ --line-length 100

# Linting
flake8 dataops/ --max-line-length=100

# Type checking (strict mode)
mypy dataops/ --strict

# Install for development
pip install -e .[dev,mcp]

Architecture Overview

Core Design Pattern

The project uses a decorator-based audit wrapper pattern where every data operation is wrapped with @audit_operation to create immutable audit trails. All operations are grouped into sessions with ULID identifiers that persist across CLI invocations.

Key Architectural Components

Audit System (dataops/audit/): The @audit_operation decorator wraps every tool function to:
- Log execution with SHA-256 hashes of inputs/outputs
- Track timing and metadata
- Create append-only NDJSON event streams in .audit/
- Support both human and LLM actors
Session Management (dataops/audit/session.py): Sessions group related operations using:
- ULID-based unique identifiers
- Persistent state in .audit/sessions/current
- JSON manifests + NDJSON event logs per session
Tool Architecture (dataops/tools/): Each tool follows a standardized pattern:
- Standalone module with main function
- @audit_operation wrapper for tracking
- Dual interfaces: CLI command + MCP method
- Auto-discovery for MCP integration
MCP Integration (dataops/mcp/): Model Context Protocol support:
- Auto-discovery server scans all tools
- Automatic parameter mapping
- Session persistence across calls
- Full audit trail for AI operations

Data Flow:

data/ (read-only) → tools → scratch/ (temp) → artifacts/ (output)
                      ↓
                  .audit/ (logs)

Critical Design Decisions

Immutable Inputs: Never modify files in data/ directory - enforced by audit wrapper
Hash-Based Tracking: All file operations include SHA-256 hashes for integrity/change detection
Dual Interface Pattern: Every tool exposes both CLI (via Click) and MCP interfaces
Session Persistence: Session state survives across CLI invocations using file-based storage
Tool Auto-Discovery: New tools automatically available via MCP without configuration

Complete Tool Set (22+ Tools)

Data Profiling & Quality

csv_profile - Comprehensive statistical profiling
dq_validate - Rule-based validation
quality_score - Multi-dimensional quality scoring

Schema & Mapping

schema_infer - Automatic schema detection
map_suggest - AI-powered field mapping
field_correspond - Cross-file correspondence

Data Transformation

column_cut - Column selection
row_filter - Row filtering
csv_join - Join operations
csv_aggregate - Aggregation
csv_transform - Schema transformation
csv_fill - Smart missing value filling

SQL & Analysis

csv_sql - SQL on CSV files
csv_sql_multi - Multi-query pipelines
csv_diff - File comparison
csv_pivot - Pivot/unpivot operations
func_dep_check - Dependency analysis

Data Organization

csv_split - Split large files
csv_consolidate - Consolidate duplicates
csv_merge - Merge multiple files

Data Cleansing

csv_clean - Data standardization
dedupe_er - Fuzzy deduplication

Adding New Features

Adding a New Tool

Create module in dataops/tools/ with function following the pattern:

@audit_operation(
    tool_name="your_tool",
    tool_version="0.1.0"
)
def your_tool(input_path: str, output_path: str, **kwargs) -> tuple:
    # Implementation
    result = {...}
    metadata = {...}
    return result, metadata

Add CLI command in dataops/cli/tool_commands.py
Tool automatically available via MCP through auto-discovery

Testing New Code

Place unit tests in tests/test_<module>.py
Use tempfile.TemporaryDirectory() for all file operations
Verify audit logs are created correctly
Check both CLI and programmatic interfaces
Test with sample CSV files in data/

Important Notes

Python 3.11+ required
Always run black before committing (line length 100)
Type hints are mandatory - mypy runs in strict mode
All data transformations must use @audit_operation decorator
MCP server uses auto-discovery - new tools available without restart
Never modify files in data/ directory
Use scratch/ for intermediate files
Use artifacts/ for final outputs
Known issue: test_poor_quality_detection expects score < 70 but gets 78.14

Testing Commands

# Create virtual environment
python -m venv .venv
source .venv/bin/activate

# Install with all dependencies
pip install -e ".[dev,mcp]"
pip install python-Levenshtein  # For dedupe_er optimal performance

# Run tests (skipping known issue)
pytest -k "not test_poor_quality_detection"

# Run specific new tool tests
pytest tests/test_data_quality_tools.py -v
pytest tests/test_schema_mapping_tools.py -v
pytest tests/test_csv_diff.py tests/test_csv_pivot.py tests/test_csv_sql_multi.py -v
pytest tests/test_csv_clean.py tests/test_dedupe_er.py -v

# Check code quality
black dataops/ --check --line-length 100
flake8 dataops/ --max-line-length=100
mypy dataops/

MCP Server Setup

For Claude Desktop integration:

Start the auto-discovery server:

python start_mcp_auto.py

Configure Claude Desktop:

{
  "mcpServers": {
    "dataops-toolkit": {
      "command": "python",
      "args": ["/path/to/dataops-toolkit/start_mcp_auto.py"],
      "env": {"PYTHONPATH": "/path/to/dataops-toolkit"}
    }
  }
}

All tools are automatically available to Claude!

Code Style Guidelines

Line length: 100 characters
Use Black formatter
Type hints required for all functions
Docstrings for all public functions
Use pathlib.Path for file paths
Handle pandas FutureWarnings appropriately
Use logging instead of print statements
Comprehensive error handling with clear messages

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CLAUDE.md

Development Commands

Running the CLI (22+ Tools Available)

Development Tasks

Architecture Overview

Core Design Pattern

Key Architectural Components

Critical Design Decisions

Complete Tool Set (22+ Tools)

Data Profiling & Quality

Schema & Mapping

Data Transformation

SQL & Analysis

Data Organization

Data Cleansing

Adding New Features

Adding a New Tool

Testing New Code

Important Notes

Testing Commands

MCP Server Setup

Code Style Guidelines

FilesExpand file tree

CLAUDE.md

Latest commit

History

CLAUDE.md

File metadata and controls

CLAUDE.md

Development Commands

Running the CLI (22+ Tools Available)

Development Tasks

Architecture Overview

Core Design Pattern

Key Architectural Components

Critical Design Decisions

Complete Tool Set (22+ Tools)

Data Profiling & Quality

Schema & Mapping

Data Transformation

SQL & Analysis

Data Organization

Data Cleansing

Adding New Features

Adding a New Tool

Testing New Code

Important Notes

Testing Commands

MCP Server Setup

Code Style Guidelines