Skip to content

Latest commit

 

History

History
253 lines (201 loc) · 7.87 KB

File metadata and controls

253 lines (201 loc) · 7.87 KB

CLAUDE.md

This file provides guidance to Claude Code (claude.ai/code) when working with code in this repository.

Development Commands

Running the CLI (22+ Tools Available)

# Session management
dataops session start --who "analyst" --purpose "data analysis"
dataops session status
dataops session end --status success

# Data profiling & quality
dataops run csv-profile data.csv -o profile.json
dataops run quality-score data.csv -o score.json
dataops run dq-validate data.csv rules.json -o violations.csv

# Schema & mapping
dataops run schema-infer data.csv schema.json
dataops run map-suggest source.csv target.csv mapping.json
dataops run field-correspond directory/ correspondences.json

# Data transformation
dataops run column-cut input.csv output.csv -c col1 col2
dataops run row-filter input.csv output.csv "column > value"
dataops run csv-join left.csv right.csv output.csv --on key
dataops run csv-aggregate input.csv output.csv --group-by col
dataops run csv-transform input.csv output.csv mapping.csv
dataops run csv-fill input.csv output.csv group_col

# SQL & analysis
dataops run csv-sql "SELECT * FROM table" -i table=file.csv -o output.csv
dataops run csv-sql-multi -q "QUERY1" -q "QUERY2" -o output.csv
dataops run csv-diff file1.csv file2.csv -o diff.csv
dataops run csv-pivot input.csv output.csv --operation pivot
dataops run func-dep-check input.csv -o dependencies.txt

# Data organization
dataops run csv-split large.csv chunks/ --chunk-size 1000
dataops run csv-consolidate input.csv output.csv --uid-column id
dataops run csv-merge directory/ merged.csv --pattern "*.csv"

# Data cleansing
dataops run csv-clean input.csv output.csv --remove-duplicates
dataops run dedupe-er input.csv output.csv -m col1 col2 -t 85

# Lineage
dataops lineage build

Development Tasks

# Run all tests (except known issues)
pytest -k "not test_poor_quality_detection"

# Run specific test file
pytest tests/test_data_quality_tools.py

# Run with coverage
pytest --cov=dataops --cov-report=html

# Code formatting (line length 100)
black dataops/ --line-length 100

# Linting
flake8 dataops/ --max-line-length=100

# Type checking (strict mode)
mypy dataops/ --strict

# Install for development
pip install -e .[dev,mcp]

Architecture Overview

Core Design Pattern

The project uses a decorator-based audit wrapper pattern where every data operation is wrapped with @audit_operation to create immutable audit trails. All operations are grouped into sessions with ULID identifiers that persist across CLI invocations.

Key Architectural Components

  1. Audit System (dataops/audit/): The @audit_operation decorator wraps every tool function to:

    • Log execution with SHA-256 hashes of inputs/outputs
    • Track timing and metadata
    • Create append-only NDJSON event streams in .audit/
    • Support both human and LLM actors
  2. Session Management (dataops/audit/session.py): Sessions group related operations using:

    • ULID-based unique identifiers
    • Persistent state in .audit/sessions/current
    • JSON manifests + NDJSON event logs per session
  3. Tool Architecture (dataops/tools/): Each tool follows a standardized pattern:

    • Standalone module with main function
    • @audit_operation wrapper for tracking
    • Dual interfaces: CLI command + MCP method
    • Auto-discovery for MCP integration
  4. MCP Integration (dataops/mcp/): Model Context Protocol support:

    • Auto-discovery server scans all tools
    • Automatic parameter mapping
    • Session persistence across calls
    • Full audit trail for AI operations
  5. Data Flow:

    data/ (read-only) → tools → scratch/ (temp) → artifacts/ (output)
                          ↓
                      .audit/ (logs)
    

Critical Design Decisions

  • Immutable Inputs: Never modify files in data/ directory - enforced by audit wrapper
  • Hash-Based Tracking: All file operations include SHA-256 hashes for integrity/change detection
  • Dual Interface Pattern: Every tool exposes both CLI (via Click) and MCP interfaces
  • Session Persistence: Session state survives across CLI invocations using file-based storage
  • Tool Auto-Discovery: New tools automatically available via MCP without configuration

Complete Tool Set (22+ Tools)

Data Profiling & Quality

  • csv_profile - Comprehensive statistical profiling
  • dq_validate - Rule-based validation
  • quality_score - Multi-dimensional quality scoring

Schema & Mapping

  • schema_infer - Automatic schema detection
  • map_suggest - AI-powered field mapping
  • field_correspond - Cross-file correspondence

Data Transformation

  • column_cut - Column selection
  • row_filter - Row filtering
  • csv_join - Join operations
  • csv_aggregate - Aggregation
  • csv_transform - Schema transformation
  • csv_fill - Smart missing value filling

SQL & Analysis

  • csv_sql - SQL on CSV files
  • csv_sql_multi - Multi-query pipelines
  • csv_diff - File comparison
  • csv_pivot - Pivot/unpivot operations
  • func_dep_check - Dependency analysis

Data Organization

  • csv_split - Split large files
  • csv_consolidate - Consolidate duplicates
  • csv_merge - Merge multiple files

Data Cleansing

  • csv_clean - Data standardization
  • dedupe_er - Fuzzy deduplication

Adding New Features

Adding a New Tool

  1. Create module in dataops/tools/ with function following the pattern:
    @audit_operation(
        tool_name="your_tool",
        tool_version="0.1.0"
    )
    def your_tool(input_path: str, output_path: str, **kwargs) -> tuple:
        # Implementation
        result = {...}
        metadata = {...}
        return result, metadata
  2. Add CLI command in dataops/cli/tool_commands.py
  3. Tool automatically available via MCP through auto-discovery

Testing New Code

  • Place unit tests in tests/test_<module>.py
  • Use tempfile.TemporaryDirectory() for all file operations
  • Verify audit logs are created correctly
  • Check both CLI and programmatic interfaces
  • Test with sample CSV files in data/

Important Notes

  • Python 3.11+ required
  • Always run black before committing (line length 100)
  • Type hints are mandatory - mypy runs in strict mode
  • All data transformations must use @audit_operation decorator
  • MCP server uses auto-discovery - new tools available without restart
  • Never modify files in data/ directory
  • Use scratch/ for intermediate files
  • Use artifacts/ for final outputs
  • Known issue: test_poor_quality_detection expects score < 70 but gets 78.14

Testing Commands

# Create virtual environment
python -m venv .venv
source .venv/bin/activate

# Install with all dependencies
pip install -e ".[dev,mcp]"
pip install python-Levenshtein  # For dedupe_er optimal performance

# Run tests (skipping known issue)
pytest -k "not test_poor_quality_detection"

# Run specific new tool tests
pytest tests/test_data_quality_tools.py -v
pytest tests/test_schema_mapping_tools.py -v
pytest tests/test_csv_diff.py tests/test_csv_pivot.py tests/test_csv_sql_multi.py -v
pytest tests/test_csv_clean.py tests/test_dedupe_er.py -v

# Check code quality
black dataops/ --check --line-length 100
flake8 dataops/ --max-line-length=100
mypy dataops/

MCP Server Setup

For Claude Desktop integration:

  1. Start the auto-discovery server:
python start_mcp_auto.py
  1. Configure Claude Desktop:
{
  "mcpServers": {
    "dataops-toolkit": {
      "command": "python",
      "args": ["/path/to/dataops-toolkit/start_mcp_auto.py"],
      "env": {"PYTHONPATH": "/path/to/dataops-toolkit"}
    }
  }
}
  1. All tools are automatically available to Claude!

Code Style Guidelines

  • Line length: 100 characters
  • Use Black formatter
  • Type hints required for all functions
  • Docstrings for all public functions
  • Use pathlib.Path for file paths
  • Handle pandas FutureWarnings appropriately
  • Use logging instead of print statements
  • Comprehensive error handling with clear messages