Skip to content

Latest commit

 

History

History
397 lines (306 loc) · 9.48 KB

File metadata and controls

397 lines (306 loc) · 9.48 KB

MCP Integration Guide

Complete guide for setting up and using the Model Context Protocol (MCP) server with the DataOps Toolkit.

Overview

The DataOps Toolkit includes a full Model Context Protocol (MCP) server implementation that enables AI assistants like Claude to interact naturally with all data transformation tools while maintaining complete audit trails.

Key Features:

  • Automatic tool discovery - all tools instantly available
  • Full audit trail for every operation
  • Session management with persistence
  • Support for all 22+ DataOps tools
  • Claude Desktop integration ready

Quick Start

1. Installation

cd dataops-toolkit

# Create and activate virtual environment
python -m venv .venv
source .venv/bin/activate  # Windows: .venv\Scripts\activate

# Install with MCP support
pip install -e ".[mcp]"

2. Start the MCP Server

# Auto-discovery mode (recommended)
python start_mcp_auto.py

# The server will:
# - Auto-discover all tools in dataops/tools/
# - Start on stdio transport for Claude Desktop
# - Create MCP methods for each tool

Claude Desktop Integration

Setup Instructions

  1. Find your Claude Desktop config file:

    • macOS: ~/Library/Application Support/Claude/claude_desktop_config.json
    • Windows: %APPDATA%\Claude\claude_desktop_config.json
    • Linux: ~/.config/Claude/claude_desktop_config.json
  2. Edit the configuration:

{
  "mcpServers": {
    "dataops-toolkit": {
      "command": "python",
      "args": ["/full/path/to/dataops-toolkit/start_mcp_auto.py"],
      "env": {
        "PYTHONPATH": "/full/path/to/dataops-toolkit"
      }
    }
  }
}
  1. Restart Claude Desktop

  2. Test the integration - Ask Claude:

    • "What DataOps tools do you have available?"
    • "Can you profile the CSV file at data/customers.csv?"
    • "Start a DataOps session for data cleaning"

Available MCP Tools

All 22+ tools are automatically exposed via MCP:

Data Profiling & Quality

  • csv_profile - Comprehensive statistical profiling
  • dq_validate - Rule-based data validation
  • quality_score - Multi-dimensional quality scoring

Schema & Mapping

  • schema_infer - Automatic schema detection
  • map_suggest - AI-powered field mapping
  • field_correspond - Cross-file field correspondence

Data Transformation

  • column_cut - Column selection and reordering
  • row_filter - Conditional row filtering
  • csv_join - Join operations on CSV files
  • csv_aggregate - Group by and aggregation
  • csv_transform - Schema transformation
  • csv_fill - Smart missing value filling

SQL & Advanced Operations

  • csv_sql - SQL queries on CSV files
  • csv_sql_multi - Multi-query SQL pipelines
  • csv_diff - File comparison and diff
  • csv_pivot - Pivot/unpivot operations
  • func_dep_check - Functional dependency analysis

Data Organization

  • csv_split - Split large files into chunks
  • csv_consolidate - Consolidate duplicates
  • csv_merge - Merge multiple CSVs

Data Cleansing

  • csv_clean - Data standardization
  • dedupe_er - Fuzzy deduplication

Session Management

  • session_start - Start audit session
  • session_end - End session
  • session_status - Get current session

Usage Examples

Basic Workflow with Claude

Human: I have customer data at data/customers.csv that needs cleaning.
Can you assess its quality and clean it up?

Claude: I'll help you assess and clean your customer data. Let me start by
analyzing its current quality.

[Claude automatically:]
1. Starts a session
2. Profiles the data
3. Calculates quality score
4. Identifies issues
5. Applies cleaning operations
6. Validates results
7. Ends session with full audit trail

Programmatic MCP Usage

# Example of how tools are called via MCP
{
    "method": "csv_profile",
    "params": {
        "input_file": "data/customers.csv",
        "output_file": "reports/profile.json",
        "sample_size": 10000
    }
}

# Response includes:
{
    "success": true,
    "result": {
        "rows_analyzed": 10000,
        "columns_profiled": 15,
        "data_completeness": 92.5,
        "quality_issues_found": 3,
        "output_file": "reports/profile.json",
        "operation_id": "01HXYZ..."
    }
}

Advanced Features

Auto-Discovery System

The MCP server automatically discovers tools by:

  1. Scanning dataops/tools/ directory
  2. Finding functions with @audit_operation decorator
  3. Extracting parameters and documentation
  4. Creating MCP method definitions
  5. No restart needed when adding tools!

Session Persistence

Sessions persist across MCP calls:

# First call starts session
session_start(who="claude", purpose="analysis")

# Subsequent calls use active session
csv_profile("input.csv")  # Automatically uses current session
csv_clean("input.csv", "clean.csv")  # Same session

# End session to finalize
session_end(status="success")

Audit Trail

Every MCP operation is fully audited:

  • Operation timestamp and duration
  • Input/output file SHA-256 hashes
  • Parameters used
  • Actor information (AI agent)
  • Session grouping
  • Lineage tracking

Configuration Options

Environment Variables

# Set working directory
export DATAOPS_WORK_DIR=/path/to/your/data

# Set audit directory
export DATAOPS_AUDIT_DIR=/path/to/audit

# Enable debug logging
export MCP_LOG_LEVEL=DEBUG

Multiple Projects

Configure multiple MCP servers for different projects:

{
  "mcpServers": {
    "dataops-sales": {
      "command": "python",
      "args": ["/path/to/dataops/start_mcp_auto.py"],
      "env": {
        "PYTHONPATH": "/path/to/dataops",
        "DATAOPS_WORK_DIR": "/data/sales"
      }
    },
    "dataops-marketing": {
      "command": "python",
      "args": ["/path/to/dataops/start_mcp_auto.py"],
      "env": {
        "PYTHONPATH": "/path/to/dataops",
        "DATAOPS_WORK_DIR": "/data/marketing"
      }
    }
  }
}

Troubleshooting

Common Issues and Solutions

1. "MCP server not found" in Claude

  • Check file paths are absolute, not relative
  • Ensure Python is in PATH
  • Verify PYTHONPATH is set correctly
  • Restart Claude Desktop after config changes

2. "Tool not found" errors

  • Verify tool exists in dataops/tools/
  • Check tool has @audit_operation decorator
  • Ensure proper Python imports

3. "Session not found" errors

  • Always start a session before operations
  • Check .audit/sessions/current exists
  • Verify session hasn't expired

4. File path issues

  • Use absolute paths or paths relative to working directory
  • Check DATAOPS_WORK_DIR if set
  • Ensure data files exist and are readable

Debug Mode

Enable detailed logging:

# In your config.json
"env": {
    "PYTHONPATH": "/path/to/dataops",
    "MCP_LOG_LEVEL": "DEBUG"
}

Testing MCP Connection

# test_mcp.py
import json
import sys

# Test tool discovery
from dataops.mcp.auto_discovery import discover_tools

tools = discover_tools()
print(f"Found {len(tools)} tools:")
for name in sorted(tools.keys()):
    print(f"  - {name}")

Security Considerations

  1. Read-Only Data Directory: Source data in data/ is never modified
  2. Audit Everything: All operations logged with cryptographic hashes
  3. Session Isolation: Each session has its own audit trail
  4. Path Validation: Prevents directory traversal attacks
  5. Parameter Sanitization: All inputs validated

Performance Optimization

Large Files

# Use sampling for profiling
csv_profile("large.csv", sample_size=10000)

# Split before processing
csv_split("huge.csv", "chunks/", chunk_size=50000)

Batch Operations

# Process multiple files efficiently
session_start(who="claude", purpose="batch_processing")

for file in files:
    csv_clean(file, f"clean_{file}")

session_end()  # Single audit trail for batch

Extending MCP Support

Adding New Tools

  1. Create tool in dataops/tools/:
# dataops/tools/my_tool.py
from dataops.audit.wrapper import audit_operation

@audit_operation(
    tool_name="my_tool",
    tool_version="0.1.0"
)
def my_tool(input_file: str, output_file: str, **kwargs):
    """Tool description for MCP."""
    # Implementation
    return result, metadata
  1. Tool is automatically available via MCP!

Custom Workflows

Create composite operations:

# dataops/tools/quality_pipeline.py
@audit_operation
def quality_pipeline(input_file: str, output_dir: str):
    """Run complete quality assessment pipeline."""

    # Profile
    profile = csv_profile(input_file, f"{output_dir}/profile.json")

    # Score
    score = quality_score(input_file, f"{output_dir}/score.json")

    # Validate
    validate = dq_validate(input_file, "rules.json",
                          f"{output_dir}/violations.csv")

    return {
        "profile": profile,
        "score": score,
        "validation": validate
    }

Best Practices

  1. Always use sessions for grouping related operations
  2. Specify output files to maintain lineage
  3. Use descriptive purposes in session starts
  4. Clean scratch directory periodically
  5. Monitor audit size - archive old sessions
  6. Test with small data before processing large files
  7. Document custom rules for validation tools

Getting Help

Version Compatibility

  • Python: 3.11+
  • MCP Protocol: Latest
  • Claude Desktop: 1.0+
  • Pandas: 2.0+
  • DuckDB: 0.9+