MCP Integration Guide

Complete guide for setting up and using the Model Context Protocol (MCP) server with the DataOps Toolkit.

Overview

The DataOps Toolkit includes a full Model Context Protocol (MCP) server implementation that enables AI assistants like Claude to interact naturally with all data transformation tools while maintaining complete audit trails.

Key Features:

Automatic tool discovery - all tools instantly available
Full audit trail for every operation
Session management with persistence
Support for all 22+ DataOps tools
Claude Desktop integration ready

Quick Start

1. Installation

cd dataops-toolkit

# Create and activate virtual environment
python -m venv .venv
source .venv/bin/activate  # Windows: .venv\Scripts\activate

# Install with MCP support
pip install -e ".[mcp]"

2. Start the MCP Server

# Auto-discovery mode (recommended)
python start_mcp_auto.py

# The server will:
# - Auto-discover all tools in dataops/tools/
# - Start on stdio transport for Claude Desktop
# - Create MCP methods for each tool

Claude Desktop Integration

Setup Instructions

Find your Claude Desktop config file:
- macOS: ~/Library/Application Support/Claude/claude_desktop_config.json
- Windows: %APPDATA%\Claude\claude_desktop_config.json
- Linux: ~/.config/Claude/claude_desktop_config.json
Edit the configuration:

{
  "mcpServers": {
    "dataops-toolkit": {
      "command": "python",
      "args": ["/full/path/to/dataops-toolkit/start_mcp_auto.py"],
      "env": {
        "PYTHONPATH": "/full/path/to/dataops-toolkit"
      }
    }
  }
}

Restart Claude Desktop
Test the integration - Ask Claude:
- "What DataOps tools do you have available?"
- "Can you profile the CSV file at data/customers.csv?"
- "Start a DataOps session for data cleaning"

Available MCP Tools

All 22+ tools are automatically exposed via MCP:

Data Profiling & Quality

csv_profile - Comprehensive statistical profiling
dq_validate - Rule-based data validation
quality_score - Multi-dimensional quality scoring

Schema & Mapping

schema_infer - Automatic schema detection
map_suggest - AI-powered field mapping
field_correspond - Cross-file field correspondence

Data Transformation

column_cut - Column selection and reordering
row_filter - Conditional row filtering
csv_join - Join operations on CSV files
csv_aggregate - Group by and aggregation
csv_transform - Schema transformation
csv_fill - Smart missing value filling

SQL & Advanced Operations

csv_sql - SQL queries on CSV files
csv_sql_multi - Multi-query SQL pipelines
csv_diff - File comparison and diff
csv_pivot - Pivot/unpivot operations
func_dep_check - Functional dependency analysis

Data Organization

csv_split - Split large files into chunks
csv_consolidate - Consolidate duplicates
csv_merge - Merge multiple CSVs

Data Cleansing

csv_clean - Data standardization
dedupe_er - Fuzzy deduplication

Session Management

session_start - Start audit session
session_end - End session
session_status - Get current session

Usage Examples

Basic Workflow with Claude

Human: I have customer data at data/customers.csv that needs cleaning.
Can you assess its quality and clean it up?

Claude: I'll help you assess and clean your customer data. Let me start by
analyzing its current quality.

[Claude automatically:]
1. Starts a session
2. Profiles the data
3. Calculates quality score
4. Identifies issues
5. Applies cleaning operations
6. Validates results
7. Ends session with full audit trail

Programmatic MCP Usage

# Example of how tools are called via MCP
{
    "method": "csv_profile",
    "params": {
        "input_file": "data/customers.csv",
        "output_file": "reports/profile.json",
        "sample_size": 10000
    }
}

# Response includes:
{
    "success": true,
    "result": {
        "rows_analyzed": 10000,
        "columns_profiled": 15,
        "data_completeness": 92.5,
        "quality_issues_found": 3,
        "output_file": "reports/profile.json",
        "operation_id": "01HXYZ..."
    }
}

Advanced Features

Auto-Discovery System

The MCP server automatically discovers tools by:

Scanning dataops/tools/ directory
Finding functions with @audit_operation decorator
Extracting parameters and documentation
Creating MCP method definitions
No restart needed when adding tools!

Session Persistence

Sessions persist across MCP calls:

# First call starts session
session_start(who="claude", purpose="analysis")

# Subsequent calls use active session
csv_profile("input.csv")  # Automatically uses current session
csv_clean("input.csv", "clean.csv")  # Same session

# End session to finalize
session_end(status="success")

Audit Trail

Every MCP operation is fully audited:

Operation timestamp and duration
Input/output file SHA-256 hashes
Parameters used
Actor information (AI agent)
Session grouping
Lineage tracking

Configuration Options

Environment Variables

# Set working directory
export DATAOPS_WORK_DIR=/path/to/your/data

# Set audit directory
export DATAOPS_AUDIT_DIR=/path/to/audit

# Enable debug logging
export MCP_LOG_LEVEL=DEBUG

Multiple Projects

Configure multiple MCP servers for different projects:

{
  "mcpServers": {
    "dataops-sales": {
      "command": "python",
      "args": ["/path/to/dataops/start_mcp_auto.py"],
      "env": {
        "PYTHONPATH": "/path/to/dataops",
        "DATAOPS_WORK_DIR": "/data/sales"
      }
    },
    "dataops-marketing": {
      "command": "python",
      "args": ["/path/to/dataops/start_mcp_auto.py"],
      "env": {
        "PYTHONPATH": "/path/to/dataops",
        "DATAOPS_WORK_DIR": "/data/marketing"
      }
    }
  }
}

Troubleshooting

Common Issues and Solutions

1. "MCP server not found" in Claude

Check file paths are absolute, not relative
Ensure Python is in PATH
Verify PYTHONPATH is set correctly
Restart Claude Desktop after config changes

2. "Tool not found" errors

Verify tool exists in dataops/tools/
Check tool has @audit_operation decorator
Ensure proper Python imports

3. "Session not found" errors

Always start a session before operations
Check .audit/sessions/current exists
Verify session hasn't expired

4. File path issues

Use absolute paths or paths relative to working directory
Check DATAOPS_WORK_DIR if set
Ensure data files exist and are readable

Debug Mode

Enable detailed logging:

# In your config.json
"env": {
    "PYTHONPATH": "/path/to/dataops",
    "MCP_LOG_LEVEL": "DEBUG"
}

Testing MCP Connection

# test_mcp.py
import json
import sys

# Test tool discovery
from dataops.mcp.auto_discovery import discover_tools

tools = discover_tools()
print(f"Found {len(tools)} tools:")
for name in sorted(tools.keys()):
    print(f"  - {name}")

Security Considerations

Read-Only Data Directory: Source data in data/ is never modified
Audit Everything: All operations logged with cryptographic hashes
Session Isolation: Each session has its own audit trail
Path Validation: Prevents directory traversal attacks
Parameter Sanitization: All inputs validated

Performance Optimization

Large Files

# Use sampling for profiling
csv_profile("large.csv", sample_size=10000)

# Split before processing
csv_split("huge.csv", "chunks/", chunk_size=50000)

Batch Operations

# Process multiple files efficiently
session_start(who="claude", purpose="batch_processing")

for file in files:
    csv_clean(file, f"clean_{file}")

session_end()  # Single audit trail for batch

Extending MCP Support

Adding New Tools

Create tool in dataops/tools/:

# dataops/tools/my_tool.py
from dataops.audit.wrapper import audit_operation

@audit_operation(
    tool_name="my_tool",
    tool_version="0.1.0"
)
def my_tool(input_file: str, output_file: str, **kwargs):
    """Tool description for MCP."""
    # Implementation
    return result, metadata

Tool is automatically available via MCP!

Custom Workflows

Create composite operations:

# dataops/tools/quality_pipeline.py
@audit_operation
def quality_pipeline(input_file: str, output_dir: str):
    """Run complete quality assessment pipeline."""

    # Profile
    profile = csv_profile(input_file, f"{output_dir}/profile.json")

    # Score
    score = quality_score(input_file, f"{output_dir}/score.json")

    # Validate
    validate = dq_validate(input_file, "rules.json",
                          f"{output_dir}/violations.csv")

    return {
        "profile": profile,
        "score": score,
        "validation": validate
    }

Best Practices

Always use sessions for grouping related operations
Specify output files to maintain lineage
Use descriptive purposes in session starts
Clean scratch directory periodically
Monitor audit size - archive old sessions
Test with small data before processing large files
Document custom rules for validation tools

Getting Help

Check KNOWN_ISSUES.md for known problems
Review LLMs.md for AI-specific guidance
See USAGE_EXAMPLES.md for detailed examples
Open issues on GitHub for bugs or features

Version Compatibility

Python: 3.11+
MCP Protocol: Latest
Claude Desktop: 1.0+
Pandas: 2.0+
DuckDB: 0.9+

FilesExpand file tree

MCP_INTEGRATION.md

Latest commit

History