Skip to content

Latest commit

 

History

History
140 lines (107 loc) · 4.89 KB

File metadata and controls

140 lines (107 loc) · 4.89 KB

CLAUDE.md

This file provides guidance to Claude Code (claude.ai/code) when working with code in this repository.

Project Overview

This is the LLM-Driven Data Integration Toolkit, a CLI-first framework for enterprise data integration and ERP migrations using Claude Code CLI agents and Model Context Protocol (MCP) servers. The project focuses on CSV-based workflows where all data is provided as CSV files within a working directory.

Architecture

The system follows a modular architecture with:

  • Primary Planner Agent (Codex CLI/Claude Code): Main orchestrator
  • MCP Servers: Each tool runs as a lightweight microservice with JSON-RPC interface
  • Sub-Agents: Specialized Claude Code agents for dedicated tasks
  • Human CLI Commands: Mirror MCP methods for human analysts

Development Setup Commands

# Set up Python environment (Python 3.11+ required)
python -m venv venv
source venv/bin/activate  # On Windows: venv\Scripts\activate

# Install dependencies (when requirements.txt is created)
pip install -r requirements.txt

# Run tests (when implemented)
python -m pytest tests/

# Type checking (when mypy is configured)
mypy dataops/

# Linting (when configured)
python -m flake8 dataops/
python -m black dataops/

Core Module Structure

The codebase should be organized as follows:

  • dataops/audit/ - Session management, event logging, and hashing utilities
  • dataops/lineage/ - Dependency graph builder and reporting
  • dataops/tools/ - Individual tool implementations
  • dataops/mcp/ - MCP server implementations
  • dataops/cli/ - CLI commands and entry points
  • dataops/agents/ - Sub-agent implementations

Directory Conventions

  • data/ - Read-only staging inputs (NEVER modify)
  • scratch/ - Ephemeral intermediates (safe to delete)
  • artifacts/ - Durable outputs for handoff/sign-off
  • ref/ - Reference files (rules, mappings)
  • .audit/ - Append-only event logs per session
  • .cache/ - Local indices (duckdb/parquet)

Development Stages

When implementing features, follow this staged approach:

  1. Stage 1: Core audit library (dataops.audit) with session management
  2. Stage 2: Lineage builder (dataops.lineage) for dependency graphs
  3. Stage 3: Initial tools (column_cut, csv_sql, func_dep_check)
  4. Stage 4: MCP server integration
  5. Stage 5: Sub-agents and additional tools

Key Implementation Requirements

Session Management

  • Every operation must be wrapped in a session with unique SESSION_ID (ULID)
  • Sessions start with: dataops session start --who "name" --purpose "description"
  • Sessions end with: dataops session end --status success|aborted
  • All events logged to .audit/<SESSION_ID>/events.ndjson

Event Logging Format

All operations must log structured NDJSON events with:

  • Timestamp, session_id, event_id, actor info
  • Tool name and version
  • Operation details including parameters and code references
  • Input/output file paths with SHA-256 hashes
  • Performance metrics (duration, rows read/written)

Write Policy

  • NEVER modify files in data/ directory
  • All outputs go to scratch/ or artifacts/
  • Overwrite forbidden by default
  • Content hashes (SHA-256) required for all file operations

Tool Development

Each tool must:

  • Expose both MCP method and CLI command
  • Accept only CSV files from working directory
  • Output JSON/CSV/Markdown artifacts
  • Include full audit logging wrapper
  • Return: { outputs: [...], op_id, session_id, log_path }

Testing Requirements

  • Unit tests for each module
  • Integration tests for CLI commands
  • MCP server endpoint tests
  • Deterministic output verification (hashes, timestamps)
  • Sample CSV datasets in tests/data/

Code Standards

  • Python 3.11+ with type hints
  • PEP 8 style compliance
  • Comprehensive docstrings for public APIs
  • Dependency injection for configuration
  • Feature branches with frequent rebasing
  • Meaningful commit messages referencing features

Security Requirements

  • Sandbox custom Python transforms
  • Validate and sanitize all CLI/MCP parameters
  • Enforce directory access controls
  • No external access without explicit configuration
  • Never log sensitive data in plain text

Core Tools to Implement

Priority tools for initial implementation:

  1. csv_profile - Column profiling with types, nulls, distributions
  2. schema_infer - Logical schema inference from CSVs
  3. map_suggest - Source-to-target field mapping suggestions
  4. dq_validate - Data quality rule enforcement
  5. csv_sql - SQL queries over CSVs via DuckDB
  6. column_cut - Column selection/reordering
  7. func_dep_check - Functional dependency detection

Sub-Agent Specializations

Each sub-agent has limited, focused toolsets:

  • Profiler Agent: csv_profile, schema_infer
  • SQL Desk Agent: csv_sql, column_cut, row_filter
  • DQ Agent: dq_validate
  • Mapping Agent: map_suggest, map_apply
  • Reconciliation Agent: reconcile_balances, post_load_checks