This file provides guidance to Claude Code (claude.ai/code) when working with code in this repository.
This is the LLM-Driven Data Integration Toolkit, a CLI-first framework for enterprise data integration and ERP migrations using Claude Code CLI agents and Model Context Protocol (MCP) servers. The project focuses on CSV-based workflows where all data is provided as CSV files within a working directory.
The system follows a modular architecture with:
- Primary Planner Agent (Codex CLI/Claude Code): Main orchestrator
- MCP Servers: Each tool runs as a lightweight microservice with JSON-RPC interface
- Sub-Agents: Specialized Claude Code agents for dedicated tasks
- Human CLI Commands: Mirror MCP methods for human analysts
# Set up Python environment (Python 3.11+ required)
python -m venv venv
source venv/bin/activate # On Windows: venv\Scripts\activate
# Install dependencies (when requirements.txt is created)
pip install -r requirements.txt
# Run tests (when implemented)
python -m pytest tests/
# Type checking (when mypy is configured)
mypy dataops/
# Linting (when configured)
python -m flake8 dataops/
python -m black dataops/The codebase should be organized as follows:
dataops/audit/- Session management, event logging, and hashing utilitiesdataops/lineage/- Dependency graph builder and reportingdataops/tools/- Individual tool implementationsdataops/mcp/- MCP server implementationsdataops/cli/- CLI commands and entry pointsdataops/agents/- Sub-agent implementations
data/- Read-only staging inputs (NEVER modify)scratch/- Ephemeral intermediates (safe to delete)artifacts/- Durable outputs for handoff/sign-offref/- Reference files (rules, mappings).audit/- Append-only event logs per session.cache/- Local indices (duckdb/parquet)
When implementing features, follow this staged approach:
- Stage 1: Core audit library (
dataops.audit) with session management - Stage 2: Lineage builder (
dataops.lineage) for dependency graphs - Stage 3: Initial tools (
column_cut,csv_sql,func_dep_check) - Stage 4: MCP server integration
- Stage 5: Sub-agents and additional tools
- Every operation must be wrapped in a session with unique SESSION_ID (ULID)
- Sessions start with:
dataops session start --who "name" --purpose "description" - Sessions end with:
dataops session end --status success|aborted - All events logged to
.audit/<SESSION_ID>/events.ndjson
All operations must log structured NDJSON events with:
- Timestamp, session_id, event_id, actor info
- Tool name and version
- Operation details including parameters and code references
- Input/output file paths with SHA-256 hashes
- Performance metrics (duration, rows read/written)
- NEVER modify files in
data/directory - All outputs go to
scratch/orartifacts/ - Overwrite forbidden by default
- Content hashes (SHA-256) required for all file operations
Each tool must:
- Expose both MCP method and CLI command
- Accept only CSV files from working directory
- Output JSON/CSV/Markdown artifacts
- Include full audit logging wrapper
- Return:
{ outputs: [...], op_id, session_id, log_path }
- Unit tests for each module
- Integration tests for CLI commands
- MCP server endpoint tests
- Deterministic output verification (hashes, timestamps)
- Sample CSV datasets in
tests/data/
- Python 3.11+ with type hints
- PEP 8 style compliance
- Comprehensive docstrings for public APIs
- Dependency injection for configuration
- Feature branches with frequent rebasing
- Meaningful commit messages referencing features
- Sandbox custom Python transforms
- Validate and sanitize all CLI/MCP parameters
- Enforce directory access controls
- No external access without explicit configuration
- Never log sensitive data in plain text
Priority tools for initial implementation:
csv_profile- Column profiling with types, nulls, distributionsschema_infer- Logical schema inference from CSVsmap_suggest- Source-to-target field mapping suggestionsdq_validate- Data quality rule enforcementcsv_sql- SQL queries over CSVs via DuckDBcolumn_cut- Column selection/reorderingfunc_dep_check- Functional dependency detection
Each sub-agent has limited, focused toolsets:
- Profiler Agent:
csv_profile,schema_infer - SQL Desk Agent:
csv_sql,column_cut,row_filter - DQ Agent:
dq_validate - Mapping Agent:
map_suggest,map_apply - Reconciliation Agent:
reconcile_balances,post_load_checks