CLAUDE.md

This file provides guidance to Claude Code (claude.ai/code) when working with code in this repository.

Project Overview

This is the LLM-Driven Data Integration Toolkit, a CLI-first framework for enterprise data integration and ERP migrations using Claude Code CLI agents and Model Context Protocol (MCP) servers. The project focuses on CSV-based workflows where all data is provided as CSV files within a working directory.

Architecture

The system follows a modular architecture with:

Primary Planner Agent (Codex CLI/Claude Code): Main orchestrator
MCP Servers: Each tool runs as a lightweight microservice with JSON-RPC interface
Sub-Agents: Specialized Claude Code agents for dedicated tasks
Human CLI Commands: Mirror MCP methods for human analysts

Development Setup Commands

# Set up Python environment (Python 3.11+ required)
python -m venv venv
source venv/bin/activate  # On Windows: venv\Scripts\activate

# Install dependencies (when requirements.txt is created)
pip install -r requirements.txt

# Run tests (when implemented)
python -m pytest tests/

# Type checking (when mypy is configured)
mypy dataops/

# Linting (when configured)
python -m flake8 dataops/
python -m black dataops/

Core Module Structure

The codebase should be organized as follows:

dataops/audit/ - Session management, event logging, and hashing utilities
dataops/lineage/ - Dependency graph builder and reporting
dataops/tools/ - Individual tool implementations
dataops/mcp/ - MCP server implementations
dataops/cli/ - CLI commands and entry points
dataops/agents/ - Sub-agent implementations

Directory Conventions

data/ - Read-only staging inputs (NEVER modify)
scratch/ - Ephemeral intermediates (safe to delete)
artifacts/ - Durable outputs for handoff/sign-off
ref/ - Reference files (rules, mappings)
.audit/ - Append-only event logs per session
.cache/ - Local indices (duckdb/parquet)

Development Stages

When implementing features, follow this staged approach:

Stage 1: Core audit library (dataops.audit) with session management
Stage 2: Lineage builder (dataops.lineage) for dependency graphs
Stage 3: Initial tools (column_cut, csv_sql, func_dep_check)
Stage 4: MCP server integration
Stage 5: Sub-agents and additional tools

Key Implementation Requirements

Session Management

Every operation must be wrapped in a session with unique SESSION_ID (ULID)
Sessions start with: dataops session start --who "name" --purpose "description"
Sessions end with: dataops session end --status success|aborted
All events logged to .audit/<SESSION_ID>/events.ndjson

Event Logging Format

All operations must log structured NDJSON events with:

Timestamp, session_id, event_id, actor info
Tool name and version
Operation details including parameters and code references
Input/output file paths with SHA-256 hashes
Performance metrics (duration, rows read/written)

Write Policy

NEVER modify files in data/ directory
All outputs go to scratch/ or artifacts/
Overwrite forbidden by default
Content hashes (SHA-256) required for all file operations

Tool Development

Each tool must:

Expose both MCP method and CLI command
Accept only CSV files from working directory
Output JSON/CSV/Markdown artifacts
Include full audit logging wrapper
Return: { outputs: [...], op_id, session_id, log_path }

Testing Requirements

Unit tests for each module
Integration tests for CLI commands
MCP server endpoint tests
Deterministic output verification (hashes, timestamps)
Sample CSV datasets in tests/data/

Code Standards

Python 3.11+ with type hints
PEP 8 style compliance
Comprehensive docstrings for public APIs
Dependency injection for configuration
Feature branches with frequent rebasing
Meaningful commit messages referencing features

Security Requirements

Sandbox custom Python transforms
Validate and sanitize all CLI/MCP parameters
Enforce directory access controls
No external access without explicit configuration
Never log sensitive data in plain text

Core Tools to Implement

Priority tools for initial implementation:

csv_profile - Column profiling with types, nulls, distributions
schema_infer - Logical schema inference from CSVs
map_suggest - Source-to-target field mapping suggestions
dq_validate - Data quality rule enforcement
csv_sql - SQL queries over CSVs via DuckDB
column_cut - Column selection/reordering
func_dep_check - Functional dependency detection

Sub-Agent Specializations

Each sub-agent has limited, focused toolsets:

Profiler Agent: csv_profile, schema_infer
SQL Desk Agent: csv_sql, column_cut, row_filter
DQ Agent: dq_validate
Mapping Agent: map_suggest, map_apply
Reconciliation Agent: reconcile_balances, post_load_checks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CLAUDE.md

Project Overview

Architecture

Development Setup Commands

Core Module Structure

Directory Conventions

Development Stages

Key Implementation Requirements

Session Management

Event Logging Format

Write Policy

Tool Development

Testing Requirements

Code Standards

Security Requirements

Core Tools to Implement

Sub-Agent Specializations

FilesExpand file tree

CLAUDE.md

Latest commit

History

CLAUDE.md

File metadata and controls

CLAUDE.md

Project Overview

Architecture

Development Setup Commands

Core Module Structure

Directory Conventions

Development Stages

Key Implementation Requirements

Session Management

Event Logging Format

Write Policy

Tool Development

Testing Requirements

Code Standards

Security Requirements

Core Tools to Implement

Sub-Agent Specializations