This file provides guidance to Claude Code (claude.ai/code) when working with code in this repository.
# Session management
dataops session start --who "analyst" --purpose "data analysis"
dataops session status
dataops session end --status success
# Data profiling & quality
dataops run csv-profile data.csv -o profile.json
dataops run quality-score data.csv -o score.json
dataops run dq-validate data.csv rules.json -o violations.csv
# Schema & mapping
dataops run schema-infer data.csv schema.json
dataops run map-suggest source.csv target.csv mapping.json
dataops run field-correspond directory/ correspondences.json
# Data transformation
dataops run column-cut input.csv output.csv -c col1 col2
dataops run row-filter input.csv output.csv "column > value"
dataops run csv-join left.csv right.csv output.csv --on key
dataops run csv-aggregate input.csv output.csv --group-by col
dataops run csv-transform input.csv output.csv mapping.csv
dataops run csv-fill input.csv output.csv group_col
# SQL & analysis
dataops run csv-sql "SELECT * FROM table" -i table=file.csv -o output.csv
dataops run csv-sql-multi -q "QUERY1" -q "QUERY2" -o output.csv
dataops run csv-diff file1.csv file2.csv -o diff.csv
dataops run csv-pivot input.csv output.csv --operation pivot
dataops run func-dep-check input.csv -o dependencies.txt
# Data organization
dataops run csv-split large.csv chunks/ --chunk-size 1000
dataops run csv-consolidate input.csv output.csv --uid-column id
dataops run csv-merge directory/ merged.csv --pattern "*.csv"
# Data cleansing
dataops run csv-clean input.csv output.csv --remove-duplicates
dataops run dedupe-er input.csv output.csv -m col1 col2 -t 85
# Lineage
dataops lineage build# Run all tests (except known issues)
pytest -k "not test_poor_quality_detection"
# Run specific test file
pytest tests/test_data_quality_tools.py
# Run with coverage
pytest --cov=dataops --cov-report=html
# Code formatting (line length 100)
black dataops/ --line-length 100
# Linting
flake8 dataops/ --max-line-length=100
# Type checking (strict mode)
mypy dataops/ --strict
# Install for development
pip install -e .[dev,mcp]The project uses a decorator-based audit wrapper pattern where every data operation is wrapped with @audit_operation to create immutable audit trails. All operations are grouped into sessions with ULID identifiers that persist across CLI invocations.
-
Audit System (
dataops/audit/): The@audit_operationdecorator wraps every tool function to:- Log execution with SHA-256 hashes of inputs/outputs
- Track timing and metadata
- Create append-only NDJSON event streams in
.audit/ - Support both human and LLM actors
-
Session Management (
dataops/audit/session.py): Sessions group related operations using:- ULID-based unique identifiers
- Persistent state in
.audit/sessions/current - JSON manifests + NDJSON event logs per session
-
Tool Architecture (
dataops/tools/): Each tool follows a standardized pattern:- Standalone module with main function
@audit_operationwrapper for tracking- Dual interfaces: CLI command + MCP method
- Auto-discovery for MCP integration
-
MCP Integration (
dataops/mcp/): Model Context Protocol support:- Auto-discovery server scans all tools
- Automatic parameter mapping
- Session persistence across calls
- Full audit trail for AI operations
-
Data Flow:
data/ (read-only) → tools → scratch/ (temp) → artifacts/ (output) ↓ .audit/ (logs)
- Immutable Inputs: Never modify files in
data/directory - enforced by audit wrapper - Hash-Based Tracking: All file operations include SHA-256 hashes for integrity/change detection
- Dual Interface Pattern: Every tool exposes both CLI (via Click) and MCP interfaces
- Session Persistence: Session state survives across CLI invocations using file-based storage
- Tool Auto-Discovery: New tools automatically available via MCP without configuration
csv_profile- Comprehensive statistical profilingdq_validate- Rule-based validationquality_score- Multi-dimensional quality scoring
schema_infer- Automatic schema detectionmap_suggest- AI-powered field mappingfield_correspond- Cross-file correspondence
column_cut- Column selectionrow_filter- Row filteringcsv_join- Join operationscsv_aggregate- Aggregationcsv_transform- Schema transformationcsv_fill- Smart missing value filling
csv_sql- SQL on CSV filescsv_sql_multi- Multi-query pipelinescsv_diff- File comparisoncsv_pivot- Pivot/unpivot operationsfunc_dep_check- Dependency analysis
csv_split- Split large filescsv_consolidate- Consolidate duplicatescsv_merge- Merge multiple files
csv_clean- Data standardizationdedupe_er- Fuzzy deduplication
- Create module in
dataops/tools/with function following the pattern:@audit_operation( tool_name="your_tool", tool_version="0.1.0" ) def your_tool(input_path: str, output_path: str, **kwargs) -> tuple: # Implementation result = {...} metadata = {...} return result, metadata
- Add CLI command in
dataops/cli/tool_commands.py - Tool automatically available via MCP through auto-discovery
- Place unit tests in
tests/test_<module>.py - Use
tempfile.TemporaryDirectory()for all file operations - Verify audit logs are created correctly
- Check both CLI and programmatic interfaces
- Test with sample CSV files in
data/
- Python 3.11+ required
- Always run
blackbefore committing (line length 100) - Type hints are mandatory -
mypyruns in strict mode - All data transformations must use
@audit_operationdecorator - MCP server uses auto-discovery - new tools available without restart
- Never modify files in
data/directory - Use
scratch/for intermediate files - Use
artifacts/for final outputs - Known issue:
test_poor_quality_detectionexpects score < 70 but gets 78.14
# Create virtual environment
python -m venv .venv
source .venv/bin/activate
# Install with all dependencies
pip install -e ".[dev,mcp]"
pip install python-Levenshtein # For dedupe_er optimal performance
# Run tests (skipping known issue)
pytest -k "not test_poor_quality_detection"
# Run specific new tool tests
pytest tests/test_data_quality_tools.py -v
pytest tests/test_schema_mapping_tools.py -v
pytest tests/test_csv_diff.py tests/test_csv_pivot.py tests/test_csv_sql_multi.py -v
pytest tests/test_csv_clean.py tests/test_dedupe_er.py -v
# Check code quality
black dataops/ --check --line-length 100
flake8 dataops/ --max-line-length=100
mypy dataops/For Claude Desktop integration:
- Start the auto-discovery server:
python start_mcp_auto.py- Configure Claude Desktop:
{
"mcpServers": {
"dataops-toolkit": {
"command": "python",
"args": ["/path/to/dataops-toolkit/start_mcp_auto.py"],
"env": {"PYTHONPATH": "/path/to/dataops-toolkit"}
}
}
}- All tools are automatically available to Claude!
- Line length: 100 characters
- Use Black formatter
- Type hints required for all functions
- Docstrings for all public functions
- Use pathlib.Path for file paths
- Handle pandas FutureWarnings appropriately
- Use logging instead of print statements
- Comprehensive error handling with clear messages