Skip to content

Latest commit

 

History

History
297 lines (217 loc) · 9.63 KB

File metadata and controls

297 lines (217 loc) · 9.63 KB

DataOps Toolkit

A CLI-first, LLM-friendly framework for auditable data transformations with built-in lineage tracking. Every operation is logged with cryptographic hashes, creating an immutable audit trail perfect for compliance, debugging, and understanding complex data pipelines.

🎯 Key Features

  • Immutable Audit Trail: Every operation logged with SHA-256 hashes
  • Automatic Data Lineage: Visual dependency graphs showing complete data flow
  • LLM-Friendly: Designed for seamless integration with AI agents via CLI and MCP
  • CLI-First: Human and machine-readable commands
  • Zero Data Loss: Source files are read-only, all outputs traceable
  • Session Management: Group related operations with metadata
  • Extensible: Easy to add new tools following the framework
  • Model Context Protocol: Full MCP server support for AI integration

🚀 Quick Start

Installation

# Clone the repository
git clone https://github.com/yourusername/dataops-toolkit.git
cd dataops-toolkit

# Create virtual environment (Python 3.11+ required)
python -m venv .venv
source .venv/bin/activate  # On Windows: .venv\Scripts\activate

# Install the package
pip install -e .

# For development and testing
pip install -e ".[dev]"

# For MCP server support
pip install -e ".[mcp]"

Basic Usage

# Start a session
dataops session start --who "analyst" --purpose "Q4 sales analysis"

# Profile your data
dataops run csv-profile data/customers.csv -o reports/profile.json

# Validate data quality
dataops run dq-validate data/orders.csv rules/order_rules.json

# Run SQL on CSV files
dataops run csv-sql "SELECT * FROM customers WHERE tier = 'Gold'" \
  -i customers=data/customers.csv \
  -o artifacts/gold_customers.csv

# Deduplicate records with fuzzy matching
dataops run dedupe-er data/contacts.csv artifacts/clean_contacts.csv \
  -m name -m email --threshold 85

# Generate lineage graph
dataops lineage build

# End session
dataops session end

📁 Project Structure

your-project/
├── data/           # Source data (read-only)
├── scratch/        # Temporary files (deletable)
├── artifacts/      # Final outputs
├── rules/          # Validation rules and mappings
└── .audit/         # Audit trails (immutable)
    └── SESSION_ID/
        ├── manifest.json     # Session metadata
        ├── events.ndjson     # Operation log
        └── lineage.html      # Visual graph

🛠️ Complete Tool Set

Core Framework ✅

Component Status Description
Session Management ✅ Complete Start/end sessions, persistent across CLI calls
Event Logging ✅ Complete NDJSON audit trail with SHA-256 hashes
Lineage Builder ✅ Complete Dependency graphs in HTML/JSON/DOT/GraphML
Audit Wrapper ✅ Complete Automatic operation tracking decorator
MCP Server ✅ Complete Model Context Protocol for AI agents

Data Profiling & Quality Tools ✅

Tool Command Purpose
csv_profile dataops run csv-profile Comprehensive statistical profiling with data types, patterns, distributions
dq_validate dataops run dq-validate Rule-based validation against business rules and constraints
quality_score dataops run quality-score Calculate overall data quality score across 6 dimensions

Schema & Mapping Tools ✅

Tool Command Purpose
schema_infer dataops run schema-infer Automatically infer schema, data types, and relationships
map_suggest dataops run map-suggest AI-powered field mapping suggestions between schemas
field_correspond dataops run field-correspond Find field correspondences across multiple CSV files

Data Transformation Tools ✅

Tool Command Purpose
column_cut dataops run column-cut Select and reorder CSV columns
row_filter dataops run row-filter Filter rows based on conditions
csv_join dataops run csv-join Join two CSV files on specified columns
csv_aggregate dataops run csv-aggregate Group and aggregate CSV data
csv_transform dataops run csv-transform Transform CSV to target schema using field mapping
csv_fill dataops run csv-fill Fill missing values using similar rows within groups

SQL & Advanced Operations ✅

Tool Command Purpose
csv_sql dataops run csv-sql SQL queries over CSV files (DuckDB)
csv_sql_multi dataops run csv-sql-multi Multi-query SQL with intermediate tables
csv_diff dataops run csv-diff Compare two CSV files and identify differences
csv_pivot dataops run csv-pivot Pivot/unpivot operations for reshaping data
func_dep_check dataops run func-dep-check Find functional dependencies in data

Data Organization Tools ✅

Tool Command Purpose
csv_split dataops run csv-split Split large CSV files into chunks
csv_consolidate dataops run csv-consolidate Consolidate duplicate rows by unique ID
csv_merge dataops run csv-merge Merge multiple CSVs with different schemas

Data Cleansing Tools ✅

Tool Command Purpose
csv_clean dataops run csv-clean Standardize and clean data (encoding, formats, whitespace)
dedupe_er dataops run dedupe-er Entity resolution with fuzzy matching and deduplication

🤖 LLM Integration

Model Context Protocol (MCP) Server

The toolkit includes full MCP integration for seamless AI agent interaction:

# Start the MCP server
python start_mcp_auto.py

# Or use with Claude Desktop (add to config):
{
  "mcpServers": {
    "dataops-toolkit": {
      "command": "python",
      "args": ["/path/to/dataops-toolkit/start_mcp_auto.py"],
      "env": {"PYTHONPATH": "/path/to/dataops-toolkit"}
    }
  }
}

For AI Agents (Claude, GPT, etc.)

# Example: Using via subprocess
import subprocess

# Start session
subprocess.run(["dataops", "session", "start", "--who", "ai-agent", "--purpose", "data-cleaning"])

# Profile data
subprocess.run(["dataops", "run", "csv-profile", "input.csv", "-o", "profile.json"])

# Clean data
subprocess.run(["dataops", "run", "csv-clean", "input.csv", "clean.csv", "--remove-duplicates"])

# End session
subprocess.run(["dataops", "session", "end"])

📊 Example Workflows

Data Quality Assessment Pipeline

# 1. Profile the data
dataops run csv-profile raw_data.csv -o quality/profile.json

# 2. Calculate quality score
dataops run quality-score raw_data.csv -o quality/score.json

# 3. Validate against rules
dataops run dq-validate raw_data.csv rules.json -o quality/violations.csv

# 4. Generate quality report
dataops lineage build --format html

Data Cleaning Pipeline

# 1. Deduplicate records
dataops run dedupe-er dirty_data.csv deduped.csv -t 90

# 2. Clean and standardize
dataops run csv-clean deduped.csv cleaned.csv --remove-empty-rows

# 3. Fill missing values
dataops run csv-fill cleaned.csv complete.csv customer_segment

# 4. Final validation
dataops run quality-score complete.csv -o final_quality.json

Schema Migration Pipeline

# 1. Infer source schema
dataops run schema-infer source.csv source_schema.json

# 2. Generate mapping suggestions
dataops run map-suggest source.csv target_template.csv mapping.json

# 3. Transform to new schema
dataops run csv-transform source.csv migrated.csv mapping.json

# 4. Validate migration
dataops run csv-diff source.csv migrated.csv -o migration_report.csv

🧪 Testing

# Run all tests
pytest

# Run specific test file
pytest tests/test_data_quality_tools.py

# Run with coverage
pytest --cov=dataops --cov-report=html

# Skip known failing test
pytest -k "not test_poor_quality_detection"

📝 Known Issues

  • Quality Score Test: One test (test_poor_quality_detection) expects score < 70 but gets 78.14
  • Date Format Warnings: Pandas date inference warnings in some tools (cosmetic, doesn't affect functionality)
  • ULID Ordering: Test-only issue with ULID ordering in same millisecond

See docs/KNOWN_ISSUES.md for details.

📚 Documentation

🤝 Contributing

We welcome contributions! See CONTRIBUTING.md for:

  • How to add new tools
  • Code style guidelines (Black, line length 100)
  • Testing requirements
  • Pull request process

📋 Future Roadmap

  • Cloud storage adapters (S3, GCS, Azure)
  • Database connectors (PostgreSQL, MySQL, Snowflake)
  • Real-time streaming support
  • Web UI for lineage visualization
  • Advanced privacy controls (PII detection/masking)
  • Parallel processing for large datasets
  • Data versioning and rollback
  • Automated data quality monitoring

🙏 Acknowledgments

  • Built with Python, DuckDB, Pandas, and NetworkX
  • Designed for seamless LLM integration
  • Inspired by data engineering best practices

Ready to transform your data workflows with complete auditability? Get started with the Quick Start guide above!