Skip to content

A linter for data pipelines. Deterministic validation with YAML rules and CI-native gating.

License

Notifications You must be signed in to change notification settings

Squrtech/datacheck

DataCheck Logo

DataCheck - A Linter for Data Pipelines

CI Python 3.10+ License: Apache 2.0 PyPI version

DataCheck enforces deterministic validation rules at the pipeline boundary. Define rules in YAML. Run in CI. Fail fast on bad data. No servers, no dashboards, no infrastructure.

Your data source  →  [DataCheck rules]  →  exit 0: pipeline continues
                                        →  exit 1: pipeline stops

View the Documentation for full details.

Mental Model

Code has linters. Infrastructure has policy enforcement. Data pipelines need gates.

DataCheck is that gate.

Why DataCheck?

Most teams detect bad data after the fact - broken dashboards, wrong reports, angry stakeholders. DataCheck enforces validation rules before bad data moves downstream, the same way a linter enforces code quality before bad code ships.

  • Fail fast - structured exit codes stop pipelines at the gate, not after the damage is done
  • Deterministic - rules are explicit and binary. No heuristics. No anomaly scoring. No statistical guessing.
  • SQL pushdown - database checks run as a single aggregate SELECT; no data leaves your warehouse
  • Zero infrastructure - one pip install, one YAML file, runs anywhere
  • CI-native - SARIF output to GitHub Security tab, GitHub Action, Apache Airflow operators

Validate Where Data Lives

For databases, DataCheck executes validation as aggregate SQL inside your warehouse.

  • No data pulled into pandas
  • No row transfer
  • No separate compute layer
  • Single aggregate SELECT per rule set

Validation happens where the data already lives.

Why not observability?

DataCheck is not a data observability platform. It does not provide dashboards, trend analysis, anomaly detection, or SaaS backends. Those tools answer "what happened?" - DataCheck answers "does this data meet our rules right now?" Enforcement happens at the gate; investigation happens after.

What DataCheck Is Not

  • Not a monitoring dashboard
  • Not anomaly detection
  • Not a SaaS platform
  • Not a data catalog

It is an enforcement layer.

Demo

DataCheck Quickstart Demo
Install DataCheck, generate an ecommerce config with sample data, and run validation - all in one go.

Setup

Requirements

To use DataCheck, you must have installed the following on your system.

Python 3.10, 3.11, or 3.12

To check your existing version, use the CLI command: python --version or python3 --version.

Pip 21.0 or greater

To check your pip version: pip --version

Installation

DataCheck is available on public PyPI as datacheck-cli.

pip install datacheck-cli

To install with support for a specific data source, use extras:

pip install datacheck-cli[postgresql]    # PostgreSQL
pip install datacheck-cli[mysql]         # MySQL
pip install datacheck-cli[snowflake]     # Snowflake
pip install datacheck-cli[bigquery]      # BigQuery
pip install datacheck-cli[redshift]      # Redshift
pip install datacheck-cli[s3]            # S3
pip install datacheck-cli[all]           # All data sources

Quickstart

To see detailed logs on any command, add --verbose or -v.

Create a config

Option 1 - Start from a template:

datacheck config init --with-sample-data
datacheck config init --template ecommerce --with-sample-data

Option 2 - Write manually. The config defines both the data source and the validation rules.

# .datacheck.yaml

data_source:
  type: csv
  path: ./data/orders.csv

checks:
  - name: id_check
    column: id
    rules:
      not_null: true
      unique: true

  - name: amount_check
    column: amount
    rules:
      not_null: true
      min: 0
      max: 10000

DataCheck auto-discovers config files in this order: .datacheck.yaml → .datacheck.yml → datacheck.yaml → datacheck.yml. To specify a config explicitly, use the --config flag.

Run validation

datacheck validate                          # auto-discover config
datacheck validate data.csv                 # direct file
datacheck validate --config checks.yaml
echo $?  # 1 if any error-severity rule fails

Data source

Option Short Description
[DATA_SOURCE] Positional: file path or connection string
--config -c Path to config file (auto-discovered if not set)
--source Named source from sources.yaml
--sources-file Path to sources YAML file
--table -t Database table name
--where -w WHERE clause for filtering
--query -q Custom SQL query (alternative to --table)
--schema -s Schema/dataset name (databases and warehouses)
--warehouse Snowflake warehouse name
--credentials Path to credentials file (e.g., BigQuery service account JSON)
--region Cloud region (Redshift IAM auth)
--cluster Cluster identifier (Redshift IAM auth)
--iam-auth Use IAM authentication (Redshift)

Output

Option Short Description
--output -o Save results to file
--format -f Output format: json (default), sarif, markdown, csv
--csv-export Export failure details as CSV
--suggestions / --no-suggestions Show actionable fix suggestions (default: on)

Execution

Option Short Description
--parallel Enable multi-core execution
--workers Number of worker processes (default: CPU count)
--chunk-size Rows per chunk for parallel processing (default: 100000)
--progress / --no-progress Show progress bar (default: on)
--slack-webhook Slack webhook URL for result notifications

Logging

Option Short Description
--verbose -v Set log level to DEBUG
--log-level DEBUG, INFO, WARNING, ERROR, CRITICAL
--log-format console (default) or json
--log-file Path to log file (enables rotation)

Data Source Configuration

File-based data sources are defined inline under data_source in your config. For databases and cloud storage, define named sources in a separate sources.yaml file and reference them.

CSV / Parquet

data_source:
  type: csv
  path: ./data/orders.csv
  options:
    delimiter: ","
    encoding: utf-8
data_source:
  type: parquet
  path: ./data/orders.parquet

Databases (PostgreSQL, Snowflake, BigQuery, etc.)

For database connections, use named sources in a sources.yaml file. The inline data_source config only supports file-based sources (csv, parquet).

SQL pushdown: database checks run as a single aggregate SELECT per rule - no rows are transferred to the validator. Validation happens inside your warehouse.

# sources.yaml
sources:
  production_db:
    type: postgresql
    host: ${DB_HOST}
    port: ${DB_PORT:-5432}
    database: ${DB_NAME}
    user: ${DB_USER}
    password: ${DB_PASSWORD}

  snowflake_wh:
    type: snowflake
    account: ${SF_ACCOUNT}
    user: ${SF_USER}
    password: ${SF_PASSWORD}
    warehouse: ${SF_WAREHOUSE:-COMPUTE_WH}
    database: ${SF_DATABASE}
    schema: ${SF_SCHEMA:-PUBLIC}

  bigquery_ds:
    type: bigquery
    project_id: ${GCP_PROJECT}
    dataset_id: ${GCP_DATASET}
    credentials_path: /path/to/service-account.json
    location: US

  redshift_db:
    type: redshift
    host: ${REDSHIFT_HOST}
    port: ${REDSHIFT_PORT:-5439}
    database: ${REDSHIFT_DB}
    user: ${REDSHIFT_USER}
    password: ${REDSHIFT_PASSWORD}
    schema: public

Then reference the source in your config:

# .datacheck.yaml
sources_file: sources.yaml
source: production_db
table: orders

checks:
  - name: id_check
    column: id
    rules:
      not_null: true

Cloud Storage (S3)

Access cloud files via named sources in sources.yaml:

# sources.yaml
sources:
  s3_data:
    type: s3
    bucket: my-bucket
    path: data/orders.csv
    region: us-east-1
    access_key: ${AWS_ACCESS_KEY_ID}
    secret_key: ${AWS_SECRET_ACCESS_KEY}
datacheck validate --source s3_data --sources-file sources.yaml

Switching Sources at Runtime

Switch sources at runtime:

datacheck validate --source snowflake_wh --config checks.yaml

Individual checks can also override the default source:

sources_file: sources.yaml
source: production_db
table: customers

checks:
  - name: customer_email
    column: email
    rules:
      not_null: true

  - name: order_total
    column: total
    source: snowflake_wh      # Override source for this check
    table: orders
    rules:
      min: 0

Environment Variables

Config files support environment variable substitution:

# In sources.yaml
sources:
  production_db:
    type: postgresql
    host: ${DB_HOST}                    # Required variable
    port: ${DB_PORT:-5432}              # Variable with default value
    database: ${DB_NAME}
    user: ${DB_USER}
    password: ${DB_PASSWORD}

Use datacheck config env to list all variables referenced in a config and their current values.

CI/CD Integration

DataCheck is built for pipelines. Rules fail hard and fast - no soft warnings that let bad data slip through unnoticed.

Exit codes

Code Meaning
0 All rules passed (or only warning/info severity failures)
1 One or more error-severity rules failed
2 Configuration error
3 Data loading error
4 Unexpected error

Rules can have severity: error (default), severity: warning, or severity: info. Only error-severity failures cause exit code 1 and stop the pipeline.

GitHub Actions (with SARIF to Security tab)

Results appear as annotations on PRs in the GitHub Security tab via SARIF 2.1.0:

# .github/workflows/data-quality.yml
name: Data Quality Gate
on: [push, pull_request]

permissions:
  contents: read
  security-events: write   # Required for SARIF upload

jobs:
  validate:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - uses: squrtech/datacheck-action@v1
        with:
          config: .datacheck.yaml

Or without the GitHub Action - generates SARIF and uploads it directly:

      - name: Install DataCheck
        run: pip install datacheck-cli

      - name: Run data quality gate
        run: datacheck validate -c .datacheck.yaml --format sarif --output results.sarif

      - name: Upload SARIF to GitHub Security tab
        uses: github/codeql-action/upload-sarif@v3
        if: always()
        with:
          sarif_file: results.sarif

Apache Airflow

Use the built-in Airflow operators to gate DAG tasks on data quality:

from airflow_provider_datacheck.operators.datacheck import DataCheckOperator

validate_orders = DataCheckOperator(
    task_id="validate_orders",
    config_path="/config/orders.datacheck.yaml",
    source_name="production_db",
    table="orders",
    min_pass_rate=100.0,   # Fail if any rule fails
    fail_on_error=True,
)

The operator raises AirflowException when validation fails, halting the DAG at the gate.

Any CI runner

Works with any CI system that respects exit codes:

pip install datacheck-cli
datacheck validate -c .datacheck.yaml
# exits 1 if any error-severity rule fails

Enforce Schema Contracts

Capture a schema baseline and compare future data against it. Detects column additions, removals, type changes, and nullable changes. Use --fail-on-breaking to exit 1 on breaking changes. The data source can be provided directly, read from your config, or loaded from a named source.

# Auto-discover config or use named source
datacheck schema capture                              # Save current schema as baseline
datacheck schema compare                              # Compare - reports changes, exit 0
datacheck schema compare --fail-on-breaking           # Compare - exit 1 on breaking changes

# Direct file path
datacheck schema capture data.csv
datacheck schema compare data.csv --fail-on-breaking

# Named source
datacheck schema capture --source production_db --sources-file sources.yaml

# Other schema commands
datacheck schema show                                 # Display saved baseline
datacheck schema list                                 # List saved baselines
datacheck schema history                              # View capture history

schema compare options:

Option Short Description
[DATA_SOURCE] Positional: file path or connection string
--config -c Path to config file
--source Named source from sources.yaml
--sources-file Path to sources YAML file
--table -t Database table name
--baseline -b Name of baseline to compare against (default: baseline)
--baseline-dir Directory containing baselines (default: .datacheck/schemas)
--rename-threshold Similarity threshold for rename detection (default: 0.8)
--fail-on-breaking Exit 1 if breaking changes are detected
--format -f Output format: terminal (default) or json

Python API

Use DataCheck programmatically within your pipelines:

from datacheck import ValidationEngine

engine = ValidationEngine(config_path=".datacheck.yaml")
summary = engine.validate()

print(f"Records: {summary.total_rows:,} rows, {summary.total_columns} columns")
print(f"Passed: {summary.passed_rules}/{summary.total_rules}")

for result in summary.get_failed_results():
    print(f"  FAIL: {result.rule_name} on {result.column} ({result.failed_rows} rows)")

if not summary.all_passed:
    raise ValueError("Data quality gate failed - halting pipeline")

Available Rules

Category Rules
Null & Uniqueness not_null, unique, unique_combination
Numeric min, max, range, boolean
String & Pattern regex, allowed_values, length, min_length, max_length, type
Temporal max_age, timestamp_range (or date_range), no_future_timestamps, date_format_valid (or date_format)
Cross-Column unique_combination, foreign_key_exists (Python API), sum_equals

Roadmap

What's coming next:

  • Data Contracts format - --format datacontract aligned with the datacontract.com open spec.
  • dbt integration - generate DataCheck rules directly from your dbt schema YAML.
  • Streaming validation - chunk-based ingestion for 100M+ row datasets without loading into memory.

Development

git clone https://github.com/squrtech/datacheck.git
cd datacheck
poetry install

See CONTRIBUTING.md for guidelines.

Resources

License

Apache License 2.0 - see LICENSE for details.