DataCheck - A Linter for Data Pipelines

DataCheck enforces deterministic validation rules at the pipeline boundary. Define rules in YAML. Run in CI. Fail fast on bad data. No servers, no dashboards, no infrastructure.

Your data source  →  [DataCheck rules]  →  exit 0: pipeline continues
                                        →  exit 1: pipeline stops

View the Documentation for full details.

Mental Model

Code has linters. Infrastructure has policy enforcement. Data pipelines need gates.

DataCheck is that gate.

Why DataCheck?

Most teams detect bad data after the fact - broken dashboards, wrong reports, angry stakeholders. DataCheck enforces validation rules before bad data moves downstream, the same way a linter enforces code quality before bad code ships.

Fail fast - structured exit codes stop pipelines at the gate, not after the damage is done
Deterministic - rules are explicit and binary. No heuristics. No anomaly scoring. No statistical guessing.
SQL pushdown - database checks run as a single aggregate SELECT; no data leaves your warehouse
Zero infrastructure - one pip install, one YAML file, runs anywhere
CI-native - SARIF output to GitHub Security tab, GitHub Action, Apache Airflow operators

Validate Where Data Lives

For databases, DataCheck executes validation as aggregate SQL inside your warehouse.

No data pulled into pandas
No row transfer
No separate compute layer
Single aggregate SELECT per rule set

Validation happens where the data already lives.

Why not observability?

DataCheck is not a data observability platform. It does not provide dashboards, trend analysis, anomaly detection, or SaaS backends. Those tools answer "what happened?" - DataCheck answers "does this data meet our rules right now?" Enforcement happens at the gate; investigation happens after.

What DataCheck Is Not

Not a monitoring dashboard
Not anomaly detection
Not a SaaS platform
Not a data catalog

It is an enforcement layer.

Demo

_{Install DataCheck, generate an ecommerce config with sample data, and run validation - all in one go.}

Setup

Requirements

To use DataCheck, you must have installed the following on your system.

Python 3.10, 3.11, or 3.12

To check your existing version, use the CLI command: python --version or python3 --version.

Pip 21.0 or greater

To check your pip version: pip --version

Installation

DataCheck is available on public PyPI as datacheck-cli.

pip install datacheck-cli

To install with support for a specific data source, use extras:

pip install datacheck-cli[postgresql]    # PostgreSQL
pip install datacheck-cli[mysql]         # MySQL
pip install datacheck-cli[snowflake]     # Snowflake
pip install datacheck-cli[bigquery]      # BigQuery
pip install datacheck-cli[redshift]      # Redshift
pip install datacheck-cli[s3]            # S3
pip install datacheck-cli[all]           # All data sources

Quickstart

To see detailed logs on any command, add --verbose or -v.

Create a config

Option 1 - Start from a template:

datacheck config init --with-sample-data
datacheck config init --template ecommerce --with-sample-data

Option 2 - Write manually. The config defines both the data source and the validation rules.

# .datacheck.yaml

data_source:
  type: csv
  path: ./data/orders.csv

checks:
  - name: id_check
    column: id
    rules:
      not_null: true
      unique: true

  - name: amount_check
    column: amount
    rules:
      not_null: true
      min: 0
      max: 10000

DataCheck auto-discovers config files in this order: .datacheck.yaml → .datacheck.yml → datacheck.yaml → datacheck.yml. To specify a config explicitly, use the --config flag.

Run validation

datacheck validate                          # auto-discover config
datacheck validate data.csv                 # direct file
datacheck validate --config checks.yaml
echo $?  # 1 if any error-severity rule fails

Data source

Option	Short	Description
`[DATA_SOURCE]`		Positional: file path or connection string
`--config`	`-c`	Path to config file (auto-discovered if not set)
`--source`		Named source from `sources.yaml`
`--sources-file`		Path to sources YAML file
`--table`	`-t`	Database table name
`--where`	`-w`	WHERE clause for filtering
`--query`	`-q`	Custom SQL query (alternative to `--table`)
`--schema`	`-s`	Schema/dataset name (databases and warehouses)
`--warehouse`		Snowflake warehouse name
`--credentials`		Path to credentials file (e.g., BigQuery service account JSON)
`--region`		Cloud region (Redshift IAM auth)
`--cluster`		Cluster identifier (Redshift IAM auth)
`--iam-auth`		Use IAM authentication (Redshift)

Output

Option	Short	Description
`--output`	`-o`	Save results to file
`--format`	`-f`	Output format: `json` (default), `sarif`, `markdown`, `csv`
`--csv-export`		Export failure details as CSV
`--suggestions` / `--no-suggestions`		Show actionable fix suggestions (default: on)

Execution

Option	Short	Description
`--parallel`		Enable multi-core execution
`--workers`		Number of worker processes (default: CPU count)
`--chunk-size`		Rows per chunk for parallel processing (default: 100000)
`--progress` / `--no-progress`		Show progress bar (default: on)
`--slack-webhook`		Slack webhook URL for result notifications

Logging

Option	Short	Description
`--verbose`	`-v`	Set log level to DEBUG
`--log-level`		`DEBUG`, `INFO`, `WARNING`, `ERROR`, `CRITICAL`
`--log-format`		`console` (default) or `json`
`--log-file`		Path to log file (enables rotation)

Data Source Configuration

File-based data sources are defined inline under data_source in your config. For databases and cloud storage, define named sources in a separate sources.yaml file and reference them.

CSV / Parquet

data_source:
  type: csv
  path: ./data/orders.csv
  options:
    delimiter: ","
    encoding: utf-8

data_source:
  type: parquet
  path: ./data/orders.parquet

Databases (PostgreSQL, Snowflake, BigQuery, etc.)

For database connections, use named sources in a sources.yaml file. The inline data_source config only supports file-based sources (csv, parquet).

SQL pushdown: database checks run as a single aggregate SELECT per rule - no rows are transferred to the validator. Validation happens inside your warehouse.

# sources.yaml
sources:
  production_db:
    type: postgresql
    host: ${DB_HOST}
    port: ${DB_PORT:-5432}
    database: ${DB_NAME}
    user: ${DB_USER}
    password: ${DB_PASSWORD}

  snowflake_wh:
    type: snowflake
    account: ${SF_ACCOUNT}
    user: ${SF_USER}
    password: ${SF_PASSWORD}
    warehouse: ${SF_WAREHOUSE:-COMPUTE_WH}
    database: ${SF_DATABASE}
    schema: ${SF_SCHEMA:-PUBLIC}

  bigquery_ds:
    type: bigquery
    project_id: ${GCP_PROJECT}
    dataset_id: ${GCP_DATASET}
    credentials_path: /path/to/service-account.json
    location: US

  redshift_db:
    type: redshift
    host: ${REDSHIFT_HOST}
    port: ${REDSHIFT_PORT:-5439}
    database: ${REDSHIFT_DB}
    user: ${REDSHIFT_USER}
    password: ${REDSHIFT_PASSWORD}
    schema: public

Then reference the source in your config:

# .datacheck.yaml
sources_file: sources.yaml
source: production_db
table: orders

checks:
  - name: id_check
    column: id
    rules:
      not_null: true

Cloud Storage (S3)

Access cloud files via named sources in sources.yaml:

# sources.yaml
sources:
  s3_data:
    type: s3
    bucket: my-bucket
    path: data/orders.csv
    region: us-east-1
    access_key: ${AWS_ACCESS_KEY_ID}
    secret_key: ${AWS_SECRET_ACCESS_KEY}

datacheck validate --source s3_data --sources-file sources.yaml

Switching Sources at Runtime

Switch sources at runtime:

datacheck validate --source snowflake_wh --config checks.yaml

Individual checks can also override the default source:

sources_file: sources.yaml
source: production_db
table: customers

checks:
  - name: customer_email
    column: email
    rules:
      not_null: true

  - name: order_total
    column: total
    source: snowflake_wh      # Override source for this check
    table: orders
    rules:
      min: 0

Environment Variables

Config files support environment variable substitution:

# In sources.yaml
sources:
  production_db:
    type: postgresql
    host: ${DB_HOST}                    # Required variable
    port: ${DB_PORT:-5432}              # Variable with default value
    database: ${DB_NAME}
    user: ${DB_USER}
    password: ${DB_PASSWORD}

Use datacheck config env to list all variables referenced in a config and their current values.

CI/CD Integration

DataCheck is built for pipelines. Rules fail hard and fast - no soft warnings that let bad data slip through unnoticed.

Exit codes

Code	Meaning
`0`	All rules passed (or only warning/info severity failures)
`1`	One or more error-severity rules failed
`2`	Configuration error
`3`	Data loading error
`4`	Unexpected error

Rules can have severity: error (default), severity: warning, or severity: info. Only error-severity failures cause exit code 1 and stop the pipeline.

GitHub Actions (with SARIF to Security tab)

Results appear as annotations on PRs in the GitHub Security tab via SARIF 2.1.0:

# .github/workflows/data-quality.yml
name: Data Quality Gate
on: [push, pull_request]

permissions:
  contents: read
  security-events: write   # Required for SARIF upload

jobs:
  validate:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - uses: squrtech/datacheck-action@v1
        with:
          config: .datacheck.yaml

Or without the GitHub Action - generates SARIF and uploads it directly:

      - name: Install DataCheck
        run: pip install datacheck-cli

      - name: Run data quality gate
        run: datacheck validate -c .datacheck.yaml --format sarif --output results.sarif

      - name: Upload SARIF to GitHub Security tab
        uses: github/codeql-action/upload-sarif@v3
        if: always()
        with:
          sarif_file: results.sarif

Apache Airflow

Use the built-in Airflow operators to gate DAG tasks on data quality:

from airflow_provider_datacheck.operators.datacheck import DataCheckOperator

validate_orders = DataCheckOperator(
    task_id="validate_orders",
    config_path="/config/orders.datacheck.yaml",
    source_name="production_db",
    table="orders",
    min_pass_rate=100.0,   # Fail if any rule fails
    fail_on_error=True,
)

The operator raises AirflowException when validation fails, halting the DAG at the gate.

Any CI runner

Works with any CI system that respects exit codes:

pip install datacheck-cli
datacheck validate -c .datacheck.yaml
# exits 1 if any error-severity rule fails

Enforce Schema Contracts

Capture a schema baseline and compare future data against it. Detects column additions, removals, type changes, and nullable changes. Use --fail-on-breaking to exit 1 on breaking changes. The data source can be provided directly, read from your config, or loaded from a named source.

# Auto-discover config or use named source
datacheck schema capture                              # Save current schema as baseline
datacheck schema compare                              # Compare - reports changes, exit 0
datacheck schema compare --fail-on-breaking           # Compare - exit 1 on breaking changes

# Direct file path
datacheck schema capture data.csv
datacheck schema compare data.csv --fail-on-breaking

# Named source
datacheck schema capture --source production_db --sources-file sources.yaml

# Other schema commands
datacheck schema show                                 # Display saved baseline
datacheck schema list                                 # List saved baselines
datacheck schema history                              # View capture history

schema compare options:

Option	Short	Description
`[DATA_SOURCE]`		Positional: file path or connection string
`--config`	`-c`	Path to config file
`--source`		Named source from `sources.yaml`
`--sources-file`		Path to sources YAML file
`--table`	`-t`	Database table name
`--baseline`	`-b`	Name of baseline to compare against (default: `baseline`)
`--baseline-dir`		Directory containing baselines (default: `.datacheck/schemas`)
`--rename-threshold`		Similarity threshold for rename detection (default: 0.8)
`--fail-on-breaking`		Exit 1 if breaking changes are detected
`--format`	`-f`	Output format: `terminal` (default) or `json`

Python API

Use DataCheck programmatically within your pipelines:

from datacheck import ValidationEngine

engine = ValidationEngine(config_path=".datacheck.yaml")
summary = engine.validate()

print(f"Records: {summary.total_rows:,} rows, {summary.total_columns} columns")
print(f"Passed: {summary.passed_rules}/{summary.total_rules}")

for result in summary.get_failed_results():
    print(f"  FAIL: {result.rule_name} on {result.column} ({result.failed_rows} rows)")

if not summary.all_passed:
    raise ValueError("Data quality gate failed - halting pipeline")

Available Rules

Category	Rules
Null & Uniqueness	`not_null`, `unique`, `unique_combination`
Numeric	`min`, `max`, `range`, `boolean`
String & Pattern	`regex`, `allowed_values`, `length`, `min_length`, `max_length`, `type`
Temporal	`max_age`, `timestamp_range` (or `date_range`), `no_future_timestamps`, `date_format_valid` (or `date_format`)
Cross-Column	`unique_combination`, `foreign_key_exists` (Python API), `sum_equals`

Roadmap

What's coming next:

Data Contracts format - --format datacontract aligned with the datacontract.com open spec.
dbt integration - generate DataCheck rules directly from your dbt schema YAML.
Streaming validation - chunk-based ingestion for 100M+ row datasets without loading into memory.

Development

git clone https://github.com/squrtech/datacheck.git
cd datacheck
poetry install

See CONTRIBUTING.md for guidelines.

Resources

License

Apache License 2.0 - see LICENSE for details.

Name		Name	Last commit message	Last commit date
Latest commit History 117 Commits
.github		.github
airflow-provider		airflow-provider
assets		assets
blog		blog
datacheck		datacheck
docs		docs
github-action		github-action
guides		guides
security		security
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
CHANGELOG.md		CHANGELOG.md
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
README.md		README.md
README_PYPI.md		README_PYPI.md
SECURITY.md		SECURITY.md
package-lock.json		package-lock.json
package.json		package.json
poetry.lock		poetry.lock
pyproject.toml		pyproject.toml

License

Squrtech/datacheck

Folders and files

Latest commit

History

Repository files navigation

DataCheck - A Linter for Data Pipelines

Mental Model

Why DataCheck?

Validate Where Data Lives

Why not observability?

What DataCheck Is Not

Demo

Setup

Requirements

Installation

Quickstart

Create a config

Run validation

Data Source Configuration

CSV / Parquet

Databases (PostgreSQL, Snowflake, BigQuery, etc.)

Cloud Storage (S3)

Switching Sources at Runtime

Environment Variables

CI/CD Integration

Exit codes

GitHub Actions (with SARIF to Security tab)

Apache Airflow

Any CI runner

Enforce Schema Contracts

Python API

Available Rules

Roadmap

Development

Resources

License

About

Topics

Resources

License

Contributing

Security policy

Uh oh!

Stars

Watchers

Forks

Releases 4

Packages 0

Uh oh!

Contributors 5

Uh oh!

Languages

Packages