DataCheck enforces deterministic validation rules at the pipeline boundary. Define rules in YAML. Run in CI. Fail fast on bad data. No servers, no dashboards, no infrastructure.
Your data source → [DataCheck rules] → exit 0: pipeline continues
→ exit 1: pipeline stops
View the Documentation for full details.
Code has linters. Infrastructure has policy enforcement. Data pipelines need gates.
DataCheck is that gate.
Most teams detect bad data after the fact - broken dashboards, wrong reports, angry stakeholders. DataCheck enforces validation rules before bad data moves downstream, the same way a linter enforces code quality before bad code ships.
- Fail fast - structured exit codes stop pipelines at the gate, not after the damage is done
- Deterministic - rules are explicit and binary. No heuristics. No anomaly scoring. No statistical guessing.
- SQL pushdown - database checks run as a single aggregate
SELECT; no data leaves your warehouse - Zero infrastructure - one
pip install, one YAML file, runs anywhere - CI-native - SARIF output to GitHub Security tab, GitHub Action, Apache Airflow operators
For databases, DataCheck executes validation as aggregate SQL inside your warehouse.
- No data pulled into pandas
- No row transfer
- No separate compute layer
- Single aggregate
SELECTper rule set
Validation happens where the data already lives.
DataCheck is not a data observability platform. It does not provide dashboards, trend analysis, anomaly detection, or SaaS backends. Those tools answer "what happened?" - DataCheck answers "does this data meet our rules right now?" Enforcement happens at the gate; investigation happens after.
- Not a monitoring dashboard
- Not anomaly detection
- Not a SaaS platform
- Not a data catalog
It is an enforcement layer.
Install DataCheck, generate an ecommerce config with sample data, and run validation - all in one go.
To use DataCheck, you must have installed the following on your system.
Python 3.10, 3.11, or 3.12
To check your existing version, use the CLI command: python --version or python3 --version.
Pip 21.0 or greater
To check your pip version: pip --version
DataCheck is available on public PyPI as datacheck-cli.
pip install datacheck-cliTo install with support for a specific data source, use extras:
pip install datacheck-cli[postgresql] # PostgreSQL
pip install datacheck-cli[mysql] # MySQL
pip install datacheck-cli[snowflake] # Snowflake
pip install datacheck-cli[bigquery] # BigQuery
pip install datacheck-cli[redshift] # Redshift
pip install datacheck-cli[s3] # S3
pip install datacheck-cli[all] # All data sourcesTo see detailed logs on any command, add --verbose or -v.
Option 1 - Start from a template:
datacheck config init --with-sample-data
datacheck config init --template ecommerce --with-sample-dataOption 2 - Write manually. The config defines both the data source and the validation rules.
# .datacheck.yaml
data_source:
type: csv
path: ./data/orders.csv
checks:
- name: id_check
column: id
rules:
not_null: true
unique: true
- name: amount_check
column: amount
rules:
not_null: true
min: 0
max: 10000DataCheck auto-discovers config files in this order: .datacheck.yaml → .datacheck.yml → datacheck.yaml → datacheck.yml. To specify a config explicitly, use the --config flag.
datacheck validate # auto-discover config
datacheck validate data.csv # direct file
datacheck validate --config checks.yaml
echo $? # 1 if any error-severity rule failsData source
| Option | Short | Description |
|---|---|---|
[DATA_SOURCE] |
Positional: file path or connection string | |
--config |
-c |
Path to config file (auto-discovered if not set) |
--source |
Named source from sources.yaml |
|
--sources-file |
Path to sources YAML file | |
--table |
-t |
Database table name |
--where |
-w |
WHERE clause for filtering |
--query |
-q |
Custom SQL query (alternative to --table) |
--schema |
-s |
Schema/dataset name (databases and warehouses) |
--warehouse |
Snowflake warehouse name | |
--credentials |
Path to credentials file (e.g., BigQuery service account JSON) | |
--region |
Cloud region (Redshift IAM auth) | |
--cluster |
Cluster identifier (Redshift IAM auth) | |
--iam-auth |
Use IAM authentication (Redshift) |
Output
| Option | Short | Description |
|---|---|---|
--output |
-o |
Save results to file |
--format |
-f |
Output format: json (default), sarif, markdown, csv |
--csv-export |
Export failure details as CSV | |
--suggestions / --no-suggestions |
Show actionable fix suggestions (default: on) |
Execution
| Option | Short | Description |
|---|---|---|
--parallel |
Enable multi-core execution | |
--workers |
Number of worker processes (default: CPU count) | |
--chunk-size |
Rows per chunk for parallel processing (default: 100000) | |
--progress / --no-progress |
Show progress bar (default: on) | |
--slack-webhook |
Slack webhook URL for result notifications |
Logging
| Option | Short | Description |
|---|---|---|
--verbose |
-v |
Set log level to DEBUG |
--log-level |
DEBUG, INFO, WARNING, ERROR, CRITICAL |
|
--log-format |
console (default) or json |
|
--log-file |
Path to log file (enables rotation) |
File-based data sources are defined inline under data_source in your config. For databases and cloud storage, define named sources in a separate sources.yaml file and reference them.
data_source:
type: csv
path: ./data/orders.csv
options:
delimiter: ","
encoding: utf-8data_source:
type: parquet
path: ./data/orders.parquetFor database connections, use named sources in a sources.yaml file. The inline data_source config only supports file-based sources (csv, parquet).
SQL pushdown: database checks run as a single aggregate
SELECTper rule - no rows are transferred to the validator. Validation happens inside your warehouse.
# sources.yaml
sources:
production_db:
type: postgresql
host: ${DB_HOST}
port: ${DB_PORT:-5432}
database: ${DB_NAME}
user: ${DB_USER}
password: ${DB_PASSWORD}
snowflake_wh:
type: snowflake
account: ${SF_ACCOUNT}
user: ${SF_USER}
password: ${SF_PASSWORD}
warehouse: ${SF_WAREHOUSE:-COMPUTE_WH}
database: ${SF_DATABASE}
schema: ${SF_SCHEMA:-PUBLIC}
bigquery_ds:
type: bigquery
project_id: ${GCP_PROJECT}
dataset_id: ${GCP_DATASET}
credentials_path: /path/to/service-account.json
location: US
redshift_db:
type: redshift
host: ${REDSHIFT_HOST}
port: ${REDSHIFT_PORT:-5439}
database: ${REDSHIFT_DB}
user: ${REDSHIFT_USER}
password: ${REDSHIFT_PASSWORD}
schema: publicThen reference the source in your config:
# .datacheck.yaml
sources_file: sources.yaml
source: production_db
table: orders
checks:
- name: id_check
column: id
rules:
not_null: trueAccess cloud files via named sources in sources.yaml:
# sources.yaml
sources:
s3_data:
type: s3
bucket: my-bucket
path: data/orders.csv
region: us-east-1
access_key: ${AWS_ACCESS_KEY_ID}
secret_key: ${AWS_SECRET_ACCESS_KEY}datacheck validate --source s3_data --sources-file sources.yamlSwitch sources at runtime:
datacheck validate --source snowflake_wh --config checks.yamlIndividual checks can also override the default source:
sources_file: sources.yaml
source: production_db
table: customers
checks:
- name: customer_email
column: email
rules:
not_null: true
- name: order_total
column: total
source: snowflake_wh # Override source for this check
table: orders
rules:
min: 0Config files support environment variable substitution:
# In sources.yaml
sources:
production_db:
type: postgresql
host: ${DB_HOST} # Required variable
port: ${DB_PORT:-5432} # Variable with default value
database: ${DB_NAME}
user: ${DB_USER}
password: ${DB_PASSWORD}Use datacheck config env to list all variables referenced in a config and their current values.
DataCheck is built for pipelines. Rules fail hard and fast - no soft warnings that let bad data slip through unnoticed.
| Code | Meaning |
|---|---|
0 |
All rules passed (or only warning/info severity failures) |
1 |
One or more error-severity rules failed |
2 |
Configuration error |
3 |
Data loading error |
4 |
Unexpected error |
Rules can have severity: error (default), severity: warning, or severity: info. Only error-severity failures cause exit code 1 and stop the pipeline.
Results appear as annotations on PRs in the GitHub Security tab via SARIF 2.1.0:
# .github/workflows/data-quality.yml
name: Data Quality Gate
on: [push, pull_request]
permissions:
contents: read
security-events: write # Required for SARIF upload
jobs:
validate:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- uses: squrtech/datacheck-action@v1
with:
config: .datacheck.yamlOr without the GitHub Action - generates SARIF and uploads it directly:
- name: Install DataCheck
run: pip install datacheck-cli
- name: Run data quality gate
run: datacheck validate -c .datacheck.yaml --format sarif --output results.sarif
- name: Upload SARIF to GitHub Security tab
uses: github/codeql-action/upload-sarif@v3
if: always()
with:
sarif_file: results.sarifUse the built-in Airflow operators to gate DAG tasks on data quality:
from airflow_provider_datacheck.operators.datacheck import DataCheckOperator
validate_orders = DataCheckOperator(
task_id="validate_orders",
config_path="/config/orders.datacheck.yaml",
source_name="production_db",
table="orders",
min_pass_rate=100.0, # Fail if any rule fails
fail_on_error=True,
)The operator raises AirflowException when validation fails, halting the DAG at the gate.
Works with any CI system that respects exit codes:
pip install datacheck-cli
datacheck validate -c .datacheck.yaml
# exits 1 if any error-severity rule failsCapture a schema baseline and compare future data against it. Detects column additions, removals, type changes, and nullable changes. Use --fail-on-breaking to exit 1 on breaking changes. The data source can be provided directly, read from your config, or loaded from a named source.
# Auto-discover config or use named source
datacheck schema capture # Save current schema as baseline
datacheck schema compare # Compare - reports changes, exit 0
datacheck schema compare --fail-on-breaking # Compare - exit 1 on breaking changes
# Direct file path
datacheck schema capture data.csv
datacheck schema compare data.csv --fail-on-breaking
# Named source
datacheck schema capture --source production_db --sources-file sources.yaml
# Other schema commands
datacheck schema show # Display saved baseline
datacheck schema list # List saved baselines
datacheck schema history # View capture historyschema compare options:
| Option | Short | Description |
|---|---|---|
[DATA_SOURCE] |
Positional: file path or connection string | |
--config |
-c |
Path to config file |
--source |
Named source from sources.yaml |
|
--sources-file |
Path to sources YAML file | |
--table |
-t |
Database table name |
--baseline |
-b |
Name of baseline to compare against (default: baseline) |
--baseline-dir |
Directory containing baselines (default: .datacheck/schemas) |
|
--rename-threshold |
Similarity threshold for rename detection (default: 0.8) | |
--fail-on-breaking |
Exit 1 if breaking changes are detected | |
--format |
-f |
Output format: terminal (default) or json |
Use DataCheck programmatically within your pipelines:
from datacheck import ValidationEngine
engine = ValidationEngine(config_path=".datacheck.yaml")
summary = engine.validate()
print(f"Records: {summary.total_rows:,} rows, {summary.total_columns} columns")
print(f"Passed: {summary.passed_rules}/{summary.total_rules}")
for result in summary.get_failed_results():
print(f" FAIL: {result.rule_name} on {result.column} ({result.failed_rows} rows)")
if not summary.all_passed:
raise ValueError("Data quality gate failed - halting pipeline")| Category | Rules |
|---|---|
| Null & Uniqueness | not_null, unique, unique_combination |
| Numeric | min, max, range, boolean |
| String & Pattern | regex, allowed_values, length, min_length, max_length, type |
| Temporal | max_age, timestamp_range (or date_range), no_future_timestamps, date_format_valid (or date_format) |
| Cross-Column | unique_combination, foreign_key_exists (Python API), sum_equals |
What's coming next:
- Data Contracts format -
--format datacontractaligned with the datacontract.com open spec. - dbt integration - generate DataCheck rules directly from your dbt schema YAML.
- Streaming validation - chunk-based ingestion for 100M+ row datasets without loading into memory.
git clone https://github.com/squrtech/datacheck.git
cd datacheck
poetry installSee CONTRIBUTING.md for guidelines.
Apache License 2.0 - see LICENSE for details.