Thank you for considering contributing! Whether you are adding a new test or tailoring existing ones to your data and validation needs, your contributions help make the framework more flexible, robust, and useful for a wide range of datasets.
Before contributing, ensure your environment is set up as described in Getting Started.
-
Create a new branch
In a terminal at the project root:
git checkout -b branch_name -
Install required libraries
In a terminal, run:
pip install -r requirements.txt -
Create a test by following the steps in Adding New Tests
-
Verify your tests work as expected by following Testing your Changes
Note: If you are only testing locally, you can stop here. The remaining steps are only required if you want to contribute your test to the framework.
-
Stage your changes
Only add the files you want included in the pull request.
In a terminal at the project root:
git add dimensions/<dimension>/x#.py # e.g., git add dimensions/accuracy/a1.py -
Commit and push your changes
In a terminal at the project root:
git commit -m "Added new test X# for [dimension]" git push -u origin branch_name -
Open a pull request to merge your changes into the main repository
Describe the test and include examples if helpful.
TODO: Set a standard for merging changes to main repo
The framework organizes tests into dimensions, where tests that evaluate similar aspects of data quality are conceptually grouped as metrics. Each dimension is stored in its own folder, containing all files needed to define, manage, and load its tests.
dimension_reference.py: Manages and loads all tests in a dimension, collects their metadata, and providesrun_tests()to execute selected or all tests with their specific parameters.- Test file (e.g.,
a1.py,c2.py): Defines a single test's parameters, logic, and metadata. Each test returns a score and optionally a CSV used for reporting. test_template.py: Provides a template for contributors to add a new test, including placeholders for parameters, logic, and metadata.
Add a new test or customize an existing one by copying the test template and filling in the # TODO sections.
- Test names must follow the
X#naming convention (e.g.,A1). - Any new library added for a test is to be added to the requirements.txt file with the package version specified.
- Identify the dimension your test belongs to. See the Tests Reference Table for reference.
- Navigate to the corresponding dimension folder under dimensions/.
- Copy the test template
test_template.pyand rename it to your test name (e.g.,a5.py). - Edit the template following the
# TODOcomments:- Define the test name and test specific parameters.
- Assign test specific attributes to
selfvariables. - Set
self.threshold(useNoneif not applicable). - Set
self.selected_columns(useNoneif your test does not specify specific columns). - Implement your test logic in
run_test(). - Optional: Define test metadata and parameter types in
create_metadata()to run your test in the UI tool. Skip this step if using only the notebook.- Set each parameter's type using
ParameterType.[TYPE](see available types here).
- Set each parameter's type using
- Import any required operation modules from utils/ at the top of your test file (e.g.,
from utils import item_operations, column_operations, table_operations).- See Operations for a list of available operations.
- See Creating Custom Operations to add custom operations.
-
Open the notebook
Open the Data Quality Complete notebook and choose your dataset file (CSV or XLSX):
- In Setup, set
DATA_FILE_PATHin the last code cell. - Dataset requirements:
- The data must be on the first sheet in the Excel document.
- The first row must be the column names.
- The test won't run if the Excel file is open.
- In Setup, set
-
Register your new test under its dimension in the notebook:
- Set test-specific parameters in the
test_paramsdictionary.- These should match the parameters defined in the
__init__header in your template file.
- These should match the parameters defined in the
- Set
run_tests()to include only your test.- Example:
run_tests(['C3'])
- Example:
- Set test-specific parameters in the
-
Restart the kernel
Go to Kernel (top lefthand corner) > Restart Kernel
-
Run the required sections
Run a selected cell using Shift+Enter or go to Run (top lefthand corner) > Run Selected Cell.
Run:
- Setup section
- Your test's dimension section
- Determine Overall Data Quality section
-
View Results
- For the calculated Data Quality, see the output at the last cell in the notebook.
- For individual test and dimension scores, see the output below each code cell for the given dimension.
To run all tests in the framework (including your new test), see Run Tests with the Notebook in the main repository.
Before re-launching the UI tool:
-
Ensure
create_metadata()is updated with test metadata and parameter types. -
Stop any previous UI session
In a terminal running the UI, press CTRL + C
-
Re-launch the UI tool:
streamlit run ui_tool/dq_ui.py
This FAQ is designed to quickly resolve the most common contributor issues.
Q1: My test is not showing up in the notebook
- Ensure your test file is in the correct folder under
dimensions/<dimension>/. - Check the file name follows the naming convention (e.g.,
a1.py,c3.py). - Restart the notebook kernel after adding a new test.
Q2: My test is not running or failing
- Ensure your test is included in
run_tests().- Example:
run_tests(['C3'])to run a single test.
- Example:
- Check that the
test_paramsdictionary matches the parameters defined in your test’s__init__method.
Q3: I’m getting import or module errors
- From the project root, confirm all required libraries are installed:
pip install -r requirements.txt - Ensure your environment meets the prerequisites:
- Python 3.10 or later
- Git installed
- Jupyter Notebook or JupyterLab installed