Skip to content

Add percent deletion validation check#1864

Merged
vish-cs merged 1 commit intodatacommonsorg:masterfrom
vish-cs:diff
Feb 5, 2026
Merged

Add percent deletion validation check#1864
vish-cs merged 1 commit intodatacommonsorg:masterfrom
vish-cs:diff

Conversation

@vish-cs
Copy link
Contributor

@vish-cs vish-cs commented Feb 5, 2026

Adds a new validation check DELETED_RECORDS_PERCENT which checks using the differ output if the percentage of deleted observation nodes (based on the count of observations from the previous data) is less than a specified threshold. For this, it dumps various differ stats to a file differ_summary.json.
The check is enabled by default for all the imports with a threshold value 10, disabling the DELETED_RECORDS_COUNT check.

@gemini-code-assist
Copy link
Contributor

Summary of Changes

Hello @vish-cs, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request significantly enhances the data import validation system by introducing a percentage-based check for deleted records. The primary goal is to prevent accidental or erroneous large-scale data deletions by setting a configurable threshold. This involves updating the data differencing tool to produce more comprehensive summary data in JSON format, and modifying the validation runner to consume this new output, thereby making the import process more robust and secure against data loss.

Highlights

  • New Validation Rule: Introduced a new validation rule, DELETED_RECORDS_PERCENT, which checks if the percentage of deleted records in an import operation exceeds a configurable threshold. This enhances data integrity by flagging potentially large-scale unintended deletions.
  • Enhanced Differ Output: The import_differ tool now generates a differ_summary.json file alongside existing CSV outputs. This JSON file contains crucial summary statistics, including the current_data_size, which is essential for calculating deletion percentages.
  • Flexible Differ Output Handling: The validation runner (runner.py) has been updated to intelligently handle the differ_output parameter, accepting either a directory (where it expects both CSV and JSON summary files) or a single CSV file. This provides greater flexibility in how differ results are processed.
  • Updated Data Paths and Logic: Adjustments were made in import_executor.py to modify GCS output paths for generated MCF files to /genmcf and to prioritize looking for previous data in genmcf paths, with a fallback to validation paths for backward compatibility.
Changelog
  • import-automation/executor/app/executor/import_executor.py
    • Updated GCS output path for generated MCF files from /validation to /genmcf.
    • Modified logic to prioritize genmcf path for previous data, with a fallback to validation path.
    • Adjusted differ_output_file assignment to use validation_output_path.
  • tools/import_differ/differ_utils.py
    • Added json import.
    • Implemented write_json_data utility function for writing JSON output to local or GCS paths.
  • tools/import_differ/import_differ.py
    • Introduced calculation and output of a diff_summary dictionary containing data sizes and diff sizes.
    • Writes diff_summary to differ_summary.csv and differ_summary.json.
  • tools/import_validation/README.md
    • Documented the new DELETED_RECORDS_PERCENT validation rule.
  • tools/import_validation/runner.py
    • Added differ_summary to data sources in the ValidationRunner.
    • Mapped DELETED_RECORDS_PERCENT to its validation function.
    • Refactored _initialize_data_sources to handle differ_output as either a directory (expecting obs_diff_summary.csv and differ_summary.json) or a single CSV file.
    • Updated differ_output flag description to indicate it can be a directory or data file.
    • Modified validation execution to pass differ_summary to the new percentage validation.
  • tools/import_validation/runner_test.py
    • Added test_runner_deleted_records_percent to test the new validation rule's integration and correct data parsing.
  • tools/import_validation/validation_config.json
    • Updated the default validation rule from DELETED_RECORDS_COUNT to DELETED_RECORDS_PERCENT with a threshold of 10.
  • tools/import_validation/validator.py
    • Implemented validate_deleted_records_percent function to calculate and validate the percentage of deleted records based on differ output and summary data.
  • tools/import_validation/validator_test.py
    • Added TestDeletedRecordsPercentValidation class with comprehensive unit tests for the new percentage validation logic, covering various scenarios including thresholds, empty data, and missing summary information.
Activity
  • The pull request was created by vish-cs to introduce a new validation check for the percentage of deleted records during data imports.
  • The changes involve modifying core import and validation tools, adding new utility functions for JSON output, and updating documentation and configuration files.
  • New unit tests have been added to ensure the correctness and robustness of the newly implemented percentage-based deletion validation logic.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a new validation check, DELETED_RECORDS_PERCENT, to verify that the percentage of deleted data points is within a specified threshold, with updates to the import_differ tool, ValidationRunner, validator.py, unit tests, and documentation. A critical security vulnerability has been identified: the use of unsanitized input from the repository's manifest file (import_name) in constructing GCS paths and looking up previous versions introduces path traversal and information leakage vulnerabilities. This must be addressed by sanitizing the import_name parameter. Furthermore, there is a critical issue in the implementation of the new percentage validation logic that will lead to incorrect results, which also requires attention.

@vish-cs vish-cs merged commit 6b625a3 into datacommonsorg:master Feb 5, 2026
9 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants