Add percent deletion validation check#1864
Conversation
Summary of ChangesHello @vish-cs, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request significantly enhances the data import validation system by introducing a percentage-based check for deleted records. The primary goal is to prevent accidental or erroneous large-scale data deletions by setting a configurable threshold. This involves updating the data differencing tool to produce more comprehensive summary data in JSON format, and modifying the validation runner to consume this new output, thereby making the import process more robust and secure against data loss. Highlights
Changelog
Activity
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
There was a problem hiding this comment.
Code Review
This pull request introduces a new validation check, DELETED_RECORDS_PERCENT, to verify that the percentage of deleted data points is within a specified threshold, with updates to the import_differ tool, ValidationRunner, validator.py, unit tests, and documentation. A critical security vulnerability has been identified: the use of unsanitized input from the repository's manifest file (import_name) in constructing GCS paths and looking up previous versions introduces path traversal and information leakage vulnerabilities. This must be addressed by sanitizing the import_name parameter. Furthermore, there is a critical issue in the implementation of the new percentage validation logic that will lead to incorrect results, which also requires attention.
Adds a new validation check DELETED_RECORDS_PERCENT which checks using the differ output if the percentage of deleted observation nodes (based on the count of observations from the previous data) is less than a specified threshold. For this, it dumps various differ stats to a file differ_summary.json.
The check is enabled by default for all the imports with a threshold value 10, disabling the DELETED_RECORDS_COUNT check.