-
Notifications
You must be signed in to change notification settings - Fork 30
Add initial Data Commons Platform database simple import support #469
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
dwnoble
commented
Jan 31, 2026
- Added scaffold DataCommonsPlatformDb class and stubs in simple/stats/db.py.
- Add environment variable configuration and factory logic for backend selection.
- Add unit tests for configuration parsing in db_test.py.
Summary of ChangesHello @dwnoble, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request lays the groundwork for integrating with the Data Commons Platform by adding a new database backend. It establishes the necessary class structure, introduces environment-based configuration, and updates the system's database selection mechanism to support this new platform. The changes enable future development for data import and interaction with the Data Commons Platform. Highlights
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request introduces initial support for a new database backend, the Data Commons Platform, including a new DB type, configuration, a stub Db interface implementation, and factory logic, along with unit tests. A security review identified a potential information disclosure vulnerability where the platform URL, which may contain sensitive credentials, is logged in its entirety, potentially exposing secrets in application logs. Additionally, there are areas for improvement regarding correctness and maintainability, such as an incorrect method signature, an unnecessary method, and hardcoded strings in a test.
| logging.info("TODO: Writing %s observations to [%s]", len(observations_df), | ||
| self.url) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The DataCommonsPlatformDb class logs the full URL configured via the DATACOMMONS_PLATFORM_URL environment variable. This can lead to information disclosure as URLs may contain sensitive credentials. It is recommended to sanitize the URL or log only the host portion. Additionally, please note that the signature of insert_observations (lines 386-390) is inconsistent with the base class Db and other implementations like SqlDb, expecting list[Observation] instead of pd.DataFrame. This will cause a runtime error.
logging.info("TODO: Writing %s observations to Data Commons Platform", len(observations_df))|
|
||
| def insert_triples(self, triples: list[Triple]): | ||
| # TODO: Implement triple insertion into Data Commons Platform. | ||
| logging.info("TODO: Writing %s triples to [%s]", len(triples), self.url) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The DataCommonsPlatformDb class logs the full URL configured via the DATACOMMONS_PLATFORM_URL environment variable. URLs for platforms often contain sensitive information such as API keys, tokens, or credentials in the query parameters or as part of the path. Logging the full URL without sanitization can lead to the exposure of these secrets in the application logs.
Recommendation: Sanitize the URL before logging it, or log only the host portion of the URL.
logging.info("TODO: Writing %s triples to Data Commons Platform", len(triples))| def commit(self): | ||
| # Not applicable for Data Commons Platform. | ||
| pass |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| self.assertEqual(get_datacommons_platform_config_from_env(), { | ||
| "type": "datacommons_platform", | ||
| "params": { | ||
| "datacommons_platform_url": "https://test_url" | ||
| } | ||
| }) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For better maintainability and to avoid magic strings, it's recommended to use the constants defined in stats.db for the dictionary keys and values in this assertion.
You'll need to add the following imports at the top of the file:
from stats.db import (
DATACOMMONS_PLATFORM_URL,
FIELD_DB_PARAMS,
FIELD_DB_TYPE,
TYPE_DATACOMMONS_PLATFORM,
)| self.assertEqual(get_datacommons_platform_config_from_env(), { | |
| "type": "datacommons_platform", | |
| "params": { | |
| "datacommons_platform_url": "https://test_url" | |
| } | |
| }) | |
| self.assertEqual(get_datacommons_platform_config_from_env(), { | |
| FIELD_DB_TYPE: TYPE_DATACOMMONS_PLATFORM, | |
| FIELD_DB_PARAMS: { | |
| DATACOMMONS_PLATFORM_URL: "https://test_url" | |
| } | |
| }) |
clincoln8
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🎉
| return { | ||
| FIELD_DB_TYPE: TYPE_DATACOMMONS_PLATFORM, | ||
| FIELD_DB_PARAMS: { | ||
| DATACOMMONS_PLATFORM_URL: dcp_url, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
long term discussion point: we should align on how to refer "instances" of the platform and their api urls. I assume this URL would be pointing to the dcp api server; do we just want to be calling that the "dcp url" or have a more specific name so that it doesn't get confused with a future "admin website url" or potentially user facing website url?
Do we want to refer to it as a "platform url" or more of an "instance url"?
I think this matters more when we start writing external documentation and the wording used when talking to DCP owners, but it'd be nice if we align early so the code can be consistent. Noted in b/481115789