Skip to content

Conversation

@dwnoble
Copy link
Contributor

@dwnoble dwnoble commented Jan 31, 2026

  • Added scaffold DataCommonsPlatformDb class and stubs in simple/stats/db.py.
  • Add environment variable configuration and factory logic for backend selection.
  • Add unit tests for configuration parsing in db_test.py.

@dwnoble dwnoble requested a review from clincoln8 January 31, 2026 00:45
@gemini-code-assist
Copy link

Summary of Changes

Hello @dwnoble, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request lays the groundwork for integrating with the Data Commons Platform by adding a new database backend. It establishes the necessary class structure, introduces environment-based configuration, and updates the system's database selection mechanism to support this new platform. The changes enable future development for data import and interaction with the Data Commons Platform.

Highlights

  • New Database Backend: Introduced a new DataCommonsPlatformDb class to provide initial support for interacting with the Data Commons Platform, including stub methods for data insertion and selection.
  • Environment Variable Configuration: Added new environment variables (USE_DATACOMMONS_PLATFORM, DATACOMMONS_PLATFORM_URL) and corresponding logic to configure the Data Commons Platform backend.
  • Database Factory Integration: Updated the database creation factory (create_and_update_db) and configuration retrieval logic (_get_db_config) to recognize and prioritize the new Data Commons Platform backend.
  • Unit Tests: Included a new unit test to validate the correct parsing of Data Commons Platform configuration from environment variables.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces initial support for a new database backend, the Data Commons Platform, including a new DB type, configuration, a stub Db interface implementation, and factory logic, along with unit tests. A security review identified a potential information disclosure vulnerability where the platform URL, which may contain sensitive credentials, is logged in its entirety, potentially exposing secrets in application logs. Additionally, there are areas for improvement regarding correctness and maintainability, such as an incorrect method signature, an unnecessary method, and hardcoded strings in a test.

Comment on lines +389 to +390
logging.info("TODO: Writing %s observations to [%s]", len(observations_df),
self.url)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

security-medium medium

The DataCommonsPlatformDb class logs the full URL configured via the DATACOMMONS_PLATFORM_URL environment variable. This can lead to information disclosure as URLs may contain sensitive credentials. It is recommended to sanitize the URL or log only the host portion. Additionally, please note that the signature of insert_observations (lines 386-390) is inconsistent with the base class Db and other implementations like SqlDb, expecting list[Observation] instead of pd.DataFrame. This will cause a runtime error.

    logging.info("TODO: Writing %s observations to Data Commons Platform", len(observations_df))


def insert_triples(self, triples: list[Triple]):
# TODO: Implement triple insertion into Data Commons Platform.
logging.info("TODO: Writing %s triples to [%s]", len(triples), self.url)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

security-medium medium

The DataCommonsPlatformDb class logs the full URL configured via the DATACOMMONS_PLATFORM_URL environment variable. URLs for platforms often contain sensitive information such as API keys, tokens, or credentials in the query parameters or as part of the path. Logging the full URL without sanitization can lead to the exposure of these secrets in the application logs.

Recommendation: Sanitize the URL before logging it, or log only the host portion of the URL.

    logging.info("TODO: Writing %s triples to Data Commons Platform", len(triples))

Comment on lines +401 to +403
def commit(self):
# Not applicable for Data Commons Platform.
pass

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The commit method is not part of the Db abstract base class interface. To maintain a clean and consistent API, it's better to remove methods that are not part of the public interface defined by the base class, especially if they are just no-ops.

Comment on lines +310 to +315
self.assertEqual(get_datacommons_platform_config_from_env(), {
"type": "datacommons_platform",
"params": {
"datacommons_platform_url": "https://test_url"
}
})

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

For better maintainability and to avoid magic strings, it's recommended to use the constants defined in stats.db for the dictionary keys and values in this assertion.

You'll need to add the following imports at the top of the file:

from stats.db import (
    DATACOMMONS_PLATFORM_URL,
    FIELD_DB_PARAMS,
    FIELD_DB_TYPE,
    TYPE_DATACOMMONS_PLATFORM,
)
Suggested change
self.assertEqual(get_datacommons_platform_config_from_env(), {
"type": "datacommons_platform",
"params": {
"datacommons_platform_url": "https://test_url"
}
})
self.assertEqual(get_datacommons_platform_config_from_env(), {
FIELD_DB_TYPE: TYPE_DATACOMMONS_PLATFORM,
FIELD_DB_PARAMS: {
DATACOMMONS_PLATFORM_URL: "https://test_url"
}
})

Copy link
Contributor

@clincoln8 clincoln8 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🎉

return {
FIELD_DB_TYPE: TYPE_DATACOMMONS_PLATFORM,
FIELD_DB_PARAMS: {
DATACOMMONS_PLATFORM_URL: dcp_url,
Copy link
Contributor

@clincoln8 clincoln8 Feb 3, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

long term discussion point: we should align on how to refer "instances" of the platform and their api urls. I assume this URL would be pointing to the dcp api server; do we just want to be calling that the "dcp url" or have a more specific name so that it doesn't get confused with a future "admin website url" or potentially user facing website url?
Do we want to refer to it as a "platform url" or more of an "instance url"?
I think this matters more when we start writing external documentation and the wording used when talking to DCP owners, but it'd be nice if we align early so the code can be consistent. Noted in b/481115789

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants