Skip to content

Conversation

@rohitkumarbhagat
Copy link
Contributor

  • Update enhanced TMCF, earthengine utils, RFF raster, and EPA facilities scripts to use V2 client/wrappers and repo-root imports.
  • Add/adjust tests to cover new V2 behavior and mocked API responses where appropriate.
  • Expand dc_api_wrapper capabilities (multi-prop fetch + API key helper) and add coverage.
  • Add requests-mock dependency for new test coverage.

@gemini-code-assist
Copy link
Contributor

Summary of Changes

Hello @rohitkumarbhagat, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request undertakes a significant refactoring effort to migrate several Data Commons-related scripts to leverage the more modern and flexible Data Commons V2 client. The changes aim to enhance API interaction capabilities, improve code maintainability through standardized imports, and bolster testing with comprehensive mock-based strategies. This migration ensures that the affected scripts are aligned with the latest API standards and are more resilient to external service changes.

Highlights

  • DC V2 Client Migration: Several scripts, including those for Earth Engine utilities, RFF raster processing, enhanced TMCF, and EPA facilities, have been updated to utilize the Data Commons V2 client and wrappers for API interactions, moving away from the older datacommons library.
  • Enhanced dc_api_wrapper: The util/dc_api_wrapper.py module has been improved to support fetching multiple properties for a node using dc_api_get_node_property and now includes a dedicated helper function, get_dc_api_key, for retrieving the Data Commons API key.
  • Improved Test Coverage and Dependencies: New test cases have been added or adjusted across various modules to validate the new V2 client behavior and API key handling. The requests-mock library has been introduced as a new dependency to facilitate robust mocking of API calls during testing.
  • Standardized Imports: Scripts now consistently use repo-root imports, improving module resolution and code maintainability.
Changelog
  • requirements_all.txt
    • Added requests-mock dependency to support new testing methodologies.
  • scripts/earthengine/utils.py
    • Migrated place_id_to_lat_lng function to use dc_api_get_node_property (V2 client).
    • Updated _DC_API_ROOT to http://api.datacommons.org.
    • Removed direct datacommons import and added Path import.
  • scripts/earthengine/utils_test.py
    • Added PlaceUtilsTest class with test_place_id_to_lat_lng_dc_api to test the V2 API integration using mock.patch.
    • Added Path and mock imports.
  • scripts/rff/preprocess_raster.py
    • Migrated get_county_geoid to use get_datacommons_client and dc_api_batched_wrapper (V2 client).
    • Updated import paths to use REPO_ROOT for util.dc_api_wrapper and scripts.rff.util.
    • Removed direct datacommons import and added Path import.
  • scripts/rff/preprocess_raster_test.py
    • Added a new test file to cover preprocess_raster.py functions, including mocked V2 API calls for get_county_geoid.
    • Introduced FakeNodeEndpoint and FakeClient for mocking API responses.
  • scripts/us_census/enhanced_tmcf/process_etmcf.py
    • Migrated _get_places_not_found to use dc_api_get_node_property (V2 client).
    • Updated import paths to use REPO_ROOT for util.dc_api_wrapper.
    • Removed direct datacommons import and added sys and Path imports.
  • scripts/us_census/enhanced_tmcf/process_etmcf_test.py
    • Updated imports to use REPO_ROOT and explicit imports from scripts.us_census.enhanced_tmcf.process_etmcf.
    • Added mock import and test_get_places_not_found_uses_v2_wrapper.
    • Updated existing tests to use mock.patch.object for dc_api_get_node_property.
  • scripts/us_epa/parent_company/download_existing_facilities.py
    • Rewrote facility download logic to use the V2 SPARQL endpoint with requests and get_dc_api_key.
    • Removed datacommons and pathlib imports, added sys, Path, and requests.
    • Refactored main into download_existing_facilities and a new main function.
  • scripts/us_epa/parent_company/download_existing_facilities_test.py
    • Added a new test file for download_existing_facilities.py using requests_mock to simulate V2 SPARQL API responses.
    • Updated imports to use REPO_ROOT.
  • scripts/us_epa/parent_company/process_parent_company.py
    • Removed unused datacommons and json imports.
  • scripts/us_epa/parent_company/process_parent_company_test.py
    • Updated imports to use REPO_ROOT and explicit imports from scripts.us_epa.parent_company.process_parent_company.
  • tools/statvar_importer/mcf_filter.py
    • Removed unused datacommons import.
  • tools/statvar_importer/place/place_resolver.py
    • Removed unused datacommons import.
  • util/dc_api_wrapper.py
    • Modified dc_api_get_node_property to accept a list of properties for V2 API calls, enabling multi-property fetching.
    • Refactored get_datacommons_client to extract get_dc_api_key as a separate function.
    • Added Union type hint.
  • util/dc_api_wrapper_test.py
    • Added test_dc_api_get_node_property_multi_v2 to verify the new multi-property fetching capability.
Activity
  • The author, rohitkumarbhagat, initiated this pull request to migrate several scripts to use the Data Commons V2 client and improve testing.
  • The pull request includes updates to core utility functions, script logic, and corresponding test files.
  • New dependencies have been introduced to facilitate robust mocking of API calls during testing.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request migrates several scripts to use the Data Commons V2 client and API wrappers, which is a significant and positive update. The changes are well-structured, and the addition of new tests using requests-mock and updates to existing tests to cover the new V2 behavior are commendable. The dc_api_wrapper is also enhanced with multi-property fetch capabilities and better API key handling.

I have a couple of suggestions to improve maintainability by reducing code duplication in the new logic for handling V2 API responses in scripts/earthengine/utils.py and scripts/rff/preprocess_raster.py.

Comment on lines +385 to +394
if isinstance(lat_value, list):
lat_value = lat_value[0]
if isinstance(lng_value, list):
lng_value = lng_value[0]
if isinstance(lat_value, str):
lat_value = lat_value.split(',')[0].strip().strip('"')
if isinstance(lng_value, str):
lng_value = lng_value.split(',')[0].strip().strip('"')
lat = str_get_numeric_value(lat_value)
lng = str_get_numeric_value(lng_value)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The logic for parsing latitude and longitude values is duplicated. This can be refactored to improve readability and maintainability by avoiding repetition.

Suggested change
if isinstance(lat_value, list):
lat_value = lat_value[0]
if isinstance(lng_value, list):
lng_value = lng_value[0]
if isinstance(lat_value, str):
lat_value = lat_value.split(',')[0].strip().strip('"')
if isinstance(lng_value, str):
lng_value = lng_value.split(',')[0].strip().strip('"')
lat = str_get_numeric_value(lat_value)
lng = str_get_numeric_value(lng_value)
coords = []
for v in [lat_value, lng_value]:
if isinstance(v, list):
v = v[0]
if isinstance(v, str):
v = v.split(',')[0].strip().strip('"')
coords.append(str_get_numeric_value(v))
lat, lng = coords

Comment on lines 52 to 110
def get_county_geoid(lat, lon):
counties = dc.get_places_in(['country/USA'], 'County')['country/USA']
counties_simp = dc.get_property_values(counties, 'geoJsonCoordinatesDP1')
config = {'dc_api_use_cache': True}
client = get_datacommons_client(config)
counties_result = client.node.fetch_place_children(
place_dcids=['country/USA'],
children_type='County',
as_dict=True,
)
counties = [
node.get('dcid')
for node in counties_result.get('country/USA', [])
if node.get('dcid')
]
counties_simp = dc_api_batched_wrapper(
function=client.node.fetch_property_values,
dcids=counties,
args={'properties': 'geoJsonCoordinatesDP1'},
dcid_arg_kw='node_dcids',
config=config,
)
point = geometry.Point(lon, lat)
for p, gj in counties_simp.items():
if len(gj) == 0:
gj = dc.get_property_values([p], 'geoJsonCoordinates')[p]
if len(gj) == 0: # property not defined for one county in alaska
continue
if geometry.shape(json.loads(gj[0])).contains(point):
return p
counties_missing_dp1 = []
for county in counties:
node_data = counties_simp.get(county, {})
nodes = node_data.get('arcs', {}).get('geoJsonCoordinatesDP1',
{}).get('nodes', [])
geojson = None
if nodes:
first_node = nodes[0]
geojson = first_node.get('value') if isinstance(
first_node, dict) else first_node.value
if not geojson:
counties_missing_dp1.append(county)
continue
if geometry.shape(json.loads(geojson)).contains(point):
return county
fallback = {}
if counties_missing_dp1:
fallback = dc_api_batched_wrapper(
function=client.node.fetch_property_values,
dcids=counties_missing_dp1,
args={'properties': 'geoJsonCoordinates'},
dcid_arg_kw='node_dcids',
config=config,
)
for county in counties_missing_dp1:
node_data = fallback.get(county, {})
nodes = node_data.get('arcs', {}).get('geoJsonCoordinates',
{}).get('nodes', [])
geojson = None
if nodes:
first_node = nodes[0]
geojson = first_node.get('value') if isinstance(
first_node, dict) else first_node.value
if not geojson: # property not defined for one county in alaska
continue
if geometry.shape(json.loads(geojson)).contains(point):
return county
return None
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The logic for finding a county by checking its GeoJSON is repeated for geoJsonCoordinatesDP1 and the fallback geoJsonCoordinates. This code duplication makes the function harder to read and maintain. Consider extracting the repeated logic into a local helper function within get_county_geoid to improve clarity and reduce redundancy. The helper could take the list of counties, the geojson data map, and the property name as arguments, and return the found county.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant