-
Notifications
You must be signed in to change notification settings - Fork 131
Migrate multiple scripts to DC V2 clients and tighten tests #1861
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Migrate multiple scripts to DC V2 clients and tighten tests #1861
Conversation
Add test coverage for API call
Add V2 multi-property support in dc_api_get_node_property Update tests for new V2 response handling
…ny/process_parent_company.py
Summary of ChangesHello @rohitkumarbhagat, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request undertakes a significant refactoring effort to migrate several Data Commons-related scripts to leverage the more modern and flexible Data Commons V2 client. The changes aim to enhance API interaction capabilities, improve code maintainability through standardized imports, and bolster testing with comprehensive mock-based strategies. This migration ensures that the affected scripts are aligned with the latest API standards and are more resilient to external service changes. Highlights
Changelog
Activity
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request migrates several scripts to use the Data Commons V2 client and API wrappers, which is a significant and positive update. The changes are well-structured, and the addition of new tests using requests-mock and updates to existing tests to cover the new V2 behavior are commendable. The dc_api_wrapper is also enhanced with multi-property fetch capabilities and better API key handling.
I have a couple of suggestions to improve maintainability by reducing code duplication in the new logic for handling V2 API responses in scripts/earthengine/utils.py and scripts/rff/preprocess_raster.py.
| if isinstance(lat_value, list): | ||
| lat_value = lat_value[0] | ||
| if isinstance(lng_value, list): | ||
| lng_value = lng_value[0] | ||
| if isinstance(lat_value, str): | ||
| lat_value = lat_value.split(',')[0].strip().strip('"') | ||
| if isinstance(lng_value, str): | ||
| lng_value = lng_value.split(',')[0].strip().strip('"') | ||
| lat = str_get_numeric_value(lat_value) | ||
| lng = str_get_numeric_value(lng_value) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The logic for parsing latitude and longitude values is duplicated. This can be refactored to improve readability and maintainability by avoiding repetition.
| if isinstance(lat_value, list): | |
| lat_value = lat_value[0] | |
| if isinstance(lng_value, list): | |
| lng_value = lng_value[0] | |
| if isinstance(lat_value, str): | |
| lat_value = lat_value.split(',')[0].strip().strip('"') | |
| if isinstance(lng_value, str): | |
| lng_value = lng_value.split(',')[0].strip().strip('"') | |
| lat = str_get_numeric_value(lat_value) | |
| lng = str_get_numeric_value(lng_value) | |
| coords = [] | |
| for v in [lat_value, lng_value]: | |
| if isinstance(v, list): | |
| v = v[0] | |
| if isinstance(v, str): | |
| v = v.split(',')[0].strip().strip('"') | |
| coords.append(str_get_numeric_value(v)) | |
| lat, lng = coords |
| def get_county_geoid(lat, lon): | ||
| counties = dc.get_places_in(['country/USA'], 'County')['country/USA'] | ||
| counties_simp = dc.get_property_values(counties, 'geoJsonCoordinatesDP1') | ||
| config = {'dc_api_use_cache': True} | ||
| client = get_datacommons_client(config) | ||
| counties_result = client.node.fetch_place_children( | ||
| place_dcids=['country/USA'], | ||
| children_type='County', | ||
| as_dict=True, | ||
| ) | ||
| counties = [ | ||
| node.get('dcid') | ||
| for node in counties_result.get('country/USA', []) | ||
| if node.get('dcid') | ||
| ] | ||
| counties_simp = dc_api_batched_wrapper( | ||
| function=client.node.fetch_property_values, | ||
| dcids=counties, | ||
| args={'properties': 'geoJsonCoordinatesDP1'}, | ||
| dcid_arg_kw='node_dcids', | ||
| config=config, | ||
| ) | ||
| point = geometry.Point(lon, lat) | ||
| for p, gj in counties_simp.items(): | ||
| if len(gj) == 0: | ||
| gj = dc.get_property_values([p], 'geoJsonCoordinates')[p] | ||
| if len(gj) == 0: # property not defined for one county in alaska | ||
| continue | ||
| if geometry.shape(json.loads(gj[0])).contains(point): | ||
| return p | ||
| counties_missing_dp1 = [] | ||
| for county in counties: | ||
| node_data = counties_simp.get(county, {}) | ||
| nodes = node_data.get('arcs', {}).get('geoJsonCoordinatesDP1', | ||
| {}).get('nodes', []) | ||
| geojson = None | ||
| if nodes: | ||
| first_node = nodes[0] | ||
| geojson = first_node.get('value') if isinstance( | ||
| first_node, dict) else first_node.value | ||
| if not geojson: | ||
| counties_missing_dp1.append(county) | ||
| continue | ||
| if geometry.shape(json.loads(geojson)).contains(point): | ||
| return county | ||
| fallback = {} | ||
| if counties_missing_dp1: | ||
| fallback = dc_api_batched_wrapper( | ||
| function=client.node.fetch_property_values, | ||
| dcids=counties_missing_dp1, | ||
| args={'properties': 'geoJsonCoordinates'}, | ||
| dcid_arg_kw='node_dcids', | ||
| config=config, | ||
| ) | ||
| for county in counties_missing_dp1: | ||
| node_data = fallback.get(county, {}) | ||
| nodes = node_data.get('arcs', {}).get('geoJsonCoordinates', | ||
| {}).get('nodes', []) | ||
| geojson = None | ||
| if nodes: | ||
| first_node = nodes[0] | ||
| geojson = first_node.get('value') if isinstance( | ||
| first_node, dict) else first_node.value | ||
| if not geojson: # property not defined for one county in alaska | ||
| continue | ||
| if geometry.shape(json.loads(geojson)).contains(point): | ||
| return county | ||
| return None |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The logic for finding a county by checking its GeoJSON is repeated for geoJsonCoordinatesDP1 and the fallback geoJsonCoordinates. This code duplication makes the function harder to read and maintain. Consider extracting the repeated logic into a local helper function within get_county_geoid to improve clarity and reduce redundancy. The helper could take the list of counties, the geojson data map, and the property name as arguments, and return the found county.
dc_api_wrappercapabilities (multi-prop fetch + API key helper) and add coverage.requests-mockdependency for new test coverage.