feat: BigQuery loading and GCS integration for Python pipeline#3
feat: BigQuery loading and GCS integration for Python pipeline#3
Conversation
|
We also need to check and allow running the pipeline locally as well as within GCP, so locally we need to somehow authenticate against GCP I guess and when running inside a container on GCP I guess the tools like gcloud are already available but this needs to be considered when writing the client code for Google storage and big Query |
- Add gcp/bigquery.py with load_table() and load_pipeline_tables() matching R pipeline's ingest_data() with clustering fields - Add gcp/storage.py with download_tracker_files() and upload_output() replacing R pipeline's gsutil CLI calls - Add CLI commands: upload-tables, download-trackers, upload-output - Add 18 unit tests for GCP modules (all mocked) - Update .env.example with GCP auth documentation Co-authored-by: pmayd <9614291+pmayd@users.noreply.github.com>
- Add __all__ to gcp/__init__.py for explicit public API - Add comment explaining table_logs.parquet naming convention - Extract _get_job_config() helper in tests to reduce duplication Co-authored-by: pmayd <9614291+pmayd@users.noreply.github.com>
|
I allowed astral.sh for agents, you can try again @copilot |
There was a problem hiding this comment.
Pull request overview
This PR implements GCP integration for the Python pipeline migration, enabling BigQuery table loading and Google Cloud Storage operations that were previously handled by R scripts using CLI tools (bq and gsutil).
Changes:
- Added
a4d.gcp.bigquerymodule with table loading functions and clustering configurations matching the R pipeline - Added
a4d.gcp.storagemodule for downloading tracker files and uploading pipeline output to GCS - Added three new CLI commands:
upload-tables,download-trackers, andupload-output
Reviewed changes
Copilot reviewed 7 out of 8 changed files in this pull request and generated no comments.
Show a summary per file
| File | Description |
|---|---|
src/a4d/gcp/bigquery.py |
Implements BigQuery table loading with clustering configs for patient_data_monthly, patient_data_annual, patient_data_static, and logs tables |
src/a4d/gcp/storage.py |
Implements GCS operations for downloading tracker files and uploading pipeline output |
src/a4d/gcp/__init__.py |
Exports public API for the GCP module |
src/a4d/cli.py |
Adds three new commands for BigQuery upload and GCS interactions |
.env.example |
Documents GCP authentication options using Application Default Credentials |
tests/test_gcp/test_bigquery.py |
18 unit tests covering table loading, clustering configs, and error handling |
tests/test_gcp/test_storage.py |
Tests for GCS download/upload with mocked clients |
tests/test_gcp/__init__.py |
Empty init file for test package |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
|
Fix the ci errors, tests will fail for now but fix the ruff linter warnings |
|
@copilot Fix the ci errors, tests will fail for now but fix the ruff linter warnings |
- E501: Break long lines (comments, f-strings, docstrings) to ≤100 chars - E501: Add noqa: E501 for file path literals in scripts/ - N806: Rename EXCEL_ERRORS -> excel_errors, FLOAT_REL_TOL -> float_rel_tol, FLOAT_ABS_TOL -> float_abs_tol (variables inside functions) - B904: Add 'from e' to raise statements in except blocks Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Fixed all 109 ruff linting errors in commit
All 239 tests pass, 0 ruff errors remain.
The GCP client code (
The |
src/a4d/gcp/storage.py)src/a4d/gcp/bigquery.py)src/a4d/gcp/__init__.pywith re-exports and__all__upload-tables,download-trackers,upload-outputCLI commands.env.examplewith GCP authentication documentationOriginal prompt
💡 You can make Copilot smarter by setting up custom instructions, customizing its development environment and configuring Model Context Protocol (MCP) servers. Learn more Copilot coding agent tips in the docs.