Skip to content

feat: BigQuery loading and GCS integration for Python pipeline#3

Merged
pmayd merged 4 commits intomigrationfrom
copilot/plan-data-migration-steps
Feb 24, 2026
Merged

feat: BigQuery loading and GCS integration for Python pipeline#3
pmayd merged 4 commits intomigrationfrom
copilot/plan-data-migration-steps

Conversation

Copy link

Copilot AI commented Feb 20, 2026

  • Implement GCS storage client (src/a4d/gcp/storage.py)
  • Implement BigQuery loading client (src/a4d/gcp/bigquery.py)
  • Update src/a4d/gcp/__init__.py with re-exports and __all__
  • Add upload-tables, download-trackers, upload-output CLI commands
  • Add 18 unit tests for GCP modules (all mocked, all passing)
  • Update .env.example with GCP authentication documentation
  • Fix all 109 ruff linter warnings across the entire codebase (0 remaining)
  • All 239 tests passing
Original prompt

wir sind in der Mitte einer Migration von R zu Python. eigentlich testen wir gerade noch aber das ist gerade ohne Daten nicht möglich. daher würde ich gerne die nächsten Schritte planen und die Pipeline fertigstellen. wir müssten bereits die Tabellen erstellt haben, das nächste Ziel wäre also die Daten nach Big Query zu laden und die Pipeline dann auch Mal in Google cloud zu testen, was wäre dafür notwendig bzw welche Dienste müssten aufgesetzt werden?


💡 You can make Copilot smarter by setting up custom instructions, customizing its development environment and configuring Model Context Protocol (MCP) servers. Learn more Copilot coding agent tips in the docs.

@pmayd
Copy link
Collaborator

pmayd commented Feb 20, 2026

We also need to check and allow running the pipeline locally as well as within GCP, so locally we need to somehow authenticate against GCP I guess and when running inside a container on GCP I guess the tools like gcloud are already available but this needs to be considered when writing the client code for Google storage and big Query

Copilot AI and others added 2 commits February 20, 2026 12:45
- Add gcp/bigquery.py with load_table() and load_pipeline_tables()
  matching R pipeline's ingest_data() with clustering fields
- Add gcp/storage.py with download_tracker_files() and upload_output()
  replacing R pipeline's gsutil CLI calls
- Add CLI commands: upload-tables, download-trackers, upload-output
- Add 18 unit tests for GCP modules (all mocked)
- Update .env.example with GCP auth documentation

Co-authored-by: pmayd <9614291+pmayd@users.noreply.github.com>
- Add __all__ to gcp/__init__.py for explicit public API
- Add comment explaining table_logs.parquet naming convention
- Extract _get_job_config() helper in tests to reduce duplication

Co-authored-by: pmayd <9614291+pmayd@users.noreply.github.com>
Copilot AI changed the title [WIP] Plan next steps for data migration to BigQuery feat: BigQuery loading and GCS integration for Python pipeline Feb 20, 2026
Copilot AI requested a review from pmayd February 20, 2026 12:50
@pmayd
Copy link
Collaborator

pmayd commented Feb 20, 2026

I allowed astral.sh for agents, you can try again @copilot

Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR implements GCP integration for the Python pipeline migration, enabling BigQuery table loading and Google Cloud Storage operations that were previously handled by R scripts using CLI tools (bq and gsutil).

Changes:

  • Added a4d.gcp.bigquery module with table loading functions and clustering configurations matching the R pipeline
  • Added a4d.gcp.storage module for downloading tracker files and uploading pipeline output to GCS
  • Added three new CLI commands: upload-tables, download-trackers, and upload-output

Reviewed changes

Copilot reviewed 7 out of 8 changed files in this pull request and generated no comments.

Show a summary per file
File Description
src/a4d/gcp/bigquery.py Implements BigQuery table loading with clustering configs for patient_data_monthly, patient_data_annual, patient_data_static, and logs tables
src/a4d/gcp/storage.py Implements GCS operations for downloading tracker files and uploading pipeline output
src/a4d/gcp/__init__.py Exports public API for the GCP module
src/a4d/cli.py Adds three new commands for BigQuery upload and GCS interactions
.env.example Documents GCP authentication options using Application Default Credentials
tests/test_gcp/test_bigquery.py 18 unit tests covering table loading, clustering configs, and error handling
tests/test_gcp/test_storage.py Tests for GCS download/upload with mocked clients
tests/test_gcp/__init__.py Empty init file for test package

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@pmayd pmayd marked this pull request as ready for review February 20, 2026 15:10
@pmayd
Copy link
Collaborator

pmayd commented Feb 20, 2026

Fix the ci errors, tests will fail for now but fix the ruff linter warnings

@pmayd
Copy link
Collaborator

pmayd commented Feb 20, 2026

@copilot Fix the ci errors, tests will fail for now but fix the ruff linter warnings

- E501: Break long lines (comments, f-strings, docstrings) to ≤100 chars
- E501: Add noqa: E501 for file path literals in scripts/
- N806: Rename EXCEL_ERRORS -> excel_errors, FLOAT_REL_TOL -> float_rel_tol,
  FLOAT_ABS_TOL -> float_abs_tol (variables inside functions)
- B904: Add 'from e' to raise statements in except blocks

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Copy link
Author

Copilot AI commented Feb 20, 2026

@copilot Fix the ci errors, tests will fail for now but fix the ruff linter warnings

Fixed all 109 ruff linting errors in commit ac2fa23. This includes:

  • 46 E501 (line too long) - wrapped comments, f-strings, and docstrings; added # noqa: E501 for file path literals in scripts/
  • Auto-fixed UP035, F841, I001, PT006, B007, PT018 via ruff --fix
  • 3 N806 (variable naming) - renamed uppercase variables to lowercase inside functions
  • 2 B904 (raise from) - added from e to raise statements in except blocks

All 239 tests pass, 0 ruff errors remain.

We also need to check and allow running the pipeline locally as well as within GCP, so locally we need to somehow authenticate against GCP I guess and when running inside a container on GCP I guess th...

The GCP client code (gcp/bigquery.py and gcp/storage.py) already uses Application Default Credentials (ADC), which handles both scenarios:

  • Locally: Authenticate with gcloud auth application-default login - no service account key file needed
  • In GCP (Cloud Run/GCE): Authentication is automatic via the metadata server - no extra setup
  • CI/CD or VMs: Set GOOGLE_APPLICATION_CREDENTIALS=/path/to/key.json environment variable

The .env.example documents these options. No gcloud CLI is needed at runtime - the Python client libraries (google-cloud-bigquery, google-cloud-storage) handle auth natively through ADC.

Copilot AI requested a review from pmayd February 20, 2026 15:23
@pmayd pmayd merged commit 7e06dbc into migration Feb 24, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants