Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
141 commits
Select commit Hold shift + click to select a range
bdab0e3
Add migration strategy and technical documentation
pmayd Oct 19, 2025
4e06793
Initialize Python project structure (Phase 0)
pmayd Oct 19, 2025
5be3f78
Add GitHub Actions CI with Astral toolchain
pmayd Oct 19, 2025
de25d6d
Consolidate migration documentation
pmayd Oct 19, 2025
611dcd4
Add root CLAUDE.md as navigation hub
pmayd Oct 19, 2025
7c45b79
update pyproject.toml and add proper dev group
pmayd Oct 20, 2025
7f641fb
Add justfile and update README with development commands
pmayd Oct 20, 2025
c35a072
Implement column synonyms mapper for data extraction
pmayd Oct 20, 2025
98e0419
Add shared utilities and province validation with case-insensitive ma…
pmayd Oct 20, 2025
361a898
Refactor: Reorganize into reference/ package for better structure
pmayd Oct 20, 2025
469c738
add examples to modules, fix docxtring indent
pmayd Oct 20, 2025
77af9df
Optimize patient extraction with single-pass read-only loading
pmayd Oct 22, 2025
05d24aa
update tests
pmayd Oct 22, 2025
3b06496
1. Patient Data Extraction Module (src/a4d/extract/patient.py)
pmayd Oct 23, 2025
a7f1a88
further updates in the migration; step 2
pmayd Oct 25, 2025
ca3cad6
fix error in R pipeline with some 2024 trackers not correctly parsing…
pmayd Oct 25, 2025
0b0a376
add tqdm
pmayd Oct 27, 2025
061748a
next improvements
pmayd Oct 27, 2025
1b944ca
add log table summary to cli
pmayd Nov 4, 2025
02ffd36
Fix age calculation to match R pipeline behavior
pmayd Nov 4, 2025
cdd4be8
Add date validation and FBG text conversion to patient cleaning
pmayd Nov 4, 2025
ff5d755
remove old scripts
pmayd Nov 6, 2025
1617013
Implement flexible date parsing for legacy tracker formats
pmayd Nov 6, 2025
911f6e5
Simplify comparison script to use fixed base paths
pmayd Nov 6, 2025
8e6c91f
Fix date parser to handle month-year without separator
pmayd Nov 6, 2025
1d963da
Update VALIDATION_TRACKING: mark 2018 tracker as validated
pmayd Nov 6, 2025
ce4caf3
Fix extraction and cleaning bugs for 2021 Phattalung Hospital tracker
pmayd Nov 8, 2025
b62047d
Fix extraction to handle rows with missing row numbers
pmayd Nov 8, 2025
b0f4f37
Update validation tracking: 2022 Surat Thani Hospital fixed
pmayd Nov 8, 2025
d121f76
Fix extraction bugs: handle None max_row and filter Excel errors
pmayd Nov 9, 2025
3f1ba18
Update validation tracking: 2024 Sultanah Bahiyah fixed
pmayd Nov 9, 2025
426cae4
Add patient_id normalization for transfer patients
pmayd Nov 9, 2025
2355603
Add unit tests for patient cleaning preprocessing functions
pmayd Nov 9, 2025
989715b
Fix extraction to filter numeric zero patient IDs (0, 0.0)
pmayd Nov 9, 2025
0ba9d6e
Update validation tracking: investigation complete
pmayd Nov 9, 2025
228ebfc
Add comprehensive R vs Python validation test suite
pmayd Nov 10, 2025
f763c7b
Refactor R validation tests and fix HbA1c exceeds default value
pmayd Nov 11, 2025
0475663
Add patient-level exceptions for R extraction errors
pmayd Nov 11, 2025
c696630
Add file-specific column exceptions and identify R Unicode bug
pmayd Nov 11, 2025
2ed8ee0
Clarify Unicode character handling difference between R and Python
pmayd Nov 11, 2025
218886b
Treat null and empty string as equivalent in R vs Python validation
pmayd Nov 11, 2025
a9a74db
Add exceptions for Kantha Bopha II Hospital tracker R extraction errors
pmayd Nov 12, 2025
f2e8d4e
Implement province validation in Python to match R behavior
pmayd Nov 12, 2025
c727c32
Remove Kantha Bopha II province validation exception
pmayd Nov 12, 2025
63d9f38
Revert: Keep province exception - R validation is still broken
pmayd Nov 12, 2025
71b86c1
Implement sex synonym mapping to match R behavior
pmayd Nov 12, 2025
68b48b1
Implement BMI calculation to match R pipeline
pmayd Nov 12, 2025
55ad1ca
Complete migration of R data transformation functions
pmayd Nov 12, 2025
0eccaf9
Fix extract tests to use numeric values in column A
pmayd Nov 12, 2025
42f4abb
update exceptions for end to end tests
pmayd Nov 13, 2025
dc91f18
add a delete parameter to ingest_Data
pmayd Nov 14, 2025
5ec67b8
Add exception for 2017 Mandalay tracker missing status values
pmayd Nov 14, 2025
7ed5e36
Add exception for 2019 CDA tracker missing status value
pmayd Nov 15, 2025
97d34de
Update required column test to alert when exceptions are fixed
pmayd Nov 15, 2025
d88ef45
Add exception for 2019 Preah Kossamak tracker missing status
pmayd Nov 15, 2025
870c505
Add exception for 2019 Vietnam National Children's Hospital tracker
pmayd Nov 15, 2025
622d500
Fix header merge bug causing status column loss in 2021 Kantha Bopha …
pmayd Nov 15, 2025
87a7169
new limits for fbg, hba1c, bmi, age
pmayd Nov 15, 2025
c96611b
Fix validation error for rows with missing patient_id
pmayd Nov 15, 2025
9c3f236
Improve worker log file naming with timestamp and PID
pmayd Nov 15, 2025
52283fd
Fix patient_id normalization to handle hyphens before removing transf…
pmayd Nov 16, 2025
1948d13
Add exception for missing status in 2021 Mandalay tracker
pmayd Nov 16, 2025
dbb94a4
Add exception for missing status in 2021 Preah Kossamak tracker
pmayd Nov 16, 2025
52fa688
Add exceptions for missing status in 2022 Chiang Mai tracker
pmayd Nov 16, 2025
3fe851a
Add exception for missing status in 2022 Chulalongkorn tracker
pmayd Nov 16, 2025
6d43474
Add exception for missing status in 2022 Kantha Bopha tracker
pmayd Nov 16, 2025
60147ca
Add exception for missing status in 2022 Likas tracker
pmayd Nov 16, 2025
c4975ce
Add exception for missing status in 2022 Mandalay tracker
pmayd Nov 16, 2025
f38d5b5
Add exception for missing status in 2022 Penang tracker
pmayd Nov 16, 2025
118c580
Add exception for missing status in 2022 Putrajaya DC tracker
pmayd Nov 16, 2025
a47cdc1
Add exception for missing status in 2022 Sarawak DC tracker
pmayd Nov 16, 2025
e28c695
Add exception for missing status in 2022 Surat Thani tracker
pmayd Nov 16, 2025
e5e050b
Add exception for missing status in 2022 Udon Thani tracker
pmayd Nov 16, 2025
6371894
Add exception for missing status in 2023 Mahosot tracker
pmayd Nov 16, 2025
43d84d8
Add exception for missing status in 2023 Nakornping tracker
pmayd Nov 16, 2025
59507bb
Add exception for missing status in 2023 Surat Thani tracker
pmayd Nov 16, 2025
7cc0764
format
pmayd Dec 8, 2025
5f1fc33
exception for status in 2024_Likas Women & Children's Hospital A4D Tr…
pmayd Dec 8, 2025
32fbcad
exception for status in 2024_Yangon General Hospital A4D Tracker
pmayd Dec 8, 2025
20c90c6
add vscode settings
pmayd Dec 8, 2025
489a3b4
remove -m option from addopts to not exclude tests from test explorer…
pmayd Dec 8, 2025
f391249
ruff check
pmayd Dec 9, 2025
a9ce967
fix header merge forward-fill for group headers like observations_cat…
Dec 27, 2025
4b1c6ab
use Excel merge metadata for header forward-fill instead of heuristics
Dec 27, 2025
0cffe77
fix openpyxl warning filter to match submodules
Dec 27, 2025
7c91db1
perf: replace Excel merge metadata with synonym validation (6.6x faster)
Dec 27, 2025
424f9e7
fix: convert height from cm to m before BMI calculation
Dec 27, 2025
a812ecc
fix: date parsing truncated dd-Mon-yyyy format (11 chars) to 10
Dec 28, 2025
58700fd
test: add R extraction error exceptions for Mandalay trackers
Dec 28, 2025
7018884
test: add R extraction error exception for NPH tracker
Dec 28, 2025
1a8c7b8
fix: calculate age/t1d_diagnosis_age from DOB, handle Excel errors
Dec 28, 2025
a5b9948
Initial plan
Copilot Feb 20, 2026
1b7998d
feat: add BigQuery loading and GCS integration modules
Copilot Feb 20, 2026
b2ccea1
refactor: address code review feedback
Copilot Feb 20, 2026
ac2fa23
Fix all 63 ruff linting errors in a4d-python
Copilot Feb 20, 2026
7e06dbc
Merge pull request #3 from CorrelAid/copilot/plan-data-migration-steps
pmayd Feb 24, 2026
580b186
Initial plan
Copilot Feb 24, 2026
4eac0a8
Add Dockerfile, .dockerignore, deploy.sh, and cloud-run config for se…
Copilot Feb 24, 2026
b57b483
Deploy Python pipeline as Cloud Run Job: fix Dockerfile, add run-pipe…
Copilot Feb 24, 2026
c488a8c
Fix PR diff: sync migration branch files, leaving only the 4 new depl…
Copilot Feb 24, 2026
f3910a2
Merge branch 'migration' of https://github.com/CorrelAid/a4d into cop…
Copilot Feb 25, 2026
e6241f8
Upgrade Python from 3.11 to 3.13 in Dockerfile and CI workflow
Copilot Feb 25, 2026
ac1181f
Remove unused duckdb dependency and upgrade to Python 3.14
Copilot Feb 25, 2026
ef6a1b9
Merge pull request #4 from CorrelAid/copilot/setup-serverless-pipeline
pmayd Feb 25, 2026
38595b4
Initial plan
Copilot Feb 25, 2026
f5db33c
Fix patient pipeline: stale column names, Python version, product ing…
Copilot Feb 25, 2026
72c7be9
Remove longitudinal data table creation and all related code
Copilot Feb 25, 2026
8d188e6
Merge pull request #5 from CorrelAid/copilot/finalize-patient-data-mi…
pmayd Feb 25, 2026
7fb63fa
Finalize patient pipeline: fix bugs, remove stale code, add CLI tests
Feb 25, 2026
cccaea8
format code
Feb 26, 2026
0a7c3c5
Add clean output, CLI step output, and setup guide
Feb 26, 2026
7f88230
rename just commands, fix error in tests, fix max_workers not working…
Feb 26, 2026
b0085ce
Improve CLI UX, GCS performance, and GCP setup
Feb 27, 2026
a119270
Add backup-bq and fix logs-job in justfile
Feb 28, 2026
945d230
Audit and update docs: remove obsolete files, update to reflect curre…
Mar 1, 2026
fc5973f
Move all GCP resources to asia-southeast2 (Jakarta) for data residency
Mar 1, 2026
aa70d9e
Fix uv not found in Docker and update justfile region to Jakarta
Mar 1, 2026
767b433
Fix Docker build: add UV_PYTHON_DOWNLOADS and copy README.md
Mar 1, 2026
2012fe6
Fix Docker build: split uv sync to avoid editable install before src/…
Mar 1, 2026
6b22d19
Add gcloud command to list images in Artifact Registry after docker-push
Mar 1, 2026
81b77bf
Add local Docker test instructions and docker-smoke just command
Mar 1, 2026
dce1dd7
Add --provenance=false to docker-build to suppress attestation manifests
Mar 1, 2026
92f9ac6
Add git-SHA tagging strategy and rollback command to justfile
Mar 1, 2026
40293dc
Add docker-list command to justfile
Mar 1, 2026
7236446
Reorder and group justfile into logical sections
Mar 1, 2026
9e8089d
Fix E402: move warnings.filterwarnings after imports in cli.py
Mar 1, 2026
f96d4e0
Fix ty type errors: PolarsDataType annotation and tqdm isinstance check
Mar 1, 2026
657e3f6
Update Cloud Run Job spec: bump to 8 vCPU / 8 GiB, add jobs list command
Mar 1, 2026
80ee1e3
Code cleanup: remove unused imports, fix forward reference, reformat
Mar 1, 2026
3a6faae
Add --platform=linux/amd64 to docker-build for Cloud Run compatibility
Mar 1, 2026
8824cc3
Fix CI test failures by excluding slow and integration tests
Mar 1, 2026
b2c91a6
Fix reference_data path resolution in Docker/Cloud Run
Mar 1, 2026
bfa8c85
Switch parallel processing from ProcessPoolExecutor to ThreadPoolExec…
Mar 1, 2026
fb078ff
Update SETUP.md: fix BigQuery IAM grant command
Mar 1, 2026
113890f
Restore clean console output for parallel pipeline runs
Mar 1, 2026
c306479
Revert to ProcessPoolExecutor; add docker-clean command
Mar 1, 2026
38efab6
Fix GCS upload: timestamped prefix, upload only tables and logs
Mar 1, 2026
5560b38
Fix BigQuery loading: delete-before-recreate and update logs clustering
Mar 1, 2026
c18fb70
Add clinic_data_static table creation
Mar 1, 2026
7e345dd
Add error_code classification to pipeline logs
Mar 1, 2026
f4fb7b5
Fix CLI test runner for CI environment
Mar 1, 2026
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
13 changes: 13 additions & 0 deletions .dockerignore
Original file line number Diff line number Diff line change
@@ -0,0 +1,13 @@
.git
.github
.Rproj.user
.Rhistory
.RData
*.Rproj
a4d-python/.pytest_cache
a4d-python/.ruff_cache
a4d-python/htmlcov
a4d-python/.coverage
a4d-python/profiling/*.prof
data/
secrets/
52 changes: 52 additions & 0 deletions .github/workflows/python-ci.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,52 @@
name: Python CI

on:
push:
branches: [migration]
paths:
- 'a4d-python/**'
- '.github/workflows/python-ci.yml'
pull_request:
branches: [main, develop, migration]
paths:
- 'a4d-python/**'
- '.github/workflows/python-ci.yml'

jobs:
test:
runs-on: ubuntu-latest
defaults:
run:
working-directory: a4d-python

steps:
- uses: actions/checkout@v4

- name: Install uv
uses: astral-sh/setup-uv@v2
with:
enable-cache: true

- name: Set up Python
run: uv python install 3.14

- name: Install dependencies
run: uv sync --all-extras

- name: Run ruff linting
run: uv run ruff check .

- name: Run ruff formatting check
run: uv run ruff format --check .

- name: Run type checking with ty
run: uv run ty check src/

- name: Run tests
run: uv run pytest -m "not slow and not integration" --cov --cov-report=xml

- name: Upload coverage
uses: codecov/codecov-action@v3
with:
files: ./a4d-python/coverage.xml
flags: python
8 changes: 7 additions & 1 deletion .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -10,4 +10,10 @@
rsconnect

data/output
data/mapping_table.csv
data/mapping_table.csv

# Serena (MCP server state)
.serena/

# Secrets (GCP service accounts, etc.)
secrets/
29 changes: 29 additions & 0 deletions .vscode/settings.json
Original file line number Diff line number Diff line change
@@ -0,0 +1,29 @@
{
"python.testing.pytestEnabled": true,
"python.testing.unittestEnabled": false,
"python.testing.cwd": "${workspaceFolder}/a4d-python",
"python.testing.pytestArgs": [
"${workspaceFolder}/a4d-python/tests"
],
"python.defaultInterpreterPath": "${workspaceFolder}/a4d-python/.venv/bin/python",
"workbench.colorCustomizations": {
"activityBar.activeBackground": "#ab307e",
"activityBar.background": "#ab307e",
"activityBar.foreground": "#e7e7e7",
"activityBar.inactiveForeground": "#e7e7e799",
"activityBarBadge.background": "#25320e",
"activityBarBadge.foreground": "#e7e7e7",
"commandCenter.border": "#e7e7e799",
"sash.hoverBorder": "#ab307e",
"statusBar.background": "#832561",
"statusBar.foreground": "#e7e7e7",
"statusBarItem.hoverBackground": "#ab307e",
"statusBarItem.remoteBackground": "#832561",
"statusBarItem.remoteForeground": "#e7e7e7",
"titleBar.activeBackground": "#832561",
"titleBar.activeForeground": "#e7e7e7",
"titleBar.inactiveBackground": "#83256199",
"titleBar.inactiveForeground": "#e7e7e799"
},
"peacock.color": "#832561"
}
61 changes: 61 additions & 0 deletions CLAUDE.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,61 @@
# CLAUDE.md

This repository contains **two projects**:

## 1. R Pipeline (Production - Legacy)

**Location**: Root directory
**Status**: Production (being phased out)

The original R implementation of the A4D medical tracker data processing pipeline.

**Key Files**:
- `R/` - R package code
- `scripts/R/` - Pipeline scripts
- `reference_data/` - Shared YAML configurations

**Commands**: See README.md for R-specific commands

---

## 2. Python Pipeline (Active Development)

**Location**: `a4d-python/`
**Status**: Active migration
**Branch**: `migration`

New Python implementation with better performance and incremental processing.

**Documentation**: [a4d-python/docs/CLAUDE.md](a4d-python/docs/CLAUDE.md)

**Quick Start**:
```bash
cd a4d-python
uv sync
uv run pytest
```

**Migration Guide**: [a4d-python/docs/migration/MIGRATION_GUIDE.md](a4d-python/docs/migration/MIGRATION_GUIDE.md)

---

## Working on This Repository

**If working on R code**: Stay in root, use R commands

**If working on Python migration**:
```bash
cd a4d-python
# See a4d-python/docs/CLAUDE.md for Python-specific guidance
```

## Shared Resources

Both projects use the same reference data:
- `reference_data/synonyms/` - Column name mappings
- `reference_data/data_cleaning.yaml` - Validation rules
- `reference_data/provinces/` - Allowed provinces

**Do not modify these** without testing both R and Python pipelines.
- Always check your implementation against the original R pipeline and check if the logic is the same
- Limit comments to explain why a desigin was made or give important context information for the migration but do not use comments for obvious code otherwise
9 changes: 9 additions & 0 deletions R/script2_helper_patient_data_fix.R
Original file line number Diff line number Diff line change
Expand Up @@ -176,6 +176,15 @@ parse_dates <- function(date) {
return(lubridate::NA_Date_)
}

# Handle Excel serial numbers (e.g., "45341.0", "39920.0")
# Excel stores dates as days since 1899-12-30
numeric_date <- suppressWarnings(as.numeric(date))
if (!is.na(numeric_date) && numeric_date > 1 && numeric_date < 100000) {
# This is likely an Excel serial number
excel_origin <- as.Date("1899-12-30")
return(excel_origin + as.integer(numeric_date))
}

parsed_date <- suppressWarnings(lubridate::as_date(date))

if (is.na(parsed_date)) {
Expand Down
90 changes: 0 additions & 90 deletions R/script3_create_table_patient_data_changes_only.R

This file was deleted.

25 changes: 25 additions & 0 deletions a4d-python/.env.example
Original file line number Diff line number Diff line change
@@ -0,0 +1,25 @@
# Environment Configuration
A4D_ENVIRONMENT=development

# GCP Configuration
A4D_PROJECT_ID=a4dphase2
A4D_DATASET=tracker
A4D_DOWNLOAD_BUCKET=a4dphase2_upload
A4D_UPLOAD_BUCKET=a4dphase2_output

# GCP Authentication (optional - uses Application Default Credentials if not set)
# For local development: run `gcloud auth application-default login`
# For CI/CD or VM: set path to service account key file
# GOOGLE_APPLICATION_CREDENTIALS=/path/to/service-account-key.json

# Paths
A4D_DATA_ROOT=/path/to/tracker/files
A4D_OUTPUT_DIR=output

# Processing Settings
A4D_MAX_WORKERS=4

# Error Values (matching R pipeline)
A4D_ERROR_VAL_NUMERIC=999999
A4D_ERROR_VAL_CHARACTER=Undefined
A4D_ERROR_VAL_DATE=9999-12-31
67 changes: 67 additions & 0 deletions a4d-python/.gitignore
Original file line number Diff line number Diff line change
@@ -0,0 +1,67 @@
# Python
__pycache__/
*.py[cod]
*$py.class
*.so
.Python
build/
develop-eggs/
dist/
downloads/
eggs/
.eggs/
lib/
lib64/
parts/
sdist/
var/
wheels/
*.egg-info/
.installed.cfg
*.egg

# Virtual environments
.venv/
venv/
ENV/
env/

# uv
.uv/

# Testing
.pytest_cache/
.coverage
htmlcov/
.tox/

# Type checking
.mypy_cache/
.dmypy.json
dmypy.json

# IDEs
.vscode/
.idea/
*.swp
*.swo
*~

# Environment
.env
.env.local

# Logs
*.log
logs/

# Data (sensitive)
data/
output/
*.parquet
*.xlsx
!reference_data/

# OS
.DS_Store
Thumbs.db
Loading
Loading