Background
The PIN geometry transformation steps that produce the pin_geometry table in pin.R are using huge amounts of memory (~100gb) and running way too long on the Data team server (~2 days without completing). Here's the current block of code on the main branch:
|
# pin_geometry ----------------------------------------------------------------- |
|
|
|
# Grab the maximum year of agency data so we can limit the scope of the parcel |
|
# file query |
|
agency_df <- read_parquet(file.path(remote_bucket, "agency", "part-0.parquet")) |
|
max_year <- max(as.integer(agency_df$year)) |
|
|
|
# Grab parcel shapes from S3. These files are originally from the data portal |
|
# and Cook Central |
|
remote_bucket_geometry <- "s3://ccao-data-warehouse-us-east-1/spatial/parcel" |
|
pin_geometry_df_full <- arrow::open_dataset(remote_bucket_geometry) %>% |
|
filter(year >= 2006 & year <= max_year) %>% |
|
select(year, x = lon, y = lat, pin10, geometry) %>% |
|
geoarrow_collect_sf() |
|
|
|
# For each PIN10, keep only records where the shape/area of the PIN have changed |
|
# and record the start and end year for each unique shape/area |
|
pin_geometry_df_raw <- pin_geometry_df_full %>% |
|
mutate(area = st_area(geometry)) %>% |
|
group_by(pin10) %>% |
|
arrange(pin10, year) %>% |
|
mutate(across(c(area, x, y), lag, .names = "lag_{.col}")) %>% |
|
mutate( |
|
diff_area = !(abs(area - lag_area) < units::set_units(0.001, "m^2")), |
|
diff_cent = !(abs(x - lag_x) < 0.00001 & abs(y - lag_y) < 0.00001), |
|
pin_group = cumsum( |
|
(diff_area & diff_cent) | |
|
(is.na(diff_area) & is.na(diff_cent)) |
|
) |
|
) %>% |
|
group_by(pin10, pin_group) %>% |
|
mutate( |
|
start_year = min(year), |
|
end_year = max(year) |
|
) %>% |
|
filter(row_number() == 1) %>% |
|
ungroup() %>% |
|
select(pin10, start_year, end_year, longitude = x, latitude = y, geometry) %>% |
|
arrange(pin10, start_year) |
|
|
|
# Write the raw PIN geometry to S3 |
|
geoarrow::write_geoparquet( |
|
pin_geometry_df_raw, |
|
sink = remote_path_pin_geometry_raw, |
|
compression = "zstd" |
|
) |
We need to figure out how to get this running in a reasonable amount of time (say, 12 hours or less).
Existing work
To make debugging easier, I pulled this logic out into its own script in the branch jeancochrane/debug-pin-geometry-performance.
Ideas
Some ideas for how we might improve performance:
- Debug the transformation step to figure out if there are ways of performing the transformation in a less resource-intensive way
- This would be the ideal approach, but I'm skeptical that there is any low-hanging fruit remaining, since Dan wrote this code and he is skilled at performant spatial transformations
- Refactor the logic to use previous script output so that we don't have to redo the entire transformation pipeline every year
- Not ideal since it means the artifacts would not be fully reproducible without pre-existing output, but could be a useful bandaid for getting us moving while we figure out a better long-term solution
- Set up some infra to run this on a beefy rented AWS server where it can take a long time to execute without blocking our day-to-day development
Background
The PIN geometry transformation steps that produce the
pin_geometrytable inpin.Rare using huge amounts of memory (~100gb) and running way too long on the Data team server (~2 days without completing). Here's the current block of code on the main branch:ptaxsim/data-raw/pin/pin.R
Lines 146 to 191 in eb1e3e8
We need to figure out how to get this running in a reasonable amount of time (say, 12 hours or less).
Existing work
To make debugging easier, I pulled this logic out into its own script in the branch
jeancochrane/debug-pin-geometry-performance.Ideas
Some ideas for how we might improve performance:
build-and-run-batch-jobreusable workflow in a non-model project