Skip to content

Get PIN geometry transformations running again #66

@jeancochrane

Description

@jeancochrane

Background

The PIN geometry transformation steps that produce the pin_geometry table in pin.R are using huge amounts of memory (~100gb) and running way too long on the Data team server (~2 days without completing). Here's the current block of code on the main branch:

ptaxsim/data-raw/pin/pin.R

Lines 146 to 191 in eb1e3e8

# pin_geometry -----------------------------------------------------------------
# Grab the maximum year of agency data so we can limit the scope of the parcel
# file query
agency_df <- read_parquet(file.path(remote_bucket, "agency", "part-0.parquet"))
max_year <- max(as.integer(agency_df$year))
# Grab parcel shapes from S3. These files are originally from the data portal
# and Cook Central
remote_bucket_geometry <- "s3://ccao-data-warehouse-us-east-1/spatial/parcel"
pin_geometry_df_full <- arrow::open_dataset(remote_bucket_geometry) %>%
filter(year >= 2006 & year <= max_year) %>%
select(year, x = lon, y = lat, pin10, geometry) %>%
geoarrow_collect_sf()
# For each PIN10, keep only records where the shape/area of the PIN have changed
# and record the start and end year for each unique shape/area
pin_geometry_df_raw <- pin_geometry_df_full %>%
mutate(area = st_area(geometry)) %>%
group_by(pin10) %>%
arrange(pin10, year) %>%
mutate(across(c(area, x, y), lag, .names = "lag_{.col}")) %>%
mutate(
diff_area = !(abs(area - lag_area) < units::set_units(0.001, "m^2")),
diff_cent = !(abs(x - lag_x) < 0.00001 & abs(y - lag_y) < 0.00001),
pin_group = cumsum(
(diff_area & diff_cent) |
(is.na(diff_area) & is.na(diff_cent))
)
) %>%
group_by(pin10, pin_group) %>%
mutate(
start_year = min(year),
end_year = max(year)
) %>%
filter(row_number() == 1) %>%
ungroup() %>%
select(pin10, start_year, end_year, longitude = x, latitude = y, geometry) %>%
arrange(pin10, start_year)
# Write the raw PIN geometry to S3
geoarrow::write_geoparquet(
pin_geometry_df_raw,
sink = remote_path_pin_geometry_raw,
compression = "zstd"
)

We need to figure out how to get this running in a reasonable amount of time (say, 12 hours or less).

Existing work

To make debugging easier, I pulled this logic out into its own script in the branch jeancochrane/debug-pin-geometry-performance.

Ideas

Some ideas for how we might improve performance:

  • Debug the transformation step to figure out if there are ways of performing the transformation in a less resource-intensive way
    • This would be the ideal approach, but I'm skeptical that there is any low-hanging fruit remaining, since Dan wrote this code and he is skilled at performant spatial transformations
  • Refactor the logic to use previous script output so that we don't have to redo the entire transformation pipeline every year
    • Not ideal since it means the artifacts would not be fully reproducible without pre-existing output, but could be a useful bandaid for getting us moving while we figure out a better long-term solution
  • Set up some infra to run this on a beefy rented AWS server where it can take a long time to execute without blocking our day-to-day development

Metadata

Metadata

Assignees

Labels

No labels
No labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions