Skip to content

Lower memory usage for pin geometry calc#67

Draft
wrridgeway wants to merge 9 commits into2024-data-updatefrom
jeancochrane/debug-pin-geometry-performance
Draft

Lower memory usage for pin geometry calc#67
wrridgeway wants to merge 9 commits into2024-data-updatefrom
jeancochrane/debug-pin-geometry-performance

Conversation

@wrridgeway
Copy link
Member

@wrridgeway wrridgeway commented Feb 26, 2026

We go from 1,469,467 to 1,470,359 rows (a difference of 892). Here is what the new data looks like compared to the old data, minus the geometry column:

skim_type skim_variable source n_missing complete_rate charactermin charactermax characterempty charactern_unique characterwhitespace numericmean numericsd numericp0 numericp25 numericp50 numericp75 numericp100 numerichist
character pin10 new 0 1 10 10 0 1457959 0
character pin10 old 0 1 10 10 0 1457163 0
numeric start_year new 0 1 2006.310948 1.852083596 2006.0 2006.0 2006.0 2006.0 2024.0 ▇▁▁▁▁
numeric start_year old 0 1 2006.30021 1.800622516 2006.0 2006.0 2006.0 2006.0 2023.0 ▇▁▁▁▁
numeric end_year new 0 1 2022.955853 2.291021408 2006.0 2023.0 2023.0 2024.0 2024.0 ▁▁▁▁▇
numeric end_year old 0 1 2022.623576 2.186314394 2006.0 2023.0 2023.0 2023.0 2023.0 ▁▁▁▁▇
numeric longitude new 0 1 -87.76344617 0.140398164 -88.2635061 -87.82816859 -87.73445373 -87.66719229 -87.5245681 ▁▁▃▇▅
numeric longitude old 0 1 -87.76329732 0.140232267 -88.2635061 -87.82795164 -87.73439089 -87.66715148 -87.5245681 ▁▁▃▇▅
numeric latitude new 0 1 41.84223347 0.167653113 41.46975644 41.71743355 41.85172987 41.9808197 42.15406191 ▃▆▇▇▆
numeric latitude old 0 1 41.84211685 0.167632348 41.46975644 41.71734915 41.85154599 41.98052128 42.15406191 ▃▆▇▇▆

fill(town_code, .direction = "updown", .by = pin10) %>%
# For remaining missing town codes, replace with 99 to make looping through
# them below easier.
mutate(town_code = replace_na(town_code, 99))
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's annoying to loop through NA values. I tried using both group_map() and split() to avoid this, but they seemed to take far longer than just walking through the possible values and using filter().

}, .progress = TRUE)

# Remove the full dataset to free up memory for processing
rm(pin_geometry_df_full)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This isn't necessary, but collecting the data and filling it takes a lot of time/memory and this let's us avoid having to redo it if anything goes wrong.

) %>%
arrange(pin10, start_year)
}, .progress = TRUE) %>%
bind_rows()
Copy link
Member Author

@wrridgeway wrridgeway Feb 26, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We're just moving this work inside of a map function to let us load less data into memory. I've chunked everything by town_code since we only need to look back in time within pin10, not across pin10.

@wrridgeway wrridgeway self-assigned this Feb 26, 2026
@wrridgeway wrridgeway linked an issue Feb 26, 2026 that may be closed by this pull request
@wrridgeway wrridgeway changed the base branch from master to 2024-data-update February 26, 2026 21:02
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Get PIN geometry transformations running again

2 participants