Skip to content

Remove year groupings from Socrata upload script where possible #922

@jeancochrane

Description

@jeancochrane

Adapted from #894 (comment):

It strikes me that the "remove missing years" operation is distinct enough from the year-upload loop that I think it would be clearer to move it out into a step that's separate from the year-upload loop. Unfortunately, that's not currently an easy change, because the year-upload loop is tightly coupled to the logic that uploads rows, including deleted rows (currently on lines 379-393). As a result, we have to do this weird hack in this block where we only perform the "remove missing years" operation in the last iteration of the year-upload loop.

We could try to fix this by factoring out the row-uploading logic and running it after the year loop completes, such that the year loop's only job is to collate data into a combined input_data dataframe that covers all years, which we could then combine with missing_years and upload all at once. However, to me, all of this is pointing to a question that I think is indicating a larger issue with the code: If we're already chunking our uploads by 10k rows anyway, then why are we also chunking our data by year in the first place?

As an example of why I think this question is important: I spent way too much time reviewing this code relative to the complexity of the change, and I spent most of that time trying to understand the origin of the various different lists of years that the code uses so that I could confirm whether functions like check_deleted() will run inside year loops or not. That cognitive complexity makes it hard to have confidence in our changes, even small changes like this one. In my opinion, as long as the 10k row chunks are working fine for upload, then the only part of the code that should care about years is the extract step that filters the Athena query to limit the input data to only the years of data that the user wants to upload. Outside of that extract step, I think we should avoid grouping rows or queries by year in our transformation/upload steps wherever possible.

Metadata

Metadata

Assignees

Labels

No labels
No labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions