Skip to content

IN-1734 redact file names#11

Merged
ehanson8 merged 2 commits intomainfrom
IN-1734-redact-file-names
Apr 9, 2026
Merged

IN-1734 redact file names#11
ehanson8 merged 2 commits intomainfrom
IN-1734-redact-file-names

Conversation

@ehanson8
Copy link
Copy Markdown
Contributor

@ehanson8 ehanson8 commented Apr 9, 2026

Purpose and background context

The file redaction changes are the primary purpose of this PR but I would like you to weigh in on whether the added date message is this the most effective way to manage a lack of data without crashing the notebook.

How can a reviewer manually see the effects of these changes?

  1. Set the following values in your .env file
DIGITIZED_BAG_IDS=""
S3_INVENTORY_LOCATIONS="<shared-on-slack>"
  1. Set Prod credentials
  2. Run the following command:
make edit-notebook
  1. Review the cell with the label # Verify S3 Inventory data is available for the selected date. Prod data currently starts at 2025-10-09, select a date before that to see the error message and then select a date after to see the populated tables load.

Includes new or updated dependencies?

YES

Changes expectations for external applications?

NO

What are the relevant tickets?

Why these changes are being introduced:
* File names need to be redacted for all data points

How this addresses that need:
* Replace file name and file path with extension and bagname where they appear
* Update uuid parsing from S3 key
* Add loading message indicating whether S3 inventory data was found for a given date
* Update dependencies

Side effects of this change:
* None

Relevant ticket(s):
* https://mitlibraries.atlassian.net/browse/IN-1734
@ehanson8 ehanson8 requested a review from a team as a code owner April 9, 2026 15:15
@ehanson8 ehanson8 requested review from ghukill and removed request for a team April 9, 2026 15:15
Copy link
Copy Markdown

@ghukill ghukill left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Conceptually, looks good to me!

I did leave some comments about a possible inversion and use of mo.stop(). And, some ponderings about stopping other cells from firing.

Comment thread notebook.py Outdated
Comment on lines +317 to +327
# Verify S3 Inventory data is available for the selected date
logger.info(f"Collecting parquet file URIs for date: {selected_date}")

if selected_date not in symlink_dict:
date_message = (
f"No S3 Inventory data found for {selected_date}, select a different date"
)
else:
date_message = f"S3 Inventory data found for {selected_date}, loading data tables and visualizations..."
mo.md(date_message)
return
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was interesting! I had to consule a notebook I had worked on awhile ago, alma tag export, to re-familiarize with mo.stop() which I think could be a good fit here.

To me, the happy path is that data exists for that date and the rest of the notebook loads. When that happens, I'm not sure you need any actual notebook output.

It's only when there is not data for the date, that output feels helpful.

An alternate pattern could be to stop any more work by this checker/validation/integrity cell if data is found, else the default behavior is to notify:

# Verify S3 Inventory data is available for the selected date
logger.info(f"Collecting parquet file URIs for date: {selected_date}")

if selected_date in symlink_dict:
    mo.stop()

mo.md(f"No S3 Inventory data found for {selected_date}, select a different date")

At least personally, this feels weird at first... and then it starts to feel really normal for cells that aren't meant to do anything unless something is true. In this case, the "something is true" = no data found for the date, so we want to notify the user.

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm unsure at the moment, without a deeper dive into the notebook, but this might set you up for stopping all downstream cells entirely from firing.

Like, you could set some kind of flag where other cells would also not fire if that flag was set.

In the run view, it's fine, there is nothing visible. But in the edit view, when data isn't found for the date, you get cascading errors because the next cell fails.

If interested, happy to brainstorm how to avoid that. The alma notebook has a possible pattern where you run mo.stop() in a cell if something isn't right. You could theoretically add this to all downstream cells... but that feels a little cumbersome.

In the alma notebook it was only 2-3 cells where a check is performed, looking for something in the cell above. But here, I'm wondering if there is a more idiomatic marimo approach like, "stop everything if X is true."

* Update notebook's flow control with stop function
* Add spinners for long running processes
@ehanson8
Copy link
Copy Markdown
Contributor Author

ehanson8 commented Apr 9, 2026

@ghukill Pushed a new commit based on our conversation, thanks again for the huddle!

@coveralls
Copy link
Copy Markdown

Coverage Report for CI Build 24207062828

Coverage decreased (-0.1%) to 16.312%

Details

  • Coverage decreased (-0.1%) from the base build.
  • Patch coverage: 23 uncovered changes across 1 file (2 of 25 lines covered, 8.0%).
  • 3 coverage regressions across 1 file.

Uncovered Changes

File Changed Covered %
notebook.py 25 2 8.0%

Coverage Regressions

3 previously-covered lines in 1 file lost coverage.

File Lines Losing Coverage Coverage
notebook.py 3 15.71%

Coverage Stats

Coverage Status
Relevant Lines: 282
Covered Lines: 46
Line Coverage: 16.31%
Coverage Strength: 0.16 hits per line

💛 - Coveralls

@ghukill ghukill self-requested a review April 9, 2026 18:47
Copy link
Copy Markdown

@ghukill ghukill left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me! I learned a lot about marimo during this review.

100% optional, approval stands, I did notice that when the notebook first loads, it relies on the built-in hourglass in the upper-left to indicate loading before the spinners kick on.

If you wanted to, I'm betting there is some early data pull that could have a spinner as well? It must happen before the date picker is loaded.

Totally optional. Looking real good as-is.

Comment thread notebook.py
Comment on lines +305 to +308
mo.stop(
selected_date not in symlink_dict,
mo.md(f"No S3 Inventory data found for {selected_date}, select a different date"),
)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍 - after we did that marimo research about how and when to use mo.stop(...), this feels like the right approach.

And maybe most importantly, how descendent cells are set; where the selected_date here makes all other cells descendent, and thus also stop with this mo.stop().

@ehanson8
Copy link
Copy Markdown
Contributor Author

ehanson8 commented Apr 9, 2026

100% optional, approval stands, I did notice that when the notebook first loads, it relies on the built-in hourglass in the upper-left to indicate loading before the spinners kick on.

If you wanted to, I'm betting there is some early data pull that could have a spinner as well? It must happen before the date picker is loaded.

@ghukill I'll hold off on that change for now but keep it in mind for future PRs!

@ehanson8 ehanson8 merged commit c11a203 into main Apr 9, 2026
2 checks passed
@ehanson8 ehanson8 deleted the IN-1734-redact-file-names branch April 9, 2026 19:23
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants