Feature/metadata and remote expansion by Hbeilinson · Pull Request #41 · hinxcode/digital-collections-explorer

Hbeilinson · 2025-12-08T23:45:53Z

Description

I made three changes, all specifically to the photograph part of the app. These were:

Allowing the app to run on photographs stored in S3, without having to locally store all of the raw images.
Adding a date search filter.
Adding an option to filter photographs by file path before running the embedding search.

Motivation and Context

The first of these changes allows the app to scale to larger datasets of photographs. For use cases where there are over a million photos, it will be helpful to be able to run the app without having to store all of the photos locally.

The next two are to enable more specific photograph searching. This is particularly useful for contexts where a user might know about a specific photo they're looking for, but not know where to find it. By filtering based on date or file name they can get closer to finding the photo they want, and then layer the embedding search on top of that.

Type of Change

Bug fix (non-breaking change that fixes an issue)
New feature (non-breaking change that adds functionality)
Breaking change (fix or feature that would cause existing functionality to not work as expected)
Documentation update
Code refactoring (no functional changes)
Performance improvement
Research contribution (new models, evaluation methods, etc.)
Other (please describe):

Component(s) Affected

Changes Made

Updated the generate_embeddings script to be able to download files from S3.
Updated the generate_embeddigns script to store the origin date of a photograph into the metadata file.
Updated the backend to fetch full photographs from S3 when they are not stored locally.
Updated the backend to provide an API for date search.
Updated the backend to enable file name filter on text search.
Updated the photograph frontend to add a date search option.
Updated the photograph frontend to include a filter bar below the text search, currently only including the file path filter.

Testing

How Has This Been Tested?

I ran manual tests on each aspect that I described above.

Screenshots (if applicable)

Before	After

N/A

Checklist

Code Quality

My code follows the project's coding standards
I have run black . and isort . on Python code
I have run npm run lint on frontend code (if applicable)
I have performed a self-review of my own code
I have commented my code, particularly in hard-to-understand areas
My changes generate no new warnings or errors

Testing

I have added tests that prove my fix is effective or that my feature works
I didn't see unit tests.
New and existing unit tests pass locally with my changes
I didn't see unit tests.
I have tested this locally with actual data

Documentation

I have updated the documentation accordingly
I have updated the README if needed
I have added docstrings to new functions/classes
I have updated config.json documentation if config changes were made

Dependencies

I have updated requirements.txt (if Python dependencies changed)
I have updated package.json (if Node dependencies changed)
I have documented any new configuration options

Research (if applicable)

I have included references to relevant papers or research
I have shared evaluation results or benchmarks
I have included information about datasets used
I have documented model training procedures

Breaking Changes

None / (describe breaking changes)

Additional Notes

Reviewers Checklist (for maintainers)

…wo forms of metadata search

hinxcode

Thanks for this great work! The S3 support and date filtering are solid additions. I had a few thoughts on the implementation:

S3 file handling

I took a look at process_remote_files and noticed it’s still downloading files to local disk and then cleaning them up afterward. One alternative I’d prefer is to keep generate_embeddings.py focused on embedding generation, and provide a new helper script (can be written in any language) that downloads the collection from S3 into raw_data_dir up front. That way:
(1) We avoid adding more parameters/configuration for bucket names, S3 credentials, etc.
(2) We can rely on proven tooling like s5cmd for fast, reliable syncs, with optional cleanup handled by the helper script.

Date filtering UX

Instead of making date search its own tab, could we make date a filter that works alongside text and image search? In practice, I can imagine researchers and practitioners wanting workflows like searching with results limited to a certain time range in both modes. We may also need to keep the codebase flexible, leaving more room for future contributors to extend the same pattern and add additional filters.

Hbeilinson and others added 6 commits October 10, 2025 22:22

Added the capability to run embeddings on files stored in AWS S3

8935f3d

Merged in updates from main

7edd27f

Sample download from S3 working through full app

b2f8572

feat: added capacity to access photographs stored on S3, as well as t…

686e6ac

…wo forms of metadata search

fix: fixed aws access in backend

f443287

completed merge

3f6d808

hinxcode reviewed Dec 21, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Feature/metadata and remote expansion#41

Feature/metadata and remote expansion#41
Hbeilinson wants to merge 6 commits intohinxcode:mainfrom
Hbeilinson:feature/metadata-and-remote-expansion

Hbeilinson commented Dec 8, 2025

Uh oh!

hinxcode left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

Hbeilinson commented Dec 8, 2025

Description

Motivation and Context

Type of Change

Component(s) Affected

Changes Made

Testing

How Has This Been Tested?

Screenshots (if applicable)

Checklist

Code Quality

Testing

Documentation

Dependencies

Research (if applicable)

Breaking Changes

Additional Notes

Reviewers Checklist (for maintainers)

Uh oh!

hinxcode left a comment

Choose a reason for hiding this comment

S3 file handling

Date filtering UX

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants