Feature/metadata and remote expansion#41
Conversation
…wo forms of metadata search
hinxcode
left a comment
There was a problem hiding this comment.
Thanks for this great work! The S3 support and date filtering are solid additions. I had a few thoughts on the implementation:
S3 file handling
I took a look at process_remote_files and noticed it’s still downloading files to local disk and then cleaning them up afterward. One alternative I’d prefer is to keep generate_embeddings.py focused on embedding generation, and provide a new helper script (can be written in any language) that downloads the collection from S3 into raw_data_dir up front. That way:
(1) We avoid adding more parameters/configuration for bucket names, S3 credentials, etc.
(2) We can rely on proven tooling like s5cmd for fast, reliable syncs, with optional cleanup handled by the helper script.
Date filtering UX
Instead of making date search its own tab, could we make date a filter that works alongside text and image search? In practice, I can imagine researchers and practitioners wanting workflows like searching with results limited to a certain time range in both modes. We may also need to keep the codebase flexible, leaving more room for future contributors to extend the same pattern and add additional filters.
Description
I made three changes, all specifically to the photograph part of the app. These were:
Motivation and Context
The first of these changes allows the app to scale to larger datasets of photographs. For use cases where there are over a million photos, it will be helpful to be able to run the app without having to store all of the photos locally.
The next two are to enable more specific photograph searching. This is particularly useful for contexts where a user might know about a specific photo they're looking for, but not know where to find it. By filtering based on date or file name they can get closer to finding the photo they want, and then layer the embedding search on top of that.
Type of Change
Component(s) Affected
Changes Made
Testing
How Has This Been Tested?
I ran manual tests on each aspect that I described above.
Screenshots (if applicable)
Checklist
Code Quality
black .andisort .on Python codenpm run linton frontend code (if applicable)Testing
I didn't see unit tests.
I didn't see unit tests.
Documentation
config.jsondocumentation if config changes were madeDependencies
requirements.txt(if Python dependencies changed)package.json(if Node dependencies changed)Research (if applicable)
Breaking Changes
None / (describe breaking changes)
Additional Notes
Reviewers Checklist (for maintainers)