feat: implement core data pipeline and image segmentation#5
Conversation
|
Note Reviews pausedIt looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the Use the following commands to manage reviews:
Use the checkboxes below for quick actions:
No actionable comments were generated in the recent review. 🎉 ℹ️ Recent review info⚙️ Run configurationConfiguration used: defaults Review profile: CHILL Plan: Pro Run ID: 📒 Files selected for processing (1)
🚧 Files skipped from review as they are similar to previous changes (1)
📝 WalkthroughWalkthroughAdds a new ciao.data package with image path iteration, ImageNet-style preprocessing, replacement-image generators, and vectorized square/hex segmentation; fixes a malformed mypy option and enables explicit package bases; removes a transitive networkx dependency. Changes
Sequence Diagram(s)sequenceDiagram
participant User
participant Loader as Image Loader
participant Preproc as Preprocessor
participant Segment as Segmenter
participant Replacer as Replacer
User->>Loader: iter_image_paths(config)
Loader-->>User: yields Path(s)
User->>Preproc: load_and_preprocess_image(path)
Preproc-->>User: tensor [3,224,224] on device
User->>Segment: create_segmentation(tensor, type, size)
Segment-->>User: segment map + adjacency bitmasks
User->>Replacer: get_replacement_image(tensor, strategy)
Replacer-->>User: replacement tensor
Estimated code review effort🎯 4 (Complex) | ⏱️ ~60 minutes Poem
🚥 Pre-merge checks | ✅ 3✅ Passed checks (3 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
Summary of ChangesHello, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request establishes the foundational data pipeline for the project by introducing a new, self-contained Highlights
Changelog
Activity
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
There was a problem hiding this comment.
Code Review
This pull request introduces a new data module for handling image loading, preprocessing, and segmentation. The overall structure is good, and the use of vectorized operations in the segmentation logic is great for performance. I have provided a few suggestions to improve type safety, address a performance bottleneck in one of the utility functions, and reduce code duplication for better maintainability.
There was a problem hiding this comment.
Pull request overview
This PR adds a new self-contained ciao/data/ module that covers image path loading, ImageNet-style preprocessing, and square/hexagonal segmentation with adjacency encoding for downstream pipeline steps.
Changes:
- Added
ciao/datapackage with loader, preprocessing, and segmentation utilities. - Implemented square + hexagonal segmentation and adjacency bitmask encoding.
- Removed the unused
networkxdependency from project metadata / lockfile.
Reviewed changes
Copilot reviewed 6 out of 7 changed files in this pull request and generated 5 comments.
Show a summary per file
| File | Description |
|---|---|
ciao/data/segmentation.py |
New segmentation implementation (square + hex) and adjacency encoding utilities. |
ciao/data/preprocessing.py |
New ImageNet-style preprocessing + image loading helper. |
ciao/data/loader.py |
New Hydra-config-driven image path iterator (single image or directory). |
ciao/data/__init__.py |
Exposes the new data utilities as package exports. |
pyproject.toml |
Drops networkx from dependencies. |
uv.lock |
Lockfile updated to reflect removal of networkx. |
.mypy.ini |
Enables explicit_package_bases and fixes formatting of disable_error_code. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
You can also share your feedback on Copilot code review. Take the survey.
36f96f0 to
de00a02
Compare
There was a problem hiding this comment.
Actionable comments posted: 3
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@ciao/data/loader.py`:
- Around line 26-50: The loader currently treats image_path and batch_path with
if/elif so a config that sets both will silently prefer image_path; update the
validation in loader (before the existing image_path/batch_path branches) to
detect when both config.data.get("image_path") and config.data.get("batch_path")
are provided and raise a clear ValueError indicating they are mutually
exclusive; keep existing behavior for single image (image_path -> Path + is_file
check + yield) and batch mode (batch_path -> Path + is_dir check + rglob +
suffix in IMAGE_EXTENSIONS) unchanged.
In `@ciao/data/preprocessing.py`:
- Line 38: The cast call on the preprocess result uses an unquoted type
expression which trips Ruff TC006; update the usage in preprocessing.py by
changing cast(torch.Tensor, preprocess(image)) to cast("torch.Tensor",
preprocess(image)) (i.e., quote the type argument) so the type string literal is
used for the tensor assignment produced by preprocess.
In `@ciao/data/replacement.py`:
- Around line 117-119: The plotted tensor is still on CUDA and normalized,
causing TypeError and incorrect colors; update the code around
calculate_image_mean_color and normalized_mean to move the tensor to CPU,
unnormalize it back to 0-1 RGB using the same preprocessing mean/std (or the
inverse transform), then convert to a NumPy HWC array before plotting; e.g.,
call .cpu() on the tensor, apply the inverse normalization (using the
preprocessing mean/std used by load_and_preprocess_image), then permute to
(H,W,C) and .numpy() before passing to plt.imshow so plt receives a CPU numpy
array with actual RGB values.
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: 47596251-88b7-44c5-a69e-2aaed07dca59
⛔ Files ignored due to path filters (1)
uv.lockis excluded by!**/*.lock
📒 Files selected for processing (7)
.mypy.iniciao/data/__init__.pyciao/data/loader.pyciao/data/preprocessing.pyciao/data/replacement.pyciao/data/segmentation.pypyproject.toml
💤 Files with no reviewable changes (1)
- pyproject.toml
There was a problem hiding this comment.
🧹 Nitpick comments (1)
ciao/data/preprocessing.py (1)
15-15: Consider centralizing ImageNet normalization constants.These same constants are duplicated in
ciao/data/replacement.py(lines 1-3 asIMAGENET_MEANandIMAGENET_STD). If one is updated without the other, the unnormalization/renormalization logic incalculate_image_mean_colorwill produce incorrect results.Consider defining these constants in a shared location (e.g., a
constants.pymodule) and importing from there.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@ciao/data/preprocessing.py` at line 15, The ImageNet mean/std literals used in transforms.Normalize (in ciao/data/preprocessing.py) are duplicated as IMAGENET_MEAN and IMAGENET_STD in ciao/data/replacement.py and must be centralized: create a shared constant (e.g., IMAGENET_MEAN and IMAGENET_STD) in a common module (constants.py), import those constants into preprocessing.py and replacement.py, and replace the literal list in transforms.Normalize and any uses in calculate_image_mean_color with the imported constants so both modules use the single source of truth.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Nitpick comments:
In `@ciao/data/preprocessing.py`:
- Line 15: The ImageNet mean/std literals used in transforms.Normalize (in
ciao/data/preprocessing.py) are duplicated as IMAGENET_MEAN and IMAGENET_STD in
ciao/data/replacement.py and must be centralized: create a shared constant
(e.g., IMAGENET_MEAN and IMAGENET_STD) in a common module (constants.py), import
those constants into preprocessing.py and replacement.py, and replace the
literal list in transforms.Normalize and any uses in calculate_image_mean_color
with the imported constants so both modules use the single source of truth.
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: 9c404dff-784b-42d1-835a-e2216cd30576
📒 Files selected for processing (1)
ciao/data/preprocessing.py
|
The representation of segments as an array as large as the image is very inefficient. I am not sure whether I should recommend fixing it, because it could make the code inefficient in other parts that I do not see now. Another possible representation is a table with id, x_coord, y_coord, and neighbors. This could also be beneficial when computing neighbors, because it could be reduced to simple rotations using the IDs. |
For some reason, I can't reply to this comment directly. The reason for this representation is to allow fast tensor masking ( |
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@ciao/data/loader.py`:
- Around line 36-57: iter_image_paths currently yields non-image files in
single-image mode and can yield directories in batch mode; update it to validate
extensions against IMAGE_EXTENSIONS in both modes and only yield actual files:
in the single-image branch (function iter_image_paths) check
Path(image_path_value).suffix.lower() is in IMAGE_EXTENSIONS and raise a clear
ValueError (or similar validation error) if not, and in the batch branch change
the generator loop to filter by path.is_file() and path.suffix.lower() in
IMAGE_EXTENSIONS before yielding to avoid returning directories or non-image
files (this will ensure callers like load_and_preprocess_image receive only
valid image file paths).
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: 807955a8-6b1f-45ab-ba9f-f24ba56c6199
📒 Files selected for processing (5)
README.mdciao/data/__init__.pyciao/data/loader.pyciao/data/replacement.pyciao/data/segmentation.py
✅ Files skipped from review due to trivial changes (1)
- README.md
🚧 Files skipped from review as they are similar to previous changes (3)
- ciao/data/init.py
- ciao/data/replacement.py
- ciao/data/segmentation.py
Context:
This PR introduces the completely self-contained
data/module.What was changed:
loader.py: Image loading utilities.preprocessing.py: Image tensor transformations.segmentation.py: Hexagonal and square image segmentation logic.replacement.py: Image masking strategies (mean color, blur, interlacing, solid color).Related Task:
XAI-29
Summary by CodeRabbit
New Features
Chores
Documentation