Skip to content

feat: implement core data pipeline and image segmentation#5

Merged
vejtek merged 13 commits intomasterfrom
feat/data-pipeline
Mar 13, 2026
Merged

feat: implement core data pipeline and image segmentation#5
vejtek merged 13 commits intomasterfrom
feat/data-pipeline

Conversation

@dhalmazna
Copy link
Collaborator

@dhalmazna dhalmazna commented Mar 4, 2026

Context:
This PR introduces the completely self-contained data/ module.

What was changed:

  • loader.py: Image loading utilities.
  • preprocessing.py: Image tensor transformations.
  • segmentation.py: Hexagonal and square image segmentation logic.
  • replacement.py: Image masking strategies (mean color, blur, interlacing, solid color).

Related Task:
XAI-29

Summary by CodeRabbit

  • New Features

    • Single-file and batch image loading with supported-format detection
    • ImageNet-style preprocessing for model inputs
    • Image segmentation (square & hexagonal grids) with adjacency bitmasks
    • Multiple masking/replacement strategies: mean color, blur, interlacing, solid color
  • Chores

    • Re-exported core data utilities for simpler imports
    • Fixed mypy configuration and enabled explicit package bases
  • Documentation

    • Updated README to describe new data utilities and loaders

Copilot AI review requested due to automatic review settings March 4, 2026 16:36
@coderabbitai
Copy link

coderabbitai bot commented Mar 4, 2026

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

  • @coderabbitai resume to resume automatic reviews.
  • @coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

  • ▶️ Resume reviews
  • 🔍 Trigger review

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 712c215f-ceca-4683-9e7e-564a7ba1ba25

📥 Commits

Reviewing files that changed from the base of the PR and between 6ac5c40 and 452935c.

📒 Files selected for processing (1)
  • ciao/data/loader.py
🚧 Files skipped from review as they are similar to previous changes (1)
  • ciao/data/loader.py

📝 Walkthrough

Walkthrough

Adds a new ciao.data package with image path iteration, ImageNet-style preprocessing, replacement-image generators, and vectorized square/hex segmentation; fixes a malformed mypy option and enables explicit package bases; removes a transitive networkx dependency.

Changes

Cohort / File(s) Summary
Configuration
\.mypy\.ini, pyproject\.toml
Fix malformed mypy option (-disable_error_codedisable_error_code = no-any-return) and add explicit_package_bases = True; remove transitive networkx dependency.
Package Init
ciao/data/__init__.py
New package initializer re-exporting data utilities and defining __all__.
Image I/O
ciao/data/loader.py
Add IMAGE_EXTENSIONS and iter_image_paths(config: DictConfig) supporting single-image and recursive batch modes with validation and explicit errors.
Preprocessing
ciao/data/preprocessing.py
Add ImageNet-style preprocess transforms and load_and_preprocess_image(image_path, device) returning a [3,224,224] tensor on target device.
Replacement Utilities
ciao/data/replacement.py
Add ImageNet mean/std constants, calculate_image_mean_color, and get_replacement_image strategies (mean_color, interlacing, blur, solid_color) with device- and shape-preserving behavior.
Segmentation
ciao/data/segmentation.py
Add vectorized hex rounding, square/hex grid creators, adjacency builders, adjacency bitmask conversion, and create_segmentation producing segment maps and adjacency bitmasks.
Docs
README.md
Update data directory description to reflect renamed loader and newly added preprocessing/segmentation utilities.

Sequence Diagram(s)

sequenceDiagram
  participant User
  participant Loader as Image Loader
  participant Preproc as Preprocessor
  participant Segment as Segmenter
  participant Replacer as Replacer

  User->>Loader: iter_image_paths(config)
  Loader-->>User: yields Path(s)
  User->>Preproc: load_and_preprocess_image(path)
  Preproc-->>User: tensor [3,224,224] on device
  User->>Segment: create_segmentation(tensor, type, size)
  Segment-->>User: segment map + adjacency bitmasks
  User->>Replacer: get_replacement_image(tensor, strategy)
  Replacer-->>User: replacement tensor
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Poem

🐇 I hop through files and nibble every byte,
I crop and blur and patch with quiet delight,
Hex grids and means in tidy rows,
Paths yielded swift where the wild code grows—
A rabbit's whisper, turning data bright.

🚥 Pre-merge checks | ✅ 3
✅ Passed checks (3 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title clearly describes the main changes: implementing a core data pipeline with image segmentation functionality, which aligns with the five new modules (loader, preprocessing, segmentation, replacement) and configuration fixes.
Docstring Coverage ✅ Passed Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
  • 📝 Generate docstrings (stacked PR)
  • 📝 Generate docstrings (commit on current branch)
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch feat/data-pipeline

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@gemini-code-assist
Copy link

Summary of Changes

Hello, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request establishes the foundational data pipeline for the project by introducing a new, self-contained data module. This module provides robust utilities for loading images, applying standard preprocessing steps, and performing advanced image segmentation using both square and hexagonal grids. These changes are crucial for setting up the project's image analysis capabilities and simplify future development by centralizing data handling logic.

Highlights

  • New data module introduced: A new data module has been added, encapsulating image loading, preprocessing, and segmentation functionalities.
  • Image Loading Utilities: Implemented loader.py to provide flexible image loading from single paths or batch directories, supporting common image formats.
  • Image Preprocessing: Developed preprocessing.py to handle standard ImageNet-style transformations, including resizing, cropping, and normalization for image tensors.
  • Image Segmentation: Introduced segmentation.py which supports both hexagonal and square grid segmentation, along with generating adjacency lists for spatial relationships.
  • Dependency Management: The networkx dependency has been removed from pyproject.toml and uv.lock, streamlining the project's dependencies.
Changelog
  • .mypy.ini
    • Added explicit_package_bases = True to mypy configuration.
  • ciao/data/init.py
    • Added new file to initialize the data module.
    • Exported get_image_loader, load_and_preprocess_image, and create_segmentation for module-level access.
  • ciao/data/loader.py
    • Added new file containing get_image_loader function.
    • Implemented logic to load images from a single path or iterate through images in a batch directory.
    • Defined supported image extensions.
  • ciao/data/preprocessing.py
    • Added new file containing load_and_preprocess_image function.
    • Defined ImageNet preprocessing transforms (resize, center crop, ToTensor, Normalize).
    • Implemented image loading with PIL and tensor conversion to specified device.
  • ciao/data/segmentation.py
    • Added new file implementing image segmentation logic.
    • Included functions for vectorized hexagonal rounding (_hex_round_vectorized).
    • Provided utilities to convert adjacency bitmasks to lists and vice-versa.
    • Implemented _build_square_adjacency_list and _build_fast_adjacency_list for square and hexagonal grids respectively.
    • Developed _create_square_grid and _create_hexagonal_grid to generate segment IDs and adjacency lists.
    • Exposed create_segmentation as the main entry point for generating segmentations.
  • pyproject.toml
    • Removed networkx from project dependencies.
  • uv.lock
    • Removed networkx from the lock file's dependencies and requires-dist sections.
Activity
  • No specific activity (comments, reviews, approvals) was found in the provided context.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a new data module for handling image loading, preprocessing, and segmentation. The overall structure is good, and the use of vectorized operations in the segmentation logic is great for performance. I have provided a few suggestions to improve type safety, address a performance bottleneck in one of the utility functions, and reduce code duplication for better maintainability.

Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR adds a new self-contained ciao/data/ module that covers image path loading, ImageNet-style preprocessing, and square/hexagonal segmentation with adjacency encoding for downstream pipeline steps.

Changes:

  • Added ciao/data package with loader, preprocessing, and segmentation utilities.
  • Implemented square + hexagonal segmentation and adjacency bitmask encoding.
  • Removed the unused networkx dependency from project metadata / lockfile.

Reviewed changes

Copilot reviewed 6 out of 7 changed files in this pull request and generated 5 comments.

Show a summary per file
File Description
ciao/data/segmentation.py New segmentation implementation (square + hex) and adjacency encoding utilities.
ciao/data/preprocessing.py New ImageNet-style preprocessing + image loading helper.
ciao/data/loader.py New Hydra-config-driven image path iterator (single image or directory).
ciao/data/__init__.py Exposes the new data utilities as package exports.
pyproject.toml Drops networkx from dependencies.
uv.lock Lockfile updated to reflect removal of networkx.
.mypy.ini Enables explicit_package_bases and fixes formatting of disable_error_code.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

You can also share your feedback on Copilot code review. Take the survey.

@dhalmazna dhalmazna self-assigned this Mar 4, 2026
@dhalmazna dhalmazna requested a review from vejtek March 4, 2026 17:23
Base automatically changed from chore/project-setup to master March 6, 2026 14:58
@vejtek vejtek requested a review from a team March 6, 2026 14:58
@dhalmazna dhalmazna force-pushed the feat/data-pipeline branch from 36f96f0 to de00a02 Compare March 10, 2026 12:39
Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@ciao/data/loader.py`:
- Around line 26-50: The loader currently treats image_path and batch_path with
if/elif so a config that sets both will silently prefer image_path; update the
validation in loader (before the existing image_path/batch_path branches) to
detect when both config.data.get("image_path") and config.data.get("batch_path")
are provided and raise a clear ValueError indicating they are mutually
exclusive; keep existing behavior for single image (image_path -> Path + is_file
check + yield) and batch mode (batch_path -> Path + is_dir check + rglob +
suffix in IMAGE_EXTENSIONS) unchanged.

In `@ciao/data/preprocessing.py`:
- Line 38: The cast call on the preprocess result uses an unquoted type
expression which trips Ruff TC006; update the usage in preprocessing.py by
changing cast(torch.Tensor, preprocess(image)) to cast("torch.Tensor",
preprocess(image)) (i.e., quote the type argument) so the type string literal is
used for the tensor assignment produced by preprocess.

In `@ciao/data/replacement.py`:
- Around line 117-119: The plotted tensor is still on CUDA and normalized,
causing TypeError and incorrect colors; update the code around
calculate_image_mean_color and normalized_mean to move the tensor to CPU,
unnormalize it back to 0-1 RGB using the same preprocessing mean/std (or the
inverse transform), then convert to a NumPy HWC array before plotting; e.g.,
call .cpu() on the tensor, apply the inverse normalization (using the
preprocessing mean/std used by load_and_preprocess_image), then permute to
(H,W,C) and .numpy() before passing to plt.imshow so plt receives a CPU numpy
array with actual RGB values.

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 47596251-88b7-44c5-a69e-2aaed07dca59

📥 Commits

Reviewing files that changed from the base of the PR and between e6bee2f and de00a02.

⛔ Files ignored due to path filters (1)
  • uv.lock is excluded by !**/*.lock
📒 Files selected for processing (7)
  • .mypy.ini
  • ciao/data/__init__.py
  • ciao/data/loader.py
  • ciao/data/preprocessing.py
  • ciao/data/replacement.py
  • ciao/data/segmentation.py
  • pyproject.toml
💤 Files with no reviewable changes (1)
  • pyproject.toml

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (1)
ciao/data/preprocessing.py (1)

15-15: Consider centralizing ImageNet normalization constants.

These same constants are duplicated in ciao/data/replacement.py (lines 1-3 as IMAGENET_MEAN and IMAGENET_STD). If one is updated without the other, the unnormalization/renormalization logic in calculate_image_mean_color will produce incorrect results.

Consider defining these constants in a shared location (e.g., a constants.py module) and importing from there.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@ciao/data/preprocessing.py` at line 15, The ImageNet mean/std literals used
in transforms.Normalize (in ciao/data/preprocessing.py) are duplicated as
IMAGENET_MEAN and IMAGENET_STD in ciao/data/replacement.py and must be
centralized: create a shared constant (e.g., IMAGENET_MEAN and IMAGENET_STD) in
a common module (constants.py), import those constants into preprocessing.py and
replacement.py, and replace the literal list in transforms.Normalize and any
uses in calculate_image_mean_color with the imported constants so both modules
use the single source of truth.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Nitpick comments:
In `@ciao/data/preprocessing.py`:
- Line 15: The ImageNet mean/std literals used in transforms.Normalize (in
ciao/data/preprocessing.py) are duplicated as IMAGENET_MEAN and IMAGENET_STD in
ciao/data/replacement.py and must be centralized: create a shared constant
(e.g., IMAGENET_MEAN and IMAGENET_STD) in a common module (constants.py), import
those constants into preprocessing.py and replacement.py, and replace the
literal list in transforms.Normalize and any uses in calculate_image_mean_color
with the imported constants so both modules use the single source of truth.

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 9c404dff-784b-42d1-835a-e2216cd30576

📥 Commits

Reviewing files that changed from the base of the PR and between de00a02 and 60deed3.

📒 Files selected for processing (1)
  • ciao/data/preprocessing.py

@Adames4
Copy link
Member

Adames4 commented Mar 10, 2026

The representation of segments as an array as large as the image is very inefficient. I am not sure whether I should recommend fixing it, because it could make the code inefficient in other parts that I do not see now.

Another possible representation is a table with id, x_coord, y_coord, and neighbors. This could also be beneficial when computing neighbors, because it could be reduced to simple rotations using the IDs.

@dhalmazna
Copy link
Collaborator Author

The representation of segments as an array as large as the image is very inefficient. I am not sure whether I should recommend fixing it, because it could make the code inefficient in other parts that I do not see now.

Another possible representation is a table with id, x_coord, y_coord, and neighbors. This could also be beneficial when computing neighbors, because it could be reduced to simple rotations using the IDs.

@Adames4

For some reason, I can't reply to this comment directly.

The reason for this representation is to allow fast tensor masking (mask = torch.isin(segments, ids)) when calculating scores. This structure isn't used for anything else. For finding neighbors, I use adj_masks, which is pretty efficient since I represent segments (and sets of segments) as bitsets (e.g. "adj_masks[5] will be an integer with bits set at positions of hex 5's neighbors"). I couldn't find a way to do both efficiently using just a single data structure, which is why it's split up like this.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@ciao/data/loader.py`:
- Around line 36-57: iter_image_paths currently yields non-image files in
single-image mode and can yield directories in batch mode; update it to validate
extensions against IMAGE_EXTENSIONS in both modes and only yield actual files:
in the single-image branch (function iter_image_paths) check
Path(image_path_value).suffix.lower() is in IMAGE_EXTENSIONS and raise a clear
ValueError (or similar validation error) if not, and in the batch branch change
the generator loop to filter by path.is_file() and path.suffix.lower() in
IMAGE_EXTENSIONS before yielding to avoid returning directories or non-image
files (this will ensure callers like load_and_preprocess_image receive only
valid image file paths).

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 807955a8-6b1f-45ab-ba9f-f24ba56c6199

📥 Commits

Reviewing files that changed from the base of the PR and between 60a25b3 and 6ac5c40.

📒 Files selected for processing (5)
  • README.md
  • ciao/data/__init__.py
  • ciao/data/loader.py
  • ciao/data/replacement.py
  • ciao/data/segmentation.py
✅ Files skipped from review due to trivial changes (1)
  • README.md
🚧 Files skipped from review as they are similar to previous changes (3)
  • ciao/data/init.py
  • ciao/data/replacement.py
  • ciao/data/segmentation.py

@dhalmazna dhalmazna requested a review from Adames4 March 11, 2026 09:08
@vejtek vejtek merged commit b85a5a3 into master Mar 13, 2026
3 checks passed
@vejtek vejtek deleted the feat/data-pipeline branch March 13, 2026 10:28
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants