feat: implement core data pipeline and image segmentation by dhalmazna · Pull Request #5 · RationAI/ciao

dhalmazna · 2026-03-04T16:36:51Z

Context:
This PR introduces the completely self-contained data/ module.

What was changed:

loader.py: Image loading utilities.
preprocessing.py: Image tensor transformations.
segmentation.py: Hexagonal and square image segmentation logic.
replacement.py: Image masking strategies (mean color, blur, interlacing, solid color).

Related Task:
XAI-29

Summary by CodeRabbit

New Features
- Single-file and batch image loading with supported-format detection
- ImageNet-style preprocessing for model inputs
- Image segmentation (square & hexagonal grids) with adjacency bitmasks
- Multiple masking/replacement strategies: mean color, blur, interlacing, solid color
Chores
- Re-exported core data utilities for simpler imports
- Fixed mypy configuration and enabled explicit package bases
Documentation
- Updated README to describe new data utilities and loaders

coderabbitai · 2026-03-04T16:37:03Z

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

@coderabbitai resume to resume automatic reviews.
@coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

▶️ Resume reviews
🔍 Trigger review

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 712c215f-ceca-4683-9e7e-564a7ba1ba25

📥 Commits

Reviewing files that changed from the base of the PR and between 6ac5c40 and 452935c.

📒 Files selected for processing (1)

ciao/data/loader.py

🚧 Files skipped from review as they are similar to previous changes (1)

ciao/data/loader.py

📝 Walkthrough

Walkthrough

Adds a new ciao.data package with image path iteration, ImageNet-style preprocessing, replacement-image generators, and vectorized square/hex segmentation; fixes a malformed mypy option and enables explicit package bases; removes a transitive networkx dependency.

Changes

Cohort / File(s)	Summary
Configuration `\.mypy\.ini`, `pyproject\.toml`	Fix malformed mypy option (`-disable_error_code` → `disable_error_code = no-any-return`) and add `explicit_package_bases = True`; remove transitive `networkx` dependency.
Package Init `ciao/data/__init__.py`	New package initializer re-exporting data utilities and defining `__all__`.
Image I/O `ciao/data/loader.py`	Add `IMAGE_EXTENSIONS` and `iter_image_paths(config: DictConfig)` supporting single-image and recursive batch modes with validation and explicit errors.
Preprocessing `ciao/data/preprocessing.py`	Add ImageNet-style `preprocess` transforms and `load_and_preprocess_image(image_path, device)` returning a [3,224,224] tensor on target device.
Replacement Utilities `ciao/data/replacement.py`	Add ImageNet mean/std constants, `calculate_image_mean_color`, and `get_replacement_image` strategies (`mean_color`, `interlacing`, `blur`, `solid_color`) with device- and shape-preserving behavior.
Segmentation `ciao/data/segmentation.py`	Add vectorized hex rounding, square/hex grid creators, adjacency builders, adjacency bitmask conversion, and `create_segmentation` producing segment maps and adjacency bitmasks.
Docs `README.md`	Update data directory description to reflect renamed loader and newly added preprocessing/segmentation utilities.

Sequence Diagram(s)

sequenceDiagram
  participant User
  participant Loader as Image Loader
  participant Preproc as Preprocessor
  participant Segment as Segmenter
  participant Replacer as Replacer

  User->>Loader: iter_image_paths(config)
  Loader-->>User: yields Path(s)
  User->>Preproc: load_and_preprocess_image(path)
  Preproc-->>User: tensor [3,224,224] on device
  User->>Segment: create_segmentation(tensor, type, size)
  Segment-->>User: segment map + adjacency bitmasks
  User->>Replacer: get_replacement_image(tensor, strategy)
  Replacer-->>User: replacement tensor

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Poem

🐇 I hop through files and nibble every byte,
I crop and blur and patch with quiet delight,
Hex grids and means in tidy rows,
Paths yielded swift where the wild code grows—
A rabbit's whisper, turning data bright.

🚥 Pre-merge checks | ✅ 3

✅ Passed checks (3 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title clearly describes the main changes: implementing a core data pipeline with image segmentation functionality, which aligns with the five new modules (loader, preprocessing, segmentation, replacement) and configuration fixes.
Docstring Coverage	✅ Passed	Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

📝 Generate docstrings (stacked PR)
📝 Generate docstrings (commit on current branch)

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment
Commit unit tests in branch feat/data-pipeline

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

gemini-code-assist · 2026-03-04T16:38:33Z

Summary of Changes

Hello, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request establishes the foundational data pipeline for the project by introducing a new, self-contained data module. This module provides robust utilities for loading images, applying standard preprocessing steps, and performing advanced image segmentation using both square and hexagonal grids. These changes are crucial for setting up the project's image analysis capabilities and simplify future development by centralizing data handling logic.

Highlights

New data module introduced: A new data module has been added, encapsulating image loading, preprocessing, and segmentation functionalities.
Image Loading Utilities: Implemented loader.py to provide flexible image loading from single paths or batch directories, supporting common image formats.
Image Preprocessing: Developed preprocessing.py to handle standard ImageNet-style transformations, including resizing, cropping, and normalization for image tensors.
Image Segmentation: Introduced segmentation.py which supports both hexagonal and square grid segmentation, along with generating adjacency lists for spatial relationships.
Dependency Management: The networkx dependency has been removed from pyproject.toml and uv.lock, streamlining the project's dependencies.

Changelog

.mypy.ini
- Added explicit_package_bases = True to mypy configuration.
ciao/data/init.py
- Added new file to initialize the data module.
- Exported get_image_loader, load_and_preprocess_image, and create_segmentation for module-level access.
ciao/data/loader.py
- Added new file containing get_image_loader function.
- Implemented logic to load images from a single path or iterate through images in a batch directory.
- Defined supported image extensions.
ciao/data/preprocessing.py
- Added new file containing load_and_preprocess_image function.
- Defined ImageNet preprocessing transforms (resize, center crop, ToTensor, Normalize).
- Implemented image loading with PIL and tensor conversion to specified device.
ciao/data/segmentation.py
- Added new file implementing image segmentation logic.
- Included functions for vectorized hexagonal rounding (_hex_round_vectorized).
- Provided utilities to convert adjacency bitmasks to lists and vice-versa.
- Implemented _build_square_adjacency_list and _build_fast_adjacency_list for square and hexagonal grids respectively.
- Developed _create_square_grid and _create_hexagonal_grid to generate segment IDs and adjacency lists.
- Exposed create_segmentation as the main entry point for generating segmentations.
pyproject.toml
- Removed networkx from project dependencies.
uv.lock
- Removed networkx from the lock file's dependencies and requires-dist sections.

Activity

No specific activity (comments, reviews, approvals) was found in the provided context.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

gemini-code-assist

Code Review

This pull request introduces a new data module for handling image loading, preprocessing, and segmentation. The overall structure is good, and the use of vectorized operations in the segmentation logic is great for performance. I have provided a few suggestions to improve type safety, address a performance bottleneck in one of the utility functions, and reduce code duplication for better maintainability.

ciao/data/segmentation.py

ciao/data/loader.py

ciao/data/segmentation.py

Copilot

Pull request overview

This PR adds a new self-contained ciao/data/ module that covers image path loading, ImageNet-style preprocessing, and square/hexagonal segmentation with adjacency encoding for downstream pipeline steps.

Changes:

Added ciao/data package with loader, preprocessing, and segmentation utilities.
Implemented square + hexagonal segmentation and adjacency bitmask encoding.
Removed the unused networkx dependency from project metadata / lockfile.

Reviewed changes

Copilot reviewed 6 out of 7 changed files in this pull request and generated 5 comments.

Show a summary per file

File	Description
`ciao/data/segmentation.py`	New segmentation implementation (square + hex) and adjacency encoding utilities.
`ciao/data/preprocessing.py`	New ImageNet-style preprocessing + image loading helper.
`ciao/data/loader.py`	New Hydra-config-driven image path iterator (single image or directory).
`ciao/data/__init__.py`	Exposes the new data utilities as package exports.
`pyproject.toml`	Drops `networkx` from dependencies.
`uv.lock`	Lockfile updated to reflect removal of `networkx`.
`.mypy.ini`	Enables `explicit_package_bases` and fixes formatting of `disable_error_code`.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

You can also share your feedback on Copilot code review. Take the survey.

ciao/data/preprocessing.py

ciao/data/segmentation.py

ciao/data/loader.py

coderabbitai

Actionable comments posted: 3

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@ciao/data/loader.py`:
- Around line 26-50: The loader currently treats image_path and batch_path with
if/elif so a config that sets both will silently prefer image_path; update the
validation in loader (before the existing image_path/batch_path branches) to
detect when both config.data.get("image_path") and config.data.get("batch_path")
are provided and raise a clear ValueError indicating they are mutually
exclusive; keep existing behavior for single image (image_path -> Path + is_file
check + yield) and batch mode (batch_path -> Path + is_dir check + rglob +
suffix in IMAGE_EXTENSIONS) unchanged.

In `@ciao/data/preprocessing.py`:
- Line 38: The cast call on the preprocess result uses an unquoted type
expression which trips Ruff TC006; update the usage in preprocessing.py by
changing cast(torch.Tensor, preprocess(image)) to cast("torch.Tensor",
preprocess(image)) (i.e., quote the type argument) so the type string literal is
used for the tensor assignment produced by preprocess.

In `@ciao/data/replacement.py`:
- Around line 117-119: The plotted tensor is still on CUDA and normalized,
causing TypeError and incorrect colors; update the code around
calculate_image_mean_color and normalized_mean to move the tensor to CPU,
unnormalize it back to 0-1 RGB using the same preprocessing mean/std (or the
inverse transform), then convert to a NumPy HWC array before plotting; e.g.,
call .cpu() on the tensor, apply the inverse normalization (using the
preprocessing mean/std used by load_and_preprocess_image), then permute to
(H,W,C) and .numpy() before passing to plt.imshow so plt receives a CPU numpy
array with actual RGB values.

ℹ️ Review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 47596251-88b7-44c5-a69e-2aaed07dca59

📥 Commits

Reviewing files that changed from the base of the PR and between e6bee2f and de00a02.

⛔ Files ignored due to path filters (1)

uv.lock is excluded by !**/*.lock

📒 Files selected for processing (7)

.mypy.ini
ciao/data/__init__.py
ciao/data/loader.py
ciao/data/preprocessing.py
ciao/data/replacement.py
ciao/data/segmentation.py
pyproject.toml

💤 Files with no reviewable changes (1)

pyproject.toml

ciao/data/loader.py

ciao/data/preprocessing.py

ciao/data/replacement.py

coderabbitai

🧹 Nitpick comments (1)

ciao/data/preprocessing.py (1)
15-15: Consider centralizing ImageNet normalization constants.

These same constants are duplicated in ciao/data/replacement.py (lines 1-3 as IMAGENET_MEAN and IMAGENET_STD). If one is updated without the other, the unnormalization/renormalization logic in calculate_image_mean_color will produce incorrect results.

Consider defining these constants in a shared location (e.g., a constants.py module) and importing from there.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@ciao/data/preprocessing.py` at line 15, The ImageNet mean/std literals used
in transforms.Normalize (in ciao/data/preprocessing.py) are duplicated as
IMAGENET_MEAN and IMAGENET_STD in ciao/data/replacement.py and must be
centralized: create a shared constant (e.g., IMAGENET_MEAN and IMAGENET_STD) in
a common module (constants.py), import those constants into preprocessing.py and
replacement.py, and replace the literal list in transforms.Normalize and any
uses in calculate_image_mean_color with the imported constants so both modules
use the single source of truth.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Nitpick comments:
In `@ciao/data/preprocessing.py`:
- Line 15: The ImageNet mean/std literals used in transforms.Normalize (in
ciao/data/preprocessing.py) are duplicated as IMAGENET_MEAN and IMAGENET_STD in
ciao/data/replacement.py and must be centralized: create a shared constant
(e.g., IMAGENET_MEAN and IMAGENET_STD) in a common module (constants.py), import
those constants into preprocessing.py and replacement.py, and replace the
literal list in transforms.Normalize and any uses in calculate_image_mean_color
with the imported constants so both modules use the single source of truth.

ℹ️ Review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 9c404dff-784b-42d1-835a-e2216cd30576

📥 Commits

Reviewing files that changed from the base of the PR and between de00a02 and 60deed3.

📒 Files selected for processing (1)

ciao/data/preprocessing.py

ciao/data/loader.py

ciao/data/replacement.py

ciao/data/segmentation.py

Adames4 · 2026-03-10T17:09:31Z

The representation of segments as an array as large as the image is very inefficient. I am not sure whether I should recommend fixing it, because it could make the code inefficient in other parts that I do not see now.

Another possible representation is a table with id, x_coord, y_coord, and neighbors. This could also be beneficial when computing neighbors, because it could be reduced to simple rotations using the IDs.

… literals

dhalmazna · 2026-03-11T08:41:53Z

The representation of segments as an array as large as the image is very inefficient. I am not sure whether I should recommend fixing it, because it could make the code inefficient in other parts that I do not see now.

Another possible representation is a table with id, x_coord, y_coord, and neighbors. This could also be beneficial when computing neighbors, because it could be reduced to simple rotations using the IDs.

@Adames4

For some reason, I can't reply to this comment directly.

The reason for this representation is to allow fast tensor masking (mask = torch.isin(segments, ids)) when calculating scores. This structure isn't used for anything else. For finding neighbors, I use adj_masks, which is pretty efficient since I represent segments (and sets of segments) as bitsets (e.g. "adj_masks[5] will be an integer with bits set at positions of hex 5's neighbors"). I couldn't find a way to do both efficiently using just a single data structure, which is why it's split up like this.

coderabbitai

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@ciao/data/loader.py`:
- Around line 36-57: iter_image_paths currently yields non-image files in
single-image mode and can yield directories in batch mode; update it to validate
extensions against IMAGE_EXTENSIONS in both modes and only yield actual files:
in the single-image branch (function iter_image_paths) check
Path(image_path_value).suffix.lower() is in IMAGE_EXTENSIONS and raise a clear
ValueError (or similar validation error) if not, and in the batch branch change
the generator loop to filter by path.is_file() and path.suffix.lower() in
IMAGE_EXTENSIONS before yielding to avoid returning directories or non-image
files (this will ensure callers like load_and_preprocess_image receive only
valid image file paths).

ℹ️ Review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 807955a8-6b1f-45ab-ba9f-f24ba56c6199

📥 Commits

Reviewing files that changed from the base of the PR and between 60a25b3 and 6ac5c40.

📒 Files selected for processing (5)

README.md
ciao/data/__init__.py
ciao/data/loader.py
ciao/data/replacement.py
ciao/data/segmentation.py

✅ Files skipped from review due to trivial changes (1)

README.md

🚧 Files skipped from review as they are similar to previous changes (3)

ciao/data/init.py
ciao/data/replacement.py
ciao/data/segmentation.py

ciao/data/loader.py

Copilot AI review requested due to automatic review settings March 4, 2026 16:36

Copilot started reviewing on behalf of dhalmazna March 4, 2026 16:37 View session

gemini-code-assist bot reviewed Mar 4, 2026

View reviewed changes

ciao/data/segmentation.py Outdated Show resolved Hide resolved

ciao/data/loader.py Outdated Show resolved Hide resolved

ciao/data/segmentation.py Outdated Show resolved Hide resolved

Copilot AI reviewed Mar 4, 2026

View reviewed changes

ciao/data/preprocessing.py Outdated Show resolved Hide resolved

ciao/data/segmentation.py Show resolved Hide resolved

ciao/data/segmentation.py Outdated Show resolved Hide resolved

ciao/data/loader.py Outdated Show resolved Hide resolved

ciao/data/loader.py Outdated Show resolved Hide resolved

dhalmazna self-assigned this Mar 4, 2026

dhalmazna requested a review from vejtek March 4, 2026 17:23

Base automatically changed from chore/project-setup to master March 6, 2026 14:58

vejtek requested a review from a team March 6, 2026 14:58

dhalmazna added 8 commits March 10, 2026 13:38

feat: add data loading utilities for images

01c2bd0

feat: add image preprocessing function

2643f12

feat: add segmentation functionality for image processing

00fd452

chore: remove the unused networkx dependency

45e4f1d

feat: apply agents' suggestions

ce834bc

feat: remove unused bitmask to adjacency list conversion function

677f408

feat: implement image replacement strategies

76b0bc6

chore: update __init__.py

de00a02

dhalmazna force-pushed the feat/data-pipeline branch from 36f96f0 to de00a02 Compare March 10, 2026 12:39

chore: ruff checks

60deed3

coderabbitai bot reviewed Mar 10, 2026

View reviewed changes

ciao/data/loader.py Outdated Show resolved Hide resolved

ciao/data/preprocessing.py Outdated Show resolved Hide resolved

ciao/data/replacement.py Outdated Show resolved Hide resolved

coderabbitai bot reviewed Mar 10, 2026

View reviewed changes

chore: apply agents' suggestions

60a25b3

Adames4 requested changes Mar 10, 2026

View reviewed changes

dhalmazna added 2 commits March 11, 2026 09:04

refactor: improve image path validation and fail-fast logic

6b1886a

refactor: improve fail-fast logic in segmentation; convert strings to…

6ac5c40

… literals

coderabbitai bot reviewed Mar 11, 2026

View reviewed changes

ciao/data/loader.py Show resolved Hide resolved

refactor: enhance image path validation

452935c

dhalmazna requested a review from Adames4 March 11, 2026 09:08

Adames4 approved these changes Mar 12, 2026

View reviewed changes

vejtek approved these changes Mar 12, 2026

View reviewed changes

vejtek merged commit b85a5a3 into master Mar 13, 2026
3 checks passed

vejtek deleted the feat/data-pipeline branch March 13, 2026 10:28

Conversation

dhalmazna commented Mar 4, 2026 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by CodeRabbit

Uh oh!

coderabbitai bot commented Mar 4, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Reviews paused

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Poem

Uh oh!

gemini-code-assist bot commented Mar 4, 2026

Summary of Changes

Highlights

Footnotes

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Adames4 commented Mar 10, 2026

Uh oh!

dhalmazna commented Mar 11, 2026

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

dhalmazna commented Mar 4, 2026 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Mar 4, 2026 •

edited

Loading