feat: add preprocessing by LAdam-ix · Pull Request #3 · RationAI/stain-normalization

LAdam-ix · 2026-03-11T05:50:07Z

Just usage of rational libs to prepare initial dataset for training. After the data reorganization using the official dataset. I have it setup for when I was creating some single slide dataset for testing ( the gemini problems).

Summary by CodeRabbit

New Features
- Added automated tissue mask generation pipeline to preprocess whole-slide images.
- Added image tiling workflow with intelligent tissue-based filtering and experiment tracking integration for managing dataset splits and versioning.
Chores
- Initialized project configuration with dependencies and build tooling.
- Added comprehensive development environment setup files.

coderabbitai · 2026-03-11T05:50:19Z

📝 Walkthrough

Walkthrough

This PR establishes the foundational infrastructure for a stain normalization project, adding comprehensive project configuration, environment setup, and two new preprocessing scripts to automate parallel whole-slide image processing, tissue masking, and tiling workflows.

Changes

Cohort / File(s)	Summary
Project Configuration `.gitignore`, `pyproject.toml`	Adds project setup files: comprehensive Python gitignore rules and PDM-based project configuration with dependencies (OpenSlide, pyvips, Ray, MLflow), CLI scripts for mask generation, tiling, training, validation, and testing.
Preprocessing Pipeline `preprocessing/mask_generator.py`, `preprocessing/tiler.py`	Introduces two preprocessing scripts: mask_generator.py orchestrates parallel slide processing via Ray to generate tissue masks using OpenSlide and pyvips; tiler.py implements complete tiling workflow with tissue masking (TissueMask class), train/val/test splitting, remote Ray handler, CSV export, and MLflow dataset logging.

Sequence Diagrams

sequenceDiagram
    participant Main as Main Workflow
    participant Discover as Slide Discovery
    participant Split as Train/Val/Test Split
    participant Handler as Ray Remote Handler
    participant Source as OpenSlide Tile Source
    participant Mask as Tissue Mask Filter
    participant CSV as CSV Export
    participant MLflow as MLflow Logger

    Main->>Discover: Scan SLIDES_PATH for .mrxs files
    Discover-->>Main: Return slide paths
    Main->>Split: Split into train/val/test
    Split-->>Main: Return split groups
    Main->>Handler: Process each split (parallel via Ray)
    Handler->>Source: Load slide and generate tiles
    Source-->>Handler: Return tiles + metadata
    Handler->>Mask: Apply tissue mask filter
    Mask-->>Handler: Return filtered tiles
    Handler-->>Main: Return TiledSlideMetadata
    Main->>CSV: Export per-split metadata to CSV
    Main->>MLflow: Log train, val, test datasets
    MLflow-->>Main: Datasets logged
    Main-->>Main: Workflow complete

sequenceDiagram
    participant Main as Main Workflow
    participant Scan as Slide Scanner
    participant Remote as Ray Remote process_slide
    participant OpenSlide as OpenSlide
    participant Pyvips as Pyvips
    participant FileSystem as File System

    Main->>Scan: Discover .mrxs slides in SLIDES_PATH
    Scan-->>Main: Return slide paths
    Main->>Remote: Submit parallel processing jobs
    Remote->>OpenSlide: Open slide at specified level
    OpenSlide-->>Remote: Return resolution info (xres, yres)
    Remote->>Pyvips: Load slide region at level
    Pyvips-->>Remote: Return image data
    Remote->>Remote: Compute tissue mask
    Remote->>FileSystem: Create destination directory
    Remote->>FileSystem: Write mask as BigTIFF
    FileSystem-->>Remote: File saved
    Remote-->>Main: Mask generation complete
    Main-->>Main: All slides processed

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Poem

🐰 With mask and tile, the slides now dance,
Ray distributes work at speed's advance,
Tissues filtered, MLflow logs the way,
Configuration blooms—hooray, hooray!
A rabbit's preprocessing ballet! 🎪✨

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (2 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title 'feat: add preprocessing' directly reflects the core purpose of the PR—adding preprocessing functionality. The changes introduce three new preprocessing modules (mask_generator.py, tiler.py) and project configuration, all essential preprocessing infrastructure.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

📝 Generate docstrings (stacked PR)
📝 Generate docstrings (commit on current branch)

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment
Commit unit tests in branch feature/ml-preprocessing

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

gemini-code-assist · 2026-03-11T05:50:30Z

Summary of Changes

Hello, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request establishes the foundational preprocessing pipeline for whole slide images, enabling the automated generation of tissue masks and the subsequent tiling of these images into structured datasets. It also sets up robust project dependency management and integrates MLflow for experiment tracking, significantly streamlining the initial data preparation phase for machine learning model development.

Highlights

New Preprocessing Scripts: Introduced mask_generator.py and tiler.py for automated tissue mask generation and whole slide image tiling, leveraging rationai libraries.
Dependency Management: Added pyproject.toml to define project metadata, manage dependencies, and configure PDM scripts for various development and data processing tasks.
MLflow Integration: Integrated MLflow for tracking and saving processed datasets (slides and tiles) for training, validation, and testing, ensuring reproducibility.
Git Ignore Configuration: Included a comprehensive .gitignore file to exclude common development artifacts, environments, caches, and IDE-specific files from version control.

Changelog

.gitignore
- Added a new .gitignore file with comprehensive exclusions for Python projects, including build artifacts, environments, caches, and IDE-specific files.
preprocessing/mask_generator.py
- Added a new Python script to generate tissue masks from whole slide images.
- Implemented parallel processing of slides using ray.
- Utilized pyvips and openslide for image handling and spatial resolution extraction.
- Defined constants for slide path, mask destination, and processing level.
preprocessing/tiler.py
- Added a new Python script for tiling whole slide images and applying tissue masks.
- Integrated rationai.tiling for tile generation and rationai.tiling.modules.masks.PyvipsMask for tissue masking.
- Implemented data splitting into train, validation, and test sets using sklearn.model_selection.train_test_split.
- Configured MLflow for experiment tracking and saving processed datasets (slides and tiles).
- Defined constants for slide path, tissue masks path, slide MPP, and minimum tissue percentage.
pyproject.toml
- Added a new pyproject.toml file to define project metadata, including name, version, author, Python requirements, and license.
- Listed core dependencies, such as lightning, albumentations, torch, and several rationai libraries.
- Specified development dependencies like mypy, pre-commit, and ruff.
- Configured PDM scripts for mask generation, tiling, training, validation, testing, prediction, linting, formatting, and type checking.

Activity

No human activity has been recorded on this pull request yet.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for GitHub and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

gemini-code-assist

Code Review

This pull request introduces preprocessing scripts for dataset preparation and sets up the project structure with a pyproject.toml and .gitignore. The scripts are a good foundation, but I've identified several areas for improvement. My feedback focuses on removing hardcoded paths to enhance portability, fixing a variable redefinition bug, improving path manipulation logic for robustness, and cleaning up leftover debugging code. I've also suggested adjustments to pyproject.toml for more flexible dependency management.

gemini-code-assist · 2026-03-11T05:52:33Z

preprocessing/tiler.py

+TISSUE_MASKS_PATH = "./mask/tissue_masks"
+TISSUE_MASKS_PATH = "/home/jovyan/staining/demo_data/masks"


The variable TISSUE_MASKS_PATH is defined on line 14 and then immediately redefined on line 15 with a different value. This is likely an error and can lead to unexpected behavior. Please remove the incorrect definition.

Suggested change

TISSUE_MASKS_PATH = "./mask/tissue_masks"

TISSUE_MASKS_PATH = "/home/jovyan/staining/demo_data/masks"

TISSUE_MASKS_PATH = "./mask/tissue_masks"

gemini-code-assist · 2026-03-11T05:52:34Z

preprocessing/tiler.py

+def handler(slide_path: Path) -> TiledSlideMetadata:
+    slide, tiles = source(slide_path)
+
+    tissue_mask_path = Path(TISSUE_MASKS_PATH, slide_path.name[:-5] + ".tiff")


Using string slicing [:-5] to change the file extension is not robust as it assumes all slide files have a 4-character extension (e.g., .mrxs). A more reliable approach is to use pathlib's methods to handle file stems and suffixes.

Suggested change

tissue_mask_path = Path(TISSUE_MASKS_PATH, slide_path.name[:-5] + ".tiff")

tissue_mask_path = Path(TISSUE_MASKS_PATH, f"{slide_path.stem}.tiff")

gemini-code-assist · 2026-03-11T05:52:34Z

preprocessing/mask_generator.py

+SLIDES_PATH = "/mnt/data/MOU/prostate/tile_level_annotations/"
+MASK_DEST = "./mask/tissue_masks"
+LEVEL = 3


The script uses hardcoded paths and parameters (SLIDES_PATH, MASK_DEST, LEVEL). This makes it difficult to reuse and run in different environments. It's better to make these configurable via command-line arguments using a library like argparse or click.

gemini-code-assist · 2026-03-11T05:52:34Z

preprocessing/mask_generator.py

+
+    slide = pyvips.Image.new_from_file(slide_path, level=LEVEL)
+    mask = tissue_mask(slide, xres)
+    mask_path = Path(MASK_DEST, f"{Path(slide_path).stem}.tiff")


The slide_path variable is already a pathlib.Path object, so there's no need to instantiate it again with Path(). You can directly access its attributes like .stem.

Suggested change

mask_path = Path(MASK_DEST, f"{Path(slide_path).stem}.tiff")

mask_path = Path(MASK_DEST, f"{slide_path.stem}.tiff")

gemini-code-assist · 2026-03-11T05:52:34Z

preprocessing/mask_generator.py

+
+def main() -> None:
+    slides = list(Path(SLIDES_PATH).rglob("*.mrxs"))
+    # process_slide(slides[0])


This line appears to be leftover debugging code. It should be removed to keep the codebase clean.

gemini-code-assist · 2026-03-11T05:52:34Z

preprocessing/tiler.py

+from sklearn.model_selection import train_test_split
+
+
+SLIDES_PATH = "/mnt/data/MOU/prostate/tile_level_annotations/"


The script uses several hardcoded paths, such as SLIDES_PATH, TISSUE_MASKS_PATH (line 15), and the output CSV paths (lines 72-79). This reduces portability and makes the script harder to use in different environments or with different datasets. Consider using command-line arguments (e.g., with argparse) to make these paths configurable.

gemini-code-assist · 2026-03-11T05:52:34Z

preprocessing/tiler.py

+    val_slides_df, val_tiles_df = tiling(slides=list(val_slides), handler=handler)
+    test_slides_df, test_tiles_df = tiling(slides=list(test_slides), handler=handler)


The train_test_split function from sklearn already returns lists, so the explicit calls to list() on val_slides and test_slides are redundant. Removing them will make the code cleaner.

Suggested change

val_slides_df, val_tiles_df = tiling(slides=list(val_slides), handler=handler)

test_slides_df, test_tiles_df = tiling(slides=list(test_slides), handler=handler)

val_slides_df, val_tiles_df = tiling(slides=val_slides, handler=handler)

test_slides_df, test_tiles_df = tiling(slides=test_slides, handler=handler)

gemini-code-assist · 2026-03-11T05:52:34Z

preprocessing/tiler.py

+if __name__ == "__main__":
+    # main()
+    slides = [("/home/jovyan/staining/demo_data/P-2016_0077-08-1_hed_h0.6_e1.5.tiff")]
+    train_slides_df, train_tiles_df = tiling(slides=slides, handler=handler)
+
+    mlflow.set_experiment(experiment_name="Stain-Normalization")
+    with mlflow.start_run(run_name="P-2016_0077-08-1_hed all tissue tiles") as _:
+        save_mlflow_dataset(
+            slides=train_slides_df,
+            tiles=train_tiles_df,
+            dataset_name="P-2016_0077-08-1_hed",
+        )


The if __name__ == "__main__": block contains commented-out code and hardcoded paths for debugging. This should be cleaned up to only call the main() function. Any testing or debugging logic should be moved to a separate script or function.

Suggested change

if __name__ == "__main__":

# main()

slides = [("/home/jovyan/staining/demo_data/P-2016_0077-08-1_hed_h0.6_e1.5.tiff")]

train_slides_df, train_tiles_df = tiling(slides=slides, handler=handler)

mlflow.set_experiment(experiment_name="Stain-Normalization")

with mlflow.start_run(run_name="P-2016_0077-08-1_hed all tissue tiles") as _:

save_mlflow_dataset(

slides=train_slides_df,

tiles=train_tiles_df,

dataset_name="P-2016_0077-08-1_hed",

)

if __name__ == "__main__":

main()

gemini-code-assist · 2026-03-11T05:52:34Z

pyproject.toml

+name = "Stain Normalization"
+version = "0.1.0"
+authors = [{name = "Adam Lopatka"}]
+requires-python = "==3.12.5"


Pinning the Python version to an exact patch release (==3.12.5) can make it difficult to set up the project environment. It's generally better to allow for a range of compatible versions, for example >=3.12, to improve flexibility.

Suggested change

requires-python = "==3.12.5"

requires-python = ">=3.12"

pyproject.toml

coderabbitai

Actionable comments posted: 7

🧹 Nitpick comments (1)

.gitignore (1)
181-183: Ignore the generated mask TIFFs too.

preprocessing/mask_generator.py writes to ./mask/tissue_masks, but that output tree is still tracked here. Running the new preprocessing step will leave large generated artifacts in git status.
🧹 Small ignore update
 # Project
 /data
+/mask/tissue_masks/
 /demo_*
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In @.gitignore around lines 181 - 183, Add an ignore rule for the generated mask
TIFFs so git stops tracking the output from preprocessing/mask_generator.py:
update .gitignore to exclude the mask output directory (e.g.,
mask/tissue_masks/) and TIFF files (e.g., *.tif and/or *.tiff) so that files
created under ./mask/tissue_masks by the mask_generator are not shown in git
status.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@preprocessing/mask_generator.py`:
- Around line 32-34: The mask filename is not unique because mask_path is built
from Path(slide_path).stem; change mask generation to preserve the slide's
source-relative path under MASK_DEST instead of using only the stem: compute the
slide's relative path (e.g., rel = Path(slide_path).relative_to(SOURCE_ROOT) or
equivalent), replace its suffix with ".tiff" and join with MASK_DEST to form
mask_path, ensure mask_path.parent.mkdir(...), and call write_big_tiff(mask,
path=mask_path, mpp_x=xres, mpp_y=yres); also update preprocessing/tiler.py to
resolve masks using the same source-relative path scheme so mask lookup and
slide pairing remain consistent.

In `@preprocessing/tiler.py`:
- Around line 100-111: The __main__ block currently bypasses the CLI entry point
by hard-coding a str slide path and calling tiling directly; restore the
intended behavior by invoking main() from the __main__ guard (so `pdm run tiler`
executes the dataset-building flow), remove the hard-coded slides variable, and
if you must call tiling directly ensure you pass Path objects to handler
(convert the slide strings with pathlib.Path) and still call save_mlflow_dataset
with the produced train_slides_df and train_tiles_df; update the __main__ block
to call main() (or, if keeping a direct call, convert slide paths to Path and
preserve the mlflow/save_mlflow_dataset steps).
- Around line 14-15: TISSUE_MASKS_PATH in preprocessing/tiler.py is being
overridden with a developer-local path; remove the hard-coded override so the
module uses the original TISSUE_MASKS_PATH definition (or alternatively read
from an environment/config variable) to keep producer
(preprocessing/mask_generator.py) and consumer paths aligned; update references
to the TISSUE_MASKS_PATH symbol accordingly and ensure no second assignment
replaces the intended shared path.
- Around line 72-79: The CSV writes in preprocessing/tiler.py call
train_slides_df.to_csv / train_tiles_df.to_csv / val_slides_df.to_csv /
val_tiles_df.to_csv / test_slides_df.to_csv / test_tiles_df.to_csv but never
ensure the ./data/datasets directory exists; fix by creating the directory
(e.g., using os.makedirs or Path(...).mkdir with parents=True, exist_ok=True)
once before the first to_csv call—add this directory creation at the start of
the block that writes these CSVs (or at the start of main()) so all six writes
succeed on a clean checkout.
- Around line 65-66: The two-stage splitting using train_test_split can fail for
tiny datasets; update main() around the slides split to handle len(slides) < 3:
if len(slides) == 1 set train_slides = slides and val_slides = test_slides = []
(or otherwise ensure no further splitting), if len(slides) == 2 perform a single
split to produce one train and one test (e.g., train_slides and test_slides) and
set val_slides = [], otherwise keep the existing two-step splitting (slides ->
test_slides and remaining -> train/val). Use the existing variable names slides,
train_slides, val_slides, test_slides and the train_test_split calls so the rest
of the pipeline remains unchanged.

In `@pyproject.toml`:
- Around line 8-21: Add the missing direct dependencies to pyproject.toml so
preprocessing scripts don't rely on transitive installs: add pyvips, ray,
mlflow, and scikit-learn to the [project.dependencies] list (the imports
referenced are in preprocessing/mask_generator.py -> pyvips, ray and
preprocessing/tiler.py -> mlflow, ray, sklearn); do not rely on
override-dependencies (which only pins existing packages), ensure exact package
names (pyvips, ray, mlflow, scikit-learn) and appropriate version constraints
are included in the dependencies array.
- Around line 34-37: The mypy job (mypy_check in pyproject.toml) fails due to
missing type stubs for pyvips and sklearn.model_selection; update the mypy
configuration to ignore missing imports for those modules so the check passes:
add a mypy config section (either in pyproject.toml under [tool.mypy] or a
standalone mypy.ini) that sets ignore_missing_imports = True for pyvips and
sklearn.model_selection (or use per-module settings using [tool.mypy-PACKAGE] /
[mypy-PACKAGE] for "pyvips" and "sklearn.model_selection"); this will unblock
the mypy run for imports used in preprocessing/mask_generator.py and
preprocessing/tiler.py while keeping the existing mypy_check = "mypy ." target.

---

Nitpick comments:
In @.gitignore:
- Around line 181-183: Add an ignore rule for the generated mask TIFFs so git
stops tracking the output from preprocessing/mask_generator.py: update
.gitignore to exclude the mask output directory (e.g., mask/tissue_masks/) and
TIFF files (e.g., *.tif and/or *.tiff) so that files created under
./mask/tissue_masks by the mask_generator are not shown in git status.

ℹ️ Review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: cf66e25c-e3bc-4247-8696-b54dbeec00f2

📥 Commits

Reviewing files that changed from the base of the PR and between 0ce77d2 and 89c3d5a.

⛔ Files ignored due to path filters (2)

pdm.lock is excluded by !**/*.lock
uv.lock is excluded by !**/*.lock

📒 Files selected for processing (4)

.gitignore
preprocessing/mask_generator.py
preprocessing/tiler.py
pyproject.toml

coderabbitai · 2026-03-11T06:58:20Z

preprocessing/mask_generator.py

+    mask_path = Path(MASK_DEST, f"{Path(slide_path).stem}.tiff")
+    mask_path.parent.mkdir(exist_ok=True, parents=True)
+    write_big_tiff(mask, path=mask_path, mpp_x=xres, mpp_y=yres)


⚠️ Potential issue | 🟠 Major

Keep mask filenames unique across the recursive slide tree.

Line 32 uses only stem, so two slides with the same basename in different folders will overwrite the same mask. preprocessing/tiler.py later resolves masks using the same flattened name, so the wrong mask can be paired back to a slide.

🗂️ Preserve the source-relative path in the mask output

- mask_path = Path(MASK_DEST, f"{Path(slide_path).stem}.tiff") + relative_mask_path = slide_path.relative_to(Path(SLIDES_PATH)).with_suffix(".tiff") + mask_path = Path(MASK_DEST, relative_mask_path)

Mirror the same relative-path scheme in preprocessing/tiler.py when resolving masks.

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

mask_path = Path(MASK_DEST, f"{Path(slide_path).stem}.tiff")

mask_path.parent.mkdir(exist_ok=True, parents=True)

write_big_tiff(mask, path=mask_path, mpp_x=xres, mpp_y=yres)

relative_mask_path = slide_path.relative_to(Path(SLIDES_PATH)).with_suffix(".tiff")

mask_path = Path(MASK_DEST, relative_mask_path)

write_big_tiff(mask, path=mask_path, mpp_x=xres, mpp_y=yres)

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@preprocessing/mask_generator.py` around lines 32 - 34, The mask filename is not unique because mask_path is built from Path(slide_path).stem; change mask generation to preserve the slide's source-relative path under MASK_DEST instead of using only the stem: compute the slide's relative path (e.g., rel = Path(slide_path).relative_to(SOURCE_ROOT) or equivalent), replace its suffix with ".tiff" and join with MASK_DEST to form mask_path, ensure mask_path.parent.mkdir(...), and call write_big_tiff(mask, path=mask_path, mpp_x=xres, mpp_y=yres); also update preprocessing/tiler.py to resolve masks using the same source-relative path scheme so mask lookup and slide pairing remain consistent.

coderabbitai · 2026-03-11T06:58:20Z

preprocessing/tiler.py

+TISSUE_MASKS_PATH = "./mask/tissue_masks"
+TISSUE_MASKS_PATH = "/home/jovyan/staining/demo_data/masks"


⚠️ Potential issue | 🔴 Critical

Drop the developer-local mask path override.

Line 14 matches the output of preprocessing/mask_generator.py, but Line 15 immediately replaces it with /home/jovyan/staining/demo_data/masks. The main workflow will ignore freshly generated masks and only work on one workstation.

📁 Keep the producer/consumer paths aligned

TISSUE_MASKS_PATH = "./mask/tissue_masks" -TISSUE_MASKS_PATH = "/home/jovyan/staining/demo_data/masks"

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

TISSUE_MASKS_PATH = "./mask/tissue_masks"

TISSUE_MASKS_PATH = "/home/jovyan/staining/demo_data/masks"

TISSUE_MASKS_PATH = "./mask/tissue_masks"

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@preprocessing/tiler.py` around lines 14 - 15, TISSUE_MASKS_PATH in preprocessing/tiler.py is being overridden with a developer-local path; remove the hard-coded override so the module uses the original TISSUE_MASKS_PATH definition (or alternatively read from an environment/config variable) to keep producer (preprocessing/mask_generator.py) and consumer paths aligned; update references to the TISSUE_MASKS_PATH symbol accordingly and ensure no second assignment replaces the intended shared path.

coderabbitai · 2026-03-11T06:58:20Z

preprocessing/tiler.py

+    slides, test_slides = train_test_split(slides, test_size=0.2)
+    train_slides, val_slides = train_test_split(slides, test_size=0.1)


⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

cd . && find . -type f -name "tiler.py" | head -20

Repository: RationAI/stain-normalization

Length of output: 96

🏁 Script executed:

cd . && git ls-files | grep -i tiler

Repository: RationAI/stain-normalization

Length of output: 94

🏁 Script executed:

cd . && git ls-files | grep -i preprocessing

Repository: RationAI/stain-normalization

Length of output: 126

🏁 Script executed:

cd . && git ls-files | head -30

Repository: RationAI/stain-normalization

Length of output: 221

🏁 Script executed:

cd . && git log --oneline -20

Repository: RationAI/stain-normalization

Length of output: 103

🏁 Script executed:

cd . && find . -name "*.py" -path "*/preprocessing*" -o -name "*tiler*" 2>/dev/null | head -20

Repository: RationAI/stain-normalization

Length of output: 130

🏁 Script executed:

cd . && ls -la

Repository: RationAI/stain-normalization

Length of output: 796

🏁 Script executed:

cd . && find . -type f -name "*.py" | grep -E "(tiler|preprocessing)" | head -20

Repository: RationAI/stain-normalization

Length of output: 130

🏁 Script executed:

cd . && find . -type f -name "*.py" | head -30

Repository: RationAI/stain-normalization

Length of output: 130

🏁 Script executed:

cd . && tree -L 2 -I '__pycache__|*.egg-info|.git' 2>/dev/null || find . -maxdepth 2 -type d | head -30

Repository: RationAI/stain-normalization

Length of output: 242

🏁 Script executed:

cd . && ls -R | grep -E "(tiler|preprocessing)" | head -20

Repository: RationAI/stain-normalization

Length of output: 111

🏁 Script executed:

cd . && git show --stat | head -50

Repository: RationAI/stain-normalization

Length of output: 729

🏁 Script executed:

cd . && rg "train_test_split" -l

Repository: RationAI/stain-normalization

Length of output: 94

🏁 Script executed:

cd . && rg "def main\(\)" -A 10 | head -50

Repository: RationAI/stain-normalization

Length of output: 1322

🏁 Script executed:

cd . && git ls-files 2>/dev/null | head -50

Repository: RationAI/stain-normalization

Length of output: 221

🏁 Script executed:

cd . && pwd && ls -la | head -20

Repository: RationAI/stain-normalization

Length of output: 815

🏁 Script executed:

python3 << 'EOF' from sklearn.model_selection import train_test_split # Test 1: Single slide (1 element) print("Test 1: Single slide dataset") try: slides = ["slide1.mrxs"] result = train_test_split(slides, test_size=0.2) print(f" Result: {result}") except Exception as e: print(f" Error: {type(e).__name__}: {e}") # Test 2: Two slides print("\nTest 2: Two slides dataset") try: slides = ["slide1.mrxs", "slide2.mrxs"] result = train_test_split(slides, test_size=0.2) print(f" Result: {result}") except Exception as e: print(f" Error: {type(e).__name__}: {e}") # Test 3: Three slides print("\nTest 3: Three slides dataset") try: slides = ["slide1.mrxs", "slide2.mrxs", "slide3.mrxs"] result = train_test_split(slides, test_size=0.2) print(f" Result: {result}") except Exception as e: print(f" Error: {type(e).__name__}: {e}") # Test 4: Five slides (test first split then second) print("\nTest 4: Five slides with sequential splits") try: slides = ["s1", "s2", "s3", "s4", "s5"] slides, test_slides = train_test_split(slides, test_size=0.2) print(f" After first split - train_val: {slides}, test: {test_slides}") train_slides, val_slides = train_test_split(slides, test_size=0.1) print(f" After second split - train: {train_slides}, val: {val_slides}") except Exception as e: print(f" Error: {type(e).__name__}: {e}") EOF

Repository: RationAI/stain-normalization

Length of output: 191

🏁 Script executed:

python3 << 'EOF' from sklearn.model_selection import train_test_split # Test exact scenario from review with edge cases print("Simulating the review's exact code pattern:\n") # Scenario 1: Single slide print("Scenario 1: Single slide") try: slides = ["only_slide.mrxs"] slides, test_slides = train_test_split(slides, test_size=0.2) train_slides, val_slides = train_test_split(slides, test_size=0.1) print(f" Success - train: {train_slides}, val: {val_slides}, test: {test_slides}") except Exception as e: print(f" Error: {type(e).__name__}: {e}") # Scenario 2: Two slides print("\nScenario 2: Two slides") try: slides = ["s1.mrxs", "s2.mrxs"] slides, test_slides = train_test_split(slides, test_size=0.2) train_slides, val_slides = train_test_split(slides, test_size=0.1) print(f" Success - train: {train_slides}, val: {val_slides}, test: {test_slides}") except Exception as e: print(f" Error: {type(e).__name__}: {e}") # Scenario 3: Three slides print("\nScenario 3: Three slides") try: slides = ["s1.mrxs", "s2.mrxs", "s3.mrxs"] slides, test_slides = train_test_split(slides, test_size=0.2) print(f" After first split - remaining slides: {len(slides)}, test: {len(test_slides)}") train_slides, val_slides = train_test_split(slides, test_size=0.1) print(f" Success - train: {len(train_slides)}, val: {len(val_slides)}, test: {len(test_slides)}") except Exception as e: print(f" Error: {type(e).__name__}: {e}") EOF

Repository: RationAI/stain-normalization

Length of output: 191

🏁 Script executed:

cat -n preprocessing/tiler.py

Repository: RationAI/stain-normalization

Length of output: 4589

🏁 Script executed:

cat README.md

Repository: RationAI/stain-normalization

Length of output: 93

🏁 Script executed:

git log --all --oneline | head -20

Repository: RationAI/stain-normalization

Length of output: 181

🏁 Script executed:

git show 89c3d5a --no-patch --format=fuller

Repository: RationAI/stain-normalization

Length of output: 328

🏁 Script executed:

cat pyproject.toml

Repository: RationAI/stain-normalization

Length of output: 1673

🏁 Script executed:

rg "len\(slides\)" preprocessing/

Repository: RationAI/stain-normalization

Length of output: 54

🏁 Script executed:

cat -n preprocessing/tiler.py | head -80

Repository: RationAI/stain-normalization

Length of output: 3237

🏁 Script executed:

python3 << 'EOF' # Simulate train_test_split behavior without sklearn # For test_size=0.2: ceil(n * 0.2) samples go to test # For test_size=0.1: ceil(n * 0.1) samples go to validation import math scenarios = [ ("1 slide", 1), ("2 slides", 2), ("3 slides", 3), ("4 slides", 4), ("5 slides", 5), ("10 slides", 10), ] for name, n in scenarios: # First split: test_size=0.2 test_count_1 = max(1, math.ceil(n * 0.2)) train_val_count = n - test_count_1 # Second split on remaining: test_size=0.1 if train_val_count > 0: val_count = max(1, math.ceil(train_val_count * 0.1)) if train_val_count > 1 else 0 train_count = train_val_count - val_count else: train_count = 0 val_count = 0 print(f"{name:12} -> train: {train_count}, val: {val_count}, test: {test_count_1}") if train_val_count < 1: print(f" ⚠️ First split leaves 0 samples") elif val_count == 0: print(f" ⚠️ Second split produces empty validation set") EOF

Repository: RationAI/stain-normalization

Length of output: 431

Handle the single-slide dataset before splitting.

The code does not guard against tiny datasets. With a single slide, train_test_split raises a ValueError. With 2 slides, the second split produces an empty validation set. The codebase itself references single-slide workflows (line 102), but the main() function lacks protection for datasets with fewer than 3 slides.

Guard for tiny datasets

def main() -> None: slides = list(Path(SLIDES_PATH).rglob("*.mrxs")) - slides, test_slides = train_test_split(slides, test_size=0.2) - train_slides, val_slides = train_test_split(slides, test_size=0.1) + if len(slides) < 3: + train_slides, val_slides, test_slides = slides, [], [] + else: + train_val_slides, test_slides = train_test_split( + slides, test_size=0.2, random_state=42 + ) + train_slides, val_slides = train_test_split( + train_val_slides, test_size=0.1, random_state=42 + )

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

slides, test_slides = train_test_split(slides, test_size=0.2)

train_slides, val_slides = train_test_split(slides, test_size=0.1)

def main() -> None:

slides = list(Path(SLIDES_PATH).rglob("*.mrxs"))

if len(slides) < 3:

train_slides, val_slides, test_slides = slides, [], []

else:

train_val_slides, test_slides = train_test_split(

slides, test_size=0.2, random_state=42

)

train_slides, val_slides = train_test_split(

train_val_slides, test_size=0.1, random_state=42

)

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@preprocessing/tiler.py` around lines 65 - 66, The two-stage splitting using train_test_split can fail for tiny datasets; update main() around the slides split to handle len(slides) < 3: if len(slides) == 1 set train_slides = slides and val_slides = test_slides = [] (or otherwise ensure no further splitting), if len(slides) == 2 perform a single split to produce one train and one test (e.g., train_slides and test_slides) and set val_slides = [], otherwise keep the existing two-step splitting (slides -> test_slides and remaining -> train/val). Use the existing variable names slides, train_slides, val_slides, test_slides and the train_test_split calls so the rest of the pipeline remains unchanged.

coderabbitai · 2026-03-11T06:58:21Z

preprocessing/tiler.py

+    train_slides_df.to_csv("./data/datasets/train_slides.csv", index=False)
+    train_tiles_df.to_csv("./data/datasets/train_tiles.csv", index=False)
+
+    val_slides_df.to_csv("./data/datasets/val_slides.csv", index=False)
+    val_tiles_df.to_csv("./data/datasets/val_tiles.csv", index=False)
+
+    test_slides_df.to_csv("./data/datasets/test_slides.csv", index=False)
+    test_tiles_df.to_csv("./data/datasets/test_tiles.csv", index=False)


⚠️ Potential issue | 🟠 Major

Create ./data/datasets before the first to_csv.

None of these writes ensures the parent directory exists, so main() will fail on a clean checkout before any CSV is produced.

📦 Ensure the output directory exists once

+ output_dir = Path("./data/datasets") + output_dir.mkdir(parents=True, exist_ok=True) + - train_slides_df.to_csv("./data/datasets/train_slides.csv", index=False) - train_tiles_df.to_csv("./data/datasets/train_tiles.csv", index=False) + train_slides_df.to_csv(output_dir / "train_slides.csv", index=False) + train_tiles_df.to_csv(output_dir / "train_tiles.csv", index=False) - val_slides_df.to_csv("./data/datasets/val_slides.csv", index=False) - val_tiles_df.to_csv("./data/datasets/val_tiles.csv", index=False) + val_slides_df.to_csv(output_dir / "val_slides.csv", index=False) + val_tiles_df.to_csv(output_dir / "val_tiles.csv", index=False) - test_slides_df.to_csv("./data/datasets/test_slides.csv", index=False) - test_tiles_df.to_csv("./data/datasets/test_tiles.csv", index=False) + test_slides_df.to_csv(output_dir / "test_slides.csv", index=False) + test_tiles_df.to_csv(output_dir / "test_tiles.csv", index=False)

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@preprocessing/tiler.py` around lines 72 - 79, The CSV writes in preprocessing/tiler.py call train_slides_df.to_csv / train_tiles_df.to_csv / val_slides_df.to_csv / val_tiles_df.to_csv / test_slides_df.to_csv / test_tiles_df.to_csv but never ensure the ./data/datasets directory exists; fix by creating the directory (e.g., using os.makedirs or Path(...).mkdir with parents=True, exist_ok=True) once before the first to_csv call—add this directory creation at the start of the block that writes these CSVs (or at the start of main()) so all six writes succeed on a clean checkout.

coderabbitai · 2026-03-11T06:58:21Z

preprocessing/tiler.py

+if __name__ == "__main__":
+    # main()
+    slides = [("/home/jovyan/staining/demo_data/P-2016_0077-08-1_hed_h0.6_e1.5.tiff")]
+    train_slides_df, train_tiles_df = tiling(slides=slides, handler=handler)
+
+    mlflow.set_experiment(experiment_name="Stain-Normalization")
+    with mlflow.start_run(run_name="P-2016_0077-08-1_hed all tissue tiles") as _:
+        save_mlflow_dataset(
+            slides=train_slides_df,
+            tiles=train_tiles_df,
+            dataset_name="P-2016_0077-08-1_hed",
+        )


⚠️ Potential issue | 🔴 Critical

Run main() from the CLI entry point.

pyproject.toml wires pdm run tiler to this file, but the __main__ block currently skips main(), hard-codes one local slide path, and passes str values into handler(slide_path: Path). As committed, the entry point does not execute the dataset-building flow added above.

🚪 Make the script entry point run the intended workflow

if __name__ == "__main__": - # main() - slides = [("/home/jovyan/staining/demo_data/P-2016_0077-08-1_hed_h0.6_e1.5.tiff")] - train_slides_df, train_tiles_df = tiling(slides=slides, handler=handler) - - mlflow.set_experiment(experiment_name="Stain-Normalization") - with mlflow.start_run(run_name="P-2016_0077-08-1_hed all tissue tiles") as _: - save_mlflow_dataset( - slides=train_slides_df, - tiles=train_tiles_df, - dataset_name="P-2016_0077-08-1_hed", - ) + main()

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

if __name__ == "__main__":

# main()

slides = [("/home/jovyan/staining/demo_data/P-2016_0077-08-1_hed_h0.6_e1.5.tiff")]

train_slides_df, train_tiles_df = tiling(slides=slides, handler=handler)

mlflow.set_experiment(experiment_name="Stain-Normalization")

with mlflow.start_run(run_name="P-2016_0077-08-1_hed all tissue tiles") as _:

save_mlflow_dataset(

slides=train_slides_df,

tiles=train_tiles_df,

dataset_name="P-2016_0077-08-1_hed",

)

if __name__ == "__main__":

main()

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@preprocessing/tiler.py` around lines 100 - 111, The __main__ block currently bypasses the CLI entry point by hard-coding a str slide path and calling tiling directly; restore the intended behavior by invoking main() from the __main__ guard (so `pdm run tiler` executes the dataset-building flow), remove the hard-coded slides variable, and if you must call tiling directly ensure you pass Path objects to handler (convert the slide strings with pathlib.Path) and still call save_mlflow_dataset with the produced train_slides_df and train_tiles_df; update the __main__ block to call main() (or, if keeping a direct call, convert slide paths to Path and preserve the mlflow/save_mlflow_dataset steps).

coderabbitai · 2026-03-11T06:58:21Z

pyproject.toml

+dependencies = [
+    "lightning>=2.0.0",
+    "albumentations>=1.4.14",
+    "torchmetrics>=1.4.14",
+    "torchvision>=0.17.2",
+    "torch>=2.2.2",
+    "tqdm>=4.66.5",
+    "rationai-mlkit @ git+https://gitlab.ics.muni.cz/rationai/digital-pathology/libraries/mlkit.git",
+    "rationai-masks @ git+https://gitlab.ics.muni.cz/rationai/digital-pathology/libraries/masks.git",
+    "rationai-tiling @ git+https://gitlab.ics.muni.cz/rationai/digital-pathology/libraries/tiling.git",
+    "scikit-image>=0.25.2",
+    "openslide-bin>=4.0.0.6",
+    "rationai-staining @ git+https://gitlab.ics.muni.cz/rationai/digital-pathology/libraries/staining.git",
+]


⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash set -euo pipefail echo "Direct imports in the new preprocessing scripts:" rg -n '^(import (mlflow|pyvips|ray)\b|from sklearn\.model_selection import train_test_split\b)' preprocessing echo echo "Matching declared project dependencies:" rg -n '^\s*"(mlflow|pyvips|ray|scikit-learn)([<>=@ ].*)?"' pyproject.toml || true echo echo "UV override section:" rg -n -A2 -B1 'override-dependencies' pyproject.toml || true

Repository: RationAI/stain-normalization

Length of output: 480

Add missing direct dependencies for packages imported by preprocessing scripts.

preprocessing/mask_generator.py imports pyvips and ray. preprocessing/tiler.py imports mlflow, ray, and sklearn. None of these packages are declared in [project.dependencies], making the scripts depend on transitive dependencies that may not be guaranteed. The override-dependencies entry on line 50 does not install mlflow; it only pins its version if already present in the dependency tree.

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@pyproject.toml` around lines 8 - 21, Add the missing direct dependencies to pyproject.toml so preprocessing scripts don't rely on transitive installs: add pyvips, ray, mlflow, and scikit-learn to the [project.dependencies] list (the imports referenced are in preprocessing/mask_generator.py -> pyvips, ray and preprocessing/tiler.py -> mlflow, ray, sklearn); do not rely on override-dependencies (which only pins existing packages), ensure exact package names (pyvips, ray, mlflow, scikit-learn) and appropriate version constraints are included in the dependencies array.

coderabbitai · 2026-03-11T06:58:21Z

pyproject.toml

+l = { composite = ["lint", "format", "mypy_check"] }
+lint = "ruff check --fix"
+format = "ruff format"
+mypy_check = "mypy ."


⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

# First, let's check the pyproject.toml file cat -n pyproject.toml

Repository: RationAI/stain-normalization

Length of output: 2023

🏁 Script executed:

# Check for imports of pyvips in the codebase rg "import pyvips|from pyvips" --type py

Repository: RationAI/stain-normalization

Length of output: 117

🏁 Script executed:

# Check for imports of sklearn.model_selection in the codebase rg "from sklearn\.model_selection|import sklearn\.model_selection" --type py

Repository: RationAI/stain-normalization

Length of output: 147

🏁 Script executed:

# Check if there's a current mypy configuration in pyproject.toml rg "\[tool\.mypy\]" pyproject.toml

Repository: RationAI/stain-normalization

Length of output: 54

🏁 Script executed:

# Check for any existing type: ignore comments in the codebase rg "type: ignore|TYPE_CHECKING" --type py | head -20

Repository: RationAI/stain-normalization

Length of output: 54

🏁 Script executed:

# Check if there are .pyi files or py.typed markers for these packages find . -name "*.pyi" -o -name "py.typed" | head -20

Repository: RationAI/stain-normalization

Length of output: 54

🌐 Web query:

pyvips type stubs py.typed mypy

💡 Result:

pyvips ships a stub file (pyvips/__init__.pyi) and its README says it’s intended to work as a PEP 561 typed package. [1]

For mypy (and other type checkers) to use inline/bundled stubs, the installed runtime package must include an (empty) py.typed marker file in the package directory (PEP 561). [2]

If you’re seeing mypy errors like “missing library stubs or py.typed marker” for pyvips, the fix is: add pyvips/py.typed and ensure it’s included in the built distribution (wheel/sdist). Per PEP 561, stub-only distributions named *-stubs don’t need py.typed (alternative approach: publish pyvips-stubs). [2]

If you can’t change the installed package, generate your own stubs and point mypy at them (e.g., stubgen + mypy_path). [3]

Sources: [1] [2] [3]

🌐 Web query:

scikit-learn sklearn type stubs mypy import-untyped

💡 Result:

mypy reports [import-untyped] when you import an installed third‑party library that doesn’t ship inline types (py.typed) and you don’t have a PEP 561 stub package installed. [2][3]

Fix for scikit-learn (sklearn)

Install the stub package:

pip install scikit-learn-stubs

This publishes the sklearn stubs (from Microsoft’s python-type-stubs) to PyPI. [1]

Re-run mypy.

If you can’t (or don’t want to) add stubs

Ignore just sklearn imports:
[mypy-sklearn.*] ignore_missing_imports = True

Or tell mypy to follow/analyze untyped imports (may increase noise/time):
[mypy] follow_untyped_imports = True
(Also available as --follow-untyped-imports.) [4]

Sources: scikit-learn stubs on PyPI [1], mypy docs/changelog on import-untyped behavior [2], PEP 561 on distributing stubs [3], mypy guidance on follow_untyped_imports [4].

mypy_check will fail without mypy configuration.

Both pyvips (from preprocessing/mask_generator.py) and sklearn.model_selection (from preprocessing/tiler.py) are imported but lack type information that mypy can use. Running mypy . without configuration will fail on import-untyped errors. Configure mypy to ignore missing imports for these modules:

Minimal mypy configuration to unblock the check

+[tool.mypy] + +[[tool.mypy.overrides]] +module = ["pyvips", "sklearn.*"] +ignore_missing_imports = true

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

l = { composite = ["lint", "format", "mypy_check"] }

lint = "ruff check --fix"

format = "ruff format"

mypy_check = "mypy ."

l = { composite = ["lint", "format", "mypy_check"] }

lint = "ruff check --fix"

format = "ruff format"

mypy_check = "mypy ."

[tool.mypy]

[[tool.mypy.overrides]]

module = ["pyvips", "sklearn.*"]

ignore_missing_imports = true

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@pyproject.toml` around lines 34 - 37, The mypy job (mypy_check in pyproject.toml) fails due to missing type stubs for pyvips and sklearn.model_selection; update the mypy configuration to ignore missing imports for those modules so the check passes: add a mypy config section (either in pyproject.toml under [tool.mypy] or a standalone mypy.ini) that sets ignore_missing_imports = True for pyvips and sklearn.model_selection (or use per-module settings using [tool.mypy-PACKAGE] / [mypy-PACKAGE] for "pyvips" and "sklearn.model_selection"); this will unblock the mypy run for imports used in preprocessing/mask_generator.py and preprocessing/tiler.py while keeping the existing mypy_check = "mypy ." target.

gemini-code-assist bot reviewed Mar 11, 2026

View reviewed changes

LAdam-ix force-pushed the feature/ml-preprocessing branch 3 times, most recently from c6c3e65 to 7d68d80 Compare March 11, 2026 06:11

feat: add preprocessing

89c3d5a

LAdam-ix force-pushed the feature/ml-preprocessing branch from 7d68d80 to 89c3d5a Compare March 11, 2026 06:19

LAdam-ix self-assigned this Mar 11, 2026

LAdam-ix marked this pull request as ready for review March 11, 2026 06:40

LAdam-ix requested review from a team, Adames4, JakubPekar and matejpekar and removed request for Adames4 and JakubPekar March 11, 2026 06:40

coderabbitai bot reviewed Mar 11, 2026

View reviewed changes

LAdam-ix closed this Mar 11, 2026

		TISSUE_MASKS_PATH = "./mask/tissue_masks"
		TISSUE_MASKS_PATH = "/home/jovyan/staining/demo_data/masks"

	tissue_mask_path = Path(TISSUE_MASKS_PATH, slide_path.name[:-5] + ".tiff")
	tissue_mask_path = Path(TISSUE_MASKS_PATH, f"{slide_path.stem}.tiff")

	mask_path = Path(MASK_DEST, f"{Path(slide_path).stem}.tiff")
	mask_path = Path(MASK_DEST, f"{slide_path.stem}.tiff")

		from sklearn.model_selection import train_test_split


		SLIDES_PATH = "/mnt/data/MOU/prostate/tile_level_annotations/"

		val_slides_df, val_tiles_df = tiling(slides=list(val_slides), handler=handler)
		test_slides_df, test_tiles_df = tiling(slides=list(test_slides), handler=handler)

		slides, test_slides = train_test_split(slides, test_size=0.2)
		train_slides, val_slides = train_test_split(slides, test_size=0.1)

Conversation

LAdam-ix commented Mar 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by CodeRabbit

Uh oh!

coderabbitai bot commented Mar 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Sequence Diagrams

Estimated code review effort

Poem

❌ Failed checks (1 warning)

Uh oh!

gemini-code-assist bot commented Mar 11, 2026

Summary of Changes

Highlights

Footnotes

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Mar 11, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Mar 11, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Mar 11, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Mar 11, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Mar 11, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Mar 11, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Mar 11, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Mar 11, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Mar 11, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Mar 11, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Mar 11, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Mar 11, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Mar 11, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Mar 11, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Mar 11, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Mar 11, 2026

Choose a reason for hiding this comment

Fix for scikit-learn (sklearn)

If you can’t (or don’t want to) add stubs

Uh oh!

Reviewers

Assignees

Labels

LAdam-ix commented Mar 11, 2026 •

edited

Loading

coderabbitai bot commented Mar 11, 2026 •

edited

Loading

Fix for scikit-learn (`sklearn`)