Skip to content

feat: add preprocessing#3

Closed
LAdam-ix wants to merge 1 commit intomainfrom
feature/ml-preprocessing
Closed

feat: add preprocessing#3
LAdam-ix wants to merge 1 commit intomainfrom
feature/ml-preprocessing

Conversation

@LAdam-ix
Copy link
Collaborator

@LAdam-ix LAdam-ix commented Mar 11, 2026

Just usage of rational libs to prepare initial dataset for training. After the data reorganization using the official dataset. I have it setup for when I was creating some single slide dataset for testing ( the gemini problems).

Summary by CodeRabbit

  • New Features

    • Added automated tissue mask generation pipeline to preprocess whole-slide images.
    • Added image tiling workflow with intelligent tissue-based filtering and experiment tracking integration for managing dataset splits and versioning.
  • Chores

    • Initialized project configuration with dependencies and build tooling.
    • Added comprehensive development environment setup files.

@coderabbitai
Copy link

coderabbitai bot commented Mar 11, 2026

📝 Walkthrough

Walkthrough

This PR establishes the foundational infrastructure for a stain normalization project, adding comprehensive project configuration, environment setup, and two new preprocessing scripts to automate parallel whole-slide image processing, tissue masking, and tiling workflows.

Changes

Cohort / File(s) Summary
Project Configuration
.gitignore, pyproject.toml
Adds project setup files: comprehensive Python gitignore rules and PDM-based project configuration with dependencies (OpenSlide, pyvips, Ray, MLflow), CLI scripts for mask generation, tiling, training, validation, and testing.
Preprocessing Pipeline
preprocessing/mask_generator.py, preprocessing/tiler.py
Introduces two preprocessing scripts: mask_generator.py orchestrates parallel slide processing via Ray to generate tissue masks using OpenSlide and pyvips; tiler.py implements complete tiling workflow with tissue masking (TissueMask class), train/val/test splitting, remote Ray handler, CSV export, and MLflow dataset logging.

Sequence Diagrams

sequenceDiagram
    participant Main as Main Workflow
    participant Discover as Slide Discovery
    participant Split as Train/Val/Test Split
    participant Handler as Ray Remote Handler
    participant Source as OpenSlide Tile Source
    participant Mask as Tissue Mask Filter
    participant CSV as CSV Export
    participant MLflow as MLflow Logger

    Main->>Discover: Scan SLIDES_PATH for .mrxs files
    Discover-->>Main: Return slide paths
    Main->>Split: Split into train/val/test
    Split-->>Main: Return split groups
    Main->>Handler: Process each split (parallel via Ray)
    Handler->>Source: Load slide and generate tiles
    Source-->>Handler: Return tiles + metadata
    Handler->>Mask: Apply tissue mask filter
    Mask-->>Handler: Return filtered tiles
    Handler-->>Main: Return TiledSlideMetadata
    Main->>CSV: Export per-split metadata to CSV
    Main->>MLflow: Log train, val, test datasets
    MLflow-->>Main: Datasets logged
    Main-->>Main: Workflow complete
Loading
sequenceDiagram
    participant Main as Main Workflow
    participant Scan as Slide Scanner
    participant Remote as Ray Remote process_slide
    participant OpenSlide as OpenSlide
    participant Pyvips as Pyvips
    participant FileSystem as File System

    Main->>Scan: Discover .mrxs slides in SLIDES_PATH
    Scan-->>Main: Return slide paths
    Main->>Remote: Submit parallel processing jobs
    Remote->>OpenSlide: Open slide at specified level
    OpenSlide-->>Remote: Return resolution info (xres, yres)
    Remote->>Pyvips: Load slide region at level
    Pyvips-->>Remote: Return image data
    Remote->>Remote: Compute tissue mask
    Remote->>FileSystem: Create destination directory
    Remote->>FileSystem: Write mask as BigTIFF
    FileSystem-->>Remote: File saved
    Remote-->>Main: Mask generation complete
    Main-->>Main: All slides processed
Loading

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Poem

🐰 With mask and tile, the slides now dance,
Ray distributes work at speed's advance,
Tissues filtered, MLflow logs the way,
Configuration blooms—hooray, hooray!
A rabbit's preprocessing ballet! 🎪✨

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (2 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title 'feat: add preprocessing' directly reflects the core purpose of the PR—adding preprocessing functionality. The changes introduce three new preprocessing modules (mask_generator.py, tiler.py) and project configuration, all essential preprocessing infrastructure.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
  • 📝 Generate docstrings (stacked PR)
  • 📝 Generate docstrings (commit on current branch)
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch feature/ml-preprocessing

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@gemini-code-assist
Copy link

Summary of Changes

Hello, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request establishes the foundational preprocessing pipeline for whole slide images, enabling the automated generation of tissue masks and the subsequent tiling of these images into structured datasets. It also sets up robust project dependency management and integrates MLflow for experiment tracking, significantly streamlining the initial data preparation phase for machine learning model development.

Highlights

  • New Preprocessing Scripts: Introduced mask_generator.py and tiler.py for automated tissue mask generation and whole slide image tiling, leveraging rationai libraries.
  • Dependency Management: Added pyproject.toml to define project metadata, manage dependencies, and configure PDM scripts for various development and data processing tasks.
  • MLflow Integration: Integrated MLflow for tracking and saving processed datasets (slides and tiles) for training, validation, and testing, ensuring reproducibility.
  • Git Ignore Configuration: Included a comprehensive .gitignore file to exclude common development artifacts, environments, caches, and IDE-specific files from version control.
Changelog
  • .gitignore
    • Added a new .gitignore file with comprehensive exclusions for Python projects, including build artifacts, environments, caches, and IDE-specific files.
  • preprocessing/mask_generator.py
    • Added a new Python script to generate tissue masks from whole slide images.
    • Implemented parallel processing of slides using ray.
    • Utilized pyvips and openslide for image handling and spatial resolution extraction.
    • Defined constants for slide path, mask destination, and processing level.
  • preprocessing/tiler.py
    • Added a new Python script for tiling whole slide images and applying tissue masks.
    • Integrated rationai.tiling for tile generation and rationai.tiling.modules.masks.PyvipsMask for tissue masking.
    • Implemented data splitting into train, validation, and test sets using sklearn.model_selection.train_test_split.
    • Configured MLflow for experiment tracking and saving processed datasets (slides and tiles).
    • Defined constants for slide path, tissue masks path, slide MPP, and minimum tissue percentage.
  • pyproject.toml
    • Added a new pyproject.toml file to define project metadata, including name, version, author, Python requirements, and license.
    • Listed core dependencies, such as lightning, albumentations, torch, and several rationai libraries.
    • Specified development dependencies like mypy, pre-commit, and ruff.
    • Configured PDM scripts for mask generation, tiling, training, validation, testing, prediction, linting, formatting, and type checking.
Activity
  • No human activity has been recorded on this pull request yet.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for GitHub and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces preprocessing scripts for dataset preparation and sets up the project structure with a pyproject.toml and .gitignore. The scripts are a good foundation, but I've identified several areas for improvement. My feedback focuses on removing hardcoded paths to enhance portability, fixing a variable redefinition bug, improving path manipulation logic for robustness, and cleaning up leftover debugging code. I've also suggested adjustments to pyproject.toml for more flexible dependency management.

Comment on lines +14 to +15
TISSUE_MASKS_PATH = "./mask/tissue_masks"
TISSUE_MASKS_PATH = "/home/jovyan/staining/demo_data/masks"

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The variable TISSUE_MASKS_PATH is defined on line 14 and then immediately redefined on line 15 with a different value. This is likely an error and can lead to unexpected behavior. Please remove the incorrect definition.

Suggested change
TISSUE_MASKS_PATH = "./mask/tissue_masks"
TISSUE_MASKS_PATH = "/home/jovyan/staining/demo_data/masks"
TISSUE_MASKS_PATH = "./mask/tissue_masks"

def handler(slide_path: Path) -> TiledSlideMetadata:
slide, tiles = source(slide_path)

tissue_mask_path = Path(TISSUE_MASKS_PATH, slide_path.name[:-5] + ".tiff")

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

Using string slicing [:-5] to change the file extension is not robust as it assumes all slide files have a 4-character extension (e.g., .mrxs). A more reliable approach is to use pathlib's methods to handle file stems and suffixes.

Suggested change
tissue_mask_path = Path(TISSUE_MASKS_PATH, slide_path.name[:-5] + ".tiff")
tissue_mask_path = Path(TISSUE_MASKS_PATH, f"{slide_path.stem}.tiff")

Comment on lines +16 to +18
SLIDES_PATH = "/mnt/data/MOU/prostate/tile_level_annotations/"
MASK_DEST = "./mask/tissue_masks"
LEVEL = 3

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The script uses hardcoded paths and parameters (SLIDES_PATH, MASK_DEST, LEVEL). This makes it difficult to reuse and run in different environments. It's better to make these configurable via command-line arguments using a library like argparse or click.


slide = pyvips.Image.new_from_file(slide_path, level=LEVEL)
mask = tissue_mask(slide, xres)
mask_path = Path(MASK_DEST, f"{Path(slide_path).stem}.tiff")

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The slide_path variable is already a pathlib.Path object, so there's no need to instantiate it again with Path(). You can directly access its attributes like .stem.

Suggested change
mask_path = Path(MASK_DEST, f"{Path(slide_path).stem}.tiff")
mask_path = Path(MASK_DEST, f"{slide_path.stem}.tiff")


def main() -> None:
slides = list(Path(SLIDES_PATH).rglob("*.mrxs"))
# process_slide(slides[0])

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

This line appears to be leftover debugging code. It should be removed to keep the codebase clean.

from sklearn.model_selection import train_test_split


SLIDES_PATH = "/mnt/data/MOU/prostate/tile_level_annotations/"

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The script uses several hardcoded paths, such as SLIDES_PATH, TISSUE_MASKS_PATH (line 15), and the output CSV paths (lines 72-79). This reduces portability and makes the script harder to use in different environments or with different datasets. Consider using command-line arguments (e.g., with argparse) to make these paths configurable.

Comment on lines +69 to +70
val_slides_df, val_tiles_df = tiling(slides=list(val_slides), handler=handler)
test_slides_df, test_tiles_df = tiling(slides=list(test_slides), handler=handler)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The train_test_split function from sklearn already returns lists, so the explicit calls to list() on val_slides and test_slides are redundant. Removing them will make the code cleaner.

Suggested change
val_slides_df, val_tiles_df = tiling(slides=list(val_slides), handler=handler)
test_slides_df, test_tiles_df = tiling(slides=list(test_slides), handler=handler)
val_slides_df, val_tiles_df = tiling(slides=val_slides, handler=handler)
test_slides_df, test_tiles_df = tiling(slides=test_slides, handler=handler)

Comment on lines +100 to +111
if __name__ == "__main__":
# main()
slides = [("/home/jovyan/staining/demo_data/P-2016_0077-08-1_hed_h0.6_e1.5.tiff")]
train_slides_df, train_tiles_df = tiling(slides=slides, handler=handler)

mlflow.set_experiment(experiment_name="Stain-Normalization")
with mlflow.start_run(run_name="P-2016_0077-08-1_hed all tissue tiles") as _:
save_mlflow_dataset(
slides=train_slides_df,
tiles=train_tiles_df,
dataset_name="P-2016_0077-08-1_hed",
)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The if __name__ == "__main__": block contains commented-out code and hardcoded paths for debugging. This should be cleaned up to only call the main() function. Any testing or debugging logic should be moved to a separate script or function.

Suggested change
if __name__ == "__main__":
# main()
slides = [("/home/jovyan/staining/demo_data/P-2016_0077-08-1_hed_h0.6_e1.5.tiff")]
train_slides_df, train_tiles_df = tiling(slides=slides, handler=handler)
mlflow.set_experiment(experiment_name="Stain-Normalization")
with mlflow.start_run(run_name="P-2016_0077-08-1_hed all tissue tiles") as _:
save_mlflow_dataset(
slides=train_slides_df,
tiles=train_tiles_df,
dataset_name="P-2016_0077-08-1_hed",
)
if __name__ == "__main__":
main()

name = "Stain Normalization"
version = "0.1.0"
authors = [{name = "Adam Lopatka"}]
requires-python = "==3.12.5"

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Pinning the Python version to an exact patch release (==3.12.5) can make it difficult to set up the project environment. It's generally better to allow for a range of compatible versions, for example >=3.12, to improve flexibility.

Suggested change
requires-python = "==3.12.5"
requires-python = ">=3.12"

@LAdam-ix LAdam-ix force-pushed the feature/ml-preprocessing branch 3 times, most recently from c6c3e65 to 7d68d80 Compare March 11, 2026 06:11
@LAdam-ix LAdam-ix force-pushed the feature/ml-preprocessing branch from 7d68d80 to 89c3d5a Compare March 11, 2026 06:19
@LAdam-ix LAdam-ix self-assigned this Mar 11, 2026
@LAdam-ix LAdam-ix marked this pull request as ready for review March 11, 2026 06:40
@LAdam-ix LAdam-ix requested review from a team, Adames4, JakubPekar and matejpekar and removed request for Adames4 and JakubPekar March 11, 2026 06:40
Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 7

🧹 Nitpick comments (1)
.gitignore (1)

181-183: Ignore the generated mask TIFFs too.

preprocessing/mask_generator.py writes to ./mask/tissue_masks, but that output tree is still tracked here. Running the new preprocessing step will leave large generated artifacts in git status.

🧹 Small ignore update
 # Project
 /data
+/mask/tissue_masks/
 /demo_*
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In @.gitignore around lines 181 - 183, Add an ignore rule for the generated mask
TIFFs so git stops tracking the output from preprocessing/mask_generator.py:
update .gitignore to exclude the mask output directory (e.g.,
mask/tissue_masks/) and TIFF files (e.g., *.tif and/or *.tiff) so that files
created under ./mask/tissue_masks by the mask_generator are not shown in git
status.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@preprocessing/mask_generator.py`:
- Around line 32-34: The mask filename is not unique because mask_path is built
from Path(slide_path).stem; change mask generation to preserve the slide's
source-relative path under MASK_DEST instead of using only the stem: compute the
slide's relative path (e.g., rel = Path(slide_path).relative_to(SOURCE_ROOT) or
equivalent), replace its suffix with ".tiff" and join with MASK_DEST to form
mask_path, ensure mask_path.parent.mkdir(...), and call write_big_tiff(mask,
path=mask_path, mpp_x=xres, mpp_y=yres); also update preprocessing/tiler.py to
resolve masks using the same source-relative path scheme so mask lookup and
slide pairing remain consistent.

In `@preprocessing/tiler.py`:
- Around line 100-111: The __main__ block currently bypasses the CLI entry point
by hard-coding a str slide path and calling tiling directly; restore the
intended behavior by invoking main() from the __main__ guard (so `pdm run tiler`
executes the dataset-building flow), remove the hard-coded slides variable, and
if you must call tiling directly ensure you pass Path objects to handler
(convert the slide strings with pathlib.Path) and still call save_mlflow_dataset
with the produced train_slides_df and train_tiles_df; update the __main__ block
to call main() (or, if keeping a direct call, convert slide paths to Path and
preserve the mlflow/save_mlflow_dataset steps).
- Around line 14-15: TISSUE_MASKS_PATH in preprocessing/tiler.py is being
overridden with a developer-local path; remove the hard-coded override so the
module uses the original TISSUE_MASKS_PATH definition (or alternatively read
from an environment/config variable) to keep producer
(preprocessing/mask_generator.py) and consumer paths aligned; update references
to the TISSUE_MASKS_PATH symbol accordingly and ensure no second assignment
replaces the intended shared path.
- Around line 72-79: The CSV writes in preprocessing/tiler.py call
train_slides_df.to_csv / train_tiles_df.to_csv / val_slides_df.to_csv /
val_tiles_df.to_csv / test_slides_df.to_csv / test_tiles_df.to_csv but never
ensure the ./data/datasets directory exists; fix by creating the directory
(e.g., using os.makedirs or Path(...).mkdir with parents=True, exist_ok=True)
once before the first to_csv call—add this directory creation at the start of
the block that writes these CSVs (or at the start of main()) so all six writes
succeed on a clean checkout.
- Around line 65-66: The two-stage splitting using train_test_split can fail for
tiny datasets; update main() around the slides split to handle len(slides) < 3:
if len(slides) == 1 set train_slides = slides and val_slides = test_slides = []
(or otherwise ensure no further splitting), if len(slides) == 2 perform a single
split to produce one train and one test (e.g., train_slides and test_slides) and
set val_slides = [], otherwise keep the existing two-step splitting (slides ->
test_slides and remaining -> train/val). Use the existing variable names slides,
train_slides, val_slides, test_slides and the train_test_split calls so the rest
of the pipeline remains unchanged.

In `@pyproject.toml`:
- Around line 8-21: Add the missing direct dependencies to pyproject.toml so
preprocessing scripts don't rely on transitive installs: add pyvips, ray,
mlflow, and scikit-learn to the [project.dependencies] list (the imports
referenced are in preprocessing/mask_generator.py -> pyvips, ray and
preprocessing/tiler.py -> mlflow, ray, sklearn); do not rely on
override-dependencies (which only pins existing packages), ensure exact package
names (pyvips, ray, mlflow, scikit-learn) and appropriate version constraints
are included in the dependencies array.
- Around line 34-37: The mypy job (mypy_check in pyproject.toml) fails due to
missing type stubs for pyvips and sklearn.model_selection; update the mypy
configuration to ignore missing imports for those modules so the check passes:
add a mypy config section (either in pyproject.toml under [tool.mypy] or a
standalone mypy.ini) that sets ignore_missing_imports = True for pyvips and
sklearn.model_selection (or use per-module settings using [tool.mypy-PACKAGE] /
[mypy-PACKAGE] for "pyvips" and "sklearn.model_selection"); this will unblock
the mypy run for imports used in preprocessing/mask_generator.py and
preprocessing/tiler.py while keeping the existing mypy_check = "mypy ." target.

---

Nitpick comments:
In @.gitignore:
- Around line 181-183: Add an ignore rule for the generated mask TIFFs so git
stops tracking the output from preprocessing/mask_generator.py: update
.gitignore to exclude the mask output directory (e.g., mask/tissue_masks/) and
TIFF files (e.g., *.tif and/or *.tiff) so that files created under
./mask/tissue_masks by the mask_generator are not shown in git status.

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: cf66e25c-e3bc-4247-8696-b54dbeec00f2

📥 Commits

Reviewing files that changed from the base of the PR and between 0ce77d2 and 89c3d5a.

⛔ Files ignored due to path filters (2)
  • pdm.lock is excluded by !**/*.lock
  • uv.lock is excluded by !**/*.lock
📒 Files selected for processing (4)
  • .gitignore
  • preprocessing/mask_generator.py
  • preprocessing/tiler.py
  • pyproject.toml

Comment on lines +32 to +34
mask_path = Path(MASK_DEST, f"{Path(slide_path).stem}.tiff")
mask_path.parent.mkdir(exist_ok=True, parents=True)
write_big_tiff(mask, path=mask_path, mpp_x=xres, mpp_y=yres)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Keep mask filenames unique across the recursive slide tree.

Line 32 uses only stem, so two slides with the same basename in different folders will overwrite the same mask. preprocessing/tiler.py later resolves masks using the same flattened name, so the wrong mask can be paired back to a slide.

🗂️ Preserve the source-relative path in the mask output
-    mask_path = Path(MASK_DEST, f"{Path(slide_path).stem}.tiff")
+    relative_mask_path = slide_path.relative_to(Path(SLIDES_PATH)).with_suffix(".tiff")
+    mask_path = Path(MASK_DEST, relative_mask_path)

Mirror the same relative-path scheme in preprocessing/tiler.py when resolving masks.

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
mask_path = Path(MASK_DEST, f"{Path(slide_path).stem}.tiff")
mask_path.parent.mkdir(exist_ok=True, parents=True)
write_big_tiff(mask, path=mask_path, mpp_x=xres, mpp_y=yres)
relative_mask_path = slide_path.relative_to(Path(SLIDES_PATH)).with_suffix(".tiff")
mask_path = Path(MASK_DEST, relative_mask_path)
write_big_tiff(mask, path=mask_path, mpp_x=xres, mpp_y=yres)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@preprocessing/mask_generator.py` around lines 32 - 34, The mask filename is
not unique because mask_path is built from Path(slide_path).stem; change mask
generation to preserve the slide's source-relative path under MASK_DEST instead
of using only the stem: compute the slide's relative path (e.g., rel =
Path(slide_path).relative_to(SOURCE_ROOT) or equivalent), replace its suffix
with ".tiff" and join with MASK_DEST to form mask_path, ensure
mask_path.parent.mkdir(...), and call write_big_tiff(mask, path=mask_path,
mpp_x=xres, mpp_y=yres); also update preprocessing/tiler.py to resolve masks
using the same source-relative path scheme so mask lookup and slide pairing
remain consistent.

Comment on lines +14 to +15
TISSUE_MASKS_PATH = "./mask/tissue_masks"
TISSUE_MASKS_PATH = "/home/jovyan/staining/demo_data/masks"
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

Drop the developer-local mask path override.

Line 14 matches the output of preprocessing/mask_generator.py, but Line 15 immediately replaces it with /home/jovyan/staining/demo_data/masks. The main workflow will ignore freshly generated masks and only work on one workstation.

📁 Keep the producer/consumer paths aligned
 TISSUE_MASKS_PATH = "./mask/tissue_masks"
-TISSUE_MASKS_PATH = "/home/jovyan/staining/demo_data/masks"
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
TISSUE_MASKS_PATH = "./mask/tissue_masks"
TISSUE_MASKS_PATH = "/home/jovyan/staining/demo_data/masks"
TISSUE_MASKS_PATH = "./mask/tissue_masks"
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@preprocessing/tiler.py` around lines 14 - 15, TISSUE_MASKS_PATH in
preprocessing/tiler.py is being overridden with a developer-local path; remove
the hard-coded override so the module uses the original TISSUE_MASKS_PATH
definition (or alternatively read from an environment/config variable) to keep
producer (preprocessing/mask_generator.py) and consumer paths aligned; update
references to the TISSUE_MASKS_PATH symbol accordingly and ensure no second
assignment replaces the intended shared path.

Comment on lines +65 to +66
slides, test_slides = train_test_split(slides, test_size=0.2)
train_slides, val_slides = train_test_split(slides, test_size=0.1)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

cd . && find . -type f -name "tiler.py" | head -20

Repository: RationAI/stain-normalization

Length of output: 96


🏁 Script executed:

cd . && git ls-files | grep -i tiler

Repository: RationAI/stain-normalization

Length of output: 94


🏁 Script executed:

cd . && git ls-files | grep -i preprocessing

Repository: RationAI/stain-normalization

Length of output: 126


🏁 Script executed:

cd . && git ls-files | head -30

Repository: RationAI/stain-normalization

Length of output: 221


🏁 Script executed:

cd . && git log --oneline -20

Repository: RationAI/stain-normalization

Length of output: 103


🏁 Script executed:

cd . && find . -name "*.py" -path "*/preprocessing*" -o -name "*tiler*" 2>/dev/null | head -20

Repository: RationAI/stain-normalization

Length of output: 130


🏁 Script executed:

cd . && ls -la

Repository: RationAI/stain-normalization

Length of output: 796


🏁 Script executed:

cd . && find . -type f -name "*.py" | grep -E "(tiler|preprocessing)" | head -20

Repository: RationAI/stain-normalization

Length of output: 130


🏁 Script executed:

cd . && find . -type f -name "*.py" | head -30

Repository: RationAI/stain-normalization

Length of output: 130


🏁 Script executed:

cd . && tree -L 2 -I '__pycache__|*.egg-info|.git' 2>/dev/null || find . -maxdepth 2 -type d | head -30

Repository: RationAI/stain-normalization

Length of output: 242


🏁 Script executed:

cd . && ls -R | grep -E "(tiler|preprocessing)" | head -20

Repository: RationAI/stain-normalization

Length of output: 111


🏁 Script executed:

cd . && git show --stat | head -50

Repository: RationAI/stain-normalization

Length of output: 729


🏁 Script executed:

cd . && rg "train_test_split" -l

Repository: RationAI/stain-normalization

Length of output: 94


🏁 Script executed:

cd . && rg "def main\(\)" -A 10 | head -50

Repository: RationAI/stain-normalization

Length of output: 1322


🏁 Script executed:

cd . && git ls-files 2>/dev/null | head -50

Repository: RationAI/stain-normalization

Length of output: 221


🏁 Script executed:

cd . && pwd && ls -la | head -20

Repository: RationAI/stain-normalization

Length of output: 815


🏁 Script executed:

python3 << 'EOF'
from sklearn.model_selection import train_test_split

# Test 1: Single slide (1 element)
print("Test 1: Single slide dataset")
try:
    slides = ["slide1.mrxs"]
    result = train_test_split(slides, test_size=0.2)
    print(f"  Result: {result}")
except Exception as e:
    print(f"  Error: {type(e).__name__}: {e}")

# Test 2: Two slides
print("\nTest 2: Two slides dataset")
try:
    slides = ["slide1.mrxs", "slide2.mrxs"]
    result = train_test_split(slides, test_size=0.2)
    print(f"  Result: {result}")
except Exception as e:
    print(f"  Error: {type(e).__name__}: {e}")

# Test 3: Three slides
print("\nTest 3: Three slides dataset")
try:
    slides = ["slide1.mrxs", "slide2.mrxs", "slide3.mrxs"]
    result = train_test_split(slides, test_size=0.2)
    print(f"  Result: {result}")
except Exception as e:
    print(f"  Error: {type(e).__name__}: {e}")

# Test 4: Five slides (test first split then second)
print("\nTest 4: Five slides with sequential splits")
try:
    slides = ["s1", "s2", "s3", "s4", "s5"]
    slides, test_slides = train_test_split(slides, test_size=0.2)
    print(f"  After first split - train_val: {slides}, test: {test_slides}")
    train_slides, val_slides = train_test_split(slides, test_size=0.1)
    print(f"  After second split - train: {train_slides}, val: {val_slides}")
except Exception as e:
    print(f"  Error: {type(e).__name__}: {e}")

EOF

Repository: RationAI/stain-normalization

Length of output: 191


🏁 Script executed:

python3 << 'EOF'
from sklearn.model_selection import train_test_split

# Test exact scenario from review with edge cases
print("Simulating the review's exact code pattern:\n")

# Scenario 1: Single slide
print("Scenario 1: Single slide")
try:
    slides = ["only_slide.mrxs"]
    slides, test_slides = train_test_split(slides, test_size=0.2)
    train_slides, val_slides = train_test_split(slides, test_size=0.1)
    print(f"  Success - train: {train_slides}, val: {val_slides}, test: {test_slides}")
except Exception as e:
    print(f"  Error: {type(e).__name__}: {e}")

# Scenario 2: Two slides
print("\nScenario 2: Two slides")
try:
    slides = ["s1.mrxs", "s2.mrxs"]
    slides, test_slides = train_test_split(slides, test_size=0.2)
    train_slides, val_slides = train_test_split(slides, test_size=0.1)
    print(f"  Success - train: {train_slides}, val: {val_slides}, test: {test_slides}")
except Exception as e:
    print(f"  Error: {type(e).__name__}: {e}")

# Scenario 3: Three slides
print("\nScenario 3: Three slides")
try:
    slides = ["s1.mrxs", "s2.mrxs", "s3.mrxs"]
    slides, test_slides = train_test_split(slides, test_size=0.2)
    print(f"  After first split - remaining slides: {len(slides)}, test: {len(test_slides)}")
    train_slides, val_slides = train_test_split(slides, test_size=0.1)
    print(f"  Success - train: {len(train_slides)}, val: {len(val_slides)}, test: {len(test_slides)}")
except Exception as e:
    print(f"  Error: {type(e).__name__}: {e}")

EOF

Repository: RationAI/stain-normalization

Length of output: 191


🏁 Script executed:

cat -n preprocessing/tiler.py

Repository: RationAI/stain-normalization

Length of output: 4589


🏁 Script executed:

cat README.md

Repository: RationAI/stain-normalization

Length of output: 93


🏁 Script executed:

git log --all --oneline | head -20

Repository: RationAI/stain-normalization

Length of output: 181


🏁 Script executed:

git show 89c3d5a --no-patch --format=fuller

Repository: RationAI/stain-normalization

Length of output: 328


🏁 Script executed:

cat pyproject.toml

Repository: RationAI/stain-normalization

Length of output: 1673


🏁 Script executed:

rg "len\(slides\)" preprocessing/

Repository: RationAI/stain-normalization

Length of output: 54


🏁 Script executed:

cat -n preprocessing/tiler.py | head -80

Repository: RationAI/stain-normalization

Length of output: 3237


🏁 Script executed:

python3 << 'EOF'
# Simulate train_test_split behavior without sklearn
# For test_size=0.2: ceil(n * 0.2) samples go to test
# For test_size=0.1: ceil(n * 0.1) samples go to validation

import math

scenarios = [
    ("1 slide", 1),
    ("2 slides", 2),
    ("3 slides", 3),
    ("4 slides", 4),
    ("5 slides", 5),
    ("10 slides", 10),
]

for name, n in scenarios:
    # First split: test_size=0.2
    test_count_1 = max(1, math.ceil(n * 0.2))
    train_val_count = n - test_count_1
    
    # Second split on remaining: test_size=0.1
    if train_val_count > 0:
        val_count = max(1, math.ceil(train_val_count * 0.1)) if train_val_count > 1 else 0
        train_count = train_val_count - val_count
    else:
        train_count = 0
        val_count = 0
    
    print(f"{name:12} -> train: {train_count}, val: {val_count}, test: {test_count_1}")
    if train_val_count < 1:
        print(f"              ⚠️  First split leaves 0 samples")
    elif val_count == 0:
        print(f"              ⚠️  Second split produces empty validation set")

EOF

Repository: RationAI/stain-normalization

Length of output: 431


Handle the single-slide dataset before splitting.

The code does not guard against tiny datasets. With a single slide, train_test_split raises a ValueError. With 2 slides, the second split produces an empty validation set. The codebase itself references single-slide workflows (line 102), but the main() function lacks protection for datasets with fewer than 3 slides.

Guard for tiny datasets
 def main() -> None:
     slides = list(Path(SLIDES_PATH).rglob("*.mrxs"))
 
-    slides, test_slides = train_test_split(slides, test_size=0.2)
-    train_slides, val_slides = train_test_split(slides, test_size=0.1)
+    if len(slides) < 3:
+        train_slides, val_slides, test_slides = slides, [], []
+    else:
+        train_val_slides, test_slides = train_test_split(
+            slides, test_size=0.2, random_state=42
+        )
+        train_slides, val_slides = train_test_split(
+            train_val_slides, test_size=0.1, random_state=42
+        )
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
slides, test_slides = train_test_split(slides, test_size=0.2)
train_slides, val_slides = train_test_split(slides, test_size=0.1)
def main() -> None:
slides = list(Path(SLIDES_PATH).rglob("*.mrxs"))
if len(slides) < 3:
train_slides, val_slides, test_slides = slides, [], []
else:
train_val_slides, test_slides = train_test_split(
slides, test_size=0.2, random_state=42
)
train_slides, val_slides = train_test_split(
train_val_slides, test_size=0.1, random_state=42
)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@preprocessing/tiler.py` around lines 65 - 66, The two-stage splitting using
train_test_split can fail for tiny datasets; update main() around the slides
split to handle len(slides) < 3: if len(slides) == 1 set train_slides = slides
and val_slides = test_slides = [] (or otherwise ensure no further splitting), if
len(slides) == 2 perform a single split to produce one train and one test (e.g.,
train_slides and test_slides) and set val_slides = [], otherwise keep the
existing two-step splitting (slides -> test_slides and remaining -> train/val).
Use the existing variable names slides, train_slides, val_slides, test_slides
and the train_test_split calls so the rest of the pipeline remains unchanged.

Comment on lines +72 to +79
train_slides_df.to_csv("./data/datasets/train_slides.csv", index=False)
train_tiles_df.to_csv("./data/datasets/train_tiles.csv", index=False)

val_slides_df.to_csv("./data/datasets/val_slides.csv", index=False)
val_tiles_df.to_csv("./data/datasets/val_tiles.csv", index=False)

test_slides_df.to_csv("./data/datasets/test_slides.csv", index=False)
test_tiles_df.to_csv("./data/datasets/test_tiles.csv", index=False)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Create ./data/datasets before the first to_csv.

None of these writes ensures the parent directory exists, so main() will fail on a clean checkout before any CSV is produced.

📦 Ensure the output directory exists once
+    output_dir = Path("./data/datasets")
+    output_dir.mkdir(parents=True, exist_ok=True)
+
-    train_slides_df.to_csv("./data/datasets/train_slides.csv", index=False)
-    train_tiles_df.to_csv("./data/datasets/train_tiles.csv", index=False)
+    train_slides_df.to_csv(output_dir / "train_slides.csv", index=False)
+    train_tiles_df.to_csv(output_dir / "train_tiles.csv", index=False)
 
-    val_slides_df.to_csv("./data/datasets/val_slides.csv", index=False)
-    val_tiles_df.to_csv("./data/datasets/val_tiles.csv", index=False)
+    val_slides_df.to_csv(output_dir / "val_slides.csv", index=False)
+    val_tiles_df.to_csv(output_dir / "val_tiles.csv", index=False)
 
-    test_slides_df.to_csv("./data/datasets/test_slides.csv", index=False)
-    test_tiles_df.to_csv("./data/datasets/test_tiles.csv", index=False)
+    test_slides_df.to_csv(output_dir / "test_slides.csv", index=False)
+    test_tiles_df.to_csv(output_dir / "test_tiles.csv", index=False)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@preprocessing/tiler.py` around lines 72 - 79, The CSV writes in
preprocessing/tiler.py call train_slides_df.to_csv / train_tiles_df.to_csv /
val_slides_df.to_csv / val_tiles_df.to_csv / test_slides_df.to_csv /
test_tiles_df.to_csv but never ensure the ./data/datasets directory exists; fix
by creating the directory (e.g., using os.makedirs or Path(...).mkdir with
parents=True, exist_ok=True) once before the first to_csv call—add this
directory creation at the start of the block that writes these CSVs (or at the
start of main()) so all six writes succeed on a clean checkout.

Comment on lines +100 to +111
if __name__ == "__main__":
# main()
slides = [("/home/jovyan/staining/demo_data/P-2016_0077-08-1_hed_h0.6_e1.5.tiff")]
train_slides_df, train_tiles_df = tiling(slides=slides, handler=handler)

mlflow.set_experiment(experiment_name="Stain-Normalization")
with mlflow.start_run(run_name="P-2016_0077-08-1_hed all tissue tiles") as _:
save_mlflow_dataset(
slides=train_slides_df,
tiles=train_tiles_df,
dataset_name="P-2016_0077-08-1_hed",
)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

Run main() from the CLI entry point.

pyproject.toml wires pdm run tiler to this file, but the __main__ block currently skips main(), hard-codes one local slide path, and passes str values into handler(slide_path: Path). As committed, the entry point does not execute the dataset-building flow added above.

🚪 Make the script entry point run the intended workflow
 if __name__ == "__main__":
-    # main()
-    slides = [("/home/jovyan/staining/demo_data/P-2016_0077-08-1_hed_h0.6_e1.5.tiff")]
-    train_slides_df, train_tiles_df = tiling(slides=slides, handler=handler)
-
-    mlflow.set_experiment(experiment_name="Stain-Normalization")
-    with mlflow.start_run(run_name="P-2016_0077-08-1_hed all tissue tiles") as _:
-        save_mlflow_dataset(
-            slides=train_slides_df,
-            tiles=train_tiles_df,
-            dataset_name="P-2016_0077-08-1_hed",
-        )
+    main()
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
if __name__ == "__main__":
# main()
slides = [("/home/jovyan/staining/demo_data/P-2016_0077-08-1_hed_h0.6_e1.5.tiff")]
train_slides_df, train_tiles_df = tiling(slides=slides, handler=handler)
mlflow.set_experiment(experiment_name="Stain-Normalization")
with mlflow.start_run(run_name="P-2016_0077-08-1_hed all tissue tiles") as _:
save_mlflow_dataset(
slides=train_slides_df,
tiles=train_tiles_df,
dataset_name="P-2016_0077-08-1_hed",
)
if __name__ == "__main__":
main()
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@preprocessing/tiler.py` around lines 100 - 111, The __main__ block currently
bypasses the CLI entry point by hard-coding a str slide path and calling tiling
directly; restore the intended behavior by invoking main() from the __main__
guard (so `pdm run tiler` executes the dataset-building flow), remove the
hard-coded slides variable, and if you must call tiling directly ensure you pass
Path objects to handler (convert the slide strings with pathlib.Path) and still
call save_mlflow_dataset with the produced train_slides_df and train_tiles_df;
update the __main__ block to call main() (or, if keeping a direct call, convert
slide paths to Path and preserve the mlflow/save_mlflow_dataset steps).

Comment on lines +8 to +21
dependencies = [
"lightning>=2.0.0",
"albumentations>=1.4.14",
"torchmetrics>=1.4.14",
"torchvision>=0.17.2",
"torch>=2.2.2",
"tqdm>=4.66.5",
"rationai-mlkit @ git+https://gitlab.ics.muni.cz/rationai/digital-pathology/libraries/mlkit.git",
"rationai-masks @ git+https://gitlab.ics.muni.cz/rationai/digital-pathology/libraries/masks.git",
"rationai-tiling @ git+https://gitlab.ics.muni.cz/rationai/digital-pathology/libraries/tiling.git",
"scikit-image>=0.25.2",
"openslide-bin>=4.0.0.6",
"rationai-staining @ git+https://gitlab.ics.muni.cz/rationai/digital-pathology/libraries/staining.git",
]
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
set -euo pipefail

echo "Direct imports in the new preprocessing scripts:"
rg -n '^(import (mlflow|pyvips|ray)\b|from sklearn\.model_selection import train_test_split\b)' preprocessing

echo
echo "Matching declared project dependencies:"
rg -n '^\s*"(mlflow|pyvips|ray|scikit-learn)([<>=@ ].*)?"' pyproject.toml || true

echo
echo "UV override section:"
rg -n -A2 -B1 'override-dependencies' pyproject.toml || true

Repository: RationAI/stain-normalization

Length of output: 480


Add missing direct dependencies for packages imported by preprocessing scripts.

preprocessing/mask_generator.py imports pyvips and ray. preprocessing/tiler.py imports mlflow, ray, and sklearn. None of these packages are declared in [project.dependencies], making the scripts depend on transitive dependencies that may not be guaranteed. The override-dependencies entry on line 50 does not install mlflow; it only pins its version if already present in the dependency tree.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@pyproject.toml` around lines 8 - 21, Add the missing direct dependencies to
pyproject.toml so preprocessing scripts don't rely on transitive installs: add
pyvips, ray, mlflow, and scikit-learn to the [project.dependencies] list (the
imports referenced are in preprocessing/mask_generator.py -> pyvips, ray and
preprocessing/tiler.py -> mlflow, ray, sklearn); do not rely on
override-dependencies (which only pins existing packages), ensure exact package
names (pyvips, ray, mlflow, scikit-learn) and appropriate version constraints
are included in the dependencies array.

Comment on lines +34 to +37
l = { composite = ["lint", "format", "mypy_check"] }
lint = "ruff check --fix"
format = "ruff format"
mypy_check = "mypy ."
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

# First, let's check the pyproject.toml file
cat -n pyproject.toml

Repository: RationAI/stain-normalization

Length of output: 2023


🏁 Script executed:

# Check for imports of pyvips in the codebase
rg "import pyvips|from pyvips" --type py

Repository: RationAI/stain-normalization

Length of output: 117


🏁 Script executed:

# Check for imports of sklearn.model_selection in the codebase
rg "from sklearn\.model_selection|import sklearn\.model_selection" --type py

Repository: RationAI/stain-normalization

Length of output: 147


🏁 Script executed:

# Check if there's a current mypy configuration in pyproject.toml
rg "\[tool\.mypy\]" pyproject.toml

Repository: RationAI/stain-normalization

Length of output: 54


🏁 Script executed:

# Check for any existing type: ignore comments in the codebase
rg "type: ignore|TYPE_CHECKING" --type py | head -20

Repository: RationAI/stain-normalization

Length of output: 54


🏁 Script executed:

# Check if there are .pyi files or py.typed markers for these packages
find . -name "*.pyi" -o -name "py.typed" | head -20

Repository: RationAI/stain-normalization

Length of output: 54


🌐 Web query:

pyvips type stubs py.typed mypy

💡 Result:

  • pyvips ships a stub file (pyvips/__init__.pyi) and its README says it’s intended to work as a PEP 561 typed package. [1]
  • For mypy (and other type checkers) to use inline/bundled stubs, the installed runtime package must include an (empty) py.typed marker file in the package directory (PEP 561). [2]
  • If you’re seeing mypy errors like “missing library stubs or py.typed marker” for pyvips, the fix is: add pyvips/py.typed and ensure it’s included in the built distribution (wheel/sdist). Per PEP 561, stub-only distributions named *-stubs don’t need py.typed (alternative approach: publish pyvips-stubs). [2]
  • If you can’t change the installed package, generate your own stubs and point mypy at them (e.g., stubgen + mypy_path). [3]

Sources: [1] [2] [3]


🌐 Web query:

scikit-learn sklearn type stubs mypy import-untyped

💡 Result:

mypy reports [import-untyped] when you import an installed third‑party library that doesn’t ship inline types (py.typed) and you don’t have a PEP 561 stub package installed. [2][3]

Fix for scikit-learn (sklearn)

  1. Install the stub package:

    pip install scikit-learn-stubs

    This publishes the sklearn stubs (from Microsoft’s python-type-stubs) to PyPI. [1]

  2. Re-run mypy.

If you can’t (or don’t want to) add stubs

  • Ignore just sklearn imports:
    [mypy-sklearn.*]
    ignore_missing_imports = True
  • Or tell mypy to follow/analyze untyped imports (may increase noise/time):
    [mypy]
    follow_untyped_imports = True
    (Also available as --follow-untyped-imports.) [4]

Sources: scikit-learn stubs on PyPI [1], mypy docs/changelog on import-untyped behavior [2], PEP 561 on distributing stubs [3], mypy guidance on follow_untyped_imports [4].


mypy_check will fail without mypy configuration.

Both pyvips (from preprocessing/mask_generator.py) and sklearn.model_selection (from preprocessing/tiler.py) are imported but lack type information that mypy can use. Running mypy . without configuration will fail on import-untyped errors. Configure mypy to ignore missing imports for these modules:

Minimal mypy configuration to unblock the check
+[tool.mypy]
+
+[[tool.mypy.overrides]]
+module = ["pyvips", "sklearn.*"]
+ignore_missing_imports = true
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
l = { composite = ["lint", "format", "mypy_check"] }
lint = "ruff check --fix"
format = "ruff format"
mypy_check = "mypy ."
l = { composite = ["lint", "format", "mypy_check"] }
lint = "ruff check --fix"
format = "ruff format"
mypy_check = "mypy ."
[tool.mypy]
[[tool.mypy.overrides]]
module = ["pyvips", "sklearn.*"]
ignore_missing_imports = true
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@pyproject.toml` around lines 34 - 37, The mypy job (mypy_check in
pyproject.toml) fails due to missing type stubs for pyvips and
sklearn.model_selection; update the mypy configuration to ignore missing imports
for those modules so the check passes: add a mypy config section (either in
pyproject.toml under [tool.mypy] or a standalone mypy.ini) that sets
ignore_missing_imports = True for pyvips and sklearn.model_selection (or use
per-module settings using [tool.mypy-PACKAGE] / [mypy-PACKAGE] for "pyvips" and
"sklearn.model_selection"); this will unblock the mypy run for imports used in
preprocessing/mask_generator.py and preprocessing/tiler.py while keeping the
existing mypy_check = "mypy ." target.

@LAdam-ix LAdam-ix closed this Mar 11, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant