Conversation
📝 WalkthroughWalkthroughThis PR establishes the foundational infrastructure for a stain normalization project, adding comprehensive project configuration, environment setup, and two new preprocessing scripts to automate parallel whole-slide image processing, tissue masking, and tiling workflows. Changes
Sequence DiagramssequenceDiagram
participant Main as Main Workflow
participant Discover as Slide Discovery
participant Split as Train/Val/Test Split
participant Handler as Ray Remote Handler
participant Source as OpenSlide Tile Source
participant Mask as Tissue Mask Filter
participant CSV as CSV Export
participant MLflow as MLflow Logger
Main->>Discover: Scan SLIDES_PATH for .mrxs files
Discover-->>Main: Return slide paths
Main->>Split: Split into train/val/test
Split-->>Main: Return split groups
Main->>Handler: Process each split (parallel via Ray)
Handler->>Source: Load slide and generate tiles
Source-->>Handler: Return tiles + metadata
Handler->>Mask: Apply tissue mask filter
Mask-->>Handler: Return filtered tiles
Handler-->>Main: Return TiledSlideMetadata
Main->>CSV: Export per-split metadata to CSV
Main->>MLflow: Log train, val, test datasets
MLflow-->>Main: Datasets logged
Main-->>Main: Workflow complete
sequenceDiagram
participant Main as Main Workflow
participant Scan as Slide Scanner
participant Remote as Ray Remote process_slide
participant OpenSlide as OpenSlide
participant Pyvips as Pyvips
participant FileSystem as File System
Main->>Scan: Discover .mrxs slides in SLIDES_PATH
Scan-->>Main: Return slide paths
Main->>Remote: Submit parallel processing jobs
Remote->>OpenSlide: Open slide at specified level
OpenSlide-->>Remote: Return resolution info (xres, yres)
Remote->>Pyvips: Load slide region at level
Pyvips-->>Remote: Return image data
Remote->>Remote: Compute tissue mask
Remote->>FileSystem: Create destination directory
Remote->>FileSystem: Write mask as BigTIFF
FileSystem-->>Remote: File saved
Remote-->>Main: Mask generation complete
Main-->>Main: All slides processed
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~20 minutes Poem
🚥 Pre-merge checks | ✅ 2 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
Summary of ChangesHello, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request establishes the foundational preprocessing pipeline for whole slide images, enabling the automated generation of tissue masks and the subsequent tiling of these images into structured datasets. It also sets up robust project dependency management and integrates MLflow for experiment tracking, significantly streamlining the initial data preparation phase for machine learning model development. Highlights
Changelog
Activity
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for GitHub and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
There was a problem hiding this comment.
Code Review
This pull request introduces preprocessing scripts for dataset preparation and sets up the project structure with a pyproject.toml and .gitignore. The scripts are a good foundation, but I've identified several areas for improvement. My feedback focuses on removing hardcoded paths to enhance portability, fixing a variable redefinition bug, improving path manipulation logic for robustness, and cleaning up leftover debugging code. I've also suggested adjustments to pyproject.toml for more flexible dependency management.
| TISSUE_MASKS_PATH = "./mask/tissue_masks" | ||
| TISSUE_MASKS_PATH = "/home/jovyan/staining/demo_data/masks" |
There was a problem hiding this comment.
The variable TISSUE_MASKS_PATH is defined on line 14 and then immediately redefined on line 15 with a different value. This is likely an error and can lead to unexpected behavior. Please remove the incorrect definition.
| TISSUE_MASKS_PATH = "./mask/tissue_masks" | |
| TISSUE_MASKS_PATH = "/home/jovyan/staining/demo_data/masks" | |
| TISSUE_MASKS_PATH = "./mask/tissue_masks" |
| def handler(slide_path: Path) -> TiledSlideMetadata: | ||
| slide, tiles = source(slide_path) | ||
|
|
||
| tissue_mask_path = Path(TISSUE_MASKS_PATH, slide_path.name[:-5] + ".tiff") |
There was a problem hiding this comment.
Using string slicing [:-5] to change the file extension is not robust as it assumes all slide files have a 4-character extension (e.g., .mrxs). A more reliable approach is to use pathlib's methods to handle file stems and suffixes.
| tissue_mask_path = Path(TISSUE_MASKS_PATH, slide_path.name[:-5] + ".tiff") | |
| tissue_mask_path = Path(TISSUE_MASKS_PATH, f"{slide_path.stem}.tiff") |
| SLIDES_PATH = "/mnt/data/MOU/prostate/tile_level_annotations/" | ||
| MASK_DEST = "./mask/tissue_masks" | ||
| LEVEL = 3 |
|
|
||
| slide = pyvips.Image.new_from_file(slide_path, level=LEVEL) | ||
| mask = tissue_mask(slide, xres) | ||
| mask_path = Path(MASK_DEST, f"{Path(slide_path).stem}.tiff") |
There was a problem hiding this comment.
|
|
||
| def main() -> None: | ||
| slides = list(Path(SLIDES_PATH).rglob("*.mrxs")) | ||
| # process_slide(slides[0]) |
| from sklearn.model_selection import train_test_split | ||
|
|
||
|
|
||
| SLIDES_PATH = "/mnt/data/MOU/prostate/tile_level_annotations/" |
There was a problem hiding this comment.
The script uses several hardcoded paths, such as SLIDES_PATH, TISSUE_MASKS_PATH (line 15), and the output CSV paths (lines 72-79). This reduces portability and makes the script harder to use in different environments or with different datasets. Consider using command-line arguments (e.g., with argparse) to make these paths configurable.
| val_slides_df, val_tiles_df = tiling(slides=list(val_slides), handler=handler) | ||
| test_slides_df, test_tiles_df = tiling(slides=list(test_slides), handler=handler) |
There was a problem hiding this comment.
The train_test_split function from sklearn already returns lists, so the explicit calls to list() on val_slides and test_slides are redundant. Removing them will make the code cleaner.
| val_slides_df, val_tiles_df = tiling(slides=list(val_slides), handler=handler) | |
| test_slides_df, test_tiles_df = tiling(slides=list(test_slides), handler=handler) | |
| val_slides_df, val_tiles_df = tiling(slides=val_slides, handler=handler) | |
| test_slides_df, test_tiles_df = tiling(slides=test_slides, handler=handler) |
| if __name__ == "__main__": | ||
| # main() | ||
| slides = [("/home/jovyan/staining/demo_data/P-2016_0077-08-1_hed_h0.6_e1.5.tiff")] | ||
| train_slides_df, train_tiles_df = tiling(slides=slides, handler=handler) | ||
|
|
||
| mlflow.set_experiment(experiment_name="Stain-Normalization") | ||
| with mlflow.start_run(run_name="P-2016_0077-08-1_hed all tissue tiles") as _: | ||
| save_mlflow_dataset( | ||
| slides=train_slides_df, | ||
| tiles=train_tiles_df, | ||
| dataset_name="P-2016_0077-08-1_hed", | ||
| ) |
There was a problem hiding this comment.
The if __name__ == "__main__": block contains commented-out code and hardcoded paths for debugging. This should be cleaned up to only call the main() function. Any testing or debugging logic should be moved to a separate script or function.
| if __name__ == "__main__": | |
| # main() | |
| slides = [("/home/jovyan/staining/demo_data/P-2016_0077-08-1_hed_h0.6_e1.5.tiff")] | |
| train_slides_df, train_tiles_df = tiling(slides=slides, handler=handler) | |
| mlflow.set_experiment(experiment_name="Stain-Normalization") | |
| with mlflow.start_run(run_name="P-2016_0077-08-1_hed all tissue tiles") as _: | |
| save_mlflow_dataset( | |
| slides=train_slides_df, | |
| tiles=train_tiles_df, | |
| dataset_name="P-2016_0077-08-1_hed", | |
| ) | |
| if __name__ == "__main__": | |
| main() |
| name = "Stain Normalization" | ||
| version = "0.1.0" | ||
| authors = [{name = "Adam Lopatka"}] | ||
| requires-python = "==3.12.5" |
There was a problem hiding this comment.
Pinning the Python version to an exact patch release (==3.12.5) can make it difficult to set up the project environment. It's generally better to allow for a range of compatible versions, for example >=3.12, to improve flexibility.
| requires-python = "==3.12.5" | |
| requires-python = ">=3.12" |
c6c3e65 to
7d68d80
Compare
7d68d80 to
89c3d5a
Compare
There was a problem hiding this comment.
Actionable comments posted: 7
🧹 Nitpick comments (1)
.gitignore (1)
181-183: Ignore the generated mask TIFFs too.
preprocessing/mask_generator.pywrites to./mask/tissue_masks, but that output tree is still tracked here. Running the new preprocessing step will leave large generated artifacts in git status.🧹 Small ignore update
# Project /data +/mask/tissue_masks/ /demo_*🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In @.gitignore around lines 181 - 183, Add an ignore rule for the generated mask TIFFs so git stops tracking the output from preprocessing/mask_generator.py: update .gitignore to exclude the mask output directory (e.g., mask/tissue_masks/) and TIFF files (e.g., *.tif and/or *.tiff) so that files created under ./mask/tissue_masks by the mask_generator are not shown in git status.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@preprocessing/mask_generator.py`:
- Around line 32-34: The mask filename is not unique because mask_path is built
from Path(slide_path).stem; change mask generation to preserve the slide's
source-relative path under MASK_DEST instead of using only the stem: compute the
slide's relative path (e.g., rel = Path(slide_path).relative_to(SOURCE_ROOT) or
equivalent), replace its suffix with ".tiff" and join with MASK_DEST to form
mask_path, ensure mask_path.parent.mkdir(...), and call write_big_tiff(mask,
path=mask_path, mpp_x=xres, mpp_y=yres); also update preprocessing/tiler.py to
resolve masks using the same source-relative path scheme so mask lookup and
slide pairing remain consistent.
In `@preprocessing/tiler.py`:
- Around line 100-111: The __main__ block currently bypasses the CLI entry point
by hard-coding a str slide path and calling tiling directly; restore the
intended behavior by invoking main() from the __main__ guard (so `pdm run tiler`
executes the dataset-building flow), remove the hard-coded slides variable, and
if you must call tiling directly ensure you pass Path objects to handler
(convert the slide strings with pathlib.Path) and still call save_mlflow_dataset
with the produced train_slides_df and train_tiles_df; update the __main__ block
to call main() (or, if keeping a direct call, convert slide paths to Path and
preserve the mlflow/save_mlflow_dataset steps).
- Around line 14-15: TISSUE_MASKS_PATH in preprocessing/tiler.py is being
overridden with a developer-local path; remove the hard-coded override so the
module uses the original TISSUE_MASKS_PATH definition (or alternatively read
from an environment/config variable) to keep producer
(preprocessing/mask_generator.py) and consumer paths aligned; update references
to the TISSUE_MASKS_PATH symbol accordingly and ensure no second assignment
replaces the intended shared path.
- Around line 72-79: The CSV writes in preprocessing/tiler.py call
train_slides_df.to_csv / train_tiles_df.to_csv / val_slides_df.to_csv /
val_tiles_df.to_csv / test_slides_df.to_csv / test_tiles_df.to_csv but never
ensure the ./data/datasets directory exists; fix by creating the directory
(e.g., using os.makedirs or Path(...).mkdir with parents=True, exist_ok=True)
once before the first to_csv call—add this directory creation at the start of
the block that writes these CSVs (or at the start of main()) so all six writes
succeed on a clean checkout.
- Around line 65-66: The two-stage splitting using train_test_split can fail for
tiny datasets; update main() around the slides split to handle len(slides) < 3:
if len(slides) == 1 set train_slides = slides and val_slides = test_slides = []
(or otherwise ensure no further splitting), if len(slides) == 2 perform a single
split to produce one train and one test (e.g., train_slides and test_slides) and
set val_slides = [], otherwise keep the existing two-step splitting (slides ->
test_slides and remaining -> train/val). Use the existing variable names slides,
train_slides, val_slides, test_slides and the train_test_split calls so the rest
of the pipeline remains unchanged.
In `@pyproject.toml`:
- Around line 8-21: Add the missing direct dependencies to pyproject.toml so
preprocessing scripts don't rely on transitive installs: add pyvips, ray,
mlflow, and scikit-learn to the [project.dependencies] list (the imports
referenced are in preprocessing/mask_generator.py -> pyvips, ray and
preprocessing/tiler.py -> mlflow, ray, sklearn); do not rely on
override-dependencies (which only pins existing packages), ensure exact package
names (pyvips, ray, mlflow, scikit-learn) and appropriate version constraints
are included in the dependencies array.
- Around line 34-37: The mypy job (mypy_check in pyproject.toml) fails due to
missing type stubs for pyvips and sklearn.model_selection; update the mypy
configuration to ignore missing imports for those modules so the check passes:
add a mypy config section (either in pyproject.toml under [tool.mypy] or a
standalone mypy.ini) that sets ignore_missing_imports = True for pyvips and
sklearn.model_selection (or use per-module settings using [tool.mypy-PACKAGE] /
[mypy-PACKAGE] for "pyvips" and "sklearn.model_selection"); this will unblock
the mypy run for imports used in preprocessing/mask_generator.py and
preprocessing/tiler.py while keeping the existing mypy_check = "mypy ." target.
---
Nitpick comments:
In @.gitignore:
- Around line 181-183: Add an ignore rule for the generated mask TIFFs so git
stops tracking the output from preprocessing/mask_generator.py: update
.gitignore to exclude the mask output directory (e.g., mask/tissue_masks/) and
TIFF files (e.g., *.tif and/or *.tiff) so that files created under
./mask/tissue_masks by the mask_generator are not shown in git status.
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: cf66e25c-e3bc-4247-8696-b54dbeec00f2
⛔ Files ignored due to path filters (2)
pdm.lockis excluded by!**/*.lockuv.lockis excluded by!**/*.lock
📒 Files selected for processing (4)
.gitignorepreprocessing/mask_generator.pypreprocessing/tiler.pypyproject.toml
| mask_path = Path(MASK_DEST, f"{Path(slide_path).stem}.tiff") | ||
| mask_path.parent.mkdir(exist_ok=True, parents=True) | ||
| write_big_tiff(mask, path=mask_path, mpp_x=xres, mpp_y=yres) |
There was a problem hiding this comment.
Keep mask filenames unique across the recursive slide tree.
Line 32 uses only stem, so two slides with the same basename in different folders will overwrite the same mask. preprocessing/tiler.py later resolves masks using the same flattened name, so the wrong mask can be paired back to a slide.
🗂️ Preserve the source-relative path in the mask output
- mask_path = Path(MASK_DEST, f"{Path(slide_path).stem}.tiff")
+ relative_mask_path = slide_path.relative_to(Path(SLIDES_PATH)).with_suffix(".tiff")
+ mask_path = Path(MASK_DEST, relative_mask_path)Mirror the same relative-path scheme in preprocessing/tiler.py when resolving masks.
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| mask_path = Path(MASK_DEST, f"{Path(slide_path).stem}.tiff") | |
| mask_path.parent.mkdir(exist_ok=True, parents=True) | |
| write_big_tiff(mask, path=mask_path, mpp_x=xres, mpp_y=yres) | |
| relative_mask_path = slide_path.relative_to(Path(SLIDES_PATH)).with_suffix(".tiff") | |
| mask_path = Path(MASK_DEST, relative_mask_path) | |
| write_big_tiff(mask, path=mask_path, mpp_x=xres, mpp_y=yres) |
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@preprocessing/mask_generator.py` around lines 32 - 34, The mask filename is
not unique because mask_path is built from Path(slide_path).stem; change mask
generation to preserve the slide's source-relative path under MASK_DEST instead
of using only the stem: compute the slide's relative path (e.g., rel =
Path(slide_path).relative_to(SOURCE_ROOT) or equivalent), replace its suffix
with ".tiff" and join with MASK_DEST to form mask_path, ensure
mask_path.parent.mkdir(...), and call write_big_tiff(mask, path=mask_path,
mpp_x=xres, mpp_y=yres); also update preprocessing/tiler.py to resolve masks
using the same source-relative path scheme so mask lookup and slide pairing
remain consistent.
| TISSUE_MASKS_PATH = "./mask/tissue_masks" | ||
| TISSUE_MASKS_PATH = "/home/jovyan/staining/demo_data/masks" |
There was a problem hiding this comment.
Drop the developer-local mask path override.
Line 14 matches the output of preprocessing/mask_generator.py, but Line 15 immediately replaces it with /home/jovyan/staining/demo_data/masks. The main workflow will ignore freshly generated masks and only work on one workstation.
📁 Keep the producer/consumer paths aligned
TISSUE_MASKS_PATH = "./mask/tissue_masks"
-TISSUE_MASKS_PATH = "/home/jovyan/staining/demo_data/masks"📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| TISSUE_MASKS_PATH = "./mask/tissue_masks" | |
| TISSUE_MASKS_PATH = "/home/jovyan/staining/demo_data/masks" | |
| TISSUE_MASKS_PATH = "./mask/tissue_masks" |
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@preprocessing/tiler.py` around lines 14 - 15, TISSUE_MASKS_PATH in
preprocessing/tiler.py is being overridden with a developer-local path; remove
the hard-coded override so the module uses the original TISSUE_MASKS_PATH
definition (or alternatively read from an environment/config variable) to keep
producer (preprocessing/mask_generator.py) and consumer paths aligned; update
references to the TISSUE_MASKS_PATH symbol accordingly and ensure no second
assignment replaces the intended shared path.
| slides, test_slides = train_test_split(slides, test_size=0.2) | ||
| train_slides, val_slides = train_test_split(slides, test_size=0.1) |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
cd . && find . -type f -name "tiler.py" | head -20Repository: RationAI/stain-normalization
Length of output: 96
🏁 Script executed:
cd . && git ls-files | grep -i tilerRepository: RationAI/stain-normalization
Length of output: 94
🏁 Script executed:
cd . && git ls-files | grep -i preprocessingRepository: RationAI/stain-normalization
Length of output: 126
🏁 Script executed:
cd . && git ls-files | head -30Repository: RationAI/stain-normalization
Length of output: 221
🏁 Script executed:
cd . && git log --oneline -20Repository: RationAI/stain-normalization
Length of output: 103
🏁 Script executed:
cd . && find . -name "*.py" -path "*/preprocessing*" -o -name "*tiler*" 2>/dev/null | head -20Repository: RationAI/stain-normalization
Length of output: 130
🏁 Script executed:
cd . && ls -laRepository: RationAI/stain-normalization
Length of output: 796
🏁 Script executed:
cd . && find . -type f -name "*.py" | grep -E "(tiler|preprocessing)" | head -20Repository: RationAI/stain-normalization
Length of output: 130
🏁 Script executed:
cd . && find . -type f -name "*.py" | head -30Repository: RationAI/stain-normalization
Length of output: 130
🏁 Script executed:
cd . && tree -L 2 -I '__pycache__|*.egg-info|.git' 2>/dev/null || find . -maxdepth 2 -type d | head -30Repository: RationAI/stain-normalization
Length of output: 242
🏁 Script executed:
cd . && ls -R | grep -E "(tiler|preprocessing)" | head -20Repository: RationAI/stain-normalization
Length of output: 111
🏁 Script executed:
cd . && git show --stat | head -50Repository: RationAI/stain-normalization
Length of output: 729
🏁 Script executed:
cd . && rg "train_test_split" -lRepository: RationAI/stain-normalization
Length of output: 94
🏁 Script executed:
cd . && rg "def main\(\)" -A 10 | head -50Repository: RationAI/stain-normalization
Length of output: 1322
🏁 Script executed:
cd . && git ls-files 2>/dev/null | head -50Repository: RationAI/stain-normalization
Length of output: 221
🏁 Script executed:
cd . && pwd && ls -la | head -20Repository: RationAI/stain-normalization
Length of output: 815
🏁 Script executed:
python3 << 'EOF'
from sklearn.model_selection import train_test_split
# Test 1: Single slide (1 element)
print("Test 1: Single slide dataset")
try:
slides = ["slide1.mrxs"]
result = train_test_split(slides, test_size=0.2)
print(f" Result: {result}")
except Exception as e:
print(f" Error: {type(e).__name__}: {e}")
# Test 2: Two slides
print("\nTest 2: Two slides dataset")
try:
slides = ["slide1.mrxs", "slide2.mrxs"]
result = train_test_split(slides, test_size=0.2)
print(f" Result: {result}")
except Exception as e:
print(f" Error: {type(e).__name__}: {e}")
# Test 3: Three slides
print("\nTest 3: Three slides dataset")
try:
slides = ["slide1.mrxs", "slide2.mrxs", "slide3.mrxs"]
result = train_test_split(slides, test_size=0.2)
print(f" Result: {result}")
except Exception as e:
print(f" Error: {type(e).__name__}: {e}")
# Test 4: Five slides (test first split then second)
print("\nTest 4: Five slides with sequential splits")
try:
slides = ["s1", "s2", "s3", "s4", "s5"]
slides, test_slides = train_test_split(slides, test_size=0.2)
print(f" After first split - train_val: {slides}, test: {test_slides}")
train_slides, val_slides = train_test_split(slides, test_size=0.1)
print(f" After second split - train: {train_slides}, val: {val_slides}")
except Exception as e:
print(f" Error: {type(e).__name__}: {e}")
EOFRepository: RationAI/stain-normalization
Length of output: 191
🏁 Script executed:
python3 << 'EOF'
from sklearn.model_selection import train_test_split
# Test exact scenario from review with edge cases
print("Simulating the review's exact code pattern:\n")
# Scenario 1: Single slide
print("Scenario 1: Single slide")
try:
slides = ["only_slide.mrxs"]
slides, test_slides = train_test_split(slides, test_size=0.2)
train_slides, val_slides = train_test_split(slides, test_size=0.1)
print(f" Success - train: {train_slides}, val: {val_slides}, test: {test_slides}")
except Exception as e:
print(f" Error: {type(e).__name__}: {e}")
# Scenario 2: Two slides
print("\nScenario 2: Two slides")
try:
slides = ["s1.mrxs", "s2.mrxs"]
slides, test_slides = train_test_split(slides, test_size=0.2)
train_slides, val_slides = train_test_split(slides, test_size=0.1)
print(f" Success - train: {train_slides}, val: {val_slides}, test: {test_slides}")
except Exception as e:
print(f" Error: {type(e).__name__}: {e}")
# Scenario 3: Three slides
print("\nScenario 3: Three slides")
try:
slides = ["s1.mrxs", "s2.mrxs", "s3.mrxs"]
slides, test_slides = train_test_split(slides, test_size=0.2)
print(f" After first split - remaining slides: {len(slides)}, test: {len(test_slides)}")
train_slides, val_slides = train_test_split(slides, test_size=0.1)
print(f" Success - train: {len(train_slides)}, val: {len(val_slides)}, test: {len(test_slides)}")
except Exception as e:
print(f" Error: {type(e).__name__}: {e}")
EOFRepository: RationAI/stain-normalization
Length of output: 191
🏁 Script executed:
cat -n preprocessing/tiler.pyRepository: RationAI/stain-normalization
Length of output: 4589
🏁 Script executed:
cat README.mdRepository: RationAI/stain-normalization
Length of output: 93
🏁 Script executed:
git log --all --oneline | head -20Repository: RationAI/stain-normalization
Length of output: 181
🏁 Script executed:
git show 89c3d5a --no-patch --format=fullerRepository: RationAI/stain-normalization
Length of output: 328
🏁 Script executed:
cat pyproject.tomlRepository: RationAI/stain-normalization
Length of output: 1673
🏁 Script executed:
rg "len\(slides\)" preprocessing/Repository: RationAI/stain-normalization
Length of output: 54
🏁 Script executed:
cat -n preprocessing/tiler.py | head -80Repository: RationAI/stain-normalization
Length of output: 3237
🏁 Script executed:
python3 << 'EOF'
# Simulate train_test_split behavior without sklearn
# For test_size=0.2: ceil(n * 0.2) samples go to test
# For test_size=0.1: ceil(n * 0.1) samples go to validation
import math
scenarios = [
("1 slide", 1),
("2 slides", 2),
("3 slides", 3),
("4 slides", 4),
("5 slides", 5),
("10 slides", 10),
]
for name, n in scenarios:
# First split: test_size=0.2
test_count_1 = max(1, math.ceil(n * 0.2))
train_val_count = n - test_count_1
# Second split on remaining: test_size=0.1
if train_val_count > 0:
val_count = max(1, math.ceil(train_val_count * 0.1)) if train_val_count > 1 else 0
train_count = train_val_count - val_count
else:
train_count = 0
val_count = 0
print(f"{name:12} -> train: {train_count}, val: {val_count}, test: {test_count_1}")
if train_val_count < 1:
print(f" ⚠️ First split leaves 0 samples")
elif val_count == 0:
print(f" ⚠️ Second split produces empty validation set")
EOFRepository: RationAI/stain-normalization
Length of output: 431
Handle the single-slide dataset before splitting.
The code does not guard against tiny datasets. With a single slide, train_test_split raises a ValueError. With 2 slides, the second split produces an empty validation set. The codebase itself references single-slide workflows (line 102), but the main() function lacks protection for datasets with fewer than 3 slides.
Guard for tiny datasets
def main() -> None:
slides = list(Path(SLIDES_PATH).rglob("*.mrxs"))
- slides, test_slides = train_test_split(slides, test_size=0.2)
- train_slides, val_slides = train_test_split(slides, test_size=0.1)
+ if len(slides) < 3:
+ train_slides, val_slides, test_slides = slides, [], []
+ else:
+ train_val_slides, test_slides = train_test_split(
+ slides, test_size=0.2, random_state=42
+ )
+ train_slides, val_slides = train_test_split(
+ train_val_slides, test_size=0.1, random_state=42
+ )📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| slides, test_slides = train_test_split(slides, test_size=0.2) | |
| train_slides, val_slides = train_test_split(slides, test_size=0.1) | |
| def main() -> None: | |
| slides = list(Path(SLIDES_PATH).rglob("*.mrxs")) | |
| if len(slides) < 3: | |
| train_slides, val_slides, test_slides = slides, [], [] | |
| else: | |
| train_val_slides, test_slides = train_test_split( | |
| slides, test_size=0.2, random_state=42 | |
| ) | |
| train_slides, val_slides = train_test_split( | |
| train_val_slides, test_size=0.1, random_state=42 | |
| ) |
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@preprocessing/tiler.py` around lines 65 - 66, The two-stage splitting using
train_test_split can fail for tiny datasets; update main() around the slides
split to handle len(slides) < 3: if len(slides) == 1 set train_slides = slides
and val_slides = test_slides = [] (or otherwise ensure no further splitting), if
len(slides) == 2 perform a single split to produce one train and one test (e.g.,
train_slides and test_slides) and set val_slides = [], otherwise keep the
existing two-step splitting (slides -> test_slides and remaining -> train/val).
Use the existing variable names slides, train_slides, val_slides, test_slides
and the train_test_split calls so the rest of the pipeline remains unchanged.
| train_slides_df.to_csv("./data/datasets/train_slides.csv", index=False) | ||
| train_tiles_df.to_csv("./data/datasets/train_tiles.csv", index=False) | ||
|
|
||
| val_slides_df.to_csv("./data/datasets/val_slides.csv", index=False) | ||
| val_tiles_df.to_csv("./data/datasets/val_tiles.csv", index=False) | ||
|
|
||
| test_slides_df.to_csv("./data/datasets/test_slides.csv", index=False) | ||
| test_tiles_df.to_csv("./data/datasets/test_tiles.csv", index=False) |
There was a problem hiding this comment.
Create ./data/datasets before the first to_csv.
None of these writes ensures the parent directory exists, so main() will fail on a clean checkout before any CSV is produced.
📦 Ensure the output directory exists once
+ output_dir = Path("./data/datasets")
+ output_dir.mkdir(parents=True, exist_ok=True)
+
- train_slides_df.to_csv("./data/datasets/train_slides.csv", index=False)
- train_tiles_df.to_csv("./data/datasets/train_tiles.csv", index=False)
+ train_slides_df.to_csv(output_dir / "train_slides.csv", index=False)
+ train_tiles_df.to_csv(output_dir / "train_tiles.csv", index=False)
- val_slides_df.to_csv("./data/datasets/val_slides.csv", index=False)
- val_tiles_df.to_csv("./data/datasets/val_tiles.csv", index=False)
+ val_slides_df.to_csv(output_dir / "val_slides.csv", index=False)
+ val_tiles_df.to_csv(output_dir / "val_tiles.csv", index=False)
- test_slides_df.to_csv("./data/datasets/test_slides.csv", index=False)
- test_tiles_df.to_csv("./data/datasets/test_tiles.csv", index=False)
+ test_slides_df.to_csv(output_dir / "test_slides.csv", index=False)
+ test_tiles_df.to_csv(output_dir / "test_tiles.csv", index=False)🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@preprocessing/tiler.py` around lines 72 - 79, The CSV writes in
preprocessing/tiler.py call train_slides_df.to_csv / train_tiles_df.to_csv /
val_slides_df.to_csv / val_tiles_df.to_csv / test_slides_df.to_csv /
test_tiles_df.to_csv but never ensure the ./data/datasets directory exists; fix
by creating the directory (e.g., using os.makedirs or Path(...).mkdir with
parents=True, exist_ok=True) once before the first to_csv call—add this
directory creation at the start of the block that writes these CSVs (or at the
start of main()) so all six writes succeed on a clean checkout.
| if __name__ == "__main__": | ||
| # main() | ||
| slides = [("/home/jovyan/staining/demo_data/P-2016_0077-08-1_hed_h0.6_e1.5.tiff")] | ||
| train_slides_df, train_tiles_df = tiling(slides=slides, handler=handler) | ||
|
|
||
| mlflow.set_experiment(experiment_name="Stain-Normalization") | ||
| with mlflow.start_run(run_name="P-2016_0077-08-1_hed all tissue tiles") as _: | ||
| save_mlflow_dataset( | ||
| slides=train_slides_df, | ||
| tiles=train_tiles_df, | ||
| dataset_name="P-2016_0077-08-1_hed", | ||
| ) |
There was a problem hiding this comment.
Run main() from the CLI entry point.
pyproject.toml wires pdm run tiler to this file, but the __main__ block currently skips main(), hard-codes one local slide path, and passes str values into handler(slide_path: Path). As committed, the entry point does not execute the dataset-building flow added above.
🚪 Make the script entry point run the intended workflow
if __name__ == "__main__":
- # main()
- slides = [("/home/jovyan/staining/demo_data/P-2016_0077-08-1_hed_h0.6_e1.5.tiff")]
- train_slides_df, train_tiles_df = tiling(slides=slides, handler=handler)
-
- mlflow.set_experiment(experiment_name="Stain-Normalization")
- with mlflow.start_run(run_name="P-2016_0077-08-1_hed all tissue tiles") as _:
- save_mlflow_dataset(
- slides=train_slides_df,
- tiles=train_tiles_df,
- dataset_name="P-2016_0077-08-1_hed",
- )
+ main()📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| if __name__ == "__main__": | |
| # main() | |
| slides = [("/home/jovyan/staining/demo_data/P-2016_0077-08-1_hed_h0.6_e1.5.tiff")] | |
| train_slides_df, train_tiles_df = tiling(slides=slides, handler=handler) | |
| mlflow.set_experiment(experiment_name="Stain-Normalization") | |
| with mlflow.start_run(run_name="P-2016_0077-08-1_hed all tissue tiles") as _: | |
| save_mlflow_dataset( | |
| slides=train_slides_df, | |
| tiles=train_tiles_df, | |
| dataset_name="P-2016_0077-08-1_hed", | |
| ) | |
| if __name__ == "__main__": | |
| main() |
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@preprocessing/tiler.py` around lines 100 - 111, The __main__ block currently
bypasses the CLI entry point by hard-coding a str slide path and calling tiling
directly; restore the intended behavior by invoking main() from the __main__
guard (so `pdm run tiler` executes the dataset-building flow), remove the
hard-coded slides variable, and if you must call tiling directly ensure you pass
Path objects to handler (convert the slide strings with pathlib.Path) and still
call save_mlflow_dataset with the produced train_slides_df and train_tiles_df;
update the __main__ block to call main() (or, if keeping a direct call, convert
slide paths to Path and preserve the mlflow/save_mlflow_dataset steps).
| dependencies = [ | ||
| "lightning>=2.0.0", | ||
| "albumentations>=1.4.14", | ||
| "torchmetrics>=1.4.14", | ||
| "torchvision>=0.17.2", | ||
| "torch>=2.2.2", | ||
| "tqdm>=4.66.5", | ||
| "rationai-mlkit @ git+https://gitlab.ics.muni.cz/rationai/digital-pathology/libraries/mlkit.git", | ||
| "rationai-masks @ git+https://gitlab.ics.muni.cz/rationai/digital-pathology/libraries/masks.git", | ||
| "rationai-tiling @ git+https://gitlab.ics.muni.cz/rationai/digital-pathology/libraries/tiling.git", | ||
| "scikit-image>=0.25.2", | ||
| "openslide-bin>=4.0.0.6", | ||
| "rationai-staining @ git+https://gitlab.ics.muni.cz/rationai/digital-pathology/libraries/staining.git", | ||
| ] |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
#!/bin/bash
set -euo pipefail
echo "Direct imports in the new preprocessing scripts:"
rg -n '^(import (mlflow|pyvips|ray)\b|from sklearn\.model_selection import train_test_split\b)' preprocessing
echo
echo "Matching declared project dependencies:"
rg -n '^\s*"(mlflow|pyvips|ray|scikit-learn)([<>=@ ].*)?"' pyproject.toml || true
echo
echo "UV override section:"
rg -n -A2 -B1 'override-dependencies' pyproject.toml || trueRepository: RationAI/stain-normalization
Length of output: 480
Add missing direct dependencies for packages imported by preprocessing scripts.
preprocessing/mask_generator.py imports pyvips and ray. preprocessing/tiler.py imports mlflow, ray, and sklearn. None of these packages are declared in [project.dependencies], making the scripts depend on transitive dependencies that may not be guaranteed. The override-dependencies entry on line 50 does not install mlflow; it only pins its version if already present in the dependency tree.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@pyproject.toml` around lines 8 - 21, Add the missing direct dependencies to
pyproject.toml so preprocessing scripts don't rely on transitive installs: add
pyvips, ray, mlflow, and scikit-learn to the [project.dependencies] list (the
imports referenced are in preprocessing/mask_generator.py -> pyvips, ray and
preprocessing/tiler.py -> mlflow, ray, sklearn); do not rely on
override-dependencies (which only pins existing packages), ensure exact package
names (pyvips, ray, mlflow, scikit-learn) and appropriate version constraints
are included in the dependencies array.
| l = { composite = ["lint", "format", "mypy_check"] } | ||
| lint = "ruff check --fix" | ||
| format = "ruff format" | ||
| mypy_check = "mypy ." |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
# First, let's check the pyproject.toml file
cat -n pyproject.tomlRepository: RationAI/stain-normalization
Length of output: 2023
🏁 Script executed:
# Check for imports of pyvips in the codebase
rg "import pyvips|from pyvips" --type pyRepository: RationAI/stain-normalization
Length of output: 117
🏁 Script executed:
# Check for imports of sklearn.model_selection in the codebase
rg "from sklearn\.model_selection|import sklearn\.model_selection" --type pyRepository: RationAI/stain-normalization
Length of output: 147
🏁 Script executed:
# Check if there's a current mypy configuration in pyproject.toml
rg "\[tool\.mypy\]" pyproject.tomlRepository: RationAI/stain-normalization
Length of output: 54
🏁 Script executed:
# Check for any existing type: ignore comments in the codebase
rg "type: ignore|TYPE_CHECKING" --type py | head -20Repository: RationAI/stain-normalization
Length of output: 54
🏁 Script executed:
# Check if there are .pyi files or py.typed markers for these packages
find . -name "*.pyi" -o -name "py.typed" | head -20Repository: RationAI/stain-normalization
Length of output: 54
🌐 Web query:
pyvips type stubs py.typed mypy
💡 Result:
pyvipsships a stub file (pyvips/__init__.pyi) and its README says it’s intended to work as a PEP 561 typed package. [1]- For mypy (and other type checkers) to use inline/bundled stubs, the installed runtime package must include an (empty)
py.typedmarker file in the package directory (PEP 561). [2] - If you’re seeing mypy errors like “missing library stubs or py.typed marker” for
pyvips, the fix is: addpyvips/py.typedand ensure it’s included in the built distribution (wheel/sdist). Per PEP 561, stub-only distributions named*-stubsdon’t needpy.typed(alternative approach: publishpyvips-stubs). [2] - If you can’t change the installed package, generate your own stubs and point mypy at them (e.g.,
stubgen+mypy_path). [3]
Sources: [1] [2] [3]
🌐 Web query:
scikit-learn sklearn type stubs mypy import-untyped
💡 Result:
mypy reports [import-untyped] when you import an installed third‑party library that doesn’t ship inline types (py.typed) and you don’t have a PEP 561 stub package installed. [2][3]
Fix for scikit-learn (sklearn)
-
Install the stub package:
pip install scikit-learn-stubs
This publishes the
sklearnstubs (from Microsoft’spython-type-stubs) to PyPI. [1] -
Re-run
mypy.
If you can’t (or don’t want to) add stubs
- Ignore just
sklearnimports:[mypy-sklearn.*] ignore_missing_imports = True
- Or tell mypy to follow/analyze untyped imports (may increase noise/time):
(Also available as
[mypy] follow_untyped_imports = True
--follow-untyped-imports.) [4]
Sources: scikit-learn stubs on PyPI [1], mypy docs/changelog on import-untyped behavior [2], PEP 561 on distributing stubs [3], mypy guidance on follow_untyped_imports [4].
mypy_check will fail without mypy configuration.
Both pyvips (from preprocessing/mask_generator.py) and sklearn.model_selection (from preprocessing/tiler.py) are imported but lack type information that mypy can use. Running mypy . without configuration will fail on import-untyped errors. Configure mypy to ignore missing imports for these modules:
Minimal mypy configuration to unblock the check
+[tool.mypy]
+
+[[tool.mypy.overrides]]
+module = ["pyvips", "sklearn.*"]
+ignore_missing_imports = true📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| l = { composite = ["lint", "format", "mypy_check"] } | |
| lint = "ruff check --fix" | |
| format = "ruff format" | |
| mypy_check = "mypy ." | |
| l = { composite = ["lint", "format", "mypy_check"] } | |
| lint = "ruff check --fix" | |
| format = "ruff format" | |
| mypy_check = "mypy ." | |
| [tool.mypy] | |
| [[tool.mypy.overrides]] | |
| module = ["pyvips", "sklearn.*"] | |
| ignore_missing_imports = true |
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@pyproject.toml` around lines 34 - 37, The mypy job (mypy_check in
pyproject.toml) fails due to missing type stubs for pyvips and
sklearn.model_selection; update the mypy configuration to ignore missing imports
for those modules so the check passes: add a mypy config section (either in
pyproject.toml under [tool.mypy] or a standalone mypy.ini) that sets
ignore_missing_imports = True for pyvips and sklearn.model_selection (or use
per-module settings using [tool.mypy-PACKAGE] / [mypy-PACKAGE] for "pyvips" and
"sklearn.model_selection"); this will unblock the mypy run for imports used in
preprocessing/mask_generator.py and preprocessing/tiler.py while keeping the
existing mypy_check = "mypy ." target.
Just usage of rational libs to prepare initial dataset for training. After the data reorganization using the official dataset. I have it setup for when I was creating some single slide dataset for testing ( the gemini problems).
Summary by CodeRabbit
New Features
Chores