Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
79 commits
Select commit Hold shift + click to select a range
6e1d4b9
chore: replace pdm with uv
Adames4 Jan 25, 2026
d7a1ef3
feat: configs
Adames4 Jan 25, 2026
34abac4
feat: dataset creation
Adames4 Jan 25, 2026
80eca85
fix: configs
Adames4 Jan 25, 2026
97335ef
fix: configs
Adames4 Jan 25, 2026
727c729
feat: add scripts
Adames4 Jan 25, 2026
fa9f581
fix: invalid job name
Adames4 Jan 25, 2026
fd978b3
fix: set slide_id as index and ensure nancy is an integer
Adames4 Jan 25, 2026
2869916
chore: dependencies
Adames4 Jan 25, 2026
725e17d
feat: processed data
Adames4 Jan 25, 2026
f7026b3
feat: tissue masks
Adames4 Jan 25, 2026
ed49c42
feat: script
Adames4 Jan 25, 2026
87113e1
feat: script
Adames4 Jan 25, 2026
de98fef
fix: job name
Adames4 Jan 25, 2026
929d298
fix: exclude .git folder from ray env
Adames4 Jan 25, 2026
81d4df3
fix: exclude .venv folder from ray env
Adames4 Jan 25, 2026
3f0321f
fix: typo
Adames4 Jan 25, 2026
2ecde13
feat: job script
Adames4 Jan 25, 2026
6cae5d0
feat: config
Adames4 Jan 25, 2026
616ed48
feat: job script
Adames4 Jan 25, 2026
700f916
chore: dependencies
Adames4 Jan 25, 2026
65e86db
feat: quality control
Adames4 Jan 25, 2026
59965a4
feat: add dataset configuration files for ftn, ikem, and knl_patos
Adames4 Jan 25, 2026
234d611
fix: output dir
Adames4 Jan 25, 2026
4930434
fix: typo
Adames4 Jan 25, 2026
180afe7
chore: Merge branch 'feature/tissue-masks' into feature/tiling
Adames4 Jan 26, 2026
456d4ae
chore: add tiling libs
Adames4 Jan 27, 2026
e88bfd8
fix: naming
Adames4 Jan 27, 2026
b91d36d
feat: confs
Adames4 Jan 27, 2026
413db4f
feat: tiling
Adames4 Jan 27, 2026
f5b4045
fix: confs
Adames4 Jan 27, 2026
2a578d1
fix: typo
Adames4 Jan 27, 2026
9262f3e
fix: typo
Adames4 Jan 27, 2026
b852be9
fix: typo
Adames4 Jan 27, 2026
bd8ca9f
fix: dataset index
Adames4 Jan 27, 2026
08d3320
feat: update tiling to latest ratiopath
Adames4 Jan 27, 2026
3ab6d5a
fix: use tile overlay overlap as udfexpr
Adames4 Jan 27, 2026
06e957e
chore: ratiopath from github
Adames4 Jan 28, 2026
de0fd75
fix: WIP
Adames4 Jan 28, 2026
e56216b
fix: None in overlap
Adames4 Jan 28, 2026
95cc81c
feat: confs
Adames4 Jan 28, 2026
754698f
chore: dependencies
Adames4 Jan 28, 2026
178f294
feat: tile = stride
Adames4 Jan 31, 2026
e4d2499
fix: typo
Adames4 Jan 31, 2026
acfbf19
fix: glob over changing dir
Adames4 Jan 31, 2026
659f505
fix: finish the run
Adames4 Jan 31, 2026
45ac3b3
fix: rever last commit
Adames4 Jan 31, 2026
3c7a5fc
feat: conf
Adames4 Jan 31, 2026
774f24a
fix: splits
Adames4 Jan 31, 2026
076ba93
feat: tweaking resources
Adames4 Jan 31, 2026
2788656
feat: confs
Adames4 Jan 31, 2026
25bec6e
chore: dependecies
Adames4 Feb 1, 2026
5faa224
fix: add paths
Adames4 Feb 1, 2026
eb96f56
fix: conf
Adames4 Feb 5, 2026
89a4582
fix: typo
Adames4 Feb 5, 2026
18c556d
chore: dependencies
Adames4 Feb 5, 2026
35cfb5c
fix: group splitting
Adames4 Feb 6, 2026
bd0c6ac
fix: PR comments
Adames4 Feb 11, 2026
9ffea96
chore: dependencies
Adames4 Feb 11, 2026
0db7d08
chore: Merge branch 'feature/dataset' into feature/tissue-masks
Adames4 Feb 11, 2026
b833059
feat: refactor configs
Adames4 Feb 11, 2026
3300341
fix: PR comments
Adames4 Feb 11, 2026
c98608e
chore: Merge branch 'feature/dataset' into feature/quality-control
Adames4 Feb 11, 2026
bc71668
feat: configs
Adames4 Feb 11, 2026
6647d0d
chore: Merge branch 'feature/tissue-masks' into feature/tiling
Adames4 Feb 11, 2026
2dfb32b
chore: Merge branch 'feature/quality-control' into feature/tiling
Adames4 Feb 11, 2026
30d4494
feat: configs
Adames4 Feb 12, 2026
88eb5dc
fix: conf
Adames4 Feb 12, 2026
4269a3a
fix: PR
Adames4 Feb 12, 2026
a9615ed
fix: repo
Adames4 Feb 12, 2026
3b7f95c
chore: Merge branch 'feature/quality-control' into feature/tiling
Adames4 Feb 12, 2026
659910e
fix: repo
Adames4 Feb 12, 2026
cd09f80
chore: Merge branch 'master' into feature/tiling
Adames4 Mar 7, 2026
2007e83
feat: update tiling
Adames4 Mar 7, 2026
6c5ec71
fix: typo
Adames4 Mar 8, 2026
4c01ac2
fix: typo
Adames4 Mar 8, 2026
31e78c2
fix: imports
Adames4 Mar 8, 2026
caaa6aa
fix: add ray to with
Adames4 Mar 10, 2026
486ca49
fix: mypy
Adames4 Mar 10, 2026
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
3 changes: 2 additions & 1 deletion configs/dataset/processed/ftn.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -2,4 +2,5 @@ defaults:
- /dataset/raw/ftn@_here_
- _self_

uri: mlflow-artifacts:/86/d2b2f1835fc647e2ba3639ce606f4768/artifacts/dataset.csv
mlflow_uris:
dataset: mlflow-artifacts:/86/d2b2f1835fc647e2ba3639ce606f4768/artifacts/dataset.csv
3 changes: 2 additions & 1 deletion configs/dataset/processed/ikem.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -2,4 +2,5 @@ defaults:
- /dataset/raw/ikem@_here_
- _self_

uri: mlflow-artifacts:/86/7c6e7cc142494d45b6513185318d4462/artifacts/dataset.csv
mlflow_uris:
dataset: mlflow-artifacts:/86/7c6e7cc142494d45b6513185318d4462/artifacts/dataset.csv
3 changes: 2 additions & 1 deletion configs/dataset/processed/knl_patos.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -2,4 +2,5 @@ defaults:
- /dataset/raw/knl_patos@_here_
- _self_

uri: mlflow-artifacts:/86/f690f64ded624da9a7150a7a92385aec/artifacts/dataset.csv
mlflow_uris:
dataset: mlflow-artifacts:/86/f690f64ded624da9a7150a7a92385aec/artifacts/dataset.csv
11 changes: 11 additions & 0 deletions configs/dataset/processed_w_masks/ftn.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,11 @@
defaults:
- /dataset/processed/ftn@_here_
- _self_

mlflow_uris:
tissue_mask: mlflow-artifacts:/86/17149d1de7014112aba3a252a76d10bc/artifacts/tissue_masks
qc_mask: mlflow-artifacts:/86/d2301fc279c94682a639583731c2fded/artifacts
splits:
train: mlflow-artifacts:/86/a0fa337bf26146dab42062237285737f/artifacts/train.csv
test_preliminary: mlflow-artifacts:/86/1c8e7a9b0c5f4d1b8a3e6c9e2c8f0a9/artifacts/test_preliminary.csv
test_final: mlflow-artifacts:/86/a0fa337bf26146dab42062237285737f/artifacts/test_final.csv
11 changes: 11 additions & 0 deletions configs/dataset/processed_w_masks/ikem.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,11 @@
defaults:
- /dataset/processed/ikem@_here_
- _self_

mlflow_uris:
tissue_mask: mlflow-artifacts:/86/94481ac59246471fb874bfb4dccb5e67/artifacts/tissue_masks
qc_mask: mlflow-artifacts:/86/65e794b652ab4369aad2e7dbe60eddca/artifacts
splits:
train: mlflow-artifacts:/86/9a2d8f975bc24dceb5271c5699560a8f/artifacts/train.csv
test_preliminary: mlflow-artifacts:/86/9a2d8f975bc24dceb5271c5699560a8f/artifacts/test_preliminary.csv
test_final: mlflow-artifacts:/86/9a2d8f975bc24dceb5271c5699560a8f/artifacts/test_final.csv
10 changes: 10 additions & 0 deletions configs/dataset/processed_w_masks/knl_patos.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,10 @@
defaults:
- /dataset/processed/knl_patos@_here_
- _self_

mlflow_uris:
tissue_mask: mlflow-artifacts:/86/6cdfa5ce4f7242a9b2bb394bfd5ed705/artifacts/tissue_masks
qc_mask: mlflow-artifacts:/86/7cc5586efbca4ecd8f7ac2847b4ee199/artifacts
splits:
test_preliminary: mlflow-artifacts:/86/4d517fd564c741ec8e14679a340ebcb0/artifacts/test_preliminary.csv
test_final: mlflow-artifacts:/86/4d517fd564c741ec8e14679a340ebcb0/artifacts/test_final.csv
15 changes: 15 additions & 0 deletions configs/preprocessing/tiling.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,15 @@
# @package _global_

mpp: 1.55 # level 2
tile_extent: 224
stride: 112
tissue_threshold: 0.5

metadata:
run_name: "🧱 Tiling: ${dataset.institution}"
description: Tile extraction for ${dataset.institution} institution with tile extent ${tile_extent}
hyperparams:
mpp: ${mpp}
tile_extent: ${tile_extent}
stride: ${stride}
tissue_threshold: ${tissue_threshold}
12 changes: 3 additions & 9 deletions preprocessing/quality_control.py
Original file line number Diff line number Diff line change
Expand Up @@ -6,9 +6,9 @@
from typing import TypedDict

import hydra
import mlflow.artifacts
import pandas as pd
import rationai
from mlflow.artifacts import download_artifacts
from omegaconf import DictConfig
from rationai.mlkit import autolog, with_cli_args
from rationai.mlkit.lightning.loggers import MLFlowLogger
Expand Down Expand Up @@ -91,25 +91,19 @@ async def qc_main(
logger.log_artifacts(local_dir=str(output_path))


def download_dataset(uri: str) -> pd.DataFrame:
path = mlflow.artifacts.download_artifacts(artifact_uri=uri)
df = pd.read_csv(path)
return df


@with_cli_args(["+preprocessing=quality_control"])
@hydra.main(config_path="../configs", config_name="preprocessing", version_base=None)
@autolog
def main(config: DictConfig, logger: MLFlowLogger) -> None:
df = download_dataset(config.dataset.uri)
dataset = pd.read_csv(download_artifacts(config.dataset.mlflow_uris.dataset))

output_path = Path(config.output_dir)
output_path.mkdir(parents=True, exist_ok=True)

asyncio.run(
qc_main(
output_path=output_path,
slides=df["path"].to_list(),
slides=dataset["path"].to_list(),
logger=logger,
request_timeout=config.request_timeout,
max_concurrent=config.max_concurrent,
Expand Down
2 changes: 1 addition & 1 deletion preprocessing/split_dataset.py
Original file line number Diff line number Diff line change
Expand Up @@ -64,7 +64,7 @@ def add_folds(train: pd.DataFrame, n_folds: int, random_state: int) -> pd.DataFr
@hydra.main(config_path="../configs", config_name="preprocessing", version_base=None)
@autolog
def main(config: DictConfig, logger: MLFlowLogger) -> None:
dataset = pd.read_csv(download_artifacts(config.dataset.uri))
dataset = pd.read_csv(download_artifacts(config.dataset.mlflow_uris.dataset))

train, test_preliminary, test_final = split_dataset(
dataset, config.splits, config.random_state
Expand Down
242 changes: 242 additions & 0 deletions preprocessing/tiling.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,242 @@
from pathlib import Path
from typing import Any, TypedDict, cast

import hydra
import mlflow.artifacts
import pandas as pd
import ray
from omegaconf import DictConfig
from rationai.mlkit import autolog, with_cli_args
from rationai.mlkit.lightning.loggers import MLFlowLogger
from rationai.tiling.writers import save_mlflow_dataset
from ratiopath.ray import read_slides
from ratiopath.tiling import grid_tiles, tile_overlay_overlap
from ratiopath.tiling.utils import row_hash
from ray.data.expressions import col
from shapely import Polygon
from shapely.geometry import box


QC_BLUR_MEAN_COLUMN = "mean_coverage(Piqe)"
QC_ARTIFACTS_MEAN_COLUMN = "mean_coverage(ResidualArtifactsAndCoverage)"
QC_SUBFOLDERS = {"blur": "blur_per_pixel", "artifacts": "artifacts_per_pixel"}
Comment thread
Adames4 marked this conversation as resolved.


class _RayCpuResources(TypedDict):
num_cpus: float


class _RayMemResources(TypedDict):
memory: int


LO_CPU: _RayCpuResources = {"num_cpus": 0.1}
HI_CPU: _RayCpuResources = {"num_cpus": 0.2}
LO_MEM: _RayMemResources = {"memory": 128 * 1024**2}
HI_MEM: _RayMemResources = {"memory": 256 * 1024**2}


def add_nancy_index(row: dict[str, Any], df: pd.DataFrame) -> dict[str, Any]:
row["nancy_index"] = df.loc[Path(row["path"]).stem, "nancy"]
return row


def qc_agg(row: dict[str, Any], df: pd.DataFrame) -> dict[str, Any]:
qc_df = cast("pd.Series", df.loc[Path(row["path"]).stem])

row["blur_mean"] = qc_df[QC_BLUR_MEAN_COLUMN]
row["artifacts_mean"] = qc_df[QC_ARTIFACTS_MEAN_COLUMN]

return row


def add_fold(row: dict[str, Any], df: pd.DataFrame) -> dict[str, Any]:
row["fold"] = df.loc[Path(row["path"]).stem, "fold"]
return row


def add_mask_paths(
row: dict[str, Any], qc_folder: Path, tissue_folder: Path
) -> dict[str, Any]:
stem = Path(row["path"]).stem
row["tissue_mask_path"] = str(tissue_folder / f"{stem}.tiff")
for key, subfolder in QC_SUBFOLDERS.items():
row[f"{key}_mask_path"] = str(qc_folder / subfolder / f"{stem}.tiff")
return row


def create_tissue_roi(tile_extent: int) -> Polygon:
offset = tile_extent // 4
size = tile_extent // 2
return box(offset, offset, offset + size, offset + size)


def create_qc_roi(tile_extent: int) -> Polygon:
return box(0, 0, tile_extent, tile_extent)


def tile(row: dict[str, Any]) -> list[dict[str, Any]]:
return [
{
"tile_x": x,
"tile_y": y,
"path": row["path"],
"slide_id": row["id"],
"level": row["level"],
"tile_extent_x": row["tile_extent_x"],
"tile_extent_y": row["tile_extent_y"],
"mpp_x": row["mpp_x"],
"mpp_y": row["mpp_y"],
"tissue_mask_path": row["tissue_mask_path"],
"blur_mask_path": row["blur_mask_path"],
"artifacts_mask_path": row["artifacts_mask_path"],
}
for x, y in grid_tiles(
slide_extent=(row["extent_x"], row["extent_y"]),
tile_extent=(row["tile_extent_x"], row["tile_extent_y"]),
stride=(row["stride_x"], row["stride_y"]),
)
]


def extract_coverages(row: dict[str, Any], *cols: str) -> dict[str, Any]:
for c in cols:
overlap = row[f"{c}_overlap"]
zero_overlap = overlap.get("0", 0)
Comment thread
Adames4 marked this conversation as resolved.
if zero_overlap is None:
row[c] = 1.0
else:
row[c] = 1.0 - zero_overlap
return row


def filter_tissue(row: dict[str, Any], threshold: float) -> bool:
return row["tissue"] >= threshold


def select(row: dict[str, Any]) -> dict[str, Any]:
return {
"slide_id": row["slide_id"],
"x": row["tile_x"],
"y": row["tile_y"],
"tissue": row["tissue"],
"blur": row["blur"],
"artifacts": row["artifacts"],
}


def tiling(
df: pd.DataFrame,
qc_folder: Path,
tissue_folder: Path,
tile_extent: int,
stride: int,
mpp: float,
tissue_threshold: float,
) -> tuple[pd.DataFrame, pd.DataFrame]:
qc_df = pd.read_csv(qc_folder / "qc_metrics.csv", index_col="slide_name")
paths = df["path"].tolist()

slides = (
read_slides(paths, tile_extent=tile_extent, stride=stride, mpp=mpp)
.map(row_hash, **LO_CPU, **LO_MEM)
.map(add_nancy_index, fn_args=(df,), **LO_CPU, **LO_MEM) # pyright: ignore[reportArgumentType]
.map(qc_agg, fn_args=(qc_df,), **HI_CPU, **LO_MEM) # pyright: ignore[reportArgumentType]
)

if "fold" in df.columns:
slides = slides.map(add_fold, fn_args=(df,), **LO_CPU, **LO_MEM) # pyright: ignore[reportArgumentType]

tissue_roi = create_tissue_roi(tile_extent)
qc_roi = create_qc_roi(tile_extent)

tiles = (
slides.map(
add_mask_paths, # pyright: ignore[reportArgumentType]
fn_args=(qc_folder, tissue_folder),
**LO_CPU,
**LO_MEM,
)
.flat_map(tile, **HI_CPU, **LO_MEM)
.repartition(target_num_rows_per_block=4096)
.with_column(
"tissue_overlap",
tile_overlay_overlap(
tissue_roi,
col("tissue_mask_path"),
col("tile_x"),
col("tile_y"),
col("mpp_x"),
col("mpp_y"),
), # pyright: ignore[reportCallIssue]
**HI_CPU,
**HI_MEM,
)
.map(extract_coverages, fn_args=("tissue",), **LO_CPU, **LO_MEM) # pyright: ignore[reportArgumentType]
.filter(filter_tissue, fn_args=(tissue_threshold,), **LO_CPU, **LO_MEM) # pyright: ignore[reportArgumentType]
.with_column(
"blur_overlap",
tile_overlay_overlap(
qc_roi,
col("blur_mask_path"),
col("tile_x"),
col("tile_y"),
col("mpp_x"),
col("mpp_y"),
), # pyright: ignore[reportCallIssue]
**HI_CPU,
**HI_MEM,
)
.with_column(
"artifacts_overlap",
tile_overlay_overlap(
qc_roi,
col("artifacts_mask_path"),
col("tile_x"),
col("tile_y"),
col("mpp_x"),
col("mpp_y"),
), # pyright: ignore[reportCallIssue]
**HI_CPU,
**HI_MEM,
)
.map(extract_coverages, fn_args=("blur", "artifacts"), **LO_CPU, **LO_MEM) # pyright: ignore[reportArgumentType]
.map(select, **LO_CPU, **LO_MEM)
)
Comment thread
Adames4 marked this conversation as resolved.

return slides.to_pandas(), tiles.to_pandas()


@with_cli_args(["+preprocessing=tiling"])
@hydra.main(config_path="../configs", config_name="preprocessing", version_base=None)
@autolog
def main(config: DictConfig, logger: MLFlowLogger) -> None:
qc_folder = Path(
mlflow.artifacts.download_artifacts(config.dataset.mlflow_uris.qc_mask)
)
tissue_folder = Path(
mlflow.artifacts.download_artifacts(config.dataset.mlflow_uris.tissue_mask)
)

for name, split_uri in config.dataset.mlflow_uris.splits.items():
split = pd.read_csv(
mlflow.artifacts.download_artifacts(split_uri), index_col="slide_id"
)

df_slides, df_tiles = tiling(
split,
qc_folder=qc_folder,
tissue_folder=tissue_folder,
tile_extent=config.tile_extent,
stride=config.stride,
mpp=config.mpp,
tissue_threshold=config.tissue_threshold,
)
save_mlflow_dataset(
df_slides, df_tiles, f"{name} - {config.dataset.institution}"
)
Comment thread
Adames4 marked this conversation as resolved.


if __name__ == "__main__":
with ray.init(runtime_env={"excludes": [".git", ".venv"]}):
main()
2 changes: 1 addition & 1 deletion preprocessing/tissue_masks.py
Original file line number Diff line number Diff line change
Expand Up @@ -35,7 +35,7 @@ def process_slide(slide_path: str, level: int, output_path: Path) -> None:
@hydra.main(config_path="../configs", config_name="preprocessing", version_base=None)
@autolog
def main(config: DictConfig, logger: MLFlowLogger) -> None:
dataset = pd.read_csv(download_artifacts(config.dataset.uri))
dataset = pd.read_csv(download_artifacts(config.dataset.mlflow_uris.dataset))

with TemporaryDirectory() as output_dir:
process_items(
Expand Down
2 changes: 2 additions & 0 deletions pyproject.toml
Original file line number Diff line number Diff line change
Expand Up @@ -16,6 +16,7 @@ dependencies = [
"rationai-sdk",
"rationai-mlkit",
"rationai-masks",
"rationai-tiling",
"ray>=2.52.1",
"torch>=2.9.0",
"torchmetrics>=1.8.2",
Expand All @@ -29,5 +30,6 @@ job = ["rationai-kube-jobs"]
[tool.uv.sources]
rationai-mlkit = { git = "https://gitlab.ics.muni.cz/rationai/digital-pathology/libraries/mlkit.git" }
rationai-masks = { git = "https://gitlab.ics.muni.cz/rationai/digital-pathology/libraries/masks.git" }
rationai-tiling = { git = "https://gitlab.ics.muni.cz/rationai/digital-pathology/libraries/tiling.git" }
rationai-kube-jobs = { git = "ssh://git@gitlab.ics.muni.cz/rationai/infrastructure/kube-jobs" }
rationai-sdk = { git = "https://gitlab.ics.muni.cz/rationai/infrastructure/rationai-sdk-python.git" }
Loading