diff --git a/README.md b/README.md
index ccf6db4..1526580 100644
--- a/README.md
+++ b/README.md
@@ -10,7 +10,7 @@ A one sentence summary of purpose and methodology. Used for creating an
overview tables.
Repository:
-[openproblems-bio/task_template](https://github.com/openproblems-bio/task_template)
+[openproblems-bio/task_spatial_segmentation](https://github.com/openproblems-bio/task_spatial_segmentation)
## Description
@@ -28,34 +28,34 @@ should convince readers of the significance and relevance of your task.
## Authors & contributors
-| Name | Roles | Linkedin | Twitter | Email | Github | Orcid |
-|:---|:---|:---|:---|:---|:---|:---|
-| John Doe | author, maintainer | johndoe | johndoe | john@doe.me | johndoe | 0000-0000-0000-0000 |
+| name | roles |
+|:---------|:-------------------|
+| John Doe | author, maintainer |
## API
``` mermaid
flowchart TB
- file_common_ist("Common iST Dataset")
- comp_data_processor[/"Data processor"/]
- file_spatial_dataset("Raw iST Dataset")
- file_scrnaseq_reference("scRNA-seq Reference")
- comp_control_method[/"Control Method"/]
- comp_method[/"Method"/]
- comp_metric[/"Metric"/]
- file_prediction("Predicted data")
- file_score("Score")
- file_common_scrnaseq("Common SC Dataset")
+ file_common_ist("Common iST Dataset")
+ comp_data_processor[/"Data processor"/]
+ file_scrnaseq_reference("scRNA-seq Reference")
+ file_spatial_dataset("Raw iST Dataset")
+ comp_control_method[/"Control Method"/]
+ comp_metric[/"Metric"/]
+ comp_method[/"Method"/]
+ file_prediction("Predicted data")
+ file_score("Score")
+ file_common_scrnaseq("Common SC Dataset")
file_common_ist---comp_data_processor
- comp_data_processor-->file_spatial_dataset
comp_data_processor-->file_scrnaseq_reference
- file_spatial_dataset---comp_control_method
- file_spatial_dataset---comp_method
+ comp_data_processor-->file_spatial_dataset
file_scrnaseq_reference---comp_control_method
file_scrnaseq_reference---comp_metric
+ file_spatial_dataset---comp_control_method
+ file_spatial_dataset---comp_method
comp_control_method-->file_prediction
- comp_method-->file_prediction
comp_metric-->file_score
+ comp_method-->file_prediction
file_prediction---comp_metric
file_common_scrnaseq---comp_data_processor
```
@@ -76,91 +76,12 @@ Format:
- SpatialData object
- images: 'image', 'image_3D', 'he_image'
- labels: 'cell_labels', 'nucleus_labels'
- points: 'transcripts'
- shapes: 'cell_boundaries', 'nucleus_boundaries'
- tables: 'metadata'
- coordinate_systems: 'global'
-
Data structure:
-*images*
-
-| Name | Description |
-|:-----------|:------------------------------------|
-| `image` | The raw image data. |
-| `image_3D` | (*Optional*) The raw 3D image data. |
-| `he_image` | (*Optional*) H&E image data. |
-
-*labels*
-
-| Name | Description |
-|:-----------------|:---------------------------------------|
-| `cell_labels` | (*Optional*) Cell segmentation labels. |
-| `nucleus_labels` | (*Optional*) Cell segmentation labels. |
-
-*points*
-
-`transcripts`: Point cloud data of transcripts.
-
-| Column | Type | Description |
-|:---|:---|:---|
-| `x` | `float` | x-coordinate of the point. |
-| `y` | `float` | y-coordinate of the point. |
-| `z` | `float` | (*Optional*) z-coordinate of the point. |
-| `feature_name` | `categorical` | Name of the feature. |
-| `cell_id` | `integer` | (*Optional*) Unique identifier of the cell. |
-| `nucleus_id` | `integer` | (*Optional*) Unique identifier of the nucleus. |
-| `cell_type` | `string` | (*Optional*) Cell type of the cell. |
-| `qv` | `float` | (*Optional*) Quality value of the point. |
-| `transcript_id` | `long` | Unique identifier of the transcript. |
-| `overlaps_nucleus` | `boolean` | (*Optional*) Whether the point overlaps with a nucleus. |
-
-*shapes*
-
-`cell_boundaries`: Cell boundaries.
-
-| Column | Type | Description |
-|:-----------|:---------|:-------------------------------|
-| `geometry` | `object` | Geometry of the cell boundary. |
-
-`nucleus_boundaries`: Nucleus boundaries.
-
-| Column | Type | Description |
-|:-----------|:---------|:----------------------------------|
-| `geometry` | `object` | Geometry of the nucleus boundary. |
-
-*tables*
-
-`metadata`: Metadata of spatial dataset.
-
-| Slot | Type | Description |
-|:---|:---|:---|
-| `obs["cell_id"]` | `string` | A unique identifier for the cell. |
-| `var["gene_ids"]` | `string` | Unique identifier for the gene. |
-| `var["feature_types"]` | `string` | Type of the feature. |
-| `obsm["spatial"]` | `double` | Spatial coordinates of the cell. |
-| `uns["dataset_id"]` | `string` | A unique identifier for the dataset. |
-| `uns["dataset_name"]` | `string` | A human-readable name for the dataset. |
-| `uns["dataset_url"]` | `string` | Link to the original source of the dataset. |
-| `uns["dataset_reference"]` | `string` | Bibtex reference of the paper in which the dataset was published. |
-| `uns["dataset_summary"]` | `string` | Short description of the dataset. |
-| `uns["dataset_description"]` | `string` | Long description of the dataset. |
-| `uns["dataset_organism"]` | `string` | The organism of the sample in the dataset. |
-| `uns["segmentation_id"]` | `string` | A unique identifier for the segmentation. |
-
-*coordinate_systems*
-
-| Name | Description |
-|:---------|:------------------------------------|
-| `global` | Coordinate system of the replicate. |
-
## Component type: Data processor
@@ -176,110 +97,7 @@ Arguments:
| `--input_sp` | `file` | An unprocessed spatial imaging dataset stored as a zarr file. |
| `--input_sc` | `file` | An unprocessed dataset as output by a dataset loader. |
| `--output_spatial_dataset` | `file` | (*Output*) A spatial transcriptomics dataset, preprocessed for this benchmark. |
-| `--output_scrnaseq_reference` | `file` | (*Output*) A single-cell reference dataset, preprocessed for this benchmark. |
-
-
-
-## File format: Raw iST Dataset
-
-A spatial transcriptomics dataset, preprocessed for this benchmark.
-
-Example file:
-`resources_test/task_spatial_segmentation/mouse_brain_combined/common_ist.zarr`
-
-Description:
-
-This dataset contains preprocessed images, labels, points, shapes, and
-tables for spatial transcriptomics data.
-
-Format:
-
-
-
- SpatialData object
- images: 'image', 'image_3D', 'he_image'
- labels: 'cell_labels', 'nucleus_labels'
- points: 'transcripts'
- shapes: 'cell_boundaries', 'nucleus_boundaries'
- tables: 'metadata'
- coordinate_systems: 'global'
-
-
-
-Data structure:
-
-
-
-*images*
-
-| Name | Description |
-|:-----------|:------------------------------------|
-| `image` | The raw image data. |
-| `image_3D` | (*Optional*) The raw 3D image data. |
-| `he_image` | (*Optional*) H&E image data. |
-
-*labels*
-
-| Name | Description |
-|:-----------------|:---------------------------------------|
-| `cell_labels` | (*Optional*) Cell segmentation labels. |
-| `nucleus_labels` | (*Optional*) Cell segmentation labels. |
-
-*points*
-
-`transcripts`: Point cloud data of transcripts.
-
-| Column | Type | Description |
-|:---|:---|:---|
-| `x` | `float` | x-coordinate of the point. |
-| `y` | `float` | y-coordinate of the point. |
-| `z` | `float` | (*Optional*) z-coordinate of the point. |
-| `feature_name` | `categorical` | Name of the feature. |
-| `cell_id` | `integer` | (*Optional*) Unique identifier of the cell. |
-| `nucleus_id` | `integer` | (*Optional*) Unique identifier of the nucleus. |
-| `cell_type` | `string` | (*Optional*) Cell type of the cell. |
-| `qv` | `float` | (*Optional*) Quality value of the point. |
-| `transcript_id` | `long` | Unique identifier of the transcript. |
-| `overlaps_nucleus` | `boolean` | (*Optional*) Whether the point overlaps with a nucleus. |
-
-*shapes*
-
-`cell_boundaries`: Cell boundaries.
-
-| Column | Type | Description |
-|:-----------|:---------|:-------------------------------|
-| `geometry` | `object` | Geometry of the cell boundary. |
-
-`nucleus_boundaries`: Nucleus boundaries.
-
-| Column | Type | Description |
-|:-----------|:---------|:----------------------------------|
-| `geometry` | `object` | Geometry of the nucleus boundary. |
-
-*tables*
-
-`metadata`: Metadata of spatial dataset.
-
-| Slot | Type | Description |
-|:---|:---|:---|
-| `obs["cell_id"]` | `string` | A unique identifier for the cell. |
-| `var["gene_ids"]` | `string` | Unique identifier for the gene. |
-| `var["feature_types"]` | `string` | Type of the feature. |
-| `obsm["spatial"]` | `double` | Spatial coordinates of the cell. |
-| `uns["dataset_id"]` | `string` | A unique identifier for the dataset. |
-| `uns["dataset_name"]` | `string` | A human-readable name for the dataset. |
-| `uns["dataset_url"]` | `string` | Link to the original source of the dataset. |
-| `uns["dataset_reference"]` | `string` | Bibtex reference of the paper in which the dataset was published. |
-| `uns["dataset_summary"]` | `string` | Short description of the dataset. |
-| `uns["dataset_description"]` | `string` | Long description of the dataset. |
-| `uns["dataset_organism"]` | `string` | The organism of the sample in the dataset. |
-| `uns["segmentation_id"]` | `string` | A unique identifier for the segmentation. |
-
-*coordinate_systems*
-
-| Name | Description |
-|:---------|:------------------------------------|
-| `global` | Coordinate system of the replicate. |
+| `--output_scrnaseq` | `file` | (*Output*) A single-cell reference dataset, preprocessed for this benchmark. |
@@ -288,7 +106,7 @@ Data structure:
A single-cell reference dataset, preprocessed for this benchmark.
Example file:
-`resources_test/task_spatial_segmentation/mouse_brain_combined/common_scrnaseq.h5ad`
+`resources_test/task_spatial_segmentation/mouse_brain_combined/scrnaseq_reference.h5ad`
Description:
@@ -364,6 +182,30 @@ Data structure:
+## File format: Raw iST Dataset
+
+A spatial transcriptomics dataset, preprocessed for this benchmark.
+
+Example file:
+`resources_test/task_spatial_segmentation/mouse_brain_combined/spatial_dataset.zarr`
+
+Description:
+
+This dataset contains preprocessed images, labels, points, shapes, and
+tables for spatial transcriptomics data.
+
+Format:
+
+
+
+
+
+Data structure:
+
+
+
+
+
## Component type: Control Method
Quality control methods for verifying the pipeline.
@@ -380,9 +222,9 @@ Arguments:
-## Component type: Method
+## Component type: Metric
-A method.
+A task template metric.
Arguments:
@@ -390,14 +232,15 @@ Arguments:
| Name | Type | Description |
|:---|:---|:---|
-| `--input` | `file` | A spatial transcriptomics dataset, preprocessed for this benchmark. |
-| `--output` | `file` | (*Output*) A predicted dataset as output by a method. |
+| `--input_prediction` | `file` | A predicted dataset as output by a method. |
+| `--input_scrnaseq_reference` | `file` | A single-cell reference dataset, preprocessed for this benchmark. |
+| `--output` | `file` | (*Output*) File indicating the score of a metric. |
-## Component type: Metric
+## Component type: Method
-A task template metric.
+A method.
Arguments:
@@ -405,9 +248,8 @@ Arguments:
| Name | Type | Description |
|:---|:---|:---|
-| `--input_prediction` | `file` | A predicted dataset as output by a method. |
-| `--input_scrnaseq_reference` | `file` | A single-cell reference dataset, preprocessed for this benchmark. |
-| `--output` | `file` | (*Output*) File indicating the score of a metric. |
+| `--input` | `file` | A spatial transcriptomics dataset, preprocessed for this benchmark. |
+| `--output` | `file` | (*Output*) A predicted dataset as output by a method. |
@@ -422,31 +264,12 @@ Format:
- SpatialData object
- labels: 'segmentation'
- tables: 'table'
-
Data structure:
-*labels*
-
-| Name | Description |
-|:---------------|:--------------------------|
-| `segmentation` | Segmentation of the data. |
-
-*tables*
-
-`table`: AnnData table.
-
-| Slot | Type | Description |
-|:-----------------|:---------|:------------|
-| `obs["cell_id"]` | `string` | Cell ID. |
-| `obs["region"]` | `string` | Region. |
-
## File format: Score
@@ -562,3 +385,4 @@ Data structure:
| `uns["dataset_organism"]` | `string` | (*Optional*) The organism of the sample in the dataset. |
+
diff --git a/_viash.yaml b/_viash.yaml
index 31ad320..a0130fe 100644
--- a/_viash.yaml
+++ b/_viash.yaml
@@ -11,8 +11,8 @@ license: MIT
keywords: [single-cell, openproblems, benchmark]
# Step 3: Update the `task_template` to the name of the task from step 1.
links:
- issue_tracker: https://github.com/openproblems-bio/task_template/issues
- repository: https://github.com/openproblems-bio/task_template
+ issue_tracker: https://github.com/openproblems-bio/task_spatial_segmentation/issues
+ repository: https://github.com/openproblems-bio/task_spatial_segmentation
docker_registry: ghcr.io
diff --git a/scripts/create_resources/resources.sh b/scripts/create_resources/resources.sh
index 57f4d68..52ee226 100755
--- a/scripts/create_resources/resources.sh
+++ b/scripts/create_resources/resources.sh
@@ -18,7 +18,7 @@ cat > /tmp/params.yaml << 'HERE'
input_states: s3://openproblems-data/resources/datasets/**/state.yaml
rename_keys: 'input:output_dataset'
output_state: '$id/state.yaml'
-settings: '{"output_train": "$id/train.h5ad", "output_test": "$id/test.h5ad", "output_solution": "$id/solution.h5ad"}'
+settings: '{"output_spatial_dataset": "$id/output_spatial_dataset.zarr", "output_scrnaseq": "$id/output_scrnaseq.h5ad"}'
publish_dir: s3://openproblems-data/resources/task_template/datasets/
HERE
diff --git a/scripts/create_resources/test_resources.sh b/scripts/create_resources/test_resources.sh
index 9cb372a..b11d437 100755
--- a/scripts/create_resources/test_resources.sh
+++ b/scripts/create_resources/test_resources.sh
@@ -13,41 +13,49 @@ cd "$REPO_ROOT"
set -e
+DATASET_ID=mouse_brain_combined
+
RAW_DATA=resources_test/common
-DATASET_DIR=resources_test/task_template
+DATASET_DIR=resources_test/task_spatial_segmentation/$DATASET_ID
mkdir -p $DATASET_DIR
# process dataset
viash run src/data_processors/process_dataset/config.vsh.yaml -- \
- --input $RAW_DATA/cxg_mouse_pancreas_atlas/dataset.h5ad \
- --output_train $DATASET_DIR/cxg_mouse_pancreas_atlas/train.h5ad \
- --output_test $DATASET_DIR/cxg_mouse_pancreas_atlas/test.h5ad \
- --output_solution $DATASET_DIR/cxg_mouse_pancreas_atlas/solution.h5ad
+ --input_sp $RAW_DATA/2023_10x_mouse_brain_xenium_rep1/dataset.zarr \
+ --input_sc $RAW_DATA/2023_yao_mouse_brain_scrnaseq_10xv2/dataset.h5ad \
+ --output_spatial_dataset $DATASET_DIR/spatial_dataset.zarr \
+ --output_scrnaseq_reference $DATASET_DIR/scrnaseq_reference.h5ad \
+ --dataset_id mouse_brain_combined \
+ --dataset_name "Test data mouse brain combined 2023 tenx Xenium replicate 1 2023 Yao scRNAseq" \
+ --dataset_url "https://www.10xgenomics.com/datasets/fresh-frozen-mouse-brain-replicates-1-standard;https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE246717" \
+ --dataset_reference "https://www.10xgenomics.com/datasets/fresh-frozen-mouse-brain-replicates-1-standard;10.1038/s41586-023-06812-z" \
+ --dataset_summary "Demonstration of gene expression profiling for fresh frozen mouse brain on the Xenium platform using the pre-designed Mouse Brain Gene Expression Panel (v1);A high-resolution scRNAseq atlas of cell types in the whole mouse brain" \
+ --dataset_description "Demonstration of gene expression profiling for fresh frozen mouse brain on the Xenium platform using the pre-designed Mouse Brain Gene Expression Panel (v1). Replicate results demonstrate the high reproducibility of data generated by the platform. 10x Genomics obtained tissue from a C57BL/6 mouse from Charles River Laboratories. Three adjacent 10µm sections were placed on the same slide. Tissues were prepared following the demonstrated protocols Xenium In Situ for Fresh Frozen Tissues - Tissue Preparation Guide (CG000579) and Xenium In Situ for Fresh Frozen Tissues - Fixation & Permeabilization (CG000581).;See dataset_reference for more information. Note that we only took the 10xv2 data from the dataset." \
+ --dataset_organism "mus_musculus"
# run one method
-viash run src/methods/logistic_regression/config.vsh.yaml -- \
- --input_train $DATASET_DIR/cxg_mouse_pancreas_atlas/train.h5ad \
- --input_test $DATASET_DIR/cxg_mouse_pancreas_atlas/test.h5ad \
- --output $DATASET_DIR/cxg_mouse_pancreas_atlas/prediction.h5ad
+viash run src/methods/cellpose/config.vsh.yaml -- \
+ --input $DATASET_DIR/spatial_dataset.zarr \
+ --output $DATASET_DIR/prediction.h5ad
# run one metric
-viash run src/metrics/accuracy/config.vsh.yaml -- \
- --input_prediction $DATASET_DIR/cxg_mouse_pancreas_atlas/prediction.h5ad \
- --input_solution $DATASET_DIR/cxg_mouse_pancreas_atlas/solution.h5ad \
- --output $DATASET_DIR/cxg_mouse_pancreas_atlas/score.h5ad
+# TODO: implement this!
+# viash run src/metrics/ari/config.vsh.yaml -- \
+# --input_prediction $DATASET_DIR/prediction.h5ad \
+# --input_scrnaseq_reference $DATASET_DIR/scrnaseq_reference.h5ad \
+# --output $DATASET_DIR/score.h5ad
# write manual state.yaml. this is not actually necessary but you never know it might be useful
-cat > $DATASET_DIR/cxg_mouse_pancreas_atlas/state.yaml << HERE
-id: cxg_mouse_pancreas_atlas
-train: !file train.h5ad
-test: !file test.h5ad
-solution: !file solution.h5ad
-prediction: !file prediction.h5ad
-score: !file score.h5ad
+cat > $DATASET_DIR/state.yaml << HERE
+id: $DATASET_ID
+spatial_dataset: spatial_dataset.zarr
+scrnaseq_reference: scrnaseq_reference.h5ad
+prediction: prediction.h5ad
+score: score.h5ad
HERE
# only run this if you have access to the openproblems-data bucket
aws s3 sync --profile op \
- "$DATASET_DIR" s3://openproblems-data/resources_test/task_template \
+ "$DATASET_DIR" s3://openproblems-data/resources_test/task_spatial_segmentation/mouse_brain_combined/ \
--delete --dryrun
diff --git a/scripts/create_test_resources/README.md b/scripts/create_test_resources/README.md
deleted file mode 100644
index 46bb116..0000000
--- a/scripts/create_test_resources/README.md
+++ /dev/null
@@ -1,3 +0,0 @@
-Here we generate a small test dataset, used for `viash test`. Note that the file structure here is a bit simplified compared to `scripts/create_resources` as we only have one dataset.
-
-Copy the data from the `task_ist_preprocessing` test resources: `mouse_brain_combined.sh`
diff --git a/scripts/create_test_resources/mouse_brain_combined.sh b/scripts/create_test_resources/mouse_brain_combined.sh
deleted file mode 100755
index 05e4de9..0000000
--- a/scripts/create_test_resources/mouse_brain_combined.sh
+++ /dev/null
@@ -1,32 +0,0 @@
-#!/bin/bash
-
-# get the root of the directory
-REPO_ROOT=$(git rev-parse --show-toplevel)
-
-# ensure that the command below is run from the root of the repository
-cd "$REPO_ROOT"
-
-set -e
-
-if [ ! -d resources_test/task_spatial_segmentation/mouse_brain_combined ]; then
- mkdir -p resources_test/task_spatial_segmentation/mouse_brain_combined
-fi
-
-# these files were generated by https://github.com/openproblems-bio/task_ist_preprocessing/tree/main/scripts/create_test_resources
-# we can just copy them for now
-
-aws s3 sync --profile op \
- s3://openproblems-data/resources_test/common/2023_10x_mouse_brain_xenium_rep1/dataset.zarr \
- resources_test/task_spatial_segmentation/mouse_brain_combined/common_ist.zarr
-
-aws s3 cp --profile op \
- s3://openproblems-data/resources_test/common/2023_yao_mouse_brain_scrnaseq_10xv2/dataset.h5ad \
- resources_test/task_spatial_segmentation/mouse_brain_combined/common_scrnaseq.h5ad
-
-# ...additional preprocessing if needed ...
-
-# sync to s3
-aws s3 sync --profile op \
- "resources_test/task_spatial_segmentation/mouse_brain_combined/" \
- "s3://openproblems-data/resources_test/task_spatial_segmentation/mouse_brain_combined/" \
- --delete --dryrun
diff --git a/scripts/run_benchmark/run_full_local.sh b/scripts/run_benchmark/run_full_local.sh
index f8c1585..26bba56 100755
--- a/scripts/run_benchmark/run_full_local.sh
+++ b/scripts/run_benchmark/run_full_local.sh
@@ -31,7 +31,7 @@ publish_dir="resources/results/${RUN_ID}"
# write the parameters to file
cat > /tmp/params.yaml << HERE
input_states: resources/datasets/**/state.yaml
-rename_keys: 'input_train:output_train;input_test:output_test;input_solution:output_solution'
+rename_keys: 'input_spatial_dataset:output_spatial_dataset,input_scrnaseq_reference:output_scrnaseq_reference'
output_state: "state.yaml"
publish_dir: "$publish_dir"
HERE
diff --git a/scripts/run_benchmark/run_full_seqeracloud.sh b/scripts/run_benchmark/run_full_seqeracloud.sh
index 87d133c..3c31e74 100755
--- a/scripts/run_benchmark/run_full_seqeracloud.sh
+++ b/scripts/run_benchmark/run_full_seqeracloud.sh
@@ -23,7 +23,7 @@ publish_dir="s3://openproblems-data/resources/task_template/results/${RUN_ID}"
# write the parameters to file
cat > /tmp/params.yaml << HERE
input_states: s3://openproblems-data/resources/task_template/datasets/**/state.yaml
-rename_keys: 'input_train:output_train;input_test:output_test;input_solution:output_solution'
+rename_keys: 'input_spatial_dataset:output_spatial_dataset,input_scrnaseq_reference:output_scrnaseq_reference'
output_state: "state.yaml"
publish_dir: "$publish_dir"
HERE
diff --git a/src/api/comp_data_processor.yaml b/src/api/comp_data_processor.yaml
index 22c77aa..ecd3f9c 100644
--- a/src/api/comp_data_processor.yaml
+++ b/src/api/comp_data_processor.yaml
@@ -27,6 +27,52 @@ argument_groups:
__merge__: file_scrnaseq_reference.yaml
direction: output
required: true
+ - name: Combined Dataset Metadata
+ description: Metadata for the combined dataset that will be stored.
+ arguments:
+ - type: string
+ name: --dataset_id
+ description: "A unique identifier for the dataset"
+ required: true
+ info:
+ test_default: "mouse_brain_combined"
+ - name: --dataset_name
+ type: string
+ description: Nicely formatted name.
+ required: true
+ info:
+ test_default: "Mouse brain combined dataset"
+ - type: string
+ name: --dataset_url
+ description: Link to the original source of the dataset.
+ required: true
+ info:
+ test_default: "https://example.com/mouse_brain_combined"
+ - name: --dataset_reference
+ type: string
+ description: Bibtex reference of the paper in which the dataset was published.
+ required: true
+ multiple: true
+ info:
+ test_default: ["https://example.com/mouse_brain_combined_paper", "10.1234/example.doi"]
+ - name: --dataset_summary
+ type: string
+ description: Short description of the dataset.
+ required: true
+ info:
+ test_default: "Combined dataset for mouse brain spatial transcriptomics"
+ - name: --dataset_description
+ type: string
+ description: Long description of the dataset.
+ required: true
+ info:
+ test_default: "This is a combined dataset for mouse brain spatial transcriptomics."
+ - name: --dataset_organism
+ type: string
+ description: The organism of the sample in the dataset.
+ required: true
+ info:
+ test_default: "Mus musculus"
test_resources:
- path: /resources_test/common/2023_10x_mouse_brain_xenium_rep1
dest: resources_test/common/2023_10x_mouse_brain_xenium_rep1
diff --git a/src/api/file_prediction.yaml b/src/api/file_prediction.yaml
index 23850bb..b1fc443 100644
--- a/src/api/file_prediction.yaml
+++ b/src/api/file_prediction.yaml
@@ -25,3 +25,12 @@ info:
name: region
description: Region
required: true
+ uns:
+ - type: string
+ name: dataset_id
+ description: "A unique identifier for the dataset"
+ required: true
+ - type: string
+ name: method_id
+ description: "A unique identifier for the method"
+ required: true
diff --git a/src/api/file_scrnaseq_reference.yaml b/src/api/file_scrnaseq_reference.yaml
index 06d8491..9b855fd 100644
--- a/src/api/file_scrnaseq_reference.yaml
+++ b/src/api/file_scrnaseq_reference.yaml
@@ -1,9 +1,110 @@
type: file
-example: "resources_test/task_spatial_segmentation/mouse_brain_combined/common_scrnaseq.h5ad"
+example: "resources_test/task_spatial_segmentation/mouse_brain_combined/scrnaseq_reference.h5ad"
# TODO: revert to the original example once file exists
# example: "resources_test/task_spatial_segmentation/mouse_brain_combined/spatial_dataset.h5ad"
label: "scRNA-seq Reference"
summary: A single-cell reference dataset, preprocessed for this benchmark.
description: |
This dataset contains preprocessed counts and metadata for single-cell RNA-seq data.
-__merge__: file_common_scrnaseq.yaml
\ No newline at end of file
+info:
+ format:
+ type: h5ad
+ layers:
+ - type: integer
+ name: counts
+ description: Raw counts
+ required: true
+
+ - type: double
+ name: normalized
+ description: Normalized expression values
+ required: true
+
+ - type: double
+ name: normalized_log
+ description: Log1p normalized expression values
+ required: true
+
+ - type: double
+ name: normalized_log_scaled
+ description: Log1p normalized expression values scaled to unit variance and zero mean
+ required: true
+
+ obs:
+ - type: string
+ name: cell_type
+ description: Classification of the cell type based on its characteristics and function within the tissue or organism.
+ required: true
+
+ var:
+ - type: string
+ name: feature_id
+ description: Unique identifier for the feature, usually a ENSEMBL gene id.
+ # TODO: make this required once openproblems_v1 dataloader supports it
+ required: false
+
+ - type: string
+ name: feature_name
+ description: A human-readable name for the feature, usually a gene symbol.
+ # TODO: make this required once the dataloader supports it
+ required: true
+
+ - type: boolean
+ name: hvg
+ description: Whether or not the feature is considered to be a 'highly variable gene'
+ required: true
+
+ obsp:
+ - type: double
+ name: knn_distances
+ description: K nearest neighbors distance matrix.
+ required: true
+
+ - type: double
+ name: knn_connectivities
+ description: K nearest neighbors connectivities matrix.
+ required: true
+
+ obsm:
+ - type: double
+ name: X_pca
+ description: The resulting PCA embedding.
+ required: true
+
+ varm:
+ - type: double
+ name: pca_loadings
+ description: The PCA loadings matrix.
+ required: true
+
+ uns:
+ - type: string
+ name: dataset_id
+ description: A unique identifier for the dataset. This is different from the `obs.dataset_id` field, which is the identifier for the dataset from which the cell data is derived.
+ required: true
+ - name: dataset_name
+ type: string
+ description: A human-readable name for the dataset.
+ required: true
+ - type: string
+ name: dataset_url
+ description: Link to the original source of the dataset.
+ required: false
+ - name: dataset_reference
+ type: string
+ description: Bibtex reference of the paper in which the dataset was published.
+ required: false
+ multiple: true
+ - name: dataset_summary
+ type: string
+ description: Short description of the dataset.
+ required: true
+ - name: dataset_description
+ type: string
+ description: Long description of the dataset.
+ required: true
+ - name: dataset_organism
+ type: string
+ description: The organism of the sample in the dataset.
+ required: false
+ multiple: true
diff --git a/src/api/file_spatial_dataset.yaml b/src/api/file_spatial_dataset.yaml
index 5668a3f..4c5253e 100644
--- a/src/api/file_spatial_dataset.yaml
+++ b/src/api/file_spatial_dataset.yaml
@@ -1,9 +1,153 @@
type: file
-example: "resources_test/task_spatial_segmentation/mouse_brain_combined/common_ist.zarr"
+example: "resources_test/task_spatial_segmentation/mouse_brain_combined/spatial_dataset.zarr"
# TODO: revert to the original example once file exists
# example: "resources_test/task_spatial_segmentation/mouse_brain_combined/spatial_dataset.zarr"
label: "Raw iST Dataset"
summary: A spatial transcriptomics dataset, preprocessed for this benchmark.
description: |
This dataset contains preprocessed images, labels, points, shapes, and tables for spatial transcriptomics data.
-__merge__: file_common_ist.yaml
+info:
+ format:
+ type: spatialdata_zarr
+ images:
+ - type: object
+ name: morphology_mip
+ description: The raw image data
+ required: true
+ labels:
+ - type: object
+ name: "cell_labels"
+ description: Cell segmentation labels
+ required: false
+ - type: object
+ name: "nucleus_labels"
+ description: Cell segmentation labels
+ required: false
+ points:
+ - type: dataframe
+ name: transcripts
+ description: Point cloud data of transcripts
+ required: true
+ columns:
+ - type: float
+ name: "x"
+ required: true
+ description: x-coordinate of the point
+ - type: float
+ name: "y"
+ required: true
+ description: y-coordinate of the point
+ - type: float
+ name: "z"
+ required: false
+ description: z-coordinate of the point
+ - type: categorical
+ name: feature_name
+ required: true
+ description: Name of the feature
+ - type: integer
+ name: "cell_id"
+ required: false
+ description: Unique identifier of the cell
+ - type: integer
+ name: "nucleus_id"
+ required: false
+ description: Unique identifier of the nucleus
+ - type: string
+ name: "cell_type"
+ required: false
+ description: Cell type of the cell
+ - type: float
+ name: qv
+ required: false
+ description: Quality value of the point
+ - type: long
+ name: transcript_id
+ required: true
+ description: Unique identifier of the transcript
+ - type: boolean
+ name: overlaps_nucleus
+ required: false
+ description: Whether the point overlaps with a nucleus
+ shapes:
+ - type: dataframe
+ name: "cell_boundaries"
+ description: Cell boundaries
+ required: false
+ columns:
+ - type: object
+ name: "geometry"
+ required: true
+ description: Geometry of the cell boundary
+ - type: dataframe
+ name: "nucleus_boundaries"
+ description: Nucleus boundaries
+ required: false
+ columns:
+ - type: object
+ name: "geometry"
+ required: true
+ description: Geometry of the nucleus boundary
+ tables:
+ - type: anndata
+ name: "table"
+ description: Metadata of spatial dataset
+ required: true
+ uns:
+ - type: string
+ name: dataset_id
+ required: true
+ description: A unique identifier for the dataset
+ - type: string
+ name: dataset_name
+ required: true
+ description: A human-readable name for the dataset
+ - type: string
+ name: dataset_url
+ required: true
+ description: Link to the original source of the dataset
+ - type: string
+ name: dataset_reference
+ required: true
+ description: Bibtex reference of the paper in which the dataset was published
+ - type: string
+ name: dataset_summary
+ required: true
+ description: Short description of the dataset
+ - type: string
+ name: dataset_description
+ required: true
+ description: Long description of the dataset
+ - type: string
+ name: dataset_organism
+ required: true
+ description: The organism of the sample in the dataset
+ - type: string
+ name: segmentation_id
+ required: true
+ multiple: true
+ description: A unique identifier for the segmentation
+ obs:
+ - type: string
+ name: cell_id
+ required: true
+ description: A unique identifier for the cell
+ var:
+ - type: string
+ name: gene_ids
+ required: true
+ description: Unique identifier for the gene
+ - type: string
+ name: feature_types
+ required: true
+ description: Type of the feature
+ obsm:
+ - type: double
+ name: spatial
+ required: true
+ description: Spatial coordinates of the cell
+ coordinate_systems:
+ - type: object
+ name: global
+ description: Coordinate system of the replicate
+ required: true
diff --git a/src/base/setup_spatialdata_partial.yaml b/src/base/setup_spatialdata_partial.yaml
index d2b72a2..a552792 100644
--- a/src/base/setup_spatialdata_partial.yaml
+++ b/src/base/setup_spatialdata_partial.yaml
@@ -1,3 +1,5 @@
setup:
- type: python
- pypi: ["spatialdata", "anndata>=0.12.0", "zarr>=3.0.0"]
+ # spatialdata>=0.7.3a1 is required as a workaround for a bug in spatialdata<=0.7.2
+ # See: https://github.com/scverse/spatialdata/issues/1090
+ pypi: ["spatialdata>=0.7.3a1", "anndata>=0.12.0", "zarr>=3.0.0"]
diff --git a/src/data_processors/process_dataset/config.vsh.yaml b/src/data_processors/process_dataset/config.vsh.yaml
index 0047ae1..0ea6508 100644
--- a/src/data_processors/process_dataset/config.vsh.yaml
+++ b/src/data_processors/process_dataset/config.vsh.yaml
@@ -1,31 +1,37 @@
__merge__: ../../api/comp_data_processor.yaml
+
name: process_dataset
-arguments:
- - name: "--method"
- type: "string"
- description: "The process method to assign train/test."
- choices: ["batch", "random"]
- default: "batch"
- - name: "--obs_label"
- type: "string"
- description: "Which .obs slot to use as label."
- default: "cell_type"
- - name: "--obs_batch"
- type: "string"
- description: "Which .obs slot to use as batch covariate."
- default: "batch"
- - name: "--seed"
- type: "integer"
- description: "A seed for the subsampling."
- example: 123
+
+argument_groups:
+ - name: "Processing parameters"
+ arguments:
+ - name: "--seed"
+ type: "integer"
+ description: "A seed for the subsampling."
+ example: 123
+ - name: "--span"
+ type: double
+ description: The fraction of the data (cells) used when estimating the variance in the loess model fit if flavor='seurat_v3'.
+ default: 0.3
+ - name: "--n_top_genes"
+ type: integer
+ description: Number of highly-variable genes to keep. Mandatory if flavor='seurat_v3'.
+ default: 3000
+
resources:
- type: python_script
path: script.py
- - path: /common/helper_functions/subset_h5ad_by_format.py
engines:
- type: docker
+ #image: openproblems/base_pytorch_nvidia:1 # TODO: ideally get gpu image to work
image: openproblems/base_python:1
+ setup:
+ - type: python
+ packages: [scikit-learn, scikit-misc]
+ __merge__:
+ - /src/base/setup_spatialdata_partial.yaml
+ - type: native
runners:
- type: executable
diff --git a/src/data_processors/process_dataset/script.py b/src/data_processors/process_dataset/script.py
index 7cca2bd..b04cb33 100644
--- a/src/data_processors/process_dataset/script.py
+++ b/src/data_processors/process_dataset/script.py
@@ -1,27 +1,58 @@
-import sys
import random
-import numpy as np
import anndata as ad
-import openproblems as op
+import spatialdata as sd
+import scanpy as sc
## VIASH START
par = {
- 'input_sp': 'resources_test/task_spatial_segmentation/mouse_brain_combined/common_ist.zarr',
- 'input_sc': 'resources_test/task_spatial_segmentation/mouse_brain_combined/common_scrnaseq.h5ad',
- 'output_spatial_dataset': 'output_spatial_dataset.zarr',
- 'output_scrnaseq_reference': 'output_scrnaseq_reference.h5ad',
-}
-meta = {
- 'resources_dir': 'target/executable/data_processors/process_dataset',
- 'config': 'target/executable/data_processors/process_dataset/.config.vsh.yaml'
+ 'input_sp': 'resources_test/common/2023_10x_mouse_brain_xenium_rep1/dataset.zarr',
+ 'input_sc': 'resources_test/common/2023_yao_mouse_brain_scrnaseq_10xv2/dataset.h5ad',
+ 'output_spatial_dataset': 'resources_test/task_spatial_segmentation/mouse_brain_combined/output_spatial_dataset.zarr',
+ 'output_scrnaseq_reference': 'resources_test/task_spatial_segmentation/mouse_brain_combined/output_scrnaseq_reference.h5ad',
+ 'span': 0.3,
+ 'seed': 123,
+ 'n_top_genes': 3000,
+ 'dataset_id': 'mouse_brain_combined',
+ 'dataset_name': 'Mouse brain combined dataset',
+ 'dataset_url': '',
+ 'dataset_summary': '',
+ 'dataset_description': '',
+ 'dataset_reference': [],
+ 'dataset_organism': 'Mus musculus',
}
## VIASH END
-# import helper functions
-sys.path.append(meta['resources_dir'])
-from subset_h5ad_by_format import subset_h5ad_by_format
+def sc_processing(adata):
+ if "counts" not in adata.layers and adata.X != None:
+ print(">> Save raw counts in .layer", flush=True)
+ adata.layers["counts"] = adata.X.copy()
+
+ if "normalized" not in adata.layers:
+ print(">> Perform standard normalization", flush=True)
+ adata.layers["normalized"] = adata.layers["counts"].copy()
+ sc.pp.normalize_total(adata, layer="normalized", inplace=True)
+
+ if "normalized_log" not in adata.layers:
+ print(">> Perform log1p normalization", flush=True)
+ adata.layers["normalized_log"] = adata.layers["normalized"].copy()
+ sc.pp.normalize_total(adata, layer="normalized_log", inplace=True)
+
+ if "normalized_log_scaled" not in adata.layers:
+ print(">> Perform 0 mean and standard variance normalization", flush=True)
+ adata.layers["normalized_log_scaled"] = adata.layers["normalized_log"].copy()
+ sc.pp.normalize_total(adata, layer="normalized_log_scaled", inplace=True)
+
+ if "hvg" not in adata.var:
+ print(">> Compute highly variable genes", flush=True)
+ sc.pp.highly_variable_genes(
+ adata,
+ flavor="seurat_v3",
+ layer="counts",
+ span=par['span'],
+ n_top_genes=par['n_top_genes']
+ )
+ adata.var.rename(columns={"highly_variable": "hvg"}, inplace=True)
-config = op.project.read_viash_config(meta["config"])
# set seed if need be
if par["seed"]:
@@ -29,54 +60,46 @@
random.seed(par["seed"])
print(">> Load data", flush=True)
-adata = ad.read_h5ad(par["input"])
-print("input:", adata)
-
-print(f">> Process data using {par['method']} method")
-if par["method"] == "batch":
- batch_info = adata.obs[par["obs_batch"]]
- batch_categories = batch_info.dtype.categories
- test_batches = random.sample(list(batch_categories), 1)
- is_test = [ x in test_batches for x in batch_info ]
-elif par["method"] == "random":
- train_ix = np.random.choice(adata.n_obs, round(adata.n_obs * 0.8), replace=False)
- is_test = [ not x in train_ix for x in range(0, adata.n_obs) ]
-
-# subset the different adatas
-print(">> Figuring which data needs to be copied to which output file", flush=True)
-# use par arguments to look for label and batch value in different slots
-slot_mapping = {
- "obs": {
- "label": par["obs_label"],
- "batch": par["obs_batch"],
- }
-}
+sc_data = ad.read_h5ad(par["input_sc"])
+print(f"single cell data: {sc_data}")
+
+print(">> Processing sc_data", flush=True)
+sc_processing(sc_data)
-print(">> Creating train data", flush=True)
-output_train = subset_h5ad_by_format(
- adata[[not x for x in is_test]],
- config,
- "output_train",
- slot_mapping
-)
-
-print(">> Creating test data", flush=True)
-output_test = subset_h5ad_by_format(
- adata[is_test],
- config,
- "output_test",
- slot_mapping
-)
-
-print(">> Creating solution data", flush=True)
-output_solution = subset_h5ad_by_format(
- adata[is_test],
- config,
- "output_solution",
- slot_mapping
-)
+print(">> Override dataset metadata in .uns", flush=True)
+sc_data.uns["orig_dataset_id"] = sc_data.uns.get("dataset_id", None)
+for key in ["dataset_id", "dataset_name", "dataset_url", "dataset_summary", "dataset_description", "dataset_reference", "dataset_organism"]:
+ sc_data.uns[key] = par[key]
print(">> Writing data", flush=True)
-output_train.write_h5ad(par["output_train"])
-output_test.write_h5ad(par["output_test"])
-output_solution.write_h5ad(par["output_solution"])
+sc_data.write_h5ad(par["output_scrnaseq_reference"], compression="gzip")
+
+# read input_sp
+print(">> Read spatial data", flush=True)
+sp_data = sd.read_zarr(par["input_sp"])
+print(f"spatial data: {sp_data}")
+
+print(">> Processing spatial data", flush=True)
+sp_data_table = sp_data.tables['table']
+print(f"single cell part of spatial data: {sp_data_table}")
+sc_processing(sp_data_table)
+
+if "cell_area" not in sp_data_table.obs:
+ print(">> Perform scanpy qc for cell area", flush=True)
+ sc.pp.calculate_qc_metrics(sp_data_table, layer="counts", inplace=True)
+
+for x in ["transcript_counts", "n_genes_by_counts"]:
+ if f"ca_normalized_{x}" not in sp_data_table.obs and x in sp_data_table.obs:
+ print(f">> Perform cell area normalization for {x}", flush=True)
+ sp_data_table.obs[f'ca_normalized_{x}'] = sp_data_table.obs[f"{x}"] / sp_data_table.obs["cell_area"]
+
+print(">> Override dataset metadata in .uns", flush=True)
+sp_data_table.uns["orig_dataset_id"] = sp_data_table.uns.get("dataset_id", None)
+for key in ["dataset_id", "dataset_name", "dataset_url", "dataset_summary", "dataset_description", "dataset_reference", "dataset_organism"]:
+ sp_data_table.uns[key] = par[key]
+
+print(f"spatial data: {sp_data}")
+print(f"spatial data tables['table']: {sp_data.tables['table']}")
+
+print(">> Writing spatial data", flush=True)
+sp_data.write(par["output_spatial_dataset"], overwrite=True)
diff --git a/src/methods/cellpose/config.vsh.yaml b/src/methods/cellpose/config.vsh.yaml
index 46be884..47c6cec 100644
--- a/src/methods/cellpose/config.vsh.yaml
+++ b/src/methods/cellpose/config.vsh.yaml
@@ -1,11 +1,11 @@
name: cellpose
label: "Cellpose"
# TODO: update the summary, description and links
-summary: "Output of the segmantation methot cellpose"
-description: "Output of the segmantation methot cellpose"
+summary: "Cellpose-SAM: cell and nucleus segmentation with superhuman generalization."
+description: "cellpose is an anatomical segmentation algorithm written in Python 3."
links: # these should point to the documentation of the method
- documentation: "https://github.com/openproblems-bio/task_ist_preprocessing"
- repository: "https://github.com/openproblems-bio/task_ist_preprocessing"
+ documentation: "https://cellpose.readthedocs.io/en/latest/"
+ repository: "https://github.com/mouseland/cellpose"
references:
doi: "10.1038/s41592-020-01018-x"
diff --git a/src/methods/cellpose/script.py b/src/methods/cellpose/script.py
index 4949b8d..6ebae72 100644
--- a/src/methods/cellpose/script.py
+++ b/src/methods/cellpose/script.py
@@ -1,6 +1,7 @@
-import dask.array as da
+import anndata as ad
import numpy as np
import os
+import pandas as pd
import shutil
import spatialdata as sd
import xarray as xr
@@ -9,14 +10,28 @@
## VIASH START
par = {
- 'input': 'resources_test/task_spatial_segmentation/mouse_brain_combined/common_ist.zarr',
- 'output': 'resources_test/task_spatial_segmentation/mouse_brain_combined/prediction.h5ad'
+ 'input': 'resources_test/task_spatial_segmentation/mouse_brain_combined/spatial_dataset.zarr',
+ 'output': 'prediction.zarr'
}
meta = {
'name': 'cellpose'
}
## VIASH END
+# TODO: move to helper file
+def convert_to_lower_dtype(arr):
+ max_val = arr.max()
+ if max_val <= np.iinfo(np.uint8).max:
+ new_dtype = np.uint8
+ elif max_val <= np.iinfo(np.uint16).max:
+ new_dtype = np.uint16
+ elif max_val <= np.iinfo(np.uint32).max:
+ new_dtype = np.uint32
+ else:
+ new_dtype = np.uint64
+
+ return arr.astype(new_dtype)
+
print('Reading input', flush=True)
sdata = sd.read_zarr(par["input"])
image = sdata['morphology_mip']['scale0'].image.compute().to_numpy()
@@ -30,21 +45,27 @@
masks, _, _ = model.eval(image[0], progress=True, **eval_params)
print('Cellpose segmentation finished, post-processing results', flush=True)
-# Convert to smallest sufficient unsigned int dtype
-max_val = masks.max()
-for dtype in (np.uint8, np.uint16, np.uint32, np.uint64):
- if max_val <= np.iinfo(dtype).max:
- masks = masks.astype(dtype)
- break
+masks = convert_to_lower_dtype(masks)
print('Segmentation done, preparing output', flush=True)
sd_output = sd.SpatialData()
-# Wrap masks as a single-chunk dask array with flat chunk shape for zarr v3 compat
-dask_masks = da.from_array(masks, chunks=masks.shape)
-data_array = xr.DataArray(dask_masks, name='segmentation', dims=('y', 'x'))
+data_array = xr.DataArray(masks, name='segmentation', dims=('y', 'x'))
parsed = Labels2DModel.parse(data_array, transformations=transformation)
sd_output.labels['segmentation'] = parsed
+cell_ids = np.unique(masks)[1:] # exclude background (0)
+table = ad.AnnData(
+ obs=pd.DataFrame(
+ {'cell_id': cell_ids.astype(str), 'region': 'segmentation'},
+ index=cell_ids.astype(str),
+ ),
+ uns={
+ 'dataset_id': sdata.tables['table'].uns['dataset_id'],
+ 'method_id': meta['name']
+ }
+)
+sd_output.tables['table'] = table
+
print('Saving output', flush=True)
if os.path.exists(par["output"]):
shutil.rmtree(par["output"])