diff --git a/README.md b/README.md index ccf6db4..1526580 100644 --- a/README.md +++ b/README.md @@ -10,7 +10,7 @@ A one sentence summary of purpose and methodology. Used for creating an overview tables. Repository: -[openproblems-bio/task_template](https://github.com/openproblems-bio/task_template) +[openproblems-bio/task_spatial_segmentation](https://github.com/openproblems-bio/task_spatial_segmentation) ## Description @@ -28,34 +28,34 @@ should convince readers of the significance and relevance of your task. ## Authors & contributors -| Name | Roles | Linkedin | Twitter | Email | Github | Orcid | -|:---|:---|:---|:---|:---|:---|:---| -| John Doe | author, maintainer | johndoe | johndoe | john@doe.me | johndoe | 0000-0000-0000-0000 | +| name | roles | +|:---------|:-------------------| +| John Doe | author, maintainer | ## API ``` mermaid flowchart TB - file_common_ist("Common iST Dataset") - comp_data_processor[/"Data processor"/] - file_spatial_dataset("Raw iST Dataset") - file_scrnaseq_reference("scRNA-seq Reference") - comp_control_method[/"Control Method"/] - comp_method[/"Method"/] - comp_metric[/"Metric"/] - file_prediction("Predicted data") - file_score("Score") - file_common_scrnaseq("Common SC Dataset") + file_common_ist("Common iST Dataset") + comp_data_processor[/"Data processor"/] + file_scrnaseq_reference("scRNA-seq Reference") + file_spatial_dataset("Raw iST Dataset") + comp_control_method[/"Control Method"/] + comp_metric[/"Metric"/] + comp_method[/"Method"/] + file_prediction("Predicted data") + file_score("Score") + file_common_scrnaseq("Common SC Dataset") file_common_ist---comp_data_processor - comp_data_processor-->file_spatial_dataset comp_data_processor-->file_scrnaseq_reference - file_spatial_dataset---comp_control_method - file_spatial_dataset---comp_method + comp_data_processor-->file_spatial_dataset file_scrnaseq_reference---comp_control_method file_scrnaseq_reference---comp_metric + file_spatial_dataset---comp_control_method + file_spatial_dataset---comp_method comp_control_method-->file_prediction - comp_method-->file_prediction comp_metric-->file_score + comp_method-->file_prediction file_prediction---comp_metric file_common_scrnaseq---comp_data_processor ``` @@ -76,91 +76,12 @@ Format:
- SpatialData object - images: 'image', 'image_3D', 'he_image' - labels: 'cell_labels', 'nucleus_labels' - points: 'transcripts' - shapes: 'cell_boundaries', 'nucleus_boundaries' - tables: 'metadata' - coordinate_systems: 'global' -
Data structure:
-*images* - -| Name | Description | -|:-----------|:------------------------------------| -| `image` | The raw image data. | -| `image_3D` | (*Optional*) The raw 3D image data. | -| `he_image` | (*Optional*) H&E image data. | - -*labels* - -| Name | Description | -|:-----------------|:---------------------------------------| -| `cell_labels` | (*Optional*) Cell segmentation labels. | -| `nucleus_labels` | (*Optional*) Cell segmentation labels. | - -*points* - -`transcripts`: Point cloud data of transcripts. - -| Column | Type | Description | -|:---|:---|:---| -| `x` | `float` | x-coordinate of the point. | -| `y` | `float` | y-coordinate of the point. | -| `z` | `float` | (*Optional*) z-coordinate of the point. | -| `feature_name` | `categorical` | Name of the feature. | -| `cell_id` | `integer` | (*Optional*) Unique identifier of the cell. | -| `nucleus_id` | `integer` | (*Optional*) Unique identifier of the nucleus. | -| `cell_type` | `string` | (*Optional*) Cell type of the cell. | -| `qv` | `float` | (*Optional*) Quality value of the point. | -| `transcript_id` | `long` | Unique identifier of the transcript. | -| `overlaps_nucleus` | `boolean` | (*Optional*) Whether the point overlaps with a nucleus. | - -*shapes* - -`cell_boundaries`: Cell boundaries. - -| Column | Type | Description | -|:-----------|:---------|:-------------------------------| -| `geometry` | `object` | Geometry of the cell boundary. | - -`nucleus_boundaries`: Nucleus boundaries. - -| Column | Type | Description | -|:-----------|:---------|:----------------------------------| -| `geometry` | `object` | Geometry of the nucleus boundary. | - -*tables* - -`metadata`: Metadata of spatial dataset. - -| Slot | Type | Description | -|:---|:---|:---| -| `obs["cell_id"]` | `string` | A unique identifier for the cell. | -| `var["gene_ids"]` | `string` | Unique identifier for the gene. | -| `var["feature_types"]` | `string` | Type of the feature. | -| `obsm["spatial"]` | `double` | Spatial coordinates of the cell. | -| `uns["dataset_id"]` | `string` | A unique identifier for the dataset. | -| `uns["dataset_name"]` | `string` | A human-readable name for the dataset. | -| `uns["dataset_url"]` | `string` | Link to the original source of the dataset. | -| `uns["dataset_reference"]` | `string` | Bibtex reference of the paper in which the dataset was published. | -| `uns["dataset_summary"]` | `string` | Short description of the dataset. | -| `uns["dataset_description"]` | `string` | Long description of the dataset. | -| `uns["dataset_organism"]` | `string` | The organism of the sample in the dataset. | -| `uns["segmentation_id"]` | `string` | A unique identifier for the segmentation. | - -*coordinate_systems* - -| Name | Description | -|:---------|:------------------------------------| -| `global` | Coordinate system of the replicate. | -
## Component type: Data processor @@ -176,110 +97,7 @@ Arguments: | `--input_sp` | `file` | An unprocessed spatial imaging dataset stored as a zarr file. | | `--input_sc` | `file` | An unprocessed dataset as output by a dataset loader. | | `--output_spatial_dataset` | `file` | (*Output*) A spatial transcriptomics dataset, preprocessed for this benchmark. | -| `--output_scrnaseq_reference` | `file` | (*Output*) A single-cell reference dataset, preprocessed for this benchmark. | - - - -## File format: Raw iST Dataset - -A spatial transcriptomics dataset, preprocessed for this benchmark. - -Example file: -`resources_test/task_spatial_segmentation/mouse_brain_combined/common_ist.zarr` - -Description: - -This dataset contains preprocessed images, labels, points, shapes, and -tables for spatial transcriptomics data. - -Format: - -
- - SpatialData object - images: 'image', 'image_3D', 'he_image' - labels: 'cell_labels', 'nucleus_labels' - points: 'transcripts' - shapes: 'cell_boundaries', 'nucleus_boundaries' - tables: 'metadata' - coordinate_systems: 'global' - -
- -Data structure: - -
- -*images* - -| Name | Description | -|:-----------|:------------------------------------| -| `image` | The raw image data. | -| `image_3D` | (*Optional*) The raw 3D image data. | -| `he_image` | (*Optional*) H&E image data. | - -*labels* - -| Name | Description | -|:-----------------|:---------------------------------------| -| `cell_labels` | (*Optional*) Cell segmentation labels. | -| `nucleus_labels` | (*Optional*) Cell segmentation labels. | - -*points* - -`transcripts`: Point cloud data of transcripts. - -| Column | Type | Description | -|:---|:---|:---| -| `x` | `float` | x-coordinate of the point. | -| `y` | `float` | y-coordinate of the point. | -| `z` | `float` | (*Optional*) z-coordinate of the point. | -| `feature_name` | `categorical` | Name of the feature. | -| `cell_id` | `integer` | (*Optional*) Unique identifier of the cell. | -| `nucleus_id` | `integer` | (*Optional*) Unique identifier of the nucleus. | -| `cell_type` | `string` | (*Optional*) Cell type of the cell. | -| `qv` | `float` | (*Optional*) Quality value of the point. | -| `transcript_id` | `long` | Unique identifier of the transcript. | -| `overlaps_nucleus` | `boolean` | (*Optional*) Whether the point overlaps with a nucleus. | - -*shapes* - -`cell_boundaries`: Cell boundaries. - -| Column | Type | Description | -|:-----------|:---------|:-------------------------------| -| `geometry` | `object` | Geometry of the cell boundary. | - -`nucleus_boundaries`: Nucleus boundaries. - -| Column | Type | Description | -|:-----------|:---------|:----------------------------------| -| `geometry` | `object` | Geometry of the nucleus boundary. | - -*tables* - -`metadata`: Metadata of spatial dataset. - -| Slot | Type | Description | -|:---|:---|:---| -| `obs["cell_id"]` | `string` | A unique identifier for the cell. | -| `var["gene_ids"]` | `string` | Unique identifier for the gene. | -| `var["feature_types"]` | `string` | Type of the feature. | -| `obsm["spatial"]` | `double` | Spatial coordinates of the cell. | -| `uns["dataset_id"]` | `string` | A unique identifier for the dataset. | -| `uns["dataset_name"]` | `string` | A human-readable name for the dataset. | -| `uns["dataset_url"]` | `string` | Link to the original source of the dataset. | -| `uns["dataset_reference"]` | `string` | Bibtex reference of the paper in which the dataset was published. | -| `uns["dataset_summary"]` | `string` | Short description of the dataset. | -| `uns["dataset_description"]` | `string` | Long description of the dataset. | -| `uns["dataset_organism"]` | `string` | The organism of the sample in the dataset. | -| `uns["segmentation_id"]` | `string` | A unique identifier for the segmentation. | - -*coordinate_systems* - -| Name | Description | -|:---------|:------------------------------------| -| `global` | Coordinate system of the replicate. | +| `--output_scrnaseq` | `file` | (*Output*) A single-cell reference dataset, preprocessed for this benchmark. |
@@ -288,7 +106,7 @@ Data structure: A single-cell reference dataset, preprocessed for this benchmark. Example file: -`resources_test/task_spatial_segmentation/mouse_brain_combined/common_scrnaseq.h5ad` +`resources_test/task_spatial_segmentation/mouse_brain_combined/scrnaseq_reference.h5ad` Description: @@ -364,6 +182,30 @@ Data structure: +## File format: Raw iST Dataset + +A spatial transcriptomics dataset, preprocessed for this benchmark. + +Example file: +`resources_test/task_spatial_segmentation/mouse_brain_combined/spatial_dataset.zarr` + +Description: + +This dataset contains preprocessed images, labels, points, shapes, and +tables for spatial transcriptomics data. + +Format: + +
+ +
+ +Data structure: + +
+ +
+ ## Component type: Control Method Quality control methods for verifying the pipeline. @@ -380,9 +222,9 @@ Arguments: -## Component type: Method +## Component type: Metric -A method. +A task template metric. Arguments: @@ -390,14 +232,15 @@ Arguments: | Name | Type | Description | |:---|:---|:---| -| `--input` | `file` | A spatial transcriptomics dataset, preprocessed for this benchmark. | -| `--output` | `file` | (*Output*) A predicted dataset as output by a method. | +| `--input_prediction` | `file` | A predicted dataset as output by a method. | +| `--input_scrnaseq_reference` | `file` | A single-cell reference dataset, preprocessed for this benchmark. | +| `--output` | `file` | (*Output*) File indicating the score of a metric. | -## Component type: Metric +## Component type: Method -A task template metric. +A method. Arguments: @@ -405,9 +248,8 @@ Arguments: | Name | Type | Description | |:---|:---|:---| -| `--input_prediction` | `file` | A predicted dataset as output by a method. | -| `--input_scrnaseq_reference` | `file` | A single-cell reference dataset, preprocessed for this benchmark. | -| `--output` | `file` | (*Output*) File indicating the score of a metric. | +| `--input` | `file` | A spatial transcriptomics dataset, preprocessed for this benchmark. | +| `--output` | `file` | (*Output*) A predicted dataset as output by a method. | @@ -422,31 +264,12 @@ Format:
- SpatialData object - labels: 'segmentation' - tables: 'table' -
Data structure:
-*labels* - -| Name | Description | -|:---------------|:--------------------------| -| `segmentation` | Segmentation of the data. | - -*tables* - -`table`: AnnData table. - -| Slot | Type | Description | -|:-----------------|:---------|:------------| -| `obs["cell_id"]` | `string` | Cell ID. | -| `obs["region"]` | `string` | Region. | -
## File format: Score @@ -562,3 +385,4 @@ Data structure: | `uns["dataset_organism"]` | `string` | (*Optional*) The organism of the sample in the dataset. | + diff --git a/_viash.yaml b/_viash.yaml index 31ad320..a0130fe 100644 --- a/_viash.yaml +++ b/_viash.yaml @@ -11,8 +11,8 @@ license: MIT keywords: [single-cell, openproblems, benchmark] # Step 3: Update the `task_template` to the name of the task from step 1. links: - issue_tracker: https://github.com/openproblems-bio/task_template/issues - repository: https://github.com/openproblems-bio/task_template + issue_tracker: https://github.com/openproblems-bio/task_spatial_segmentation/issues + repository: https://github.com/openproblems-bio/task_spatial_segmentation docker_registry: ghcr.io diff --git a/scripts/create_resources/resources.sh b/scripts/create_resources/resources.sh index 57f4d68..52ee226 100755 --- a/scripts/create_resources/resources.sh +++ b/scripts/create_resources/resources.sh @@ -18,7 +18,7 @@ cat > /tmp/params.yaml << 'HERE' input_states: s3://openproblems-data/resources/datasets/**/state.yaml rename_keys: 'input:output_dataset' output_state: '$id/state.yaml' -settings: '{"output_train": "$id/train.h5ad", "output_test": "$id/test.h5ad", "output_solution": "$id/solution.h5ad"}' +settings: '{"output_spatial_dataset": "$id/output_spatial_dataset.zarr", "output_scrnaseq": "$id/output_scrnaseq.h5ad"}' publish_dir: s3://openproblems-data/resources/task_template/datasets/ HERE diff --git a/scripts/create_resources/test_resources.sh b/scripts/create_resources/test_resources.sh index 9cb372a..b11d437 100755 --- a/scripts/create_resources/test_resources.sh +++ b/scripts/create_resources/test_resources.sh @@ -13,41 +13,49 @@ cd "$REPO_ROOT" set -e +DATASET_ID=mouse_brain_combined + RAW_DATA=resources_test/common -DATASET_DIR=resources_test/task_template +DATASET_DIR=resources_test/task_spatial_segmentation/$DATASET_ID mkdir -p $DATASET_DIR # process dataset viash run src/data_processors/process_dataset/config.vsh.yaml -- \ - --input $RAW_DATA/cxg_mouse_pancreas_atlas/dataset.h5ad \ - --output_train $DATASET_DIR/cxg_mouse_pancreas_atlas/train.h5ad \ - --output_test $DATASET_DIR/cxg_mouse_pancreas_atlas/test.h5ad \ - --output_solution $DATASET_DIR/cxg_mouse_pancreas_atlas/solution.h5ad + --input_sp $RAW_DATA/2023_10x_mouse_brain_xenium_rep1/dataset.zarr \ + --input_sc $RAW_DATA/2023_yao_mouse_brain_scrnaseq_10xv2/dataset.h5ad \ + --output_spatial_dataset $DATASET_DIR/spatial_dataset.zarr \ + --output_scrnaseq_reference $DATASET_DIR/scrnaseq_reference.h5ad \ + --dataset_id mouse_brain_combined \ + --dataset_name "Test data mouse brain combined 2023 tenx Xenium replicate 1 2023 Yao scRNAseq" \ + --dataset_url "https://www.10xgenomics.com/datasets/fresh-frozen-mouse-brain-replicates-1-standard;https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE246717" \ + --dataset_reference "https://www.10xgenomics.com/datasets/fresh-frozen-mouse-brain-replicates-1-standard;10.1038/s41586-023-06812-z" \ + --dataset_summary "Demonstration of gene expression profiling for fresh frozen mouse brain on the Xenium platform using the pre-designed Mouse Brain Gene Expression Panel (v1);A high-resolution scRNAseq atlas of cell types in the whole mouse brain" \ + --dataset_description "Demonstration of gene expression profiling for fresh frozen mouse brain on the Xenium platform using the pre-designed Mouse Brain Gene Expression Panel (v1). Replicate results demonstrate the high reproducibility of data generated by the platform. 10x Genomics obtained tissue from a C57BL/6 mouse from Charles River Laboratories. Three adjacent 10µm sections were placed on the same slide. Tissues were prepared following the demonstrated protocols Xenium In Situ for Fresh Frozen Tissues - Tissue Preparation Guide (CG000579) and Xenium In Situ for Fresh Frozen Tissues - Fixation & Permeabilization (CG000581).;See dataset_reference for more information. Note that we only took the 10xv2 data from the dataset." \ + --dataset_organism "mus_musculus" # run one method -viash run src/methods/logistic_regression/config.vsh.yaml -- \ - --input_train $DATASET_DIR/cxg_mouse_pancreas_atlas/train.h5ad \ - --input_test $DATASET_DIR/cxg_mouse_pancreas_atlas/test.h5ad \ - --output $DATASET_DIR/cxg_mouse_pancreas_atlas/prediction.h5ad +viash run src/methods/cellpose/config.vsh.yaml -- \ + --input $DATASET_DIR/spatial_dataset.zarr \ + --output $DATASET_DIR/prediction.h5ad # run one metric -viash run src/metrics/accuracy/config.vsh.yaml -- \ - --input_prediction $DATASET_DIR/cxg_mouse_pancreas_atlas/prediction.h5ad \ - --input_solution $DATASET_DIR/cxg_mouse_pancreas_atlas/solution.h5ad \ - --output $DATASET_DIR/cxg_mouse_pancreas_atlas/score.h5ad +# TODO: implement this! +# viash run src/metrics/ari/config.vsh.yaml -- \ +# --input_prediction $DATASET_DIR/prediction.h5ad \ +# --input_scrnaseq_reference $DATASET_DIR/scrnaseq_reference.h5ad \ +# --output $DATASET_DIR/score.h5ad # write manual state.yaml. this is not actually necessary but you never know it might be useful -cat > $DATASET_DIR/cxg_mouse_pancreas_atlas/state.yaml << HERE -id: cxg_mouse_pancreas_atlas -train: !file train.h5ad -test: !file test.h5ad -solution: !file solution.h5ad -prediction: !file prediction.h5ad -score: !file score.h5ad +cat > $DATASET_DIR/state.yaml << HERE +id: $DATASET_ID +spatial_dataset: spatial_dataset.zarr +scrnaseq_reference: scrnaseq_reference.h5ad +prediction: prediction.h5ad +score: score.h5ad HERE # only run this if you have access to the openproblems-data bucket aws s3 sync --profile op \ - "$DATASET_DIR" s3://openproblems-data/resources_test/task_template \ + "$DATASET_DIR" s3://openproblems-data/resources_test/task_spatial_segmentation/mouse_brain_combined/ \ --delete --dryrun diff --git a/scripts/create_test_resources/README.md b/scripts/create_test_resources/README.md deleted file mode 100644 index 46bb116..0000000 --- a/scripts/create_test_resources/README.md +++ /dev/null @@ -1,3 +0,0 @@ -Here we generate a small test dataset, used for `viash test`. Note that the file structure here is a bit simplified compared to `scripts/create_resources` as we only have one dataset. - -Copy the data from the `task_ist_preprocessing` test resources: `mouse_brain_combined.sh` diff --git a/scripts/create_test_resources/mouse_brain_combined.sh b/scripts/create_test_resources/mouse_brain_combined.sh deleted file mode 100755 index 05e4de9..0000000 --- a/scripts/create_test_resources/mouse_brain_combined.sh +++ /dev/null @@ -1,32 +0,0 @@ -#!/bin/bash - -# get the root of the directory -REPO_ROOT=$(git rev-parse --show-toplevel) - -# ensure that the command below is run from the root of the repository -cd "$REPO_ROOT" - -set -e - -if [ ! -d resources_test/task_spatial_segmentation/mouse_brain_combined ]; then - mkdir -p resources_test/task_spatial_segmentation/mouse_brain_combined -fi - -# these files were generated by https://github.com/openproblems-bio/task_ist_preprocessing/tree/main/scripts/create_test_resources -# we can just copy them for now - -aws s3 sync --profile op \ - s3://openproblems-data/resources_test/common/2023_10x_mouse_brain_xenium_rep1/dataset.zarr \ - resources_test/task_spatial_segmentation/mouse_brain_combined/common_ist.zarr - -aws s3 cp --profile op \ - s3://openproblems-data/resources_test/common/2023_yao_mouse_brain_scrnaseq_10xv2/dataset.h5ad \ - resources_test/task_spatial_segmentation/mouse_brain_combined/common_scrnaseq.h5ad - -# ...additional preprocessing if needed ... - -# sync to s3 -aws s3 sync --profile op \ - "resources_test/task_spatial_segmentation/mouse_brain_combined/" \ - "s3://openproblems-data/resources_test/task_spatial_segmentation/mouse_brain_combined/" \ - --delete --dryrun diff --git a/scripts/run_benchmark/run_full_local.sh b/scripts/run_benchmark/run_full_local.sh index f8c1585..26bba56 100755 --- a/scripts/run_benchmark/run_full_local.sh +++ b/scripts/run_benchmark/run_full_local.sh @@ -31,7 +31,7 @@ publish_dir="resources/results/${RUN_ID}" # write the parameters to file cat > /tmp/params.yaml << HERE input_states: resources/datasets/**/state.yaml -rename_keys: 'input_train:output_train;input_test:output_test;input_solution:output_solution' +rename_keys: 'input_spatial_dataset:output_spatial_dataset,input_scrnaseq_reference:output_scrnaseq_reference' output_state: "state.yaml" publish_dir: "$publish_dir" HERE diff --git a/scripts/run_benchmark/run_full_seqeracloud.sh b/scripts/run_benchmark/run_full_seqeracloud.sh index 87d133c..3c31e74 100755 --- a/scripts/run_benchmark/run_full_seqeracloud.sh +++ b/scripts/run_benchmark/run_full_seqeracloud.sh @@ -23,7 +23,7 @@ publish_dir="s3://openproblems-data/resources/task_template/results/${RUN_ID}" # write the parameters to file cat > /tmp/params.yaml << HERE input_states: s3://openproblems-data/resources/task_template/datasets/**/state.yaml -rename_keys: 'input_train:output_train;input_test:output_test;input_solution:output_solution' +rename_keys: 'input_spatial_dataset:output_spatial_dataset,input_scrnaseq_reference:output_scrnaseq_reference' output_state: "state.yaml" publish_dir: "$publish_dir" HERE diff --git a/src/api/comp_data_processor.yaml b/src/api/comp_data_processor.yaml index 22c77aa..ecd3f9c 100644 --- a/src/api/comp_data_processor.yaml +++ b/src/api/comp_data_processor.yaml @@ -27,6 +27,52 @@ argument_groups: __merge__: file_scrnaseq_reference.yaml direction: output required: true + - name: Combined Dataset Metadata + description: Metadata for the combined dataset that will be stored. + arguments: + - type: string + name: --dataset_id + description: "A unique identifier for the dataset" + required: true + info: + test_default: "mouse_brain_combined" + - name: --dataset_name + type: string + description: Nicely formatted name. + required: true + info: + test_default: "Mouse brain combined dataset" + - type: string + name: --dataset_url + description: Link to the original source of the dataset. + required: true + info: + test_default: "https://example.com/mouse_brain_combined" + - name: --dataset_reference + type: string + description: Bibtex reference of the paper in which the dataset was published. + required: true + multiple: true + info: + test_default: ["https://example.com/mouse_brain_combined_paper", "10.1234/example.doi"] + - name: --dataset_summary + type: string + description: Short description of the dataset. + required: true + info: + test_default: "Combined dataset for mouse brain spatial transcriptomics" + - name: --dataset_description + type: string + description: Long description of the dataset. + required: true + info: + test_default: "This is a combined dataset for mouse brain spatial transcriptomics." + - name: --dataset_organism + type: string + description: The organism of the sample in the dataset. + required: true + info: + test_default: "Mus musculus" test_resources: - path: /resources_test/common/2023_10x_mouse_brain_xenium_rep1 dest: resources_test/common/2023_10x_mouse_brain_xenium_rep1 diff --git a/src/api/file_prediction.yaml b/src/api/file_prediction.yaml index 23850bb..b1fc443 100644 --- a/src/api/file_prediction.yaml +++ b/src/api/file_prediction.yaml @@ -25,3 +25,12 @@ info: name: region description: Region required: true + uns: + - type: string + name: dataset_id + description: "A unique identifier for the dataset" + required: true + - type: string + name: method_id + description: "A unique identifier for the method" + required: true diff --git a/src/api/file_scrnaseq_reference.yaml b/src/api/file_scrnaseq_reference.yaml index 06d8491..9b855fd 100644 --- a/src/api/file_scrnaseq_reference.yaml +++ b/src/api/file_scrnaseq_reference.yaml @@ -1,9 +1,110 @@ type: file -example: "resources_test/task_spatial_segmentation/mouse_brain_combined/common_scrnaseq.h5ad" +example: "resources_test/task_spatial_segmentation/mouse_brain_combined/scrnaseq_reference.h5ad" # TODO: revert to the original example once file exists # example: "resources_test/task_spatial_segmentation/mouse_brain_combined/spatial_dataset.h5ad" label: "scRNA-seq Reference" summary: A single-cell reference dataset, preprocessed for this benchmark. description: | This dataset contains preprocessed counts and metadata for single-cell RNA-seq data. -__merge__: file_common_scrnaseq.yaml \ No newline at end of file +info: + format: + type: h5ad + layers: + - type: integer + name: counts + description: Raw counts + required: true + + - type: double + name: normalized + description: Normalized expression values + required: true + + - type: double + name: normalized_log + description: Log1p normalized expression values + required: true + + - type: double + name: normalized_log_scaled + description: Log1p normalized expression values scaled to unit variance and zero mean + required: true + + obs: + - type: string + name: cell_type + description: Classification of the cell type based on its characteristics and function within the tissue or organism. + required: true + + var: + - type: string + name: feature_id + description: Unique identifier for the feature, usually a ENSEMBL gene id. + # TODO: make this required once openproblems_v1 dataloader supports it + required: false + + - type: string + name: feature_name + description: A human-readable name for the feature, usually a gene symbol. + # TODO: make this required once the dataloader supports it + required: true + + - type: boolean + name: hvg + description: Whether or not the feature is considered to be a 'highly variable gene' + required: true + + obsp: + - type: double + name: knn_distances + description: K nearest neighbors distance matrix. + required: true + + - type: double + name: knn_connectivities + description: K nearest neighbors connectivities matrix. + required: true + + obsm: + - type: double + name: X_pca + description: The resulting PCA embedding. + required: true + + varm: + - type: double + name: pca_loadings + description: The PCA loadings matrix. + required: true + + uns: + - type: string + name: dataset_id + description: A unique identifier for the dataset. This is different from the `obs.dataset_id` field, which is the identifier for the dataset from which the cell data is derived. + required: true + - name: dataset_name + type: string + description: A human-readable name for the dataset. + required: true + - type: string + name: dataset_url + description: Link to the original source of the dataset. + required: false + - name: dataset_reference + type: string + description: Bibtex reference of the paper in which the dataset was published. + required: false + multiple: true + - name: dataset_summary + type: string + description: Short description of the dataset. + required: true + - name: dataset_description + type: string + description: Long description of the dataset. + required: true + - name: dataset_organism + type: string + description: The organism of the sample in the dataset. + required: false + multiple: true diff --git a/src/api/file_spatial_dataset.yaml b/src/api/file_spatial_dataset.yaml index 5668a3f..4c5253e 100644 --- a/src/api/file_spatial_dataset.yaml +++ b/src/api/file_spatial_dataset.yaml @@ -1,9 +1,153 @@ type: file -example: "resources_test/task_spatial_segmentation/mouse_brain_combined/common_ist.zarr" +example: "resources_test/task_spatial_segmentation/mouse_brain_combined/spatial_dataset.zarr" # TODO: revert to the original example once file exists # example: "resources_test/task_spatial_segmentation/mouse_brain_combined/spatial_dataset.zarr" label: "Raw iST Dataset" summary: A spatial transcriptomics dataset, preprocessed for this benchmark. description: | This dataset contains preprocessed images, labels, points, shapes, and tables for spatial transcriptomics data. -__merge__: file_common_ist.yaml +info: + format: + type: spatialdata_zarr + images: + - type: object + name: morphology_mip + description: The raw image data + required: true + labels: + - type: object + name: "cell_labels" + description: Cell segmentation labels + required: false + - type: object + name: "nucleus_labels" + description: Cell segmentation labels + required: false + points: + - type: dataframe + name: transcripts + description: Point cloud data of transcripts + required: true + columns: + - type: float + name: "x" + required: true + description: x-coordinate of the point + - type: float + name: "y" + required: true + description: y-coordinate of the point + - type: float + name: "z" + required: false + description: z-coordinate of the point + - type: categorical + name: feature_name + required: true + description: Name of the feature + - type: integer + name: "cell_id" + required: false + description: Unique identifier of the cell + - type: integer + name: "nucleus_id" + required: false + description: Unique identifier of the nucleus + - type: string + name: "cell_type" + required: false + description: Cell type of the cell + - type: float + name: qv + required: false + description: Quality value of the point + - type: long + name: transcript_id + required: true + description: Unique identifier of the transcript + - type: boolean + name: overlaps_nucleus + required: false + description: Whether the point overlaps with a nucleus + shapes: + - type: dataframe + name: "cell_boundaries" + description: Cell boundaries + required: false + columns: + - type: object + name: "geometry" + required: true + description: Geometry of the cell boundary + - type: dataframe + name: "nucleus_boundaries" + description: Nucleus boundaries + required: false + columns: + - type: object + name: "geometry" + required: true + description: Geometry of the nucleus boundary + tables: + - type: anndata + name: "table" + description: Metadata of spatial dataset + required: true + uns: + - type: string + name: dataset_id + required: true + description: A unique identifier for the dataset + - type: string + name: dataset_name + required: true + description: A human-readable name for the dataset + - type: string + name: dataset_url + required: true + description: Link to the original source of the dataset + - type: string + name: dataset_reference + required: true + description: Bibtex reference of the paper in which the dataset was published + - type: string + name: dataset_summary + required: true + description: Short description of the dataset + - type: string + name: dataset_description + required: true + description: Long description of the dataset + - type: string + name: dataset_organism + required: true + description: The organism of the sample in the dataset + - type: string + name: segmentation_id + required: true + multiple: true + description: A unique identifier for the segmentation + obs: + - type: string + name: cell_id + required: true + description: A unique identifier for the cell + var: + - type: string + name: gene_ids + required: true + description: Unique identifier for the gene + - type: string + name: feature_types + required: true + description: Type of the feature + obsm: + - type: double + name: spatial + required: true + description: Spatial coordinates of the cell + coordinate_systems: + - type: object + name: global + description: Coordinate system of the replicate + required: true diff --git a/src/base/setup_spatialdata_partial.yaml b/src/base/setup_spatialdata_partial.yaml index d2b72a2..a552792 100644 --- a/src/base/setup_spatialdata_partial.yaml +++ b/src/base/setup_spatialdata_partial.yaml @@ -1,3 +1,5 @@ setup: - type: python - pypi: ["spatialdata", "anndata>=0.12.0", "zarr>=3.0.0"] + # spatialdata>=0.7.3a1 is required as a workaround for a bug in spatialdata<=0.7.2 + # See: https://github.com/scverse/spatialdata/issues/1090 + pypi: ["spatialdata>=0.7.3a1", "anndata>=0.12.0", "zarr>=3.0.0"] diff --git a/src/data_processors/process_dataset/config.vsh.yaml b/src/data_processors/process_dataset/config.vsh.yaml index 0047ae1..0ea6508 100644 --- a/src/data_processors/process_dataset/config.vsh.yaml +++ b/src/data_processors/process_dataset/config.vsh.yaml @@ -1,31 +1,37 @@ __merge__: ../../api/comp_data_processor.yaml + name: process_dataset -arguments: - - name: "--method" - type: "string" - description: "The process method to assign train/test." - choices: ["batch", "random"] - default: "batch" - - name: "--obs_label" - type: "string" - description: "Which .obs slot to use as label." - default: "cell_type" - - name: "--obs_batch" - type: "string" - description: "Which .obs slot to use as batch covariate." - default: "batch" - - name: "--seed" - type: "integer" - description: "A seed for the subsampling." - example: 123 + +argument_groups: + - name: "Processing parameters" + arguments: + - name: "--seed" + type: "integer" + description: "A seed for the subsampling." + example: 123 + - name: "--span" + type: double + description: The fraction of the data (cells) used when estimating the variance in the loess model fit if flavor='seurat_v3'. + default: 0.3 + - name: "--n_top_genes" + type: integer + description: Number of highly-variable genes to keep. Mandatory if flavor='seurat_v3'. + default: 3000 + resources: - type: python_script path: script.py - - path: /common/helper_functions/subset_h5ad_by_format.py engines: - type: docker + #image: openproblems/base_pytorch_nvidia:1 # TODO: ideally get gpu image to work image: openproblems/base_python:1 + setup: + - type: python + packages: [scikit-learn, scikit-misc] + __merge__: + - /src/base/setup_spatialdata_partial.yaml + - type: native runners: - type: executable diff --git a/src/data_processors/process_dataset/script.py b/src/data_processors/process_dataset/script.py index 7cca2bd..b04cb33 100644 --- a/src/data_processors/process_dataset/script.py +++ b/src/data_processors/process_dataset/script.py @@ -1,27 +1,58 @@ -import sys import random -import numpy as np import anndata as ad -import openproblems as op +import spatialdata as sd +import scanpy as sc ## VIASH START par = { - 'input_sp': 'resources_test/task_spatial_segmentation/mouse_brain_combined/common_ist.zarr', - 'input_sc': 'resources_test/task_spatial_segmentation/mouse_brain_combined/common_scrnaseq.h5ad', - 'output_spatial_dataset': 'output_spatial_dataset.zarr', - 'output_scrnaseq_reference': 'output_scrnaseq_reference.h5ad', -} -meta = { - 'resources_dir': 'target/executable/data_processors/process_dataset', - 'config': 'target/executable/data_processors/process_dataset/.config.vsh.yaml' + 'input_sp': 'resources_test/common/2023_10x_mouse_brain_xenium_rep1/dataset.zarr', + 'input_sc': 'resources_test/common/2023_yao_mouse_brain_scrnaseq_10xv2/dataset.h5ad', + 'output_spatial_dataset': 'resources_test/task_spatial_segmentation/mouse_brain_combined/output_spatial_dataset.zarr', + 'output_scrnaseq_reference': 'resources_test/task_spatial_segmentation/mouse_brain_combined/output_scrnaseq_reference.h5ad', + 'span': 0.3, + 'seed': 123, + 'n_top_genes': 3000, + 'dataset_id': 'mouse_brain_combined', + 'dataset_name': 'Mouse brain combined dataset', + 'dataset_url': '', + 'dataset_summary': '', + 'dataset_description': '', + 'dataset_reference': [], + 'dataset_organism': 'Mus musculus', } ## VIASH END -# import helper functions -sys.path.append(meta['resources_dir']) -from subset_h5ad_by_format import subset_h5ad_by_format +def sc_processing(adata): + if "counts" not in adata.layers and adata.X != None: + print(">> Save raw counts in .layer", flush=True) + adata.layers["counts"] = adata.X.copy() + + if "normalized" not in adata.layers: + print(">> Perform standard normalization", flush=True) + adata.layers["normalized"] = adata.layers["counts"].copy() + sc.pp.normalize_total(adata, layer="normalized", inplace=True) + + if "normalized_log" not in adata.layers: + print(">> Perform log1p normalization", flush=True) + adata.layers["normalized_log"] = adata.layers["normalized"].copy() + sc.pp.normalize_total(adata, layer="normalized_log", inplace=True) + + if "normalized_log_scaled" not in adata.layers: + print(">> Perform 0 mean and standard variance normalization", flush=True) + adata.layers["normalized_log_scaled"] = adata.layers["normalized_log"].copy() + sc.pp.normalize_total(adata, layer="normalized_log_scaled", inplace=True) + + if "hvg" not in adata.var: + print(">> Compute highly variable genes", flush=True) + sc.pp.highly_variable_genes( + adata, + flavor="seurat_v3", + layer="counts", + span=par['span'], + n_top_genes=par['n_top_genes'] + ) + adata.var.rename(columns={"highly_variable": "hvg"}, inplace=True) -config = op.project.read_viash_config(meta["config"]) # set seed if need be if par["seed"]: @@ -29,54 +60,46 @@ random.seed(par["seed"]) print(">> Load data", flush=True) -adata = ad.read_h5ad(par["input"]) -print("input:", adata) - -print(f">> Process data using {par['method']} method") -if par["method"] == "batch": - batch_info = adata.obs[par["obs_batch"]] - batch_categories = batch_info.dtype.categories - test_batches = random.sample(list(batch_categories), 1) - is_test = [ x in test_batches for x in batch_info ] -elif par["method"] == "random": - train_ix = np.random.choice(adata.n_obs, round(adata.n_obs * 0.8), replace=False) - is_test = [ not x in train_ix for x in range(0, adata.n_obs) ] - -# subset the different adatas -print(">> Figuring which data needs to be copied to which output file", flush=True) -# use par arguments to look for label and batch value in different slots -slot_mapping = { - "obs": { - "label": par["obs_label"], - "batch": par["obs_batch"], - } -} +sc_data = ad.read_h5ad(par["input_sc"]) +print(f"single cell data: {sc_data}") + +print(">> Processing sc_data", flush=True) +sc_processing(sc_data) -print(">> Creating train data", flush=True) -output_train = subset_h5ad_by_format( - adata[[not x for x in is_test]], - config, - "output_train", - slot_mapping -) - -print(">> Creating test data", flush=True) -output_test = subset_h5ad_by_format( - adata[is_test], - config, - "output_test", - slot_mapping -) - -print(">> Creating solution data", flush=True) -output_solution = subset_h5ad_by_format( - adata[is_test], - config, - "output_solution", - slot_mapping -) +print(">> Override dataset metadata in .uns", flush=True) +sc_data.uns["orig_dataset_id"] = sc_data.uns.get("dataset_id", None) +for key in ["dataset_id", "dataset_name", "dataset_url", "dataset_summary", "dataset_description", "dataset_reference", "dataset_organism"]: + sc_data.uns[key] = par[key] print(">> Writing data", flush=True) -output_train.write_h5ad(par["output_train"]) -output_test.write_h5ad(par["output_test"]) -output_solution.write_h5ad(par["output_solution"]) +sc_data.write_h5ad(par["output_scrnaseq_reference"], compression="gzip") + +# read input_sp +print(">> Read spatial data", flush=True) +sp_data = sd.read_zarr(par["input_sp"]) +print(f"spatial data: {sp_data}") + +print(">> Processing spatial data", flush=True) +sp_data_table = sp_data.tables['table'] +print(f"single cell part of spatial data: {sp_data_table}") +sc_processing(sp_data_table) + +if "cell_area" not in sp_data_table.obs: + print(">> Perform scanpy qc for cell area", flush=True) + sc.pp.calculate_qc_metrics(sp_data_table, layer="counts", inplace=True) + +for x in ["transcript_counts", "n_genes_by_counts"]: + if f"ca_normalized_{x}" not in sp_data_table.obs and x in sp_data_table.obs: + print(f">> Perform cell area normalization for {x}", flush=True) + sp_data_table.obs[f'ca_normalized_{x}'] = sp_data_table.obs[f"{x}"] / sp_data_table.obs["cell_area"] + +print(">> Override dataset metadata in .uns", flush=True) +sp_data_table.uns["orig_dataset_id"] = sp_data_table.uns.get("dataset_id", None) +for key in ["dataset_id", "dataset_name", "dataset_url", "dataset_summary", "dataset_description", "dataset_reference", "dataset_organism"]: + sp_data_table.uns[key] = par[key] + +print(f"spatial data: {sp_data}") +print(f"spatial data tables['table']: {sp_data.tables['table']}") + +print(">> Writing spatial data", flush=True) +sp_data.write(par["output_spatial_dataset"], overwrite=True) diff --git a/src/methods/cellpose/config.vsh.yaml b/src/methods/cellpose/config.vsh.yaml index 46be884..47c6cec 100644 --- a/src/methods/cellpose/config.vsh.yaml +++ b/src/methods/cellpose/config.vsh.yaml @@ -1,11 +1,11 @@ name: cellpose label: "Cellpose" # TODO: update the summary, description and links -summary: "Output of the segmantation methot cellpose" -description: "Output of the segmantation methot cellpose" +summary: "Cellpose-SAM: cell and nucleus segmentation with superhuman generalization." +description: "cellpose is an anatomical segmentation algorithm written in Python 3." links: # these should point to the documentation of the method - documentation: "https://github.com/openproblems-bio/task_ist_preprocessing" - repository: "https://github.com/openproblems-bio/task_ist_preprocessing" + documentation: "https://cellpose.readthedocs.io/en/latest/" + repository: "https://github.com/mouseland/cellpose" references: doi: "10.1038/s41592-020-01018-x" diff --git a/src/methods/cellpose/script.py b/src/methods/cellpose/script.py index 4949b8d..6ebae72 100644 --- a/src/methods/cellpose/script.py +++ b/src/methods/cellpose/script.py @@ -1,6 +1,7 @@ -import dask.array as da +import anndata as ad import numpy as np import os +import pandas as pd import shutil import spatialdata as sd import xarray as xr @@ -9,14 +10,28 @@ ## VIASH START par = { - 'input': 'resources_test/task_spatial_segmentation/mouse_brain_combined/common_ist.zarr', - 'output': 'resources_test/task_spatial_segmentation/mouse_brain_combined/prediction.h5ad' + 'input': 'resources_test/task_spatial_segmentation/mouse_brain_combined/spatial_dataset.zarr', + 'output': 'prediction.zarr' } meta = { 'name': 'cellpose' } ## VIASH END +# TODO: move to helper file +def convert_to_lower_dtype(arr): + max_val = arr.max() + if max_val <= np.iinfo(np.uint8).max: + new_dtype = np.uint8 + elif max_val <= np.iinfo(np.uint16).max: + new_dtype = np.uint16 + elif max_val <= np.iinfo(np.uint32).max: + new_dtype = np.uint32 + else: + new_dtype = np.uint64 + + return arr.astype(new_dtype) + print('Reading input', flush=True) sdata = sd.read_zarr(par["input"]) image = sdata['morphology_mip']['scale0'].image.compute().to_numpy() @@ -30,21 +45,27 @@ masks, _, _ = model.eval(image[0], progress=True, **eval_params) print('Cellpose segmentation finished, post-processing results', flush=True) -# Convert to smallest sufficient unsigned int dtype -max_val = masks.max() -for dtype in (np.uint8, np.uint16, np.uint32, np.uint64): - if max_val <= np.iinfo(dtype).max: - masks = masks.astype(dtype) - break +masks = convert_to_lower_dtype(masks) print('Segmentation done, preparing output', flush=True) sd_output = sd.SpatialData() -# Wrap masks as a single-chunk dask array with flat chunk shape for zarr v3 compat -dask_masks = da.from_array(masks, chunks=masks.shape) -data_array = xr.DataArray(dask_masks, name='segmentation', dims=('y', 'x')) +data_array = xr.DataArray(masks, name='segmentation', dims=('y', 'x')) parsed = Labels2DModel.parse(data_array, transformations=transformation) sd_output.labels['segmentation'] = parsed +cell_ids = np.unique(masks)[1:] # exclude background (0) +table = ad.AnnData( + obs=pd.DataFrame( + {'cell_id': cell_ids.astype(str), 'region': 'segmentation'}, + index=cell_ids.astype(str), + ), + uns={ + 'dataset_id': sdata.tables['table'].uns['dataset_id'], + 'method_id': meta['name'] + } +) +sd_output.tables['table'] = table + print('Saving output', flush=True) if os.path.exists(par["output"]): shutil.rmtree(par["output"])