Metadata for structural chemical probing experiments for RNAcentral.
This repository stores metadata YAML files for chemical probing datasets (for example in SHAPE/ and DMS/) that are validated with LinkML via GitHub Actions. Once a YAML file is accepted, the pipeline downloads FASTQ files using nf-core/fetchngs and creates a final samplesheet.csv that can be used as input for nf-core/rnastructurome.
To add a new dataset to this repository:
- Clone this repository to your local machine.
- Create a new branch from master with a descriptive name including “Add” (e.g. Add-new-shape-dataset).
- Create a new YAML file (see section below) in the appropriate directory (for example
SHAPE/orDMS/) and populate it according to the schema requirements. - Open a pull request with that new YAML file.
- Wait for the GitHub Actions checks to validate the YAML.
- If the checks pass, someone from RNAcentral will review and merge the pull request.
- If the checks fail, inspect the GitHub Actions logs, fix the reported issue in the YAML, and update the pull request.
-
Start from the template: use the example file (rnastruct00001.yaml) as a guide. Your YAML should follow the same structure. If your dataset includes multiple organisms, create one YAML file per organism (e.g. one for Homo sapiens, one for Mus musculus).
-
Choose a dataset id that is a consecutive number from the last one in the repo (e.g. rnastruct00010). Check both DMS/ and SHAPE/ to find the latest id number.
-
You must also include the organism in Latin name (e.g. Homo sapiens), the method (which can be SHAPE or DMS variants) and principal (RT-stop or MaP) of this experiement, a publication DOI, and fill out the raw_data section.
-
Each sample listed under run_accessions should include a biologically meaningful and distinguishable sample_name, along with cell_line, condition (one of untreated, treated, or denatured), and replicate (just a number). The sample accession id must be supported by nf-core/fetchngs (e.g. SRA, ENA, DDBJ, GEO; see the fetchngs documentation for the full list).
-
If including an OBI id, use a valid term from the Ontology for Biomedical Investigations / obi-ontology/obi. If the experimental context is provided, it must be one of in_vivo, in_vitro, or denatured.
-
All other fields are optional and can be set to null if not available.
Install uv then run:
uv sync --devTo run the tests:
uv run pytestThe validator (linkml-validate against schema/rnastruct.schema.yaml) makes sure the minimum required fields for running the pipeline end-to-end are present.
The required fields are:
dataset_id, which must match thernastruct00001naming conventionorganismin Latin name formatexperiment.method, which must containSHAPEorDMSexperiment.principle, which must beRT-stoporMaPpublication.doiraw_data.repository, which must be one ofSRA,ENA,GEO, orDDBJraw_data.accessionraw_data.run_accessions, where each item must includeaccession,sample_name,cell_line,condition, andreplicate
All other fields are optional and, if not known, can be null.
The optional field experiment.context, when provided, must use one or more of: in_vivo, in_vitro, or denatured.