You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: examples/README.md
+50-7Lines changed: 50 additions & 7 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -1,7 +1,5 @@
1
1
## GenomicsDB simple query tool
2
2
3
-
Note that there is `run.sh` bash script for ease of use and if you do not want to invoke the genomicsdb_query CLI directly.
4
-
5
3
Simple GenomicsDB query tool `genomicsdb_query`, given a workspace and genomic intervals of the form `<CONTIG>:<START>-<END>`. The intervals at a minimum need to have the contig specified, start and end are optional. e.g chr1:100-1000, chr1:100 and chr1 are all valid. Start defaults to 1 if not specified and end defaults to the length of the contig if not specified.
6
4
7
5
Assumption : The workspace should have been created with the `vcf2genomicsdb` tool or with `gatk GenomicsDBImport` and should exist.
@@ -10,7 +8,7 @@ Assumption : The workspace should have been created with the `vcf2genomicsdb` to
GenomicsDB simple query with samples/intervals/filter as inputs
11
+
GenomicsDB simple query with samples/intervals/attributes/filter as inputs
14
12
15
13
options:
16
14
-h, --help show this help message and exit
@@ -25,8 +23,9 @@ options:
25
23
-l LOADER, --loader LOADER
26
24
Optional - URL to loader file. Defaults to loader.json in workspace
27
25
--list-samples List samples ingested into the workspace and exit
28
-
--list-contigs List contigs for the ingested samples in the workspace and exit
29
-
--list-partitions List interval partitions for the ingested samples in the workspace and exit
26
+
--list-contigs List contigs configured in vid mapping for the workspace and exit
27
+
--list-fields List genomic fields configured in vid mapping for the workspace and exit
28
+
--list-partitions List interval partitions(genomicsdb arrays in the workspace) for the given intervals(-i/--interval or -I/--interval-list) or all the intervals for the workspace and exit
30
29
-i INTERVAL, --interval INTERVAL
31
30
genomic intervals over which to operate. The intervals should be specified in the <CONTIG>:<START>-<END> format with START and END optional.
32
31
This argument may be specified 0 or more times e.g -i chr1:1-10000 -i chr2 -i chr3:1000.
@@ -49,16 +48,25 @@ options:
49
48
Note:
50
49
1. -s/--sample and -S/--sample-list are mutually exclusive
51
50
2. either samples and/or intervals using -i/-I/-s/-S options has to be specified
51
+
-a ATTRIBUTES, --attributes ATTRIBUTES
52
+
Optional - comma separated list of genomic attributes or fields described in the vid mapping for the query, eg. GT,AC,PL,DP... Defaults to GT
52
53
-f FILTER, --filter FILTER
53
54
Optional - genomic filter expression for the query, e.g. 'ISHOMREF' or 'ISHET' or 'REF == "G" && resolve(GT, REF, ALT) &= "T/T" && ALT |= "T"'
55
+
-n NPROC, --nproc NPROC
56
+
Optional - number of processing units for multiprocessing(default: 8). Run nproc from command line to print the number of processing units available to a process for the user
57
+
--chunk-size CHUNK_SIZE
58
+
Optional - hint to split number of samples for multiprocessing used in conjunction with -n/--nproc and when -s/-S/--sample/--sample-list is not specified (default: 10240)
Optional - used in conjunction with -t/--output-type arrow as hint for buffering parquet files(default: 64MB)
60
65
-o OUTPUT, --output OUTPUT
61
-
a prefix filename to csv outputs from the tool. The filenames will be suffixed with the interval and .csv/.json (default: query_output)
66
+
a prefix filename to outputs from the tool. The filenames will be suffixed with the interval and .csv/.json/... (default: query_output)
67
+
-d, --dryrun displays the query that will be run without actually executing the query (default: False)
68
+
-b, --bypass-intersecting-intervals-phase
69
+
iterate only once bypassing the intersecting intervals phase (default: False)
62
70
```
63
71
64
72
Run `genomicsdb_query` with the -w and --list-samples/--list-contigs to figure out legitimate samples and contigs over which the query can operate. These can be used with the --samples/--intervals options later to run the actual query.
Locally caching artifacts from cloud URLs is optional for GenomicsDB metadata and helps with performance for metadata/artifacts which can be accessed multiple times. There is a separate caching tool `genomicsdb_cache` which takes as inputs the workspace, optionally callset/vidmap/loader.json and also optionally the intervals or intervals with the -i/--interval/-I/--interval-list option. This is envisioned to be done once before the first start of the queries for the interval. Set the env variable `TILEDB_CACHE` to `1` and explicitly use `-c callset.json -v vidmap.json -l loader.json` with the `genomicsdb_query` command to access locally cached GenomicsDB metadata.
Cache GenomicsDB metadata and generated callset/vidmap/loader json artifacts for workspace cloud URLs
117
+
118
+
options:
119
+
-h, --help show this help message and exit
120
+
--version print GenomicsDB native library version and exit
121
+
-w WORKSPACE, --workspace WORKSPACE
122
+
URL to GenomicsDB workspace
123
+
e.g. -w my_workspace or -w az://my_container/my_workspace or -w s3://my_bucket/my_workspace or -w gs://my_bucket/my_workspace
124
+
-v VIDMAP, --vidmap VIDMAP
125
+
Optional - URL to vid mapping file. Defaults to vidmap.json in workspace
126
+
-c CALLSET, --callset CALLSET
127
+
Optional - URL to callset mapping file. Defaults to callset.json in workspace
128
+
-l LOADER, --loader LOADER
129
+
Optional - URL to loader file. Defaults to loader.json in workspace
130
+
-i INTERVAL, --interval INTERVAL
131
+
Optional - genomic intervals over which to operate. The intervals should be specified in the <CONTIG>:<START>-<END> format with START and END optional.
132
+
This argument may be specified 0 or more times e.g -i chr1:1-10000 -i chr2 -i chr3:1000.
133
+
Note:
134
+
1. -i/--interval and -I/--interval-list are mutually exclusive
135
+
2. either samples and/or intervals using -i/-I/-s/-S options has to be specified
136
+
-I INTERVAL_LIST, --interval-list INTERVAL_LIST
137
+
Optional - genomic intervals listed in a file over which to operate.
138
+
The intervals should be specified in the <CONTIG>:<START>-<END> format, with START and END optional one interval per line.
139
+
Note:
140
+
1. -i/--interval and -I/--interval-list are mutually exclusive
141
+
2. either samples and/or intervals using -i/-I/-s/-S options has to be specified
142
+
```
143
+
100
144
101
-
For ease of use, open run.sh and change the `WORKSPACE`, `INTERVALS` and other commented out variables to what is desired before invoking it. Variables `VIDMAP_FILE` and `LOADER_FILE` need to be set only if they are not in the workspace. run.sh calls genomicsdb_query, the tool does the querying of the workspace for the intervals specified and outputs one csv file per input interval.
description="Cache GenomicsDB generated json artifacts for workspace cloud URLs",
52
+
description="Cache GenomicsDB metadata and generated callset/vidmap/loader json artifacts for workspace cloud URLs",
53
53
formatter_class=argparse.RawTextHelpFormatter,
54
54
usage="%(prog)s [options]",
55
55
)
@@ -86,13 +86,13 @@ def main():
86
86
"--interval",
87
87
action="append",
88
88
required=False,
89
-
help="genomic intervals over which to operate. The intervals should be specified in the <CONTIG>:<START>-<END> format with START and END optional.\nThis argument may be specified 0 or more times e.g -i chr1:1-10000 -i chr2 -i chr3:1000. \nNote: \n\t1. -i/--interval and -I/--interval-list are mutually exclusive \n\t2. either samples and/or intervals using -i/-I/-s/-S options has to be specified", # noqa
89
+
help="Optional - genomic intervals over which to operate. The intervals should be specified in the <CONTIG>:<START>-<END> format with START and END optional.\nThis argument may be specified 0 or more times e.g -i chr1:1-10000 -i chr2 -i chr3:1000. \nNote: \n\t1. -i/--interval and -I/--interval-list are mutually exclusive \n\t2. either samples and/or intervals using -i/-I/-s/-S options has to be specified", # noqa
90
90
)
91
91
parser.add_argument(
92
92
"-I",
93
93
"--interval-list",
94
94
required=False,
95
-
help="genomic intervals listed in a file over which to operate.\nThe intervals should be specified in the <CONTIG>:<START>-<END> format, with START and END optional one interval per line. \nNote: \n\t1. -i/--interval and -I/--interval-list are mutually exclusive \n\t2. either samples and/or intervals using -i/-I/-s/-S options has to be specified", # noqa
95
+
help="Optional - genomic intervals listed in a file over which to operate.\nThe intervals should be specified in the <CONTIG>:<START>-<END> format, with START and END optional one interval per line. \nNote: \n\t1. -i/--interval and -I/--interval-list are mutually exclusive \n\t2. either samples and/or intervals using -i/-I/-s/-S options has to be specified", # noqa
0 commit comments