Skip to content

Commit 37fe8b5

Browse files
author
Nalini Ganapati
committed
Cleanup README in examples
1 parent a4975c7 commit 37fe8b5

3 files changed

Lines changed: 53 additions & 136 deletions

File tree

examples/README.md

Lines changed: 50 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -1,7 +1,5 @@
11
## GenomicsDB simple query tool
22

3-
Note that there is `run.sh` bash script for ease of use and if you do not want to invoke the genomicsdb_query CLI directly.
4-
53
Simple GenomicsDB query tool `genomicsdb_query`, given a workspace and genomic intervals of the form `<CONTIG>:<START>-<END>`. The intervals at a minimum need to have the contig specified, start and end are optional. e.g chr1:100-1000, chr1:100 and chr1 are all valid. Start defaults to 1 if not specified and end defaults to the length of the contig if not specified.
64

75
Assumption : The workspace should have been created with the `vcf2genomicsdb` tool or with `gatk GenomicsDBImport` and should exist.
@@ -10,7 +8,7 @@ Assumption : The workspace should have been created with the `vcf2genomicsdb` to
108
~/GenomicsDB-Python/examples: ./genomicsdb_query --help
119
usage: query [options]
1210
13-
GenomicsDB simple query with samples/intervals/filter as inputs
11+
GenomicsDB simple query with samples/intervals/attributes/filter as inputs
1412
1513
options:
1614
-h, --help show this help message and exit
@@ -25,8 +23,9 @@ options:
2523
-l LOADER, --loader LOADER
2624
Optional - URL to loader file. Defaults to loader.json in workspace
2725
--list-samples List samples ingested into the workspace and exit
28-
--list-contigs List contigs for the ingested samples in the workspace and exit
29-
--list-partitions List interval partitions for the ingested samples in the workspace and exit
26+
--list-contigs List contigs configured in vid mapping for the workspace and exit
27+
--list-fields List genomic fields configured in vid mapping for the workspace and exit
28+
--list-partitions List interval partitions(genomicsdb arrays in the workspace) for the given intervals(-i/--interval or -I/--interval-list) or all the intervals for the workspace and exit
3029
-i INTERVAL, --interval INTERVAL
3130
genomic intervals over which to operate. The intervals should be specified in the <CONTIG>:<START>-<END> format with START and END optional.
3231
This argument may be specified 0 or more times e.g -i chr1:1-10000 -i chr2 -i chr3:1000.
@@ -49,16 +48,25 @@ options:
4948
Note:
5049
1. -s/--sample and -S/--sample-list are mutually exclusive
5150
2. either samples and/or intervals using -i/-I/-s/-S options has to be specified
51+
-a ATTRIBUTES, --attributes ATTRIBUTES
52+
Optional - comma separated list of genomic attributes or fields described in the vid mapping for the query, eg. GT,AC,PL,DP... Defaults to GT
5253
-f FILTER, --filter FILTER
5354
Optional - genomic filter expression for the query, e.g. 'ISHOMREF' or 'ISHET' or 'REF == "G" && resolve(GT, REF, ALT) &= "T/T" && ALT |= "T"'
55+
-n NPROC, --nproc NPROC
56+
Optional - number of processing units for multiprocessing(default: 8). Run nproc from command line to print the number of processing units available to a process for the user
57+
--chunk-size CHUNK_SIZE
58+
Optional - hint to split number of samples for multiprocessing used in conjunction with -n/--nproc and when -s/-S/--sample/--sample-list is not specified (default: 10240)
5459
-t {csv,json,arrow}, --output-type {csv,json,arrow}
5560
Optional - specify type of output for the query (default: csv)
5661
-j {all,all-by-calls,samples-with-num-calls,samples,num-calls}, --json-output-type {all,all-by-calls,samples-with-num-calls,samples,num-calls}
5762
Optional - used in conjunction with -t/--output-type json (default: samples-with-num-calls)
5863
-z MAX_ARROW_BYTE_SIZE, --max-arrow-byte-size MAX_ARROW_BYTE_SIZE
5964
Optional - used in conjunction with -t/--output-type arrow as hint for buffering parquet files(default: 64MB)
6065
-o OUTPUT, --output OUTPUT
61-
a prefix filename to csv outputs from the tool. The filenames will be suffixed with the interval and .csv/.json (default: query_output)
66+
a prefix filename to outputs from the tool. The filenames will be suffixed with the interval and .csv/.json/... (default: query_output)
67+
-d, --dryrun displays the query that will be run without actually executing the query (default: False)
68+
-b, --bypass-intersecting-intervals-phase
69+
iterate only once bypassing the intersecting intervals phase (default: False)
6270
```
6371

6472
Run `genomicsdb_query` with the -w and --list-samples/--list-contigs to figure out legitimate samples and contigs over which the query can operate. These can be used with the --samples/--intervals options later to run the actual query.
@@ -97,5 +105,40 @@ query_output_1-100-100000.csv query_output_1-100001.csv query_output_2.csv
97105
98106
```
99107

108+
### Caching for enhanced performance
109+
110+
Locally caching artifacts from cloud URLs is optional for GenomicsDB metadata and helps with performance for metadata/artifacts which can be accessed multiple times. There is a separate caching tool `genomicsdb_cache` which takes as inputs the workspace, optionally callset/vidmap/loader.json and also optionally the intervals or intervals with the -i/--interval/-I/--interval-list option. This is envisioned to be done once before the first start of the queries for the interval. Set the env variable `TILEDB_CACHE` to `1` and explicitly use `-c callset.json -v vidmap.json -l loader.json` with the `genomicsdb_query` command to access locally cached GenomicsDB metadata.
111+
112+
```
113+
~/GenomicsDB-Python/examples: ./genomicsdb_cache -h
114+
usage: cache [options]
115+
116+
Cache GenomicsDB metadata and generated callset/vidmap/loader json artifacts for workspace cloud URLs
117+
118+
options:
119+
-h, --help show this help message and exit
120+
--version print GenomicsDB native library version and exit
121+
-w WORKSPACE, --workspace WORKSPACE
122+
URL to GenomicsDB workspace
123+
e.g. -w my_workspace or -w az://my_container/my_workspace or -w s3://my_bucket/my_workspace or -w gs://my_bucket/my_workspace
124+
-v VIDMAP, --vidmap VIDMAP
125+
Optional - URL to vid mapping file. Defaults to vidmap.json in workspace
126+
-c CALLSET, --callset CALLSET
127+
Optional - URL to callset mapping file. Defaults to callset.json in workspace
128+
-l LOADER, --loader LOADER
129+
Optional - URL to loader file. Defaults to loader.json in workspace
130+
-i INTERVAL, --interval INTERVAL
131+
Optional - genomic intervals over which to operate. The intervals should be specified in the <CONTIG>:<START>-<END> format with START and END optional.
132+
This argument may be specified 0 or more times e.g -i chr1:1-10000 -i chr2 -i chr3:1000.
133+
Note:
134+
1. -i/--interval and -I/--interval-list are mutually exclusive
135+
2. either samples and/or intervals using -i/-I/-s/-S options has to be specified
136+
-I INTERVAL_LIST, --interval-list INTERVAL_LIST
137+
Optional - genomic intervals listed in a file over which to operate.
138+
The intervals should be specified in the <CONTIG>:<START>-<END> format, with START and END optional one interval per line.
139+
Note:
140+
1. -i/--interval and -I/--interval-list are mutually exclusive
141+
2. either samples and/or intervals using -i/-I/-s/-S options has to be specified
142+
```
143+
100144

101-
For ease of use, open run.sh and change the `WORKSPACE`, `INTERVALS` and other commented out variables to what is desired before invoking it. Variables `VIDMAP_FILE` and `LOADER_FILE` need to be set only if they are not in the workspace. run.sh calls genomicsdb_query, the tool does the querying of the workspace for the intervals specified and outputs one csv file per input interval.

examples/genomicsdb_cache

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -49,7 +49,7 @@ def get_arrays(interval, contigs_map, partitions):
4949
def main():
5050
parser = argparse.ArgumentParser(
5151
prog="cache",
52-
description="Cache GenomicsDB generated json artifacts for workspace cloud URLs",
52+
description="Cache GenomicsDB metadata and generated callset/vidmap/loader json artifacts for workspace cloud URLs",
5353
formatter_class=argparse.RawTextHelpFormatter,
5454
usage="%(prog)s [options]",
5555
)
@@ -86,13 +86,13 @@ def main():
8686
"--interval",
8787
action="append",
8888
required=False,
89-
help="genomic intervals over which to operate. The intervals should be specified in the <CONTIG>:<START>-<END> format with START and END optional.\nThis argument may be specified 0 or more times e.g -i chr1:1-10000 -i chr2 -i chr3:1000. \nNote: \n\t1. -i/--interval and -I/--interval-list are mutually exclusive \n\t2. either samples and/or intervals using -i/-I/-s/-S options has to be specified", # noqa
89+
help="Optional - genomic intervals over which to operate. The intervals should be specified in the <CONTIG>:<START>-<END> format with START and END optional.\nThis argument may be specified 0 or more times e.g -i chr1:1-10000 -i chr2 -i chr3:1000. \nNote: \n\t1. -i/--interval and -I/--interval-list are mutually exclusive \n\t2. either samples and/or intervals using -i/-I/-s/-S options has to be specified", # noqa
9090
)
9191
parser.add_argument(
9292
"-I",
9393
"--interval-list",
9494
required=False,
95-
help="genomic intervals listed in a file over which to operate.\nThe intervals should be specified in the <CONTIG>:<START>-<END> format, with START and END optional one interval per line. \nNote: \n\t1. -i/--interval and -I/--interval-list are mutually exclusive \n\t2. either samples and/or intervals using -i/-I/-s/-S options has to be specified", # noqa
95+
help="Optional - genomic intervals listed in a file over which to operate.\nThe intervals should be specified in the <CONTIG>:<START>-<END> format, with START and END optional one interval per line. \nNote: \n\t1. -i/--interval and -I/--interval-list are mutually exclusive \n\t2. either samples and/or intervals using -i/-I/-s/-S options has to be specified", # noqa
9696
)
9797

9898
args = parser.parse_args()

examples/run.sh

Lines changed: 0 additions & 126 deletions
This file was deleted.

0 commit comments

Comments
 (0)