This repository generates a database intended to facilitate search for GWAS records in the OpenGWAS database. This is achieved by first computing ontology mappings of the traits specified in the OpenGWAS metadata, and then combining the mappings with tabular representations of ontology relationships such that users can search over traits by leveraging the ontology class hierarchy.
| Resource | Version |
|---|---|
| SearchDB | 0.10.0 |
| EFO | 3.57.0 |
| UBERON | 2023-07-25 |
| Metadata | 2023-09-12T13:50:43 |
The database can be built by executing the module src/build_opengwas_db.py as specified below. There is an optional parameter to specify an NCBI API Key for faster querying of PubMed articles.
python3 build_opengwas_db.py <NCBI_API_Key>This generates the SQLite3 database opengwas_search.db that contains:
- The original OpenGWAS metadata table with all traits and associated OpenGWAS DB record identifiers.
- text2term-generated mappings of OpenGWAS traits to Experimental Factor Ontology (EFO) terms.
- Tables that contain details of EFO terms—their labels, identifiers, synonyms, associated anatomical locations, and mapping counts—and the asserted and inferred hierarchical (SubclassOf) relationships between EFO terms (extracted from a SemanticSQL EFO build).
opengwas_metadatacontains the original OpenGWAS metadata table.opengwas_referencescontains details obtained from PubMed about the articles in thepmidcolumn of the metadata table.opengwas_mappingscontains the text2term-generated mappings of the phenotypes in thetraitcolumn of the metadata table to EFO.efo_labelscontains the following details:- all terms in EFO, represented by their CURIEs (
Subjectcolumn). - term labels (
Objectcolumn). - term IRIs (
IRIcolumn). - disease locations associated with each term, if available (
DiseaseLocationcolumn). - count of how many metadata points are directly mapped to those ontology terms (
Directcolumn). - count of how many metadata points are indirectly mapped to those terms via a more specific term in the hierarchy (
Inheritedcolumn).
- all terms in EFO, represented by their CURIEs (
efo_edgesandefo_entailed_edgescontain, respectively, the asserted and entailed hierarchical (IS-A/SubClassOf) relationships between terms in EFO.efo_synonymscontains the potentially multiple synonyms (in theObjectcolumn) of each EFO term (given in theSubjectcolumn).
src/example_query.py contains a simple function to query the generated database for OpenGWAS records related to a user-given trait. Executing this script will perform example queries for three traits and print the results.
For example, when searching for OpenGWAS records about pancreas disease, our approach returns the results:
A similar, keyword-based search in OpenGWAS would not have found these.
Furthermore, users can search for records mapped to any more specific type of 'pancreas disease'— under the hood this is done by including subclasses of the ontology term for 'pancreas disease' in the search, obtaining the results:
The metadata are obtained directly from OpenGWAS using the ieugwaspy package—a Python interface to the OpenGWAS database API. The metadata preprocessing consists of removing all EQTL records—by discarding records whose id contains eqtl-a, which is the prefix for all such records.
The inputs to text2term are the metadata table and the EFO ontology, and the tool is configured to:
- include only mappings with a score above a threshold (
min_score=0.6in a [0,1] scale where 1=exact match). - compute only the highest scored mapping for each trait in the metadata (
max_mappings=1). - use the TFIDF mapper (
mapper=Mapper.TFIDF), which computes TFIDF-based vector representations of traits and then uses cosine distance to determine how close each trait is to each ontology term. - exclude terms that have been marked as deprecated (
excl_deprecated=True) such that we only map to terms that are current and expected to be in EFO's future releases.
EFO specifies relationships between terms that exist in external ontologies such as MONDO, ChEBI, etc. Since our goal is to map phenotypes to appropriate ontology terms, when they exist, we also configured text2term to:
- only map to terms from ontologies that describe phenotypes: EFO itself, the Monarch Disease Ontology (MONDO), the Human Phenotype Ontology (HPO), the Orphanet Rare Disease Ontology (ORDO), and the Human Disease Ontology (DOID). This is done using the parameter
base_iriswhich limits search to terms in the given namespace(s).
text2term configuration:
min_score=0.6, # minimum acceptable mapping score
max_mappings=1, # maximum number of mappings per input phenotype
mapper=Mapper.TFIDF, # use the (default) TF-IDF-based mapper to compare strings
excl_deprecated=True, # exclude deprecated ontology terms
base_iris=("http://www.ebi.ac.uk/efo/",
"http://purl.obolibrary.org/obo/MONDO",
"http://purl.obolibrary.org/obo/HP",
"http://www.orpha.net/ORDO",
"http://purl.obolibrary.org/obo/DOID")

