Skip to content

Commit bda694c

Browse files
authored
Merge pull request #22 from CMA-Lab/refactor
Tested locally and it seems to check out! Moving to main branch.
2 parents 99019ef + 145f142 commit bda694c

35 files changed

Lines changed: 1994 additions & 1606 deletions

CHANGELOG.md

Lines changed: 58 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,58 @@
1+
# Changelog
2+
All notable changes to this project will be documented in this file.
3+
4+
The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.0.0/),
5+
and this project adheres to [Calendar Versioning](https://calver.org/) with the format `MAJOR.YY.0W[_MINOR][-Modifier]`. The major version increases when the database schema changes. Minor tags are added for multiple releases in the same week, starting from `2` (the `1` is implicit). Modifiers are added for pre-releases (e.g. `beta` or `alpha`).
6+
7+
## [0.23.15-beta] - First release
8+
9+
This is the first release of the database. The DB features data from 7 different databases, all joined up for ease of consumption. We include:
10+
- [ENSEMBL](https://www.ensembl.org/index.html) gene IDs and information, making the backbone of the database IDs;
11+
- [HGNC](https://www.genenames.org/) for up-to-date, official gene names and gene grouping;
12+
- [IUPHAR](https://www.guidetopharmacology.org/) for target (in our case transporters) and ligand (i.e. drugs/internal compounds) interactions, as well as gene grouping, ion channel conductances, and more;
13+
- [COSMIC](https://cancer.sanger.ac.uk/cosmic) for mutational information;
14+
- [SLC tables](http://slc.bioparadigms.org/) for solute carrier information, such as their class and carried solute;
15+
- [TCDB](https://www.tcdb.org/) for transporter classification information.
16+
17+
We apply manual patches to the data where expert information is lacking from the above databases.
18+
19+
The database is released as a `.sqlite` file at each release.
20+
21+
I highlight the latest changes:
22+
23+
### Changes
24+
- [a03ab0b] **Major refactoring of Daedalus**
25+
- The current list of `IF-ELSE` statements to run or skip some parsers
26+
(for debugging purposes) was terrible. Now, a new class handles
27+
running them properly.
28+
- The `parsers.py` file was getting too long for comfort. It was broken up
29+
into chunks and ported to multiple files in `./parsers/`
30+
- A new `./constants` module holds all of the constants that were strewn
31+
about, with the exception of some constants that are very
32+
parser-specific.
33+
- A lot of things were removed from the module `init.py`, since they
34+
did not belong there.
35+
- The argparser was finally actually finished.
36+
- If the COSMIC username/password combo is not specified, the cosmic
37+
data will not be downloaded (at the user's risk).
38+
- New CLI parameters `run` and `skip` allow easier selective running of
39+
the different parsers, so that we don't commit breaking changes
40+
anymore by accident (aka `SKIP_ALL = True`)
41+
- We use `package.resources` everywhere now, without having to use
42+
wobbly relative paths. This should make us ready to convert to a
43+
proper package.
44+
- The `tests/` folder is now out of `./daedalus/.` It is probably
45+
completely broken now, but it was useless anyway.
46+
- [8f29fbe] **Many Daedalus logic changes**
47+
- Changed Biomart's `XML`s to be more efficient. Should reduce download times a bit.
48+
- Allowed Biomart to download colnames too, therefore making manual colnames useless. I just standardize back to the same format we have used until now the names tha biomart gives us.
49+
- This means that all of the colnames around were updated to the new naming.
50+
- Changed from `CSV` to `TSV` the format for the BioMart data. It seems that the csv parser does not escape commas in the data. How fun! This makes the tsv option the only feasible one.
51+
- Moved to the top of the BioMart list the `entrez` entry, so that the retriever has to download little data before crashing (easier debugging!).
52+
- Moved the logic for saving a pickle of the data to the `ResourceCache` class, from the bad hack in `make_database`.
53+
- Made the downloads from multithreaded to single-threaded. Why were they multithreaded in the first place? I have no idea.
54+
- The parsers that fail are now skipped gracefully, before dumping all failures at once and aborting. This should make large-scale failures easier to debug, since all parsers do not depend on each other to run (they only write to the database, they cannot read from it).
55+
- Written comments here and there.
56+
- Added delays after the warnings when using `--overwrite` and `--regen-cache`, so that one can `CTRL-C` when mistakes are made.
57+
- [960da8f] **Added a project changelog**
58+
- We will follow CalVer `MAJOR.YY.0W[_MINOR][-Modifier]` from now on.

src/db_rebuilder/daedalus/__init__.py

Lines changed: 2 additions & 13 deletions
Original file line numberDiff line numberDiff line change
@@ -1,21 +1,10 @@
11
import logging
22
from logging import StreamHandler
3-
from pathlib import Path
43

54
from colorama import Back, Fore, Style
65

7-
OUT_ANCHOR: Path = Path("/app/out")
8-
9-
__all__ = ["OUTANCHOR"]
10-
__version__ = "0.1.0"
11-
12-
DB_PATH = OUT_ANCHOR / f"MTPDB_v{__version__}.sqlite"
13-
14-
if DB_PATH.exists():
15-
raise Exception(f"Target DB already exists @{DB_PATH}. Aborting")
16-
17-
18-
SCHEMA = "BEGIN;\n{}\nEND;".format(Path("/app/schema.sql").read_text())
6+
__all__ = ["DB_NAME", "SCHEMA"]
7+
__version__ = "0.23.15-beta"
198

209

2110
class ColorFormatter(logging.Formatter):
Lines changed: 66 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,66 @@
1+
"""Constants that are used throughout the program"""
2+
3+
# Re-export constants, so they can all be accessed from here
4+
from daedalus import __version__
5+
from daedalus.constants.url_hardpoints import (
6+
BIOMART,
7+
BIOMART_XML_REQUESTS,
8+
COSMIC,
9+
HUGO,
10+
IUPHAR_COMPILED,
11+
IUPHAR_DB,
12+
SLC_TABLES,
13+
TCDB,
14+
)
15+
16+
__all__ = [
17+
"BIOMART",
18+
"BIOMART_XML_REQUESTS",
19+
"TCDB",
20+
"COSMIC",
21+
"IUPHAR_DB",
22+
"IUPHAR_COMPILED",
23+
"HUGO",
24+
"SLC_TABLES",
25+
"DESCRIPTION",
26+
"NAME",
27+
"EPILOG",
28+
"DB_NAME",
29+
"CACHE_NAME",
30+
"THESAURUS_FILE",
31+
]
32+
33+
## TODO: It could be beneficial to bundle all of these constants into
34+
# just one box and re-export just that.
35+
36+
DESCRIPTION = """
37+
>>> DAEDALUS <<<
38+
39+
This program builds the MTP-Db from information retrieved from online databases.
40+
The rationale is that if the databases update, we also update accordingly.
41+
We also add a pinch of manual curation to fill in the gaps of knowledge from the
42+
online databases.
43+
44+
Some of the parsing steps from the remote databases to the local DB are
45+
heuristic in nature, and therefore might give imperfect information.
46+
Feel free to open issues on GitHub @ https://github.com/CMA-Lab/MTP-DB/issues
47+
if you find any incorrect or missing information.
48+
"""
49+
"""A short description of Daedalus"""
50+
51+
NAME = "Daedalus, the MTP-Db rebuilder"
52+
"""The name of the program, to be shown by Argparser"""
53+
54+
EPILOG = (
55+
"For more usage information, please refer to https://github.com/CMA-Lab/MTP-DB/"
56+
)
57+
"""Message shown by argparser at the bottom of the usage info"""
58+
59+
DB_NAME = f"MTPDB_v{__version__}.sqlite"
60+
"""Name of the DB file to save as output"""
61+
62+
CACHE_NAME = f"MTPDB_datacache.pickle"
63+
"""Name of the cache file to use to stash the downloaded data"""
64+
65+
THESAURUS_FILE = "thesaurus.csv"
66+
"""Name of the local thesaurus file"""
Lines changed: 29 additions & 51 deletions
Original file line numberDiff line numberDiff line change
@@ -1,80 +1,56 @@
11
BIOMART = "http://www.ensembl.org/biomart/martservice"
22
"""The Url used by Biomart to accept requests"""
3+
34
BIOMART_XML_REQUESTS = {
4-
"IDs+desc": {
5-
"query": """<?xml version="1.0" encoding="UTF-8"?>
5+
"entrez": """<?xml version="1.0" encoding="UTF-8"?>
66
<!DOCTYPE Query>
7-
<Query virtualSchemaName = "default" formatter = "CSV" header = "0" uniqueRows = "0" count = "" datasetConfigVersion = "0.6" >
7+
<Query virtualSchemaName = "default" formatter = "TSV" header = "1" uniqueRows = "1" datasetConfigVersion = "0.6" >
88
99
<Dataset name = "hsapiens_gene_ensembl" interface = "default" >
1010
<Filter name = "biotype" value = "protein_coding"/>
1111
<Attribute name = "ensembl_gene_id_version" />
12-
<Attribute name = "ensembl_transcript_id_version" />
13-
<Attribute name = "description" />
14-
<Attribute name = "external_gene_name" />
15-
<Attribute name = "ensembl_peptide_id_version" />
16-
<Attribute name = "entrezgene_id" />
17-
<Attribute name = "pdb" />
18-
<Attribute name = "refseq_mrna" />
12+
<Attribute name = "entrezgene_id" />
1913
</Dataset>
2014
</Query>""",
21-
"colnames": [
22-
"ensembl_gene_id_version",
23-
"ensembl_transcript_id_version",
24-
"description",
25-
"external_gene_name",
26-
"ensembl_peptide_id_version",
27-
"entrezgene_id",
28-
"pdb",
29-
"refseq_mrna",
30-
],
31-
},
32-
"hugo_symbols": {
33-
"query": """<?xml version="1.0" encoding="UTF-8"?>
15+
"IDs": """<?xml version="1.0" encoding="UTF-8"?>
3416
<!DOCTYPE Query>
35-
<Query virtualSchemaName = "default" formatter = "CSV" header = "0" uniqueRows = "0" count = "" datasetConfigVersion = "0.6" >
17+
<Query virtualSchemaName = "default" formatter = "TSV" header = "1" uniqueRows = "1" datasetConfigVersion = "0.6" >
3618
3719
<Dataset name = "hsapiens_gene_ensembl" interface = "default" >
3820
<Filter name = "biotype" value = "protein_coding"/>
39-
<Attribute name = "hgnc_id" />
40-
<Attribute name = "hgnc_symbol" />
4121
<Attribute name = "ensembl_gene_id_version" />
22+
<Attribute name = "ensembl_transcript_id_version" />
4223
</Dataset>
4324
</Query>""",
44-
"colnames": ["hgnc_id", "hgnc_symbol", "ensembl_gene_id_version"],
45-
},
46-
"IDs": {
47-
"query": """<?xml version="1.0" encoding="UTF-8"?>
25+
"proteins": """<?xml version="1.0" encoding="UTF-8"?>
4826
<!DOCTYPE Query>
49-
<Query virtualSchemaName = "default" formatter = "CSV" header = "0" uniqueRows = "0" count = "" datasetConfigVersion = "0.6" >
27+
<Query virtualSchemaName = "default" formatter = "TSV" header = "1" uniqueRows = "1" datasetConfigVersion = "0.6" >
5028
5129
<Dataset name = "hsapiens_gene_ensembl" interface = "default" >
5230
<Filter name = "biotype" value = "protein_coding"/>
53-
<Attribute name = "ensembl_gene_id" />
54-
<Attribute name = "ensembl_transcript_id" />
55-
<Attribute name = "ensembl_peptide_id" />
56-
<Attribute name = "version" />
57-
<Attribute name = "transcript_version" />
58-
<Attribute name = "peptide_version" />
31+
<Attribute name = "ensembl_transcript_id_version" />
32+
<Attribute name = "ensembl_peptide_id_version" />
33+
<Attribute name = "pdb" />
5934
<Attribute name = "refseq_mrna" />
60-
<Attribute name = "refseq_peptide" />
35+
<Attribute name = "refseq_peptide" />
36+
</Dataset>
37+
</Query>""",
38+
"gene_names": """<?xml version="1.0" encoding="UTF-8"?>
39+
<!DOCTYPE Query>
40+
<Query virtualSchemaName = "default" formatter = "TSV" header = "1" uniqueRows = "1" datasetConfigVersion = "0.6" >
41+
42+
<Dataset name = "hsapiens_gene_ensembl" interface = "default" >
43+
<Filter name = "biotype" value = "protein_coding"/>
44+
<Attribute name = "hgnc_id" />
45+
<Attribute name = "hgnc_symbol" />
46+
<Attribute name = "description" />
47+
<Attribute name = "ensembl_gene_id_version" />
6148
</Dataset>
6249
</Query>""",
63-
"colnames": [
64-
"ensembl_gene_id",
65-
"ensembl_transcript_id",
66-
"ensembl_peptide_id",
67-
"version",
68-
"transcript_version",
69-
"peptide_version",
70-
"refseq_mrna",
71-
"refseq_peptide",
72-
],
73-
},
7450
}
7551
"""Hardpoints with Biomart data.
7652
77-
In the form of 'table_name': {'query': xlm_query, 'colnames': [list of colnames]}
53+
In the form of 'table_name': 'xml_query'
7854
"""
7955

8056
TCDB = {
@@ -104,6 +80,7 @@
10480

10581
IUPHAR_DB = "https://www.guidetopharmacology.org/DATA/public_iuphardb_v2022.2.zip"
10682
"""URL to the download of the full IUPHAR database"""
83+
10784
IUPHAR_COMPILED = {
10885
"targets+families": "https://www.guidetopharmacology.org/DATA/targets_and_families.csv",
10986
"ligands": "https://www.guidetopharmacology.org/DATA/ligands.csv",
@@ -112,7 +89,7 @@
11289
"""URLs to the compiled IUPHAR data from their downloads page"""
11390

11491
HUGO = {
115-
"nomenclature": "http://ftp.ebi.ac.uk/pub/databases/genenames/hgnc/archive/monthly/tsv/hgnc_complete_set_2021-03-01.txt",
92+
"nomenclature": "https://ftp.ebi.ac.uk/pub/databases/genenames/hgnc/archive/monthly/tsv/hgnc_complete_set_2023-04-01.txt",
11693
"groups": {
11794
# I could download json files, but most of the data is flat anyway, so...
11895
"endpoint": "https://www.genenames.org/cgi-bin/genegroup/download?id={id}&type=branch",
@@ -138,3 +115,4 @@
138115
"""Hugo downloads as found on their download pages"""
139116

140117
SLC_TABLES = "http://slc.bioparadigms.org/"
118+
"""URL to the SLC tables that have data regarding solute carriers"""

src/db_rebuilder/daedalus/errors.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -5,6 +5,6 @@ class CacheKeyError(Exception):
55

66

77
class Abort(Exception):
8-
"""The program cannot continue, but the error was logged."""
8+
"""The program cannot continue, but the error was caught, logged, and we can exit gracefully."""
99

1010
pass
File renamed without changes.

src/db_rebuilder/daedalus/manual_data/atp_driven_ABC_data.csv renamed to src/db_rebuilder/daedalus/local_data/atp_driven_ABC_data.csv

File renamed without changes.

src/db_rebuilder/schema.sql renamed to src/db_rebuilder/daedalus/local_data/schema.sql

Lines changed: 7 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -1,20 +1,20 @@
11
CREATE TABLE gene_ids (
2-
ensg_version TEXT UNIQUE NOT NULL, -- from biomart > IDs+desc > ensembl_gene_id_version
3-
ensg TEXT PRIMARY KEY, -- from biomart > IDs+desc > ensembl_gene_id_version
4-
ensg_version_leaf INT NOT NULL -- from biomart > IDs+desc > ensembl_gene_id_version
2+
ensg_version TEXT UNIQUE NOT NULL, -- from biomart > IDs+desc > gene_stable_id_version
3+
ensg TEXT PRIMARY KEY, -- from biomart > IDs+desc > gene_stable_id_version
4+
ensg_version_leaf INT NOT NULL -- from biomart > IDs+desc > gene_stable_id_version
55
);
66

77
CREATE TABLE transcript_ids (
8-
ensg TEXT NOT NULL, -- from biomart > IDs+desc > ensembl_gene_id_version
9-
enst TEXT PRIMARY KEY, -- from biomart > IDs+desc > ensembl_transcript_id_version
8+
ensg TEXT NOT NULL, -- from biomart > IDs+desc > gene_stable_id_version
9+
enst TEXT PRIMARY KEY, -- from biomart > IDs+desc > transcript_stable_id_version
1010
enst_version TEXT UNIQUE NOT NULL, -- same as enst
1111
enst_version_leaf INT NOT NULL, -- same as enst
1212
is_canonical_isoform INT NOT NULL -- bool
1313
);
1414

1515
CREATE TABLE mrna_refseq (
1616
-- These cannot be unique, as some refseq IDs are missing
17-
enst TEXT NOT NULL, -- from biomart > IDs+desc > ensembl_transcript_id_version
17+
enst TEXT NOT NULL, -- from biomart > IDs+desc > transcript_stable_id_version
1818
refseq_transcript_id TEXT -- from biomart > IDs+desc > refseq_mrna
1919
-- refseq_transcript_id_version INT -- MISSING?? No version for refseq?
2020
-- refseq_transcrpit_id_version_leaf INT -- See aboveref
@@ -29,7 +29,7 @@ CREATE TABLE protein_ids (
2929
);
3030

3131
CREATE TABLE gene_names (
32-
ensg TEXT, -- from biomart > IDs+desc > ensembl_gene_id_version
32+
ensg TEXT, -- from biomart > IDs+desc > gene_stable_id_version
3333
hugo_gene_id TEXT, -- from biomart > hugo_symbols > hgnc_id
3434
hugo_gene_symbol TEXT, -- from biomart > hugo_symbols > hugo_gene symbol
3535
-- (double check with the description field below)
File renamed without changes.

0 commit comments

Comments
 (0)