T38DrugDB - Medicinally Relevant Compounds in the ZINC20

The database lives at:

Zenodo (Large data files)
github (Executables and README)
CSBJ (Original Publication)

Overview

T38DrugDB provides:

34,345,597 compounds in PDBQT and MOL2 formats
Filtered dataset based on LogP, molecular weight, and medicinal chemistry guidelines
Random subsampling tools for flexible dataset creation

Usage

This directory contains a subset of the ZINC20 database filtered for physicochemical and structural properties observed in medicinal compounds. The purpose of this database is to aid drug discovery, especially concerning virtual screening via autodock-GPU. However, any other chemoinformatics tool for drug discovery that relies on MOL2 or PDBQT files may benefit from this subset.
The main component are the PDBQT_filtered and MOL2_filtered subdirectories that contain >34,000,000 compounds in PDBQT/MOL2 format deposited in several .zip archives. The two subdirectories can be downloaded from Zenodo. They required further compression into .tar.gz to comply with Zenodo's upload size limit and need to be decompressed to yield the aforementioned .zip archives for intended usage with the supplied python scripts.
The intended use of this database is to cover an extensive part of medicinally relevant chemical space for drug discovery projects. For this purpose, the full database can be used or it can be randomly subsampled via fetch_N_random_PDBQTs.py and fetch_N_random_MOL2s.py (make sure all required Python modules are installed).
The .zip format was chosen to keep the database portable; however, this causes performance bottlenecks when trying to access the compounds. Using fetch_N_random_PDBQTs.py to subsample around 100,000 compounds may take 10 minutes, depending on the hardware. Therefore, tinyDB was created, which contains a tiny subset with around 200,000 compounds and fewer .zip archives while also containing fetch_N_random_*.py with identical functionality. Any testing should be conducted in this subdirectory for faster results. The file example.sh extracts 100 random PDBQT and MOL2 compounds into tempPDBQT/ and tempMOL2/ within seconds and is the quickest way to showcase the database's functionality.

Quickstart

Test functionality with a Python script:

# Download only tinyDB/
cd tinyDB
python fetch_N_random_PDBQTs.py -i tinyT38DrugDB.csv -N 100 -d ./tempPDBQT/

Or use the provided example script to extract 100 random compounds into tempPDBQT/ and tempMOL2/ within seconds.

./example.sh

Use the Jupyter notebook for more detailed exploration:

jupyter notebook fetch_N_random_PDBQTs.ipynb

Compounds can also be extracted manually via their ZINC ID and corresponding tranche:

# For PDBQT format,
unzip -p PDBQT_filtered/HFABMN.zip ZINC000006949878.pdbqt > test.pdbqt
# For MOL2 format 
unzip -p MOL2_filtered/FFADMN.zip ZINC000620995125.mol2 > test.mol2

Directory Structure

T38DrugDB/
├── README.md                 # THIS file
├── T38drugDB.csv             # Compound Table (34M+ rows, 3 cols)
├── fetch_N_random_PDBQTs.py  # PDBQT sampling script
├── fetch_N_random_MOL2s.py   # MOL2 sampling script
├── PDBQT_filtered/           # Main database (PDBQT format 35 GB; *.tar.gz: 16 GB)
│   └── [trancheID].zip       # Compressed compound batches
├── MOL2_filtered/            # Main database (MOL2 format 75 GB; *.tar.gz: 31 GB)
│   └── [trancheID].zip       # Compressed compound batches
└── tinyDB/                          # TEST SUBSET (1.2 GB)
    ├── example.sh                   # Quick test script
    ├── fetch_N_random_PDBQTs.py     # see above
    ├── fetch_N_random_MOL2s.py      # see above
    ├── fetch_N_random_PDBQTs.ipynb  # interactive subsampling
    ├── tinyPDBQT_filtered/          # see above PDBQT_filtered/
    │   └── [trancheID].zip          # see above
    └── tinyMOL2_filtered/           # see above MOL2_filtered/
        └── [trancheID].zip          # see above

Brief File Descriptions

T38drugDB.csv
- only available at Zenodo
- list of all SMILES and ZINCIDs contained in T38DrugDB.zip
- auxiliary file for subsampling compounds from the database
- 34,345,597 lines
fetch_N_random_PDBQTs.py and fetch_N_random_MOL2s.py
- Python executable to subsample the database randomly
- the file T38drugDB.csv is used for subsampling
- files are extracted from the zip files in PDBQT_filtered or MOL2_filtered
- files are extracted into a directory specified as a command line argument to the python script (see python fetch_N_random_PDBQTs.py --help)
- It can take up to 10 minutes for a subsample ranging in the hundreds of thousands compounds
PDBQT_filtered/ and MOL2_filtered/
- only available at Zenodo
- uploaded as .tar.gz and need to be extracted before usage
- subdirectories with 35 GB and 75 GB of data
- contains downloaded compounds from the ZINC20 database
- filtered via the ZINC20 webpage and according to the Handbook of Medicinal Chemistry
- further info on filtering see Filter Criteria Applied to ZINC20 Database
- compounds are stored in zip archives individually
tinyDB/
- lightweight database for testing and quick downloading
- contains only 1.2 GB instead of 110 GB of Data
example.sh
- Quick demonstration script
- executes the two Python scripts to extract 100 random compounds in both formats
- also requires installing the required Python modules
tinyDB/fetch_N_random_PDBQTs.ipynb
- jupyter notebook for interactive testing
- this file subsamples the PDBQT database randomly
- the file tinyT38drugDB.csv is used for subsampling
- files are extracted from the zip files in tinyPDBQT_filtered
- output directory specified as an argument in the parser.parse_args() function

Optimization Tips

Download only tinyDB/ first for development and testing
Download and extract PDBQT_filtered.tar.gz and MOL2_filtered.tar.gz after you verified tinyDB/ works with your approach/workflow
Consider writing custom parallel processing scripts for extensive extractions

Filter Criteria Applied to ZINC20 Database:

property	range (including)
LogP	1 to 3
Mass	250 to 450 Da
NOCount	2 to 9
NHOHCount	0 to 3
TPSA	19 to 70 A^2
NumRotatableBonds	1 to 8
NumAromaticCarbocycles	0 to 2
NumAromaticHeterocycles	0 to 1
NumAromaticRings	1 to 3
FractionCSP3	0.2 to 0.6

Citation

If you use T38DrugDB in your research, please cite:

Luis Vollmers, Shu-Yu Chen, Martin Zacharias. In Silico Analysis of Potential Stabilizer Binding Sites at Protein–RNA Interfaces. Comput Struct Biotechnol J. 2026;35:0016. DOI:10.34133/csbj.0016

License

This work is licensed under a Creative Commons Attribution 4.0 International License. See creativecommons.org/licenses/by/4.0/ for further information.

Contact

For questions, issues, or contributions:

luis.vollmers@tum.de
zacharias@tum.de
Publication Link: https://doi.org/10.34133/csbj.0016

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

T38DrugDB - Medicinally Relevant Compounds in the ZINC20

Overview

Usage

Quickstart

Directory Structure

Brief File Descriptions

Optimization Tips

Filter Criteria Applied to ZINC20 Database:

Citation

License

Contact

About

Uh oh!

Releases 2

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 10 Commits
tinyDB		tinyDB
.gitignore		.gitignore
README.md		README.md
fetch_N_random_MOL2s.py		fetch_N_random_MOL2s.py
fetch_N_random_PDBQTs.py		fetch_N_random_PDBQTs.py

Folders and files

Latest commit

History

Repository files navigation

T38DrugDB - Medicinally Relevant Compounds in the ZINC20

Overview

Usage

Quickstart

Directory Structure

Brief File Descriptions

Optimization Tips

Filter Criteria Applied to ZINC20 Database:

Citation

License

Contact

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases 2

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages