LLNL 3D Protein-Ligand Dataset for Anti-viral Screening against SARS-CoV-2
Complex data (hdf files) for one binding site (an allosteric binding site) of the spike protein (6m0j), stabilized by disulfide Cys480-Cys488.
Complex data (hdf files) for another binding site of the spike protein (6m0j) in the proximity of the beta-turn formed by residues 501-505. This is the receptor binding domain to ACE2.
Main Protease - 6LU7 crystal binding site
Complex data (hdf files) for a conformation binding site of the main protease (6lu7).
- Cite This Work
Kim, Hyojin; Jones, Derek; Zhang, Xiaohua; Kirshner, Dan; Lightstone, Felice; Allen, Jonathan (2020). LLNL 3D Protein-Ligand Dataset for Anti-viral Screening against SARS-CoV-2. In Lawrence Livermore National Laboratory (LLNL) Open Data Initiative. UC San Diego Library Digital Collections. https://doi.org/10.6075/J0KW5DK5
This dataset contains protein-ligand complexes in a 3D representation for anti-viral drug screening against SARS-CoV-2. This is a part of the Lawrence Livermore National Laboratory Covid-19 Therapeutic Design database, but is specifically designed to facilitate machine learning and other data science tasks with regard to both efficacy (protein-ligand binding affinity) and safety. This complex dataset is called "ml-hdf", comprised of ligands and four potential binding pockets of the SARS-CoV-2 protein targets in a 3D atomic representation. The ligands in this dataset includes Federal Drug Administration (FDA) approved drugs and "Other-world-approved" drugs that have been approved for use by the EU, Canada and Japan. The compounds were docked against two binding pockets from the Spike protein (spike, spike1) and two conformations of the main protease (protease, protease2).
- Scope And Content
There are four directories, each of which contains ligands docked against a particular binding pocket. The cut-off for the binding pocket regions is 8 angstroms, similar to the pdbbind database.
- spike: complex data for one binding site of the spike protein (6m0j), stabilized by disulfide Cys480-Cys488. This is an allosteric binding site.
- spike1: complex data for another binding site of the spike protein (6m0j) in the proximity of the beta-turn formed by residues 501-505. This is the receptor binding domain to ACE2.
- protease: complex data for a conformation binding site of the main protease (6lu7)
- protease2: complex data for another conformation binding site of the main protease (6y84)
In each subdirectory, there are about 90 ml-hdf files, each of which contains about 100 complex poses (10 docking poses per complex). The poses have been down selected by using the Autodock Vina with the AMBER molecular simulation package.
- Creation Date
- Date Issued
- Technical Details
The ML-HDF files use a hdf5 format. Refer to the readme file for the internal data structure of the ml-hdf files. The Ligand ID is a combination of an internal id used in ConveyorLC and the ZINC database id (e.g., 134_ZINC000003807172). Each ligand has up to 10 docking poses (1, 2, 3, ..., 10). Each docking pose has a dataset named "data" in a 2D array, which is the 3D atomic representation of the binding pocket region with the docked ligand atoms. The dimension of the array is [N, 22] where N is the total number of atoms of the protein-ligand complex, and 22 is a feature vector for each atom. The atom feature includes the following information:
--3D coordinate (x, y, z) of the atom
--Number of heavy atom bonds (heavy valence)
--Number of bonds with other heteroatoms (hetero valence)
For more detail about the atom features, please refer to the article "Improved Protein-ligand Binding Affinity Prediction with Structure-Based Deep Fusion Inference" at https://arxiv.org/pdf/2005.07704.pdf
How to Read the ML-HDF Files
The ML-HDF files can be read using the h5py package in Python or the hdf5 C++ API. Find more detail in the readme document.
The ground truth labels (binding affinity, toxicity, etc) are not available for this dataset. However, the LLNL Covid-19 Data Portal (https://covid19drugscreen.llnl.gov) provides physics-based scoring results including Molecular Mechanics/Generalized Born-Solvent Accessible Surface Area (MM/GBSA) rescoring (CDT4mmgbsa), a machine learning fusion model predictions using 3D CNN and 3D Spatial Graph CNN (FAST) (https://github.com/llnl/fast), and molecular dynamics (MD) simulation results with the single-point MM/GBSA calculation averaged over multiple time steps, as references. The website also provides safety and pharmacokinetic property prediction results generated with the ATOM Modeling Pipeline (AMPL) and Maestro workflow manager. For more information about the original database and the scoring outputs, see the main LLNL Covid-19 Data Portal (https://covid19drugscreen.llnl.gov).
LLNL’s Laboratory Directed Research and Development (LDRD), tracking # 20-ERD-065 and 20-ERD-062.
AHA CRADA: funded by the American Heart Association Center for Accelerated Drug Discovery under a collaborative research and development agreement (CRADA TC02274).
- Related Publications
Is Supplement To:
Derek Jones and Hyojin Kim and Xiaohua Zhang and Adam Zemla and Garrett Stevenson and William D. Bennett and Dan Kirshner and Sergio Wong and Felice Lightstone and Jonathan E. Allen (2020). Improved Protein-ligand Binding Affinity Prediction with Structure-Based Deep Fusion Inference. arXiv preprint arXiv: https://arxiv.org/pdf/2005.07704.pdf.
View formats within this collection
Derek Jones: https://orcid.org/0000-0002-9510-6662
Hyojin Kim: https://orcid.org/0000-0001-7032-0999
- Related Resource
- Rights Holder
- Lawrence Livermore National Laboratory
Under copyright (US)
Use: This work is available from the UC San Diego Library. This digital copy of the work is intended to support research, teaching, and private study.
Constraint(s) on Use: This work is protected by the U.S. Copyright Law (Title 17, U.S.C.). Use of this work beyond that allowed by "fair use" or any license applied to this work requires written permission of the copyright holder(s). Responsibility for obtaining permissions and any use and distribution of this work rests exclusively with the user and not the UC San Diego Library. Inquiries can be made to the UC San Diego Library program having custody of the work.
- Digital Object Made Available By
Research Data Curation Program, UC San Diego, La Jolla, 92093-0175 (https://lib.ucsd.edu/rdcp)
- Last Modified