# LLNL 3D Protein-Ligand Dataset for Anti-viral Screening against SARS-CoV-2 
This dataset contains protein-ligand complexes in a 3D representation for anti-viral drug screening against SARS-CoV-2. This is a part of the [Lawrence Livermore National Laboratory Covid-19 database](https://covid19drugscreen.llnl.gov), but is specifically designed to facilitate machine learning and other data science tasks with regard to both efficacy (protein-ligand binding affinity) and safety. This complex dataset is called "ml-hdf", comprised of ligands and four potential binding pockets of the SARS-CoV-2 protein targets in a 3D atomic representation. The ligands in this dataset includes Federal Drug Administration (FDA) approved drugs and "Other-world-approved" drugs that have been approved for use by the EU, Canada and Japan. The compounds were docked against two binding pockets from the Spike protein (spike, spike1) and two conformations of the main protease (protease, protease2). 
   

## Directory Structure

There are 4 directories, each of which contains ligands docked against a particular binding pocket. The cut-off for the binding pocket regions is 8 angstroms, similar to the pdbbind database. 
- spike: complex data for one binding site of the spike protein (6m0j), stabilized by disulfide Cys480-Cys488.  This is an allosteric binding site.
- spike1: complex data for another binding site of the spike protein (6m0j) in the proximity of the beta-turn formed by residues 501-505. This is the receptor binding domain to ACE2.
- protease: complex data for a conformation binding site of the main protease (6lu7)
- protease2: complex data for another conformation binding site of the main protease (6y84) 

In each subdirectory, there are about 90 ml-hdf files, each of which contains about 100 complex poses (10 docking poses per complex). The poses have been down selected by using the Autodock Vina with the AMBER molecular simulation package. 
   

## ML-HDF Structure

The ML-HDF files use a hdf5 format. The internal data structure of the ml-hdf is as follows: 
   
- Ligand ID
   - pybel
      - processed
         - docking
             - 1
             - 2
             - 3
             ...
             - 10
                 - data
   
The Ligand ID is a combination of an internal id used in ConveyorLC and the [ZINC database id](https://zinc.docking.org) (e.g., 134_ZINC000003807172). Each ligand has up to 10 docking poses (1, 2, 3, ..., 10). Each docking pose has a dataset named "data" in a 2D array, which is the 3D atomic representation of the binding pocket region with the docked ligand atoms. The dimension of the array is [N, 22] where N is the total number of atoms of the protein-ligand complex, and 22 is a feature vector for each atom. The atom feature includes the following information:   
- 3D coordinate (x, y, z) of the atom   
- Atomic number   
- Atom hybridization   
- Number of heavy atom bonds (heavy valence)   
- Number of bonds with other heteroatoms (hetero valence)   
- Structural properties   
- Partial charge   

For more detail about the atom features, please refer to [this](https://arxiv.org/pdf/2005.07704.pdf).   
   

## How to Read 

The ml-hdf files can be read using [h5py package](https://www.h5py.org) in python or [hdf5 C++ API](https://support.hdfgroup.org/HDF5/doc/cpplus_RM/index.html). Below is a sample python code to read a ml-hdf file:     
   
import h5py   
   
ml_hdf = h5py.File(file_path, 'r')   
for lig_id in ml_hdf.keys(): # read all ligand ids inside the file   
    complex_data = ml_hdf[lig_id]["pybel"]["processed"]["docking"]   
    for pose_id in range(1,11):   
        pose_data = complex_data[pose_id]["data"]   
        atom_xyz = pose_data[:,0:3] # atom 3D coordinates   
        atom_feat = input_data_[:,3:] # atom features    
   

## Model Evaluation

The ground truth labels (binding affinity, toxicity, etc) are not available for this dataset. However, the [LLNL Covid-19 Data Portal](https://covid19drugscreen.llnl.gov) provides physics-based scoring results including Molecular Mechanics/Generalized Born-Solvent Accessible Surface Area (MM/GBSA) rescoring (CDT4mmgbsa), a machine learning fusion model predictions using 3D CNN and 3D Spatial Graph CNN ([FAST](https://github.com/llnl/fast)), and molecular dynamics (MD) simulation results with the single-point MM/GBSA calculation averaged over multiple time steps, as references. The website also provides safety and pharmacokinetic property prediction results generated with the ATOM Modeling Pipeline (AMPL) and Maestro workflow manager. For more information about the original database and the scoring outputs, see the main [LLNL Covid-19 Data Portal](https://covid19drugscreen.llnl.gov).    
   

## Contact

contact [Hyojin Kim](hkim@llnl.gov) or [Jonathan Allen](allen99@llnl.gov) for further questions.   

LLNL-MI-813372