Prediction of Enzyme Classification using Protein Sequence Embeddings
Scripts (Run TAPE)
File Size |
|
File Format |
|
Scope And Content | Scripts to run TAPE. |
Technical Details |
name: tape |
Scripts (Run ESM)
File Size |
|
File Format |
|
Scope And Content | Scripts to run ESM. |
Technical Details |
name: esm |
Input data
File Size |
|
File Format |
|
Scope And Content | Enzyme and non-enzyme protein sequence strings. |
Output data
File Size |
|
File Format |
|
Scope And Content | Protein sequence embeddings derived from DSE_MAS_group4_scripts.zip. |
- Collection
- Cite This Work
-
Baldino, Breanne; Dohkani, Tahamtan; Pinto, Matteo; Sundaresan, Ambika; Yu, Cindy; Rose, Peter (2021). Prediction of Enzyme Classification using Protein Sequence Embeddings. In Data Science & Engineering Master of Advanced Study (DSE MAS) Capstone Projects. UC San Diego Library Digital Collections. https://doi.org/10.6075/J0736QSX
- Description
-
Biologists work with a multitude of protein sequences represented by strings of letters. The amino acid sequence of these proteins allows us to leverage various machine learning Natural Language Processing algorithms aimed to predict enzyme classifications which are indicative of both protein structure and functionality. Our goal is to propose a multi level classification solution that is designed to predict the respective class of a given enzyme. Our approach consists of predicting the classification of an enzyme by applying NLP to a protein sequence. Our method utilizes BERT (Bidirectional Encoder Representations from Transformers) models to create embeddings, or feature vectors, and a variety of machine learning models to predict the respective class and subclass of an enzyme.
- Creation Date
- 2021-01-01 to 2021-05-04
- Date Issued
- 2021
- Advisor
- Contributors
- Series
- Topics
Formats
View formats within this collection
- Language
- English
- Identifier
- Related Resources
- DEEPre: sequence-based enzyme EC number prediction by deep learning: https://academic.oup.com/bioinformatics/article/34/5/760/4562505
- ECPred: a tool for the prediction of the enzymatic functions of protein sequences based on the EC nomenclature: https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-018-2368-y
- GitHub Repository of ESM-1b: https://github.com/facebookresearch/esm
- GitHub Repository of TAPE: https://github.com/songlab-cal/tape
Source data
Previous version
- License
-
Creative Commons Attribution 4.0 International Public License
- Rights Holder
- Baldino, Breanne; Dohkani, Tahamtan; Pinto, Matteo; Sundaresan, Ambika; Yu, Cindy
- Copyright
-
Under copyright (US)
Use: This work is available from the UC San Diego Library. This digital copy of the work is intended to support research, teaching, and private study.
Constraint(s) on Use: This work is protected by the U.S. Copyright Law (Title 17, U.S.C.). Use of this work beyond that allowed by "fair use" or any license applied to this work requires written permission of the copyright holder(s). Responsibility for obtaining permissions and any use and distribution of this work rests exclusively with the user and not the UC San Diego Library. Inquiries can be made to the UC San Diego Library program having custody of the work.
- Digital Object Made Available By
-
Research Data Curation Program, UC San Diego, La Jolla, 92093-0175 (https://lib.ucsd.edu/rdcp)
- Last Modified
2022-10-03