# Protein Embedding Analysis

Deep Learning transformer models such as BERT have been widely successful in a variety of natural language based tasks. 
Recently, BERT has been applied to protein sequences and has shown some success in protein prediction tasks relevant to 
biologists, such as secondary structure, fluorescence, and stability. To continue the investigation into BERT, we 
examined a new prediction task known as subcellular location, first described in DeepLoc (2017). Using BERT embeddings 
from a Berkeley research project titled Tasks Assessing Protein Embeddings (TAPE) as features for downstream modeling, 
we achieved a 67% test set accuracy using SVC for the 10 class classification task, and 89% using a Keras DNN for the 
binary classification task (membrane bound vs water soluble protein). Next, we created a containerized Flask app using 
Docker which is deployable to EC2 with the ability to run on a GPU. This service allows for embedding protein sequences 
using pretrained models, as well as providing an interface for visualizing the embedding space using PCA and plotly.

The code and data for our project is available on two GitHub Repositories:

[TAPE Fork](https://github.com/rdedhia/tape): A fork of the Tasks Accessing Protein Embeddings (TAPE) GitHub
repository, with the addition of the subcellular location task, notebooks for data analysis and modeling, a Google
Colab notebook for embedding sequences using a GPU, and both input and output data files for subcellular location.

[Docker TAPE](https://github.com/rdedhia/docker-tape): A GPU-enabled Docker deployment for a Flask app that builds
on top of the embedding capabilities in TAPE. The Docker/Flask application provides two main capabilities:

1. Generating embeddings from protein sequences using one of TAPE's pretrained models by wrapping the `tape-embed` 
method of the TAPE cli. These embeddings can be used as features for downstream prediction tasks. Embedding sequences 
is very compute intensive, so can be accelerated by using a GPU. For this reason, the Dockerfile relies on an 
Nvidia base image and is GPU compatible. Furthermore, we include instructions for setting up an EC2 instance with GPUs 
in Amazon Web Services (AWS) with AWS Cloudformation to run the Docker container.
2. Visualizing the embeddings and labels for a given dataset with Principal Component Analysis (PCA) in 2 or 3 
dimensions, through an interact plotly plot.