# Video Game Sentiment to Popularity

"Video Game Reviews Sentiment to Popularity" is a project that sought to investigate the potential usage of video game
reviews from aggregator websites to formulate a classification-based prediction for reaching a fixed threshold of number
of users on Steam. Using a system of converting review documents to vectors before fitting them to a
classification learner, the project was able to achieve a medium level of accuracy.
 
# Structure
├── LICENSE
├── .vscode
|   └── launch.json    <- Launches flask test server for dashboard.
├── README.md          <- The top-level README for developers using this project.
├── files
│   ├── models         <- Data dumped of dashboard pipeline and GridSearchCVs from hyperparameter tuning.
|   |                     Includes JSONs of best hyperparameters for each GridSarchCV
│   ├── raw            <- Data from the data pump. Minimal modifications.
│   └── client_id.json <- Holds the OpenCritic API Key, the SQL access credentials and pathing, and the model selection for the dashboard backend.
│
├── dashboard          <- Source code for the web dashboard.
│   ├── backend         <- Python flask server for initializing a model and processing user inputs into predictions.
│   │   ├── api
|   │   │   ├── Doc2VecTransformer.py <- Modeler for converting tagged documents to vectors within an sklearn pipeline.
|   │   │   └── ReviewHandler
|   |   |
│   │   ├── static
│   │   └── app.py     <- Initializes the flask server.
|   |                     
│   └── frontend        <- React.js, HTML, and CSS files for the end-user GUI.
|
│
│
│
├── environment.yml   <- The requirements file for reproducing the analysis environment, e.g.
│                         generated with `conda env create -f environment.yml`
│
├── notebooks          <- Jupyter notebooks. Naming convention is a function, an underscore `_`, and the principle data source affected, e.g.
│   |                     `extract_steam.ipynb`. For models, it is `model_`, followed by the specific evaluation being performed, e.g.
|   |                     `model_hyperparametres.ipynb`
│   │
│   ├── clean          <- Cleans data for insertion into the SQL database.
│   │   └── clean.ipynb
│   │
│   ├── explore        <- Scripts to create exploratory and results oriented visualizations.
│   │   └── explore.ipynb
|   |
│   ├── extract        <- Scripts to download or generate data.
│   │   ├── extract_opencritic.ipynb
│   │   ├── extract_steam.ipynb
│   │   └── extract_steamspy.ipynb
│   │
│   └── models         <- Scripts to train models and then use trained models to make
│       │                 predictions.
│       ├── model_aggregation.ipynb
│       ├── model_data_prep.ipynb
│       ├── model_hyperparameters.ipynb <- Used to generate model parameters for dashboard.
│       └── model_scaling_robustness.ipynb
│
└── sql                <- Scripts for creating tables on PSQL
    |
    ├── reviews.sql    <- Creates table for review data.
    └── sales.sql      <- Creates table for game name, game steam app id, and approximate ownership count on steam.


# Importing inputs.zip and outputs.zip
1. From inputs.zip, unzip and replace the following folder within the following relative repository pathway:
    1. raw to repo/files/raw. This will skip modeling step 2
2. From outputs.zip, unzip and place the following folder within the following relative repository pathway, replacing the existing folder entirely:
    1. models to repo/files/models. This will skip modeling step 4
    2. static to repo/dashboard/backend/static. This will provide a pre-trained model for the "tree_grid" option (see below) and precludes training a new one from the database.
    3. build to repo/dashboard/frontend/build. This will provide the frontend webpage and skips environment setup step 4

# How to setup the environment
The project requires a windows machine to run.

1. Install and configure Visual Studio Code
    1. Add the Python and Jupyter notebooke extensions

2. Install and configure Python Anaconda
    1. After installation, open the Powershell Terminal installation in Anaconda.
    2. Change directories to the repository.
    3. Enter "conda env create -f environment.yml" to install the Anaconda environment used to run this project. If packages are unable to be found, remvoe the version info from the package (==[version num]) in the file for the given package.
    4. Change the Anaconda environment using "conda activate group6b"
2. Create a PostgreQL Database. Create an account with read and write abilities. Perform the item below with an account with table creation abilities.
    - Run "reviews.sql" and "sales.sql" inside the server such that the tables will appear in the default location and can be queried without specifying a schema. The modeling and cleaning notebooks will not work otherwise.
3. Configure the client_id.json folder
    1. Set SQL server details to match the server the project will upload and query cleaned data from. The login credentials must be for an account with both read and write priveleges.
    2. Set X-RapidAPI-Key to your own unique API key. This requires a Mega subscription ($50/month) for "extract_opencritic.ipynb" to function properly.
    3. Set the model to one of several model types. This will affect what model is used in the dashboard backend. This may be postponed until after hyperparameter tuning to decide which type is best used.
        - "log_grid" - Logistic Regression Classifier
        - "tree_grid" - Random Forest Classifier
        - "net_grid" - Multi-Layer Perceptron Classifier
        - "knn_grid" - K-Nearest Neighbors Classifier
3. Install the latest recommended stable build of Node.js
4. Build the dashboard frontend
    1. Open the Powershell Terminal in Anaconda.
    2. Change directory to dashboard/frontend.
    3. Import the necessary JS libraries using "npm install"
    4. Build the dashboard using "npm run build" 

# How to build the model hyperparameters for the dashboard
1. Open JupyterLab in Anaconda Navigator. Open the "notebooks" folder.
2. Open the "extract" folder and run the files in the given order to create a cache of raw data.
    1. Execute all cells in "extract_steam.ipynb" to retrieve a sample list of ~100,000 games in steam
    2. Execute all cells in "extract_opencritic.ipynb" to retrieve reviews for approximate matches of the previous steam games. If any of the steps prove to take more than an hour, you can interrupt the kernel and modify the "session_count" variables to try to achieve optimal parallelization. If an exception is raised, the affected cell caches the output and re-runs will start from the previous point. This will take more than one hour to complete.
    3. Execute all cells in "extract_steamspy.ipynb" to retrieve approximate ownership data for steam games with reviews found for them
3. Open the "clean" folder and execute "clean.ipynb". This will perform the first round of data cleaning before uploading it to the PSQL database.
4. Open the "modeling" folder and run "model_hyperparameters.ipynb". This will generate .skops dumps of the GridSearchCVs so fitting is not needed on re-runs, and generates .json files of the best hyperparameters for each file. This step may take 3-6 hours to complete on a desktop PC.


# For other findings:
1. For data prep findings, run "model_data_prep.ipynb"
2. For scaling and robustness findings run "model_scaling_robustness.ipynb"
3. For aggregation findings, run "model_aggregation.ipynb"
4. For additional exploration, open the "explore" folder and run "explore.ipynb"

# How to run the dashboard
The following is performed on a windows machine.

1. Using the Anaconda Navigator to open Visual Studio Code, open the folder for the repository. Be sure it is running in the group6b environment.
2. Press CTRL + F5 to launch the flask server. The server will construct a model using the database data and the best hyperparameters previously generated as part of initialization.
    - Alternatively, open the "Run and Debug" tab on the sidebar, and click the "Start Debugging" button on the opened sidebar window.
    - The code uses the repository root directory as the working directory. You cannot launch app.py from directly within its directory.
3. The terminal in VS Code should indicate the Flask server running on http://127.0.0.1:5000
4. Open the above link.
5. The dashboard should be on display.
6. To obtain a prediction for a single review document, copy the review text to a clipboard, then paste it in the textbox on-screen.
    - Alternatively, change the radio button to "multi-review file" to extract multiple reviews from a header-less csv file, where each line is a single review, the use the file selector to select a target file on your system.
7. Click the "Predict" button.
8. A prediction if the game will be a bestseller (>500,000 Steam owners) is given a yes/no answer, alongside the probability's prediction according to the classifier.
9. To terminate the program, close the browser, then on the VS Code Terminal, pess CTRL+C to quit the Flask server.

# Notable References
https://towardsdatascience.com/my-absolute-go-to-for-sentiment-analysis-textblob-3ac3a11d524 
https://towardsdatascience.com/multi-class-text-classification-with-doc2vec-logistic-regression-9da9947b43f4#:~:text=Doc2vec%20is%20an%20NLP%20tool,of%20scope%20of%20this%20article
https://towardsdatascience.com/build-deploy-a-react-flask-app-47a89a5d17d9
https://panjeh.medium.com/scikit-learn-hyperparameter-optimization-for-mlpclassifier-4d670413042b