eadme:
Environment Setup Instructions for SD County Housing Analysis - Part 2

The repository is used for setting up the database and training the ML models.

Setting up the Database
python version and modules
Anaconda 3.6.4 is used.
conda install sqlalchemy psycopg2
conda install -c anaconda seaborn
pip install python-dotenv
pip install bayesian-optimization
PostgreSQL 10+
extensions installed CREATE EXTENSION postgis; CREATE EXTENSION fuzzystrmatch; --needed for postgis_tiger_geocoder CREATE EXTENSION address_standardizer; CREATE EXTENSION address_standardizer_data_us; CREATE EXTENSION postgis_topology; CREATE EXTENSION postgis_tiger_geocoder;


2. Data Setup and Import
Using the data import utility from Postgres is the easiest way to setup the database.
From Data.zip, extract the DataSumpC4 folder. It contains Postgres exports from 3 schemas
1. postgres
2. postgresdb
3. postgis

Create schemas in your instance and import each of them seperately using the default settings.

To go through the entire process, follow the steps below
We have 8 data sources.
county
sandag
greatschool
addresses_to_geocode.csv: dumped from tiger geocoding
School Digger
Fed (Mortgage rates)
Crime
Employment

## database server setup
* PostgreSQL 9.6 or above
* extensions installed
CREATE EXTENSION postgis;
CREATE EXTENSION fuzzystrmatch; --needed for postgis_tiger_geocoder
CREATE EXTENSION address_standardizer;
CREATE EXTENSION address_standardizer_data_us;
CREATE EXTENSION postgis_topology;
CREATE EXTENSION postgis_tiger_geocoder;

## local tools
* install postgresql client eg. sudo yum -y install postgresql96
* install shp2pgsql. The tool is used to import shapefile for geographic data.

## load data
* county and addresses_to_geocode
Run load_county.sql in any Postgresql client(eg psql, pgadmin).
Noted, 'COPY' command supposes the data file resides on server rather than local host. If your host is different machine than server, run \copy command in psql instead.
* sandag
  * Run commands in load_sandag.bat. Be sure shp2pgsql is installed.
* greatschool
  * Run load_greatschool.sql to import school rating - the school digger API is further used to clean up the great schools.
  * Run process_school.sql which generate school features for modeling.

## preprocess.sql
This script created various virtual/materialized views for data processing. Be sure to execute this before running python notebooks.

## materialize.sql
This script materialize certain views and created extra indices to speed up queries used by visualization and application.



3.  Some more data import and setup of source

    git clone https://github.com/vvural/Housing.git
    cd mas19
    tree

    .
    ├── README.md       --> repository readme
    ├── analyzer.ipynb      --> notebook that used for analyzing the result of model regression
    ├── baseline_prime.ipynb      --> notebook that used for testing new modeling technique with the baseline features
    ├── correlations.ipynb      --> notebook that used for extracting feature correlations
    ├── data      --> folder that contains various small data as CSV file
    │   ├── SD_CPI_NSA.csv
    │   ├── SD_ECO_COND_IDX.csv
    │   ├── SD_HPI_ADJ_YEARLY.csv
    │   ├── SD_HPI_NSA.csv
    │   ├── SD_HPI_NSA_YEARLY.csv
    │   ├── SD_HPI_SA.csv
    │   ├── SD_HPI_SA_YEARLY.csv
    │   └── historicalweeklydata.csv
    ├── notebooks      --> folder that contains notebooks for loading data to DB and experimenting ML models
    │   ├── CalcDailyMortgageRates.ipynb      --> notebook that used for calculating daily mortgage rates
    │   ├── PriceIndex.ipynb      --> notebook that used for calculating Housing Price Index
    │   ├── experiments      --> folder that contains notebooks for various experiments
    │   │   ├── feature_importance.ipynb
    │   │   ├── serdar_baseline.ipynb
    │   │   ├── serdar_clustering.ipynb
    │   │   ├── serdar_clustering_v2.ipynb
    │   │   ├── serdar_clustering_v3.ipynb
    │   │   ├── serdar_features.ipynb
    │   │   └── serdar_stacking.ipynb
    │   └── load_data      --> folder that contains notebooks for loading data to DB
    │       ├── Load\ foreclosures.ipynb
    │       ├── Load\ school\ district\ rating.ipynb
    │       ├── Load_GDP_delinquency.ipynb
    │       ├── Load_coastline.ipynb
    │       ├── Load_employment.ipynb
    │       ├── Mortgage.ipynb
    │       └── crime.ipynb
    ├── plotter.ipynb      --> notebook that used for plotting various graphs
    ├── readme.txt      --> this file
    ├── reducer.ipynb      --> notebook that used for gathering (reducing) outputs from massively parallel hyperparameter tuning jobs
    ├── runner.ipynb      --> main notebook that used for running model regression
    ├── src      --> folder that files for modularized Python code. File names should tell what each module does
    │   ├── algorithm.py
    │   ├── clustering.py
    │   ├── data_source.py
    │   ├── features.py
    │   ├── model_manage.py
    │   ├── model_regression.py
    │   ├── model_regression_v2.py
    │   ├── model_regression_v2_bak.py
    │   ├── multi_segment_regressor.py
    │   ├── plot_utils.py
    │   ├── preprocessing.py
    │   └── utils.py
    ├── tuner.ipynb      --> notebook that used for testing hyperparameter tuner jobs
    ├── tuner.py      --> actual Python module that is used by each Kubernetes node to start hyperparameter tuning
    ├── tuner.sh      --> shell script that invokes a call to `tuner.py` when the node is created as `Job` type
    ├── v0.ipynb      --> notebook (version 0.1) that the baseline + XGBoost model regression
    ├── v0.py      --> modularized version of `v0.ipynb` for versioning via `sacred` library
    └── yaml      --> folder that contains YAML files for massively parallel hyperparameter tuning
        └── tuner
            ├── tuner_1.yaml
            ├── tuner_10.yaml
            ├── tuner_11.yaml
            ├── tuner_12.yaml
            ├── tuner_13.yaml
            ├── tuner_14.yaml
            ├── tuner_15.yaml
            ├── tuner_2.yaml
            ├── tuner_3.yaml
            ├── tuner_4.yaml
            ├── tuner_5.yaml
            ├── tuner_6.yaml
            ├── tuner_7.yaml
            ├── tuner_8.yaml
            ├── tuner_9.yaml
            ├── tuner_create.sh
            ├── tuner_delete.sh
            └── tuner_tester.yaml