eadme: Environment Setup Instructions for SD County Housing Analysis - Part 2 The repository is used for setting up the database and training the ML models. Setting up the Database python version and modules Anaconda 3.6.4 is used. conda install sqlalchemy psycopg2 conda install -c anaconda seaborn pip install python-dotenv pip install bayesian-optimization PostgreSQL 10+ extensions installed CREATE EXTENSION postgis; CREATE EXTENSION fuzzystrmatch; --needed for postgis_tiger_geocoder CREATE EXTENSION address_standardizer; CREATE EXTENSION address_standardizer_data_us; CREATE EXTENSION postgis_topology; CREATE EXTENSION postgis_tiger_geocoder; 2. Data Setup and Import Using the data import utility from Postgres is the easiest way to setup the database. From Data.zip, extract the DataSumpC4 folder. It contains Postgres exports from 3 schemas 1. postgres 2. postgresdb 3. postgis Create schemas in your instance and import each of them seperately using the default settings. To go through the entire process, follow the steps below We have 8 data sources. county sandag greatschool addresses_to_geocode.csv: dumped from tiger geocoding School Digger Fed (Mortgage rates) Crime Employment ## database server setup * PostgreSQL 9.6 or above * extensions installed CREATE EXTENSION postgis; CREATE EXTENSION fuzzystrmatch; --needed for postgis_tiger_geocoder CREATE EXTENSION address_standardizer; CREATE EXTENSION address_standardizer_data_us; CREATE EXTENSION postgis_topology; CREATE EXTENSION postgis_tiger_geocoder; ## local tools * install postgresql client eg. sudo yum -y install postgresql96 * install shp2pgsql. The tool is used to import shapefile for geographic data. ## load data * county and addresses_to_geocode Run load_county.sql in any Postgresql client(eg psql, pgadmin). Noted, 'COPY' command supposes the data file resides on server rather than local host. If your host is different machine than server, run \copy command in psql instead. * sandag * Run commands in load_sandag.bat. Be sure shp2pgsql is installed. * greatschool * Run load_greatschool.sql to import school rating - the school digger API is further used to clean up the great schools. * Run process_school.sql which generate school features for modeling. ## preprocess.sql This script created various virtual/materialized views for data processing. Be sure to execute this before running python notebooks. ## materialize.sql This script materialize certain views and created extra indices to speed up queries used by visualization and application. 3. Some more data import and setup of source git clone https://github.com/vvural/Housing.git cd mas19 tree . ├── README.md --> repository readme ├── analyzer.ipynb --> notebook that used for analyzing the result of model regression ├── baseline_prime.ipynb --> notebook that used for testing new modeling technique with the baseline features ├── correlations.ipynb --> notebook that used for extracting feature correlations ├── data --> folder that contains various small data as CSV file │ ├── SD_CPI_NSA.csv │ ├── SD_ECO_COND_IDX.csv │ ├── SD_HPI_ADJ_YEARLY.csv │ ├── SD_HPI_NSA.csv │ ├── SD_HPI_NSA_YEARLY.csv │ ├── SD_HPI_SA.csv │ ├── SD_HPI_SA_YEARLY.csv │ └── historicalweeklydata.csv ├── notebooks --> folder that contains notebooks for loading data to DB and experimenting ML models │ ├── CalcDailyMortgageRates.ipynb --> notebook that used for calculating daily mortgage rates │ ├── PriceIndex.ipynb --> notebook that used for calculating Housing Price Index │ ├── experiments --> folder that contains notebooks for various experiments │ │ ├── feature_importance.ipynb │ │ ├── serdar_baseline.ipynb │ │ ├── serdar_clustering.ipynb │ │ ├── serdar_clustering_v2.ipynb │ │ ├── serdar_clustering_v3.ipynb │ │ ├── serdar_features.ipynb │ │ └── serdar_stacking.ipynb │ └── load_data --> folder that contains notebooks for loading data to DB │ ├── Load\ foreclosures.ipynb │ ├── Load\ school\ district\ rating.ipynb │ ├── Load_GDP_delinquency.ipynb │ ├── Load_coastline.ipynb │ ├── Load_employment.ipynb │ ├── Mortgage.ipynb │ └── crime.ipynb ├── plotter.ipynb --> notebook that used for plotting various graphs ├── readme.txt --> this file ├── reducer.ipynb --> notebook that used for gathering (reducing) outputs from massively parallel hyperparameter tuning jobs ├── runner.ipynb --> main notebook that used for running model regression ├── src --> folder that files for modularized Python code. File names should tell what each module does │ ├── algorithm.py │ ├── clustering.py │ ├── data_source.py │ ├── features.py │ ├── model_manage.py │ ├── model_regression.py │ ├── model_regression_v2.py │ ├── model_regression_v2_bak.py │ ├── multi_segment_regressor.py │ ├── plot_utils.py │ ├── preprocessing.py │ └── utils.py ├── tuner.ipynb --> notebook that used for testing hyperparameter tuner jobs ├── tuner.py --> actual Python module that is used by each Kubernetes node to start hyperparameter tuning ├── tuner.sh --> shell script that invokes a call to `tuner.py` when the node is created as `Job` type ├── v0.ipynb --> notebook (version 0.1) that the baseline + XGBoost model regression ├── v0.py --> modularized version of `v0.ipynb` for versioning via `sacred` library └── yaml --> folder that contains YAML files for massively parallel hyperparameter tuning └── tuner ├── tuner_1.yaml ├── tuner_10.yaml ├── tuner_11.yaml ├── tuner_12.yaml ├── tuner_13.yaml ├── tuner_14.yaml ├── tuner_15.yaml ├── tuner_2.yaml ├── tuner_3.yaml ├── tuner_4.yaml ├── tuner_5.yaml ├── tuner_6.yaml ├── tuner_7.yaml ├── tuner_8.yaml ├── tuner_9.yaml ├── tuner_create.sh ├── tuner_delete.sh └── tuner_tester.yaml