Machine Learning Engineer Nanodegree

Capstone Project

📑   P6: Sberbank Russian Housing Market

In [1]:
@import url('|Roboto');
body {background-color: honeydew;} 
a {color: #31c831; font-family: 'Roboto';} 
h1 {color: forestgreen; font-family: 'Orbitron'; text-shadow: 4px 4px 4px #ccc;} 
h2, h3 {color: slategray; font-family: 'Orbitron'; text-shadow: 4px 4px 4px #ccc;}
h4 {color: #31c831; font-family: 'Roboto';}
span {text-shadow: 4px 4px 4px #ccc;}
div.output_prompt, div.output_area pre {color: slategray;}
div.input_prompt, div.output_subarea {color: forestgreen;}      
div.output_stderr pre {background-color: ghostwhite;}  
div.output_stderr {background-color: slategrey;}                        
code_show = true; 
function code_display() {
    if (code_show) {
        $('div.input').each(function(id) {
            if (id == 0 || $(this).html().indexOf('hide_code') > -1) {$(this).hide();}
        $('div.output_prompt').css('opacity', 0);
    } else {
        $('div.input').each(function(id) {$(this).show();});
        $('div.output_prompt').css('opacity', 1);
    code_show = !code_show;
<form action="javascript: code_display()">
<input style="color: forestgreen; background: honeydew; opacity: 0.8;" \ 
type="submit" value="Click to display or hide code cells">

Import Libraries

In [2]:
hide_code = ''
%matplotlib inline
%config InlineBackend.figure_format = 'retina'

import numpy as np
import pandas as pd
import scipy

import seaborn as sns
import matplotlib.pylab as plt

from random import random
import warnings
from IPython.display import display, HTML, SVG

from sklearn.model_selection import train_test_split, ShuffleSplit
from sklearn.model_selection import KFold, ParameterGrid, cross_val_score, GridSearchCV
from sklearn.metrics import mean_squared_error, median_absolute_error, mean_absolute_error
from sklearn.metrics import r2_score, explained_variance_score
from sklearn.ensemble import GradientBoostingRegressor, RandomForestRegressor
from sklearn.ensemble import BaggingRegressor, AdaBoostRegressor, ExtraTreesRegressor
from sklearn.tree import DecisionTreeRegressor
from sklearn.neural_network import MLPRegressor
from sklearn.linear_model import LinearRegression, SGDRegressor
from sklearn.linear_model import Ridge, RidgeCV, BayesianRidge
from sklearn.linear_model import HuberRegressor, TheilSenRegressor, RANSACRegressor
from sklearn.preprocessing import OneHotEncoder, StandardScaler, RobustScaler, MinMaxScaler
from sklearn.pipeline import Pipeline

import keras as ks
from keras.models import Sequential, load_model, Model
from keras.optimizers import SGD, RMSprop
from keras.layers import Dense, Dropout, LSTM, GlobalAveragePooling1D
from keras.layers import Activation, Flatten, Input, BatchNormalization
from keras.layers import Conv1D, MaxPooling1D, Conv2D, MaxPooling2D
from keras.layers.embeddings import Embedding
from keras.callbacks import ModelCheckpoint
from keras.wrappers.scikit_learn import KerasRegressor
from keras.utils.vis_utils import model_to_dot
Using TensorFlow backend.

Create a set of functions

In [3]:
# Fit the Regressor
def regression(regressor, x_train, x_test, y_train):
    reg = regressor, y_train)
    y_train_reg = reg.predict(x_train)
    y_test_reg = reg.predict(x_test)
    return y_train_reg, y_test_reg

# Plot the Neural network fitting history
def history_plot(fit_history):
    plt.figure(figsize=(18, 12))
    plt.plot(fit_history.history['loss'], color='#348ABD', label = 'train')
    plt.plot(fit_history.history['val_loss'], color='#228B22', label = 'test')
    plt.title('Loss Function');  
    plt.plot(fit_history.history['mean_absolute_error'], color='#348ABD', label = 'train')
    plt.plot(fit_history.history['val_mean_absolute_error'], color='#228B22', label = 'test')
    plt.title('Mean Absolute Error');   

# Get values of the metrics
def scores(regressor, y_train, y_test, y_train_reg, y_test_reg):
    separator1, separator2 = '<_>'*18, '-'*10
    print(separator1, '\n', regressor, '\n'+separator1)
    print("EV score. Train: ", explained_variance_score(y_train, y_train_reg))
    print("EV score. Test: ", explained_variance_score(y_test, y_test_reg))
    print("R2 score. Train: ", r2_score(y_train, y_train_reg))
    print("R2 score. Test: ", r2_score(y_test, y_test_reg))
    print("MSE score. Train: ", mean_squared_error(y_train, y_train_reg))
    print("MSE score. Test: ", mean_squared_error(y_test, y_test_reg))
    print("MAE score. Train: ", mean_absolute_error(y_train, y_train_reg))
    print("MAE score. Test: ", mean_absolute_error(y_test, y_test_reg))
    print("MdAE score. Train: ", median_absolute_error(y_train, y_train_reg))
    print("MdAE score. Test: ", median_absolute_error(y_test, y_test_reg))
def scores2(regressor, target, target_predict):
    separator1, separator2 = '<_>'*18, '-'*10
    print(separator1, '\n', regressor, '\n'+separator1)
    print("EV score:", explained_variance_score(target, target_predict))
    print("R2 score:", r2_score(target, target_predict))
    print("MSE score:", mean_squared_error(target, target_predict))
    print("MAE score:", mean_absolute_error(target, target_predict))
    print("MdAE score:", median_absolute_error(target, target_predict))

Capstone Proposal Overview

In this capstone project proposal, the goals are stated to leverage what we've learned throughout the Nanodegree program to author a proposal for solving a problem of our choice by applying machine learning algorithms and techniques. A project proposal encompasses seven key points:

  • The project's domain background: the field of research where the project is derived;
  • A problem statement: a problem being investigated for which a solution will be defined;
  • The datasets and inputs: data or inputs being used for the problem;
  • A solution statement: the solution proposed for the problem given;
  • A benchmark model: some simple or historical model or result to compare the defined solution to;
  • A set of evaluation metrics: functional representations for how the solution can be measured;
  • An outline of the project design: how the solution will be developed and results obtained.

The full project report about results will be completed and published as well.

Domain Background

Housing costs demand a significant investment from both consumers and developers. And when it comes to planning a budget—whether personal or corporate—the last thing anyone needs is uncertainty about one of their budgets expenses. Sberbank, Russia’s oldest and largest bank, helps their customers by making predictions about reality prices so renters, developers, and lenders are more confident when they sign a lease or purchase a building.

Although the housing market is relatively stable in Russia, the country’s volatile economy makes forecasting prices as a function of apartment characteristics a unique challenge. Complex interactions between housing features such as a number of bedrooms and location are enough to make pricing predictions complicated. Adding an unstable economy to the mix means Sberbank and their customers need more than simple regression models in their arsenal.

Problem Statement

Sberbank is challenging programmers to develop algorithms which use a broad spectrum of features to predict real prices. Algorithm applications rely on a rich dataset that includes housing data and macroeconomic patterns. An accurate forecasting model will allow Sberbank to provide more certainty to their customers in an uncertain economy.

My choice of the solution in this situation is to select the most correlated indicators with the target variable and apply ensemble algorithms that have repeatedly shown successful results in the study of price trends in real estate. Boosting and bagging methods combine several models at once in order to improve the prediction accuracy on learning problems with a numerical target variable.

Then I am going to explore the different types of neural networks in the sphere of regression predictions and try to achieve the same with ensemble methods level of model perfomance.

Datasets and Inputs

The basis for the investigation is a large number of economic indicators for pricing and prices themselves (train.csv and test.csv). Macroeconomic variables are collected in a separate file for transaction dates (macro.csv). In addition, the detailed description of variables is provided (data_dictionary.txt).

For practical reasons, I have not analyzed all the data and have chosen the following independent variables:

  1. the dollar rate, which traditionally affects the Russian real estate market;
  2. the distance in km from the Kremlin (the closer to the center of the city, the more expensive);
  3. indicators characterizing the availability of urban infrastructure nearby (schools, medical and sports centers, supermarkets, etc.) ;
  4. indicators of a particular living space (number of rooms, floor, etc.);
  5. proximity to transport nodes (for example, to the metro);
  6. indicators of population density and employment in the region of housing accommodation.

All these economic indicators have a strong influence on price formation and can be used as a basic set for regression analysis. Examples of numerical variables: the distance to the metro, the distance to the school, the dollar rate at the transaction moment, the area of the living space. Examples of categorical variables: neighborhoods, the nearest metro station, the number of rooms.

The goal of the project is to predict the price of housing using the chosen set of numerical and categorical variables. The predicted target is not discrete, for the training set all the values of this dependent variable are given, and therefore it is necessary to apply the regression algorithms of supervised learning.

Data Description (data_dictionary.txt)

In [4]:
# Display the description file
HTML('''<div id="data">
<p><iframe src="data_dictionary.txt" frameborder="3" height="300" width="99%"></iframe></p>

Load and Display the Data

In [5]:
# Load the dataset
macro = pd.read_csv('macro.csv')
train = pd.read_csv('train.csv')
test = pd.read_csv('test.csv')
In [6]:
# Display the data tables
100 101 102 103 104 105 106 107 108 109
oil_urals 82.87 82.87 82.87 82.87 82.87 82.87 82.87 82.87 82.87 82.87
gdp_quart 9995.8 9995.8 9995.8 9995.8 9995.8 9995.8 9995.8 9995.8 9995.8 9995.8
gdp_quart_growth 4.1 4.1 4.1 4.1 4.1 4.1 4.1 4.1 4.1 4.1
cpi 319.8 319.8 319.8 319.8 319.8 319.8 319.8 319.8 319.8 319.8
ppi 350.2 350.2 350.2 350.2 350.2 350.2 350.2 350.2 350.2 350.2
gdp_deflator NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
balance_trade 16.604 16.604 16.604 16.604 16.604 16.604 16.604 16.604 16.604 16.604
balance_trade_growth 14.1 14.1 14.1 14.1 14.1 14.1 14.1 14.1 14.1 14.1
usdrub 29.1525 29.0261 29.1 28.9194 29.0239 29.092 29.092 29.092 29.1835 29.1398
eurrub 39.2564 39.4051 39.5008 39.5233 39.3691 39.2524 39.2524 39.2524 39.3214 39.1532
brent 84.83 84.77 84.72 86.15 87.17 85.99 85.99 85.99 84.23 84.8
net_capital_export NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
gdp_annual 38807.2 38807.2 38807.2 38807.2 38807.2 38807.2 38807.2 38807.2 38807.2 38807.2
gdp_annual_growth -0.0782086 -0.0782086 -0.0782086 -0.0782086 -0.0782086 -0.0782086 -0.0782086 -0.0782086 -0.0782086 -0.0782086
In [7]:
# Display the data tables
200 201 202 203 204 205 206 207
timestamp 2011-10-25 2011-10-25 2011-10-25 2011-10-25 2011-10-26 2011-10-26 2011-10-26 2011-10-26
full_sq 38 33 30 76 44 35 72 32
life_sq 19 14 18 51 29 21 45 18
floor 15 8 3 2 8 5 10 6
max_floor NaN NaN NaN NaN NaN NaN NaN NaN
material NaN NaN NaN NaN NaN NaN NaN NaN
build_year NaN NaN NaN NaN NaN NaN NaN NaN
num_room NaN NaN NaN NaN NaN NaN NaN NaN
kitch_sq NaN NaN NaN NaN NaN NaN NaN NaN
state NaN NaN NaN NaN NaN NaN NaN NaN
product_type Investment Investment Investment Investment Investment Investment Investment Investment
sub_area Horoshevskoe Juzhnoe Butovo Marfino Juzhnoportovoe Vostochnoe Izmajlovo Lefortovo Krylatskoe Chertanovo Juzhnoe
area_m 8.56843e+06 2.61551e+07 2.1044e+06 4.57959e+06 3.8e+06 8.99364e+06 1.21645e+07 9.28244e+06
raion_popul 56535 178264 26943 71715 76308 89971 78507 143661

Solution Statement

Selection of Features

In [8]:
# Create lists of the features
X_list_num = ['timestamp',
              'full_sq', 'num_room', 'area_m', 
              'kremlin_km', 'big_road2_km', 'big_road1_km',
              'stadium_km', 'swim_pool_km', 'fitness_km', 
              'detention_facility_km', 'cemetery_km',
              'radiation_km', 'oil_chemistry_km',
              'theater_km', 'exhibition_km', 'museum_km', 
              'park_km', 'public_healthcare_km',  
              'bus_terminal_avto_km', 'public_transport_station_min_walk',
              'railroad_station_walk_min', 'railroad_station_avto_km',
              'kindergarten_km', 'school_km', 'preschool_km',
              'university_km', 'additional_education_km',
              'shopping_centers_km', 'big_market_km',
              'ekder_all', 'work_all', 'young_all']

X_list_cat = ['sub_area', 'ID_metro', 
              'office_raion', 'sport_objects_raion',
              'raion_popul', 'healthcare_centers_raion',

target_train = train['price_doc']
In [9]:
# Create the distribution plot for the target'seaborn-whitegrid')
f, (ax1, ax2) = plt.subplots(ncols=2, figsize=(18, 6))

sns.distplot(target_train, bins=200, color='#228B22', ax=ax1)

sns.distplot(np.log(target_train), bins=200, color='#228B22', ax=ax2)
ax2.set_xlabel("Logarithm of the variable 'Prices'")

plt.suptitle('Sberbank Russian Housing Data');
In [10]:
# Create the table of descriptive statistics
print ("Sberbank Russian Housing Dataset Statistics: \n")
print ("Number of houses = ", len(target_train))
print ("Number of features = ", len(list(train[X_list_num+X_list_cat].keys())))
print ("Minimum house price = ", np.min(target_train))
print ("Maximum house price = ", np.max(target_train))
print ("Mean house price = ", "%.2f" % np.mean(target_train))
print ("Median house price = ", "%.2f" % np.median(target_train))
print ("Standard deviation of house prices =", "%.2f" % np.std(target_train))
Sberbank Russian Housing Dataset Statistics: 

Number of houses =  30471
Number of features =  44
Minimum house price =  100000
Maximum house price =  111111112
Mean house price =  7123035.28
Median house price =  6274411.00
Standard deviation of house prices = 4780032.89

Fill in Missing Values

In [11]:
# Find out the number of missing values
timestamp                               0
full_sq                                 0
num_room                             9572
area_m                                  0
kremlin_km                              0
big_road2_km                            0
big_road1_km                            0
workplaces_km                           0
stadium_km                              0
swim_pool_km                            0
fitness_km                              0
detention_facility_km                   0
cemetery_km                             0
radiation_km                            0
oil_chemistry_km                        0
theater_km                              0
exhibition_km                           0
museum_km                               0
park_km                                 0
public_healthcare_km                    0
metro_min_walk                         25
metro_km_avto                           0
bus_terminal_avto_km                    0
public_transport_station_min_walk       0
railroad_station_walk_min              25
railroad_station_avto_km                0
kindergarten_km                         0
school_km                               0
preschool_km                            0
university_km                           0
additional_education_km                 0
shopping_centers_km                     0
big_market_km                           0
ekder_all                               0
work_all                                0
young_all                               0
dtype: int64
In [12]:
# Find out the number of missing values
timestamp                             0
full_sq                               0
num_room                              0
area_m                                0
kremlin_km                            0
big_road2_km                          0
big_road1_km                          0
workplaces_km                         0
stadium_km                            0
swim_pool_km                          0
fitness_km                            0
detention_facility_km                 0
cemetery_km                           0
radiation_km                          0
oil_chemistry_km                      0
theater_km                            0
exhibition_km                         0
museum_km                             0
park_km                               0
public_healthcare_km                  0
metro_min_walk                       34
metro_km_avto                         0
bus_terminal_avto_km                  0
public_transport_station_min_walk     0
railroad_station_walk_min            34
railroad_station_avto_km              0
kindergarten_km                       0
school_km                             0
preschool_km                          0
university_km                         0
additional_education_km               0
shopping_centers_km                   0
big_market_km                         0
ekder_all                             0
work_all                              0
young_all                             0
dtype: int64
In [13]:
# Create dataframes for sets of features
df_train = pd.DataFrame(train, columns=X_list_num)
df_train_cat = pd.DataFrame(train, columns=X_list_num+X_list_cat)

df_test = pd.DataFrame(test, columns=X_list_num)
df_test_cat = pd.DataFrame(test, columns=X_list_num+X_list_cat)

df_train['prices'] = target_train
df_train_cat['prices'] = target_train

# Delete rows with a lot of missing values
df_train = df_train.dropna(subset=['num_room'])
df_train_cat = df_train_cat.dropna(subset=['num_room'])

# Fill in missing values by interpolation
df_train['metro_min_walk'] = \
df_train_cat['metro_min_walk'] = \

df_train['railroad_station_walk_min'] = \
df_train_cat['railroad_station_walk_min'] = \

df_test['metro_min_walk'] = \
df_test_cat['metro_min_walk'] = \

df_test['railroad_station_walk_min'] = \
df_test_cat['railroad_station_walk_min'] = \

# Display the number of rows in the final training set

Categorical and Macro Features

Add the Macro Feature

In [14]:
# Create a dictionary 'Date => Currency rate'
usdrub_pairs = dict(zip(list(macro['timestamp']), list(macro['usdrub'])))
# salary_pairs = dict(zip(list(macro['timestamp']), list(macro['salary'])))

# Replace the data by currency rates in the training and testing sets


df_train.rename(columns={'timestamp' : 'usdrub'}, inplace=True)
df_train_cat.rename(columns={'timestamp' : 'usdrub'}, inplace=True)

df_test.rename(columns={'timestamp' : 'usdrub'}, inplace=True)
df_test_cat.rename(columns={'timestamp' : 'usdrub'}, inplace=True)

Preprocess Categorical Features

In [15]:
# Display categorical features
separator = '<_>'*38
for df in [df_train_cat, df_test_cat]:
    print ('\n', separator)
    print('\nsub area')
    print('Number of categories:', len(set(df['sub_area'])))

    print('\nID metro')
    print('Number of categories:', len(set(df['ID_metro'])))

    print('\noffice raion')
    print('Number of categories:', len(set(df['office_raion'])))

    print('\nsport objects raion')
    print('Number of categories:', len(set(df['sport_objects_raion'])))

    print('\nraion popul')
    print('Number of categories:', len(set(df['raion_popul'])))

    print('\nhealthcare centers raion')
    print('Number of categories:', len(set(df_train_cat['healthcare_centers_raion'])))

    print('\nschool education centers raion')
    print('Number of categories:', len(set(df['school_education_centers_raion'])))

    print('\npreschool education centers raion')
    print('Number of categories:', len(set(df['preschool_education_centers_raion'])))

sub area
Number of categories: 146
{'Mitino', 'Severnoe Medvedkovo', 'Nagatinskij Zaton', 'Savelovskoe', 'Novogireevo', "Tekstil'shhiki", 'Jaroslavskoe', 'Timirjazevskoe', 'Horoshevo-Mnevniki', 'Poselenie Klenovskoe', 'Poselenie Novofedorovskoe', 'Ljublino', 'Severnoe Butovo', 'Nekrasovka', 'Golovinskoe', "Mar'ina Roshha", 'Orehovo-Borisovo Juzhnoe', 'Poselenie Shherbinka', "Moskvorech'e-Saburovo", 'Izmajlovo', 'Kurkino', 'Kuncevo', 'Molzhaninovskoe', 'Orehovo-Borisovo Severnoe', 'Koptevo', 'Gagarinskoe', 'Levoberezhnoe', 'Severnoe Tushino', 'Poselenie Sosenskoe', 'Poselenie Rogovskoe', 'Vojkovskoe', 'Zapadnoe Degunino', 'Prospekt Vernadskogo', 'Krylatskoe', "Mar'ino", 'Sokol', 'Mozhajskoe', 'Basmannoe', 'Kosino-Uhtomskoe', 'Severnoe Izmajlovo', 'Juzhnoe Medvedkovo', 'Pokrovskoe Streshnevo', 'Poselenie Voronovskoe', 'Taganskoe', 'Troickij okrug', 'Rjazanskij', 'Lefortovo', 'Caricyno', "Kon'kovo", "Sokol'niki", 'Poselenie Moskovskij', 'Novokosino', 'Meshhanskoe', 'Severnoe', "Altuf'evskoe", 'Poselenie Kokoshkino', 'Poselenie Filimonkovskoe', 'Birjulevo Zapadnoe', 'Butyrskoe', 'Begovoe', 'Chertanovo Severnoe', 'Dmitrovskoe', 'Ostankinskoe', 'Filevskij Park', 'Kapotnja', 'Poselenie Marushkinskoe', 'Ramenki', 'Marfino', 'Zjablikovo', 'Ivanovskoe', 'Poselenie Krasnopahorskoe', 'Arbat', 'Nizhegorodskoe', 'Vostochnoe Degunino', 'Ochakovo-Matveevskoe', 'Vnukovo', "Zamoskvorech'e", 'Poselenie Pervomajskoe', 'Poselenie Desjonovskoe', 'Jakimanka', 'Akademicheskoe', 'Vyhino-Zhulebino', 'Nagatino-Sadovniki', 'Poselenie Mihajlovo-Jarcevskoe', 'Novo-Peredelkino', 'Jasenevo', "Chertanovo Central'noe", 'Metrogorodok', 'Zjuzino', 'Sviblovo', 'Fili Davydkovo', 'Alekseevskoe', 'Horoshevskoe', 'Solncevo', "Kuz'minki", 'Beskudnikovskoe', 'Poselenie Mosrentgen', 'Brateevo', 'Staroe Krjukovo', 'Birjulevo Vostochnoe', "Krasnosel'skoe", 'Juzhnoe Butovo', 'Preobrazhenskoe', 'Poselenie Voskresenskoe', 'Lianozovo', 'Pechatniki', 'Hovrino', 'Strogino', 'Savelki', 'Vostochnoe', 'Matushkino', 'Teplyj Stan', 'Silino', 'Hamovniki', 'Juzhnoe Tushino', 'Lomonosovskoe', 'Obruchevskoe', 'Cheremushki', 'Rostokino', 'Poselenie Shhapovskoe', 'Tverskoe', "Gol'janovo", 'Poselenie Rjazanovskoe', 'Presnenskoe', 'Vostochnoe Izmajlovo', 'Nagornoe', 'Shhukino', 'Poselenie Kievskij', 'Ajeroport', 'Krjukovo', 'Dorogomilovo', 'Juzhnoportovoe', 'Perovo', 'Losinoostrovskoe', 'Poselenie Vnukovskoe', 'Bibirevo', 'Sokolinaja Gora', 'Danilovskoe', 'Troparevo-Nikulino', 'Babushkinskoe', 'Otradnoe', 'Veshnjaki', 'Donskoe', 'Chertanovo Juzhnoe', 'Bogorodskoe', 'Kotlovka'}

ID metro
Number of categories: 219
{1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83, 84, 85, 86, 87, 88, 89, 90, 91, 92, 93, 94, 95, 96, 97, 98, 99, 100, 101, 102, 103, 104, 105, 106, 107, 108, 109, 110, 111, 112, 113, 114, 115, 116, 117, 118, 119, 120, 121, 122, 123, 124, 125, 126, 127, 128, 129, 130, 131, 132, 133, 134, 135, 136, 137, 138, 139, 140, 141, 142, 143, 144, 145, 146, 147, 148, 149, 150, 151, 152, 153, 154, 155, 156, 157, 158, 161, 162, 163, 164, 165, 166, 167, 168, 169, 170, 172, 173, 174, 175, 176, 177, 178, 179, 180, 181, 182, 183, 184, 185, 186, 187, 188, 189, 190, 191, 192, 193, 194, 195, 196, 197, 199, 200, 201, 202, 203, 204, 205, 206, 207, 208, 209, 210, 211, 212, 213, 214, 215, 216, 217, 218, 219, 220, 221, 222, 223}

office raion
Number of categories: 30
{0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 141, 14, 16, 19, 20, 23, 24, 27, 37, 39, 45, 48, 56, 59, 73, 84, 87, 93}

sport objects raion
Number of categories: 24
{0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 20, 23, 24, 25, 29}

raion popul
Number of categories: 146
{90114, 116742, 6161, 28179, 76308, 80917, 68630, 53786, 94236, 41504, 71715, 8227, 12327, 83502, 21040, 77878, 139322, 118843, 81980, 57405, 13890, 78418, 178264, 85083, 112221, 101982, 4199, 5740, 61039, 75377, 123000, 37502, 142462, 67710, 2693, 108171, 57995, 47245, 57999, 115352, 153248, 87713, 118945, 21155, 112804, 145576, 78507, 247469, 7341, 130229, 125111, 129207, 38075, 102590, 105663, 96959, 145088, 86206, 8384, 56535, 79576, 85721, 102618, 156377, 85219, 113897, 174831, 132349, 51455, 111874, 111374, 57107, 43795, 78616, 12061, 155427, 55590, 178473, 143661, 73007, 48439, 36154, 21819, 64317, 26943, 103746, 102726, 32071, 101708, 9553, 157010, 17236, 55125, 4949, 27992, 130396, 165727, 94561, 94564, 7538, 89971, 28537, 89467, 76156, 17790, 76670, 2942, 83844, 123280, 166803, 80791, 60315, 175518, 4001, 142243, 64931, 83369, 125354, 102828, 37807, 111023, 155572, 65972, 73148, 31167, 39873, 3521, 72131, 85956, 106445, 7122, 26578, 61396, 219609, 78810, 104410, 91100, 81887, 19940, 100846, 122862, 32241, 104434, 2546, 122873, 76284}

healthcare centers raion
Number of categories: 7
{0, 1, 2, 3, 4, 5, 6}

school education centers raion
Number of categories: 14
{0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 13, 14}

preschool education centers raion
Number of categories: 13
{0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 13}


sub area
Number of categories: 145
{'Mitino', 'Nagatinskij Zaton', 'Severnoe Medvedkovo', 'Savelovskoe', 'Novogireevo', 'Jaroslavskoe', "Tekstil'shhiki", 'Timirjazevskoe', 'Horoshevo-Mnevniki', 'Poselenie Novofedorovskoe', 'Ljublino', 'Severnoe Butovo', 'Nekrasovka', 'Golovinskoe', "Mar'ina Roshha", 'Orehovo-Borisovo Juzhnoe', 'Poselenie Shherbinka', "Moskvorech'e-Saburovo", 'Izmajlovo', 'Kurkino', 'Kuncevo', 'Molzhaninovskoe', 'Orehovo-Borisovo Severnoe', 'Koptevo', 'Gagarinskoe', 'Levoberezhnoe', 'Severnoe Tushino', 'Poselenie Sosenskoe', 'Poselenie Rogovskoe', 'Vojkovskoe', 'Zapadnoe Degunino', 'Prospekt Vernadskogo', 'Krylatskoe', "Mar'ino", 'Sokol', 'Mozhajskoe', 'Basmannoe', 'Kosino-Uhtomskoe', 'Severnoe Izmajlovo', 'Juzhnoe Medvedkovo', 'Pokrovskoe Streshnevo', 'Poselenie Voronovskoe', 'Taganskoe', 'Troickij okrug', 'Rjazanskij', 'Lefortovo', 'Caricyno', "Kon'kovo", "Sokol'niki", 'Poselenie Moskovskij', 'Novokosino', 'Meshhanskoe', 'Severnoe', "Altuf'evskoe", 'Poselenie Kokoshkino', 'Poselenie Filimonkovskoe', 'Birjulevo Zapadnoe', 'Butyrskoe', 'Begovoe', 'Chertanovo Severnoe', 'Dmitrovskoe', 'Ostankinskoe', 'Filevskij Park', 'Kapotnja', 'Poselenie Marushkinskoe', 'Ramenki', 'Marfino', 'Zjablikovo', 'Ivanovskoe', 'Poselenie Krasnopahorskoe', 'Arbat', 'Nizhegorodskoe', 'Vostochnoe Degunino', 'Ochakovo-Matveevskoe', 'Vnukovo', "Zamoskvorech'e", 'Poselenie Pervomajskoe', 'Poselenie Desjonovskoe', 'Jakimanka', 'Akademicheskoe', 'Vyhino-Zhulebino', 'Nagatino-Sadovniki', 'Poselenie Mihajlovo-Jarcevskoe', 'Novo-Peredelkino', 'Jasenevo', "Chertanovo Central'noe", 'Metrogorodok', 'Zjuzino', 'Sviblovo', 'Alekseevskoe', 'Fili Davydkovo', 'Horoshevskoe', 'Solncevo', 'Beskudnikovskoe', "Kuz'minki", 'Poselenie Mosrentgen', 'Brateevo', 'Staroe Krjukovo', 'Birjulevo Vostochnoe', "Krasnosel'skoe", 'Juzhnoe Butovo', 'Preobrazhenskoe', 'Poselenie Voskresenskoe', 'Lianozovo', 'Pechatniki', 'Hovrino', 'Strogino', 'Savelki', 'Vostochnoe', 'Teplyj Stan', 'Matushkino', 'Silino', 'Hamovniki', 'Juzhnoe Tushino', 'Lomonosovskoe', 'Obruchevskoe', 'Cheremushki', 'Rostokino', 'Poselenie Shhapovskoe', 'Tverskoe', "Gol'janovo", 'Poselenie Rjazanovskoe', 'Presnenskoe', 'Vostochnoe Izmajlovo', 'Nagornoe', 'Shhukino', 'Kotlovka', 'Poselenie Kievskij', 'Ajeroport', 'Krjukovo', 'Dorogomilovo', 'Juzhnoportovoe', 'Perovo', 'Losinoostrovskoe', 'Poselenie Vnukovskoe', 'Bibirevo', 'Danilovskoe', 'Sokolinaja Gora', 'Troparevo-Nikulino', 'Babushkinskoe', 'Otradnoe', 'Veshnjaki', 'Donskoe', 'Bogorodskoe', 'Chertanovo Juzhnoe'}

ID metro
Number of categories: 212
{1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83, 84, 85, 86, 87, 88, 89, 90, 91, 92, 93, 94, 95, 96, 97, 98, 99, 100, 101, 102, 103, 104, 105, 106, 107, 108, 109, 110, 111, 112, 113, 114, 115, 116, 117, 118, 119, 120, 121, 122, 123, 124, 125, 126, 128, 129, 130, 131, 132, 133, 134, 135, 136, 137, 138, 139, 140, 141, 142, 143, 144, 145, 146, 147, 148, 149, 150, 151, 152, 153, 154, 155, 156, 157, 158, 161, 162, 163, 164, 165, 166, 167, 168, 170, 172, 173, 174, 175, 176, 177, 178, 179, 180, 181, 182, 183, 184, 185, 186, 187, 188, 189, 190, 191, 193, 194, 195, 196, 197, 199, 201, 202, 203, 204, 205, 206, 207, 208, 209, 210, 211, 212, 213, 215, 216, 219, 220, 221, 222, 224}

office raion
Number of categories: 30
{0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 141, 14, 16, 19, 20, 23, 24, 27, 37, 39, 45, 48, 56, 59, 73, 84, 87, 93}

sport objects raion
Number of categories: 24
{0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 20, 23, 24, 25, 29}

raion popul
Number of categories: 145
{90114, 116742, 6161, 28179, 76308, 80917, 68630, 53786, 94236, 41504, 71715, 8227, 12327, 83502, 21040, 77878, 139322, 118843, 81980, 57405, 13890, 78418, 178264, 85083, 112221, 101982, 4199, 5740, 61039, 75377, 123000, 37502, 67710, 142462, 2693, 57995, 108171, 47245, 57999, 115352, 153248, 118945, 87713, 21155, 112804, 145576, 78507, 247469, 7341, 130229, 129207, 125111, 38075, 86206, 96959, 145088, 102590, 105663, 8384, 56535, 79576, 156377, 102618, 85721, 85219, 113897, 174831, 132349, 51455, 111874, 111374, 57107, 43795, 78616, 12061, 155427, 55590, 178473, 143661, 73007, 48439, 36154, 21819, 64317, 26943, 103746, 102726, 32071, 101708, 9553, 157010, 17236, 55125, 4949, 27992, 130396, 165727, 94561, 94564, 7538, 89971, 28537, 89467, 76156, 76670, 17790, 83844, 123280, 166803, 80791, 60315, 175518, 4001, 64931, 142243, 83369, 125354, 102828, 111023, 37807, 155572, 65972, 73148, 31167, 39873, 3521, 72131, 85956, 106445, 7122, 26578, 61396, 219609, 78810, 104410, 91100, 81887, 19940, 122862, 100846, 32241, 2546, 104434, 122873, 76284}

healthcare centers raion
Number of categories: 7
{0, 1, 2, 3, 4, 5, 6}

school education centers raion
Number of categories: 14
{0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 13, 14}

preschool education centers raion
Number of categories: 13
{0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 13}
In [16]:
# Find the missing category in the testing set
for feature in X_list_cat:
    for element in list(set(df_test_cat[feature])):
        if element not in list(set(df_train_cat[feature])): 
            print (feature, element)
ID_metro 224
In [17]:
# Replace categorical values of'ID_metro' by discrete numbers 
ID_metro_cat = pd.factorize(df_train_cat['ID_metro'])
df_train_cat['ID_metro'] = ID_metro_cat[0]

ID_metro_pairs = dict(zip(list(ID_metro_cat[1]), list(set(ID_metro_cat[0]))))
ID_metro_pairs[224] = 219

In [18]:
# Replace categorical values of other categorical features by discrete numbers
for feature in X_list_cat:
    if feature !='ID_metro':
        feature_cat = pd.factorize(df_train_cat[feature])
        df_train_cat[feature] = feature_cat[0]
        feature_pairs = dict(zip(list(feature_cat[1]), list(set(feature_cat[0]))))
In [19]:
# Display the result of preprocessing for categorical features 
for df in [df_train_cat, df_test_cat]:
    print ('\n', separator)
    print('\nsub area')
    print('Number of categories:', len(set(df['sub_area'])))

    print('\nID metro')
    print('Number of categories:', len(set(df['ID_metro'])))

    print('\noffice raion')
    print('Number of categories:', len(set(df['office_raion'])))

    print('\nsport objects raion')
    print('Number of categories:', len(set(df['sport_objects_raion'])))

    print('\nraion popul')
    print('Number of categories:', len(set(df['raion_popul'])))

    print('\nhealthcare centers raion')
    print('Number of categories:', len(set(df_train_cat['healthcare_centers_raion'])))

    print('\nschool education centers raion')
    print('Number of categories:', len(set(df['school_education_centers_raion'])))

    print('\npreschool education centers raion')
    print('Number of categories:', len(set(df['preschool_education_centers_raion'])))

sub area
Number of categories: 146
{0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83, 84, 85, 86, 87, 88, 89, 90, 91, 92, 93, 94, 95, 96, 97, 98, 99, 100, 101, 102, 103, 104, 105, 106, 107, 108, 109, 110, 111, 112, 113, 114, 115, 116, 117, 118, 119, 120, 121, 122, 123, 124, 125, 126, 127, 128, 129, 130, 131, 132, 133, 134, 135, 136, 137, 138, 139, 140, 141, 142, 143, 144, 145}

ID metro
Number of categories: 219
{0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83, 84, 85, 86, 87, 88, 89, 90, 91, 92, 93, 94, 95, 96, 97, 98, 99, 100, 101, 102, 103, 104, 105, 106, 107, 108, 109, 110, 111, 112, 113, 114, 115, 116, 117, 118, 119, 120, 121, 122, 123, 124, 125, 126, 127, 128, 129, 130, 131, 132, 133, 134, 135, 136, 137, 138, 139, 140, 141, 142, 143, 144, 145, 146, 147, 148, 149, 150, 151, 152, 153, 154, 155, 156, 157, 158, 159, 160, 161, 162, 163, 164, 165, 166, 167, 168, 169, 170, 171, 172, 173, 174, 175, 176, 177, 178, 179, 180, 181, 182, 183, 184, 185, 186, 187, 188, 189, 190, 191, 192, 193, 194, 195, 196, 197, 198, 199, 200, 201, 202, 203, 204, 205, 206, 207, 208, 209, 210, 211, 212, 213, 214, 215, 216, 217, 218}

office raion
Number of categories: 30
{0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29}

sport objects raion
Number of categories: 24
{0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23}

raion popul
Number of categories: 146
{0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83, 84, 85, 86, 87, 88, 89, 90, 91, 92, 93, 94, 95, 96, 97, 98, 99, 100, 101, 102, 103, 104, 105, 106, 107, 108, 109, 110, 111, 112, 113, 114, 115, 116, 117, 118, 119, 120, 121, 122, 123, 124, 125, 126, 127, 128, 129, 130, 131, 132, 133, 134, 135, 136, 137, 138, 139, 140, 141, 142, 143, 144, 145}

healthcare centers raion
Number of categories: 7
{0, 1, 2, 3, 4, 5, 6}

school education centers raion
Number of categories: 14
{0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13}

preschool education centers raion
Number of categories: 13
{0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12}


sub area
Number of categories: 145
{0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83, 84, 85, 86, 87, 88, 89, 90, 91, 92, 93, 94, 95, 96, 97, 98, 99, 100, 101, 102, 103, 104, 105, 106, 107, 108, 109, 110, 111, 112, 113, 114, 115, 116, 117, 118, 119, 120, 121, 122, 123, 124, 125, 126, 127, 128, 129, 130, 131, 132, 133, 134, 135, 137, 138, 139, 140, 141, 142, 143, 144, 145}

ID metro
Number of categories: 212
{0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83, 84, 85, 86, 87, 88, 89, 90, 91, 92, 93, 94, 95, 96, 97, 98, 99, 100, 101, 102, 103, 104, 105, 106, 107, 108, 109, 110, 111, 112, 113, 114, 115, 116, 117, 118, 119, 120, 121, 122, 123, 124, 125, 126, 127, 128, 129, 130, 131, 132, 133, 134, 135, 136, 137, 138, 139, 140, 141, 142, 143, 144, 145, 146, 147, 148, 149, 150, 151, 152, 154, 155, 156, 157, 158, 159, 160, 161, 162, 163, 164, 165, 166, 167, 168, 169, 170, 171, 172, 173, 174, 175, 176, 177, 178, 180, 181, 182, 184, 185, 186, 187, 189, 190, 191, 192, 193, 194, 195, 196, 197, 198, 199, 200, 201, 202, 203, 204, 206, 207, 208, 209, 210, 211, 212, 213, 215, 218, 219}

office raion
Number of categories: 30
{0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29}

sport objects raion
Number of categories: 24
{0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23}

raion popul
Number of categories: 145
{0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83, 84, 85, 86, 87, 88, 89, 90, 91, 92, 93, 94, 95, 96, 97, 98, 99, 100, 101, 102, 103, 104, 105, 106, 107, 108, 109, 110, 111, 112, 113, 114, 115, 116, 117, 118, 119, 120, 121, 122, 123, 124, 125, 126, 127, 128, 129, 130, 131, 132, 133, 134, 135, 137, 138, 139, 140, 141, 142, 143, 144, 145}

healthcare centers raion
Number of categories: 7
{0, 1, 2, 3, 4, 5, 6}

school education centers raion
Number of categories: 14
{0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13}

preschool education centers raion
Number of categories: 13
{0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12}
In [20]:
# Apply one hot encoding for the training set
df_train_cat1 = df_train_cat
encode = OneHotEncoder(sparse=False)

for column in X_list_cat:[[column]])
    transform = encode.transform(df_train_cat[[column]])
    transform = pd.DataFrame(transform, 
                             columns=[(column+"_"+str(i)) for i in df_train_cat[column].value_counts().index])
    transform = transform.set_index(df_train_cat.index.values)
    df_train_cat1 = pd.concat([df_train_cat1, transform], axis=1)
    df_train_cat1 = df_train_cat1.drop(column, 1)
In [21]:
# Apply one hot encoding for the testing set
df_test_cat1 = df_test_cat
encode = OneHotEncoder(sparse=False)

for column in X_list_cat:[[column]])
    transform = encode.transform(df_test_cat[[column]])
    transform = pd.DataFrame(transform, 
                             columns=[(column+"_"+str(i)) for i in df_test_cat[column].value_counts().index])
    transform = transform.set_index(df_test_cat.index.values)
    df_test_cat1 = pd.concat([df_test_cat1, transform], axis=1)
    df_test_cat1 = df_test_cat1.drop(column, 1)

Check Encoding

In [22]:
# Display the example of encoded values
df_train_cat1.iloc[:, 623:636][:3].as_matrix()
array([[ 1.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.],
       [ 0.,  1.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.],
       [ 0.,  0.,  1.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.]])
In [23]:
# Check these values without one hot encoding
7672    0
8056    1
8111    2
Name: preschool_education_centers_raion, dtype: int64

Add Missing Columns with Zero Values

In [24]:
# Display the number of features in the training and testing datasets
print('Shape of the train data frame:', df_train_cat1.shape)
print('Shape of the test data frame:', df_test_cat1.shape)
Shape of the train data frame: (20899, 636)
Shape of the test data frame: (7662, 626)
In [25]:
print("Features in the train data, but not in the test data:")
for element in list(df_train_cat1):
    if element not in list(df_test_cat1):
Features in the train data, but not in the test data:
In [26]:
print("Features in the test data, but not in the train data:")
for element in list(df_test_cat1):
    if element not in list(df_train_cat1):
Features in the test data, but not in the train data:
In [27]:
# Fill in by zeros the missing columns in the training and testing datasets
for column in ['sub_area_136',' ID_metro_188', 'ID_metro_205', 'ID_metro_216', 'ID_metro_214',
              'ID_metro_183',' ID_metro_179', 'ID_metro_153', 'ID_metro_217', 'raion_popul_136']:
    df_test_cat1[column] = 0
df_train_cat1['ID_metro_219'] = 0

print('Columns with zero values were added.\n')
print('Shape of the train data frame:', df_train_cat1.shape)
print('Shape of the test data frame:', df_test_cat1.shape)
Columns with zero values were added.

Shape of the train data frame: (20899, 637)
Shape of the test data frame: (7662, 636)

Display Correlation

In [28]:
# Display the feature correlation with the target
pearson = df_train.corr(method='pearson')
corr_with_prices = pearson.ix[-1][:-1]
full_sq                              0.593829
num_room                             0.476337
kremlin_km                          -0.290126
stadium_km                          -0.238431
detention_facility_km               -0.233395
university_km                       -0.222964
theater_km                          -0.222873
workplaces_km                       -0.220889
swim_pool_km                        -0.220480
exhibition_km                       -0.212144
radiation_km                        -0.208256
museum_km                           -0.203846
park_km                             -0.201636
metro_min_walk                      -0.200058
fitness_km                          -0.197702
metro_km_avto                       -0.194751
shopping_centers_km                 -0.182459
public_healthcare_km                -0.182388
big_road2_km                        -0.178865
bus_terminal_avto_km                -0.176601
ekder_all                            0.169331
area_m                              -0.167851
school_km                           -0.158775
preschool_km                        -0.157079
additional_education_km             -0.146074
kindergarten_km                     -0.141627
work_all                             0.136761
railroad_station_walk_min           -0.135099
oil_chemistry_km                    -0.134873
railroad_station_avto_km            -0.132209
young_all                            0.131324
public_transport_station_min_walk   -0.128647
big_road1_km                        -0.098968
usdrub                               0.069506
big_market_km                       -0.069257
cemetery_km                         -0.042413
Name: prices, dtype: float64
In [29]:
# Display the most correlated features
features_list2 = corr_with_prices[abs(corr_with_prices).argsort()[::-1]][:10].index.values.tolist()
print('The most correlated with prices:\n', features_list2)
The most correlated with prices:
 ['full_sq', 'num_room', 'kremlin_km', 'stadium_km', 'detention_facility_km', 'university_km', 'theater_km', 'workplaces_km', 'swim_pool_km', 'exhibition_km']
In [30]:
# Display the correlation matrix of features
plt.figure(figsize=(18, 12))
train_corr = df_train.corr()
plt.title("Correlation Matrix", fontsize=20);

Scale, Shuffle and Split the Data

In [31]:
# Create the feature and target arrays 
target_train = df_train['prices'].as_matrix()

features_train = df_train.drop('prices', 1).as_matrix()
features_test = df_test.as_matrix()

features_train_cat = df_train_cat.drop('prices', 1).as_matrix()
features_test_cat = df_test_cat.as_matrix()

features_train_cat_enc = df_train_cat1.drop('prices', 1).as_matrix()
features_test_cat_enc = df_test_cat1.as_matrix()
In [32]:
# Split the data
print(separator, '\n\nNumeric Features')
X_train, X_test, y_train, y_test = \
train_test_split(features_train, target_train, test_size = 0.2, random_state = 1)
X_train.shape, X_test.shape

Numeric Features
((16719, 36), (4180, 36))
In [33]:
# Split the data
print(separator, '\n\nNumeric and Categorical Features')
X_train_cat, X_test_cat, y_train_cat, y_test_cat = \
train_test_split(features_train_cat, target_train, test_size = 0.2, random_state = 1)
X_train_cat.shape, X_test_cat.shape

Numeric and Categorical Features
((16719, 44), (4180, 44))
In [34]:
# Split the data
print(separator, '\n\nNumeric and Encoded Categorical Features')
X_train_cat_enc, X_test_cat_enc, y_train_cat_enc, y_test_cat_enc = \
train_test_split(features_train_cat_enc, target_train, test_size = 0.2, random_state = 1)
X_train_cat_enc.shape, X_test_cat_enc.shape

Numeric and Encoded Categorical Features
((16719, 636), (4180, 636))
In [35]:
# Scale the data
scale_X = RobustScaler()
X_train = scale_X.fit_transform(X_train)
X_test = scale_X.transform(X_test)

scale_y = RobustScaler()
y_train = scale_y.fit_transform(y_train.reshape(-1,1))
y_test = scale_y.transform(y_test.reshape(-1,1))

scale_X_cat = RobustScaler()
X_train_cat = scale_X_cat.fit_transform(X_train_cat)
X_test_cat = scale_X_cat.transform(X_test_cat)

scale_y_cat = RobustScaler()
y_train_cat = scale_y_cat.fit_transform(y_train_cat.reshape(-1,1))
y_test_cat = scale_y_cat.transform(y_test_cat.reshape(-1,1))

scale_X_cat_enc = RobustScaler()
X_train_cat_enc = scale_X_cat_enc.fit_transform(X_train_cat_enc)
X_test_cat_enc = scale_X_cat_enc.transform(X_test_cat_enc)

scale_y_cat_enc = RobustScaler()
y_train_cat_enc = scale_y_cat_enc.fit_transform(y_train_cat_enc.reshape(-1,1))
y_test_cat_enc = scale_y_cat_enc.transform(y_test_cat_enc.reshape(-1,1))

Benchmark Models

To compare the prediction quality, I chose the most effective (for financial indicators) regression ensemble algorithms and different types of neural networks: multilayer perceptrons, convolutional and recurrent neural networks. In addition, I was wondering what the highest accuracy rate will be achieved by each of the presented algorithms and whether the predicted trends of price change for all used types of techniques will coincide.

Regressors; Scikit-Learn

Tuning Parameters

In [37]:
# Tuning parameters max_depth & n_estimators
print(separator, '\n\nNumeric Features', '\nGradient Boosting Regressor')
param_grid_gbr = {'max_depth': [3, 4, 5], 'n_estimators': range(36, 361, 36)}
gridsearch_gbr = GridSearchCV(GradientBoostingRegressor(), param_grid_gbr, n_jobs=5)\
                             .fit(X_train, y_train)

Numeric Features 
Gradient Boosting Regressor
{'max_depth': 4, 'n_estimators': 360}
In [82]:
# Tuning parameters n_estimators
print ('Bagging Regressor')
param_grid_br = {'n_estimators': range(36, 361, 36)}
gridsearch_br = GridSearchCV(BaggingRegressor(), param_grid_br, n_jobs=5)\
                            .fit(X_train, y_train)
Bagging Regressor
{'n_estimators': 360}
In [35]:
# Tuning parameters max_depth & n_estimators
print(separator, '\n\nNumeric and Categorical Features', '\nGradient Boosting Regressor')
param_grid_gbr_cat = {'max_depth': [3, 4, 5], 'n_estimators': range(44, 441, 44)}
gridsearch_gbr_cat = GridSearchCV(GradientBoostingRegressor(), param_grid_gbr_cat, n_jobs=5)\
                                 .fit(X_train_cat, y_train_cat)

Numeric and Categorical Features 
Gradient Boosting Regressor
{'max_depth': 3, 'n_estimators': 396}
In [36]:
# Tuning parameters n_estimators
print ('Bagging Regressor')
param_grid_br_cat = {'n_estimators': range(44, 441, 44)}
gridsearch_br_cat = GridSearchCV(BaggingRegressor(), param_grid_br_cat, n_jobs=5)\
                                .fit(X_train_cat, y_train_cat)
Bagging Regressor
{'n_estimators': 308}
In [40]:
# Tuning parameters max_depth & n_estimators
print(separator, '\n\nNumeric and Encoded Categorical Features', '\nGradient Boosting Regressor')
param_grid_gbr_cat_enc = {'max_depth': [3, 4, 5], 'n_estimators': [159, 318, 636]}
gridsearch_gbr_cat_enc = GridSearchCV(GradientBoostingRegressor(), param_grid_gbr_cat_enc, n_jobs=5)\
                                     .fit(X_train_cat_enc, y_train_cat_enc)

Numeric and Encoded Categorical Features 
Gradient Boosting Regressor
{'max_depth': 4, 'n_estimators': 318}
In [44]:
# Tuning parameters n_estimators
print ('Bagging Regressor')
param_grid_br_cat_enc = {'n_estimators': [159, 318, 636]}
gridsearch_br_cat_enc = GridSearchCV(BaggingRegressor(), param_grid_br_cat_enc, n_jobs=5)\
                                    .fit(X_train_cat_enc, y_train_cat_enc)
Bagging Regressor
{'n_estimators': 159}

Fit the Regressors

In [47]:
# Fit the initial Regressors and display the results
print(separator, '\nNumeric Features')
y_train_gbr, y_test_gbr = regression(GradientBoostingRegressor(), 
                                     X_train, X_test, y_train)

y_train_br, y_test_br = regression(BaggingRegressor(), 
                                   X_train, X_test, y_train)

scores('GradientBoostingRegressor', y_train, y_test, y_train_gbr, y_test_gbr)
scores('BaggingRegressor', y_train, y_test, y_train_br, y_test_br)
Numeric Features
EV score. Train:  0.748806887679
EV score. Test:  0.708327883284
R2 score. Train:  0.748806887679
R2 score. Test:  0.70825231961
MSE score. Train:  0.45681465969
MSE score. Test:  0.583759067304
MAE score. Train:  0.402793952143
MAE score. Test:  0.424798766194
MdAE score. Train:  0.207968017311
MdAE score. Test:  0.218819865521
EV score. Train:  0.938001107068
EV score. Test:  0.707465526187
R2 score. Train:  0.937988378432
R2 score. Test:  0.707161018325
MSE score. Train:  0.112773067469
MSE score. Test:  0.585942656285
MAE score. Train:  0.167282028361
MAE score. Test:  0.4116910066
MdAE score. Train:  0.0666666666667
MdAE score. Test:  0.188687375
In [36]:
# Fit the tuning Regressors and display the results
print(separator, '\nNumeric Features')
y_train_gbr, y_test_gbr = regression(GradientBoostingRegressor(max_depth=4, n_estimators=360), 
                                     X_train, X_test, y_train)

y_train_br, y_test_br = regression(BaggingRegressor(n_estimators=360), 
                                   X_train, X_test, y_train)

scores('GradientBoostingRegressor', y_train, y_test, y_train_gbr, y_test_gbr)
scores('BaggingRegressor', y_train, y_test, y_train_br, y_test_br)
Numeric Features
EV score. Train:  0.86189746402
EV score. Test:  0.720854054972
R2 score. Train:  0.86189746402
R2 score. Test:  0.720764275765
MSE score. Train:  0.251150449123
MSE score. Test:  0.558723845618
MAE score. Train:  0.31458911313
MAE score. Test:  0.400128339345
MdAE score. Train:  0.174402117839
MdAE score. Test:  0.199730980341
EV score. Train:  0.955873487756
EV score. Test:  0.720733503957
R2 score. Train:  0.955847678
R2 score. Test:  0.720383275214
MSE score. Train:  0.0802945103177
MSE score. Test:  0.559486191101
MAE score. Train:  0.147217987974
MAE score. Test:  0.391817438163
MdAE score. Train:  0.0631909601337
MdAE score. Test:  0.179859818801
In [38]:
# Display parameters of the regressor
GradientBoostingRegressor(max_depth=4, n_estimators=360).get_params(deep=True)
{'alpha': 0.9,
 'criterion': 'friedman_mse',
 'init': None,
 'learning_rate': 0.1,
 'loss': 'ls',
 'max_depth': 4,
 'max_features': None,
 'max_leaf_nodes': None,
 'min_impurity_decrease': 0.0,
 'min_impurity_split': None,
 'min_samples_leaf': 1,
 'min_samples_split': 2,
 'min_weight_fraction_leaf': 0.0,
 'n_estimators': 360,
 'presort': 'auto',
 'random_state': None,
 'subsample': 1.0,
 'verbose': 0,
 'warm_start': False}
In [37]:
# Display the feature importance
importances = GradientBoostingRegressor(max_depth=4, n_estimators=360)\
.fit(X_train, y_train).feature_importances_
indices = np.argsort(importances)[::-1]

plt.figure(figsize = (18, 4.5))[1]), importances[indices], 
        color="forestgreen", align="center", alpha=0.5)
plt.xlabel("Feature Index")
plt.ylabel("Feature Importance")
plt.xticks(range(X_train.shape[1]), indices)
plt.title("Importance of the Features; Gradient Boosting Regressor");
In [50]:
# Fit the initial Regressors and display the results
print(separator, '\nNumeric and Categorical Features')
y_train_cat_gbr, y_test_cat_gbr = \
           X_train_cat, X_test_cat, y_train_cat)

y_train_cat_br, y_test_cat_br = \
regression(BaggingRegressor(), X_train_cat, X_test_cat, y_train_cat)

       y_train_cat, y_test_cat, y_train_cat_gbr, y_test_cat_gbr)
       y_train_cat, y_test_cat, y_train_cat_br, y_test_cat_br)
Numeric and Categorical Features
EV score. Train:  0.750607058773
EV score. Test:  0.707989859744
R2 score. Train:  0.750607058773
R2 score. Test:  0.707923034437
MSE score. Train:  0.453540905333
MSE score. Test:  0.584417935287
MAE score. Train:  0.402647089862
MAE score. Test:  0.425882511012
MdAE score. Train:  0.211532866065
MdAE score. Test:  0.222905790023
EV score. Train:  0.938091144693
EV score. Test:  0.686839739436
R2 score. Train:  0.938075817168
R2 score. Test:  0.686378188154
MSE score. Train:  0.112614053171
MSE score. Test:  0.627527101928
MAE score. Train:  0.168578809562
MAE score. Test:  0.412725894112
MdAE score. Train:  0.0680555555556
MdAE score. Test:  0.184698805556
In [51]:
# Fit the final Regressors and display the results
print(separator, '\nNumeric and Categorical Features')
y_train_cat_gbr, y_test_cat_gbr = \
regression(GradientBoostingRegressor(max_depth=3, n_estimators=396), 
           X_train_cat, X_test_cat, y_train_cat)

y_train_cat_br, y_test_cat_br = \
regression(BaggingRegressor(n_estimators=308), X_train_cat, X_test_cat, y_train_cat)

       y_train_cat, y_test_cat, y_train_cat_gbr, y_test_cat_gbr)
       y_train_cat, y_test_cat, y_train_cat_br, y_test_cat_br)
Numeric and Categorical Features
EV score. Train:  0.819256487057
EV score. Test:  0.715316558775
R2 score. Train:  0.819256487057
R2 score. Test:  0.715233444479
MSE score. Train:  0.328696458248
MSE score. Test:  0.569790507428
MAE score. Train:  0.352419590753
MAE score. Test:  0.407416791547
MdAE score. Train:  0.190394737254
MdAE score. Test:  0.204378587242
EV score. Train:  0.955903698863
EV score. Test:  0.720982739258
R2 score. Train:  0.955874417844
R2 score. Test:  0.720582940354
MSE score. Train:  0.080245881783
MSE score. Test:  0.559086680347
MAE score. Train:  0.146869415555
MAE score. Test:  0.391819910402
MdAE score. Train:  0.0637587229437
MdAE score. Test:  0.181356573438
In [70]:
# Display the feature importance
importances_cat = GradientBoostingRegressor(max_depth=3, n_estimators=396)\
.fit(X_train_cat, y_train_cat).feature_importances_
indices_cat = np.argsort(importances_cat)[::-1]

plt.figure(figsize = (18, 4.5))[1]), importances_cat[indices_cat], 
        color="forestgreen", align="center", alpha=0.5)

plt.xlabel("Feature Index")
plt.ylabel("Feature Importance")
plt.xticks(range(X_train_cat.shape[1]), indices_cat)
plt.title("Importance of the Features; Gradient Boosting Regressor");
In [53]:
# Fit the initial Regressors and display the results
print(separator, '\nNumeric and Encoded Categorical Features')
y_train_cat_enc_gbr, y_test_cat_enc_gbr = \
           X_train_cat_enc, X_test_cat_enc, y_train_cat_enc)

y_train_cat_enc_br, y_test_cat_enc_br = \
           X_train_cat_enc, X_test_cat_enc, y_train_cat_enc)

       y_train_cat_enc, y_test_cat_enc, y_train_cat_enc_gbr, y_test_cat_enc_gbr)
       y_train_cat_enc, y_test_cat_enc, y_train_cat_enc_br, y_test_cat_enc_br)
Numeric and Encoded Categorical Features
EV score. Train:  0.74870728459
EV score. Test:  0.704832200955
R2 score. Train:  0.74870728459
R2 score. Test:  0.704783417434
MSE score. Train:  0.456995795832
MSE score. Test:  0.590700007148
MAE score. Train:  0.405233296267
MAE score. Test:  0.427447712343
MdAE score. Train:  0.209757816885
MdAE score. Test:  0.225082680612
EV score. Train:  0.936933714331
EV score. Test:  0.690181215191
R2 score. Train:  0.936896826304
R2 score. Test:  0.689722986736
MSE score. Train:  0.114758141857
MSE score. Test:  0.62083448145
MAE score. Train:  0.166932407641
MAE score. Test:  0.416220055497
MdAE score. Train:  0.0680555555556
MdAE score. Test:  0.192197222222
In [54]:
# Fit the final Regressors and display the results
print(separator, '\nNumeric and Encoded Categorical Features')
y_train_cat_enc_gbr, y_test_cat_enc_gbr = \
regression(GradientBoostingRegressor(max_depth=4, n_estimators=318), 
           X_train_cat_enc, X_test_cat_enc, y_train_cat_enc)

y_train_cat_enc_br, y_test_cat_enc_br = \
           X_train_cat_enc, X_test_cat_enc, y_train_cat_enc)

       y_train_cat_enc, y_test_cat_enc, y_train_cat_enc_gbr, y_test_cat_enc_gbr)
       y_train_cat_enc, y_test_cat_enc, y_train_cat_enc_br, y_test_cat_enc_br)
Numeric and Encoded Categorical Features
EV score. Train:  0.845218755194
EV score. Test:  0.708155411672
R2 score. Train:  0.845218755194
R2 score. Test:  0.70803384448
MSE score. Train:  0.281482008082
MSE score. Test:  0.584196215042
MAE score. Train:  0.330491575879
MAE score. Test:  0.405533548791
MdAE score. Train:  0.180651597294
MdAE score. Test:  0.200045992214
EV score. Train:  0.955963327443
EV score. Test:  0.718096430232
R2 score. Train:  0.95594072386
R2 score. Test:  0.717747496305
MSE score. Train:  0.0801252990181
MSE score. Test:  0.56476013136
MAE score. Train:  0.147441797331
MAE score. Test:  0.393760981303
MdAE score. Train:  0.0636391788959
MdAE score. Test:  0.179454672912
In [71]:
# Display the feature importance
importances_cat_enc = GradientBoostingRegressor(max_depth=4, n_estimators=318)\
.fit(X_train_cat_enc, y_train_cat_enc).feature_importances_
indices_cat_enc = np.argsort(importances_cat_enc)[::-1][:50]

plt.figure(figsize = (18, 4.5)), importances_cat_enc[indices_cat_enc], 
        color="forestgreen", align="center", alpha=0.5)

plt.xlabel("Feature Index")
plt.ylabel("Feature Importance")
plt.xticks(range(50), indices_cat_enc)
plt.title("Importance of the Features; Gradient Boosting Regressor");

MLP Regressors

In [56]:
# Fit the initial MLPRegressor and display the results
mlpr = MLPRegressor(), y_train)

y_train_mlpr = mlpr.predict(X_train)
y_test_mlpr = mlpr.predict(X_test)

print(separator, '\nNumeric Features')
scores('MLP Regressor', y_train, y_test, y_train_mlpr, y_test_mlpr)
Numeric Features
 MLP Regressor 
EV score. Train:  0.720903125209
EV score. Test:  0.681232970735
R2 score. Train:  0.719861200081
R2 score. Test:  0.680802770479
MSE score. Train:  0.509454695506
MSE score. Test:  0.638682976819
MAE score. Train:  0.407829527795
MAE score. Test:  0.43869690106
MdAE score. Train:  0.215464420308
MdAE score. Test:  0.235717963805
In [64]:
# Fit the final MLPRegressor and display the results
mlpr = MLPRegressor(hidden_layer_sizes=(360,), max_iter=300, 
                    solver='adam', alpha=0.01), y_train)

y_train_mlpr = mlpr.predict(X_train)
y_test_mlpr = mlpr.predict(X_test)

print(separator, '\nNumeric Features')
scores('MLP Regressor', y_train, y_test, y_train_mlpr, y_test_mlpr)
Numeric Features
 MLP Regressor 
EV score. Train:  0.733480488744
EV score. Test:  0.697043001412
R2 score. Train:  0.733225607
R2 score. Test:  0.696978627454
MSE score. Train:  0.485150458251
MSE score. Test:  0.606316641745
MAE score. Train:  0.391437898871
MAE score. Test:  0.421600121798
MdAE score. Train:  0.205449914177
MdAE score. Test:  0.220711892473
In [65]:
# Fit the initial MLPRegressor and display the results
mlpr_cat = MLPRegressor(), y_train_cat)

y_train_cat_mlpr = mlpr_cat.predict(X_train_cat)
y_test_cat_mlpr = mlpr_cat.predict(X_test_cat)

print(separator, '\nNumeric and Categorical Features')
scores('MLP Regressor', y_train_cat, y_test_cat, y_train_cat_mlpr, y_test_cat_mlpr)
Numeric and Categorical Features
 MLP Regressor 
EV score. Train:  0.717357216363
EV score. Test:  0.680317383966
R2 score. Train:  0.716788418603
R2 score. Test:  0.679085721546
MSE score. Train:  0.515042793095
MSE score. Test:  0.6421186267
MAE score. Train:  0.414883180666
MAE score. Test:  0.44994788786
MdAE score. Train:  0.228522171516
MdAE score. Test:  0.241082706632
In [66]:
# Fit the final MLPRegressor and display the results
mlpr_cat = MLPRegressor(hidden_layer_sizes=(396,), max_iter=300, 
                        solver='adam', alpha=0.01), y_train_cat)

y_train_cat_mlpr = mlpr_cat.predict(X_train_cat)
y_test_cat_mlpr = mlpr_cat.predict(X_test_cat)

print(separator, '\nNumeric and Categorical Features')
scores('MLP Regressor', y_train_cat, y_test_cat, y_train_cat_mlpr, y_test_cat_mlpr)
Numeric and Categorical Features
 MLP Regressor 
EV score. Train:  0.775852560843
EV score. Test:  0.694231228691
R2 score. Train:  0.775619736958
R2 score. Test:  0.693806690918
MSE score. Train:  0.408053360044
MSE score. Test:  0.612663381884
MAE score. Train:  0.383164359776
MAE score. Test:  0.440601920972
MdAE score. Train:  0.215675685577
MdAE score. Test:  0.245818998678
In [67]:
# Fit the initial MLPRegressor and display the results
mlpr_cat_enc = MLPRegressor(), y_train_cat_enc)

y_train_cat_enc_mlpr = mlpr_cat_enc.predict(X_train_cat_enc)
y_test_cat_enc_mlpr = mlpr_cat_enc.predict(X_test_cat_enc)

print(separator, '\nNumeric and Encoded Categorical Features')
scores('MLP Regressor', y_train_cat_enc, y_test_cat_enc, y_train_cat_enc_mlpr, y_test_cat_enc_mlpr)
Numeric and Encoded Categorical Features
 MLP Regressor 
EV score. Train:  0.840958635688
EV score. Test:  0.633927171961
R2 score. Train:  0.840286572782
R2 score. Test:  0.632721697436
MSE score. Train:  0.290451574203
MSE score. Test:  0.734888582695
MAE score. Train:  0.340488626907
MAE score. Test:  0.469363910819
MdAE score. Train:  0.19783045774
MdAE score. Test:  0.249264080802
In [193]:
# Fit the final MLPRegressor and display the results
mlpr_cat_enc = MLPRegressor(hidden_layer_sizes=(318,), max_iter=150, 
                            solver='lbfgs', alpha=0.01), y_train_cat_enc)

y_train_cat_enc_mlpr = mlpr_cat_enc.predict(X_train_cat_enc)
y_test_cat_enc_mlpr = mlpr_cat_enc.predict(X_test_cat_enc)

print(separator, '\nNumeric and Encoded Categorical Features')
scores('MLP Regressor', y_train_cat_enc, y_test_cat_enc, y_train_cat_enc_mlpr, y_test_cat_enc_mlpr)
Numeric and Encoded Categorical Features
 MLP Regressor 
EV score. Train:  0.7448113493
EV score. Test:  0.695937336893
R2 score. Train:  0.744805402803
R2 score. Test:  0.695698488237
MSE score. Train:  0.464091678295
MSE score. Test:  0.608878077277
MAE score. Train:  0.403063285162
MAE score. Test:  0.440427444491
MdAE score. Train:  0.218686414333
MdAE score. Test:  0.234226334408

Display Predictions

In [73]:
# Plot predictions of the regressors with real values
plt.figure(figsize = (18, 6))
plt.plot(y_test[1:50], color = 'black', label='Real Data')

plt.plot(y_test_gbr[1:50], label='Gradient Boosting')
plt.plot(y_test_br[1:50], label='Bagging Regressor')
plt.plot(y_test_mlpr[1:50], label='MLP Regressor')

plt.xlabel("Data Points")
plt.ylabel("Predicted and Real Target Values")
plt.title("Numeric Features; Regressor Predictions vs Real Data");
In [74]:
# Plot predictions of the regressors with real values
plt.figure(figsize = (18, 6))
plt.plot(y_test_cat[1:50], color = 'black', label='Real Data')

plt.plot(y_test_cat_gbr[1:50], label='Gradient Boosting')
plt.plot(y_test_cat_br[1:50], label='Bagging Regressor')
plt.plot(y_test_cat_mlpr[1:50], label='MLP Regressor')

plt.xlabel("Data Points")
plt.ylabel("Predicted and Real Target Values")
plt.title("Numeric and Categorical Features; Regressor Predictions vs Real Data");
In [75]:
# Plot predictions of the regressors with real values
plt.figure(figsize = (18, 6))
plt.plot(y_test_cat_enc[1:50], color = 'black', label='Real Data')

plt.plot(y_test_cat_enc_gbr[1:50], label='Gradient Boosting')
plt.plot(y_test_cat_enc_br[1:50], label='Bagging Regressor')
plt.plot(y_test_cat_enc_mlpr[1:50], label='MLP Regressor')

plt.xlabel("Data Points")
plt.ylabel("Predicted and Real Target Values")
plt.title("Numeric and Encoded Categorical Features; Regressor Predictions vs Real Data");

Neural Networks; Keras


In [185]:
# Create the initial sequential model
def mlp_model():
    model = Sequential()

    model.add(Dense(32, activation='relu', input_dim=36))        
    model.add(Dense(1, kernel_initializer='normal'))
    model.compile(loss='mse', optimizer='rmsprop', metrics=['mae'])
    return model

mlp_model = mlp_model()
# Fit the model
mlp_history =, y_train, 
                            validation_data=(X_test, y_test),
                            nb_epoch=10, batch_size=128, verbose=0)
In [186]:
# Create predictions
y_train_mlp = mlp_model.predict(X_train)
y_test_mlp = mlp_model.predict(X_test)
# Display initial metrics
print(separator, '\nNumeric Features')
scores('MLP Initial Model', y_train, y_test, y_train_mlp, y_test_mlp)
Numeric Features
 MLP Initial Model 
EV score. Train:  0.599314578459
EV score. Test:  0.65533790525
R2 score. Train:  0.598459984311
R2 score. Test:  0.653649345575
MSE score. Train:  0.73023246507
MSE score. Test:  0.693014370216
MAE score. Train:  0.456790473657
MAE score. Test:  0.460267251551
MdAE score. Train:  0.248421524385
MdAE score. Test:  0.253359075126
In [152]:
# Create the sequential model
def mlp_model():
    model = Sequential()

    model.add(Dense(1024, activation='relu', input_dim=36))
    model.add(Dense(128, activation='relu'))
    model.add(Dense(1, kernel_initializer='normal'))
    model.compile(loss='mse', optimizer='nadam', metrics=['mae'])
    return model

mlp_model = mlp_model()
# Create the checkpointer for saving the best results
mlp_checkpointer = ModelCheckpoint(filepath='', 
                                   verbose=2, save_best_only=True)
# Fit the model
mlp_history =, y_train, 
                            validation_data=(X_test, y_test),
                            nb_epoch=10, batch_size=128, verbose=0, 
Epoch 00000: val_loss improved from inf to 0.69692, saving model to
Epoch 00001: val_loss improved from 0.69692 to 0.61879, saving model to
Epoch 00002: val_loss did not improve
Epoch 00003: val_loss improved from 0.61879 to 0.60344, saving model to
Epoch 00004: val_loss did not improve
Epoch 00005: val_loss did not improve
Epoch 00006: val_loss did not improve
Epoch 00007: val_loss did not improve
Epoch 00008: val_loss did not improve
Epoch 00009: val_loss did not improve
In [153]:
# Plot the fitting history
In [154]:
# Load the best model results 
# Create predictions
y_train_mlp = mlp_model.predict(X_train)
y_test_mlp = mlp_model.predict(X_test)
# Save the model'mlp_model_p6.h5')
# Display metrics
print(separator, '\nNumeric Features')
scores('MLP Model', y_train, y_test, y_train_mlp, y_test_mlp)
Numeric Features
 MLP Model 
EV score. Train:  0.713253445374
EV score. Test:  0.698621876452
R2 score. Train:  0.713191087056
R2 score. Test:  0.698417133671
MSE score. Train:  0.521584826717
MSE score. Test:  0.603438329066
MAE score. Train:  0.404265474303
MAE score. Test:  0.417929379809
MdAE score. Train:  0.199384812951
MdAE score. Test:  0.205628412465
In [219]:
# Create the initial sequential model
def mlp_cat_model():
    model = Sequential()

    model.add(Dense(64, activation='relu', input_dim=44))        
    model.add(Dense(1, kernel_initializer='normal'))
    model.compile(loss='mse', optimizer='rmsprop', metrics=['mae'])
    return model

mlp_cat_model = mlp_cat_model()
# Fit the model
mlp_cat_history =, y_train_cat, 
                            validation_data=(X_test_cat, y_test_cat),
                            nb_epoch=10, batch_size=128, verbose=0)
In [220]:
# Create predictions
y_train_cat_mlp = mlp_cat_model.predict(X_train_cat)
y_test_cat_mlp = mlp_cat_model.predict(X_test_cat)
# Display initial metrics
print(separator, '\nNumeric Features and Categorical Features')
scores('MLP Initial Model', y_train_cat, y_test_cat, y_train_cat_mlp, y_test_cat_mlp)
Numeric Features and Categorical Features
 MLP Initial Model 
EV score. Train:  0.627807027945
EV score. Test:  0.672013259967
R2 score. Train:  0.627226028332
R2 score. Test:  0.67186556115
MSE score. Train:  0.67791912539
MSE score. Test:  0.656565473691
MAE score. Train:  0.428820285454
MAE score. Test:  0.435150219302
MdAE score. Train:  0.218085530599
MdAE score. Test:  0.219873657955
In [167]:
# Create the sequential model
def mlp_cat_model():
    model = Sequential()
    model.add(Dense(1024, activation='relu', input_dim=44))
    model.add(Dense(64, activation='relu'))
    model.add(Dense(1, kernel_initializer='normal'))
    model.compile(loss='mse', optimizer='nadam', metrics=['mae'])
    return model

mlp_cat_model = mlp_cat_model()
# Create the checkpointer for saving the best results
mlp_cat_checkpointer = ModelCheckpoint(filepath='', 
                                       verbose=2, save_best_only=True)
# Fit the model
mlp_cat_history =, y_train_cat, 
                                    validation_data=(X_test_cat, y_test_cat),
                                    nb_epoch=10, batch_size=128, verbose=0, 
Epoch 00000: val_loss improved from inf to 0.64692, saving model to
Epoch 00001: val_loss did not improve
Epoch 00002: val_loss did not improve
Epoch 00003: val_loss did not improve
Epoch 00004: val_loss improved from 0.64692 to 0.60283, saving model to
Epoch 00005: val_loss did not improve
Epoch 00006: val_loss did not improve
Epoch 00007: val_loss did not improve
Epoch 00008: val_loss did not improve
Epoch 00009: val_loss did not improve
In [168]:
# Plot the history
In [169]:
# Load the best model results
# Create predictions
y_train_cat_mlp = mlp_cat_model.predict(X_train_cat)
y_test_cat_mlp = mlp_cat_model.predict(X_test_cat)
# Save the model'mlp_cat_model_p6.h5')
# Display metrics
print(separator, '\nNumeric and Categorical Features')
scores('MLP Model', 
       y_train_cat, y_test_cat, y_train_cat_mlp, y_test_cat_mlp)
Numeric and Categorical Features
 MLP Model 
EV score. Train:  0.729790225787
EV score. Test:  0.698871231371
R2 score. Train:  0.729787527567
R2 score. Test:  0.698723398616
MSE score. Train:  0.491402879234
MSE score. Test:  0.602825522346
MAE score. Train:  0.40830845428
MAE score. Test:  0.435306933961
MdAE score. Train:  0.204734190479
MdAE score. Test:  0.219043284155
In [226]:
# Create the initial sequential model
def mlp_cat_enc_model():
    model = Sequential()

    model.add(Dense(1024, activation='relu', input_dim=636))    
    model.add(Dense(1, kernel_initializer='normal'))
    model.compile(loss='mse', optimizer='rmsprop', metrics=['mae'])
    return model

mlp_cat_enc_model = mlp_cat_enc_model()

# Fit the model
mlp_cat_enc_history =, y_train_cat_enc, 
                                            validation_data=(X_test_cat_enc, y_test_cat_enc),
                                            nb_epoch=10, batch_size=128, verbose=0)
In [227]:
# Create predictions
y_train_cat_enc_mlp = mlp_cat_enc_model.predict(X_train_cat_enc)
y_test_cat_enc_mlp = mlp_cat_enc_model.predict(X_test_cat_enc)
# Display initial metrics
print(separator, '\nNumeric Features and Encoded Categorical Features')
scores('MLP Initial Model', y_train_cat_enc, y_test_cat_enc, y_train_cat_enc_mlp, y_test_cat_enc_mlp)
Numeric Features and Encoded Categorical Features
 MLP Initial Model 
EV score. Train:  0.750643613743
EV score. Test:  0.669375553457
R2 score. Train:  0.75031462201
R2 score. Test:  0.66863264798
MSE score. Train:  0.454072724854
MSE score. Test:  0.663034222215
MAE score. Train:  0.393700513874
MAE score. Test:  0.446554169066
MdAE score. Train:  0.211494664219
MdAE score. Test:  0.230405584688
In [180]:
# Create the sequential model
def mlp_cat_enc_model():
    model = Sequential()
    model.add(Dense(159, activation='relu', input_dim=636))
    model.add(Dense(159, activation='relu'))
    model.add(Dense(318, activation='relu'))
    model.add(Dense(318, activation='relu'))
    model.add(Dense(636, activation='relu'))
    model.add(Dense(636, activation='relu'))
    model.add(Dense(1, kernel_initializer='normal'))
    model.compile(loss='mse', optimizer='rmsprop', metrics=['mae'])
    return model

mlp_cat_enc_model = mlp_cat_enc_model()
# Create the checkpointer for saving the best results
mlp_cat_enc_checkpointer = ModelCheckpoint(filepath='', 
                                           verbose=2, save_best_only=True)
# Fit the model
mlp_cat_enc_history =, y_train_cat_enc, 
                                            validation_data=(X_test_cat_enc, y_test_cat_enc),
                                            nb_epoch=10, batch_size=128, verbose=0, 
Epoch 00000: val_loss improved from inf to 1.47252, saving model to
Epoch 00001: val_loss improved from 1.47252 to 1.29108, saving model to
Epoch 00002: val_loss improved from 1.29108 to 0.61749, saving model to
Epoch 00003: val_loss did not improve
Epoch 00004: val_loss did not improve
Epoch 00005: val_loss did not improve
Epoch 00006: val_loss did not improve
Epoch 00007: val_loss did not improve
Epoch 00008: val_loss did not improve
Epoch 00009: val_loss did not improve
In [181]:
# Plot the fitting history
In [182]:
# Load the best model results
# Create predictions
y_train_cat_enc_mlp = mlp_cat_enc_model.predict(X_train_cat_enc)
y_test_cat_enc_mlp = mlp_cat_enc_model.predict(X_test_cat_enc)
# Save the model'mlp_cat_enc_model_p6.h5')
# Display metrics
print(separator, '\nNumeric and Encoded Categorical Features')
scores('MLP Model', 
       y_train_cat_enc, y_test_cat_enc, y_train_cat_enc_mlp, y_test_cat_enc_mlp)
Numeric and Encoded Categorical Features
 MLP Model 
EV score. Train:  0.701942081377
EV score. Test:  0.694698085302
R2 score. Train:  0.696848010352
R2 score. Test:  0.691393760571
MSE score. Train:  0.551306011958
MSE score. Test:  0.617491423592
MAE score. Train:  0.414362664906
MAE score. Test:  0.432068189213
MdAE score. Train:  0.212007418271
MdAE score. Test:  0.217624607119


In [245]:
# Create the initial sequential model
def cnn_model():
    model = Sequential()
    model.add(Conv1D(36, 3, padding='valid', activation='relu', input_shape=(36, 1)))
    model.add(Dense(64, activation='relu', kernel_initializer='normal',))    
    model.add(Dense(1, kernel_initializer='normal'))
    model.compile(loss='mse', optimizer='rmsprop', metrics=['mae'])
    return model

cnn_model = cnn_model()

# Fit the model
cnn_history =, 36, 1), y_train, 
                            epochs=30, batch_size=128, verbose=0, 
                            validation_data=(X_test.reshape(-1, 36, 1), y_test))
In [246]:
# Create predictions
y_train_cnn = cnn_model.predict(X_train.reshape(-1, 36, 1))
y_test_cnn = cnn_model.predict(X_test.reshape(-1, 36, 1))
# Display initial metrics
print(separator, '\nNumeric Features')
scores('CNN Initial Model', y_train, y_test, y_train_cnn, y_test_cnn)
Numeric Features
 CNN Initial Model 
EV score. Train:  0.722095687398
EV score. Test:  0.698734541859
R2 score. Train:  0.721670526643
R2 score. Test:  0.698589610046
MSE score. Train:  0.506164291204
MSE score. Test:  0.603093220416
MAE score. Train:  0.392529252367
MAE score. Test:  0.412265572127
MdAE score. Train:  0.18764491823
MdAE score. Test:  0.197002594597
In [251]:
# Create the sequential model
def cnn_model():
    model = Sequential()
    model.add(Conv1D(36, 3, padding='valid', activation='relu', input_shape=(36, 1)))
    model.add(Dense(512, activation='relu', kernel_initializer='normal',))
    model.add(Dense(1, kernel_initializer='normal'))
    model.compile(loss='mse', optimizer='rmsprop', metrics=['mae'])
    return model

cnn_model = cnn_model()
# Create the checkpointer for saving the best results
cnn_checkpointer = ModelCheckpoint(filepath='', 
                                   verbose=2, save_best_only=True)
# Fit the model
cnn_history =, 36, 1), y_train, 
                            epochs=30, batch_size=128, verbose=0, callbacks=[cnn_checkpointer],
                            validation_data=(X_test.reshape(-1, 36, 1), y_test))
Epoch 00000: val_loss improved from inf to 0.93857, saving model to
Epoch 00001: val_loss improved from 0.93857 to 0.81140, saving model to
Epoch 00002: val_loss improved from 0.81140 to 0.70109, saving model to
Epoch 00003: val_loss improved from 0.70109 to 0.67079, saving model to
Epoch 00004: val_loss did not improve
Epoch 00005: val_loss did not improve
Epoch 00006: val_loss did not improve
Epoch 00007: val_loss improved from 0.67079 to 0.60319, saving model to
Epoch 00008: val_loss did not improve
Epoch 00009: val_loss did not improve
Epoch 00010: val_loss did not improve
Epoch 00011: val_loss improved from 0.60319 to 0.59882, saving model to
Epoch 00012: val_loss improved from 0.59882 to 0.59239, saving model to
Epoch 00013: val_loss did not improve
Epoch 00014: val_loss did not improve
Epoch 00015: val_loss did not improve
Epoch 00016: val_loss did not improve
Epoch 00017: val_loss did not improve
Epoch 00018: val_loss did not improve
Epoch 00019: val_loss did not improve
Epoch 00020: val_loss did not improve
Epoch 00021: val_loss improved from 0.59239 to 0.58513, saving model to
Epoch 00022: val_loss did not improve
Epoch 00023: val_loss did not improve
Epoch 00024: val_loss did not improve
Epoch 00025: val_loss did not improve
Epoch 00026: val_loss did not improve
Epoch 00027: val_loss improved from 0.58513 to 0.57757, saving model to
Epoch 00028: val_loss did not improve
Epoch 00029: val_loss did not improve
In [252]:
# Plot the fitting history
In [253]:
# Load the best model results
# Create predictions
y_train_cnn = cnn_model.predict(X_train.reshape(-1, 36, 1))
y_test_cnn = cnn_model.predict(X_test.reshape(-1, 36, 1))
# Save the model'cnn_model_p6.h5')
# Display metrics
print(separator, '\nNumeric Features')
scores('CNN Model', y_train, y_test, y_train_cnn, y_test_cnn)
Numeric Features
 CNN Model 
EV score. Train:  0.735439071809
EV score. Test:  0.711557145756
R2 score. Train:  0.735350946162
R2 score. Test:  0.711347161402
MSE score. Train:  0.481285359893
MSE score. Test:  0.577566586338
MAE score. Train:  0.400953623408
MAE score. Test:  0.427550065202
MdAE score. Train:  0.214423935148
MdAE score. Test:  0.22883767204
In [287]:
# Display the model description
Layer (type)                 Output Shape              Param #   
conv1d_7 (Conv1D)            (None, 34, 36)            144       
max_pooling1d_7 (MaxPooling1 (None, 17, 36)            0         
flatten_7 (Flatten)          (None, 612)               0         
dense_162 (Dense)            (None, 512)               313856    
dropout_20 (Dropout)         (None, 512)               0         
dense_163 (Dense)            (None, 1)                 513       
Total params: 314,513
Trainable params: 314,513
Non-trainable params: 0
In [254]:
# Display the example of the model architecture
SVG(model_to_dot(cnn_model).create(prog='dot', format='svg'))
G 6750400808 conv1d_7_input: InputLayer 6593792600 conv1d_7: Conv1D 6750400808->6593792600 6484333008 max_pooling1d_7: MaxPooling1D 6593792600->6484333008 6593939048 flatten_7: Flatten 6484333008->6593939048 6593939160 dense_162: Dense 6593939048->6593939160 6750401368 dropout_20: Dropout 6593939160->6750401368 6600994656 dense_163: Dense 6750401368->6600994656
In [273]:
# Create the initial sequential model
def cnn_cat_model():
    model = Sequential()
    model.add(Conv1D(44, 3, padding='valid', activation='relu', input_shape=(44, 1)))

    model.add(Dense(64, kernel_initializer='normal', activation='relu'))
    model.add(Dense(1, kernel_initializer='normal'))
    model.compile(loss='mse', optimizer='rmsprop', metrics=['mae'])
    return model

cnn_cat_model = cnn_cat_model()

# Fit the model
cnn_cat_history =, 44, 1), y_train_cat, 
                                    epochs=20, batch_size=128, verbose=0, 
                                    validation_data=(X_test_cat.reshape(-1, 44, 1), y_test_cat))
In [274]:
# Create predictions
y_train_cat_cnn = cnn_cat_model.predict(X_train_cat.reshape(-1, 44, 1))
y_test_cat_cnn = cnn_cat_model.predict(X_test_cat.reshape(-1, 44, 1))
# Display initial metrics
print(separator, '\nNumeric and Categorical Features')
scores('CNN Initial Model', y_train_cat, y_test_cat, y_train_cat_cnn, y_test_cat_cnn)
Numeric and Categorical Features
 CNN Initial Model 
EV score. Train:  0.707511575746
EV score. Test:  0.70148632641
R2 score. Train:  0.691620294029
R2 score. Test:  0.688982274023
MSE score. Train:  0.560813030009
MSE score. Test:  0.622316576396
MAE score. Train:  0.408317169117
MAE score. Test:  0.426873392354
MdAE score. Train:  0.197219497647
MdAE score. Test:  0.204493423326
In [279]:
# Create the sequential model
def cnn_cat_model():
    model = Sequential()
    model.add(Conv1D(44, 3, padding='valid', activation='relu', input_shape=(44, 1)))
    model.add(Dense(256, activation='relu', kernel_initializer='normal',))
    model.add(Dense(1, kernel_initializer='normal'))
    model.compile(loss='mse', optimizer='rmsprop', metrics=['mae'])
    return model

cnn_cat_model = cnn_cat_model()
# Create the checkpointer for saving the best results
cnn_cat_checkpointer = ModelCheckpoint(filepath='', 
                                       verbose=2, save_best_only=True)
# Fit the model
cnn_cat_history =, 44, 1), y_train_cat, 
                                    epochs=20, batch_size=128, verbose=0, callbacks=[cnn_cat_checkpointer],
                                    validation_data=(X_test_cat.reshape(-1, 44, 1), y_test_cat))
Epoch 00000: val_loss improved from inf to 0.89241, saving model to
Epoch 00001: val_loss improved from 0.89241 to 0.75810, saving model to
Epoch 00002: val_loss improved from 0.75810 to 0.70918, saving model to
Epoch 00003: val_loss improved from 0.70918 to 0.68910, saving model to
Epoch 00004: val_loss improved from 0.68910 to 0.66787, saving model to
Epoch 00005: val_loss improved from 0.66787 to 0.65758, saving model to
Epoch 00006: val_loss improved from 0.65758 to 0.64506, saving model to
Epoch 00007: val_loss improved from 0.64506 to 0.62748, saving model to
Epoch 00008: val_loss improved from 0.62748 to 0.60634, saving model to
Epoch 00009: val_loss did not improve
Epoch 00010: val_loss did not improve
Epoch 00011: val_loss did not improve
Epoch 00012: val_loss improved from 0.60634 to 0.59001, saving model to
Epoch 00013: val_loss did not improve
Epoch 00014: val_loss improved from 0.59001 to 0.58855, saving model to
Epoch 00015: val_loss did not improve
Epoch 00016: val_loss did not improve
Epoch 00017: val_loss did not improve
Epoch 00018: val_loss did not improve
Epoch 00019: val_loss did not improve
In [281]:
# Plot the fitting history
In [282]:
# Load the best model results
# Create predictions
y_train_cat_cnn = cnn_cat_model.predict(X_train_cat.reshape(-1, 44, 1))
y_test_cat_cnn = cnn_cat_model.predict(X_test_cat.reshape(-1, 44, 1))
# Save the model'cnn_cat_model_p6.h5')
# Display metrics
print(separator, '\nNumeric and Categorical Features')
scores('CNN Model', 
       y_train_cat, y_test_cat, y_train_cat_cnn, y_test_cat_cnn)
Numeric and Categorical Features
 CNN Model 
EV score. Train:  0.677588379835
EV score. Test:  0.706506207288
R2 score. Train:  0.677252798597
R2 score. Test:  0.705856232957
MSE score. Train:  0.586941463532
MSE score. Test:  0.588553406398
MAE score. Train:  0.426274478988
MAE score. Test:  0.438030453736
MdAE score. Train:  0.234074632479
MdAE score. Test:  0.243703509669
In [284]:
# Create the initial sequential model
def cnn_cat_enc_model():
    model = Sequential()
    model.add(Conv1D(159, 3, padding='valid', activation='relu', input_shape=(636, 1)))


    model.add(Dense(128, kernel_initializer='normal', activation='relu'))
    model.add(Dense(1, kernel_initializer='normal'))
    model.compile(loss='mse', optimizer='rmsprop', metrics=['mae'])
    return model

cnn_cat_enc_model = cnn_cat_enc_model()

# Fit the model
cnn_cat_enc_history = \, 636, 1), y_train_cat_enc, 
                      epochs=10, batch_size=128, verbose=2, 
                      validation_data=(X_test_cat_enc.reshape(-1, 636, 1), y_test_cat_enc))
Train on 16719 samples, validate on 4180 samples
Epoch 1/10
71s - loss: 1.0897 - mean_absolute_error: 0.5893 - val_loss: 0.8081 - val_mean_absolute_error: 0.4916
Epoch 2/10
63s - loss: 0.7821 - mean_absolute_error: 0.4931 - val_loss: 0.7191 - val_mean_absolute_error: 0.4629
Epoch 3/10
62s - loss: 0.6890 - mean_absolute_error: 0.4710 - val_loss: 0.7241 - val_mean_absolute_error: 0.5498
Epoch 4/10
62s - loss: 0.6555 - mean_absolute_error: 0.4517 - val_loss: 0.6712 - val_mean_absolute_error: 0.4377
Epoch 5/10
62s - loss: 0.6297 - mean_absolute_error: 0.4462 - val_loss: 0.7020 - val_mean_absolute_error: 0.4549
Epoch 6/10
62s - loss: 0.6061 - mean_absolute_error: 0.4370 - val_loss: 0.6368 - val_mean_absolute_error: 0.4443
Epoch 7/10
63s - loss: 0.5783 - mean_absolute_error: 0.4290 - val_loss: 0.7020 - val_mean_absolute_error: 0.5011
Epoch 8/10
65s - loss: 0.5532 - mean_absolute_error: 0.4264 - val_loss: 0.6264 - val_mean_absolute_error: 0.4163
Epoch 9/10
71s - loss: 0.5291 - mean_absolute_error: 0.4215 - val_loss: 0.6214 - val_mean_absolute_error: 0.4233
Epoch 10/10
68s - loss: 0.5117 - mean_absolute_error: 0.4180 - val_loss: 0.6213 - val_mean_absolute_error: 0.4363
In [285]:
# Create predictions
y_train_cat_enc_cnn = cnn_cat_enc_model.predict(X_train_cat_enc.reshape(-1, 636, 1))
y_test_cat_enc_cnn = cnn_cat_enc_model.predict(X_test_cat_enc.reshape(-1, 636, 1))
# Display initial metrics
print(separator, '\nNumeric and Encoded Categorical Features')
scores('CNN Initial Model', y_train_cat_enc, y_test_cat_enc, y_train_cat_enc_cnn, y_test_cat_enc_cnn)
Numeric and Encoded Categorical Features
 CNN Initial Model 
EV score. Train:  0.736072746558
EV score. Test:  0.68988846503
R2 score. Train:  0.734915785038
R2 score. Test:  0.689495855071
MSE score. Train:  0.482076735018
MSE score. Test:  0.62128895008
MAE score. Train:  0.396093295019
MAE score. Test:  0.436302720032
MdAE score. Train:  0.207103278372
MdAE score. Test:  0.219894897674
In [288]:
# Create the sequential model
def cnn_cat_enc_model():
    model = Sequential()
    model.add(Conv1D(159, 3, padding='valid', activation='relu', input_shape=(636, 1)))

    model.add(Dense(512, kernel_initializer='normal', activation='relu'))

    model.add(Dense(1, kernel_initializer='normal'))
    model.compile(loss='mse', optimizer='rmsprop', metrics=['mae'])
    return model

cnn_cat_enc_model = cnn_cat_enc_model()
# Create the checkpointer for saving the best results
cnn_cat_enc_checkpointer = ModelCheckpoint(filepath='', 
                                           verbose=2, save_best_only=True)
# Fit the model
cnn_cat_enc_history = \, 636, 1), y_train_cat_enc, 
                      epochs=10, batch_size=128, verbose=0, callbacks=[cnn_cat_enc_checkpointer],
                      validation_data=(X_test_cat_enc.reshape(-1, 636, 1), y_test_cat_enc))
Epoch 00000: val_loss improved from inf to 0.96205, saving model to
Epoch 00001: val_loss improved from 0.96205 to 0.72144, saving model to
Epoch 00002: val_loss improved from 0.72144 to 0.70226, saving model to
Epoch 00003: val_loss improved from 0.70226 to 0.68220, saving model to
Epoch 00004: val_loss did not improve
Epoch 00005: val_loss did not improve
Epoch 00006: val_loss improved from 0.68220 to 0.63491, saving model to
Epoch 00007: val_loss improved from 0.63491 to 0.60496, saving model to
Epoch 00008: val_loss improved from 0.60496 to 0.59351, saving model to
Epoch 00009: val_loss did not improve
In [289]:
# Plot the fitting history
In [290]:
# Load the best model results
# Create predictions
y_train_cat_enc_cnn = cnn_cat_enc_model.predict(X_train_cat_enc.reshape(-1, 636, 1))
y_test_cat_enc_cnn = cnn_cat_enc_model.predict(X_test_cat_enc.reshape(-1, 636, 1))
# Save the model'cnn_cat_enc_model_p6.h5')
# Display metrics
print(separator, '\nNumeric and Encoded Categorical Features')
scores('CNN Model', 
       y_train_cat_enc, y_test_cat_enc, y_train_cat_enc_cnn, y_test_cat_enc_cnn)
Numeric and Encoded Categorical Features
 CNN Model 
EV score. Train:  0.669089140833
EV score. Test:  0.703576179703
R2 score. Train:  0.668252129085
R2 score. Test:  0.703377856744
MSE score. Train:  0.603309897132
MSE score. Test:  0.593512398994
MAE score. Train:  0.409488764816
MAE score. Test:  0.429525232602
MdAE score. Train:  0.211565134724
MdAE score. Test:  0.224657701949


In [211]:
# Create the initial sequential model
def rnn_model():
    model = Sequential()
    model.add(LSTM(36, return_sequences=False, input_shape=(1, 36)))       

    model.compile(optimizer='rmsprop', loss='mse', metrics=['mae'])     
    return model 

rnn_model = rnn_model()
# Fit the model
rnn_history =, 1, 36), y_train.reshape(-1), 
                            validation_data=(X_test.reshape(-1, 1, 36), y_test.reshape(-1)),
                            nb_epoch=10, batch_size=128, verbose=0)
In [212]:
# Create predictions
y_train_rnn = rnn_model.predict(X_train.reshape(-1, 1, 36))
y_test_rnn = rnn_model.predict(X_test.reshape(-1, 1, 36))
# Display initial metrics
print(separator, '\nNumeric Features')
scores('RNN Initial Model', y_train, y_test, y_train_rnn, y_test_rnn)
Numeric Features
 RNN Initial Model 
EV score. Train:  0.668152902736
EV score. Test:  0.680868839573
R2 score. Train:  0.668122732062
R2 score. Test:  0.680846628688
MSE score. Train:  0.603545215913
MSE score. Test:  0.638595220757
MAE score. Train:  0.428454734993
MAE score. Test:  0.437092215793
MdAE score. Train:  0.216756235886
MdAE score. Test:  0.222291289112
In [223]:
# Create the sequential model
def rnn_model():
    model = Sequential()
    model.add(LSTM(144, return_sequences=True, input_shape=(1, 36)))
    model.add(LSTM(144, return_sequences=True))
    model.add(LSTM(144, return_sequences=False))   

    model.compile(optimizer='rmsprop', loss='mse', metrics=['mae'])     
    return model 

rnn_model = rnn_model()
# Create the checkpointer for saving the best results
rnn_checkpointer = ModelCheckpoint(filepath='', 
                                   verbose=2, save_best_only=True)
# Fit the model
rnn_history =, 1, 36), y_train.reshape(-1), 
                            epochs=10, verbose=0, callbacks=[rnn_checkpointer],
                            validation_data=(X_test.reshape(-1, 1, 36), y_test.reshape(-1)))
Epoch 00000: val_loss improved from inf to 0.68275, saving model to
Epoch 00001: val_loss did not improve
Epoch 00002: val_loss did not improve
Epoch 00003: val_loss improved from 0.68275 to 0.65074, saving model to
Epoch 00004: val_loss did not improve
Epoch 00005: val_loss did not improve
Epoch 00006: val_loss did not improve
Epoch 00007: val_loss improved from 0.65074 to 0.61933, saving model to
Epoch 00008: val_loss did not improve
Epoch 00009: val_loss improved from 0.61933 to 0.61254, saving model to
In [224]:
# Plot the fitting history
In [225]:
# Load the best model results
# Create predictions
y_train_rnn = rnn_model.predict(X_train.reshape(-1, 1, 36))
y_test_rnn = rnn_model.predict(X_test.reshape(-1, 1, 36))
# Save the model'rnn_model_p6.h5')
# Display metrics
print(separator, '\nNumeric Features')
scores('RNN Model', y_train, y_test, y_train_rnn, y_test_rnn)
Numeric Features
 RNN Model 
EV score. Train:  0.709474013983
EV score. Test:  0.694002009861
R2 score. Train:  0.709122323837
R2 score. Test:  0.693866493495
MSE score. Train:  0.528984196341
MSE score. Test:  0.612543722675
MAE score. Train:  0.417601650996
MAE score. Test:  0.434989476276
MdAE score. Train:  0.222054972582
MdAE score. Test:  0.22910155654
In [228]:
# Create the initial sequential model
def rnn_cat_model():
    model = Sequential()
    model.add(LSTM(44, return_sequences=False, input_shape=(1, 44)))       

    model.compile(optimizer='rmsprop', loss='mse', metrics=['mae'])     
    return model 

rnn_cat_model = rnn_cat_model()
# Fit the model
rnn_cat_history =, 1, 44), y_train_cat.reshape(-1), 
                                    validation_data=(X_test_cat.reshape(-1, 1, 44), y_test_cat.reshape(-1)),
                                    nb_epoch=10, batch_size=128, verbose=0)
In [230]:
# Create predictions
y_train_cat_rnn = rnn_cat_model.predict(X_train_cat.reshape(-1, 1, 44))
y_test_cat_rnn = rnn_cat_model.predict(X_test_cat.reshape(-1, 1, 44))
# Display initial metrics
print(separator, '\nNumeric and Categorical Features')
scores('RNN Initial Model', y_train_cat, y_test_cat, y_train_cat_rnn, y_test_cat_rnn)
Numeric and Categorical Features
 RNN Initial Model 
EV score. Train:  0.678028592049
EV score. Test:  0.686336258377
R2 score. Train:  0.67802553989
R2 score. Test:  0.686282096033
MSE score. Train:  0.585536172011
MSE score. Test:  0.627719373027
MAE score. Train:  0.424694864826
MAE score. Test:  0.436153528967
MdAE score. Train:  0.219955308172
MdAE score. Test:  0.225994076048
In [104]:
# Create the sequential model
def rnn_cat_model():
    model = Sequential()
    model.add(LSTM(156, return_sequences=True, input_shape=(1, 44)))
    model.add(LSTM(624, return_sequences=False))   

    model.compile(optimizer='rmsprop', loss='mse', metrics=['mae'])     
    return model 

rnn_cat_model = rnn_cat_model()
# Create the checkpointer for saving the best results
rnn_cat_checkpointer = ModelCheckpoint(filepath='', 
                                       verbose=2, save_best_only=True)
# Fit the model
rnn_cat_history =, 1, 44), y_train_cat.reshape(-1), 
                                    epochs=10, verbose=0, callbacks=[rnn_cat_checkpointer],
                                    validation_data=(X_test_cat.reshape(-1, 1, 44), y_test_cat.reshape(-1)))
Epoch 00000: val_loss improved from inf to 0.65782, saving model to
Epoch 00001: val_loss improved from 0.65782 to 0.63826, saving model to
Epoch 00002: val_loss improved from 0.63826 to 0.62570, saving model to
Epoch 00003: val_loss did not improve
Epoch 00004: val_loss did not improve
Epoch 00005: val_loss did not improve
Epoch 00006: val_loss did not improve
Epoch 00007: val_loss did not improve
Epoch 00008: val_loss did not improve
Epoch 00009: val_loss improved from 0.62570 to 0.61207, saving model to
In [105]:
# Plot the fitting history
In [106]:
# Load the best model results
# Create predictions
y_train_cat_rnn = rnn_cat_model.predict(X_train_cat.reshape(-1, 1, 44))
y_test_cat_rnn = rnn_cat_model.predict(X_test_cat.reshape(-1, 1, 44))
# Save the model'rnn_cat_model_p6.h5')
# Display metrics
print(separator, '\nNumeric and Categorical Features')
scores('RNN Model', 
       y_train_cat, y_test_cat, y_train_cat_rnn, y_test_cat_rnn)
Numeric and Categorical Features
 RNN Model 
EV score. Train:  0.720378705729
EV score. Test:  0.694678442152
R2 score. Train:  0.719266482075
R2 score. Test:  0.694105279262
MSE score. Train:  0.510536237516
MSE score. Test:  0.612065935306
MAE score. Train:  0.404334050848
MAE score. Test:  0.425333059647
MdAE score. Train:  0.206248915328
MdAE score. Test:  0.219756308573
In [232]:
# Create the initial sequential model
def rnn_cat_enc_model():
    model = Sequential()
    model.add(LSTM(636, return_sequences=False, input_shape=(1, 636)))       

    model.compile(optimizer='rmsprop', loss='mse', metrics=['mae'])     
    return model 

rnn_cat_enc_model = rnn_cat_enc_model()
# Fit the model
rnn_cat_enc_history =, 1, 636), 
                                            validation_data=(X_test_cat_enc.reshape(-1, 1, 636), 
                                            nb_epoch=10, batch_size=128, verbose=0)
In [234]:
# Create predictions
y_train_cat_enc_rnn = rnn_cat_enc_model.predict(X_train_cat_enc.reshape(-1, 1, 636))
y_test_cat_enc_rnn = rnn_cat_enc_model.predict(X_test_cat_enc.reshape(-1, 1, 636))
# Display initial metrics
print(separator, '\nNumeric and Encoded Categorical Features')
scores('RNN Initial Model', y_train_cat_enc, y_test_cat_enc, y_train_cat_enc_rnn, y_test_cat_enc_rnn)
Numeric and Encoded Categorical Features
 RNN Initial Model 
EV score. Train:  0.693131162054
EV score. Test:  0.659015602783
R2 score. Train:  0.688119822912
R2 score. Test:  0.652958817224
MSE score. Train:  0.567178915234
MSE score. Test:  0.694396051077
MAE score. Train:  0.449181869536
MAE score. Test:  0.474961223763
MdAE score. Train:  0.246776681798
MdAE score. Test:  0.262337161932
In [113]:
# Create the sequential model
def rnn_cat_enc_model():
    model = Sequential()
    model.add(LSTM(159, return_sequences=True, input_shape=(1, 636)))
    model.add(LSTM(636, return_sequences=False))   

    model.compile(optimizer='rmsprop', loss='mse', metrics=['mae'])     
    return model 

rnn_cat_enc_model = rnn_cat_enc_model()
# Create the checkpointer for saving the best results
rnn_cat_enc_checkpointer = ModelCheckpoint(filepath='', 
                                           verbose=2, save_best_only=True)
# Fit the model
rnn_cat_enc_history = \, 1, 636), y_train_cat_enc.reshape(-1), 
                      epochs=10, verbose=0, callbacks=[rnn_cat_enc_checkpointer],
                      validation_data=(X_test_cat_enc.reshape(-1, 1, 636), y_test_cat_enc.reshape(-1)))
Epoch 00000: val_loss improved from inf to 0.67299, saving model to
Epoch 00001: val_loss improved from 0.67299 to 0.60554, saving model to
Epoch 00002: val_loss did not improve
Epoch 00003: val_loss did not improve
Epoch 00004: val_loss improved from 0.60554 to 0.60045, saving model to
Epoch 00005: val_loss did not improve
Epoch 00006: val_loss improved from 0.60045 to 0.59203, saving model to
Epoch 00007: val_loss did not improve
Epoch 00008: val_loss did not improve
Epoch 00009: val_loss improved from 0.59203 to 0.58162, saving model to
In [114]:
# Plot the fitting history
In [115]:
# Load the best model results
# Create predictions
y_train_cat_enc_rnn = rnn_cat_enc_model.predict(X_train_cat_enc.reshape(-1, 1, 636))
y_test_cat_enc_rnn = rnn_cat_enc_model.predict(X_test_cat_enc.reshape(-1, 1, 636))
# Save the model'rnn_cat_enc_model_p6.h5')
# Display metrics
print(separator, '\nNumeric and Encoded Categorical Features')
scores('RNN Model', 
       y_train_cat_enc, y_test_cat_enc, y_train_cat_enc_rnn, y_test_cat_enc_rnn)
Numeric and Encoded Categorical Features
 RNN Model 
EV score. Train:  0.739230103807
EV score. Test:  0.70940135106
R2 score. Train:  0.739226532816
R2 score. Test:  0.709323171628
MSE score. Train:  0.474237297221
MSE score. Test:  0.581616395342
MAE score. Train:  0.383939769297
MAE score. Test:  0.414738278504
MdAE score. Train:  0.193722330332
MdAE score. Test:  0.20408428843

Display Predictions

In [107]:
# Plot predicted values and real data points
plt.figure(figsize = (18, 6))
plt.plot(y_test[1:50], color = 'black', label='Real Data')

plt.plot(y_test_mlp[1:50], label='MLP')
plt.plot(y_test_cnn[1:50], label='CNN')
plt.plot(y_test_rnn[1:50], label='RNN')

plt.xlabel("Data Points")
plt.ylabel("Predicted and Real Target Values")
plt.title("Numeric Features; Neural Network Predictions vs Real Data");
In [108]:
# Plot predicted values and real data points
plt.figure(figsize = (18, 6))
plt.plot(y_test_cat[1:50], color = 'black', label='Real Data')

plt.plot(y_test_cat_mlp[1:50], label='MLP')
plt.plot(y_test_cat_cnn[1:50], label='CNN')
plt.plot(y_test_cat_rnn[1:50], label='RNN')

plt.xlabel("Data Points")
plt.ylabel("Predicted and Real Target Values")
plt.title("Numeric and Categorical Features; Neural Network Predictions vs Real Data");
In [116]:
# Plot predicted values and real data points
plt.figure(figsize = (18, 6))
plt.plot(y_test_cat[1:50], color = 'black', label='Real Data')

plt.plot(y_test_cat_enc_mlp[1:50], label='MLP')
plt.plot(y_test_cat_enc_cnn[1:50], label='CNN')
plt.plot(y_test_cat_enc_rnn[1:50], label='RNN')

plt.xlabel("Data Points")
plt.ylabel("Predicted and Real Target Values")
plt.title("Numeric and Encoded Categorical Features; Neural Network Predictions vs Real Data");

Evaluation Metrics and Predictions

Evaluation metrics capture different properties of the prediction performance: how well the model explains the target variance and makes predictions, how far the predictions are from the real values. It allows us to choose the best algorithm by comparing many indicators.

In [109]:
# Scale the whole dataset
target_scale = RobustScaler()
s_target_train = target_scale.fit_transform(target_train.reshape(-1,1))
feature_scale = RobustScaler()
s_features_train = feature_scale.fit_transform(features_train)
s_features_test = feature_scale.transform(features_test)
feature_cat_scale = RobustScaler()
s_features_train_cat = feature_cat_scale.fit_transform(features_train_cat)
s_features_test_cat = feature_cat_scale.transform(features_test_cat)
feature_cat_enc_scale = RobustScaler()
s_features_train_cat_enc = feature_cat_enc_scale.fit_transform(features_train_cat_enc)
s_features_test_cat_enc = feature_cat_enc_scale.transform(features_test_cat_enc)

Regressors; Scikit-Learn

Numeric Features

In [110]:
# Fit the Regressors
gbr = GradientBoostingRegressor(max_depth=4, n_estimators=360), s_target_train)
br = BaggingRegressor(n_estimators=360), s_target_train)
# Create predictions
s_target_train_gbr = gbr.predict(s_features_train)
s_target_test_gbr = gbr.predict(s_features_test)
s_target_train_br = br.predict(s_features_train)
s_target_test_br = br.predict(s_features_test)
s_target_train_mlpr = mlpr.predict(s_features_train)
s_target_test_mlpr = mlpr.predict(s_features_test)
# Display metrics
scores2('Gradient Boosting Regressor', s_target_train, s_target_train_gbr)
scores2('Bagging Regressor', s_target_train, s_target_train_br)
scores2('MLP Regressor', s_target_train, s_target_train_mlpr)
 Gradient Boosting Regressor 
EV score: 0.851729559483
R2 score: 0.851729559483
MSE score: 0.273663122104
MAE score: 0.324355312761
MdAE score: 0.17539487972
 Bagging Regressor 
EV score: 0.95850393243
R2 score: 0.958479485667
MSE score: 0.0766345169276
MAE score: 0.144299144987
MdAE score: 0.0632429903765
 MLP Regressor 
EV score: 0.692064614945
R2 score: 0.692053341016
MSE score: 0.568377917035
MAE score: 0.430736932351
MdAE score: 0.238839315412

Numeric and Categorical Features

In [111]:
# Fit the Regressors
gbr_cat = GradientBoostingRegressor(max_depth=3, n_estimators=396), s_target_train)
br_cat = BaggingRegressor(n_estimators=308), s_target_train)
# Create predictions
s_target_train_cat_gbr = gbr_cat.predict(s_features_train_cat)
s_target_test_cat_gbr = gbr_cat.predict(s_features_test_cat)
s_target_train_cat_br = br_cat.predict(s_features_train_cat)
s_target_test_cat_br = br_cat.predict(s_features_test_cat)
s_target_train_cat_mlpr = mlpr_cat.predict(s_features_train_cat)
s_target_test_cat_mlpr = mlpr_cat.predict(s_features_test_cat)
# Display metrics
scores2('Gradient Boosting Regressor', s_target_train, s_target_train_cat_gbr)
scores2('Bagging Regressor', s_target_train, s_target_train_cat_br)
scores2('MLP Regressor', s_target_train, s_target_train_cat_mlpr)
 Gradient Boosting Regressor 
EV score: 0.813227343634
R2 score: 0.813227343634
MSE score: 0.344726757987
MAE score: 0.357167721423
MdAE score: 0.190744564286
 Bagging Regressor 
EV score: 0.957835071358
R2 score: 0.957808762631
MSE score: 0.0778724721096
MAE score: 0.144510376607
MdAE score: 0.0627813967251
 MLP Regressor 
EV score: 0.712414402875
R2 score: 0.712257005057
MSE score: 0.531087963892
MAE score: 0.417796926189
MdAE score: 0.229585027349

Numeric and Encoded Categorical Features

In [112]:
# Fit the Regressors
gbr_cat_enc = GradientBoostingRegressor(max_depth=4, n_estimators=318), s_target_train)
br_cat_enc = BaggingRegressor(n_estimators=159), s_target_train)
# Create predictions
s_target_train_cat_enc_gbr = gbr_cat_enc.predict(s_features_train_cat_enc)
s_target_test_cat_enc_gbr = gbr_cat_enc.predict(s_features_test_cat_enc)
s_target_train_cat_enc_br = br_cat.predict(s_features_train_cat_enc)
s_target_test_cat_enc_br = br_cat.predict(s_features_test_cat_enc)
s_target_train_cat_enc_mlpr = mlpr_cat_enc.predict(s_features_train_cat_enc)
s_target_test_cat_enc_mlpr = mlpr_cat_enc.predict(s_features_test_cat_enc)
# Display metrics
scores2('Gradient Boosting Regressor', s_target_train, s_target_train_cat_enc_gbr)
scores2('Bagging Regressor', s_target_train, s_target_train_cat_enc_br)
scores2('MLP Regressor', s_target_train, s_target_train_cat_enc_mlpr)
 Gradient Boosting Regressor 
EV score: 0.835548897222
R2 score: 0.835548897222
MSE score: 0.303527810822
MAE score: 0.339775897345
MdAE score: 0.183338754074
 Bagging Regressor 
EV score: 0.922628968835
R2 score: 0.922477330417
MSE score: 0.143083784723
MAE score: 0.199528910545
MdAE score: 0.0930322280142
 MLP Regressor 
EV score: 0.745053861238
R2 score: 0.745050267713
MSE score: 0.470561357165
MAE score: 0.405331444166
MdAE score: 0.220279470626

Neural Networks; Keras

Numeric Features

In [117]:
# Create predictions
s_target_train_mlp = mlp_model.predict(s_features_train)
s_target_test_mlp = mlp_model.predict(s_features_test)
s_target_train_cnn = cnn_model.predict(s_features_train.reshape(-1, 36, 1))
s_target_test_cnn = cnn_model.predict(s_features_test.reshape(-1, 36, 1))
s_target_train_rnn = rnn_model.predict(s_features_train.reshape(-1, 1, 36))
s_target_test_rnn = rnn_model.predict(s_features_test.reshape(-1, 1, 36))
# Display metrics
scores2('MLP', s_target_train, s_target_train_mlp)
scores2('CNN', s_target_train, s_target_train_cnn)
scores2('RNN', s_target_train, s_target_train_rnn)
EV score: 0.761783651615
R2 score: 0.761563333271
MSE score: 0.440083150853
MAE score: 0.386517573074
MdAE score: 0.197635491112
EV score: 0.699149967226
R2 score: 0.698797590823
MSE score: 0.555930038334
MAE score: 0.422961280799
MdAE score: 0.214017019481
EV score: 0.677314911302
R2 score: 0.675630793764
MSE score: 0.598689053484
MAE score: 0.427292031551
MdAE score: 0.219091205484

Numeric and Categorical Features

In [118]:
# Create predictions
s_target_train_cat_mlp = mlp_cat_model.predict(s_features_train_cat)
s_target_test_cat_mlp = mlp_cat_model.predict(s_features_test_cat)
s_target_train_cat_cnn = cnn_cat_model.predict(s_features_train_cat.reshape(-1, 44, 1))
s_target_test_cat_cnn = cnn_cat_model.predict(s_features_test_cat.reshape(-1, 44, 1))
s_target_train_cat_rnn = rnn_cat_model.predict(s_features_train_cat.reshape(-1, 1, 44))
s_target_test_cat_rnn = rnn_cat_model.predict(s_features_test_cat.reshape(-1, 1, 44))
# Display metrics
scores2('MLP', s_target_train, s_target_train_cat_mlp)
scores2('CNN', s_target_train, s_target_train_cat_cnn)
scores2('RNN', s_target_train, s_target_train_cat_rnn)
EV score: 0.700075627798
R2 score: 0.698330432113
MSE score: 0.556792274332
MAE score: 0.40979485907
MdAE score: 0.202451247452
EV score: 0.722221568176
R2 score: 0.722179013791
MSE score: 0.512774887609
MAE score: 0.426531735321
MdAE score: 0.214306339162
EV score: 0.714754260049
R2 score: 0.71384778339
MSE score: 0.528151860352
MAE score: 0.407582791011
MdAE score: 0.208528225119

Numeric and Encoded Categorical Features

In [119]:
# Create predictions
s_target_train_cat_enc_mlp = mlp_cat_enc_model.predict(s_features_train_cat_enc)
s_target_test_cat_enc_mlp = mlp_cat_enc_model.predict(s_features_test_cat_enc)
s_target_train_cat_enc_cnn = cnn_cat_enc_model.predict(s_features_train_cat_enc.reshape(-1, 636, 1))
s_target_test_cat_enc_cnn = cnn_cat_enc_model.predict(s_features_test_cat_enc.reshape(-1, 636, 1))
s_target_train_cat_enc_rnn = rnn_cat_enc_model.predict(s_features_train_cat_enc.reshape(-1, 1, 636))
s_target_test_cat_enc_rnn = rnn_cat_enc_model.predict(s_features_test_cat_enc.reshape(-1, 1, 636))
# Display metrics
scores2('MLP', s_target_train, s_target_train_cat_enc_mlp)
scores2('CNN', s_target_train, s_target_train_cat_enc_cnn)
scores2('RNN', s_target_train, s_target_train_cat_enc_rnn)
EV score: 0.725615039028
R2 score: 0.723263915486
MSE score: 0.510772481843
MAE score: 0.392063855341
MdAE score: 0.192621934785
EV score: 0.730510512817
R2 score: 0.730088194793
MSE score: 0.498176892494
MAE score: 0.431014259236
MdAE score: 0.237569146774
EV score: 0.732861645547
R2 score: 0.732852866767
MSE score: 0.493074130533
MAE score: 0.390074103896
MdAE score: 0.197011383602

Display All Predictions

In [120]:
# Rescale regressor predictions
target_train_gbr = target_scale.inverse_transform(s_target_train_gbr.reshape(-1,1))
target_test_gbr = target_scale.inverse_transform(s_target_test_gbr.reshape(-1,1))
target_train_br = target_scale.inverse_transform(s_target_train_br.reshape(-1,1))
target_test_br = target_scale.inverse_transform(s_target_test_br.reshape(-1,1))
target_train_mlpr = target_scale.inverse_transform(s_target_train_mlpr.reshape(-1,1))
target_test_mlpr = target_scale.inverse_transform(s_target_test_mlpr.reshape(-1,1))
# Rescale neural network predictions
target_train_mlp = target_scale.inverse_transform(s_target_train_mlp)
target_test_mlp = target_scale.inverse_transform(s_target_test_mlp)
target_train_cnn = target_scale.inverse_transform(s_target_train_cnn)
target_test_cnn = target_scale.inverse_transform(s_target_test_cnn)
target_train_rnn = target_scale.inverse_transform(s_target_train_rnn)
target_test_rnn = target_scale.inverse_transform(s_target_test_rnn)
In [121]:
# Plot predictions and real target values
plt.figure(figsize = (18, 6))
plt.plot(target_train[1:50], color = 'black', label='Real Data')

plt.plot(target_train_gbr[1:50], label='Gradient Boosting Regressor')
plt.plot(target_train_br[1:50], label='Bagging Regressor')
plt.plot(target_train_mlpr[1:50], label='MLP Regressor')

plt.plot(target_train_mlp[1:50], label='MLP')
plt.plot(target_train_cnn[1:50], label='CNN')
plt.plot(target_train_rnn[1:50], label='RNN')

plt.xlabel("Data Points")
plt.ylabel("Predicted and Real Target Values")
plt.title("Numeric Features; Train Predictions vs Real Data");
In [122]:
# Plot test predictions 
plt.figure(figsize = (18, 6))

plt.plot(target_test_gbr[1:50], label='Gradient Boosting Regressor')
plt.plot(target_test_br[1:50], label='Bagging Regressor')
plt.plot(target_test_mlpr[1:50], label='MLP Regressor')

plt.plot(target_test_mlp[1:50], label='MLP')
plt.plot(target_test_cnn[1:50], label='CNN')
plt.plot(target_test_rnn[1:50], label='RNN')

plt.xlabel("Data Points")
plt.ylabel("Predicted Target Values")
plt.title("Numeric Features; Test Predictions");
In [123]:
# Rescale regressor redictions
target_train_cat_gbr = target_scale.inverse_transform(s_target_train_cat_gbr.reshape(-1,1))
target_test_cat_gbr = target_scale.inverse_transform(s_target_test_cat_gbr.reshape(-1,1))
target_train_cat_br = target_scale.inverse_transform(s_target_train_cat_br.reshape(-1,1))
target_test_cat_br = target_scale.inverse_transform(s_target_test_cat_br.reshape(-1,1))
target_train_cat_mlpr = target_scale.inverse_transform(s_target_train_cat_mlpr.reshape(-1,1))
target_test_cat_mlpr = target_scale.inverse_transform(s_target_test_cat_mlpr.reshape(-1,1))
# Rescale neural network predictions
target_train_cat_mlp = target_scale.inverse_transform(s_target_train_cat_mlp.reshape(-1,1))
target_test_cat_mlp = target_scale.inverse_transform(s_target_test_cat_mlp.reshape(-1,1))
target_train_cat_cnn = target_scale.inverse_transform(s_target_train_cat_cnn.reshape(-1,1))
target_test_cat_cnn = target_scale.inverse_transform(s_target_test_cat_cnn.reshape(-1,1))
target_train_cat_rnn = target_scale.inverse_transform(s_target_train_cat_rnn.reshape(-1,1))
target_test_cat_rnn = target_scale.inverse_transform(s_target_test_cat_rnn.reshape(-1,1))
In [124]:
# Plot predictions and real target values
plt.figure(figsize = (18, 6))
plt.plot(target_train[1:50], color = 'black', label='Real Data')

plt.plot(target_train_cat_gbr[1:50], label='Gradient Boosting Regressor')
plt.plot(target_train_cat_br[1:50], label='Bagging Regressor')
plt.plot(target_train_cat_mlpr[1:50], label='MLP Regressor')

plt.plot(target_train_cat_mlp[1:50], label='MLP')
plt.plot(target_train_cat_cnn[1:50], label='CNN')
plt.plot(target_train_cat_rnn[1:50], label='RNN')

plt.xlabel("Data Points")
plt.ylabel("Predicted and Real Target Values")
plt.title("Numeric and Categorical Features; Train Predictions vs Real Data");
In [125]:
# Plot test predictions
plt.figure(figsize = (18, 6))

plt.plot(target_test_cat_gbr[1:50], label='Gradient Boosting Regressor')
plt.plot(target_test_cat_br[1:50], label='Bagging Regressor')
plt.plot(target_test_cat_mlpr[1:50], label='MLP Regressor')

plt.plot(target_test_cat_mlp[1:50], label='MLP')
plt.plot(target_test_cat_cnn[1:50], label='CNN')
plt.plot(target_test_cat_rnn[1:50], label='RNN')

plt.xlabel("Data Points")
plt.ylabel("Predicted Target Values")
plt.title("Numeric and Categorical Features; Test Predictions");
In [126]:
# Rescale regressor predictions
target_train_cat_enc_gbr = target_scale.inverse_transform(s_target_train_cat_enc_gbr.reshape(-1,1))
target_test_cat_enc_gbr = target_scale.inverse_transform(s_target_test_cat_enc_gbr.reshape(-1,1))
target_train_cat_enc_br = target_scale.inverse_transform(s_target_train_cat_enc_br.reshape(-1,1))
target_test_cat_enc_br = target_scale.inverse_transform(s_target_test_cat_enc_br.reshape(-1,1))
target_train_cat_enc_mlpr = target_scale.inverse_transform(s_target_train_cat_enc_mlpr.reshape(-1,1))
target_test_cat_enc_mlpr = target_scale.inverse_transform(s_target_test_cat_enc_mlpr.reshape(-1,1))
# Rescale neural network preditions
target_train_cat_enc_mlp = target_scale.inverse_transform(s_target_train_cat_enc_mlp.reshape(-1,1))
target_test_cat_enc_mlp = target_scale.inverse_transform(s_target_test_cat_enc_mlp.reshape(-1,1))
target_train_cat_enc_cnn = target_scale.inverse_transform(s_target_train_cat_enc_cnn.reshape(-1,1))
target_test_cat_enc_cnn = target_scale.inverse_transform(s_target_test_cat_enc_cnn.reshape(-1,1))
target_train_cat_enc_rnn = target_scale.inverse_transform(s_target_train_cat_enc_rnn.reshape(-1,1))
target_test_cat_enc_rnn = target_scale.inverse_transform(s_target_test_cat_enc_rnn.reshape(-1,1))
In [127]:
# Plot predictions and real target values
plt.figure(figsize = (18, 6))
plt.plot(target_train[1:50], color = 'black', label='Real Data')

plt.plot(target_train_cat_enc_gbr[1:50], label='Gradient Boosting Regressor')
plt.plot(target_train_cat_enc_br[1:50], label='Bagging Regressor')
plt.plot(target_train_cat_enc_mlpr[1:50], label='MLP Regressor')

plt.plot(target_train_cat_enc_mlp[1:50], label='MLP')
plt.plot(target_train_cat_enc_cnn[1:50], label='CNN')
plt.plot(target_train_cat_enc_rnn[1:50], label='RNN')

plt.xlabel("Data Points")
plt.ylabel("Predicted and Real Target Values")
plt.title("Numeric and Encoded Categorical Features; Train Predictions vs Real Data");
In [128]:
# Plot test predictions
plt.figure(figsize = (18, 6))

plt.plot(target_test_cat_enc_gbr[1:50], label='Gradient Boosting Regressor')
plt.plot(target_test_cat_enc_br[1:50], label='Bagging Regressor')
plt.plot(target_test_cat_enc_mlpr[1:50], label='MLP Regressor')

plt.plot(target_test_cat_enc_mlp[1:50], label='MLP')
plt.plot(target_test_cat_enc_cnn[1:50], label='CNN')
plt.plot(target_test_cat_enc_rnn[1:50], label='RNN')

plt.xlabel("Data Points")
plt.ylabel("Predicted Target Values")
plt.title("Numeric and Encoded Categorical Features; Test Predictions");

Project Design

The project was built on the basis of the competition offered on the site

The competition version of this notebook is avalible here: .

There are several popular resources (numpy, pandas, matplotlib, scikit-learn and keras) for regression models were used.

The most valuable side of this project is the investigation of real data and the attempt to approximate the predictions on them to the threshold of 0.7-0.8 for the coefficient of determination.