Machine Learning Engineer Nanodegree
Capstone Project
📑 P6: Sberbank Russian Housing Market. Part 1
Links and Code Library
Resources
🕸scikit-learn. Machine Learning in Python
🕸Scipy Lecture Notes
Versions
🕸Colaboratory Version
🕸Kaggle Version
Code Library
Set of Functions
Capstone Proposal Overview
In this capstone project proposal, the goals are stated to leverage what we've learned throughout the Nanodegree Program for solving a problem of our choice by applying machine learning techniques.
A project proposal encompasses seven key points.
project's domain background: the field of research where the project is derived;
problem statement: a problem being investigated for which a solution will be defined;
datasets and inputs: data or inputs being used for the problem;
solution statement: the solution proposed for the problem given;
benchmark model: some simple or historical model or result to compare the defined solution to;
set of evaluation metrics: functional representations for how the solution can be measured;
outline of the project design: how the solution will be developed and results obtained.
The full project report about results will be completed and published as well.
Domain Background
Housing costs demand a significant investment from both consumers and developers.
And when it comes to planning a budget—whether personal or corporate—the last thing anyone needs is uncertainty about one of their budgets expenses.
Sberbank, Russia’s oldest and largest bank, helps their customers by making predictions about reality prices so renters, developers,
and lenders are more confident when they sign a lease or purchase a building.
Although the housing market is relatively stable in Russia, the country’s volatile economy makes forecasting prices as a function of apartment characteristics a unique challenge.
Complex interactions between housing features such as a number of bedrooms and location are enough to make pricing predictions complicated.
Adding an unstable economy to the mix means Sberbank and their customers need more than simple regression models in their arsenal.
Problem Statement
Sberbank is challenging programmers to develop algorithms which use a broad spectrum of features to predict real prices.
Algorithm applications rely on a rich dataset that includes housing data and macroeconomic patterns.
An accurate forecasting model will allow Sberbank to provide more certainty to their customers in an uncertain economy.
My choice of the solution in this situation is to select the most correlated indicators with the target variable and
apply ensemble algorithms that have repeatedly shown successful results in the study of price trends in real estate.
Boosting and bagging methods combine several models at once in order to improve the prediction accuracy on learning problems with a numerical target variable.
Then I'm going to explore different types of neural networks for regression predictions and try to achieve the same with ensemble methods level of model perfomance.
Datasets and Inputs
The basis for the investigation is a large number of economic indicators for pricing and prices themselves (train.csv and test.csv).
Macroeconomic variables are collected in a separate file for transaction dates (macro.csv).
In addition, the detailed description of variables is provided (data_dictionary.txt).
For practical reasons, I have not analyzed all the data and have chosen the following independent variables:
- the dollar rate, which traditionally affects the Russian real estate market;
- the distance in km from the Kremlin (the closer to the center of the city, the more expensive);
- indicators characterizing the availability of urban infrastructure nearby (schools, medical and sports centers, supermarkets, etc.);
- indicators of a particular living space (number of rooms, floor, etc.);
- proximity to transport nodes (for example, to the metro);
- indicators of population density and employment in the region of housing accommodation.
All these economic indicators have a strong influence on price formation and can be used as a basic set for regression analysis.
Examples of numerical variables: the distance to the metro, the distance to the school, the dollar rate at the transaction moment, the area of the living space.
Examples of categorical variables: neighborhoods, the nearest metro station, the number of rooms.
The goal of the project is to predict the price of housing using the chosen set of numerical and categorical variables.
The predicted target isn't discrete, for the training set all the values of this dependent variable are given, and therefore it's necessary to apply regression algorithms of supervised learning.
Data Description
Load and Display the Data
Solution Statement
Selection of Features
Create lists of the features
Create the distribution plot for the target
Create the table of descriptive statistics
Fill in Missing Values
Categorical and Macro Features
Add One Macro Feature
Explore numbers of categories and values for categorical features
Find the missing category in the testing set
Replace values of 'ID_metro' and other categorical features by discrete numbers
Apply one hot encoding
Add Missing Columns with Zero Values
Display Correlation
Scale, Shuffle and Split the Data
Benchmark Models
To compare the prediction quality, I chose the most effective (for financial indicators) regression ensemble algorithms and neural networks (for example, multilayer perceptrons).
In addition, I was wondering what the highest accuracy rate will be achieved by each of the presented algorithms
and whether the predicted trends of price change for all used types of techniques will coincide.
Regressors with Dimensionality Reduction. Sklearn
MLP Regressors with Dimensionality Reduction. Sklearn
Neural Networks. Keras
Display Predictions
Additional Code Cell