Machine Learning Engineer Nanodegree

Capstone Project

📑 P6: Sberbank Russian Housing Market. Part 1

Links and Code Library

Resources

🕸scikit-learn. Machine Learning in Python 🕸Scipy Lecture Notes

Versions

🕸Colaboratory Version 🕸Kaggle Version

Code Library

Set of Functions

Capstone Proposal Overview

In this capstone project proposal, the goals are stated to leverage what we've learned throughout the Nanodegree Program for solving a problem of our choice by applying machine learning techniques.
A project proposal encompasses seven key points.
project's domain background: the field of research where the project is derived;
problem statement: a problem being investigated for which a solution will be defined;
datasets and inputs: data or inputs being used for the problem;
solution statement: the solution proposed for the problem given;
benchmark model: some simple or historical model or result to compare the defined solution to;
set of evaluation metrics: functional representations for how the solution can be measured;
outline of the project design: how the solution will be developed and results obtained.
The full project report about results will be completed and published as well.

Domain Background

Housing costs demand a significant investment from both consumers and developers.
And when it comes to planning a budget—whether personal or corporate—the last thing anyone needs is uncertainty about one of their budgets expenses.
Sberbank, Russia’s oldest and largest bank, helps their customers by making predictions about reality prices so renters, developers,
and lenders are more confident when they sign a lease or purchase a building.
Although the housing market is relatively stable in Russia, the country’s volatile economy makes forecasting prices as a function of apartment characteristics a unique challenge.
Complex interactions between housing features such as a number of bedrooms and location are enough to make pricing predictions complicated.
Adding an unstable economy to the mix means Sberbank and their customers need more than simple regression models in their arsenal.

Problem Statement

Sberbank is challenging programmers to develop algorithms which use a broad spectrum of features to predict real prices.
Algorithm applications rely on a rich dataset that includes housing data and macroeconomic patterns.
An accurate forecasting model will allow Sberbank to provide more certainty to their customers in an uncertain economy.
My choice of the solution in this situation is to select the most correlated indicators with the target variable and
apply ensemble algorithms that have repeatedly shown successful results in the study of price trends in real estate.
Boosting and bagging methods combine several models at once in order to improve the prediction accuracy on learning problems with a numerical target variable.
Then I'm going to explore different types of neural networks for regression predictions and try to achieve the same with ensemble methods level of model perfomance.

Datasets and Inputs

The basis for the investigation is a large number of economic indicators for pricing and prices themselves (train.csv and test.csv).
Macroeconomic variables are collected in a separate file for transaction dates (macro.csv).
In addition, the detailed description of variables is provided (data_dictionary.txt).
For practical reasons, I have not analyzed all the data and have chosen the following independent variables:
- the dollar rate, which traditionally affects the Russian real estate market;
- the distance in km from the Kremlin (the closer to the center of the city, the more expensive);
- indicators characterizing the availability of urban infrastructure nearby (schools, medical and sports centers, supermarkets, etc.);
- indicators of a particular living space (number of rooms, floor, etc.);
- proximity to transport nodes (for example, to the metro);
- indicators of population density and employment in the region of housing accommodation.
All these economic indicators have a strong influence on price formation and can be used as a basic set for regression analysis.
Examples of numerical variables: the distance to the metro, the distance to the school, the dollar rate at the transaction moment, the area of the living space.
Examples of categorical variables: neighborhoods, the nearest metro station, the number of rooms.
The goal of the project is to predict the price of housing using the chosen set of numerical and categorical variables.
The predicted target isn't discrete, for the training set all the values of this dependent variable are given, and therefore it's necessary to apply regression algorithms of supervised learning.

Data Description

Load and Display the Data

Solution Statement

Selection of Features

Create lists of the features

Create the distribution plot for the target

Create the table of descriptive statistics

Fill in Missing Values

Categorical and Macro Features

Add One Macro Feature

Explore numbers of categories and values for categorical features

Find the missing category in the testing set

Replace values of 'ID_metro' and other categorical features by discrete numbers

Apply one hot encoding

Add Missing Columns with Zero Values

Display Correlation

Scale, Shuffle and Split the Data

Benchmark Models

To compare the prediction quality, I chose the most effective (for financial indicators) regression ensemble algorithms and neural networks (for example, multilayer perceptrons).
In addition, I was wondering what the highest accuracy rate will be achieved by each of the presented algorithms and whether the predicted trends of price change for all used types of techniques will coincide.