the first version: https://olgabelitskaya.github.io/Data_Analyst_ND_Project2.html
the second version: https://olgabelitskaya.github.io/MLE_ND_P0_V0.html
In 1912, the ship RMS Titanic struck an iceberg on its maiden voyage and sank, resulting in the deaths of most of its passengers and crew. In this project, we will explore a subset of the RMS Titanic passenger manifest to determine which features best predict whether someone survived or did not survive.
The data contains demographics and voyage information from 891 of the 2224 passengers and crew on board the ship.
This link allows reading the description of this dataset on the Kaggle website, where the data was obtained. https://www.kaggle.com/c/titanic/data
Intro to Data Science. Udacity: https://www.udacity.com/course/intro-to-data-science--ud359
Statistics in Python. Scipy Lecture Notes: http://www.scipy-lectures.org/packages/statistics/index.html
A Visual Introduction to Machine Learning: http://www.r2d3.us/visual-intro-to-machine-learning-part-1/
The scikit-learn metrics: http://scikit-learn.org/stable/modules/classes.html#module-sklearn.metrics
#########################
### IMPORT LIBRARIES ###
#########################
import numpy as np
import pandas as pd
import seaborn as sns
from time import time
import matplotlib.pyplot as plt
from IPython.display import display
# Pretty display for notebooks
import warnings
warnings.filterwarnings("ignore", category = UserWarning, module = "matplotlib")
from matplotlib import style
%matplotlib inline
########################
### EXTRA LIBRARIES ###
########################
from sklearn.metrics import accuracy_score, precision_score, recall_score
from sklearn.metrics import f1_score, make_scorer
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.ensemble import AdaBoostClassifier
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import GridSearchCV, train_test_split
import keras as ks
from keras.models import Sequential, load_model, Model
from keras.preprocessing import sequence
from keras.optimizers import SGD, RMSprop
from keras.layers import Dense, Dropout, LSTM
from keras.layers import Activation, Flatten, Input, BatchNormalization
from keras.layers import Convolution1D, MaxPooling1D, Conv2D, MaxPooling2D
from keras.layers.embeddings import Embedding
from keras.wrappers.scikit_learn import KerasRegressor
Let's extract the data from the .csv file, create a pandas DataFrame and look at the available indicators:
# Load the dataset
train_file = 'titanic_train.csv'
test_file = 'titanic_test.csv'
train_data = pd.read_csv(train_file)
test_data = pd.read_csv(test_file)
full_data = train_data.append(test_data, ignore_index=True)
# Success
print ("The Titanic dataset has {} data points with {} variables.".format(*full_data.shape))
print ("Examples of Data Points")
display(test_data[:3].T)
The next step is removing the Survived feature from the data and storing it separately as our prediction targets.
The passenger data and the outcomes of survival are now paired. That means for any passenger all_features.loc[i], they have the survival outcomes[i].
# Store the 'Survived' feature in a new variable and remove it from the dataset
outcomes = train_data['Survived']
all_features = train_data.drop('Survived', axis = 1)
print ("The percentage of survived passengers: {:.2f}%".format(100*outcomes.mean()))
To measure the performance of our predictions, we need metrics to score our predictions against the true outcomes of survival.
def accuracy_score_0(truth, pred):
# Ensure that the number of predictions matches number of outcomes
if len(truth) == len(pred):
# Calculate and return the accuracy as a percent
return ("Predictions have an accuracy of {:.2f}%.".format((truth == pred).mean()*100))
else:
return ("Number of predictions does not match number of outcomes!")
If we were asked to make a prediction about any passenger aboard the RMS Titanic whom we knew nothing about, then the best prediction we could make would be that they did not survive. This is because we can assume that a majority of the passengers (more than 50%) did not survive the ship sinking.
The predictions_0() function below will always predict that a passenger did not survive. Let's check its accuracy.
def predictions_0(data):
""" Model with no features. Always predicts a passenger did not survive. """
predictions = []
for _, passenger in data.iterrows():
# Predict the survival of 'passenger'
predictions.append(0)
# Return our predictions
return pd.Series(predictions)
# Make the predictions that a passenger did not survive
predictions = predictions_0(all_features)
# Check accuracy
accuracy_score_0(outcomes, predictions)
Now let us check the feature "Sex", the outcomes "Survived" and their possible dependence.
Here is a special database for these categories including the indicators in percentages:
# Groups by sex
number_by_sex = pd.Series(train_data.groupby('Sex').count()['PassengerId'])
survived_by_sex = pd.Series(train_data.groupby('Sex').sum()['Survived'])
def percent_xy(x,y):
return round(100.0*x/y, 2)
# Count percentages
number_by_sex_in_per = pd.Series(percent_xy(number_by_sex, len(train_data)))
survived_by_sex_in_per1 = pd.Series(percent_xy(survived_by_sex, outcomes.sum()))
survived_by_sex_in_per2 = pd.Series(percent_xy(survived_by_sex, number_by_sex))
sex_data = pd.DataFrame({'Number by sex': number_by_sex,
'Number by sex in percentages':number_by_sex_in_per,
'Survived by sex':survived_by_sex,
'Survived by sex in percentages I':survived_by_sex_in_per1,
'Survived by sex in percentages II':survived_by_sex_in_per2})
sex_data
Combining the data in one histogram clearly shows the tendency: female passengers were more likely to survive in this instance.
# plot parameters
plt.rcParams['figure.figsize'] = (13, 6)
style.use('seaborn-bright')
fig = plt.figure()
ax = fig.add_subplot(111)
ind = np.array([1,2]) # the x locations for the indicator
width = 0.2 # the width of the bars
## Bars
rects1 = ax.bar(ind, number_by_sex_in_per, width, alpha = 0.7)
rects2 = ax.bar(ind+width, survived_by_sex_in_per1, width, alpha = 0.7)
rects3 = ax.bar(ind+2*width, survived_by_sex_in_per2, width, alpha = 0.7)
# Axes and labels
ax.set_xlim(-width+1,len(ind)+width+0.6)
ax.set_ylim(0,90)
ax.set_ylabel('Values by sex in percentages')
ax.set_title('Passengers, Statistics by Sex in Percentages', fontsize=20)
xTickMarks = ['female','male']
ax.set_xticks(ind+width)
xtickNames = ax.set_xticklabels(xTickMarks)
plt.setp(xtickNames, rotation=0, fontsize=30)
## Legend
ax.legend((rects1[0], rects2[0], rects3[0]),
('Number by sex in percentages',
'Survived by sex in percentages I',
'Survived by sex in percentages II'));
Let's build on our previous prediction: If a passenger was female, then we will predict that they survived. Otherwise, we will predict the passenger did not survive.
def predictions_1(data):
""" Model with one feature:
- Predict a passenger survived if they are female. """
predictions = []
for _, passenger in data.iterrows():
if passenger['Sex'] == 'female':
predictions.append(1)
else:
predictions.append(0)
# Return our predictions
return pd.Series(predictions)
# Make the predictions
predictions1 = predictions_1(all_features)
# Check accuracy
accuracy_score_0(outcomes, predictions1)
Using just the Sex feature for each passenger, we are able to increase the accuracy of our predictions by a significant margin. Now, let's consider using an additional feature to see if we can further improve our predictions. We will start by looking at the Age.
# Data for male passengers
male = train_data[train_data.Sex=='male']
male.Age.hist(alpha=0.7)
male[male.Survived==1].Age.hist(alpha=0.7)
# Labels, title, legend
plt.xlabel("Age")
plt.ylabel("Number by age")
plt.title("Male Passengers, Statistics by Age", fontsize=20)
plt.legend(["Number of Male Passengers",
"Male Survived Passengers by Age"])
def predictions_2(data):
""" Model with two features:
- Predict a passenger survived if they are female.
- Predict a passenger survived if they are male and younger than 10. """
predictions = []
for _, passenger in data.iterrows():
if passenger['Sex'] == 'female':
predictions.append(1)
elif passenger['Sex'] == 'male' and passenger['Age'] < 10:
predictions.append(1)
else:
predictions.append(0)
# Return our predictions
return pd.Series(predictions)
# Make the predictions
predictions2 = predictions_2(all_features)
# Check accuracy
accuracy_score_0(outcomes, predictions2)
Adding the feature Age as a condition in conjunction with Sex improves the accuracy by a small margin more than with simply using the feature Sex alone. Now we can try to find a series of features and conditions to split the data on to obtain an outcome prediction accuracy of at least 80%. This may require multiple features and multiple levels of conditional statements to succeed. We can use the same feature multiple times with different conditions.
There are some experiments and the function prediction_final() as a result:
# Data for male and young passengers
young_male = male[male.Age < 15]
survived_young_male = young_male[young_male.Survived==1]
# Create histogram
plt.hist((young_male.Pclass, survived_young_male.Pclass),
bins=range(1,5), rwidth=0.7, alpha=0.7, align='left')
# Ticks, labels, title, legend
plt.xticks([1,2,3])
plt.xlabel("Plass")
plt.ylabel("Number by Pclass")
plt.title("Male Young Passengers, Statistics by Pclass", fontsize=20)
plt.legend(["Male Young Passengers", "Survived Male Young Passengers"])
# Create histogram
plt.hist((young_male.SibSp, survived_young_male.SibSp),
bins=range(0,7), rwidth=0.5, align='left', alpha=0.7)
# Labels, title, legend
plt.xlabel("SibSb")
plt.ylabel("Number by SibSp")
plt.title("Male Young Passengers, Statistics by SibSp", fontsize=20)
plt.legend(["Male Young Passengers", "Survived Young Male Passengers"])
def predictions_final(data):
""" Model with multiple features. Makes a prediction with an accuracy of at least 80%. """
predictions = []
for _, passenger in data.iterrows():
if (passenger['Sex'] == 'female'):
predictions.append(1)
elif passenger['Pclass'] in [1,2] and (passenger['Age'] < 16 or passenger['Age'] > 75):
predictions.append(1)
elif passenger['Age'] < 15 and passenger['SibSp'] < 3:
predictions.append(1)
else:
predictions.append(0)
# Return our predictions
return pd.Series(predictions)
# Make the predictions
predictions_final = predictions_final(all_features)
# Check accuracy
accuracy_score_0(outcomes, predictions_final)
The final set of features Sex, Age, SibSp and Pclass are the most informative on my opinion.
As we noted the percentage of survivors of passengers is much higher among women than among men, and it was used in our predictions.
Next, I proceed from the assumption that because of humanitarian reasons people rescue children and elders at first. Unfortunately, this was only valid for the passengers of the first and second classes in this dataset.
And the latest clarification, which overcomes the border of 80% in prediction accuracy: if a family has more than three children, absolutely all the family may not be survived in catastrophic situations and in an atmosphere of panic.
Let's evaluate the quality of the prediction.
print ("Predictions have an accuracy of {:.2f}%.".format(accuracy_score(outcomes, predictions_final)*100))
print ("Predictions have an recall score equal to {:.2f}%.".format(recall_score(outcomes,
predictions_final)*100))
print ("Predictions have an precision score equal to {:.2f}%.".format(precision_score(outcomes,
predictions_final)*100))
The evaluation terminology:
In this part of the project, I have used a manual implementation of a simple machine learning model, the decision tree which splits a set of data into smaller and smaller groups (called nodes), by one feature at a time. The predictions become more accurate if each of the resulting subsets is more homogeneous (contain similar labels) than before.
A decision tree is just one of many models that come from supervised learning, i.e. learning a model from labeled training data to make predictions about unseen or future data in a set of samples the desired outputs are already known.
I think this machine learning problem is in the classification field. It needs to predict the labels for the passengers: 'yes' or 'no' for the feature 'Survived'.
For simplicity, the border between regression and classification can be described in this way:
Let's display some important information about the data training set.
# Calculate the number of passengers
n_passengers = len(train_data)
# Calculate the number of features
n_features = len(list(train_data.T.index))
# Number of survived passengers
n_survived = len(train_data[train_data['Survived'] == 1])
# Number of non-survived passengers
n_nonsurvived = len(train_data[train_data['Survived'] == 0])
# Print the results
print ("Total number of passengers: {}".format(n_passengers))
print ("Number of features: {}".format(n_features))
print ("Number of passengers who survived: {}".format(n_survived))
print ("Number of passengers who did not survive: {}".format(n_nonsurvived))
print ("Persentage of survived passengers: {:.2f}%".format(100*n_survived/n_passengers))
# Show the list of columns
print ("\nFeature columns:\n{}".format(list(all_features.columns)))
print ("\nTarget column: 'Survived'")
Let's exclude the features that cannot have an influence on the target and the feature Cabin that has too many NaN values.
# Count nan values for the feature 'Cabin'
n_nan_cabin = pd.isnull(all_features['Cabin']).sum()
print (" Persentage of NaN values in 'Cabin': {:.2f}%".format(100.0*n_nan_cabin/n_passengers))
# Setup a new feature list
feature_list2 = ['Pclass', 'Sex', 'Age', 'SibSp', 'Parch', 'Fare', 'Embarked']
pre_features = all_features[feature_list2]
test_features = test_data[feature_list2]
pre_features.head(11).T
Now I should count missing values for the remained variables.
print ("Number of missing values in the train set\n")
for feature in feature_list2:
print("{}:{}".format(feature, pd.isnull(pre_features[feature]).sum()))
print('')
print ("Number of missing values in the test set\n")
for feature in feature_list2:
print("{}:{}".format(feature, pd.isnull(test_features[feature]).sum()))
As we can see, there are several non-numeric columns that need to be converted. Some of them have simply two values (male/female, 1/0, etc.). These features can be reasonably converted into binary values.
Other columns, like Embarked, have more than two values and are known as categorical variables. The recommended way to handle such a column is to create as many columns as possible values and assign a 1 to one of them and 0 to all others.
These generated columns are sometimes called dummy variables, and I will use the pandas.get_dummies() function to perform this transformation.
In the preprocessing function, I will also replace found missing values in the features Age and Fare by the mean and in the feature Embarked by the most common value
def preprocess_features(X):
''' Preprocesses the data and converts non-numeric binary variables into
binary (0/1) variables. Converts categorical variables into dummy variables. '''
# Initialize new output DataFrame
output = pd.DataFrame(index = X.index)
# Investigate each feature column for the data
for col, col_data in X.iteritems():
# Replace 'nan' in 'Embarked'
most_common = pd.get_dummies(X['Embarked']).sum().sort_values(ascending=False).index[0]
if col == 'Embarked':
col_data = col_data.replace('nan', most_common)
# Replace 'Nan' in 'Age'
mean1 = X['Age'].mean()
if col == 'Age':
col_data = col_data.replace('NaN', mean1)
# Replace 'Nan' in 'Fare'
mean2 = X['Fare'].mean()
if col == 'Fare':
col_data = col_data.replace('NaN', mean2)
# If data type is non-numeric, replace all male/female, yes/no values with 1/0
if col_data.dtype == object:
col_data = col_data.replace(['male', 'female'], [1, 0])
# If data type is categorical, convert to dummy variables
if col_data.dtype == object:
# Example: 'school' => 'school_GP' and 'school_MS'
col_data = pd.get_dummies(col_data, prefix = col)
# Collect the revised columns
output = output.join(col_data)
return output
Here is the result of applying this function.
pre_features2 = pre_features.copy()
test_features2 = test_features.copy()
pre_features2.head(15).T
pre_features2['Sex'].replace({'male' : 1, 'female': 0}, inplace=True)
test_features2['Sex'].replace({'male' : 1, 'female': 0}, inplace=True)
age_mean = pd.concat([pre_features2.Age, test_features2.Age], ignore_index=True).mean()
fare_mean = pd.concat([pre_features2.Fare, test_features2.Fare], ignore_index=True).mean()
most_common = pd.get_dummies(pre_features2['Embarked']).sum().sort_values(ascending=False).index[0]
most_common
pre_features2['Embarked'] = pre_features2['Embarked'].fillna(most_common)
pre_features2['Age'] = pre_features2['Age'].fillna(age_mean)
test_features2['Age'] = test_features2['Age'].fillna(age_mean)
test_features2['Fare'] = test_features2['Fare'].fillna(fare_mean)
pre_features2.loc[61]
pd.isnull(pre_features2).sum()
pd.isnull(test_features2).sum()
pre_features2 = pd.get_dummies(pre_features2, prefix=['Embarked']);
pd.isnull(pre_features2).sum()
test_features2 = pd.get_dummies(test_features2, prefix=['Embarked'], columns=['Embarked']);
pd.isnull(test_features2).sum()
For predictions I have chosen the following models:
Let's have a look at their applications and characteristics:
1) GradientBoostingClassifier.
2) RandomForestClassifier.
3) AdaBoostClassifier.
All these classifiers will produce enough good predictions in this case. We should produce the result with the variant of ranking and it's a well-known fact that classification tends to be a better paradigm for ranking than regression.
Let's initialize three helper functions which we can use for training and testing sets the three supervised learning models we've chosen above.
The functions are as follows:
This function will report the F1 score for the training data.
def train_classifier(clf, X_train, y_train):
''' Fits a classifier to the training data. '''
# Start the clock, train the classifier, then stop the clock
start = time()
clf.fit(X_train, y_train)
end = time()
# Print the results
print ("Trained model in {:.4f} seconds".format(end - start))
def predict_labels(clf, features, target):
''' Makes predictions using a fit classifier based on F1 score. '''
# Start the clock, make predictions, then stop the clock
start = time()
y_pred = clf.predict(features)
end = time()
# Print and return results
print ("Made predictions in {:.4f} seconds.".format(end - start))
return f1_score(target.values, y_pred, pos_label=1)
def train_predict(clf, X_train, y_train):
''' Train and predict using a classifer based on F1 score. '''
# Indicate the classifier and the training set size
print ("Training a {} using a training set size of {}. . .".\
format(clf.__class__.__name__, len(X_train)))
# Train the classifier
train_classifier(clf, X_train, y_train)
# Print the results of prediction for both training and testing
print ("F1 score for training set: {:.4f}.".format(predict_labels(clf, X_train, y_train)))
With the predefined functions above, I will now import the three supervised learning models and run the train_predict function for each one. It needs to train and predict on each classifier for three different training set sizes: 200, 400, and all data points.
Hence, nine different outputs will be displayed below — three for each model using the varying training set sizes.
# Initialize the three models
clf_A = AdaBoostClassifier()
clf_B = GradientBoostingClassifier()
clf_C = RandomForestClassifier()
# Set up the training set sizes
X_train_200, y_train_200 = pre_features2[:200], outcomes[:200]
X_train_400, y_train_400 = pre_features2[:400], outcomes[:400]
X_train, y_train = pre_features2, outcomes
X_test = test_features2
# Execute the 'train_predict' function for each classifier and each training set size
for clf in [clf_A, clf_B, clf_C]:
for (X_train, y_train) in [(X_train_200, y_train_200),
(X_train_400, y_train_400),
(X_train, y_train)]:
train_predict(clf, X_train, y_train)
The best result has the RandomForestClassifier and I will try to apply it for test predictions.
clf = RandomForestClassifier()
clf.fit(X_train, y_train)
y_pred = clf.predict(test_features2)
y_pred
submission = pd.DataFrame({"PassengerId": test_data["PassengerId"], "Survived": y_pred})
# submission.to_csv('C:/users/OlgaBelitskaya/submission.csv', index=False)
pre_features_array = pre_features2.as_matrix()
outcomes_array = outcomes.as_matrix()
test_features_array = test_features2.as_matrix()
X_train2, X_test2, y_train2, y_test2 = train_test_split(pre_features_array, outcomes_array,
test_size=0.2, random_state=1)
X_train2.shape, X_test2.shape
def loss_plot(fit_history):
plt.figure(figsize=(14, 6))
plt.plot(fit_history.history['loss'], label = 'train')
plt.plot(fit_history.history['val_loss'], label = 'test')
plt.legend()
plt.title('Loss Function');
def acc_plot(fit_history):
plt.figure(figsize=(14, 6))
plt.plot(fit_history.history['acc'], label = 'train')
plt.plot(fit_history.history['val_acc'], label = 'test')
plt.legend()
plt.title('Accuracy');
def mlp_model():
model = Sequential()
model.add(Dense(18, input_dim=9, kernel_initializer='uniform', activation='relu'))
model.add(Dense(9, kernel_initializer='uniform', activation='relu'))
model.add(Dense(1, kernel_initializer='uniform', activation='sigmoid'))
model.compile(optimizer='adam', loss='binary_crossentropy', metrics=['accuracy'])
return model
mlp_model = mlp_model()
mlp_history = mlp_model.fit(X_train2, y_train2, validation_data=(X_test2, y_test2),
epochs=100, batch_size=8, verbose=0)
loss_plot(mlp_history)
acc_plot(mlp_history)
mlp_scores = mlp_model.evaluate(X_test2, y_test2, verbose=0)
print("Accuracy: %.2f%%" % (mlp_scores[1]*100))
y_pred_mlp = np.round(mlp_model.predict(test_features_array)).astype(int)
y_pred_mlp[:,0]
submission_mlp = pd.DataFrame({"PassengerId": test_data["PassengerId"], "Survived": y_pred_mlp[:,0]})
submission_mlp.to_csv('submission_mlp.csv', index=False)
len(submission_mlp["Survived"])
sum(submission_mlp["Survived"]==submission["Survived"])
Overvaluation of meaning and application of machine learning is unlikely to succeed. And the particular supervised method has a special importance because of the possibility of a permanent correlation of the predictions with the result of real actions.
There are several natural ideas for applying the supervised learning.
I. For every catastrophic situation, find out the exact sequence of steps and technical facilities which maximally decrease the damage. On the basis of the identified trends, it is possible to develop and check in practice clear guidelines to save lives and restore economic activities (for example, during and after the floods). Applying the scientific methods, in this case, means thousands of lives and quick recovering the economics. The useful features can evaluate disasters (areas, time period), damage (human lives, economic indicators) and recovery process (speed, effectiveness).
II. The same techniques could be useful in the process of creating self-learning facilities of virtual reality in order to bring the process of their development to the real counterparts, predict it and make corrections in time. Here the set of concrete features is very individual and depends on the object. For example, it can be growth, flowering, etc. for the plant and its imitation.