📘 P2: Titanic

1. References¶

Datasets:

In this project I was working with the data set "Titanic Data" from the Udacity website.

https://www.udacity.com/api/nodes/5420148578/supplemental_media/titanic-datacsv/download?_ga=1.57940330.157342817.1461748645

It contains demographics and passenger information from 891 of the 2224 passengers and crew on board the Titanic.

This link allows to see the description of this dataset on the Kaggle website, where the data was obtained.

https://www.kaggle.com/c/titanic/data

Articles:

http://matplotlib.org/examples/pylab_examples/

http://people.duke.edu/~ccc14/pcfb/numpympl/MatplotlibBarPlots.html

http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.groupby.html#pandas.DataFrame.groupby

http://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.linregress.html

https://en.wikipedia.org/wiki/Pearson_product-moment_correlation_coefficient

Resources :

Online Statistics Education: An Interactive Multimedia Course of Study. Project Leader: David M. Lane, Rice University.

http://onlinestatbook.com/2/index.html

2. Selection of Tools

I choose the Jupyter notebook, in which case I can submit both the code I wrote and the report of my findings in the same document.

The section below is for the code libraries.

In [1]:
import pandas as pd
import numpy as np
import scipy
In [8]:
%pylab inline
import matplotlib.mlab as mlab
import matplotlib.pyplot as plt
Populating the interactive namespace from numpy and matplotlib
In [3]:
from scipy import stats
from pylab import plot,show
import seaborn as sns
from operator import truediv
In [4]:
def convert_list_to_int(x):
    y =[]
    for element in x:
        el = element[0]
        y.append(el)
    return y
In [5]:
def percentages_xy(x,y):
    return 100.0*x/y
In [6]:
def pieplot(x,xlabel):
    figure(1, figsize=(5,5))
    ax = axes([0.1, 0.1, 0.8, 0.8])

    labels = '1', '2', '3'
    fracs = np.array(x)
    explode=(0, 0.05, 0)

    pie(fracs, explode=explode, labels=labels,
                autopct='%1.0f%%', shadow=True, startangle=0)
    title(xlabel)
    show() 
In [7]:
def pearson_stat(x,y,ylabel):
    x1 = np.array(x)
    y1 = np.array(y) 

    slope, intercept, r_value, p_value, std_err = stats.linregress(x,y)

    print '     slope                   r                    standard deviation   '
    print '                                                               '
    print '   ', slope, "      ", r_value, "        ", std_err

    line = slope*x1+intercept
    plot(x1,line,'m-',x1,y1,'o')
    pylab.xlim([x[0]-0.5,x[-1]+0.5])
    if x == pclass_list:
        pylab.xlabel('Pclass')
    elif x == index0:
        pylab.xlabel('Group by fare')
    else:
        pylab.xlabel('Age category')
    pylab.ylabel(ylabel)
    show()

3. Analyzing of the data

3.1 Dataset.

Let's have a look on the data set and do a basic statistical calculation on it using code. We extract data from the file and look at the available indicators.

In [9]:
titanic_df = pd.read_csv('/Users/olgabelitskaya/Downloads/titanic_data.csv')
titanic_df.head()
Out[9]:
PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked
0 1 0 3 Braund, Mr. Owen Harris male 22.0 1 0 A/5 21171 7.2500 NaN S
1 2 1 1 Cumings, Mrs. John Bradley (Florence Briggs Th... female 38.0 1 0 PC 17599 71.2833 C85 C
2 3 1 3 Heikkinen, Miss. Laina female 26.0 0 0 STON/O2. 3101282 7.9250 NaN S
3 4 1 1 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 35.0 1 0 113803 53.1000 C123 S
4 5 0 3 Allen, Mr. William Henry male 35.0 0 0 373450 8.0500 NaN S

The most interesting variable we could observe here is "Survived". The main question is: what factors made people more likely to survive? Let us investigate them step by step.

I propose to consider whether the number of survivors was dependent on their age, sex, social status (class and fare) or not.

The total number of passengers in the sample:

In [10]:
len(titanic_df)
Out[10]:
891

The number of survivors in the sample:

In [11]:
titanic_df['Survived'].sum()
Out[11]:
342

(including this indicator in percentage):

In [12]:
100*titanic_df['Survived'].mean()
Out[12]:
38.38383838383838

We note that in the database there is no information on the age, the cabins and the ports of embarkation for the part of the ship passengers in this sample. This will prevent the qualitative analysis in some cases.

In [13]:
titanic_df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
PassengerId    891 non-null int64
Survived       891 non-null int64
Pclass         891 non-null int64
Name           891 non-null object
Sex            891 non-null object
Age            714 non-null float64
SibSp          891 non-null int64
Parch          891 non-null int64
Ticket         891 non-null object
Fare           891 non-null float64
Cabin          204 non-null object
Embarked       889 non-null object
dtypes: float64(2), int64(5), object(5)
memory usage: 83.6+ KB

3.2 Pclass

Let's find out what kind of classes were on the ship:

In [14]:
pclass = pd.Series(titanic_df['Pclass'])
pclass_list = list(set(pclass.values))
pclass_list
Out[14]:
[1, 2, 3]

and the number of passengers for each class:

In [15]:
number_by_pclass = titanic_df.groupby('Pclass').count()['PassengerId']

We can estimate the percentage composition of how many passengers these classes have:

In [16]:
number_by_pclass_percent = percentages_xy(number_by_pclass,len(titanic_df))
number_by_pclass_df = pd.DataFrame(data={'Number by Pclass': number_by_pclass,
                                         'Number by Pclass in percentages':number_by_pclass_percent})
number_by_pclass_df
Out[16]:
Number by Pclass Number by Pclass in percentages
Pclass
1 216 24.242424
2 184 20.650954
3 491 55.106622
In [18]:
pieplot(number_by_pclass,'Number by Pclass in percentages')

And the number of survivors for all classes and the percentage of survivors for each class:

In [103]:
survived_by_pclass = titanic_df.groupby('Pclass').sum()['Survived']
survived_by_pclass_percent1 = percentages_xy(titanic_df.groupby('Pclass').sum()['Survived'],
                                           titanic_df['Survived'].sum())
survived_by_pclass_percent2 = percentages_xy(titanic_df.groupby('Pclass').sum()['Survived'],
                                             number_by_pclass)
survived_by_pclass_df = pd.DataFrame(data={'Survived by Pclass': survived_by_pclass,
                                         'Survived by Pclass in percentages I':survived_by_pclass_percent1,
                                          'Survived by Pclass in percentages II':survived_by_pclass_percent2,})
survived_by_pclass_df
Out[103]:
Survived by Pclass Survived by Pclass in percentages I Survived by Pclass in percentages II
Pclass
1 136 39.766082 62.962963
2 87 25.438596 47.282609
3 119 34.795322 24.236253

The variable "Survived by Pclass in percentages I" determines the percentage of passengers who survived in this class in relation to the total number of survivors.

The variable "Survived by Pclass in percentages II" determines the percentage of passengers who survived in this class in relation to the total number of passengers in this class.

In [20]:
pieplot(survived_by_pclass,'Survived by Pclass in percentages I')

A certain tendency is immediately detected: the higher class of passengers (the lower number of the class), the greater percentage of passengers who survived in this class in relation to the total number of passengers in this class.

In this situation we can use the Pearson product-moment correlation coefficient as a measure of the strength of the linear relationship between two variables. Looking at the data, it can be assumed that between the independent variable "Pclass" and the dependent variable "Survived by Pclass in percentages II" there is a negative linear relationship.

In [100]:
plt.style.use('seaborn-pastel')
plt.rcParams['figure.figsize'] = (8, 4) 
pearson_stat(pclass_list,survived_by_pclass_df['Survived by Pclass in percentages II'] ,
                                               'Survived by Pclass in percentages II')
     slope                   r                    standard deviation   
                                                               
    -19.3633552086        -0.994024355227          2.12638158486

This picture and the calculations show a plot for which Pearson's r = -0.994024355227. It's extremly close to r = -1, so the relationship between the variable "Pclass" and the variable "Survived by Pclass in percentages II" is a strong negative linear regression.

3.3 Fare

We can guess that the lower the price of the ticket, the more passengers on board with a ticket at this price. It's also easy to assume that the cost of travel depends on the class. Therefore (bearing in mind paragraph 3.2) it's possible to detect the relationship between the number of survived people and the cost of their tickets.

3.3.1 At first, we can find the average fare for passengers of each class.

In [29]:
fare_mean_by_class = titanic_df.groupby('Pclass').mean()['Fare']
fare_mean_by_class
Out[29]:
Pclass
1    84.154687
2    20.662183
3    13.675550
Name: Fare, dtype: float64

The calculations below show a plot for which Pearson's r = -0.907494151378. It's enough close to r = -1, so the relationship between the independent variable "Pclass" and the dependent variable "Fare mean by class" is a strong negative linear regression.

In [101]:
pearson_stat(pclass_list,fare_mean_by_class,'Fare mean by Pclass')
     slope                   r                    standard deviation   
                                                               
    -35.2395686991        -0.907494151378          16.3118400022

3.3.2 Let us now consider the general indicators of the fare (minimum, maximum and average values).

In [31]:
fare = pd.Series(titanic_df['Fare'])
In [32]:
print 'max =', max(fare)
print 'min =', min(fare)
print 'mean =', mean(fare)
max = 512.3292
min = 0.0
mean = 32.2042079686

We construct a histogram distributing the passengers by intervals of the fare.

In [99]:
plt.rcParams['figure.figsize'] = (6, 3)
titanic_df.Fare.hist()
plt.xlabel("Fare") 
plt.ylabel("Number by fare")
Out[99]:
<matplotlib.text.Text at 0x115a79d50>

At the intervals of about 50 units the large spread the number of passengers with minimum and maximum level of payment is observed.

In [36]:
number_by_fare = pd.Series(titanic_df.groupby('Fare').count()['PassengerId'])

For better understanding of the distribution of passengers in the group, I have also constructed the line. It looks like an interval about 10 units will be a better decision.

In [102]:
plt.rcParams['figure.figsize'] = (8, 4)
number_by_fare.plot(color='steelblue', linewidth=1.5, linestyle="-")
plt.xlabel("Fare") 
plt.ylabel("Number by fare") 
Out[102]:
<matplotlib.text.Text at 0x11894d910>

For the next analisis all the passengers had been divided into categories depending on the fare with the difference approximately equal to 10 units.

In [43]:
bins1 = np.linspace(fare.min(), fare.max(), 52)
groups = fare.groupby(pd.cut(fare, bins1)).count()
groups1 = groups.tolist()

index1 = [i for i in range(0,510) if i%10 == 0]

groups_df = pd.DataFrame(data={'Number in group by fare': groups1}, index=index1)

We can construct a graph showing the number of the passengers fall into each category.

In [98]:
groups_df.plot(color='darkred', linewidth=1.5, linestyle="-")
plt.xlabel("Fare") 
plt.ylabel("Number in group by fare") 
Out[98]:
<matplotlib.text.Text at 0x115ab9610>

Let us exclude outliers. First 27 groups include the majority of passengers, the remaining groups - only three people.

In [64]:
groups0 = groups1[:27]
index0 = index1[:27]

number_in_rest = sum(groups1[27:])
print number_in_rest
3

Now we can estimate the existence of dependence.

In [94]:
pearson_stat(index0,groups0,'Number in group by fare')
     slope                   r                    standard deviation   
                                                               
    -0.558913308913        -0.617051133707          0.142556020365

In this case Pearson's r = -0.617051133707. It's not so close to r = -1, so the relationship between the independent variable "Group by fare" and the dependent variable "Number in group by fare" is a weak negative relationship.

3.3.3 Let us have a look at the variable "Fare" and the variable "Survived". Here is a built database for these categories.

In [66]:
survived_by_fare = titanic_df.groupby('Fare').sum()['Survived']
survived_groups_df = pd.DataFrame(data={'Survived by fare': survived_by_fare})
In [67]:
survived_groups_df.reset_index(inplace=True)

As in the paragraph 3.3.2 passengers had been divided into several dozen categories according to the price of the ticket and the number of survivors in each category was counted. Outliers were also excluded.

In [104]:
survived_groups = survived_groups_df.groupby(pd.cut(survived_groups_df["Fare"], bins1)).sum()

survived_groups1 = survived_groups.fillna(0)

survived_groups2 = survived_groups1['Survived by fare'][:27].tolist()
survived_groups2 = [int(x) for x in survived_groups2]

fare_groups_df = pd.DataFrame(data={'Number by group': groups0, 
                                    'Survived by group': survived_groups2}, 
                              index=index0)
fare_groups_df.head()
Out[104]:
Number by group Survived by group
0 321 66
10 179 76
20 144 64
30 57 22
40 15 4

The data had adjusted by using the function "db.fillna(0)" in two cases:

1) values do not fall in the ranges;

2) the value was zero and it was impossible to find the result of division.

Missing data were replaced by zero.

In [70]:
m = np.array(survived_groups2)
n = np.array(groups0)

# survived_groups_in_percentages0 = 100.0*np.true_divide(m,n)
# survived_groups_in_percentages1 = pd.Series(survived_groups_in_percentages0).fillna(0)

Let us estimate the existence of dependence between "Group by fare" and "Survived in group by fare in percentages" t

In [93]:
pearson_stat(index0,survived_groups_in_percentages1,'Survived in group by fare in percentages')
     slope                   r                    standard deviation   
                                                               
    -0.0689445193963        -0.157400468005          0.0865119614025

In this case Pearson's r = -0.157400468005. It's enough close to r = 0, so between the independent variable "Group by fare" and the dependent variable "Survived in group by fare in percentages" there is no relationship.

3.4 Sex

Now let us check the variable "Sex", the variable "Survived" and their possible dependence. Here is a special database for these categories including the indicators in percentages.

In [105]:
number_by_sex = titanic_df.groupby('Sex').count()['PassengerId']
number_by_sex_in_percentages = percentages_xy(titanic_df.groupby('Sex').count()['PassengerId'],
                                              len(titanic_df))
In [106]:
survived_by_sex = titanic_df.groupby('Sex').sum()['Survived']
survived_by_sex_in_percentages1=percentages_xy(titanic_df.groupby('Sex').sum()['Survived'],
                                               titanic_df['Survived'].sum())
survived_by_sex_in_percentages2 = percentages_xy(titanic_df.groupby('Sex').sum()['Survived'],
                                                 number_by_sex)

For clarity, the data are combined into one table.

In [107]:
survived_by_sex_df = pd.DataFrame(data={'Number by sex': number_by_sex,
                                        'Number by sex in percentages':number_by_sex_in_percentages,
                                        'Survived by sex':survived_by_sex,
                                        'Survived by sex in percentages I':survived_by_sex_in_percentages1,
                                        'Survived by sex in percentages II':survived_by_sex_in_percentages2})
survived_by_sex_df
Out[107]:
Number by sex Number by sex in percentages Survived by sex Survived by sex in percentages I Survived by sex in percentages II
Sex
female 314 35.241302 233 68.128655 74.203822
male 577 64.758698 109 31.871345 18.890815

The variable "Survived by sex in percentages I" determines the percentage of survived passengers of this sex in relation to the total number of survivors.

The variable "Survived by sex in percentages II" determines the percentage of survived passengers of this sex in relation to the total number of passengers of the same sex.

In [88]:
plt.rcParams['figure.figsize'] = (8, 4)
fig = plt.figure()
ax = fig.add_subplot(111)

## Data
Number_by_Sex_in_percentages = pd.Series(survived_by_sex_df['Number by sex in percentages'])
Survived_by_Sex_in_percentages1 = pd.Series(survived_by_sex_df['Survived by sex in percentages I'])
Survived_by_Sex_in_percentages2 = pd.Series(survived_by_sex_df['Survived by sex in percentages II'])

## Necessary variables
ind = np.array([1,2])           # the x locations for the pclass
width = 0.2                     # the width of the bars

## Bars
rects1 = ax.bar(ind, Number_by_Sex_in_percentages, width, color='red', alpha = 0.7)
rects2 = ax.bar(ind+width, Survived_by_Sex_in_percentages1, width, color='green', alpha = 0.7)
rects3 = ax.bar(ind+2*width, Survived_by_Sex_in_percentages2, width, color='blue', alpha = 0.7)

# Axes and labels
ax.set_xlim(-width+1,len(ind)+width+0.6)
ax.set_ylim(0,90)
ax.set_ylabel('Values by sex in percentages')
ax.set_title('Number of passengers and Survived passengers by sex in percentages')
xTickMarks = ['female','male']
ax.set_xticks(ind+width)
xtickNames = ax.set_xticklabels(xTickMarks)
plt.setp(xtickNames, rotation=10, fontsize=30)

## Legend
ax.legend((rects1[0], rects2[0], rects3[0]), 
          ('Number by sex in percentages', 'Survived by sex in percentages I', 'Survived by sex in percentages II') )
plt.show()

Combining the data in one histogram let us clearly see the tendency: female passengers were more likely to survive in this instance.

Among the total number of passengers there were fewer women than men in percentage terms, and among the survived passengers women were the most.

3.5 Age

The next step - we have a look at the variable "Age" and the variable "Survived".

In [82]:
age = pd.Series(titanic_df['Age'])

We can consider the general indicators of the age of passengers (minimum, maximum and average values).

In [83]:
print 'max =', max(age)
print 'min =', min(age)
print 'mean =', mean(age)
max = 80.0
min = 0.42
mean = 29.6991176471

We will group the passengers on the basis of age and consider their distribution.

In [84]:
number_age = titanic_df.groupby('Age').count()['PassengerId']
In [87]:
plt.rcParams['figure.figsize'] = (8, 4)
titanic_df.Age.hist()
plt.xlabel("Age")
plt.ylabel("Number by age")
Out[87]:
<matplotlib.text.Text at 0x116ed2c10>

Let us divide passengers into age categories and estimate the number of survivors in each of them. In this analysis, we do not count the passengers whose age is unknown.

In [89]:
age = pd.cut(titanic_df['Age'], [0,13,19,35,60,85], labels=['child', 'teenager', 'young','middle-aged','old'])
df_age = pd.DataFrame(data={'Age category': age,'Survived': titanic_df['Survived']})
number_by_age = df_age.groupby('Age category').count() 
number_by_age1 = number_by_age.values.tolist()
survived_by_age = df_age.groupby('Age category').sum()
survived_by_age1 = survived_by_age.values.tolist()
survived_by_age_in_percentages = percentages_xy(survived_by_age,number_by_age)
survived_by_age_in_percentages1 = survived_by_age_in_percentages.values.tolist()

Here is a built database for these categories.

In [91]:
df_survived_by_age = pd.DataFrame(data={'Number by age category': convert_list_to_int(number_by_age1),
                                        'Survived by age category': convert_list_to_int(survived_by_age1),
                                        'Survived by age category in percentages': 
                                        convert_list_to_int(survived_by_age_in_percentages1)},
                                 index=['child', 'teenager', 'young','middle-aged','old'])
df_survived_by_age
Out[91]:
Number by age category Survived by age category Survived by age category in percentages
child 71 42 59.154930
teenager 93 37 39.784946
young 333 128 38.438438
middle-aged 195 78 40.000000
old 22 5 22.727273

We assign each age category ['child', 'teenager', 'young','middle-aged','old'] the corresponding number [1, 2, 3, 4, 5] and analyze the dependence of the variable "Survived by age category in percentages" on the variable "Age category".

In [92]:
pearson_stat([1, 2, 3, 4, 5],
             convert_list_to_int(survived_by_age_in_percentages1),
             'Survived by age category in percentages')
     slope                   r                    standard deviation   
                                                               
    -7.26402599369        -0.888941929837          2.16086549863

The calculations above show a plot for which Pearson's r = -0.888941929837. It's enough close to r = -1, so the relationship between the independent variable "Age category" and the dependent variable "Survived by age category in percentages" is a strong negative linear regression.

4. Сonclusion

So, a strong negative linear relationship was found between the following pairs of variables:

1) the independent variable "Pclass" and the dependent variable "Survived by Pclass in percentages II" (the percentage of passengers who survived in this class in relation to the total number of passengers in this class);

2) the independent variable "Pclass" and the dependent variable "Fare mean by Pclass" (the average fare for passengers in this class);

3) the independent variable "Age category" and the dependent variable "Survived by age category in percentages" (the percentage of passengers who survived in the age category in relation to the total number of passengers in this age category.

The revealed strong connections suggest that first-class passengers and children were more likely to survive.

A weak negative correlation was found between the independent variable "Group by fare" and the dependent variable "Number in group by fare". According to the graph, the relationship is unlikely to be linear.

However, we can say for sure: fewer the cost of the ticket, greater the number of passengers in this category of tickets.

The relationship was not found between the independent variable "Group by fare" and the dependent variable "Survived in group by fare in percentages" (the percentage of passengers who survived in the group in relation to the total number of passengers in this group).

It should be noted that the percentage of survivors of passengers is much higher among women than among men. Perhaps, here analyzing of the dependence nature is not meaningful because the variable "Sex" has only two values.

Some relationships can be due to reasons of humanity (such as surviving more women and youngs).

Some connections can have a whole range of bases. For example, a higher percentage of survivors among the passengers of the first class can be explained by the presence of a larger number of life-saving appliances, damage of the ship far away from the cabins of the first class or the existence of class prejudices of the crew who saved at first all the passengers in this class.

4.3 Restrictions in research

There were a lot of limitations in this analysis which casts doubt on some of the identified trends.

1) We have the data for only 891 passengers, but in this incident there were 1300 passengers in total and the crew members. If the rest of the passengers belong to a certain category (for example, only men or only first-class passengers), all the analysis in this category becomes fundamentally wrong.

2) We are not running statistical analysis to test for statistical significance between survival rates. So we can not be sure to decide if a data set is from a random process or not.