📘 P2: Titanic¶

1. References¶¶

Datasets:¶

In this project I was working with the data set "Titanic Data" from the Udacity website.

https://www.udacity.com/api/nodes/5420148578/supplemental_media/titanic-datacsv/download?_ga=1.57940330.157342817.1461748645

It contains demographics and passenger information from 891 of the 2224 passengers and crew on board the Titanic.

This link allows to see the description of this dataset on the Kaggle website, where the data was obtained.

https://www.kaggle.com/c/titanic/data

Articles:¶

http://matplotlib.org/examples/pylab_examples/

http://people.duke.edu/~ccc14/pcfb/numpympl/MatplotlibBarPlots.html

http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.groupby.html#pandas.DataFrame.groupby

http://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.linregress.html

https://en.wikipedia.org/wiki/Pearson_product-moment_correlation_coefficient

Resources :¶

Online Statistics Education: An Interactive Multimedia Course of Study. Project Leader: David M. Lane, Rice University.

http://onlinestatbook.com/2/index.html

2. Selection of Tools¶

I choose the Jupyter notebook, in which case I can submit both the code I wrote and the report of my findings in the same document.¶

The section below is for the code libraries.¶

import pandas as pd
import numpy as np
import scipy

%pylab inline
import matplotlib.mlab as mlab
import matplotlib.pyplot as plt

Populating the interactive namespace from numpy and matplotlib

from scipy import stats
from pylab import plot,show
import seaborn as sns
from operator import truediv

def convert_list_to_int(x):
    y =[]
    for element in x:
        el = element[0]
        y.append(el)
    return y

def percentages_xy(x,y):
    return 100.0*x/y

def pieplot(x,xlabel):
    figure(1, figsize=(5,5))
    ax = axes([0.1, 0.1, 0.8, 0.8])

    labels = '1', '2', '3'
    fracs = np.array(x)
    explode=(0, 0.05, 0)

    pie(fracs, explode=explode, labels=labels,
                autopct='%1.0f%%', shadow=True, startangle=0)
    title(xlabel)
    show()

def pearson_stat(x,y,ylabel):
    x1 = np.array(x)
    y1 = np.array(y) 

    slope, intercept, r_value, p_value, std_err = stats.linregress(x,y)

    print '     slope                   r                    standard deviation   '
    print '                                                               '
    print '   ', slope, "      ", r_value, "        ", std_err

    line = slope*x1+intercept
    plot(x1,line,'m-',x1,y1,'o')
    pylab.xlim([x[0]-0.5,x[-1]+0.5])
    if x == pclass_list:
        pylab.xlabel('Pclass')
    elif x == index0:
        pylab.xlabel('Group by fare')
    else:
        pylab.xlabel('Age category')
    pylab.ylabel(ylabel)
    show()

3. Analyzing of the data¶

3.1 Dataset.¶

Let's have a look on the data set and do a basic statistical calculation on it using code. We extract data from the file and look at the available indicators.¶

titanic_df = pd.read_csv('/Users/olgabelitskaya/Downloads/titanic_data.csv')
titanic_df.head()

The most interesting variable we could observe here is "Survived". The main question is: what factors made people more likely to survive? Let us investigate them step by step.¶

The total number of passengers in the sample:¶

len(titanic_df)

891

The number of survivors in the sample:¶

titanic_df['Survived'].sum()

342

(including this indicator in percentage):¶

100*titanic_df['Survived'].mean()

38.38383838383838

We note that in the database there is no information on the age, the cabins and the ports of embarkation for the part of the ship passengers in this sample. This will prevent the qualitative analysis in some cases.¶

titanic_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
PassengerId    891 non-null int64
Survived       891 non-null int64
Pclass         891 non-null int64
Name           891 non-null object
Sex            891 non-null object
Age            714 non-null float64
SibSp          891 non-null int64
Parch          891 non-null int64
Ticket         891 non-null object
Fare           891 non-null float64
Cabin          204 non-null object
Embarked       889 non-null object
dtypes: float64(2), int64(5), object(5)
memory usage: 83.6+ KB

3.2 Pclass¶

Let's find out what kind of classes were on the ship:¶

pclass = pd.Series(titanic_df['Pclass'])
pclass_list = list(set(pclass.values))
pclass_list

[1, 2, 3]

and the number of passengers for each class:¶

number_by_pclass = titanic_df.groupby('Pclass').count()['PassengerId']

We can estimate the percentage composition of how many passengers these classes have:¶

number_by_pclass_percent = percentages_xy(number_by_pclass,len(titanic_df))
number_by_pclass_df = pd.DataFrame(data={'Number by Pclass': number_by_pclass,
                                         'Number by Pclass in percentages':number_by_pclass_percent})
number_by_pclass_df

pieplot(number_by_pclass,'Number by Pclass in percentages')

And the number of survivors for all classes and the percentage of survivors for each class:¶

survived_by_pclass = titanic_df.groupby('Pclass').sum()['Survived']
survived_by_pclass_percent1 = percentages_xy(titanic_df.groupby('Pclass').sum()['Survived'],
                                           titanic_df['Survived'].sum())
survived_by_pclass_percent2 = percentages_xy(titanic_df.groupby('Pclass').sum()['Survived'],
                                             number_by_pclass)
survived_by_pclass_df = pd.DataFrame(data={'Survived by Pclass': survived_by_pclass,
                                         'Survived by Pclass in percentages I':survived_by_pclass_percent1,
                                          'Survived by Pclass in percentages II':survived_by_pclass_percent2,})
survived_by_pclass_df

The variable "Survived by Pclass in percentages I" determines the percentage of passengers who survived in this class in relation to the total number of survivors.¶

The variable "Survived by Pclass in percentages II" determines the percentage of passengers who survived in this class in relation to the total number of passengers in this class.¶

pieplot(survived_by_pclass,'Survived by Pclass in percentages I')

A certain tendency is immediately detected: the higher class of passengers (the lower number of the class), the greater percentage of passengers who survived in this class in relation to the total number of passengers in this class.¶

In this situation we can use the Pearson product-moment correlation coefficient as a measure of the strength of the linear relationship between two variables. Looking at the data, it can be assumed that between the independent variable "Pclass" and the dependent variable "Survived by Pclass in percentages II" there is a negative linear relationship.¶

plt.style.use('seaborn-pastel')
plt.rcParams['figure.figsize'] = (8, 4) 
pearson_stat(pclass_list,survived_by_pclass_df['Survived by Pclass in percentages II'] ,
                                               'Survived by Pclass in percentages II')

     slope                   r                    standard deviation   
                                                               
    -19.3633552086        -0.994024355227          2.12638158486

This picture and the calculations show a plot for which Pearson's r = -0.994024355227. It's extremly close to r = -1, so the relationship between the variable "Pclass" and the variable "Survived by Pclass in percentages II" is a strong negative linear regression.¶

3.3 Fare¶

Let us investigate possible links between the variable "Fare" and other indicators.¶

We can guess that the lower the price of the ticket, the more passengers on board with a ticket at this price. It's also easy to assume that the cost of travel depends on the class. Therefore (bearing in mind paragraph 3.2) it's possible to detect the relationship between the number of survived people and the cost of their tickets.¶

3.3.1 At first, we can find the average fare for passengers of each class.¶

fare_mean_by_class = titanic_df.groupby('Pclass').mean()['Fare']
fare_mean_by_class

Pclass
1    84.154687
2    20.662183
3    13.675550
Name: Fare, dtype: float64

The calculations below show a plot for which Pearson's r = -0.907494151378. It's enough close to r = -1, so the relationship between the independent variable "Pclass" and the dependent variable "Fare mean by class" is a strong negative linear regression.¶

pearson_stat(pclass_list,fare_mean_by_class,'Fare mean by Pclass')

     slope                   r                    standard deviation   
                                                               
    -35.2395686991        -0.907494151378          16.3118400022

3.3.2 Let us now consider the general indicators of the fare (minimum, maximum and average values).¶

fare = pd.Series(titanic_df['Fare'])

print 'max =', max(fare)
print 'min =', min(fare)
print 'mean =', mean(fare)

max = 512.3292
min = 0.0
mean = 32.2042079686

We construct a histogram distributing the passengers by intervals of the fare.¶

plt.rcParams['figure.figsize'] = (6, 3)
titanic_df.Fare.hist()
plt.xlabel("Fare") 
plt.ylabel("Number by fare")

<matplotlib.text.Text at 0x115a79d50>

At the intervals of about 50 units the large spread the number of passengers with minimum and maximum level of payment is observed.¶

number_by_fare = pd.Series(titanic_df.groupby('Fare').count()['PassengerId'])

For better understanding of the distribution of passengers in the group, I have also constructed the line. It looks like an interval about 10 units will be a better decision.¶

plt.rcParams['figure.figsize'] = (8, 4)
number_by_fare.plot(color='steelblue', linewidth=1.5, linestyle="-")
plt.xlabel("Fare") 
plt.ylabel("Number by fare")

<matplotlib.text.Text at 0x11894d910>

For the next analisis all the passengers had been divided into categories depending on the fare with the difference approximately equal to 10 units.¶

bins1 = np.linspace(fare.min(), fare.max(), 52)
groups = fare.groupby(pd.cut(fare, bins1)).count()
groups1 = groups.tolist()

index1 = [i for i in range(0,510) if i%10 == 0]

groups_df = pd.DataFrame(data={'Number in group by fare': groups1}, index=index1)

We can construct a graph showing the number of the passengers fall into each category.¶

groups_df.plot(color='darkred', linewidth=1.5, linestyle="-")
plt.xlabel("Fare") 
plt.ylabel("Number in group by fare")

<matplotlib.text.Text at 0x115ab9610>

Let us exclude outliers. First 27 groups include the majority of passengers, the remaining groups - only three people.¶

groups0 = groups1[:27]
index0 = index1[:27]

number_in_rest = sum(groups1[27:])
print number_in_rest

3

Now we can estimate the existence of dependence.¶

pearson_stat(index0,groups0,'Number in group by fare')

     slope                   r                    standard deviation   
                                                               
    -0.558913308913        -0.617051133707          0.142556020365

In this case Pearson's r = -0.617051133707. It's not so close to r = -1, so the relationship between the independent variable "Group by fare" and the dependent variable "Number in group by fare" is a weak negative relationship.¶

3.3.3 Let us have a look at the variable "Fare" and the variable "Survived". Here is a built database for these categories.¶

survived_by_fare = titanic_df.groupby('Fare').sum()['Survived']
survived_groups_df = pd.DataFrame(data={'Survived by fare': survived_by_fare})

survived_groups_df.reset_index(inplace=True)

As in the paragraph 3.3.2 passengers had been divided into several dozen categories according to the price of the ticket and the number of survivors in each category was counted. Outliers were also excluded.¶

survived_groups = survived_groups_df.groupby(pd.cut(survived_groups_df["Fare"], bins1)).sum()

survived_groups1 = survived_groups.fillna(0)

survived_groups2 = survived_groups1['Survived by fare'][:27].tolist()
survived_groups2 = [int(x) for x in survived_groups2]

fare_groups_df = pd.DataFrame(data={'Number by group': groups0, 
                                    'Survived by group': survived_groups2}, 
                              index=index0)
fare_groups_df.head()

The data had adjusted by using the function "db.fillna(0)" in two cases:¶

1) values do not fall in the ranges;¶

2) the value was zero and it was impossible to find the result of division.¶

Missing data were replaced by zero.¶

m = np.array(survived_groups2)
n = np.array(groups0)

# survived_groups_in_percentages0 = 100.0*np.true_divide(m,n)
# survived_groups_in_percentages1 = pd.Series(survived_groups_in_percentages0).fillna(0)

Let us estimate the existence of dependence between "Group by fare" and "Survived in group by fare in percentages" t¶

pearson_stat(index0,survived_groups_in_percentages1,'Survived in group by fare in percentages')

     slope                   r                    standard deviation   
                                                               
    -0.0689445193963        -0.157400468005          0.0865119614025

In this case Pearson's r = -0.157400468005. It's enough close to r = 0, so between the independent variable "Group by fare" and the dependent variable "Survived in group by fare in percentages" there is no relationship.¶

3.4 Sex¶

Now let us check the variable "Sex", the variable "Survived" and their possible dependence. Here is a special database for these categories including the indicators in percentages.¶

number_by_sex = titanic_df.groupby('Sex').count()['PassengerId']
number_by_sex_in_percentages = percentages_xy(titanic_df.groupby('Sex').count()['PassengerId'],
                                              len(titanic_df))

survived_by_sex = titanic_df.groupby('Sex').sum()['Survived']
survived_by_sex_in_percentages1=percentages_xy(titanic_df.groupby('Sex').sum()['Survived'],
                                               titanic_df['Survived'].sum())
survived_by_sex_in_percentages2 = percentages_xy(titanic_df.groupby('Sex').sum()['Survived'],
                                                 number_by_sex)

For clarity, the data are combined into one table.¶

survived_by_sex_df = pd.DataFrame(data={'Number by sex': number_by_sex,
                                        'Number by sex in percentages':number_by_sex_in_percentages,
                                        'Survived by sex':survived_by_sex,
                                        'Survived by sex in percentages I':survived_by_sex_in_percentages1,
                                        'Survived by sex in percentages II':survived_by_sex_in_percentages2})
survived_by_sex_df

The variable "Survived by sex in percentages I" determines the percentage of survived passengers of this sex in relation to the total number of survivors.¶

The variable "Survived by sex in percentages II" determines the percentage of survived passengers of this sex in relation to the total number of passengers of the same sex.¶

plt.rcParams['figure.figsize'] = (8, 4)
fig = plt.figure()
ax = fig.add_subplot(111)

## Data
Number_by_Sex_in_percentages = pd.Series(survived_by_sex_df['Number by sex in percentages'])
Survived_by_Sex_in_percentages1 = pd.Series(survived_by_sex_df['Survived by sex in percentages I'])
Survived_by_Sex_in_percentages2 = pd.Series(survived_by_sex_df['Survived by sex in percentages II'])

## Necessary variables
ind = np.array([1,2])           # the x locations for the pclass
width = 0.2                     # the width of the bars

## Bars
rects1 = ax.bar(ind, Number_by_Sex_in_percentages, width, color='red', alpha = 0.7)
rects2 = ax.bar(ind+width, Survived_by_Sex_in_percentages1, width, color='green', alpha = 0.7)
rects3 = ax.bar(ind+2*width, Survived_by_Sex_in_percentages2, width, color='blue', alpha = 0.7)

# Axes and labels
ax.set_xlim(-width+1,len(ind)+width+0.6)
ax.set_ylim(0,90)
ax.set_ylabel('Values by sex in percentages')
ax.set_title('Number of passengers and Survived passengers by sex in percentages')
xTickMarks = ['female','male']
ax.set_xticks(ind+width)
xtickNames = ax.set_xticklabels(xTickMarks)
plt.setp(xtickNames, rotation=10, fontsize=30)

## Legend
ax.legend((rects1[0], rects2[0], rects3[0]), 
          ('Number by sex in percentages', 'Survived by sex in percentages I', 'Survived by sex in percentages II') )
plt.show()

Combining the data in one histogram let us clearly see the tendency: female passengers were more likely to survive in this instance.¶

Among the total number of passengers there were fewer women than men in percentage terms, and among the survived passengers women were the most.¶

3.5 Age¶

The next step - we have a look at the variable "Age" and the variable "Survived".¶

age = pd.Series(titanic_df['Age'])

We can consider the general indicators of the age of passengers (minimum, maximum and average values).¶

print 'max =', max(age)
print 'min =', min(age)
print 'mean =', mean(age)

max = 80.0
min = 0.42
mean = 29.6991176471

We will group the passengers on the basis of age and consider their distribution.¶

number_age = titanic_df.groupby('Age').count()['PassengerId']

plt.rcParams['figure.figsize'] = (8, 4)
titanic_df.Age.hist()
plt.xlabel("Age")
plt.ylabel("Number by age")

<matplotlib.text.Text at 0x116ed2c10>

Let us divide passengers into age categories and estimate the number of survivors in each of them. In this analysis, we do not count the passengers whose age is unknown.¶

age = pd.cut(titanic_df['Age'], [0,13,19,35,60,85], labels=['child', 'teenager', 'young','middle-aged','old'])
df_age = pd.DataFrame(data={'Age category': age,'Survived': titanic_df['Survived']})
number_by_age = df_age.groupby('Age category').count() 
number_by_age1 = number_by_age.values.tolist()
survived_by_age = df_age.groupby('Age category').sum()
survived_by_age1 = survived_by_age.values.tolist()
survived_by_age_in_percentages = percentages_xy(survived_by_age,number_by_age)
survived_by_age_in_percentages1 = survived_by_age_in_percentages.values.tolist()

Here is a built database for these categories.¶

df_survived_by_age = pd.DataFrame(data={'Number by age category': convert_list_to_int(number_by_age1),
                                        'Survived by age category': convert_list_to_int(survived_by_age1),
                                        'Survived by age category in percentages': 
                                        convert_list_to_int(survived_by_age_in_percentages1)},
                                 index=['child', 'teenager', 'young','middle-aged','old'])
df_survived_by_age

We assign each age category ['child', 'teenager', 'young','middle-aged','old'] the corresponding number [1, 2, 3, 4, 5] and analyze the dependence of the variable "Survived by age category in percentages" on the variable "Age category".¶

pearson_stat([1, 2, 3, 4, 5],
             convert_list_to_int(survived_by_age_in_percentages1),
             'Survived by age category in percentages')

     slope                   r                    standard deviation   
                                                               
    -7.26402599369        -0.888941929837          2.16086549863

The calculations above show a plot for which Pearson's r = -0.888941929837. It's enough close to r = -1, so the relationship between the independent variable "Age category" and the dependent variable "Survived by age category in percentages" is a strong negative linear regression.¶

4. Сonclusion¶

4.1 Identified trends¶

So, a strong negative linear relationship was found between the following pairs of variables:¶

1) the independent variable "Pclass" and the dependent variable "Survived by Pclass in percentages II" (the percentage of passengers who survived in this class in relation to the total number of passengers in this class);¶

2) the independent variable "Pclass" and the dependent variable "Fare mean by Pclass" (the average fare for passengers in this class);¶

3) the independent variable "Age category" and the dependent variable "Survived by age category in percentages" (the percentage of passengers who survived in the age category in relation to the total number of passengers in this age category.¶

The revealed strong connections suggest that first-class passengers and children were more likely to survive.¶

A weak negative correlation was found between the independent variable "Group by fare" and the dependent variable "Number in group by fare". According to the graph, the relationship is unlikely to be linear.¶

However, we can say for sure: fewer the cost of the ticket, greater the number of passengers in this category of tickets.¶

The relationship was not found between the independent variable "Group by fare" and the dependent variable "Survived in group by fare in percentages" (the percentage of passengers who survived in the group in relation to the total number of passengers in this group).¶

It should be noted that the percentage of survivors of passengers is much higher among women than among men. Perhaps, here analyzing of the dependence nature is not meaningful because the variable "Sex" has only two values.¶

4.2 Reasons for trends¶

The reasons for a few trends are so obvious that it is not worth the discussion: increasing of the average price of travel and reducing the number of the passengers with increased passenger class.¶

Some relationships can be due to reasons of humanity (such as surviving more women and youngs).¶

Some connections can have a whole range of bases. For example, a higher percentage of survivors among the passengers of the first class can be explained by the presence of a larger number of life-saving appliances, damage of the ship far away from the cabins of the first class or the existence of class prejudices of the crew who saved at first all the passengers in this class.¶

4.3 Restrictions in research¶

There were a lot of limitations in this analysis which casts doubt on some of the identified trends.¶

1) We have the data for only 891 passengers, but in this incident there were 1300 passengers in total and the crew members. If the rest of the passengers belong to a certain category (for example, only men or only first-class passengers), all the analysis in this category becomes fundamentally wrong.¶

2) We are not running statistical analysis to test for statistical significance between survival rates. So we can not be sure to decide if a data set is from a random process or not.¶

3) Some columns have missing data and it also could affect the analysis. For example, the trends in age categories and in the distribution of them of rescued passengers may be changed significantly.¶

4) To determine the reasons for some trends (for example, saving a large number of first-class passengers) there is no data at all, only guesses can be built as a result.¶

	PassengerId	Survived	Pclass	Name	Sex	Age	SibSp	Ticket	Fare	Cabin	Embarked
0	1	0	3	Braund, Mr. Owen Harris	male	22.0	1	A/5 21171	7.2500	NaN	S
1	2	1	1	Cumings, Mrs. John Bradley (Florence Briggs Th...	female	38.0	1	PC 17599	71.2833	C85	C
2	3	1	3	Heikkinen, Miss. Laina	female	26.0	0	STON/O2. 3101282	7.9250	NaN	S
3	4	1	1	Futrelle, Mrs. Jacques Heath (Lily May Peel)	female	35.0	1	113803	53.1000	C123	S
4	5	0	3	Allen, Mr. William Henry	male	35.0	0	373450	8.0500	NaN	S

	Number by Pclass	Number by Pclass in percentages
Pclass
1	216	24.242424
2	184	20.650954
3	491	55.106622

	Survived by Pclass	Survived by Pclass in percentages I	Survived by Pclass in percentages II
Pclass
1	136	39.766082	62.962963
2	87	25.438596	47.282609
3	119	34.795322	24.236253

	Number by sex	Number by sex in percentages	Survived by sex	Survived by sex in percentages I	Survived by sex in percentages II
Sex
female	314	35.241302	233	68.128655	74.203822
male	577	64.758698	109	31.871345	18.890815

	Number by age category	Survived by age category	Survived by age category in percentages
child	71	42	59.154930
teenager	93	37	39.784946
young	333	128	38.438438
middle-aged	195	78	40.000000
old	22	5	22.727273