Exploratory data analysis (EDA) is an important pillar of data science, a important step required to complete every project regardless of type of data you are working with. Exploratory analysis gives us a sense of what additional work should be performed to quantify and extract insights from our data.

In this post I have performed Exploratory Data analysis on Titanic Dataset.

Let’s start with importing required libraries.

%matplotlib inline
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

Now I will read titanic dataset using Pandas read_csv method and explore first 5 rows of the data set.

titanic_df = pd.read_csv('titanic-data.csv')
titanic_df.head()
 PassengerIdSurvivedPclassNameSexAgeSibSpParchTicketFareCabinEmbarked
0 1 0 3 Braund, Mr. Owen Harris male 22.0 1 0 A/5 21171 7.2500 NaN S
1 2 1 1 Cumings, Mrs. John Bradley (Florence Briggs Th... female 38.0 1 0 PC 17599 71.2833 C85 C
2 3 1 3 Heikkinen, Miss. Laina female 26.0 0 0 STON/O2. 3101282 7.9250 NaN S
3 4 1 1 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 35.0 1 0 113803 53.1000 C123 S
4 5 0 3 Allen, Mr. William Henry male 35.0 0 0 373450 8.0500 NaN S

Data Description

(from https://www.kaggle.com/c/titanic)

  1. survival: Survival (0 = No; 1 = Yes)
  2. pclass: Passenger Class (1 = 1st; 2 = 2nd; 3 = 3rd)
  3. name: Name
  4. sex: Sex
  5. age: Age
  6. sibsp: Number of Siblings/Spouses Aboard
  7. parch: Number of Parents/Children Aboard
  8. ticket: Ticket Number
  9. fare: Passenger Fare
  10. cabin: Cabin
  11. embarked: Port of Embarkation (C = Cherbourg; Q = Queenstown; S = Southampton)

Variable Notes

pclass: A proxy for socio-economic status (SES) 1st = Upper 2nd = Middle 3rd = Lower

age: Age is fractional if less than 1. If the age is estimated, is it in the form of xx.5

sibsp: The dataset defines family relations in this way… Sibling = brother, sister, stepbrother, stepsister Spouse = husband, wife (mistresses and fiancés were ignored)

parch: The dataset defines family relations in this way… Parent = mother, father Child = daughter, son, stepdaughter, stepson Some children travelled only with a nanny, therefore parch=0 for them.

Now let’s see some statistical summary of the imported dataset using pandas.describe() method.

titanic_df.describe()
PassengerId Survived Pclass Age SibSp Parch Fare
count 891.000000 891.000000 891.000000 714.000000 891.000000 891.000000 891.000000
mean 446.000000 0.383838 2.308642 29.699118 0.523008 0.381594 32.204208
std 257.353842 0.486592 0.836071 14.526497 1.102743 0.806057 49.693429
min 1.000000 0.000000 1.000000 0.420000 0.000000 0.000000 0.000000
25% 223.500000 0.000000 2.000000 20.125000 0.000000 0.000000 7.910400
50% 446.000000 0.000000 3.000000 28.000000 0.000000 0.000000 14.454200
75% 668.500000 1.000000 3.000000 38.000000 1.000000 0.000000 31.000000
max 891.000000 1.000000 3.000000 80.000000 8.000000 6.000000 512.329200

The output DataFrame index depends on the requested dtypes:

For numeric dtypes, it will include: count, mean, std, min, max, and lower, 50, and upper percentiles.

From above table I see that mean of survived column is 0.38, but since this is not complete dataset we cannot conclude on that.

Count for ‘Age’ column is 714, it means dataset has some missing values. I will have to cleanup the data before I start exploring.

Data cleanup

Now let’s get some info on datatypes in the dataset using pandas.info() method. It will give us concise summary of a DataFrame.

titanic_df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
PassengerId    891 non-null int64
Survived       891 non-null int64
Pclass         891 non-null int64
Name           891 non-null object
Sex            891 non-null object
Age            714 non-null float64
SibSp          891 non-null int64
Parch          891 non-null int64
Ticket         891 non-null object
Fare           891 non-null float64
Cabin          204 non-null object
Embarked       889 non-null object
dtypes: float64(2), int64(5), object(5)
memory usage: 83.6+ KB

I see that there are some missing values in ‘Age’, ‘Cabin’ and ‘Embarked’ columns. I’ll not use ‘Cabin’ which is the most missing and will ignore it. There are some columns which are not required in my analysis so I will drop them. For the missing ‘Ages’ and ‘Embarked’ I will omit those rows when I use the data.

titanic_cleaned = titanic_df.drop(['PassengerId', 'Name', 'Ticket', 'Cabin'], axis=1)
titanic_cleaned.head()
Survived Pclass Sex Age SibSp Parch Fare Embarked
0 0 3 male 22.0 1 0 7.2500 S
1 1 1 female 38.0 1 0 71.2833 C
2 1 3 female 26.0 0 0 7.9250 S
3 1 1 female 35.0 1 0 53.1000 S
4 0 3 male 35.0 0 0 8.0500 S
titanic_cleaned.describe()
Survived Pclass Age SibSp Parch Fare
count 891.000000 891.000000 714.000000 891.000000 891.000000 891.000000
mean 0.383838 2.308642 29.699118 0.523008 0.381594 32.204208
std 0.486592 0.836071 14.526497 1.102743 0.806057 49.693429
min 0.000000 1.000000 0.420000 0.000000 0.000000 0.000000
25% 0.000000 2.000000 20.125000 0.000000 0.000000 7.910400
50% 0.000000 3.000000 28.000000 0.000000 0.000000 14.454200
75% 1.000000 3.000000 38.000000 1.000000 0.000000 31.000000
max 1.000000 3.000000 80.000000 8.000000 6.000000 512.329200
titanic_cleaned.isnull().sum()
Survived      0
Pclass        0
Sex           0
Age         177
SibSp         0
Parch         0
Fare          0
Embarked      2
dtype: int64

Data exploration:

As part this project I want to explore answers to following questions.

  1. How Survival is correlated to other attributes of the dataset ? Findout Pearson’s r.
  2. Did Sex play a role in Survival ?
  3. Did class played role in survival ?
  4. How fare is related to Age, Class and Port of Embarkation ?
  5. How Embarkation varied across different ports ?

Lets start with Q1.

Q1. How Survival is correlated to other attributes of the dataset ? Findout Pearson’s r.

I will compute pairwise correlation of columns(excluding NA/null values) using pandas.DataFrame.corr method. I will use ‘pearson’ standard correlation coefficient for the calculation.

titanic_cleaned.corr(method='pearson')
Survived Pclass Age SibSp Parch Fare
Survived 1.000000 -0.338481 -0.077221 -0.035322 0.081629 0.257307
Pclass -0.338481 1.000000 -0.369226 0.083081 0.018443 -0.549500
Age -0.077221 -0.369226 1.000000 -0.308247 -0.189119 0.096067
SibSp -0.035322 0.083081 -0.308247 1.000000 0.414838 0.159651
Parch 0.081629 0.018443 -0.189119 0.414838 1.000000 0.216225
Fare 0.257307 -0.549500 0.096067 0.159651 0.216225 1.000000

From above correlation table we can see that Survival is inversly correlated to Pclass value. In our case since Class 1 has lower numerical value, it had better survival rate compared to other classes.

We also see that Age and Survival are slighltly correlated.

We will try to visualize these corelation below.

This brings us to Q2

Q2. Did Sex play a role in Survival ?

Lets pull a histogram of ‘Survived’ column.

#titanic_cleaned.groupby(['Survived']).hist()

sns.factorplot('Survived', data=titanic_df, kind='count')
sns.plt.title('Count of Passengers who survived')

output_18_1.png

Let’s see agewise distribution of the passenger aboard the Titanic.

#Histogram of Age of the given data set(sample)
#plt.hist(titanic_cleaned['Age'].dropna())
sns.distplot(titanic_cleaned['Age'].dropna(), bins=15, kde=False)
sns.plt.ylabel('Count')
sns.plt.title('Agewise distribution of the passenger aboard the Titanic')

output_20_1.png

Many passensgers are of age 15-40 yrs. But again this is not complete dataset.

Now I would like to see agewise distribution of passsengers for both Genders. I will do this by plotting the rows where ‘Sex’ is Male and Female respectively.

#Age wise Distribution of Male and Female passengers
sns.plt.hist(titanic_cleaned['Age'][(titanic_cleaned['Sex'] == 'female')].dropna(), bins=7, label='Female', histtype='stepfilled')
sns.plt.hist(titanic_cleaned['Age'][(titanic_cleaned['Sex'] == 'male')].dropna(), bins=7, label='Male', alpha=.7, histtype='stepfilled')
sns.plt.xlabel('Age')
sns.plt.ylabel('Count')
sns.plt.title('Age wise Distribution of Male and Female passengers')
sns.plt.legend()

output_22_1.png

There were many male passengers aboared compared to female passengers.

I will do a agewise distribution plot for passenges who Survived across both Genders by filtering out rows where ‘Survived’ = 1.

#Age wise Distribution of Male and Female survivors
sns.plt.hist(titanic_cleaned['Age'][(titanic_cleaned['Sex'] == 'female') & (titanic_cleaned['Survived'] == 1)].dropna(), bins=7, label='Female', histtype='stepfilled')
sns.plt.hist(titanic_cleaned['Age'][(titanic_cleaned['Sex'] == 'male') & (titanic_cleaned['Survived'] == 1)].dropna(), bins=7, label='Male', alpha=.7, histtype='stepfilled')
sns.plt.xlabel('Age')
sns.plt.ylabel('Count')
sns.plt.title('Age wise Distribution of Male and Female survivors')
sns.plt.legend()

output_24_1.png

From above visualization, it is evident that Women had better survival chance. One can do an Hypothesis test to verify this.

Lets take a look for youngest and oldest passenger to survive.

yougest_survive = titanic_cleaned['Age'][(titanic_cleaned['Survived'] == 1)].min()
youngest_die = titanic_cleaned['Age'][(titanic_cleaned['Survived'] == 0)].min()
oldest_survive = titanic_cleaned['Age'][(titanic_cleaned['Survived'] == 1)].max()
oldest_die = titanic_cleaned['Age'][(titanic_cleaned['Survived'] == 0)].max()

print "Yougest to survive: {} \nYoungest to die: {} \nOldest to survive: {} \nOldest to die: {}".format(yougest_survive, youngest_die, oldest_survive, oldest_die)
Yougest to survive: 0.42 
Youngest to die: 1.0 
Oldest to survive: 80.0 
Oldest to die: 74.0

Q3. Did class played role in survival ?

Next, let’s look at survival based on passenger’s class for both genders.

We can do this by grouping the dataframe with respect to Pclass, Survived and Sex.

#sns.plt.hist(titanic_cleaned.groupby(['Pclass', 'Survived', 'Sex']).size())
grouped_by_pclass = titanic_cleaned.groupby(['Pclass', 'Survived', 'Sex'])
grouped_by_pclass.size()
Pclass  Survived  Sex   
1       0         female      3
                  male       77
        1         female     91
                  male       45
2       0         female      6
                  male       91
        1         female     70
                  male       17
3       0         female     72
                  male      300
        1         female     72
                  male       47
dtype: int64
titanic_cleaned.groupby(['Pclass', 'Sex']).describe()
Age Fare Parch SibSp Survived
Pclass Sex
1 female count 85.000000 94.000000 94.000000 94.000000 94.000000
mean 34.611765 106.125798 0.457447 0.553191 0.968085
std 13.612052 74.259988 0.728305 0.665865 0.176716
min 2.000000 25.929200 0.000000 0.000000 0.000000
25% 23.000000 57.244800 0.000000 0.000000 1.000000
50% 35.000000 82.664550 0.000000 0.000000 1.000000
75% 44.000000 134.500000 1.000000 1.000000 1.000000
max 63.000000 512.329200 2.000000 3.000000 1.000000
male count 101.000000 122.000000 122.000000 122.000000 122.000000
mean 41.281386 67.226127 0.278689 0.311475 0.368852
std 15.139570 77.548021 0.658853 0.546695 0.484484
min 0.920000 0.000000 0.000000 0.000000 0.000000
25% 30.000000 27.728100 0.000000 0.000000 0.000000
50% 40.000000 41.262500 0.000000 0.000000 0.000000
75% 51.000000 78.459375 0.000000 1.000000 1.000000
max 80.000000 512.329200 4.000000 3.000000 1.000000
2 female count 74.000000 76.000000 76.000000 76.000000 76.000000
mean 28.722973 21.970121 0.605263 0.486842 0.921053
std 12.872702 10.891796 0.833930 0.642774 0.271448
min 2.000000 10.500000 0.000000 0.000000 0.000000
25% 22.250000 13.000000 0.000000 0.000000 1.000000
50% 28.000000 22.000000 0.000000 0.000000 1.000000
75% 36.000000 26.062500 1.000000 1.000000 1.000000
max 57.000000 65.000000 3.000000 3.000000 1.000000
male count 99.000000 108.000000 108.000000 108.000000 108.000000
mean 30.740707 19.741782 0.222222 0.342593 0.157407
std 14.793894 14.922235 0.517603 0.566380 0.365882
min 0.670000 0.000000 0.000000 0.000000 0.000000
25% 23.000000 12.331250 0.000000 0.000000 0.000000
50% 30.000000 13.000000 0.000000 0.000000 0.000000
75% 36.750000 26.000000 0.000000 1.000000 0.000000
max 70.000000 73.500000 2.000000 2.000000 1.000000
3 female count 102.000000 144.000000 144.000000 144.000000 144.000000
mean 21.750000 16.118810 0.798611 0.895833 0.500000
std 12.729964 11.690314 1.237976 1.531573 0.501745
min 0.750000 6.750000 0.000000 0.000000 0.000000
25% 14.125000 7.854200 0.000000 0.000000 0.000000
50% 21.500000 12.475000 0.000000 0.000000 0.500000
75% 29.750000 20.221875 1.000000 1.000000 1.000000
max 63.000000 69.550000 6.000000 8.000000 1.000000
male count 253.000000 347.000000 347.000000 347.000000 347.000000
mean 26.507589 12.661633 0.224784 0.498559 0.135447
std 12.159514 11.681696 0.623404 1.288846 0.342694
min 0.420000 0.000000 0.000000 0.000000 0.000000
25% 20.000000 7.750000 0.000000 0.000000 0.000000
50% 25.000000 7.925000 0.000000 0.000000 0.000000
75% 33.000000 10.008300 0.000000 0.000000 0.000000
max 74.000000 69.550000 5.000000 8.000000 1.000000

I would also like to see the survival rate across all the class. I can do this by taking sum of survived passengers for each class and divide it by totla number of passenger for that class and multiplying by 100. I will use pandas groupby function to segregate passengers according to their class.

titanic_cleaned.groupby(['Pclass'])['Survived'].sum()/titanic_cleaned.groupby(['Pclass'])['Survived'].count()*100
Pclass
1    62.962963
2    47.282609
3    24.236253
Name: Survived, dtype: float64

I see Class did play role in survival of the passengers. Now let’s visualize the same.

sns.factorplot('Survived', col='Pclass', data=titanic_cleaned, kind='count', size=7, aspect=.8)
plt.subplots_adjust(top=0.9)
sns.plt.suptitle('Class wise segregation of passengers', fontsize=16)

output_33_1.png

Above visualization compares passengers who survived the tragedy and who did not, across three classes. We can also drill down further to visualize survival of passengers of both genders across 3 classes.

sns.factorplot('Survived', col='Pclass', hue='Sex', data=titanic_cleaned, kind='count', size=7, aspect=.8)
plt.subplots_adjust(top=0.9)
sns.plt.suptitle('Class and gender wise segregation of passengers', fontsize=16)

output_35_1.png

From above visualization we can see that class played important for Survival of Male and Female passengers. This brings us to my next questions.

Q5. How Embarkation varied across different ports ?

Let’s see how Fare varies with respect to Age and Port of Embarkation. I will do a scatterplot of passengers from 3 classes for Age and Fare on X and Y axis.

sns.lmplot('Age', 'Fare', data=titanic_cleaned, fit_reg=False, hue="Pclass", scatter_kws={"marker": ".", "s": 20})
sns.plt.title('Scatterplot of passengers w.r.t Fare and Age')

output_37_1.png

I can segregate the passengers according to thier Port of Embarkation and then compare Fare v/s Age across 3 classes.

sns.lmplot('Age', 'Fare', data=titanic_cleaned, fit_reg=False, hue="Pclass", col="Embarked", scatter_kws={"marker": ".", "s": 20})
plt.subplots_adjust(top=0.9)
sns.plt.suptitle('Scatterplot of passengers w.r.t Fare and Age for diff. ports', fontsize=16)

output_39_1.png

From above visualization we can see that Fare is quite uniform for Class 2 and 3 across all ages. Fare varies for Class 1 across all ages, but we cannot conclude why it varies. We need more attributes to our data points to drill down to the reason for variation. We can also observe that lot of passengers embarked from port of Southampton.

Conclusions

From my exploratory analysis of Titanic dataset we conclude that, women had higher chances of survival. We can do a t test to come up with chances(probability) of survival. I also see that Class(Socio-Economic status) of the passengers had played a role in their survival. I also compared fare across different classes and found that it varied a lot for Class 1 passenger, although I could not conclude as to why it varried diffrently for Class 1 due to insufficient data.

There were some limitation for this dataset such as missing values for some attributes of passesngers. This is not in any form a exhaustive study. More can be done on this data set.

Reference:

Websites:

Books: Python Data Science Handbook, By - Jake VanderPlas