Turnstile and Weather Variables: https://s3.amazonaws.com/uploads.hipchat.com/23756/665149/05bgLZqSsMycnkg/turnstile-weather-variables.pdf
Scipy. Statistics in Python: http://www.scipy-lectures.org/packages/statistics/index.html
Hypothesis Testing - MIT OpenCourseWare: https://ocw.mit.edu/resources/res-6-009-how-to-process-analyze-and-visualize-data-january-iap-2012/lectures-and-labs/MITRES_6_009IAP12_lab3a.pdf
Assumptions of the Mann-Whitney U test: https://statistics.laerd.com/premium-sample/mwut/mann-whitney-test-in-spss-2.php
List of statistical packages: https://en.wikipedia.org/wiki/List_of_statistical_packages
from IPython.core.display import HTML
hide_code = ''
HTML('''
<style>
body {background-color: aliceblue;}
a {color: steelblue;}
h1, h2, h3 { color: #4169E1;}
</style>
<script>code_show = true;
function code_display() {
if (code_show) {
$('div.input').each(function(id) {
if (id == 0 || $(this).html().indexOf('hide_code') > -1) {$(this).hide();}
});
$('div.output_prompt').css('opacity', 0);
} else {
$('div.input').each(function(id) {$(this).show();});
$('div.output_prompt').css('opacity', 1);
}
code_show = !code_show;
}
$(document).ready(code_display);</script>
<form action="javascript: code_display()"><input \
style="color: #4169E1; background: ghostwhite; opacity: 0.9; " \
type="submit" value="Click to display or hide code"></form>''')
hide_code
# Import libraries
import numpy as np
import pandas as pd
import time
import warnings
import matplotlib
import matplotlib.pyplot as plt
from ggplot import *
from IPython.core.display import display
# Display for notebooks
%matplotlib inline
################################
### ADD EXTRA LIBRARIES HERE ###
################################
from scipy.stats import mannwhitneyu
import seaborn as sns
from sklearn import linear_model
import statsmodels.api as sm
from sklearn.metrics import r2_score
from scipy.stats import linregress
from sklearn.preprocessing import normalize
hide_code
# Load the data from the csv file
turnstile_data = pd.read_csv('turnstile_data_master_with_weather.csv').drop('index', 1)
# Display the data description
turnstile_data.describe().T
hide_code
print ('Data Medians')
turnstile_data.median()
Question 1.1 Which statistical test did you use to analyze the NYC subway data? Did you use a one-tail or a two-tail P value? What is the null hypothesis? What is your p-critical value?
Answer 1.1
The Mann-Whitney U Test to compare the ridership of NYC subway in rainy and non-rainy days is a good choice.
The column 'ENTRIESn_hourly' will be the target and the column 'rain' - the feature.
I will test the null hypothesis: the distributions of ridership in NYC subway are the same for rainy and non-rainy days. Another variant of this hypothesis could be: the difference between ridership medians / means for rainy and non-rainy days is equal to zero.
I will use a two-tailed test to find the statistical significance in both possible directions of interest.
Let's setup the p-critical value is equal to 0.05: we will reject the null hypothesis at the confidence level of 95%.
Question 1.2 Why is this statistical test applicable to the dataset? In particular, consider the assumptions that the test is making about the distribution of ridership in the two samples.
Answer 1.2
The Mann-Whitney U Test is a non-parametric alternative test for comparing two sample medians / means (or two distributions) that come from the same population.
I have noted that the distribution of 'ENTRIESn_hourly' is not normal so I cannot use the t-test in this case.
The features' specifics for the Mann-Whitney U test:
At first,I will describe two samples, then I will run the test and display the results.
hide_code
print("ENTRIESn_hourly in rainy days")
rainy_entries_hourly = turnstile_data['ENTRIESn_hourly'][turnstile_data['rain']==1]
rainy_entries_hourly.describe()
hide_code
print("ENTRIESn_hourly in non-rainy days")
non_rainy_entries_hourly = turnstile_data['ENTRIESn_hourly'][turnstile_data['rain']==0]
non_rainy_entries_hourly.describe()
hide_code
with_rain_median = np.median(rainy_entries_hourly)
without_rain_median = np.median(non_rainy_entries_hourly)
with_rain_mean = np.mean(rainy_entries_hourly)
without_rain_mean = np.mean(non_rainy_entries_hourly)
U, p = mannwhitneyu(rainy_entries_hourly, non_rainy_entries_hourly, alternative='two-sided')
print ("Mean for rainy days: {:.0f}".format(with_rain_mean))
print ("Mean for non-rainy days: {:.0f}".format(without_rain_mean))
print ("Median for rainy days: {:.0f}".format(with_rain_median))
print ("Median for non-rainy days: {:.0f}".format(without_rain_median))
print ("Number of rainy days: {:.0f}".format(len(rainy_entries_hourly)))
print ("Number for non-rainy days: {:.0f}".format(len(non_rainy_entries_hourly)))
print ("Mann-Whitney U test: U = {:.0f}, p = {:.5f}".format(U,p))
Question 1.3 What results did you get from this statistical test? These should include the following numerical values: p-values, as well as the medians / means for each of the two samples under test.
Answer 1.3
The mean values of ENTRIESn_hourly in rainy days and in non-rainy days are 1105 and 1090.
The median values of ENTRIESn_hourly in rainy days and in non-rainy days are 282 and 278.
The number of rainy and non-rainy days are 44104 and 87847.
The p-value of Mann-Whitney U-Test is 0.04988.
Question 1.4 What is the significance and interpretation of these results?
Answer 1.4
The median / mean values of ENTRIESn_hourly in rainy days is only a little bit larger than in non-rainy days. I cannot determine whether the null hypothesis is rejected or not based on the difference between each pair of values.
The Mann-Whitney U-Test detects more informative results on whether the null hypothesis is true or not. The p-value of this test is 0.04988 and it's less than the p-critical value. Therefore, I can confirm that the null hypothesis is rejected with 95% of confidence level.
In this section, I will use the improved dataset turnstile_weather_v2.csv. Let's load and describe it.
hide_code
# Load the data from the csv file
turnstile_data2 = pd.read_csv('turnstile_weather_v2.csv')
# Display the data example
turnstile_data2.head().T
Question 2.1 What approach did you use to compute the coefficients theta and produce prediction for ENTRIESn_hourly in your regression model:
Answer 2.1
To produce predictions I would like to try a simple ordinary least squares model OLS Statsmodels, stochastic gradient descent regression SGDRegressor Scikit Learn and the set of built functions (normalize_features(), compute_cost(), gradient_descent(), predictions()). It will be interesting to compare the results.
hide_code
# Function for normalizing the features
def normalize_features(data):
mu, sigma = data.mean(), data.std()
if (sigma == 0).any():
raise Exception("One of the features had the same value for all samples and could not be normalized." +
"Do not include features with only a single value in the model.")
data_normalized = (data - mu) / sigma
return data_normalized, mu, sigma
# Cost function
def compute_cost(features, values, theta):
sum_of_square_errors = np.square(np.dot(features, theta) - values).sum()
cost = sum_of_square_errors / (2 * len(values))
return cost
# Gradient descent function
def gradient_descent(features, values, theta, alpha, num_iterations):
cost_history = []
for i in range(num_iterations):
delta = (np.dot((values - np.dot(features, theta)), features)) / len(values)
theta = theta + alpha * delta
cost_history.append(compute_cost(features, values, theta))
return theta, pd.Series(cost_history)
# Gradient descent function
def predictions(dataframe, numerical, target):
# Add selected features
features = dataframe[numerical]
# Add UNIT to features using dummy variables
dummy_units = pd.get_dummies(dataframe['UNIT'], prefix='unit')
features = features.join(dummy_units)
# Target values
values = dataframe[target]
# Normalize features
features, mu, sigma = normalize_features(features)
# Add a column of 1s (y intercept)
features['ones'] = np.ones(len(values))
# Convert features and values to numpy arrays
features_array = np.array(features)
values_array = np.array(values)
# Set values for alpha, number of iterations.
alpha = 0.1
num_iterations = 50
# Initialize theta, perform gradient descent
theta_gradient_descent = np.zeros(len(features.columns))
theta_gradient_descent, cost_history = \
gradient_descent(features_array, values_array, theta_gradient_descent, alpha, num_iterations)
predictions = np.dot(features_array, theta_gradient_descent)
return predictions, theta_gradient_descent
hide_code
# Setup the feature list
numerical2 = ['rain', 'hour', 'weekday', 'meantempi']
predict = predictions(turnstile_data2, numerical2, 'ENTRIESn_hourly')
print ("Predictions for 'ENTRIESn_hourly'")
predict[0]
The code cells below produce predictions and display the histogram of residuals (the differences between the observed values and the estimated values).
hide_code
def plot_residuals(data, predictions):
matplotlib.rcParams['figure.figsize'] = (18, 8)
plt.style.use('seaborn-whitegrid')
plt.figure()
(data['ENTRIESn_hourly'] - predictions).hist(bins=200, color='royalblue', edgecolor='black', alpha=0.5)
plt.title("Histogram of Residauls 'ENTRIESn_hourly', Theta Gradient Descent", fontsize=20)
return plt
# Create predictions
plot_residuals(turnstile_data2, predict[0]);
hide_code
def ols_predictions(data, feature_list):
# Choose features and a target
X = data[feature_list]
y = data['ENTRIESn_hourly']
# Add UNIT to features using dummy variables
dummy_units = pd.get_dummies(data['UNIT'], prefix='unit')
X = data[feature_list].join(dummy_units)
# Normalize features
X, mu, sigma = normalize_features(X)
X = sm.add_constant(X)
# Fit ordinary least squares model
model = sm.OLS(y,X)
results = model.fit()
predictions = results.predict(X)
coefficients = results.params
r2 = results.rsquared
return predictions, coefficients, r2
hide_code
ols_predict = ols_predictions(turnstile_data2, numerical2)
print ("Predictions for 'ENTRIESn_hourly'")
ols_predict[0][:20]
hide_code
# SGDRegressor
def sgd_predictions(data, feature_list):
# Choose features and a target
X = data[feature_list]
y = data['ENTRIESn_hourly']
# Add UNIT to features using dummy variables
dummy_units = pd.get_dummies(data['UNIT'], prefix='unit')
X = data[feature_list].join(dummy_units)
# Normalize features
X, mu, sigma = normalize_features(X)
# Fit the model
clf = linear_model.SGDRegressor()
clf.fit(X, y)
coefficients = clf.coef_
r2 = clf.score(X, y)
return clf.predict(X), coefficients, r2
hide_code
sgd_predict = sgd_predictions(turnstile_data2, numerical2)
print ("Predictions for 'ENTRIESn_hourly'")
sgd_predict[0][:20]
Question 2.2 What features (input variables) did you use in your model? Did you use any dummy variables as part of your features?
Answer 2.2
I have included the feature spectrum in the model:
Question 2.3 Why did you select these features in your model? We are looking for specific reasons that lead you to believe that the selected features will contribute to the predictive power of your model. Your reasons might be based on intuition. For example, response for fog might be: “I decided to use fog because I thought that when it is very foggy outside people might decide to use the subway more often.” Your reasons might also be based on data exploration and experimentation, for example: “I used feature X because as soon as I included it in my model, it drastically improved my R2 value.”
Answer 2.3
Based on intuitive assumptions, uncomfortable weather conditions increase the ridership, so I added a variable 'rain' and 'meantempi'. The ridership also depends on the time of day and the station, then the variables 'Unit' and 'hour' should be added to the feature list also.
Question 2.4 What are the parameters (also known as "coefficients" or "weights") of the non-dummy features in your linear regression model?
Answer 2.4
hide_code
print ('hour {}'.format(predict[1][1]))
print ('weekday {}'.format(predict[1][2]))
print ('meantempi {}'.format(predict[1][3]))
hide_code
ols_predict[1][2:5]
hide_code
print ('hour {}'.format(sgd_predict[1][1]))
print ('weekday {}'.format(sgd_predict[1][2]))
print ('meantempi {}'.format(sgd_predict[1][3]))
Question 2.5 What is your model’s R2 (coefficients of determination) value?
Answer 2.5
There are lots of statistic code libraries for the coefficient of determination R2. Let's try some of them in this project.
hide_code
print ("sklearn.metrics, R^2 = {}".format(r2_score(turnstile_data2['ENTRIESn_hourly'], predict[0])))
slope, intercept, r_value, p_value, std_err = linregress(turnstile_data2['ENTRIESn_hourly'], predict[0])
print ("scipy.stats, R^2 = {}".format(r_value**2))
hide_code
print ('OLS model, R^2 = {}'.format(ols_predict[2]))
hide_code
print ('SGDRegressor, R^2 = {}'.format(sgd_predict[2]))
Question 2.6 What does this R2 value mean for the goodness of fit for your regression model? Do you think this linear model to predict ridership is appropriate for this dataset, given this R2 value?
Answer 2.6
R2 measures how close the data are to the fitted regression line and represents the percentage of the response variable variation that is explained by the linear model.
For simplicity, R-squared = Explained variation / Total variation.
In my experiments, 45-48% of variations of the dependent variable 'ENTRIESn_hourly' are explained by the models. It's a reasonable and statistically significant result. I suppose including the weather and the place features is not enough to explain the subway ridership. So it's a good reason for the next experiments.
Please include two visualizations that show the relationships between two or more variables in the NYC subway data. Remember to add appropriate titles and axes labels to your plots. Also, please add a short description below each figure commenting on the key insights depicted in the figure.
Question 3.1 One visualization should contain two histograms: one of ENTRIESn_hourly for rainy days and one of ENTRIESn_hourly for non-rainy days.
You can combine the two histograms in a single plot or you can use two separate plots. If you decide to use to two separate plots for the two histograms, please ensure that the x-axis limits for both of the plots are identical. It is much easier to compare the two in that case. For the histograms, you should have intervals representing the volume of ridership (value of ENTRIESn_hourly) on the x-axis and the frequency of occurrence on the y-axis. For example, each interval (along the x-axis), the height of the bar for this interval will represent the number of records (rows in our data) that have ENTRIESn_hourly that falls in this interval.
Remember to increase the number of bins in the histogram (by having larger number of bars). The default bin width is not sufficient to capture the variability in the two samples.
Answer 3.1 Let's plot the distribution of 'ENTRIESn_hourly'. The x-axis and y-axis ranges are limited for better visualization.
hide_code
rainy_entries_hourly = turnstile_data['ENTRIESn_hourly'][turnstile_data['rain']==1]
non_rainy_entries_hourly = turnstile_data['ENTRIESn_hourly'][turnstile_data['rain']==0]
matplotlib.rcParams['figure.figsize'] = (18, 8)
plt.style.use('seaborn-whitegrid')
plt.figure()
non_rainy_entries_hourly.hist(bins=200, color = 'lightblue', edgecolor='grey',alpha=0.7)
rainy_entries_hourly.hist(bins=200, color='royalblue', edgecolor='black', alpha=0.6)
plt.xlabel("ENTRIESn_hourly", fontsize=20)
plt.ylabel("Frequency", fontsize=20)
plt.tick_params(axis='both', labelsize=15)
plt.axis((-200,6200,0,41000))
plt.title("Histogram of 'ENTRIESn_hourly': Rainy vs Non-Rainy", fontsize=20)
plt.legend(['Non-Rainy','Rainy'], fontsize=20);
As the reader can see, two distributions have the very similar shapes but with different levels of 'ENTRIESn_hourly' and the data is not normally distributed. I should note that the mean and the median vary significantly (indicating a large skew). Also, it's easy to see that the sample sizes has a huge difference: rainy days - 44104, non-rainy days - 87847.
If I apply a non-linear logarithmic scaling to the normalized histogram we will observe very similar shapes, and it looks closer to the normal distributions.
hide_code
matplotlib.rcParams['figure.figsize'] = (18, 8)
plt.style.use('seaborn-bright')
plt.figure()
np.log(non_rainy_entries_hourly+1).hist(bins=100, color = 'lightblue', edgecolor='grey',alpha=0.7, normed=1)
np.log(rainy_entries_hourly+1).hist(bins=100, color='royalblue', edgecolor='black', alpha=0.6, normed=1)
plt.axis((0.5,11.5,0,0.21))
plt.xlabel("log(ENTRIESn_hourly)", fontsize=20)
plt.ylabel("Frequency", fontsize=20)
plt.tick_params(axis='both', labelsize=15)
plt.title("Normalized Histogram of 'ENTRIESn_hourly': Rainy vs Non-Rainy. Logarithmic Scale", fontsize=20)
plt.legend(['Non-Rainy','Rainy'], fontsize=20);
Question 3.2 One visualization can be more freeform. You should feel free to implement something that we discussed in class (e.g., scatter plots, line plots) or attempt to implement something more advanced if you'd like. Some suggestions are:
Answer 3.2
hide_code
turnstile_data['entries_exits'] = turnstile_data['ENTRIESn_hourly'] - turnstile_data['EXITSn_hourly']
data = turnstile_data[['Hour','entries_exits']].groupby('Hour',as_index=False).sum()
matplotlib.rcParams['figure.figsize'] = (18, 12)
plt.style.use('seaborn-bright')
plt.figure()
plt.plot(data['Hour'], data['entries_exits'], '--', lw=3, color='royalblue')
psize=list(2000*data['entries_exits'].abs()/data['entries_exits'].abs().mean())
plt.scatter(data['Hour'], data['entries_exits'], s=psize, marker='8',
facecolors='none', edgecolors='g', linewidth='3')
plt.axis((-1, 23.5,-1000000,5500000))
plt.xlabel('Hour', fontsize=20)
plt.ylabel('Entries - Exits', fontsize=20)
plt.title('Entries - Exits By Hour', fontsize=30);
This graph shows the difference between the number of passengers' enters and exits for each hour and therefore helps to determine the number of passengers in the underground at certain hours.
Please address the following questions in detail. Your answers should be 1-2 paragraphs long.
Question 4.1 From your analysis and interpretation of the data, do more people ride the NYC subway when it is raining or when it is not raining?
Answer 4.1
The Mann-Whitney U-Test detects statistical significance of the difference in the NYC subway ridership in rainy and non-rainy days. But I cannot confirm the steady trend that more people use the NYC subway in rainy days. In the next answer I will explain my doubts.
Question 4.2 What analyses lead you to this conclusion? You should use results from both your statistical tests and your linear regression to support your analysis.
Answer 4.2
There are some reasons why it needs to detect more clear tendencies in the ridership measuring.
# Means' Difference
100.0*(1105-1090)/1105
# Medians' Difference
100.0*(282-278)/282
hide_code
# Feature list without 'rain'
numerical3 = ['hour', 'weekday', 'meantempi']
hide_code
print ("Predictions with the feature 'rain'")
print('')
print ("Built functions, R^2 = {}".format(r2_score(turnstile_data2['ENTRIESn_hourly'], predict[0])))
print('')
print ('OLS model, R^2 = {}'.format(ols_predict[2]))
print('')
print ('SGDRegressor, R^2 = {}'.format(sgd_predict[2]))
hide_code
print ("Predictions without the feature 'rain'")
print('')
predict3 = predictions(turnstile_data2, numerical3, 'ENTRIESn_hourly')
print ("Built functions, R^2 = {}".format(r2_score(turnstile_data2['ENTRIESn_hourly'], predict3[0])))
print ('')
ols_predict3 = ols_predictions(turnstile_data2, numerical3)
print ('OLS model, R^2 = {}'.format(ols_predict3[2]))
print('')
sgd_predict3 = sgd_predictions(turnstile_data2, numerical3)
print ('SGDRegressor, R^2 = {}'.format(sgd_predict3[2]))
Please address the following questions in detail. Your answers should be 1-2 paragraphs long.
Question 5.1 Please discuss potential shortcomings of the methods of your analysis, including:
Answer 5.1
The dataset and analysis have potential shortcomings.
Dataset.
Models.
I can confirm my doubts by creating one additional visualization. It displays very clearly how close predictions of different models to each other and how they far away from the real data.
hide_code
matplotlib.rcParams['figure.figsize'] = (18, 12)
plt.style.use('seaborn-bright')
plt.figure()
plt.plot(ols_predict[0], 'o', markersize=15,
markeredgecolor='royalblue', markerfacecolor="None", markeredgewidth=2)
plt.plot(ols_predict[0], color='royalblue')
plt.plot(sgd_predict[0], 'o', markersize=15,
markeredgecolor='green', markerfacecolor="None", markeredgewidth=2)
plt.plot(sgd_predict[0], color='green')
plt.plot(turnstile_data2['ENTRIESn_hourly'], 'o', markersize=15,
markeredgecolor='black', markerfacecolor="None", markeredgewidth=2)
plt.plot(turnstile_data2['ENTRIESn_hourly'], color='black')
plt.axis((10000,10050,-2100,2100))
plt.xlabel('Serial Number of Observations', fontsize=20)
plt.ylabel('ENTRIESn_hourly', fontsize=20)
plt.title('Predicitons vs Real data for ENTRIESn_hourly', fontsize=30)
plt.legend(['OLS Model', 'OLS Model', 'SGDRegressor', 'SGDRegressor', 'Real Data', 'Real Data'], fontsize=20);
Question 5.2 (Optional) Do you have any other insight about the dataset that you would like to share with us?
Answer 5.2
I would like to display information about enough high correlation indicators for the variable 'rain'. It reduced accuracy for any prediction models.
hide_code
# Create new dataset for correlation matrix
list1 = ['maxdewpti', 'minpressurei', 'mindewpti', 'maxpressurei',
'meandewpti', 'meanpressurei', 'fog', 'rain',
'meanwindspdi', 'mintempi', 'meantempi', 'maxtempi', 'precipi']
data = turnstile_data[list1]
hide_code
new_data = data.drop('rain', axis = 1)
target = data['rain']
new_data_target = pd.concat([new_data, target], axis = 1)
# Display the correlation table for the given feature
print ("The correlation table for the choosen feature 'rain'")
pearson = new_data_target.corr(method='pearson')
corr_with_delicatessen = pearson.iloc[-1][:-1]
corr_with_delicatessen[abs(corr_with_delicatessen).argsort()[::-1]]