Machine Learning Engineer Nanodegree

Additional Project

📑  P8: Analyzing the NYC Subway Dataset

Links and Code Library

Resources

🕸scikit-learn. Machine Learning in Python  🕸Scipy Lecture Notes 
🕸Hypothesis Testing - MIT OpenCourseWare  🕸Assumptions of the Mann-Whitney U test 

Code Library


Set of Functions


Statistical Test

Data Extraction and Description



Test Selection

Question 1.1

Which statistical test did you use to analyze the NYC subway data? Did you use a one-tail or a two-tail P value?
What is the null hypothesis? What is your p-critical value?

Answer 1.1

The Mann-Whitney U Test to compare the ridership of NYC subway in rainy and non-rainy days is a good choice.
The column ENTRIESn_hourly will be a target and the column rain - a feature.
I will test the null hypothesis: distributions of ridership in NYC subway are the same for rainy and non-rainy days.
Another variant of this hypothesis could be: the difference between ridership medians / means for rainy and non-rainy days is equal to zero.
I will use a two-tailed test to find the statistical significance in both possible directions of interest.
Let's setup the p-critical value is equal to 0.05: we will reject the null hypothesis at the confidence level of 95%.

Question 1.2

Why is this statistical test applicable to the dataset?
In particular, consider the assumptions that the test is making about the distribution of ridership in the two samples.

Answer 1.2

The Mann-Whitney U Test is a non-parametric alternative test for comparing two sample medians / means (or two distributions) that come from the same population.
I have noted that the distribution of ENTRIESn_hourly is not normal so I cannot use the t-test in this case.
The features' specifics for the Mann-Whitney U test:
1) one dependent variable is measured at the continuous or ordinal level (ENTRIESn_hourly),
2) one independent variable consists of two categorical, independent groups (rain),
3) independence of observations (true for this data, the samples do not affect each other),
4) if two distributions have the same shapes, the test determines differences in the medians / means of two groups, if they have different shapes -
differences in the distributions of two groups (distributions in our case, the shapes are similar but with different levels),
5) two samples under consideration could not have the same number of observations (true for this data).

Test Execution

At first, I will describe two samples, then I will run the test and display the results.



Question 1.3

What results did you get from this statistical test?
These should include the following numerical values: p-values, as well as the medians / means for each of the two samples under test.

Answer 1.3


Question 1.4

What is the significance and interpretation of these results?

Answer 1.4

The median / mean values of ENTRIESn_hourly in rainy days is only a little bit larger than in non-rainy days.
I cannot determine whether the null hypothesis is rejected or not based on the difference between each pair of values.
The Mann-Whitney U-Test detects more informative results on whether the null hypothesis is true or not.
The p-value of this test is less than the p-critical value.
Therefore, I can confirm that the null hypothesis is rejected with 95% of confidence level.

Linear Regression

In this section, I will use the improved dataset turnstile_weather_v2.csv. Let's load and describe it.

Question 2.1

What approach did you use to compute the coefficients theta and produce prediction for ENTRIESn_hourly in your regression model:
- ols using statsmodels or sklearn
- gradient descent using sklearn
- or something different?

Answer 2.1

To produce predictions I would like to try
- a simple ordinary least squares model (ols statsmodels),
- stochastic gradient descent regression (SGDRegressor sklearn) and
- the set of built functions (normalize_features(), compute_cost(), gradient_descent(), predictions()).
It will be interesting to compare the results.

Predictions by the Set of Built Functions


The code cell below produce predictions and display the histogram of residuals (the differences between the observed values and the estimated values).

Predictions by the OLS Model


Predictions by the SGDRegressor


Question 2.2

What features (input variables) did you use in your model? Did you use any dummy variables as part of your features?

Answer 2.2

I have included the feature spectrum in the model: 'rain', 'hour', 'weekday', 'Unit', 'meantempi'.
Dummy features that take the values 0 or 1 to indicate absence or presence of some categorical values (for the feature 'Unit') were used.

Question 2.3

Why did you select these features in your model?
We are looking for specific reasons that lead you to believe that the selected features will contribute to the predictive power of your model. Your reasons might be based on intuition.
For example, response for fog might be: “I decided to use fog because I thought that when it is very foggy outside people might decide to use the subway more often.”
Your reasons might also be based on data exploration and experimentation, for example: “I used feature X because as soon as I included it in my model, it drastically improved my R2 value.”

Answer 2.3

Based on intuitive assumptions, uncomfortable weather conditions increase the ridership, so I added the variables 'rain' and 'meantempi'.
The ridership also depends on the time of day and stations, then the variables 'Unit', 'weekday', and 'hour' should be added to the feature list also.

Question 2.4

What are the parameters (also known as "coefficients" or "weights") of the non-dummy features in your linear regression model?

Answer 2.4

Coefficients by the Set of Built Functions

Coefficients by the OLS Model

Coefficients by the SGDRegressor

Question 2.5

What is your model’s R2 (coefficients of determination) value?

Answer 2.5

There are lots of statistic code libraries for the coefficient of determination R2. Let's try some of them in this project.

Question 2.6

What does this R2 value mean for the goodness of fit for your regression model?
Do you think this linear model to predict ridership is appropriate for this dataset, given this R2 value?

Answer 2.6

R2 measures how close the data are to the fitted regression line and represents the percentage of the response variable variation that is explained by the linear model.
For simplicity, R-squared = Explained variation / Total variation.
In my experiments, 45-48% of variations of the dependent variable ENTRIESn_hourly are explained by the models.
It's a reasonable and statistically significant result.
I suppose including the weather and the place features is not enough to predict the subway ridership.
So it's a good reason for the next experiments.

Visualization

Please include two visualizations that show the relationships between two or more variables in the NYC subway data.
Remember to add appropriate titles and axes labels to your plots.
Also, please add a short description below each figure commenting on the key insights depicted in the figure.

Question 3.1

One visualization should contain two histograms: one of ENTRIESn_hourly for rainy days and one of ENTRIESn_hourly for non-rainy days.
You can combine the two histograms in a single plot or you can use two separate plots.
If you decide to use to two separate plots for the two histograms, please ensure that the x-axis limits for both of the plots are identical.
It is much easier to compare the two in that case.
For the histograms, you should have intervals representing the volume of ridership (value of ENTRIESn_hourly) on the x-axis and the frequency of occurrence on the y-axis.
For example, each interval (along the x-axis), the height of the bar for this interval will represent the number of records (rows in our data)
that have ENTRIESn_hourly that falls in this interval.
Remember to increase the number of bins in the histogram (by having larger number of bars). The default bin width is not sufficient to capture the variability in the two samples.

Answer 3.1

Let's plot the distribution of ENTRIESn_hourly. The x-axis and y-axis ranges are limited for better visualization.

As readers can see, two distributions have the very similar shapes but with different levels of ENTRIESn_hourly and the data is not normally distributed.
I should note that the mean and the median vary significantly (indicating a large skew).
Also, it's easy to see that the sample sizes has a huge difference.
If I apply a non-linear logarithmic scaling to the normalized histogram we can observe very similar shapes, and it looks closer to the normal distributions.

Question 3.2

One visualization can be more freeform.
You should feel free to implement something that we discussed in class (e.g., scatter plots, line plots) or attempt to implement something more advanced if you'd like.
Some suggestions are:
- Ridership by time-of-day
- Ridership by day-of-week

Answer 3.2


This graph shows the difference between the number of passengers' enters and exits for each hour and
therefore helps to determine the number of passengers in the underground at certain hours.

Conclusion

Please address the following questions in detail. Your answers should be 1-2 paragraphs long.

Question 4.1

From your analysis and interpretation of the data, do more people ride the NYC subway when it is raining or when it is not raining?

Answer 4.1

The Mann-Whitney U-Test detects statistical significance of the difference in the NYC subway ridership in rainy and non-rainy days.
But I cannot confirm the steady trend that more people use the NYC subway in rainy days.
In the next answer I will explain my doubts.

Question 4.2

What analyses lead you to this conclusion? You should use results from both your statistical tests and your linear regression to support your analysis.

Answer 4.2

There are some reasons why it needs to detect more clear tendencies in the ridership measuring.
1) The result of the Mann-Whitney U-Test is very close to the borders of the confidence interval so the next samples could not repeat it.
2) The influence of the variable 'rain' is possible combined with other weather conditions.
3) The mean and the median for two samples do not have a statistically significant difference (about 1% in both cases).
4) The normalized distributions have very similar shapes.
5) All linear models in the projects with the variable 'rain' in the feature set do not demonstrate so good fitting and have less R2 scores than I expected.
6) As we can see below, excluding the variable 'rain' does not affect the models' R2 scores significantly.




Reflections

Please address the following questions in detail. Your answers should be 1-2 paragraphs long.

Question 5.1

Please discuss potential shortcomings of the methods of your analysis, including:
- dataset,
- analysis, such as the linear regression model or statistical tests.

Answer 5.1

The dataset and analysis have potential shortcomings. Dataset.
1) The dataset only consists of information for a single month. It can be a too short period for exact analysis: short-term or random events can affect the subway ridership.
2) The result of the Mann-Whitney test can be biased because of events in the short period and it is too close to the borders of the confidence interval.
3) I think some influential features for the ridership does not present in the dataset.
Models.
1) Some features can be dependent and highly correlated, it reduces the models' accuracies.
2) Linear regressions could be not so good model in ridership predictions with this dataset features: ENTRIESn_hourly could have a non-linear dependence from other variables.
3) Linear model could not have enough accuracy in predictions of huge values when the important features are not presented.
I can confirm my doubts by creating one additional visualization.
It displays very clearly how close predictions of different models to each other and how they far away from the real data.

Question 5.2 (Optional)

Do you have any other insight about the dataset that you would like to share with us?

Answer 5.2

I would like to display information about enough high correlation indicators for the variable 'rain'.
It reduced accuracy for many prediction models.

Additional Code Cell