mlcourse.ai – Open Machine Learning Course

Authors: Yury Kashnitskiy and Maxim Keremet. Translated and edited by Artem Trunov, and Aditya Soni. This material is subject to the terms and conditions of the Creative Commons CC BY-NC-SA 4.0 license. Free use is permitted for any non-commercial purpose.

Assignment #1. Spring 2019

Exploratory Data Analysis (EDA) of US flights
(using Pandas, Matplotlib & Seaborn)

Prior to working on the assignment, you'd better check out the corresponding course material:

Your task is to:

  1. write code and perform computations in the cells below
  2. choose answers in the webform. Solutions will be shared only with those who've filled in this form
  3. submit answers with some email and remember it! This will be your ID during the course. Specify your real full name in the form as well (no nicks allowed in the final top-100 rating). If in doubt, you can re-submit the form till the deadline for A1, no problem, but stick to only one email.

Deadline for A1: 2019 February 24, 20:59 GMT (London time)

You'll get up to 12 credits for this assignment - the web-form score will be halved and rounded.

How to get help

In ODS Slack (if you still don't have access, fill in the form mentioned on the mlcourse.ai main page), we have a channel #mlcourse_ai_news with announcements from the course team. You can discuss the course content freely in the #mlcourse_ai channel (we still have a huge Russian-speaking group, they have a separate channel #mlcourse_ai_rus).

Here's how you reply in a thread (press this dialog icon to drill down into a thread):

Please stick to special threads for your questions:

Help each other without sharing correct code and answers. Our TA Maxim @maximkeremet is there to help (only in the mentioned threads, do not write to him directly).

Lastly, you can save useful messages by pinning them, further you can find pinned items on the top, just below the channel name:

In [1]:
import numpy as np
import pandas as pd
# pip install seaborn 
import seaborn as sns
import matplotlib.pyplot as plt
  • Download the data archive (Archived ~ 114 Mb, unzipped - ~ 690 Mb, alternative link). No need to unzip - pandas can unzip on the fly.
  • Place it in the "../../data" folder, or change the path below according to your location.
  • The dataset has information about carriers and flights between US airports during the year 2008.
  • Column description is available here. Visit this site to find ex. meaning of flight cancellation codes.

Consider the following terms we use:

  • unique flight - a record (row) in the dataset
  • completed flight = flight that is not cancelled (Cancelled==0 in the dataset)
  • flight code - a combination of ['UniqueCarrier','FlightNum'], i.e. UA52
  • airport code - a three letter airport alias from 'Origin or 'Dest' columns

Reading data into memory and creating a Pandas DataFrame object

(This may take a while, be patient)

We are not going to read in the whole dataset. In order to reduce memory footprint, we instead load only needed columns and cast them to suitable data types.

In [2]:
dtype = {'DayOfWeek': np.uint8, 'DayofMonth': np.uint8, 'Month': np.uint8 , 
         'Cancelled': np.uint8, 'Year': np.uint16, 'FlightNum': np.uint16 , 
         'Distance': np.uint16, 'UniqueCarrier': str, 'CancellationCode': str, 
         'Origin': str, 'Dest': str, 'ArrDelay': np.float16, 
         'DepDelay': np.float16, 'CarrierDelay': np.float16, 
         'WeatherDelay': np.float16, 'NASDelay': np.float16, 
         'SecurityDelay': np.float16, 'LateAircraftDelay': np.float16, 
         'DepTime': np.float16}
In [3]:
%%time
# change the path if needed
path = '../../data/2008.csv.bz2'
flights_df = pd.read_csv(path, usecols=dtype.keys(), dtype=dtype)
CPU times: user 34.6 s, sys: 309 ms, total: 34.9 s
Wall time: 33.6 s

Check the number of rows and columns and print column names.

In [4]:
print(flights_df.shape)
print(flights_df.columns)
(7009728, 19)
Index(['Year', 'Month', 'DayofMonth', 'DayOfWeek', 'DepTime', 'UniqueCarrier',
       'FlightNum', 'ArrDelay', 'DepDelay', 'Origin', 'Dest', 'Distance',
       'Cancelled', 'CancellationCode', 'CarrierDelay', 'WeatherDelay',
       'NASDelay', 'SecurityDelay', 'LateAircraftDelay'],
      dtype='object')

Print first 5 rows of the dataset.

In [5]:
flights_df.head()
Out[5]:
Year Month DayofMonth DayOfWeek DepTime UniqueCarrier FlightNum ArrDelay DepDelay Origin Dest Distance Cancelled CancellationCode CarrierDelay WeatherDelay NASDelay SecurityDelay LateAircraftDelay
0 2008 1 3 4 2003.0 WN 335 -14.0 8.0 IAD TPA 810 0 NaN NaN NaN NaN NaN NaN
1 2008 1 3 4 754.0 WN 3231 2.0 19.0 IAD TPA 810 0 NaN NaN NaN NaN NaN NaN
2 2008 1 3 4 628.0 WN 448 14.0 8.0 IND BWI 515 0 NaN NaN NaN NaN NaN NaN
3 2008 1 3 4 926.0 WN 1746 -6.0 -4.0 IND BWI 515 0 NaN NaN NaN NaN NaN NaN
4 2008 1 3 4 1829.0 WN 3920 34.0 34.0 IND BWI 515 0 NaN 2.0 0.0 0.0 0.0 32.0

Transpose the frame to see all features at once.

In [6]:
flights_df.head().T
Out[6]:
0 1 2 3 4
Year 2008 2008 2008 2008 2008
Month 1 1 1 1 1
DayofMonth 3 3 3 3 3
DayOfWeek 4 4 4 4 4
DepTime 2003 754 628 926 1829
UniqueCarrier WN WN WN WN WN
FlightNum 335 3231 448 1746 3920
ArrDelay -14 2 14 -6 34
DepDelay 8 19 8 -4 34
Origin IAD IAD IND IND IND
Dest TPA TPA BWI BWI BWI
Distance 810 810 515 515 515
Cancelled 0 0 0 0 0
CancellationCode NaN NaN NaN NaN NaN
CarrierDelay NaN NaN NaN NaN 2
WeatherDelay NaN NaN NaN NaN 0
NASDelay NaN NaN NaN NaN 0
SecurityDelay NaN NaN NaN NaN 0
LateAircraftDelay NaN NaN NaN NaN 32

Examine data types of all features and total dataframe size in memory.

In [7]:
flights_df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7009728 entries, 0 to 7009727
Data columns (total 19 columns):
Year                 uint16
Month                uint8
DayofMonth           uint8
DayOfWeek            uint8
DepTime              float16
UniqueCarrier        object
FlightNum            uint16
ArrDelay             float16
DepDelay             float16
Origin               object
Dest                 object
Distance             uint16
Cancelled            uint8
CancellationCode     object
CarrierDelay         float16
WeatherDelay         float16
NASDelay             float16
SecurityDelay        float16
LateAircraftDelay    float16
dtypes: float16(8), object(4), uint16(3), uint8(4)
memory usage: 387.7+ MB

Get basic statistics of each feature.

In [8]:
flights_df.describe().T
Out[8]:
count mean std min 25% 50% 75% max
Year 7009728.0 2008.000000 0.000000 2008.0 2008.0 2008.0 2008.0 2008.0
Month 7009728.0 6.375130 3.406737 1.0 3.0 6.0 9.0 12.0
DayofMonth 7009728.0 15.728015 8.797068 1.0 8.0 16.0 23.0 31.0
DayOfWeek 7009728.0 3.924182 1.988259 1.0 2.0 4.0 6.0 7.0
DepTime 6873482.0 NaN NaN 1.0 928.0 1325.0 1728.0 2400.0
FlightNum 7009728.0 2224.200105 1961.715999 1.0 622.0 1571.0 3518.0 9743.0
ArrDelay 6855029.0 NaN NaN -519.0 -10.0 -2.0 12.0 2460.0
DepDelay 6873482.0 NaN NaN -534.0 -4.0 -1.0 8.0 2468.0
Distance 7009728.0 726.387029 562.101803 11.0 325.0 581.0 954.0 4962.0
Cancelled 7009728.0 0.019606 0.138643 0.0 0.0 0.0 0.0 1.0
CarrierDelay 1524735.0 NaN NaN 0.0 0.0 0.0 16.0 2436.0
WeatherDelay 1524735.0 NaN NaN 0.0 0.0 0.0 0.0 1352.0
NASDelay 1524735.0 NaN NaN 0.0 0.0 6.0 21.0 1357.0
SecurityDelay 1524735.0 NaN NaN 0.0 0.0 0.0 0.0 392.0
LateAircraftDelay 1524735.0 NaN NaN 0.0 0.0 0.0 26.0 1316.0

Count unique Carriers and plot their relative share of flights:

In [9]:
flights_df['UniqueCarrier'].nunique()
Out[9]:
20
In [10]:
flights_df.groupby('UniqueCarrier').size().plot(kind='bar');

We can also group by category/categories in order to calculate different aggregated statistics.

For example, finding top-3 flight codes, that have the largest total distance traveled in year 2008.

In [11]:
flights_df.groupby(['UniqueCarrier','FlightNum'])['Distance'].sum().sort_values(
    ascending=False).iloc[:3]
Out[11]:
UniqueCarrier  FlightNum
CO             15           1796244.0
               14           1796244.0
UA             52           1789722.0
Name: Distance, dtype: float64

Another way:

In [12]:
flights_df.groupby(['UniqueCarrier','FlightNum'])\
  .agg({'Distance': [np.mean, np.sum, 'count'],
        'Cancelled': np.sum})\
  .sort_values(('Distance', 'sum'), ascending=False)\
  .iloc[0:3]
Out[12]:
Distance Cancelled
mean sum count sum
UniqueCarrier FlightNum
CO 15 4962.000000 1796244.0 362 0
14 4962.000000 1796244.0 362 0
UA 52 2465.181818 1789722.0 726 8

Number of flights by days of week and months:

In [13]:
pd.crosstab(flights_df.Month, flights_df.DayOfWeek)
Out[13]:
DayOfWeek 1 2 3 4 5 6 7
Month
1 80807 97298 100080 102043 81940 67178 76419
2 81504 79700 80587 82158 102726 66462 76099
3 103210 81159 82307 82831 82936 86153 97494
4 82463 100785 102586 82799 82964 68304 78225
5 80626 79884 81264 102572 102878 84493 74576
6 104168 82160 82902 83617 83930 72322 99566
7 84095 103429 103315 105035 79349 72219 80489
8 82983 80895 81773 82625 103878 86155 93970
9 94300 91533 74057 75589 75881 58343 71205
10 75131 72195 91900 94123 93894 58168 70794
11 74214 72443 73653 68071 70484 76031 88376
12 92700 90568 85241 70761 74306 61708 69674

It can also be handy to color such tables in order to easily notice outliers:

In [14]:
plt.imshow(pd.crosstab(flights_df.Month, flights_df.DayOfWeek),
           cmap='seismic', interpolation='none');

Flight distance histogram:

In [15]:
flights_df.hist('Distance', bins=20);

Making a histogram of flight frequency by date.

In [16]:
flights_df['Date'] = pd.to_datetime(flights_df.rename(
    columns={'DayofMonth': 'Day'})[['Year', 'Month', 'Day']])
In [17]:
num_flights_by_date = flights_df.groupby('Date').size()
In [18]:
num_flights_by_date.plot();

Do you see a weekly pattern above? And below?

In [19]:
num_flights_by_date.rolling(window=7).mean().plot();

We'll need a new column in our dataset - departure hour, let's create it.

As we see, DepTime is distributed from 1 to 2400 (it is given in the hhmm format, check the column description again). We'll treat departure hour as DepTime // 100 (divide by 100 and apply the floor function). However, now we'll have both hour 0 and hour 24. Hour 24 sounds strange, we'll set it to be 0 instead (a typical imperfectness of real data, however, you can check that it affects only 521 rows, which is sort of not a big deal). So now values of a new column DepHour will be distributed from 0 to 23. There are some missing values, for now we won't fill in them, just ignore them.

In [20]:
flights_df['DepHour'] = flights_df['DepTime'] // 100
flights_df['DepHour'].replace(to_replace=24, value=0, inplace=True)
In [21]:
flights_df['DepHour'].describe()
Out[21]:
count    6873482.0
mean           NaN
std            0.0
min            0.0
25%            9.0
50%           13.0
75%           17.0
max           23.0
Name: DepHour, dtype: float64

Now it's your turn. Answer the questions below.

1. How many unique carriers are there in our dataset?

  • 10
  • 15
  • 20
  • 25
In [22]:
# You code here

2. We have both cancelled and completed flights in the dataset. Check if there are more completed or cancelled flights. What is the difference?

  • Cancelled overweights completed by 329 flights
  • Completed overweights cancelled by 6734860 flights
  • Cancelled overweights completed by 671 flights
  • Completed overweights cancelled by 11088967 flights
In [23]:
# You code here

3. Find a flight with the longest departure delays and a flight with the longest arrival delay. Do they have the same destination airport, and if yes, what is its code?

  • yes, ATL
  • yes, HNL
  • yes, MSP
  • no
In [24]:
# You code here

4. Find the carrier that has the greatest number of cancelled flights.

  • AA
  • MQ
  • WN
  • CO
In [25]:
# You code here

5. Let's examine departure time and consider distribution by hour (column DepHour that we've created earlier). Which hour has the highest percentage of flights?

  • 1 am
  • 5 am
  • 8 am
  • 3 pm
In [26]:
# You code here

6. OK, now let's examine cancelled flight distribution by time. Which hour has the least percentage of cancelled flights?

  • 2 am
  • 9 pm
  • 8 am
  • 3 am
In [27]:
# You code here

7. Is there any hour that didn't have any cancelled flights at all? Check all that apply.

  • 3
  • 19
  • 22
  • 4
In [28]:
# You code here

8. Find the busiest hour, or in other words, the hour when the number of departed flights reaches its maximum.

Hint: Consider only completed flights.

  • 4
  • 7
  • 8
  • 17
In [29]:
# You code here

9. Since we know the departure hour, it might be interesting to examine the average delay for corresponding hour. Are there any cases, when the planes on average departed earlier than they should have done? And if yes, at what departure hours did it happen?

Hint: Consider only completed flights.

  • no, there are no such cases
  • yes, at 5-6 am
  • yes, at 9-10 am
  • yes, at 2-4 pm
In [30]:
# You code here

10. Considering only the completed flights by the carrier, that you have found in Question 4, find the distribution of these flights by hour. At what time does the greatest number of its planes depart?

  • at noon
  • at 7 am
  • at 8 am
  • at 10 am
In [31]:
# You code here

11. Find top-10 carriers in terms of the number of completed flights (UniqueCarrier column)?

Which of the listed below is not in your top-10 list?

  • DL
  • AA
  • OO
  • EV
In [32]:
# You code here

12. Plot distributions of flight cancellation reasons (CancellationCode).

What is the most frequent reason for flight cancellation? (Use this link to translate codes into reasons)

  • Carrier
  • Weather conditions
  • National Air System
  • Security reasons
In [33]:
# You code here

13. Which route is the most frequent, in terms of the number of flights?

(Take a look at 'Origin' and 'Dest' features. Consider A->B and B->A directions as different routes)

  • New-York – Washington (JFK-IAD)
  • San-Francisco – Los-Angeles (SFO-LAX)
  • San-Jose – Dallas (SJC-DFW)
  • New-York – San-Francisco (JFK-SFO)
In [34]:
# You code here

14. Find top-5 delayed routes (count how many times they were delayed on departure). From all flights on these 5 routes, count all flights with weather conditions contributing to a delay.

Hint: consider only positive delays

  • 449
  • 539
  • 549
  • 668
In [35]:
# You code here

15. Examine the hourly distribution of departure times. Choose all correct statements:

  • Flights are normally distributed within time interval [0-23] (Search for: Normal distribution, bell curve).
  • Flights are uniformly distributed within time interval [0-23].
  • In the period from 0 am to 4 am there are considerably less flights than from 7 pm to 8 pm.
In [36]:
# You code here

16. Show how the number of flights changes through time (on the daily/weekly/monthly basis) and interpret the findings.

Choose all correct statements:

  • The number of flights during weekends is less than during weekdays (working days).
  • The lowest number of flights is on Sunday.
  • There are less flights during winter than during summer.

Hint: Look for official meteorological winter months for the Northern Hemisphere.

In [37]:
# You code here

17. Examine the distribution of cancellation reasons with time. Make a bar plot of cancellation reasons aggregated by months.

Choose all correct statements:

  • October has the lowest number of cancellations due to weather.
  • The highest number of cancellations in September is due to Security reasons.
  • April's top cancellation reason is carriers.
  • Flights cancellations due to National Air System are more frequent than those due to carriers.
In [38]:
# You code here

18. Which month has the greatest number of cancellations due to Carrier?

  • May
  • January
  • September
  • April
In [39]:
# You code here

19. Identify the carrier with the greatest number of cancellations due to carrier in the corresponding month from the previous question.

  • 9E
  • EV
  • HA
  • AA
In [40]:
# You code here

20. Examine median arrival and departure delays (in time) by carrier. Which carrier has the lowest median delay time for both arrivals and departures? Leave only non-negative values of delay times ('ArrDelay', 'DepDelay'). (Boxplots can be helpful in this exercise, as well as it might be a good idea to remove outliers in order to build nice graphs. You can exclude delay time values higher than a corresponding .95 percentile).

  • EV
  • OO
  • AA
  • AQ
In [41]:
# You code here

That's it! Now go and do 30 push-ups! :)