Reading and writing data with Pandas

files needed = (gdp_components.csv, debt.xlsx)

We have seen some of the basic things we can do with Pandas. In doing so, we created some simple DataFrames from dicts. That was simple, but it is almost never how we create DataFrames in the wild.

Most data live in files, often as comma-separated values or as MS Excel workbooks, either on our computers or in the cloud. In this notebook, we will review was to get data into (and out of) Pandas.

Reading from your computer

Let's start by getting files from our own computers. We start by loading Pandas. We are also loading the os package. os means 'operating system' and it contains functions that help us navigate the file structure of our computers.

In [1]:
import pandas as pd     # load the pandas package and call it pd
import os               # The package name is already short enough. No need to rename it. 

If you have not already, move the gdp_components.csv file to your U:\ drive and put it in the same folder that holds this notebook. We expect this file to contain U.S. GDP and its major components. Let's see.

In [2]:
gdp = pd.read_csv('gdp_components.csv')       # read_csv is a part of Pandas, so we need the pd. 
print(type(gdp))                              # What have we got here?
<class 'pandas.core.frame.DataFrame'>

This looks successful. read_csv() takes a string with the file name and creates a DataFrame. Let's take a look at the data.

In [3]:
print(gdp)
          DATE       GDPA     GPDIA      GCEA    EXPGSA    IMPGSA
0   1929-01-01    104.556    17.170     9.622     5.939     5.556
1   1930-01-01     92.160    11.428    10.273     4.444     4.121
2   1931-01-01     77.391     6.549    10.169     2.906     2.905
3   1932-01-01     59.522     1.819     8.946     1.975     1.932
4   1933-01-01     57.154     2.276     8.875     1.987     1.929
5   1934-01-01     66.800     4.296    10.721     2.561     2.239
6   1935-01-01     74.241     7.370    11.151     2.769     2.982
7   1936-01-01     84.830     9.391    13.398     3.007     3.154
8   1937-01-01     93.003    12.967    13.119     4.039     3.961
9   1938-01-01     87.352     7.944    14.170     3.811     2.845
10  1939-01-01     93.437    10.229    15.165     3.969     3.136
11  1940-01-01    102.899    14.579    15.562     4.897     3.426
12  1941-01-01    129.309    19.369    27.836     5.482     4.449
13  1942-01-01    165.952    11.762    65.440     4.375     4.627
14  1943-01-01    203.084     7.405    98.023     4.034     6.280
15  1944-01-01    224.447     9.180   108.643     4.880     6.904
16  1945-01-01    228.007    12.400    96.396     6.781     7.547
17  1946-01-01    227.535    33.128    43.027    14.156     6.974
18  1947-01-01    249.616    37.131    39.827    18.740     7.933
19  1948-01-01    274.468    50.347    43.755    15.547    10.060
20  1949-01-01    272.475    39.099    49.808    14.484     9.249
21  1950-01-01    299.827    56.530    50.515    12.350    11.612
22  1951-01-01    346.914    62.759    73.304    17.099    14.586
23  1952-01-01    367.341    57.274    89.583    16.459    15.295
24  1953-01-01    389.218    60.414    96.765    15.313    16.014
25  1954-01-01    390.549    58.070    92.467    15.836    15.432
26  1955-01-01    425.478    73.754    92.958    17.677    17.199
27  1956-01-01    449.353    77.685    98.180    21.284    18.923
28  1957-01-01    474.039    76.505   107.163    24.017    19.942
29  1958-01-01    481.229    70.947   114.138    20.560    20.022
..         ...        ...       ...       ...       ...       ...
59  1988-01-01   5236.438   936.963  1078.855   444.601   553.993
60  1989-01-01   5641.580   999.701  1151.862   504.289   591.031
61  1990-01-01   5963.144   993.448  1238.556   551.873   629.727
62  1991-01-01   6158.129   944.344  1298.951   594.931   623.544
63  1992-01-01   6520.327  1013.006  1344.500   633.053   667.791
64  1993-01-01   6858.559  1106.826  1364.922   654.799   719.973
65  1994-01-01   7287.236  1256.484  1402.274   720.937   813.424
66  1995-01-01   7639.749  1317.489  1449.431   812.810   902.572
67  1996-01-01   8073.122  1432.055  1492.848   867.589   963.966
68  1997-01-01   8577.552  1595.600  1547.133   953.803  1055.774
69  1998-01-01   9062.817  1736.671  1611.609   952.979  1115.690
70  1999-01-01   9630.663  1887.059  1720.360   992.778  1248.612
71  2000-01-01  10252.347  2038.408  1826.845  1096.255  1471.305
72  2001-01-01  10581.822  1934.842  1949.275  1024.636  1392.565
73  2002-01-01  10936.418  1930.417  2088.717   998.741  1424.143
74  2003-01-01  11458.246  2027.056  2211.208  1036.177  1539.304
75  2004-01-01  12213.730  2281.253  2338.889  1177.631  1796.706
76  2005-01-01  13036.637  2534.720  2475.992  1305.225  2026.418
77  2006-01-01  13814.609  2700.954  2624.234  1472.613  2243.538
78  2007-01-01  14451.860  2673.011  2790.844  1660.853  2379.280
79  2008-01-01  14712.845  2477.613  2981.990  1837.055  2560.143
80  2009-01-01  14448.932  1929.664  3073.512  1581.996  1978.447
81  2010-01-01  14992.052  2165.473  3154.647  1846.280  2360.183
82  2011-01-01  15542.582  2332.562  3148.372  2102.995  2682.456
83  2012-01-01  16197.007  2621.754  3137.010  2191.280  2759.851
84  2013-01-01  16784.851  2826.013  3132.409  2273.428  2764.210
85  2014-01-01  17521.747  3038.931  3167.041  2371.027  2879.284
86  2015-01-01  18219.297  3211.971  3234.210  2265.047  2786.461
87  2016-01-01  18707.189  3169.887  3290.979  2217.576  2738.146
88  2017-01-01  19485.394  3367.965  3374.444  2350.175  2928.596

[89 rows x 6 columns]

Even though jupyter notebook hid rows 30-58, this is still a bit obnoxious. We can use the head() and tail() methods of DataFrame to peek at just the first or last few rows.

In [4]:
print( gdp.head(4) )            # Show the first 4 rows.
         DATE     GDPA   GPDIA    GCEA  EXPGSA  IMPGSA
0  1929-01-01  104.556  17.170   9.622   5.939   5.556
1  1930-01-01   92.160  11.428  10.273   4.444   4.121
2  1931-01-01   77.391   6.549  10.169   2.906   2.905
3  1932-01-01   59.522   1.819   8.946   1.975   1.932

If you do not pass head() or tail() an argument, it defaults to 5 rows.

In [5]:
print( gdp.tail() )
          DATE       GDPA     GPDIA      GCEA    EXPGSA    IMPGSA
84  2013-01-01  16784.851  2826.013  3132.409  2273.428  2764.210
85  2014-01-01  17521.747  3038.931  3167.041  2371.027  2879.284
86  2015-01-01  18219.297  3211.971  3234.210  2265.047  2786.461
87  2016-01-01  18707.189  3169.887  3290.979  2217.576  2738.146
88  2017-01-01  19485.394  3367.965  3374.444  2350.175  2928.596

The index isn't very sensible. This is time series data (the unit of observation is a year), so the date seems like a good index. How do we set the index?

In [6]:
gdp_new_index = gdp.set_index('DATE')   # We could use 'inplace = True' if we didn't need a copy.

print(gdp_new_index.head())
               GDPA   GPDIA    GCEA  EXPGSA  IMPGSA
DATE                                               
1929-01-01  104.556  17.170   9.622   5.939   5.556
1930-01-01   92.160  11.428  10.273   4.444   4.121
1931-01-01   77.391   6.549  10.169   2.906   2.905
1932-01-01   59.522   1.819   8.946   1.975   1.932
1933-01-01   57.154   2.276   8.875   1.987   1.929

We can also set the index as we read in the file. Let's take a look at the read_csv() function.

In [7]:
pd.read_csv?

I'm seeing a lot of good stuff here. index_col, usecols, header, sep,...some stuff I don't know about, too. When reading in messy files, these extra arguments may come in handy.

Let's give index_col a try.

In [8]:
gdp_2 = pd.read_csv('gdp_components.csv', index_col = 0)    # Treat the CSV like a DataFrame. Count cols staring with 0
In [9]:
gdp_2.head()
Out[9]:
GDPA GPDIA GCEA EXPGSA IMPGSA
DATE
1929-01-01 104.556 17.170 9.622 5.939 5.556
1930-01-01 92.160 11.428 10.273 4.444 4.121
1931-01-01 77.391 6.549 10.169 2.906 2.905
1932-01-01 59.522 1.819 8.946 1.975 1.932
1933-01-01 57.154 2.276 8.875 1.987 1.929

We dumped our file into our current working directory so we could just ask for the file name gdp_components.csv in read_csv(). What is our current working directory (cwd)?

In [10]:
path_to_cwd = os.getcwd()           # getcwd() is part of the os package we imported earlier
print(path_to_cwd)
U:\2019F_Econ_690\2_Pandas

When we gave read_csv() gpd_components.csv, it looked in our cwd for the file. Let's try something more complicated. Go into your Data_Class folder and create a new folder called Data_Files. Make a copy of the gdp_components file and paste it into the Data_Files folder. Rename the file gdp_components_moved.csv.

In [11]:
gdp_moved = pd.read_csv('gdp_components_moved.csv')
---------------------------------------------------------------------------
FileNotFoundError                         Traceback (most recent call last)
<ipython-input-11-fdc6cb8f492c> in <module>
----> 1 gdp_moved = pd.read_csv('gdp_components_moved.csv')

C:\ProgramData\Anaconda3\lib\site-packages\pandas\io\parsers.py in parser_f(filepath_or_buffer, sep, delimiter, header, names, index_col, usecols, squeeze, prefix, mangle_dupe_cols, dtype, engine, converters, true_values, false_values, skipinitialspace, skiprows, skipfooter, nrows, na_values, keep_default_na, na_filter, verbose, skip_blank_lines, parse_dates, infer_datetime_format, keep_date_col, date_parser, dayfirst, iterator, chunksize, compression, thousands, decimal, lineterminator, quotechar, quoting, doublequote, escapechar, comment, encoding, dialect, tupleize_cols, error_bad_lines, warn_bad_lines, delim_whitespace, low_memory, memory_map, float_precision)
    700                     skip_blank_lines=skip_blank_lines)
    701 
--> 702         return _read(filepath_or_buffer, kwds)
    703 
    704     parser_f.__name__ = name

C:\ProgramData\Anaconda3\lib\site-packages\pandas\io\parsers.py in _read(filepath_or_buffer, kwds)
    427 
    428     # Create the parser.
--> 429     parser = TextFileReader(filepath_or_buffer, **kwds)
    430 
    431     if chunksize or iterator:

C:\ProgramData\Anaconda3\lib\site-packages\pandas\io\parsers.py in __init__(self, f, engine, **kwds)
    893             self.options['has_index_names'] = kwds['has_index_names']
    894 
--> 895         self._make_engine(self.engine)
    896 
    897     def close(self):

C:\ProgramData\Anaconda3\lib\site-packages\pandas\io\parsers.py in _make_engine(self, engine)
   1120     def _make_engine(self, engine='c'):
   1121         if engine == 'c':
-> 1122             self._engine = CParserWrapper(self.f, **self.options)
   1123         else:
   1124             if engine == 'python':

C:\ProgramData\Anaconda3\lib\site-packages\pandas\io\parsers.py in __init__(self, src, **kwds)
   1851         kwds['usecols'] = self.usecols
   1852 
-> 1853         self._reader = parsers.TextReader(src, **kwds)
   1854         self.unnamed_cols = self._reader.unnamed_cols
   1855 

pandas/_libs/parsers.pyx in pandas._libs.parsers.TextReader.__cinit__()

pandas/_libs/parsers.pyx in pandas._libs.parsers.TextReader._setup_parser_source()

FileNotFoundError: [Errno 2] File b'gdp_components_moved.csv' does not exist: b'gdp_components_moved.csv'

Of course this doesn't work. The file is not in our cwd. It's good see what that kind of error message looks like. We need to pass csv_read() the path to the file. The path is the hierarchy of folders that contains the file. In my case, the path is

U:\Data_Class\Data_Files

Note that there is a \ each time we list a new folder.

When we specify a file path, we need to escape the \ by using a second backslash in front of it. If you are using a Mac, you need to use the forward slash /.

In [12]:
gdp_moved = pd.read_csv('U:\\2019F_Econ_690\\2_Pandas\\Data_Files\\gdp_components_moved.csv')
gdp_moved.head()
Out[12]:
DATE GDPA GPDIA GCEA EXPGSA IMPGSA
0 1929-01-01 104.556 17.170 9.622 5.939 5.556
1 1930-01-01 92.160 11.428 10.273 4.444 4.121
2 1931-01-01 77.391 6.549 10.169 2.906 2.905
3 1932-01-01 59.522 1.819 8.946 1.975 1.932
4 1933-01-01 57.154 2.276 8.875 1.987 1.929

We could have manipulated some strings to get to this, too. This approach might be useful if you needed to read in many files from the same place. (Maybe using a for loop and a list of file names?)

In [13]:
path_to_cwd = os.getcwd()
file_name = 'gdp_components_moved.csv'
path_to_data_file = path_to_cwd + '\\Data_Files\\' +  file_name  #Note the double \ characters
print(path_to_data_file)
U:\2019F_Econ_690\2_Pandas\Data_Files\gdp_components_moved.csv
In [14]:
gdp_moved = pd.read_csv(path_to_data_file, index_col=0)
gdp_moved.head()
Out[14]:
GDPA GPDIA GCEA EXPGSA IMPGSA
DATE
1929-01-01 104.556 17.170 9.622 5.939 5.556
1930-01-01 92.160 11.428 10.273 4.444 4.121
1931-01-01 77.391 6.549 10.169 2.906 2.905
1932-01-01 59.522 1.819 8.946 1.975 1.932
1933-01-01 57.154 2.276 8.875 1.987 1.929

Practice: Reading CSVs

Take a few minutes and try the following. Feel free to chat with those around if you get stuck. The TA and I are here, too.

  1. Try out the to_csv() method of DataFrame. Save gdp_moved as 'gdp_moved_2.csv' in your cwd. [You can use ? if you need help.]
In [15]:
gdp_moved.to_csv('gpd_moved_2.csv')
  1. Use to_csv() again to save gdp_moved to the Data_Files folder. Name it 'gdp_moved_3.csv'
In [18]:
gdp_moved.to_csv('U:\\2019F_Econ_690\\2_Pandas\\Data_Files\\gdp_moved_3.csv')

Are your files in the correct places?

Isn't this supposed to be practice reading in CSV files? Right. Let's do some of that.

  1. Use gdp_moved_3.csv to create a DataFrame named gdp_growth. Set the index to the dates. Print out the first 10 years of data.
In [19]:
gdp_growth = pd.read_csv('U:\\2019F_Econ_690\\2_Pandas\\Data_Files\\gdp_moved_3.csv', index_col=0)
print( gdp_growth.head(10) )
               GDPA   GPDIA    GCEA  EXPGSA  IMPGSA
DATE                                               
1929-01-01  104.556  17.170   9.622   5.939   5.556
1930-01-01   92.160  11.428  10.273   4.444   4.121
1931-01-01   77.391   6.549  10.169   2.906   2.905
1932-01-01   59.522   1.819   8.946   1.975   1.932
1933-01-01   57.154   2.276   8.875   1.987   1.929
1934-01-01   66.800   4.296  10.721   2.561   2.239
1935-01-01   74.241   7.370  11.151   2.769   2.982
1936-01-01   84.830   9.391  13.398   3.007   3.154
1937-01-01   93.003  12.967  13.119   4.039   3.961
1938-01-01   87.352   7.944  14.170   3.811   2.845
  1. Rename 'GDPA' to 'gdp' and rename 'GCEA' to 'gov'
In [20]:
gdp_growth.rename(columns={'GDPA':'gdp', 'GCEA':'gov'}, inplace=True)
print(gdp_growth.head())
                gdp   GPDIA     gov  EXPGSA  IMPGSA
DATE                                               
1929-01-01  104.556  17.170   9.622   5.939   5.556
1930-01-01   92.160  11.428  10.273   4.444   4.121
1931-01-01   77.391   6.549  10.169   2.906   2.905
1932-01-01   59.522   1.819   8.946   1.975   1.932
1933-01-01   57.154   2.276   8.875   1.987   1.929

Reading Excel spreadsheets

Reading spreadsheets isn't much different than reading csv files. But, since workbooks are more complicated than csv files, we have a few more options to consider.

If you haven't already, copy over 'debt.xlsx' to your cwd. Let's open it in Excel and have a look at it...

There's a lot going on here: missing data, some #N/A stuff, and several header rows. Let's get to work.

In [21]:
debt = pd.read_excel('debt.xlsx')
debt
Out[21]:
FRED Graph Observations Unnamed: 1 Unnamed: 2 Unnamed: 3
0 Federal Reserve Economic Data NaN NaN NaN
1 Link: https://fred.stlouisfed.org NaN NaN NaN
2 Help: https://fred.stlouisfed.org/help-faq NaN NaN NaN
3 Economic Research Division NaN NaN NaN
4 Federal Reserve Bank of St. Louis NaN NaN NaN
5 NaN NaN NaN NaN
6 GDPA Gross Domestic Product, Billions of Dollars, A... NaN NaN
7 GFDEBTN Federal Debt: Total Public Debt, Millions of D... NaN NaN
8 DGS10 10-Year Treasury Constant Maturity Rate, Perce... NaN NaN
9 NaN NaN NaN NaN
10 Frequency: Annual NaN NaN NaN
11 observation_date GDPA GFDEBTN DGS10
12 1929-01-01 00:00:00 104.556 NaN NaN
13 1930-01-01 00:00:00 92.16 NaN NaN
14 1931-01-01 00:00:00 77.391 NaN NaN
15 1932-01-01 00:00:00 59.522 NaN NaN
16 1933-01-01 00:00:00 57.154 NaN NaN
17 1934-01-01 00:00:00 66.8 NaN NaN
18 1935-01-01 00:00:00 74.241 NaN NaN
19 1936-01-01 00:00:00 84.83 NaN NaN
20 1937-01-01 00:00:00 93.003 NaN NaN
21 1938-01-01 00:00:00 87.352 NaN NaN
22 1939-01-01 00:00:00 93.437 NaN NaN
23 1940-01-01 00:00:00 102.899 NaN NaN
24 1941-01-01 00:00:00 129.309 NaN NaN
25 1942-01-01 00:00:00 165.952 NaN NaN
26 1943-01-01 00:00:00 203.084 NaN NaN
27 1944-01-01 00:00:00 224.447 NaN NaN
28 1945-01-01 00:00:00 228.007 NaN NaN
29 1946-01-01 00:00:00 227.535 NaN NaN
... ... ... ... ...
72 1989-01-01 00:00:00 5641.58 2.83781e+06 8.49372
73 1990-01-01 00:00:00 5963.14 3.19846e+06 8.5524
74 1991-01-01 00:00:00 6158.13 3.61754e+06 7.86244
75 1992-01-01 00:00:00 6520.33 4.02689e+06 7.00884
76 1993-01-01 00:00:00 6858.56 4.38243e+06 5.86628
77 1994-01-01 00:00:00 7287.24 4.67864e+06 7.08518
78 1995-01-01 00:00:00 7639.75 4944534 6.57392
79 1996-01-01 00:00:00 8073.12 5.20671e+06 6.44353
80 1997-01-01 00:00:00 8577.55 5.41814e+06 6.35396
81 1998-01-01 00:00:00 9062.82 5.55769e+06 5.26288
82 1999-01-01 00:00:00 9630.66 5.68069e+06 5.64614
83 2000-01-01 00:00:00 10252.3 5.69893e+06 6.03028
84 2001-01-01 00:00:00 10581.8 5.81286e+06 5.02069
85 2002-01-01 00:00:00 10936.4 6191611 4.61308
86 2003-01-01 00:00:00 11458.2 6.72805e+06 4.01388
87 2004-01-01 00:00:00 12213.7 7.34515e+06 4.27132
88 2005-01-01 00:00:00 13036.6 7.92914e+06 4.28888
89 2006-01-01 00:00:00 13814.6 8494599 4.795
90 2007-01-01 00:00:00 14451.9 8.98854e+06 4.63466
91 2008-01-01 00:00:00 14712.8 9.91353e+06 3.66426
92 2009-01-01 00:00:00 14448.9 1.17233e+07 3.26412
93 2010-01-01 00:00:00 14992.1 13390438 3.21506
94 2011-01-01 00:00:00 15542.6 1.46566e+07 2.78164
95 2012-01-01 00:00:00 16197 1.59901e+07 1.80344
96 2013-01-01 00:00:00 16784.9 16851000 2.35016
97 2014-01-01 00:00:00 17521.7 17799837 2.53956
98 2015-01-01 00:00:00 18219.3 1.83442e+07 2.13829
99 2016-01-01 00:00:00 18707.2 1.95492e+07 1.83744
100 2017-01-01 00:00:00 19485.4 2.01072e+07 2.32948
101 2018-01-01 00:00:00 NaN NaN NaN

102 rows × 4 columns

In [22]:
# Use the header to specify the row to use as the column names. (zero based, as usual)

debt = pd.read_excel('debt.xlsx', header = 12)

print(debt.head())
print('\n')
print(debt.tail())
  observation_date     GDPA  GFDEBTN  DGS10
0       1929-01-01  104.556      NaN    NaN
1       1930-01-01   92.160      NaN    NaN
2       1931-01-01   77.391      NaN    NaN
3       1932-01-01   59.522      NaN    NaN
4       1933-01-01   57.154      NaN    NaN


   observation_date       GDPA      GFDEBTN     DGS10
85       2014-01-01  17521.747  17799837.00  2.539560
86       2015-01-01  18219.297  18344212.75  2.138287
87       2016-01-01  18707.189  19549200.50  1.837440
88       2017-01-01  19485.394  20107155.25  2.329480
89       2018-01-01        NaN          NaN       NaN

That's looking good. Notice that Pandas added NaN for the missing data and for those #N\A entries. We will have to deal with those at some point. The header parameter is part of read_csv(), too.

We didn't specify which sheet in the workbook to load, so Pandas took the first one. We can ask for sheets by name.

In [23]:
debt_q = pd.read_excel('debt.xlsx', header=12, sheet_name='quarterly')
print(debt_q.head())
print('\n')
print(debt_q.tail())
  observation_date  GFDEBTN  DGS10      GDP
0       1947-01-01      NaN    NaN  243.164
1       1947-04-01      NaN    NaN  245.968
2       1947-07-01      NaN    NaN  249.585
3       1947-10-01      NaN    NaN  259.745
4       1948-01-01      NaN    NaN  265.742


    observation_date     GFDEBTN     DGS10        GDP
281       2017-04-01  19844554.0  2.260952  19359.123
282       2017-07-01  20244900.0  2.241429  19588.074
283       2017-10-01  20492747.0  2.371452  19831.829
284       2018-01-01  21089643.0  2.758525  20041.047
285       2018-04-01  21195070.0  2.920625  20411.924

We can ask for just a subset of the columns when reading in a file (csv or xlsx). Use the usecols argument. This takes either integers or Excel column letters.

In [24]:
# Take the first and third columns of sheet 'quarterly'

interest_rates = pd.read_excel('debt.xlsx', header=12,  sheet_name='quarterly', usecols=[0,2])  
interest_rates.head()
Out[24]:
observation_date DGS10
0 1947-01-01 NaN
1 1947-04-01 NaN
2 1947-07-01 NaN
3 1947-10-01 NaN
4 1948-01-01 NaN

Practice: Reading Excel

Take a few minutes and try the following. Feel free to chat with those around if you get stuck. I am here, too.

  1. Read in the quarterly data from 'debt.xlsx' and keep only the columns with the date, gdp, and GFDEBTN. Name your new DataFrame fed_debt.
In [26]:
fed_debt = pd.read_excel('debt.xlsx', header=12,  sheet_name='quarterly', usecols=[0,1,3], index_col=0)
fed_debt.head()
Out[26]:
GFDEBTN GDP
observation_date
1947-01-01 NaN 243.164
1947-04-01 NaN 245.968
1947-07-01 NaN 249.585
1947-10-01 NaN 259.745
1948-01-01 NaN 265.742
  1. Oops, I wanted to set the observation_date to the index. Go back and add that to your solution to 1.
  2. What is 'GFDEBTN'? It is the federal debt, in millions. Rename this variable to 'DEBT'
In [27]:
fed_debt.rename(columns={'GFDEBTN':'DEBT'}, inplace=True)
fed_debt.head()
Out[27]:
DEBT GDP
observation_date
1947-01-01 NaN 243.164
1947-04-01 NaN 245.968
1947-07-01 NaN 249.585
1947-10-01 NaN 259.745
1948-01-01 NaN 265.742
  1. Create a variable name debt_ratio that is the debt-to-GDP ratio. Debt is in millions and gdp is in billions. Adjust accordingly.
In [29]:
fed_debt['debt_ratio'] = (fed_debt['DEBT']/1000)/fed_debt['GDP']
print(fed_debt.head())
                  DEBT      GDP  debt_ratio
observation_date                           
1947-01-01         NaN  243.164         NaN
1947-04-01         NaN  245.968         NaN
1947-07-01         NaN  249.585         NaN
1947-10-01         NaN  259.745         NaN
1948-01-01         NaN  265.742         NaN

There are a lot of missing debt values. Did Pandas throw an error? No. Pandas knows (in some cases) how to work around missing data.

  1. Summarize the debt_ratio variable. What is its max level? Its min?
In [30]:
print(fed_debt['debt_ratio'].describe())
count    210.000000
mean       0.564994
std        0.227520
min        0.306033
25%        0.355102
50%        0.555767
75%        0.641648
max        1.052562
Name: debt_ratio, dtype: float64

Reading from the internet

We can pass read functions urls, too.

In [31]:
# Read in the penn world table data
url = "http://www.rug.nl/ggdc/docs/pwt81.xlsx"
pwt = pd.read_excel(url, sheet_name= "Data")
pwt.head()
Out[31]:
countrycode country currency_unit year rgdpe rgdpo pop emp avh hc ... csh_g csh_x csh_m csh_r pl_c pl_i pl_g pl_x pl_m pl_k
0 AGO Angola Kwanza 1950 NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
1 AGO Angola Kwanza 1951 NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
2 AGO Angola Kwanza 1952 NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
3 AGO Angola Kwanza 1953 NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
4 AGO Angola Kwanza 1954 NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN

5 rows × 47 columns

That took a few seconds --- this is a pretty big file.

In [33]:
# Data from McKinney's book. Each file contains baby name counts for a year. 
baby_url = 'https://raw.githubusercontent.com/wesm/pydata-book/2nd-edition/datasets/babynames'

# What was trendy in 1880?
old_names = pd.read_csv(baby_url + '//yob1880.txt')
old_names.head()
    
Out[33]:
Mary F 7065
0 Anna F 2604
1 Emma F 2003
2 Elizabeth F 1939
3 Minnie F 1746
4 Margaret F 1578

We've lost Mary, which looks pretty popular. What happened?

We can specify no header with the None keyword

In [38]:
old_names = pd.read_csv(baby_url + '//yob1880.txt', header=None)
old_names.head()
Out[38]:
0 1 2
0 Mary F 7065
1 Anna F 2604
2 Emma F 2003
3 Elizabeth F 1939
4 Minnie F 1746

Practice

Take a few minutes and try the following. Feel free to chat with those around if you get stuck. The TA and I are here, too.

Baby names

  1. Get the baby name data for 2009. Call the DataFrame new_names. Give the columns some reasonable names.
In [39]:
new_names = pd.read_csv(baby_url + '//yob2009.txt', header=None)
print(old_names.head())

new_names.rename(columns={0:'name', 1:'sex', 2:'count'}, inplace=True)
print('\n', old_names.head())
           0  1     2
0       Mary  F  7065
1       Anna  F  2604
2       Emma  F  2003
3  Elizabeth  F  1939
4     Minnie  F  1746

            0  1     2
0       Mary  F  7065
1       Anna  F  2604
2       Emma  F  2003
3  Elizabeth  F  1939
4     Minnie  F  1746
  1. What are the two most popular female names in 2009? You might try the sort_values() method of DataFrame.
In [41]:
print('Shape of new_names:', new_names.shape)

new_names_female=new_names[new_names['sex']=='F']
print('Shape of new_names_female:', new_names_female.shape)

new_names_female_sorted = new_names_female.sort_values(by=['count'], ascending=False)
print(new_names_female_sorted.head())
Shape of new_names: (34602, 3)
Shape of new_names_female: (20123, 3)
       name sex  count
0  Isabella   F  22222
1      Emma   F  17830
2    Olivia   F  17374
3    Sophia   F  16869
4       Ava   F  15826
  1. What are the two least popular female names in 2009?
In [43]:
print(new_names_female_sorted.tail())
           name sex  count
18190   Giannie   F      5
18191  Giavonni   F      5
18192    Gibson   F      5
18193     Gilda   F      5
20122    Zyriel   F      5

Pisa Scores

  1. In a web browser, go to dx.doi.org/10.1787/888932937035 This should initiate a download of an excel file with pisa scores across countries. This is a bit of a mess.

  2. Use the read_excel() function to create a DataFrame with mean scores in math, reading, and science. Do not set an index yet. There is some junk at the bottom of the spreadsheet. Try the skipfooter argument.

In [44]:
url = 'http://dx.doi.org/10.1787/888932937035'
pisa = pd.read_excel(url,
                     skiprows=18,             # skip the first 18 rows
                     skipfooter=7,            # skip the last 7
                     usecols=[0,1,9,13],   # select columns of interest
                     #index_col=0,             # set the index as the first column
                     #header=[0,1]             # set the variable names
                     )

pisa
Out[44]:
Unnamed: 0 Mathematics Reading Science
0 NaN Mean score in PISA 2012 Mean score in PISA 2012 Mean score in PISA 2012
1 NaN NaN NaN NaN
2 OECD average 494.046 496.463 501.16
3 NaN NaN NaN NaN
4 Shanghai-China 612.676 569.588 580.118
5 Singapore 573.468 542.216 551.493
6 Hong Kong-China 561.241 544.6 554.937
7 Chinese Taipei 559.825 523.119 523.315
8 Korea 553.767 535.79 537.788
9 Macao-China 538.134 508.949 520.571
10 Japan 536.407 538.051 546.736
11 Liechtenstein 534.965 515.522 524.695
12 Switzerland 530.931 509.04 515.298
13 Netherlands 522.972 511.23 522.056
14 Estonia 520.546 516.294 541.405
15 Finland 518.75 524.022 545.442
16 Canada 518.079 523.117 525.461
17 Poland 517.501 518.187 525.816
18 Belgium 514.745 509.108 505.457
19 Germany 513.525 507.677 524.121
20 Viet Nam 511.338 508.219 528.423
21 Austria 505.541 489.609 505.781
22 Australia 504.151 511.804 521.495
23 Ireland 501.497 523.173 522.004
24 Slovenia 501.127 481.316 514.143
25 Denmark 500.027 496.131 498.474
26 New Zealand 499.75 512.187 515.636
27 Czech Republic 498.958 492.89 508.299
28 France 494.985 505.481 498.971
29 United Kingdom 493.934 499.323 514.129
... ... ... ... ...
39 United States 481.367 497.582 497.41
40 Lithuania 478.823 477.307 495.7
41 Sweden 478.261 483.335 484.799
42 Hungary 477.044 488.461 494.302
43 Croatia 471.131 484.565 491.357
44 Israel 466.481 485.803 470.073
45 Greece 452.973 477.197 466.722
46 Serbia 448.859 446.13 444.804
47 Turkey 447.984 475.491 463.413
48 Romania 444.554 437.599 438.768
49 Cyprus1, 2 439.696 448.953 437.676
50 Bulgaria 438.738 436.126 446.469
51 United Arab Emirates 434.007 441.704 448.37
52 Kazakhstan 431.798 392.736 424.708
53 Thailand 426.737 441.22 444
54 Chile 422.632 441.398 444.934
55 Malaysia 420.513 398.196 419.503
56 Mexico 413.281 423.554 414.92
57 Montenegro 409.627 422.111 410.098
58 Uruguay 409.292 411.349 415.843
59 Costa Rica 407 440.548 429.351
60 Albania 394.329 393.964 397.373
61 Brazil 391.46 410.122 404.711
62 Argentina 388.432 395.979 405.632
63 Tunisia 387.825 404.078 398.046
64 Jordan 385.596 399.035 409.367
65 Colombia 376.489 403.403 398.679
66 Qatar 376.448 387.504 383.643
67 Indonesia 375.114 396.12 381.911
68 Peru 368.103 384.151 373.113

69 rows × 4 columns

  1. Clean up your DataFrame. Drop rows that have NaNs.
In [45]:
pisa2 = pisa.dropna()
pisa2
Out[45]:
Unnamed: 0 Mathematics Reading Science
2 OECD average 494.046 496.463 501.16
4 Shanghai-China 612.676 569.588 580.118
5 Singapore 573.468 542.216 551.493
6 Hong Kong-China 561.241 544.6 554.937
7 Chinese Taipei 559.825 523.119 523.315
8 Korea 553.767 535.79 537.788
9 Macao-China 538.134 508.949 520.571
10 Japan 536.407 538.051 546.736
11 Liechtenstein 534.965 515.522 524.695
12 Switzerland 530.931 509.04 515.298
13 Netherlands 522.972 511.23 522.056
14 Estonia 520.546 516.294 541.405
15 Finland 518.75 524.022 545.442
16 Canada 518.079 523.117 525.461
17 Poland 517.501 518.187 525.816
18 Belgium 514.745 509.108 505.457
19 Germany 513.525 507.677 524.121
20 Viet Nam 511.338 508.219 528.423
21 Austria 505.541 489.609 505.781
22 Australia 504.151 511.804 521.495
23 Ireland 501.497 523.173 522.004
24 Slovenia 501.127 481.316 514.143
25 Denmark 500.027 496.131 498.474
26 New Zealand 499.75 512.187 515.636
27 Czech Republic 498.958 492.89 508.299
28 France 494.985 505.481 498.971
29 United Kingdom 493.934 499.323 514.129
30 Iceland 492.796 482.522 478.155
31 Latvia 490.571 488.694 502.186
32 Luxembourg 489.845 487.807 491.215
... ... ... ... ...
39 United States 481.367 497.582 497.41
40 Lithuania 478.823 477.307 495.7
41 Sweden 478.261 483.335 484.799
42 Hungary 477.044 488.461 494.302
43 Croatia 471.131 484.565 491.357
44 Israel 466.481 485.803 470.073
45 Greece 452.973 477.197 466.722
46 Serbia 448.859 446.13 444.804
47 Turkey 447.984 475.491 463.413
48 Romania 444.554 437.599 438.768
49 Cyprus1, 2 439.696 448.953 437.676
50 Bulgaria 438.738 436.126 446.469
51 United Arab Emirates 434.007 441.704 448.37
52 Kazakhstan 431.798 392.736 424.708
53 Thailand 426.737 441.22 444
54 Chile 422.632 441.398 444.934
55 Malaysia 420.513 398.196 419.503
56 Mexico 413.281 423.554 414.92
57 Montenegro 409.627 422.111 410.098
58 Uruguay 409.292 411.349 415.843
59 Costa Rica 407 440.548 429.351
60 Albania 394.329 393.964 397.373
61 Brazil 391.46 410.122 404.711
62 Argentina 388.432 395.979 405.632
63 Tunisia 387.825 404.078 398.046
64 Jordan 385.596 399.035 409.367
65 Colombia 376.489 403.403 398.679
66 Qatar 376.448 387.504 383.643
67 Indonesia 375.114 396.12 381.911
68 Peru 368.103 384.151 373.113

66 rows × 4 columns

  1. Make the country names the index.
In [46]:
pisa2.set_index('Unnamed: 0', inplace=True)
pisa2
Out[46]:
Mathematics Reading Science
Unnamed: 0
OECD average 494.046 496.463 501.16
Shanghai-China 612.676 569.588 580.118
Singapore 573.468 542.216 551.493
Hong Kong-China 561.241 544.6 554.937
Chinese Taipei 559.825 523.119 523.315
Korea 553.767 535.79 537.788
Macao-China 538.134 508.949 520.571
Japan 536.407 538.051 546.736
Liechtenstein 534.965 515.522 524.695
Switzerland 530.931 509.04 515.298
Netherlands 522.972 511.23 522.056
Estonia 520.546 516.294 541.405
Finland 518.75 524.022 545.442
Canada 518.079 523.117 525.461
Poland 517.501 518.187 525.816
Belgium 514.745 509.108 505.457
Germany 513.525 507.677 524.121
Viet Nam 511.338 508.219 528.423
Austria 505.541 489.609 505.781
Australia 504.151 511.804 521.495
Ireland 501.497 523.173 522.004
Slovenia 501.127 481.316 514.143
Denmark 500.027 496.131 498.474
New Zealand 499.75 512.187 515.636
Czech Republic 498.958 492.89 508.299
France 494.985 505.481 498.971
United Kingdom 493.934 499.323 514.129
Iceland 492.796 482.522 478.155
Latvia 490.571 488.694 502.186
Luxembourg 489.845 487.807 491.215
... ... ... ...
United States 481.367 497.582 497.41
Lithuania 478.823 477.307 495.7
Sweden 478.261 483.335 484.799
Hungary 477.044 488.461 494.302
Croatia 471.131 484.565 491.357
Israel 466.481 485.803 470.073
Greece 452.973 477.197 466.722
Serbia 448.859 446.13 444.804
Turkey 447.984 475.491 463.413
Romania 444.554 437.599 438.768
Cyprus1, 2 439.696 448.953 437.676
Bulgaria 438.738 436.126 446.469
United Arab Emirates 434.007 441.704 448.37
Kazakhstan 431.798 392.736 424.708
Thailand 426.737 441.22 444
Chile 422.632 441.398 444.934
Malaysia 420.513 398.196 419.503
Mexico 413.281 423.554 414.92
Montenegro 409.627 422.111 410.098
Uruguay 409.292 411.349 415.843
Costa Rica 407 440.548 429.351
Albania 394.329 393.964 397.373
Brazil 391.46 410.122 404.711
Argentina 388.432 395.979 405.632
Tunisia 387.825 404.078 398.046
Jordan 385.596 399.035 409.367
Colombia 376.489 403.403 398.679
Qatar 376.448 387.504 383.643
Indonesia 375.114 396.12 381.911
Peru 368.103 384.151 373.113

66 rows × 3 columns

5. What is the ratio of the United States pisa scores relative to the OECD average?
In [47]:
print(pisa2.loc['United States']/pisa2.loc['OECD average'])
Mathematics       0.974335
Reading            1.00225
Science           0.992517
dtype: object
  1. Challenging. How correlated are pisa math, reading, and science scores with each other? Write the correlation matrix to a file called 'pisa_corrs.xlsx'

This is a challenging question because, depending on how you read in the data, your columns are probably of type 'Object' and corr() won't work. Google around and see if you can convert the three columns to numbers. Then find the correlations.

In [49]:
pisa2.columns = ['math', 'read', 'sci']  # rename the columns to something short
pisa2=pisa2[['math', 'read', 'sci']].apply(pd.to_numeric)  # convert coumns to numeric
pisa2.corr() # correlate
Out[49]:
math read sci
math 1.000000 0.959806 0.972131
read 0.959806 1.000000 0.978559
sci 0.972131 0.978559 1.000000
In [ ]: