Problem set 7. Review: Dataframe Access & Exponential Growth

Too many people spent far too much time on PS03.ipynb

Therefore we area going to back up and review how to pull numbers out of dataframes, and we are going to review exponential growth processes

Let us get started!

 

1. Preliminaries

A. Computing environment

First, we set up the computing environment with the libraries we need:

In [ ]:
# 7.1.A.1. set up the computing environment: ensure that graphs
# appear inline in the notebook & not in extra windows:

%matplotlib inline
In [ ]:
# 7.1.A.2 set up the computing environment: import other libraries

import numpy as np
import pandas as pd
import matplotlib as mpl
import matplotlib.pyplot as plt

 

B. Reproduce our very long-run growth dataframes for the world, the global north, and the global south:

Recall our estimates of humanity's economy in the very long run. Let's repeat their construction:

In [ ]:
# 7.1.B.1. repeating the creation of the world dataframe:

long_run_growth_list = [
    [-68000, 0.1, 1200, 379.47],
    [-8000, 2.5, 1200, 1897.37],
    [-6000, 7, 900, 2381.18],
    [-3000, 15, 900, 3485.68],
    [-1000, 50, 900, 6363.96],
    [1, 170, 900, 11734.56],
    [1500, 500, 900, 20124.61],
    [1770, 750, 1100, 30124.74],
    [1870, 1300, 1300, 46872.1],
    [2020, 7600, 11842, 1032370.8]
    ]

long_run_growth_df = pd.DataFrame(
  data=np.array(long_run_growth_list), columns = ['year', 'population', 
  'income_level', 'human_ideas']
  )

long_run_growth_df['year'] = long_run_growth_df['year'].apply(np.int64)

initial_year = long_run_growth_df['year'][0:10]

span = []
g = []
h = []
n = []

for t in range(9):
    span = span + [long_run_growth_df['year'][t+1]-long_run_growth_df['year'][t]]
    h = h + [np.log(long_run_growth_df['human_ideas'][t+1]/long_run_growth_df['human_ideas'][t])/span[t]]
    g = g + [np.log(long_run_growth_df['income_level'][t+1]/long_run_growth_df['income_level'][t])/span[t]]
    n = n + [np.log(long_run_growth_df['population'][t+1]/long_run_growth_df['population'][t])/span[t]]
    


# finally, add a note to the end of each observation, reminding
# us of what was going on in human history back in each of the
# eras into which we have divided it

eras = ['at the dawn', 'agriculture & herding', 'proto-agrarian age',
        'writing', 'axial age', 'dark & middle age slowdown', 'commercial revolution',
        'industrial revolution', 'modern economic growth', 'whatever the 21st century brings']

long_run_growth_df['eras'] = eras

format_dict = {'human_ideas': '{0:,.0f}', 
    'income_level': '${0:,.0f}', 'population': '{0:,.1f}'}

print('WORLD LEVELS')
long_run_growth_df.style.format(format_dict)
In [ ]:
# 7.1.B.2. repeating the creation of the global north dataframe:

long_run_growth_list_global_north = [
    [-68000, 0.00001, 1200, 379.47, 0.0001],
    [-8000, 0.1, 1200, 1897.37, 0.0294],
    [-6000, 0.2, 900, 2012.5, 0.0294],
    [-3000, 0.5, 900, 3182, 0.0294],
    [-1000, 2, 900, 6364.1, 0.0294],
    [1, 5, 900, 10062.5, 0.0294],
    [1500, 25, 1000, 25000.4, 0.0294],
    [1770, 75, 1400, 42866.8, 0.0588],
    [1870, 175, 2800, 106928.6, 0.0882],
    [2020, 800, 50000, 3580637.4, 0.1147]
    ]

long_run_growth_global_north_df = pd.DataFrame(
  data=np.array(long_run_growth_list_global_north), columns = ['year', 'population', 
  'income_level', 'human_ideas', 'resources']
  )
long_run_growth_global_north_df['year'] = long_run_growth_global_north_df['year'].apply(np.int64)

eras = ['at the dawn', 'agriculture & herding', 'proto-agrarian age',
        'writing', 'axial age', 'dark & middle age slowdown', 'commercial revolution',
        'industrial revolution', 'modern economic growth', 'whatever the 21st century brings']

long_run_growth_global_north_df['eras'] = eras

format_dict = {'human_ideas': '{0:,.0f}', 
    'income_level': '${0:,.0f}', 'population': '{0:,.1f}','resources': '{0:,.3f}'}

print('GLOBAL NORTH LEVELS')
long_run_growth_global_north_df.style.format(format_dict)
In [ ]:
# 7.1.B.3. repeating the creation of the global south dataframe:

long_run_growth_list_global_south = [
    [-68000, 0.1, 1200, 379.47, 0.9999],
    [-8000, 2.4, 1200, 1897.37, 0.971],
    [-6000, 6.8, 900, 2395.3, 0.971],
    [-3000, 14.5, 900, 3497.9, 0.971],
    [-1000, 48, 900, 6364.1, 0.971],
    [1, 165, 900, 11799.4, 0.971],
    [1500, 475, 900, 20019.9, 0.971],
    [1770, 675, 1070, 29386.7, 0.9412],
    [1870, 1125, 1000, 36172.8, 0.9118],
    [2020, 6800, 7700, 693805.9, 0.8853]
    ]

long_run_growth_global_south_df = pd.DataFrame(
  data=np.array(long_run_growth_list_global_south), columns = ['year', 'population', 
  'income_level', 'human_ideas', 'resources']
  )
long_run_growth_global_south_df['year'] = long_run_growth_global_south_df['year'].apply(np.int64)

eras = ['at the dawn', 'agriculture & herding', 'proto-agrarian age',
        'writing', 'axial age', 'dark & middle age slowdown', 'commercial revolution',
        'industrial revolution', 'modern economic growth', 'whatever the 21st century brings']

long_run_growth_global_south_df['eras'] = eras

format_dict = {'human_ideas': '{0:,.0f}', 
    'income_level': '${0:,.0f}', 'population': '{0:,.1f}','resources': '{0:,.3f}'}

print('GLOBAL SOUTH LEVELS')
long_run_growth_global_south_df.style.format(format_dict)

 

7.2. Pulling numbers out of dataframes

Suppose we want to pull a number out of a dataframe—say the 7700 average income level in the global south in 2020. How do we do this? The next code cell provides an example:

In [ ]:
# 7.2.1. pulling out global south average income in 2020

long_run_growth_global_south_df['income_level'][9]

That is all there is to it. It is a three-step process:

  1. Tell the python interpreter what dataframe the number is in "long_run_growth_global_south_df"
  2. Tell the python interpreter the variable—the column—the number belongs to by stating the column's name as a string inside brackets: "['income_level']"
  3. Tell the python interpreter the observation—the row—the number belongs to by stating the row number inside brackets: "[9]"

You may ask: why is the column name a string that must go inside quotation marks, while the row name is a number that cannot go inside quotation marks—for doing the wrong thing with either will generate an error? There is no good answer.

Now let us get some practice with this. This should be, if I have gauged the situation right, be trivial for rather more than half the class, and easily doable for the rest.

First, in the next cell, write code to pull out the year-1500 value of income per capita for the world as a whole, assign it to the variable world_income_1500, and then print it out:

In [ ]:
# 7.2.2. pulling out world average income in 1500

world_income_1500 = ...

print(world_income_1500, "is the average income per capita in the world in 1500")

Did the code cell above print, below itself: "900.0 is the average income per capita in the world in 1500"? If not, go back and try to figure out what you did wrong. If yes, proceed.

Now can you pull out the value of the ideas stock deployed in the global north in 1870? Do so in the code cell below:

In [ ]:
# 7.2.3. pulling out global north ideas stock in 1870

global_north_ideas_1870 = ...

print(global_north_ideas_1870, "is the level of ideas used by the global north in 1870")

Did you succeed in pulling out 106928.6?

An expression like "long_run_growth_global_north_df['human_ideas'][8]" is a number—it is the number in the 8th row of the "human_ideas" column of the "long_run_growth_global_north_df" dataframe. You can do anything with it that you can do with a number. In particular, you can do calculations with it. In particular, you can divide it by another number.

So in the code cell below, do so: calculate what the value of ideas deployed in the global north was relative to those deployed in the global south in year 1 by dividing the global-north ideas value by the global-south ideas value:

In [ ]:
# 7.2.4. global north ideas divided by global south ideas in year 1

north_south_ideas_ratio_1 = ...

print(north_south_ideas_ratio_1, "is the north-south ratio of the value of ideas in year 1")

Now let us march down the dataframes: in the four code cells below, do the analogous calculations for the years 1500, 1770, 1870, and 2020 to see how the region that was to become the heart of the global north developed a deployed-technology edge over the rest of the world between the year 1 and today:

In [ ]:
# 7.2.5. global north ideas divided by global south ideas in year 1500

north_south_ideas_ratio_1500 = ...

print(north_south_ideas_ratio_1500, "is the north-south ratio of the value of ideas in 1500")
In [ ]:
# 7.2.6. global north ideas divided by global south ideas in year 1770

north_south_ideas_ratio_1770 = ...

print(north_south_ideas_ratio_1770, "is the north-south ratio of the value of ideas in 1770")
In [ ]:
# 7.2.7. global north ideas divided by global south ideas in year 1870

north_south_ideas_ratio_1870 = ...

print(north_south_ideas_ratio_1870, "is the north-south ratio of the value of ideas in 1870")
In [ ]:
# 7.2.8. global north ideas divided by global south ideas in year 2020

north_south_ideas_ratio_2020 = ...

print(north_south_ideas_ratio_2020, "is the north-south ratio of the value of ideas in 2020")

By now you should have five numbers, covering the period from year 1 to 2020: 0.85, 1.25, 1.45, 2.96, 5.16.

What would happen if you tried to divide not one number by another number, but divide one column from a dataframe by another column from another dataframe? Let's see:

In [ ]:
# 7.2.9. dividing one series by another element-by-element

n_s_ideas_ratio = long_run_growth_global_north_df['human_ideas']/long_run_growth_global_south_df['human_ideas']

print(n_s_ideas_ratio, 'is the ideas-ratio series')

We now have all of the ideas-ratio quotients from the year -68000 to today in a series, where we can refer to all of them together, or pick out one. In the next code cell, pick out from this series the ideas ratio for the year -3000:

In [ ]:
# 7.2.10. picking out the ideas-ratio value for the year -3000

n_s_ideas_ratio_minus_3000 = ...

print(n_s_ideas_ratio_minus_3000, 'is the ideas ratio for the year -3000')

In dividing two columns from different dataframes, we have lost track of the year variable—it is no longer included and attached. So let us add it back in by making a tiny dataframe consisting of our "which year this is" series and our ideas-ratio series by using pandas.concat:

In [ ]:
# 7.2.10. creating a tiny dataframe

n_s_ideas_ratio_df = pd.concat([long_run_growth_global_north_df['year'], n_s_ideas_ratio], axis=1)

n_s_ideas_ratio_df

Note that it might be clearer what we are doing if, instead of carrying the index row number 0-9 around, we used the 'year' column as an index itself:

In [ ]:
# 7.2.11. changing the index to the 'year' column

n_s_ideas_ratio_df.set_index('year', inplace=True)

n_s_ideas_ratio_df

Then, when we wanted to refer to the value for the year -8000, we would not have to remember which row of the table corresponds to the year -8000, but simply specify that in place of the row number;

In [ ]:
# 7.2.12. referring to entries by year rather than by row number

n_s_ideas_ratio_df['human_ideas'][-8000]

A problem is that we then have lost the ability to access the year column as a series of numbers, since it is now the index:

In [ ]:
# 7.2.13. generating an error message

n_s_ideas_ratio_df['year'][-8000]

We have just generated an error: the python interpreter could not find a dataframe column-variable named 'year' because we transformed it into an index.

If we want to have the year lying around in the dataframe so that we can do math with it, we have to add it back in;

In [ ]:
# 7.2.14. adding the year back in

n_s_ideas_ratio_df['year'] = [-68000, -8000, -6000, -3000, -1000, 1, 1500, 1770, 1870, 2020]

n_s_ideas_ratio_df

 

7.3. Exponential growth

A number of people ran into problems on problem set 3 calculating growth rates. So here is some math review in python—this should be, if I have gauged the situation right, be trivial for rather more than half the class, and easily doable for the rest.

Let us start with a quantity $ Y $ equal to 1 in our initial year—call it year 0: $Y_0 = 1$. Suppose that each year it grows by 1%: that when we move from year n to year n+1, we multiply the year-n value by 1.01 to get the year n+1 value: $ Y_{n+1} = (1.01)Y_n $

What then will the value of Y be in the 70th year? What is $ Y_{70} $?

Well:

$ Y_0 = 1 $
$ Y_1 = (1.01)Y_0 = 1.01 $
$ Y_2 = (1.01)(1.01)Y_0 = (1.01)^2Y_0 = 1.0201 $
$ Y_3 = (1.01)(1.01)(1.01)Y_0 = (1.01)^3Y_0 = 1.030301 $
$ Y_4 = (1.01)(1.01)(1.01)(1.01)Y_0 = (1.01)^4Y_0 = 1.04060401 $
$ Y_5 = (1.01)(1.01)(1.01)(1.01)(1.01)Y_0 = (1.01)^5Y_0 = 1.0510100501 $

and so on, multiplying by a factor 1.01 each year. This is the kind of thing computers are made for:

In [ ]:
# 7.3.1. looping to calculate Y_{70}:

Y = 1

for i in range(1,71):
    Y = 1.01*Y
    print(i, Y)

But we could have done this calculation in one line:

In [ ]:
# 7.3.2. exponentiation to calculate Y_{70}:

Y = 1.01**70

print(Y)

It takes a hair under 70 years for a quantity growing exponentially at the rate of 1% per year to double. If we write T for the doubling time and subscript it by the growth rate, we have:

$ T_{1\%} = 70 $

In general, if we have a quantity Y which starts at our baseline time—call it 0—at some value $ Y_0 $, and if it grows at a rate g, then after T years it will be:

$ Y_T = Y_0 (1 + g)^T $

And if we want, we can express this in logarithms—which turn exponentiation into multiplication, and multiplication into addition, thus:

$ ln(Y_T) = ln(Y_0) + T[ln(1+g)] $

Let's go back to doubling times.

Now suppose that we wanted to find out how long a time T it takes a quantity growing at some other rate—2% or 3% or 5% or 9%/year—to double. We could do it by trial-and-error for each growth rate. We could stare at the equation:

$ 2 = (1 + g\%)^{T_{g\%}} $

and try to figure out how to solve it.

It is much simpler to solve if we look at the logarithmic form:

$ ln(2) = \left[ln(1 + g\%)\right]{T_{g\%}} $

and then simply solve:

$ {T_{g\%}} = \frac{ln(2)}{ln(1 + g\%)} $

We know that $ ln(2) = 0.693147 $. And we also know that $ ln(1 + g\%) $ will be very close to g%. Ho close? Let's see:

In [ ]:
# 7.3.3. g% & ln(1+g%) for g = 0.01

g = 0.01

print("g    ln(1+g)              (g - ln(1+g))/g       :: description")
print(g, np.log(1+g), (g-np.log(1+g))/g, " :: comparing g% and ln(1+g%) for g=", 100*g,"%")

Very, very close, no? The difference between .01 and .00995 is only one part in 200. For g=10% the difference between g and ln(1+g) is less than one part in 20. For growth rates of more than 10%/year, you might find your calculations considerably off if you ignore the fact that $ g ≠ ln(1+g) $. But for growth rates of less than 10%/year, you won't be considerably off if you just assume that $ g = ln(1+g) $.

So what is the true doubling time for a growth rate of 1% per year? We can use the exact equation: $ {T_{g\%}} = \frac{ln(2)}{ln(1 + g\%)} $ for g = 1%/year. Do the calculation in the next code cell, remembering that 'np.log' is the pandas log function:

Now, in the next code cell, fill in the loop to print out the same three numbers—g, ln(1+g), and the proportional difference between g and ln(1+g)—for g assuming values from 1% to 2% to 3% and on up to 10%:

In [ ]:
# 7.3.4. exact doubling time for 1%/year growth rate:

exact_doubling_time_for_1_percent = ...

print(exact_doubling_time_for_1_percent, "is the exact doubling time for a 1%/year growth rate")

69.6607 years, no?

Now program up a loop to calculate exact doubling times for g equal to 1%, 2%, 3%, and so forth up to 10%/year:

In [ ]:
# 7.3.5. exact doubling times for g from 1% to 10%/year

# for i in range...
#    g = ...
#    T = ...
#    print(T, "years is the actual doubling time for an annual growth rate of", i, "%/year")

Note that for a growth rate of 8% per year the product of the growth rate and the doubling time is:

In [ ]:
# 7.3.6. doubling time multiplied by growth rate

8*np.log(2)/np.log(1+0.08)

In fact, for all ten of the growth rates, if you multiply the growth rate in percent by the doubling time you get:

In [ ]:
# 7.3.7. doubling times multiplied by growth rates

for i in range...
    g = ...
    T = ...
    print(T, "is the product of the doubling time and the growth rate for a growth rate of", i, "%/year")

Hence the rule-of-thumb called the Rule of 72. Want to know how long it takes a quantity growing exponentially to double? Take the number 72, and divide it by the growth rate in percent per year. That is, approximately, how long it takes to double.

Conversely, if you have a quantity Y that starts in the year 1500 at a value of $Y_{1500} = 1.11 $ and that is thereafter growing at an annual rate of 0.1997%/year, you can calculate its value $ Y_{2020} $ in the current year using the equation:

$ Y_T = Y_0 (1 + g\%)^T $

Where T, the number of years for which it grows, is 520; and where the growth rate g = 0.1997%/year to get:

In [ ]:
# 7.3.8. calculating the answer to f in problem set 3

Y_2020 = 1.11 * 1.001997**520

Y_2020

That was supposed to be the answer to # 3.2.C.6.f. in problem set 3—from the 1.11 in part (a), the 0.001997 growth rate calculated in part (e), and the 520 years between 1500 and 2020. Yet many fewer people got that answer than should have.

And if you have a quantity that has a value $Y_0$ in some baseline year 0 and a value $Y_T$ T years later, you can calculate its growth rate over that span using the equation, approximately:

$ g = \frac{ln(Y_T/Y_0)}{T} $

or, exactly:

$ ln(1+g) = \frac{ln(Y_T/Y_0)}{T} $

 

4. Ungraded!

Now, for a final, extra, ungraded task of this problem set, recall the evolution over time of the north-south ratio of the value of their ideas stocks that we calculated at the end of part of this problem set:

In [ ]:
# 7.3.9. recalling the north-south ideas ratio


n_s_ideas_ratio_df

Calculate and stuff into a list called "growth" the difference between the global north and the global south in the growth rates of their respective ideas stocks for each of the periods into which we have divided our history. Don't spend too much time on this—I want to know how many people can simply write down a correct answer quickly, and how many cannot:

In [ ]:
# 7.3.10. calculate the difference in ideas stocks

date = [-68000, -8000, -3000, -1000, 1, 1500, 1770, 1870, 2020]
growth = []
    
for i in range(8):
    growth = ...    
growth

(Note: Let me leave you with one last piece of information that you may find useful if it takes T years for a quantity to double, it takes only ten times as long—10T years—for it to multiply a thousandfold...)

======================================================

In [ ]:
# summary calculations cell

print(world_income_1500, "is the average income per capita in the world in 1500")
print(global_north_ideas_1870, "is the level of ideas used by the global north in 1870")
print(north_south_ideas_ratio_1, "is the north-south ratio of the value of ideas in year 1")
print(north_south_ideas_ratio_1500, "is the north-south ratio of the value of ideas in year 1500")
print(north_south_ideas_ratio_1770, "is the north-south ratio of the value of ideas in year 1770")
print(north_south_ideas_ratio_1870, "is the north-south ratio of the value of ideas in year 1870")
print(north_south_ideas_ratio_2020, "is the north-south ratio of the value of ideas in year 2020")
print(n_s_ideas_ratio, 'is the ideas-ratio series')
print(n_s_ideas_ratio_minus_3000, 'is the ideas ratio for the year -3000')
print(exact_doubling_time_for_1_percent, "is the exact doubling time for a 1%/year growth rate")
print("ideas difference growth rate list:")
growth

5. Done!

Print your finished notebook to pdf, and upload it as an answer on the problem set 7 assignment page. URL:

6. Appendix Programming Dos and Don'ts...

A Running List...

  1. Do restart your kernel and run cells up to your current working point every fifteen minutes or so. Yes, it takes a little time. But if you don't, sooner or later the machine's namespace will get confused, and then you will get confused about the state of the machine's namespace, and by assuming things about it that are false you will lose hours and hours...
     

  2. Do reload the page when restarting the kernel does not seem to do the job...
     

  3. Do edit code cells by copying them below your current version and then working on the copy: when you break everything in the current cell (as you will), you can then go back to the old cell and start fresh...
     

  4. Do exercise agile development practices: if there is a line of code that you have not tested, test it. The best way to test is to ask the machine to echo back to you the thing you have just created in its namespace to make sure that it is what you want it to be. Only after you are certain that your namespace contains what you think it does should you write the next line of code. And then you should immediately test it...
     

  5. Do take screenshots of your error messages...
     

  6. Do google your error messages: Ms. Google is your best friend here...
     

  7. Do not confuse assignment ("=") and test for equality ("=="). In general, if there is an "if" anywhere nearby, you should be testing for equality. If there is not, you should be assignment a variable in your namespace to a value. Do curse the mathematicians 500 years ago who did not realize that in the twenty-first century it would be very convenient if we had different and not confusable symbols for equals-as-assignment and equals-as-test...
     


 

Thanks to: Rachel Grossberg, Christopher Hench, Meghana Krishnakumer, Seth Lloyd, Ronald Walker...