Last edited: 2019-10-08
You should have gotten to this point vis this link: http://datahub.berkeley.edu/user-redirect/interact?account=braddelong&repo=LS2019&branch=master&path=Introduction-Python-%26-Economics-delong.ipynb
This introductory notebook will familiarize you with (some of) the programming tools that will be useful to you. There are many other very good resources for a (gentle) introduction to computer programming. I especially recommend Berkeley's Data 8 website: http://data8.org/fa19/.
This webpage is called a Jupyter notebook. A notebook is a place to write programs and view their results.
In a notebook, each box containing text or code is called a cell.
Text cells (like this one) can be edited by double-clicking on them to make them the active cell. The formatting is then stripped, leaving an unformatted text string, and a blue or green bar appears on the right. You can then edit the text stream. The text in these cells written in a simple format called Markdown. You almost surely want to learn Markdown.
After you edit a text cell, click the "run cell" button at the top that looks like ▶| to invoke the Markdown processor on the changed cell, and display the formatted version. (Try not to delete any of the instructions about what you should do.)
Notice any dollar signs in the unformatted text stream? Those tell the formatting processor that the symbols between the dollar signs make up a mathematical expression written in another not-so-simple format called LaTeX. For example:
(3.8) $ y^{*mal} = \phi y^{sub} \left( 1 + \frac{ \gamma h}{\beta}\right) $
(3.14) $ L_t^{*mal} = \left[ \left( \frac{H_t}{y^{sub}} \right) \left( \frac{s}{\delta} \right)^\theta \left( \frac{1}{\phi} \right) \left[ \frac{1}{(1+\gamma h/\delta)^\theta} \frac{1}{(1+\gamma h/\beta)} \right] \right]^\gamma $
You almost surely want to learn (some of) LaTeX as well. (Jupyter notebooks use only a small subset of LaTex, which is a very powerful and complex programming language—it is, in fact, Turing-complete, which means that if you can write a program to compute something in any computer language on any (classical) computer, you can program it in LaTeX as well. For our purposes, full LaTeX is overkill: Markdown and the equation-processing parts of LaTeX do perfectly well.)
Understanding Check 1.1: This paragraph is in its own text cell. Try editing it so that this sentence is the last sentence in the paragraph, and then click the 'run cell' (▶|) button in the toolbar above. In short, you should move the previous sentence to a positon after this sentence, and then click the 'run cell' (▶|) button.
Note: You almost surely also want, sometime, to learn something about one of the genius founders of computer science, mid-twentieth century British mathematician Alan Turing. Here are three very good resources:
- Charles Petzold (2008): The Annotated Turing: A Guided Tour Through Alan Turing's Historic Paper on Computability and the Turing Machine http://books.google.com/?isbn=9780470229057...
https://books.google.com › books Charles Petzold
- Andrew Hodges (2012) Alan Turing: The Enigma https://books.google.com/?isbn=9781448137817...
- (2014): The Imitation Game https://en.wikipedia.org/wiki/The_Imitation_Game
Other cells contain code in the Python 3 language. You can switch a cell type from 'code' to 'Markdown' or back by selecting the appropriate option in the toolbar above. (Still other cells might contain 'raw' expressions, but we will not use those.)
To run the code in a code cell, first click on the cell to make it active. The cell should then be highlighted with a little green or blue bar to the left. Next, either press the 'run cell' button (▶|) in the toolbar above, or hold down the shift
key and press return
or enter
.
Running a code cell will execute all of the code it contains, if this notebook is connected to a Python interpreter and so able to call the Python kernel.
Try running this cell:
print("Hello, World!")
[1] "Hello, World!"
And this one:
print("\N{WAVING HAND SIGN}, \N{EARTH GLOBE ASIA-AUSTRALIA}!")
👋, 🌏!
The fundamental building block of Python code is an expression—something that starts at the left end of a line, and ends with a 'return' character which is not inside any unclosed left parenthesis ('(')), left brace ('{'), or left bracket ('[').
Code cells can contain multiple expressions.
When you run a code cell, the successive expressions contained in the lines of code are executed in the order in which they appear. Every print
expression prints a line.
Run the next cell and notice the order of the output:
print("First this line,")
print("then the whole 🌍")
print("and then this one.")
First this line, then the whole 🌍 and then this one.
Understanding Check 1.2: Change the cell above so that it prints out:
First print out this line,
and then this one,
and then, finally, the whole 🌏.
Look at the upper right corner of this window tab. You should say an open circle, and immediately to the left the words 'Python 3'. (If it says something else—like 'R', for example—click on the word or phrase, select 'Python 3' in the popup that appears, and click 'select'.
If the circle is closed and empty, the Python kernel is idle and ready to execute code. If the circle is filled in, wait a little while. If it does not become clear, either reselect 'Python 3' or click the 'Restart Kernel' item in the 'Kernel' menu at the top of this window.
Do not be scared should you see a "Kernel Restarting" message! Your data and work will still be saved. Once you see "Kernel Ready" in a light blue box on the top right of the notebook, you'll be ready to work again.
After a kernel restart, however, you will need to rerun any cells in the notebook above the cell you are currently working on if they import programming modules, load data, or carry out calculations with variables.
Next to every code cell, you'll see some text that says "In [...]". Before you run the cell, you'll see "In [ ]". When the cell is running, you'll see In [*]. If you see an asterisk (*) next to a cell that doesn't go away, it's likely that the code inside the cell is taking too long to run, and it might be a good time to interrupt the kernel. When a cell is finished running, you'll see a number inside the brackets, like so: In [1]. The number corresponds to the order in which you run the cells; so, the first cell you run will show a 1 when it's finished running, the second will show a 2, and so on.
If your kernel seems stuck, your notebook is very slow and unresponsive, or your kernel loses its connection. If this happens, try:
There will be some questions for you in these notebooks.
For free response questions, write your answers in the provided markdown cell that starts with ANSWER:. Do not change the heading, and write your entire answer in that one cell.
For questions that are to be answered numerically, there is a code cell that starts with:
__# ANSWER__
and has a line in which there is a variable (like "X") currently set to underscores so:
X = ___
Replace those underscores with your final answer. It is okay to make other computations in that cell and others, so long as you set the variable to your answer.
You will use Jupyter notebooks for your own projects or documents. In order for you to make your own notebook, you'll need to create your own cells for text and code.
To add a cell, click the + button in the menu bar. It'll start out as a text cell. You can change it to a code cell by clicking inside it so it's highlighted, clicking the drop-down box next to the restart (⟳) button in the menu bar, and choosing "Code".
Understanding Check 1.3: Add a code cell below this one. Write code in it that prints out:
A whole new cell! ♪🌏♪
(That musical note symbol is like the Earth symbol. Its long-form name is
\N{EIGHTH NOTE}
.)
Run your cell to verify that it works.
You will make errors. And the computer will tell you when you do. Making programming errors is not a problem. Not taking steps to correct the errors you make will be.
Python is a language, and like natural human languages, it has rules. It differs from natural language in two important ways:
Whenever you write code, you will make mistakes. When you run a code cell that has errors, Python will produce error messages to tell you what you did wrong. You will know it is an error because it is in a pink box to call your attention to it.
Errors are okay. Even—especially—experienced programmers make many errors. Perhaps the best programmers are those who make—and then correct—the most.
When you make an error, you just have to find the source of the problem, fix it, and move on.
We have made an error in the next cell. Run it and see what happens:
print("This line is missing something."
File "<ipython-input-1-0fbe4427aee1>", line 1 print("This line is missing something." ^ SyntaxError: unexpected EOF while parsing
You should see something like this (minus our annotations):
The computer tells you that this is a SyntaxError
: it is missing something that the computer requires in order to interpret the expression. 'EOF' means "end of file". 'unexpected EOF' means that the computer found itself confronted with the end of the cell before everything needed to make a valid expression had been presented to it. Where it needed to find a ')', it found instead the end of the file that the notebook submitted to the Python kernel when you issued the 'run cell' command.
There's a lot of terminology in programming languages, but You do not need to know all of the vast ocean of programming-language terminology in order to program effectively. When you see a cryptic message like this, you can often fix it—fix 'the bug'—without having to figure out exactly what the message means.
(If it is not immediately obvious to you, feel free to ask a friend or somebody else in the class: there is a saying in the practice of debugging programs: "with enough eyeballs, all bugs are shallow"—meaning that somewhere there is an eyeball attached to a brain which already knows how to immediately solve that bug, you only have to find that eyeball and get it to look at the faulty code).
Note: in the toolbar, there is the option to click "Cell > Run All", which will run all the code cells in this notebook in order until it hits an error. When it hits an error, it stops the 'run all' process.
Understanding Check 1.4: Try to fix the code above so that you can run the cell and see the intended message instead of an error.
print("This line is missing something.")
Now that you know something about our computing environment, it is time to move into understanding Python proper. First, however, run the code cell below to ensure all the libraries needed for this notebook are installed:
!pip install numpy
!pip install pandas
!pip install matplotlib
Requirement already satisfied: numpy in /Users/delong/anaconda3/lib/python3.6/site-packages You are using pip version 9.0.2, however version 19.2.3 is available. You should consider upgrading via the 'pip install --upgrade pip' command. Requirement already satisfied: pandas in /Users/delong/anaconda3/lib/python3.6/site-packages Requirement already satisfied: python-dateutil>=2 in /Users/delong/anaconda3/lib/python3.6/site-packages (from pandas) Requirement already satisfied: pytz>=2011k in /Users/delong/anaconda3/lib/python3.6/site-packages (from pandas) Requirement already satisfied: numpy>=1.7.0 in /Users/delong/anaconda3/lib/python3.6/site-packages (from pandas) Requirement already satisfied: six>=1.5 in /Users/delong/anaconda3/lib/python3.6/site-packages (from python-dateutil>=2->pandas) You are using pip version 9.0.2, however version 19.2.3 is available. You should consider upgrading via the 'pip install --upgrade pip' command. Requirement already satisfied: matplotlib in /Users/delong/anaconda3/lib/python3.6/site-packages Requirement already satisfied: numpy>=1.7.1 in /Users/delong/anaconda3/lib/python3.6/site-packages (from matplotlib) Requirement already satisfied: six>=1.10 in /Users/delong/anaconda3/lib/python3.6/site-packages (from matplotlib) Requirement already satisfied: python-dateutil in /Users/delong/anaconda3/lib/python3.6/site-packages (from matplotlib) Requirement already satisfied: pytz in /Users/delong/anaconda3/lib/python3.6/site-packages (from matplotlib) Requirement already satisfied: cycler>=0.10 in /Users/delong/anaconda3/lib/python3.6/site-packages (from matplotlib) Requirement already satisfied: pyparsing!=2.0.4,!=2.1.2,!=2.1.6,>=1.5.6 in /Users/delong/anaconda3/lib/python3.6/site-packages (from matplotlib) You are using pip version 9.0.2, however version 19.2.3 is available. You should consider upgrading via the 'pip install --upgrade pip' command.
The departure point for all programming is the concept of the expression. An expression is a combination of variables, operators, and other Python elements that the language interprets and acts upon. Expressions act as a set of instructions to be fed through the interpreter, with the goal of generating specific outcomes. Here are some examples of very basic expressions:
# Examples of basic expressions:
print(2 + 2) # addition
print('me' + ' and I') # string concatenation
print("me" + str(2)) # you can print a number with a string if
# you cast the number and so change it into
# type string with 'str()'...
print(12 ** 2) # exponentiation
4 me and I me2 144
If instead we had:
# Examples of basic expressions:
(2 + 2) # addition
('me' + ' and I') # string concatenation
("me" + str(2)) # you can print a number with a string if
# you cast the number and so change it into
# type string with 'str()'...
(12 ** 2) # exponentiation
144
You will notice that the last expression and only the last in a cell gets printed out below the cell if it has a value. If you want the computer to print more things to your screen, you need to explicitly tell it to print()
whatever is inside the parentheses.
An expression can be as simple as a number object:
3.25000
3.25
or it can be an arithmetic calculation:
24 * 24 - 15/3 + 15**3
3946.0
A great many basic arithmetic expressions are built into Python.
A variables is an object in the computer's memory (in this case, an integer
object and a float
object) with a name. We can store a quantity in that object, and then refer to it by its name and use it later. In Python we do this via expressions that are called assignment statements: an equals sign, with the name of the variable we are assigning a value to on the left-hand side of the equal signs, and what we want the value to be on the right-hand side.
In the example below, a
and b
are variables:
a = 4
b = 10/5
Notice that when you create a variable object—unlike what you previously saw—it does not print anything out. Our previous expressions did calculations, and then—if the expression is the last in the cell—reports that calculated value. The '=' sign in an object assignment expression redirects that reporting to the object named before the '=' sign. An assignment expression thus has no value to report.
Once you have assigned a value to a variable, that value is then bound to the variable name. In order to refer to and use that variable and its value, simply type the variable object name that we stored the value under:
c=8/2
c
4.0
Variables are stored within the notebook's environment: once you have stored a variable value in one cell, that value carries over and can be used in all subsequently executed cells—until the kernel restarts, after which you have to rerun all the previously executed code cells to restore the computer's memory environment to its previous working state:
# Notice that 'a' retains its value from the previous
# code cell above—as long as the kernel was not restarted:
print(a)
a + b
4
6.0
Understanding Check 2.1: See if you can write a series of expressions that creates two new variables called x and y and assigns them values of 10.5 and 7.2. Then assign their product to the variable combo and print it:
# Fill in the missing lines to complete the expressions.
# x = 10.5
# y = 7.2
# combo = x * y
# print(combo)
#...
75.60000000000001
Computers have turned out to be useful for vastly more tasks than simply calculations. Data manipulation plays an especially key role. For data manipulation, the most important concept is the type of object killed a lists (plus its more advanced counterpart, a numpy array
).
A list object is an ordered collection of other objects, of sub-objects. Lists allow us to store and access groups of variables and other objects for easy access and analysis.
Check out this documentation for an in-depth look at the capabilities of lists. (Yes, a list can contain itself as a sub-object; no, there is no way to create a list of all lists that do not contain themselves.)
To initialize a list, use brackets. Putting objects separated by commas in between the brackets adds them to the list:
# lists...
lst = [] # an empty list
print(lst)
lst = [1, 3, 6, # assigning to a new list to our empty list
'lists', 'are',
'fun',
4] # note how one expression stretches across
# four lines
print(lst)
[] [1, 3, 6, 'lists', 'are', 'fun', 4]
To access a value in a list object, count from left to right, starting at zero. Then write the the name of the list, followed by the number of the subobject you wish to access in brackets:
# Elements are selected thus:
example = lst[2] # to select the '6' in the object 'lst'
print(example)
6
There are some subtleties in how Python treats lists. When you assign a list to a variable object, the variable is then a pointer to the list object. What does this mean? It means this:
a = [1,2,3] # assign an original list object to variable a
b = a # assign a to variable b; b now points to list a
b[0] = 4 # now we assign a new value to the first subobject
# of b: we assign the value '0' to it
# What now is the value of a[0]? Is it '1'—our original assignment?
# Or is it '4'—did the reassignment of b[0] also carry over to a[0]?
# In Python, a[0] is now equal to '4'
print(a[0])
4
In Python we can use the '+' sign not just to add numbers, but to add an element to a list—these are two very different senses of the word "add":
# adding an element to a list
example_list = [1]
example_list = example_list + [2]
print("example_list is:", example_list)
example_number = 1
example_number = example_number + 2
print("example_number is:", example_number)
example_list is: [1, 2] example_number is: 3
I think that this use of '+' for two different things, addition and concatenation, is a design flaw in Python. But that ship has long ago sailed. You will need to check, when reading and writing Python, whether each '+' is "add" in the sense of "add two numbers", or "add" in the sense of "add an extra object to a collection".
As you can see from above, lists do not have to be made up of elements of the same kind. Indices do not have to be taken one at a time, either. Instead, we can take a slice of indices and return the elements at those indices as a separate list.
# This line will store the 1st (inclusive),
# i.e., not including the 0th element!,
# through 4th (exclusive) elements of lst
# as a new list called lst_2:
lst_2 = lst[1:4]
lst_2
[3, 6, 'lists']
Why does Python use zero-based indexing? Why is the first element of the list 'lst' the 0th, 'lst[0] = 1', rather than the 1st, 'lst[1] = 3', element? This may confuse you, and it may be easier to remember how Python works if you think of Python lists like days of the week. Suppose we have the list:
days_of_the_week = [Monday, Tuesday, Wednesday, Thursday, Friday]
What is two days from now, Monday? Wednesday. What is one day from now? Tuesday. Today is Monday—and Monday is not one day from now. Python works the same way.
Understanding Check 2.2: Slicing Lists: Build a list of length 10 containing whatever elements you'd like. Then, slice it into a new list of length five using a index slicing. Finally, assign the last element in your sliced list to the given variable and print it.
# Fill in the ellipses to complete the question.
# lst=...
# lst2=lst[...:...]
# a=lst2[...]
print(a)
6
Lists can also be operated on with a few built-in analysis functions. These include min
and max
, among others. Lists can also be concatenated together. Find some examples below.
# MOAR List Examples:
a_list = [1, 6, 4, 8, 13, 2] # a list containing six integers
b_list = [4, 5, 2, 14, 9, 11] # another list containing six integers.
print('Max of a_list:', max(a_list))
print('Min of b_list:', min(a_list))
c_list = a_list + b_list # concatenate a_list and b_list
print('Concatenated:', c_list)
Max of a_list: 13 Min of b_list: 1 Concatenated: [1, 6, 4, 8, 13, 2, 4, 5, 2, 14, 9, 11]
Python list objects are very flexible and powerful, but they are slow. And even though our computer hardware is immensely powerful, it is not quite powerful enough, sometimes, for the uses we wish to make of it in the environments we want to work in. Therefore there is an add-on library to Python, numpy, for "numerical Python", to work faster. And there is an object type in numpy, the array. A numpy array is a kind of list, in which we know that all of the subobjects—elements—of the array will be numbers, and so the computer can can operate on them much more quickly.
We tell the computer that we want to use the numpy library with an import statement:
import numpy as np
It is conventional to use the abbreviation 'np' to call on numpy. Whenever, in any Python program, you see an 'np.', you can safely assume that somewhere earlier in the computer's workflow there was a:
import numpy as np
Now let's take a look at some things we can do with numpy array objects:
# Initialize an array of integers 0 through 9.
example_array = np.array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])
# This can also be accomplished using np.arange
example_array_2 = np.arange(10)
print('Undoubled Array:')
print(example_array_2)
# Double the values in example_array and print the new array.
double_array = example_array*2
print('Doubled Array:')
print(double_array)
Undoubled Array: [0 1 2 3 4 5 6 7 8 9] Doubled Array: [ 0 2 4 6 8 10 12 14 16 18]
This behavior differs from that of a list. See below what happens if you multiply a list.
example_list = [1, 2, 3, 4, 5, 6, 7, 8, 9]
example_list * 2
[1, 2, 3, 4, 5, 6, 7, 8, 9, 1, 2, 3, 4, 5, 6, 7, 8, 9]
Notice that instead of multiplying each of the elements by two, multiplying a list and a number returns that many copies of that list. Other mathematical operations also have... interesting... behaviors with lists. Beware, but also explore.
The computer does not think and work like you do. You can very easily get wrong ideas into your head about what the computer will do in response to the lines of code you type and run, and then fixing your code will become next to impossible unless you work in small, discrete steps. Therefore: write a line of code; run it; check it to make sure it did what you thought it would do; fix it; rerun it; and check it again.
We will sometimes use the "ok" grader to help you see whether you have coded correctly.
Go ahead and attempt Understnding Check 2.2. Running the cell directly after it will test whether you have assigned seconds_in_a_decade
correctly. If you haven't, this test will tell you the correct answer. Resist the urge to just copy it, and instead try to adjust your expression. (Sometimes the tests will give hints about what went wrong...)
Understanding Check 2.2: Assign the name seconds_in_a_decade to the number of seconds between midnight January 1, 2010 and midnight January 1, 2020. Note that there are two leap years in this span of a decade. A non-leap year has 365 days and a leap year has 366 days.
# Change the next line
# so that it computes the number of seconds in a decade
# and assigns that number the name, seconds_in_a_decade.
seconds_in_a_decade = ...
# seconds_in_a_decade = 60*60*24*(365*10+1+1)
# We've put this line in this cell
# so that it will print the value you've given to seconds_in_a_decade when you run it.
# You don't need to change this.
seconds_in_a_decade
ok.grade("q22");
You may have noticed lines like these in the code cells so far:
# Change the next line
# so that it computes the number of seconds in a decade
# and assigns that number the name, seconds_in_a_decade.
These are comment. Comments do not make anything happen in Python: Python ignores anything on a line after a #.
Comments are there to communicate something about the code to you, the human reader. Comments are essential if anybody else is to understand your code, and in this case "anybody else" includes you a season, a month, a week, and possibly a day from now.
Do restart your kernel and run cells up to your current working point every fifteen minutes or so. Yes, it takes a little time. But if you don't, sooner or later the machine's namespace will get confused, and then you will get confused about the state of the machine's namespace, and by assuming things about it that are false you will lose hours and hours...
Do reload the page when restarting the kernel does not seem to do the job...
Do edit code cells by copying them below your current version and then working on the copy: when you break everything in the current cell (as you will), you can then go back to the old cell and start fresh...
Do exercise "agile" development practices: if there is a line of code that you have not tested, test it. The best way to test is to ask the machine to echo back to you the thing you have just created in its namespace to make sure that it is what you want it to be. Only after you are certain that your namespace contains what you think it does should you write the next line of code. And then you should immediately test it...
Do take screenshots of your error messages...
Do google your error messages: Ms. Google is your best friend here...
Do not confuse assignment ("=") and test for equality ("=="). In general, if there is an "if" anywhere nearby, you should be testing for equality. If there is not, you should be assignment a variable in your namespace to a value. Do curse the mathematicians 500 years ago who did not realize that in the twenty-first century it would be very convenient if we had different and not confusable symbols for equals-as-assignment and equals-as-test...
Do expect things to go wrong: it's not the end of the world, or even a threat to your soul. Back up to a stage where things were working as expected, and then try to figure out why things diverged from your expectations. This happens to everybody.
Here, for example, we have Gandalf the Grey, Python νB, confronting unexpected behavior from a Python pandas.DataFrame. Yes, he is going to have to upgrade his entire system and reboot. But in the end he will be fine:
Note: On the elective affinity between computer programming and sorcery: https://www.bradford-delong.com/2017/08/live-from-cyberspace-the-elective-affinity-between-fantasy-and-computer-programming-paul-dourish-the-original-hac.html
At the bottom-most level, a computer is a collection of electronic circuits in which there is either power—understood as a '1'—or no power—understood as a '0'. These circuits are groups into sets of 8—a 'byte'—which is then understood as an 8-digit binary number, something between 0 and 127 (decimal) inclusive. In modern computers these bytes are gathered in groups of 8 into 'words'. The computer then takes these words and adds, negates, and moves them within its systems in accordance with a pattern set by the interaction of the computer's hardware design and the word currently in the computer's instruction register.
But working directly with a computer—programming "on the bare metal"—is impossible for humans. At the bare-metal level, a computer can do things like:
So we build programming languages that translate our ideas about what the computer should do into a pattern of machine-language instructions that the computer can load into its instruction register and work with. And it is essential that such programming languages allow us to work not with raw 8-bit or 64-bit binary numbers but instead with objects. Early programming languages set up only four kinds of objects: you could ask the interpreter or compiler that took your program and translated it into machine language to set things up so that some of the numbers in the computer memory could represent integers, others could represent numbers with a fractional part, still others could represent strings of symbols like letters, and still others could represent sets of _operations_—combinations of the basic instruction operations.
Note: You probably want to learn a little—but only a little—about how what you write in your code cells is translated into dancing electrons. I recommend reading Charles Petzold (1999): Code: The Hidden Language of Computer Hardware and Software http://books.google.com/?isbn=9780735605053, starting at the beginning and continuing until you get bored...
But it very soon became clear—in fact, at the start of computer science it was obvious—that having only these four kinds of objects was inadequate. So slightly later computer languages introduced other types of objects. We have already seen lists and arrays. But the most important of the early additional objects was the loop_—ways of conveniently specifying that a similar calculation was to be done over and over again. Then came the _function_—an easy-to-type and easy-to-remember way of setting up a calculation that you will want to do that has a few inputs, one output, and that you are going to want to do over and over again for different input values. Ultimately, computer science developed the idea that a modern computer language should be object-oriented: it should allow you—in fact, attempt to force you—to define and use your own kinds of _objects to have whatever features and properties you found conveneint.
Python is a very modern computer language.
But, first, let us back up to loops. Loops are super useful. If each line of code we wrote were only executed once, programming would be very unproductive. But by repeated execution of sets of statements on slightly different sets of data, we can begin to unlock the potential of computers.
Python has two major types of loops: for loops and while loops. A for loop does a thing over and over again for different values of a something, whatever it is, that is named immediately after the keyword for. A while loop does a thing over and again while something is true, and stops when that something becomes false.
The while loop repeatedly performs operations until a conditional—something that is either True or __False__—is no longer satisfied. You will hear such a conditional thing called a boolean expression.
Thirteenth-century Italian mathematician Leonardo Bonacci of Pisa—now called Fibonacci—wrote the Book of Calculation: the first European-language (Latin) book showing what Hindu-Arabic numerals could do—how much easier than was possible using Roman numerals it would be to do money-changing, interest-owed, measure-conversion, and other calculations. In his book, he gave an example of how you could calculate the growth of a hypothetical population of rabbits, which followed what we now call the very useful Fibonacci sequence, in which the first two numbers are 1 and 1, and each subsequent number is the sum of the preceding two numbers.
The code cell below uses a while loop to calculate the nth Fibonacci number:
# using a while loop to calculate the nth Fibonacci number
#
# compute the nth Fibonacci number:
n = 14 # set which sequence number to compute
i = 1
previous_number = 0
current_number = 1
while i < n+1:
print(i, current_number)
next_number = previous_number + current_number
previous_number = current_number
current_number = next_number
i = i + 1
1 1 2 1 3 2 4 3 5 5 6 8 7 13 8 21 9 34 10 55 11 89 12 144 13 233 14 377
That—the fourteenth Fibonacci number—was as far as Leonardo of Pisa got in his calculations in his Book of Calculation.
The program does the calculation he does, but much more quickly. It starts with the zeroth Fibonacci number—0—and the first number—1—set as the values of the variables previous_number and current_number, respectively, with the index of the sequence i set equal to 1, and with the desired Fibonnaci number to be calculated set to n.
Then the computer calls the while loop object: while the index i is less than the desired number n, the program prints the current value of the index i and the current_number, sets the next_number to the sum of the current_number and the previous_number, sets the previous_number equal to the current_number, sets the current_number equal to the next_number, adds one to the index i, and then goes back to the top of the loop to check whether i is still less than n. If it is not, it exits the loop, and the program cell's calulations come to an end.
If you had told Fibonnaci back in the thirteenth century that in less than 800 years every student could easily learn how to construct a golem that in a split fraction of a second could calculate not the 14th of his numbers (377) but the 1000th:
4346655768693745643568852767504062580256466051
73717804024817290895365554179490518904038798400
79255169295922593080322634775209689623239873322
47116164299644090653318793829896964992851600370
4476137795166849228875
what would his reaction have been?
For loop objects are essential in traversing a list and performing an analogous set of calculations at each element. For example, the great German mathematician and philosopher Gottfried Wilhelm Leibniz (1646-1716) discovered a wonderful formula for 𝜋 as an infinite sum of simple fractions like this:
$ \pi = 4 - \frac{4}{3} + \frac{4}{5} - \frac{4}{7} + \frac{4}{9} - \frac{4}{11} + ... $
We can use for loop objects to, first, calculate the terms of the series; and then to sum them:
# using Leibnitz's series for π to calculate an approximate value
n=1000 # for how many terms do we wish to calculate the series?
#
# set up the terms of Leibnitz's series
series_terms = [] # start with an empty list to hold the series terms
for i in range(1,n+1): # for each of the first n terms of the series
series_terms = series_terms + [4*(-1)**(i+1)*1/(2*i-1)] # calculate the series term
#
# use the terms to calculate an approximation to π
π_approximation = 0 # start with zero
for term in series_terms: # for each of the series terms in our list
π_approximation = π_approximation + term # add the series term to our
# approximation to π
print(π_approximation) # print our approximation to π
3.140592653839794
In both of the for loop objects in the above cell, the most important line is the "for ... in ..." line. This sets the structure. It tells the computer to step through every element of the object named after the "in", perform the indicated operations, and then move on. Once Python has stepped through every element, the computer exists the loop and prints "π_approximation"
Note that the "i" and the "term" are arbitrary: as variables they have no existence outside of the for loop, and they could be named anything. For example:
π_approximation = 0 # start with zero
for rudolph_the_red_nosed_reindeer in series_terms:
π_approximation = π_approximation + rudolph_the_red_nosed_reindeer
print(π_approximation) # print our approximation to π
3.140592653839794
works exactly the same.
Understanding Check 2.3: In the following cell, partial steps to manipulate an array are included. You must fill in the blanks to accomplish the following:
Hint: To check if an integer
x
is a multiple ofy
, use the modulus operator%
. Typingx % y
will return the remainder whenx
is divided byy
. Therefore, (x % y != 0
) will returnTrue
wheny
does not dividex
, andFalse
when it does.
import numpy as np
# Make use of iterators, range, length, while
# loops, and indices to complete this question.
question_3 = np.array([12, 31, 50, 0, 22, 28, 19, 105, 44, 12, 77])
# for i in ...:
# while ...:
# question_3[i]+=1
# print(question_3)
[ 15 35 50 0 25 30 20 105 45 15 80]
The loop objects in the previous section are messy. They contain several expressions, and extend over several lines. It would be nice to figure a way to gather all these pieces together, and make it more transparent just why the object exists and what it is good for.
A programming language, after all, is much more than a means for instructing a computer to perform tasks. It is a way for us to organize our thoughts about the computer and what it is doing: programs must be written for people to read as well as for computers to execute. And the most powerful way to enable people to read is for the programing language to rpovide tools by which compound elements can be built, combined, and then named and used as a single unit—as an easily-understood and referenced object.
Function objects are thus useful when you want to repeat a series of steps in a calculation for multiple different sets of input values, but don't want to type out the steps over and over again. The purpose of function object provides you with an easy-to-type and easy-to-remember way of setting up such a calculation, and the referring to it and adding it to your program cells whenever you wish.
Many functions are built into Python already; for example, you've already made use of len()
to retrieve the number of elements in a list.
You can also write your own functions. At this point you already have the skills to begin to do so. It is good to get into the habit fo writing functions: the major benefit of using functions is that it makes your code much easier for humans to read—and the human who will have the most trouble reading your code is yourself three months from now, when you are trying to study for the final exam.
Functions generally take a set of parameters (also called inputs), which define the objects they will use when they are run. For example, the len()
function takes a list or array as its parameter, and returns the length of that list.
All of the loops we wrote above can be better presented when encapsulated in functions:
# a function to calculate the nth Fibonacci number
def Fibonacci(n):
"""
compute the nth Fibonacci number
function input: n, the index of the
Fibonacci number to be computed
function output: F_n, the nth Fibonacci
number
"""
i = 1
previous_number = 0
current_number = 1
while i < n+1:
next_number = previous_number + current_number
previous_number = current_number
current_number = next_number
i = i + 1
return previous_number
Once you have defined this function, you can then, whenever you want, simply calculate the 14th Fibonacci number and store it in a variable—called "result", say—by simply invoking:
result = Fibonacci(14)
result
377
Or, if Python did not already have a perfectly good π-calculating function already, a function to calculate π using the Leibnitz approximation might be useful:
# a function to calculate the n-term Leibnitz approximation to π
def Leibnitz_π(n):
"""
compute the n-term Leibnitz approximation to π
function input: n, the number of terms to be
calculated in the approximation
function output: π_{Ln}
"""
series_terms = []
for i in range(1,n+1):
series_terms = series_terms + [4*(-1)**(i+1)*1/(2*i-1)]
π_approximation = 0
for term in series_terms:
π_approximation = π_approximation + term
return π_approximation
result = Leibnitz_π(10000)
result
3.1414926535900345
But Python does, in the "math" library:
import math as math
math.pi
3.141592653589793
Or if we needed a function to test whether a number is prime:
# prime number test function
def is_multiple(m,n):
"""
is m a multiple of n?
"""
if (m%n == 0):
return True
else:
return False
def is_it_prime(n):
"""
tests the number n for primality
"""
for i in range(2, n):
if (is_multiple(n, i)):
return False
break
if (i >= n/2):
return True
break
is_it_prime(9)
False
Note: The function above uses the
if
statements. Read more about theif
statement here: https://www.tutorialspoint.com/python/python_if_else.htm.
Remember: The principal reason to use functions—and other objects—is to aid in your understanding, not the computer's. The computer does not care: for it, it is all patterns of ones and zeroes. Thus wherever you can make your program easier to read by explicitly using a function, do so.
For example, there are lots of times when economists want to calculate the marginal utility of spending on something—say, the marginal utility of spending on consumption for a consumer whose attitude toward risk is captured by a constant relative risk aversion parameter γ. The calculation of marginal utility as a function of the current level of consumption c and the CRRA parameter γ is straightforward:
marg_util = c**(-γ)
unless $ \gamma = 1 $, in which case:
marg_util = 1/c
So you wind up writing code like:
if (γ=1):
marg_util = 1/c
else:
marg_util = c**(-γ)
So why not put it into a function called marginal_utility, so that you will have one line instead of 4, and to remind yourself of what is going on each time you read through your code?
# marginal utility function
def marginal_utility(c, γ):
"""
the marginal utility of increasing spending on consumption
for a consumer with a constant-relative risk aversion
parameter γ and a current level of consumption spending c
"""
if γ == 1:
return 1 / c
else:
return c**(-γ)
c = 5
γ = 2
result = marginal_utility(c, γ)
result
0.04
Python is a modern kind of computer language called "object oriented". That means this: in Python, everything is an object that someone has defined and that you can redefine, and you are free to define your own objects as you wish.
OK. What does that list paragraph mean, and why does it matter:
In Python, an object is a collection of data and instructions held in computer memory that consists of:
You might think that a Python variable is simply a named box in which you can store a number. And, indeed the variable has a name and does have its value as its contents. But Python knows that a variable is also an object that understands a great many ways you would like to use it. These predefined ways to use it are called "methods".
Here are some of the methods that come with a variable object:
y = 3
print(y) # value
print(y + y) # add
print(y * y) # multiply
print(y - y) # subtract
print(y/y) # divide
print(y**y) # exponentiate
print(y.__abs__()) # absolute value
print(y.__bool__()) # is it "true" (i.e., not zero)?
print(y.__lt__(5)) # is it less than 5?
print(y.real) # real part
print(y.imag) # imaginary part
print(y.conjugate()) # complex conjugate
3 6 9 0 1.0 27 3 True True 3 0 3
x = 3 + 2j
print(x) # value
print(x.real) # real part
print(x.imag) # imaginary part
print(x.conjugate()) # complex conjugate
print(x + x) # add
print(x * x) # multiply
print(x - x) # subtract
print(x/x) # divide
print(x**x) # exponentiate
print(x.__abs__()) # absolute value
print(x.__bool__()) # is x "true" (i.e., not zero)?
print(y.__lt__(5)) # is it less than 5?
(3+2j) 3.0 2.0 (3-2j) (6+4j) (5+12j) 0j (1+0j) (-5.409738793917678-13.410442370412747j) 3.605551275463989 True True
You can call the
dir(y)
command to see all the methods that Python has associated with the variable y. And suppose that you want more methods attached to one of your objects? Then you can extend the class of object it is, and define your own.
Python has tools for you to figure out what methods an object has. Consider a list object. Does it know how to use the "append" method to add another object to it? Let's see:
university_list = ['UC San Diego', 'UC Riverside', 'UC Irvine', 'UC Los Angeles',
'UC Santa Barbara', 'UC Merced', 'UC Santa Cruz',
'UC San Francisco', 'UC Davis']
print("university_list is", len(university_list), "items long")
print("does university list understand the .append method?", callable(university_list.append))
university_list is 9 items long does university list understand the .append method? True
So let's add the missing UC campus—the one that Buffy Summers graduated from:
university_list.append('UC Sunnydale')
university_list
['UC San Diego', 'UC Riverside', 'UC Irvine', 'UC Los Angeles', 'UC Santa Barbara', 'UC Merced', 'UC Santa Cruz', 'UC San Francisco', 'UC Davis', 'UC Sunnydale']
"university list" is now a longer list than it was:
print("university_list is", len(university_list), "items long")
university_list is 10 items long
The 9th element of "university list" is the character string "UC Sunnydale". And that is an object that has its methods:
university_list[9].upper()
'UC SUNNYDALE'
university_list[9].lower()
'uc sunnydale'
Imagine now you want to write a program with consumers, who can:
A natural solution in Python would be to create consumers as objects with
Python makes it easy to do this, by providing you with class definitions that allow you to build objects according to your own specifications.
A class definition is a blueprint for a particular class of objects (e.g., lists, strings or complex numbers).
It describesWhat kind of data the class stores, and What methods it has for acting on these data. An object or instance is a realization of the class, created from the blueprint. Each instance has its own unique data.
Methods set out in the class definition act on this (and other) data. In Python, the data and methods of an object are collectively referred to as attributes.
Attributes are accessed via “dotted attribute notation”:
object_name.data
object_name.method_name()
x = [1, 5, 4]
x.sort()
x.__class__
x
is an object or instance, created from the definition for Python lists, but with its own particular data.
x.sort()
and x.__class__
are two attributes of x
.
dir(x)
can be used to view all the attributes of x
.
we’ll build a Consumer
class with
wealth
attribute that stores the consumer’s wealth (data)earn
method, where earn(y) increments the consumer’s wealth by yspend
method, where spend(x) either decreases wealth by x or returns an error if insufficient funds existimport numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
class Consumer:
def __init__(self, w):
"Initialize consumer with w dollars of wealth"
self.wealth = w
def earn(self, y):
"The consumer earns y dollars"
self.wealth += y
def spend(self, x):
"The consumer spends x dollars if feasible"
new_wealth = self.wealth - x
if new_wealth < 0:
print("Insufficent funds")
else:
self.wealth = new_wealth
If you look at the Consumer class definition again you’ll see the word self
throughout the code.
The rules with self
are that
self
e.g., the earn
method references self.wealth
rather than just wealth
2. Any method defined within the class should have self
as its first argument
e.g., def earn(self, y)
rather than just def earn(y)
Any method referenced within the class should be called as self.method_name
For our next example, let’s write a simple class to implement the Solow growth model. The Solow growth model is a neoclassical growth model where the amount of capital stock per capita $k_{t}$ evolves according to the rule $k_{t+1} = \frac{s z k_t^{\alpha} + (1 - \delta) k_t}{1 + n} \tag{1}$ Here $s$ is an exogenously given savings rate $z$ is a productivity parameter $\aplha$ is capital’s share of income $n$ is the population growth rate $\delta$ is the depreciation rate The steady state of the model is the k that solves$\tag{1}$ when $k_{t+1} = k_t = k$
Some points of interest in the code are:
self.k
.h
method implements the right-hand side of (1).update
method uses h
to update capital as per (1). Notice how inside update
the reference to the local method h
is self.h
.The methods steady_state
and generate_sequence
are fairly self-explanatory
class Solow:
r"""
Implements the Solow growth model with the update rule
k_{t+1} = [(s z k^α_t) + (1 - δ)k_t] /(1 + n)
"""
def __init__(self, n=0.05, # population growth rate
s=0.25, # savings rate
δ=0.1, # depreciation rate
α=0.3, # share of labor
z=2.0, # productivity
k=1.0): # current capital stock
self.n, self.s, self.δ, self.α, self.z = n, s, δ, α, z
self.k = k
def h(self):
"Evaluate the h function"
# Unpack parameters (get rid of self to simplify notation)
n, s, δ, α, z = self.n, self.s, self.δ, self.α, self.z
# Apply the update rule
return (s * z * self.k**α + (1 - δ) * self.k) / (1 + n)
def update(self):
"Update the current state (i.e., the capital stock)."
self.k = self.h()
def steady_state(self):
"Compute the steady state value of capital."
# Unpack parameters (get rid of self to simplify notation)
n, s, δ, α, z = self.n, self.s, self.δ, self.α, self.z
# Compute and return steady state
return ((s * z) / (n + δ))**(1 / (1 - α))
def generate_sequence(self, t):
"Generate and return a time series of length t"
path = []
for i in range(t):
path.append(self.k)
self.update()
return path
Here’s a little program that uses the class to compute time series from two different initial conditions. The common steady state is also plotted for comparison
s1 = Solow()
s2 = Solow(k=8.0)
T = 60
fig, ax = plt.subplots(figsize=(9, 6))
# Plot the common steady state value of capital
ax.plot([s1.steady_state()]*T, 'k-', label='steady state')
# Plot time series for each economy
for s in s1, s2:
lb = f'capital series from initial state {s.k}'
ax.plot(s.generate_sequence(T), 'o-', lw=2, alpha=0.6, label=lb)
ax.legend()
plt.show()
Congratulations, you have finished your first assignment for Econ 101B! Run the cell below to submit all of your work. Make sure to check on OK to make sure that it has uploaded.
We will be managing our data using "dataframe" objects from the Pandas library, one of the most widely used Python libraries in data science. We need to import "pandas", and it is conventional to abbreviate it "pd" ;
import numpy as np
import pandas as pd
The rows and columns of a pandas dataframe are essentially a collection of lists stacked on top/next to each other. For example, to store the top 10 movies and their ratings, I could create 10 lists. Each list would contain a movie title and its corresponding rating, and each list would be one row of the Dataframe's data table:
top_10_movies_df = pd.DataFrame(data=np.array(
[[9.2, 'The Shawshank Redemption', 1994],
[9.2, 'The Godfather', 1972],
[9.0, 'The Godfather: Part II', 1974],
[8.9, 'Pulp Fiction', 1994],
[8.9, "Schindler's List", 1993],
[8.9, 'The Lord of the Rings: The Return of the King', 2003],
[8.9, '12 Angry Men', 1957],
[8.9, 'The Dark Knight', 2008],
[8.9, 'Il buono, il brutto, il cattivo', 1966],
[8.8, 'The Lord of the Rings: The Fellowship of the Ring',2001]]),
columns=["Rating", "Movie", "Date"])
top_10_movies_df
Rating | Movie | Date | |
---|---|---|---|
0 | 9.2 | The Shawshank Redemption | 1994 |
1 | 9.2 | The Godfather | 1972 |
2 | 9.0 | The Godfather: Part II | 1974 |
3 | 8.9 | Pulp Fiction | 1994 |
4 | 8.9 | Schindler's List | 1993 |
5 | 8.9 | The Lord of the Rings: The Return of the King | 2003 |
6 | 8.9 | 12 Angry Men | 1957 |
7 | 8.9 | The Dark Knight | 2008 |
8 | 8.9 | Il buono, il brutto, il cattivo | 1966 |
9 | 8.8 | The Lord of the Rings: The Fellowship of the Ring | 2001 |
Alternatively, we could use an object of class "dictionary" to store or data. You can think of a list as a way of associating each value with a key which is its place in the list. Thus:
Godfather_II_list = ['The Godfather: Part II', 1974, 9.0]
print("0:", Godfather_II_list[0])
print("1:", Godfather_II_list[1])
print("2:", Godfather_II_list[2])
The Godfather: Part II 1974 9.0
A dictionary is like a list, but instead of its keys being the numbers 0, 1, 2, 3..., the dictionary's keys can be anything you define:
Godfather_II_dict = {'title': 'The Godfather: Part II', 'date': 1974, 'rating': 9.0}
print("title:", Godfather_II_dict['title'])
print("date:", Godfather_II_dict['date'])
print("rating:", Godfather_II_dict['rating'])
title: The Godfather: Part II date: 1974 rating: 9.0
In our top 10 movies example, we could create a dictionary that contains three keys: "Rating" as the key to a list of the ratings, "Movie" as a key to the movie titles, and "Date" as a key to the dates:
top_10_movies_dict = {"Rating" : [9.2, 9.2, 9.0, 8.9, 8.9, 8.9, 8.9, 8.9, 8.9, 8.8],
"Movie" : ['The Shawshank Redemption (1994)',
'The Godfather',
'The Godfather: Part II',
'Pulp Fiction',
"Schindler's List",
'The Lord of the Rings: The Return of the King',
'12 Angry Men',
'The Dark Knight',
'Il buono, il brutto, il cattivo',
'The Lord of the Rings: The Fellowship of the Ring'],
"Date" : [1994, 1972, 1974, 1994, 1993, 2003, 1957, 2008, 1966, 2001]
}
Now, we can use this dictionary to create a table with columns Rating
, Movie
, and Date
:
top_10_movies_df2 = pd.DataFrame(data=top_10_movies_dict, columns=["Rating", "Movie", "Date"])
top_10_movies_df2
Rating | Movie | Date | |
---|---|---|---|
0 | 9.2 | The Shawshank Redemption (1994) | 1994 |
1 | 9.2 | The Godfather | 1972 |
2 | 9.0 | The Godfather: Part II | 1974 |
3 | 8.9 | Pulp Fiction | 1994 |
4 | 8.9 | Schindler's List | 1993 |
5 | 8.9 | The Lord of the Rings: The Return of the King | 2003 |
6 | 8.9 | 12 Angry Men | 1957 |
7 | 8.9 | The Dark Knight | 2008 |
8 | 8.9 | Il buono, il brutto, il cattivo | 1966 |
9 | 8.8 | The Lord of the Rings: The Fellowship of the Ring | 2001 |
Both ways produced the same data table, the same dataframe. The list method created the table by using the lists to make up the rows of the table. The dictionary method took the dictionary keys and used them to make up the columns of the table.
Luckily for you, most data tables in this course will be premade by somebody else—and precleaned. Perhaps the most common file type that is used for economic data is a comma-separated-values (.csv). If properly cleaaned, they are easy to read in as pandas dataframes. We will use the "pd.read_csv" function, which takes as its one input parameter the url of the csv file you want to turn into a dataframe:
import pandas as pd
# Run this cell to read in the table
nipa_quarterly_accounts_df = pd.read_csv("https://delong.typepad.com/files/quarterly_accounts.csv")
What has this command done? Let's use the "head" method attached to objects in the dataframe class to see:
nipa_quarterly_accounts_df.head()
Year | Quarter | Real GDI | Real GDP | Nominal GDP | |
---|---|---|---|---|---|
0 | 1947 | Q1 | 1912.5 | 1934.5 | 243.1 |
1 | 1947 | Q2 | 1910.9 | 1932.3 | 246.3 |
2 | 1947 | Q3 | 1914.0 | 1930.3 | 250.1 |
3 | 1947 | Q4 | 1932.0 | 1960.7 | 260.3 |
4 | 1948 | Q1 | 1984.4 | 1989.5 | 266.2 |
Oftentimes, tables will contain a lot of extraneous data that muddles our data tables, making it more difficult to quickly and accurately obtain the data we need. To correct for this, we can select out columns or rows that we need by indexing our dataframes.
The easiest way to index into a table is with square bracket notation. Suppose you wanted to obtain all of the Real GDP data from the data. Using a single pair of square brackets, you could index the table for "Real GDP"
# Run this cell and see what it outputs
nipa_quarterly_accounts_df["Real GDP"]
0 1934.5 1 1932.3 2 1930.3 3 1960.7 4 1989.5 5 2021.9 6 2033.2 7 2035.3 8 2007.5 9 2000.8 10 2022.8 11 2004.7 12 2084.6 13 2147.6 14 2230.4 15 2273.4 16 2304.5 17 2344.5 18 2392.8 19 2398.1 20 2423.5 21 2428.5 22 2446.1 23 2526.4 24 2573.4 25 2593.5 26 2578.9 27 2539.8 28 2528.0 29 2530.7 ... 251 14541.9 252 14604.8 253 14745.9 254 14845.5 255 14939.0 256 14881.3 257 14989.6 258 15021.1 259 15190.3 260 15291.0 261 15362.4 262 15380.8 263 15384.3 264 15491.9 265 15521.6 266 15641.3 267 15793.9 268 15757.6 269 15935.8 270 16139.5 271 16220.2 272 16350.0 273 16460.9 274 16527.6 275 16547.6 276 16571.6 277 16663.5 278 16778.1 279 16851.4 280 16903.2 Name: Real GDP, Length: 281, dtype: float64
Notice how the above cell returns an array of all the real GDP values in their original order. Now, if you wanted to get the first real GDP value from this array, you could index it with another pair of square brackets:
nipa_quarterly_accounts_df["Real GDP"][0]
1934.5
Pandas columns have many of the same properties as numpy arrays. Keep in mind that pandas dataframes, as well as many other data structures, are zero-indexed, meaning indexes start at 0 and end at the number of elements minus one.
If you wanted to create a new datatable with select columns from the original table, you can index with double brackets.
## Note: .head() returns the first five rows of the table
nipa_quarterly_accounts_df[["Year", "Quarter", "Real GDP", "Real GDI"]].head()
Year | Quarter | Real GDP | Real GDI | |
---|---|---|---|---|
0 | 1947 | Q1 | 1934.5 | 1912.5 |
1 | 1947 | Q2 | 1932.3 | 1910.9 |
2 | 1947 | Q3 | 1930.3 | 1914.0 |
3 | 1947 | Q4 | 1960.7 | 1932.0 |
4 | 1948 | Q1 | 1989.5 | 1984.4 |
You can get rid of columns you dont need using .drop()
nipa_quarterly_accounts_df.drop("Nominal GDP", axis=1).head()
Year | Quarter | Real GDI | Real GDP | |
---|---|---|---|---|
0 | 1947 | Q1 | 1912.5 | 1934.5 |
1 | 1947 | Q2 | 1910.9 | 1932.3 |
2 | 1947 | Q3 | 1914.0 | 1930.3 |
3 | 1947 | Q4 | 1932.0 | 1960.7 |
4 | 1948 | Q1 | 1984.4 | 1989.5 |
Finally, you can use square bracket notation to index rows by their indices. For example, if I wanted the 20th to 30th rows of accounts
:
nipa_quarterly_accounts_df[20:31]
Year | Quarter | Real GDI | Real GDP | Nominal GDP | |
---|---|---|---|---|---|
20 | 1952 | Q1 | 2398.3 | 2423.5 | 360.2 |
21 | 1952 | Q2 | 2412.6 | 2428.5 | 361.4 |
22 | 1952 | Q3 | 2435.0 | 2446.1 | 368.1 |
23 | 1952 | Q4 | 2509.5 | 2526.4 | 381.2 |
24 | 1953 | Q1 | 2554.3 | 2573.4 | 388.5 |
25 | 1953 | Q2 | 2572.2 | 2593.5 | 392.3 |
26 | 1953 | Q3 | 2555.7 | 2578.9 | 391.7 |
27 | 1953 | Q4 | 2504.1 | 2539.8 | 386.5 |
28 | 1954 | Q1 | 2510.1 | 2528.0 | 385.9 |
29 | 1954 | Q2 | 2514.5 | 2530.7 | 386.7 |
30 | 1954 | Q3 | 2537.1 | 2559.4 | 391.6 |
Indexing rows based on indices is only useful when you know the specific set of rows that you need, and you can only really get a range of entries. Working with data often involves huge datasets, making it inefficient and sometimes impossible to know exactly what indices to be looking at. On top of that, most data analysis concerns itself with looking for patterns or specific conditions in the data, which is impossible to look for with simple index based sorting.
Thankfully, you can also use square bracket notation to filter out data based on a condition. Suppose we only wanted real GDP and nominal GDP data from the 21st century:
nipa_quarterly_accounts_df[nipa_quarterly_accounts_df["Year"] >= 2000][["Real GDP", "Nominal GDP"]]
Real GDP | Nominal GDP | |
---|---|---|
212 | 12359.1 | 10031.0 |
213 | 12592.5 | 10278.3 |
214 | 12607.7 | 10357.4 |
215 | 12679.3 | 10472.3 |
216 | 12643.3 | 10508.1 |
217 | 12710.3 | 10638.4 |
218 | 12670.1 | 10639.5 |
219 | 12705.3 | 10701.3 |
220 | 12822.3 | 10834.4 |
221 | 12893.0 | 10934.8 |
222 | 12955.8 | 11037.1 |
223 | 12964.0 | 11103.8 |
224 | 13031.2 | 11230.1 |
225 | 13152.1 | 11370.7 |
226 | 13372.4 | 11625.1 |
227 | 13528.7 | 11816.8 |
228 | 13606.5 | 11988.4 |
229 | 13706.2 | 12181.4 |
230 | 13830.8 | 12367.7 |
231 | 13950.4 | 12562.2 |
232 | 14099.1 | 12813.7 |
233 | 14172.7 | 12974.1 |
234 | 14291.8 | 13205.4 |
235 | 14373.4 | 13381.6 |
236 | 14546.1 | 13648.9 |
237 | 14589.6 | 13799.8 |
238 | 14602.6 | 13908.5 |
239 | 14716.9 | 14066.4 |
240 | 14726.0 | 14233.2 |
241 | 14838.7 | 14422.3 |
... | ... | ... |
251 | 14541.9 | 14566.5 |
252 | 14604.8 | 14681.1 |
253 | 14745.9 | 14888.6 |
254 | 14845.5 | 15057.7 |
255 | 14939.0 | 15230.2 |
256 | 14881.3 | 15238.4 |
257 | 14989.6 | 15460.9 |
258 | 15021.1 | 15587.1 |
259 | 15190.3 | 15785.3 |
260 | 15291.0 | 15973.9 |
261 | 15362.4 | 16121.9 |
262 | 15380.8 | 16227.9 |
263 | 15384.3 | 16297.3 |
264 | 15491.9 | 16475.4 |
265 | 15521.6 | 16541.4 |
266 | 15641.3 | 16749.3 |
267 | 15793.9 | 16999.9 |
268 | 15757.6 | 17031.3 |
269 | 15935.8 | 17320.9 |
270 | 16139.5 | 17622.3 |
271 | 16220.2 | 17735.9 |
272 | 16350.0 | 17874.7 |
273 | 16460.9 | 18093.2 |
274 | 16527.6 | 18227.7 |
275 | 16547.6 | 18287.2 |
276 | 16571.6 | 18325.2 |
277 | 16663.5 | 18538.0 |
278 | 16778.1 | 18729.1 |
279 | 16851.4 | 18905.5 |
280 | 16903.2 | 19057.7 |
69 rows × 2 columns
The nipa_quarterly_accounts_df
table is being indexed by the condition nipa_quarterly_accounts_df["Year"] >= 2000
, which returns a table where only rows that have a "Year" greater than $2000$ is returned. We then index this table with the double bracket notation from the previous section to only get the real GDP and nominal GDP columns.
Suppose now we wanted a table with data from the first quarter, and where the real GDP was less than 5000 or nominal GDP is greater than 15,000.
nipa_quarterly_accounts_df[(nipa_quarterly_accounts_df["Quarter"] ==
"Q1") & ((nipa_quarterly_accounts_df["Real GDP"] < 5000) |
(nipa_quarterly_accounts_df["Nominal GDP"] > 15000))]
Year | Quarter | Real GDI | Real GDP | Nominal GDP | |
---|---|---|---|---|---|
0 | 1947 | Q1 | 1912.5 | 1934.5 | 243.1 |
4 | 1948 | Q1 | 1984.4 | 1989.5 | 266.2 |
8 | 1949 | Q1 | 2001.5 | 2007.5 | 275.4 |
12 | 1950 | Q1 | 2060.1 | 2084.6 | 281.2 |
16 | 1951 | Q1 | 2281.0 | 2304.5 | 336.4 |
20 | 1952 | Q1 | 2398.3 | 2423.5 | 360.2 |
24 | 1953 | Q1 | 2554.3 | 2573.4 | 388.5 |
28 | 1954 | Q1 | 2510.1 | 2528.0 | 385.9 |
32 | 1955 | Q1 | 2661.6 | 2683.8 | 413.8 |
36 | 1956 | Q1 | 2775.4 | 2770.0 | 440.5 |
40 | 1957 | Q1 | 2862.0 | 2854.5 | 470.6 |
44 | 1958 | Q1 | 2779.9 | 2772.7 | 468.4 |
48 | 1959 | Q1 | 2976.5 | 2976.6 | 511.1 |
52 | 1960 | Q1 | 3121.9 | 3123.2 | 543.3 |
56 | 1961 | Q1 | 3109.9 | 3102.3 | 545.9 |
60 | 1962 | Q1 | 3328.6 | 3336.8 | 595.2 |
64 | 1963 | Q1 | 3469.1 | 3456.1 | 622.7 |
68 | 1964 | Q1 | 3658.6 | 3672.7 | 671.1 |
72 | 1965 | Q1 | 3885.5 | 3873.5 | 719.2 |
76 | 1966 | Q1 | 4167.8 | 4201.9 | 797.3 |
80 | 1967 | Q1 | 4286.5 | 4324.9 | 846.0 |
84 | 1968 | Q1 | 4465.6 | 4490.6 | 911.1 |
88 | 1969 | Q1 | 4665.4 | 4691.6 | 995.4 |
92 | 1970 | Q1 | 4690.4 | 4707.1 | 1053.5 |
96 | 1971 | Q1 | 4778.0 | 4834.3 | 1137.8 |
256 | 2011 | Q1 | 14924.4 | 14881.3 | 15238.4 |
260 | 2012 | Q1 | 15500.4 | 15291.0 | 15973.9 |
264 | 2013 | Q1 | 15642.7 | 15491.9 | 16475.4 |
268 | 2014 | Q1 | 15912.8 | 15757.6 | 17031.3 |
272 | 2015 | Q1 | 16599.6 | 16350.0 | 17874.7 |
276 | 2016 | Q1 | 16776.1 | 16571.6 | 18325.2 |
280 | 2017 | Q1 | 16992.1 | 16903.2 | 19057.7 |
Many different conditions can be included to filter, and you can use &
and |
operators to connect them together. Make sure to include parantheses for each condition!
Another way to reorganize data to make it more convenient is to sort the data by the values in a specific column. For example, if we wanted to find the highest real GDP since 1947, we could sort the table for real GDP:
nipa_quarterly_accounts_df.sort_values("Real GDP")
Year | Quarter | Real GDI | Real GDP | Nominal GDP | |
---|---|---|---|---|---|
2 | 1947 | Q3 | 1914.0 | 1930.3 | 250.1 |
1 | 1947 | Q2 | 1910.9 | 1932.3 | 246.3 |
0 | 1947 | Q1 | 1912.5 | 1934.5 | 243.1 |
3 | 1947 | Q4 | 1932.0 | 1960.7 | 260.3 |
4 | 1948 | Q1 | 1984.4 | 1989.5 | 266.2 |
9 | 1949 | Q2 | 1995.9 | 2000.8 | 271.7 |
11 | 1949 | Q4 | 1979.6 | 2004.7 | 271.0 |
8 | 1949 | Q1 | 2001.5 | 2007.5 | 275.4 |
5 | 1948 | Q2 | 2030.2 | 2021.9 | 272.9 |
10 | 1949 | Q3 | 2007.9 | 2022.8 | 273.3 |
6 | 1948 | Q3 | 2031.5 | 2033.2 | 279.5 |
7 | 1948 | Q4 | 2041.6 | 2035.3 | 280.7 |
12 | 1950 | Q1 | 2060.1 | 2084.6 | 281.2 |
13 | 1950 | Q2 | 2144.4 | 2147.6 | 290.7 |
14 | 1950 | Q3 | 2225.9 | 2230.4 | 308.5 |
15 | 1950 | Q4 | 2268.9 | 2273.4 | 320.3 |
16 | 1951 | Q1 | 2281.0 | 2304.5 | 336.4 |
17 | 1951 | Q2 | 2321.3 | 2344.5 | 344.5 |
18 | 1951 | Q3 | 2362.0 | 2392.8 | 351.8 |
19 | 1951 | Q4 | 2382.7 | 2398.1 | 356.6 |
20 | 1952 | Q1 | 2398.3 | 2423.5 | 360.2 |
21 | 1952 | Q2 | 2412.6 | 2428.5 | 361.4 |
22 | 1952 | Q3 | 2435.0 | 2446.1 | 368.1 |
23 | 1952 | Q4 | 2509.5 | 2526.4 | 381.2 |
28 | 1954 | Q1 | 2510.1 | 2528.0 | 385.9 |
29 | 1954 | Q2 | 2514.5 | 2530.7 | 386.7 |
27 | 1953 | Q4 | 2504.1 | 2539.8 | 386.5 |
30 | 1954 | Q3 | 2537.1 | 2559.4 | 391.6 |
24 | 1953 | Q1 | 2554.3 | 2573.4 | 388.5 |
26 | 1953 | Q3 | 2555.7 | 2578.9 | 391.7 |
... | ... | ... | ... | ... | ... |
244 | 2008 | Q1 | 14842.2 | 14889.5 | 14668.4 |
246 | 2008 | Q3 | 14767.0 | 14891.6 | 14843.0 |
242 | 2007 | Q3 | 14822.4 | 14938.5 | 14569.7 |
255 | 2010 | Q4 | 14904.9 | 14939.0 | 15230.2 |
245 | 2008 | Q2 | 14832.4 | 14963.4 | 14813.0 |
257 | 2011 | Q2 | 14996.1 | 14989.6 | 15460.9 |
243 | 2007 | Q4 | 14816.6 | 14991.8 | 14685.3 |
258 | 2011 | Q3 | 15093.1 | 15021.1 | 15587.1 |
259 | 2011 | Q4 | 15217.0 | 15190.3 | 15785.3 |
260 | 2012 | Q1 | 15500.4 | 15291.0 | 15973.9 |
261 | 2012 | Q2 | 15522.8 | 15362.4 | 16121.9 |
262 | 2012 | Q3 | 15517.1 | 15380.8 | 16227.9 |
263 | 2012 | Q4 | 15650.6 | 15384.3 | 16297.3 |
264 | 2013 | Q1 | 15642.7 | 15491.9 | 16475.4 |
265 | 2013 | Q2 | 15719.8 | 15521.6 | 16541.4 |
266 | 2013 | Q3 | 15752.0 | 15641.3 | 16749.3 |
268 | 2014 | Q1 | 15912.8 | 15757.6 | 17031.3 |
267 | 2013 | Q4 | 15851.3 | 15793.9 | 16999.9 |
269 | 2014 | Q2 | 16136.1 | 15935.8 | 17320.9 |
270 | 2014 | Q3 | 16327.9 | 16139.5 | 17622.3 |
271 | 2014 | Q4 | 16520.8 | 16220.2 | 17735.9 |
272 | 2015 | Q1 | 16599.6 | 16350.0 | 17874.7 |
273 | 2015 | Q2 | 16700.6 | 16460.9 | 18093.2 |
274 | 2015 | Q3 | 16726.7 | 16527.6 | 18227.7 |
275 | 2015 | Q4 | 16789.8 | 16547.6 | 18287.2 |
276 | 2016 | Q1 | 16776.1 | 16571.6 | 18325.2 |
277 | 2016 | Q2 | 16783.0 | 16663.5 | 18538.0 |
278 | 2016 | Q3 | 16953.0 | 16778.1 | 18729.1 |
279 | 2016 | Q4 | 16882.1 | 16851.4 | 18905.5 |
280 | 2017 | Q1 | 16992.1 | 16903.2 | 19057.7 |
281 rows × 5 columns
But wait! The table looks like it's sorted in increasing order. This is because sort_values
defaults to ordering the column in ascending order. To correct this, add in the extra optional parameter
nipa_quarterly_accounts_df.sort_values("Real GDP", ascending=False)
Year | Quarter | Real GDI | Real GDP | Nominal GDP | |
---|---|---|---|---|---|
280 | 2017 | Q1 | 16992.1 | 16903.2 | 19057.7 |
279 | 2016 | Q4 | 16882.1 | 16851.4 | 18905.5 |
278 | 2016 | Q3 | 16953.0 | 16778.1 | 18729.1 |
277 | 2016 | Q2 | 16783.0 | 16663.5 | 18538.0 |
276 | 2016 | Q1 | 16776.1 | 16571.6 | 18325.2 |
275 | 2015 | Q4 | 16789.8 | 16547.6 | 18287.2 |
274 | 2015 | Q3 | 16726.7 | 16527.6 | 18227.7 |
273 | 2015 | Q2 | 16700.6 | 16460.9 | 18093.2 |
272 | 2015 | Q1 | 16599.6 | 16350.0 | 17874.7 |
271 | 2014 | Q4 | 16520.8 | 16220.2 | 17735.9 |
270 | 2014 | Q3 | 16327.9 | 16139.5 | 17622.3 |
269 | 2014 | Q2 | 16136.1 | 15935.8 | 17320.9 |
267 | 2013 | Q4 | 15851.3 | 15793.9 | 16999.9 |
268 | 2014 | Q1 | 15912.8 | 15757.6 | 17031.3 |
266 | 2013 | Q3 | 15752.0 | 15641.3 | 16749.3 |
265 | 2013 | Q2 | 15719.8 | 15521.6 | 16541.4 |
264 | 2013 | Q1 | 15642.7 | 15491.9 | 16475.4 |
263 | 2012 | Q4 | 15650.6 | 15384.3 | 16297.3 |
262 | 2012 | Q3 | 15517.1 | 15380.8 | 16227.9 |
261 | 2012 | Q2 | 15522.8 | 15362.4 | 16121.9 |
260 | 2012 | Q1 | 15500.4 | 15291.0 | 15973.9 |
259 | 2011 | Q4 | 15217.0 | 15190.3 | 15785.3 |
258 | 2011 | Q3 | 15093.1 | 15021.1 | 15587.1 |
243 | 2007 | Q4 | 14816.6 | 14991.8 | 14685.3 |
257 | 2011 | Q2 | 14996.1 | 14989.6 | 15460.9 |
245 | 2008 | Q2 | 14832.4 | 14963.4 | 14813.0 |
255 | 2010 | Q4 | 14904.9 | 14939.0 | 15230.2 |
242 | 2007 | Q3 | 14822.4 | 14938.5 | 14569.7 |
246 | 2008 | Q3 | 14767.0 | 14891.6 | 14843.0 |
244 | 2008 | Q1 | 14842.2 | 14889.5 | 14668.4 |
... | ... | ... | ... | ... | ... |
26 | 1953 | Q3 | 2555.7 | 2578.9 | 391.7 |
24 | 1953 | Q1 | 2554.3 | 2573.4 | 388.5 |
30 | 1954 | Q3 | 2537.1 | 2559.4 | 391.6 |
27 | 1953 | Q4 | 2504.1 | 2539.8 | 386.5 |
29 | 1954 | Q2 | 2514.5 | 2530.7 | 386.7 |
28 | 1954 | Q1 | 2510.1 | 2528.0 | 385.9 |
23 | 1952 | Q4 | 2509.5 | 2526.4 | 381.2 |
22 | 1952 | Q3 | 2435.0 | 2446.1 | 368.1 |
21 | 1952 | Q2 | 2412.6 | 2428.5 | 361.4 |
20 | 1952 | Q1 | 2398.3 | 2423.5 | 360.2 |
19 | 1951 | Q4 | 2382.7 | 2398.1 | 356.6 |
18 | 1951 | Q3 | 2362.0 | 2392.8 | 351.8 |
17 | 1951 | Q2 | 2321.3 | 2344.5 | 344.5 |
16 | 1951 | Q1 | 2281.0 | 2304.5 | 336.4 |
15 | 1950 | Q4 | 2268.9 | 2273.4 | 320.3 |
14 | 1950 | Q3 | 2225.9 | 2230.4 | 308.5 |
13 | 1950 | Q2 | 2144.4 | 2147.6 | 290.7 |
12 | 1950 | Q1 | 2060.1 | 2084.6 | 281.2 |
7 | 1948 | Q4 | 2041.6 | 2035.3 | 280.7 |
6 | 1948 | Q3 | 2031.5 | 2033.2 | 279.5 |
10 | 1949 | Q3 | 2007.9 | 2022.8 | 273.3 |
5 | 1948 | Q2 | 2030.2 | 2021.9 | 272.9 |
8 | 1949 | Q1 | 2001.5 | 2007.5 | 275.4 |
11 | 1949 | Q4 | 1979.6 | 2004.7 | 271.0 |
9 | 1949 | Q2 | 1995.9 | 2000.8 | 271.7 |
4 | 1948 | Q1 | 1984.4 | 1989.5 | 266.2 |
3 | 1947 | Q4 | 1932.0 | 1960.7 | 260.3 |
0 | 1947 | Q1 | 1912.5 | 1934.5 | 243.1 |
1 | 1947 | Q2 | 1910.9 | 1932.3 | 246.3 |
2 | 1947 | Q3 | 1914.0 | 1930.3 | 250.1 |
281 rows × 5 columns
Now we can clearly see that the highest real GDP was attained in the first quarter of this year, and had a value of 16903.2
Here are a few useful functions when dealing with numeric data columns.
To find the minimum value in a column, call min()
on a column of the table.
nipa_quarterly_accounts_df["Real GDP"].min()
To find the maximum value, call max()
.
nipa_quarterly_accounts_df["Nominal GDP"].max()
And to find the average value of a column, use mean()
.
nipa_quarterly_accounts_df["Real GDI"].mean()
Now that you can read in data and manipulate it, you are now ready to learn about how to visualize data. To begin, run the cells below to import the required packages we will be using.
%matplotlib inline
import matplotlib.pyplot as plt
We will be using US unemployment data from FRED to show what we can do with data. The statement below will put the csv file into a pandas DataFrame.
import pandas as pd
unemployment_data = pd.read_csv("https://delong.typepad.com/detailed_unemployment.csv")
unemployment_data.head()
date | total_unemployed | more_than_15_weeks | not_in_labor_searched_for_work | multi_jobs | leavers | losers | housing_price_index | |
---|---|---|---|---|---|---|---|---|
0 | 11/1/10 | 16.9 | 8696 | 2531 | 6708 | 5.7 | 63.0 | 186.07 |
1 | 12/1/10 | 16.6 | 8549 | 2609 | 6899 | 6.4 | 61.2 | 183.27 |
2 | 1/1/11 | 16.2 | 8393 | 2800 | 6816 | 6.5 | 60.1 | 181.35 |
3 | 2/1/11 | 16.0 | 8175 | 2730 | 6741 | 6.4 | 60.2 | 179.66 |
4 | 3/1/11 | 15.9 | 8166 | 2434 | 6735 | 6.4 | 60.3 | 178.84 |
One of the advantages of pandas is its built-in plotting methods. We can simply call .plot()
on a dataframe to plot columns against one another. All that we have to do is specify which column to plot on which axis. Something special that pandas does is attempt to automatically parse dates into something that it can understand and order them sequentially.
Note:
total_unemployed
is a percentage—not a number. Divide it by 100 to get the unemployment rate as a number.
unemployment_data.plot(x='date', y='total_unemployed')
<matplotlib.axes._subplots.AxesSubplot at 0x1079bbeb8>
The base package for most plotting in Python is matplotlib
. Below we will look at how to plot with it. First we will extract the columns that we are interested in, then plot them in a scatter plot. Note that plt
is the common convention for matplotlib.pyplot
.
total_unemployed = unemployment_data['total_unemployed']
not_labor = unemployment_data['not_in_labor_searched_for_work']
#Plot the data by inputting the x and y axis
plt.scatter(total_unemployed, not_labor)
# we can then go on to customize the plot with labels
plt.xlabel("Percent Unemployed")
plt.ylabel("Total Not In Labor, Searched for Work")
<matplotlib.text.Text at 0x107f3c0f0>
Though matplotlib is sometimes considered an "ugly" plotting tool, it is powerful. It is highly customizable and is the foundation for most Python plotting libraries. Check out the documentation to get a sense of all of the things you can do with it, which extend far beyond scatter and line plots. An arguably more attractive package is seaborn, which we will go over in future notebooks.
Understanding Check 3.1: Try plotting the total percent of people unemployed vs those unemployed for more than 15 weeks.
total_unemployed = unemployment_data['total_unemployed']
unemp_15_weeks = unemployment_data['more_than_15_weeks']
plt.scatter(total_unemployed, unemp_15_weeks)
plt.xlabel('total percent of people unemployed')
plt.ylabel('people unemployed for more than 15 weeks')
# note: plt.show() is the equivalent of print, but for graphs
plt.show()
Some materials in this notebook were taken from Data 8, CS 61A, and DS Modules lessons.