Dictionaries

The dictionary is a very useful data structure in Python. The easiest way to conceptualize a dictionary is that it's like a list, except you don't look up values in a dictionary by their index in a sequence---you look them up using a "key," or a unique identifier for that value.

We're going to focus here just on learning how to get data out of dictionaries, not how to build new dictionaries from existing data. We're also going to omit some of the nitty-gritty details about how dictionaries work internally. You'll learn a lot of those details in later courses, but for now it means that some of what I'm going to tell you will seem weird and magical. Be prepared!

Why dictionaries?

For our purposes, the benefit of having data that can be parsed into dictionaries, as opposed to lists, is that dictionary keys tend to be mnemonic. That is, a dictionary key will usually tell you something about what its value is. (This is in opposition to parsing, say, CSV data, where we have to keep counting fields in the header row and translating that to the index that we want.)

Lists and dictionaries work together and are used extensively to represent all different kinds of data. Often, when we get data from a remote source, or when we choose how to represent data internally, we'll use both in tandem. The most common form this will take is representing a table, or a database, as a list of records that are themselves represented as dictionaries (mapping the name of the column to the value for that column). We'll see an example of this when we access the New York Times API, below.

What dictionaries look like

Dictionaries are written with curly brackets, surrounding a series of comma-separated pairs of keys and values. Here's a very simple dictionary, with one key, Obama, associated with a value, Hawaii:

In [137]:
{'Barack Obama': 'Hawaii'}
Out[137]:
{'Barack Obama': 'Hawaii'}

Here's another dictionary, with two more entries:

In [138]:
{'Barack Obama': 'Hawaii', 'George W. Bush': 'Texas', 'Bill Clinton': 'Arkansas'}
Out[138]:
{'Barack Obama': 'Hawaii',
 'Bill Clinton': 'Arkansas',
 'George W. Bush': 'Texas'}

As you can see, we're building a simple dictionary that associates the names of presidents to the home states of those presidents. (This is my version of JOURNALISM.)

The association of a key with a value is sometimes called a mapping. (In fact, in other programming languages like Java, the dictionary data structure is called a "Map.") So, in the above dictionary for example, we might say that the key Bill Clinton maps to the value Arkansas.

A dictionary is just like any other Python value. It has a type:

In [139]:
print type({'Barack Obama': 'Hawaii', 'George W. Bush': 'Texas', 'Bill Clinton': 'Arkansas'})
<type 'dict'>

And you can assign a dictionary to a variable:

In [140]:
president_states = {'Barack Obama': 'Hawaii', 'George W. Bush': 'Texas', 'Bill Clinton': 'Arkansas'}
print type(president_states)
<type 'dict'>

Keys and values in dictionaries can be of any data type, not just strings. Here's a dictionary, for example, that maps integers to lists of floating point numbers:

In [141]:
print {17: [1.6, 2.45], 42: [11.6, 19.4], 101: [0.123, 4.89]}
{17: [1.6, 2.45], 42: [11.6, 19.4], 101: [0.123, 4.89]}

HEAD-SPINNING OPTIONAL ASIDE: Actually, "any type" above is a simplification: values can be of any type, but keys must be hashable---see the Python glossary for more information. In practice, this limitation means you can't use lists (or dictionaries themselves) as keys in dictionaries. There are ways of getting around this, though!

A dictionary can also be empty, containing no key/value pairs at all:

In [142]:
print {}
{}

Getting values out of dictionaries

The primary operation that we'll perform on dictionaries is writing an expression that evaluates to the value for a particular key. We do that with the same syntax we used to get a value at a particular index from a list. Except with dictionaries, instead of using a number, we use one of the keys that we had specified for the value when making the dictionary. For example, if we wanted to know what Bill Clinton's home state was, or, more precisely, what the value for the key Bill Clinton is, we would write this expression:

In [143]:
print {'Barack Obama': 'Hawaii', 'George W. Bush': 'Texas', 'Bill Clinton': 'Arkansas'}['Bill Clinton']
Arkansas

... or, using a variable that has previously been assigned to a dictionary:

In [144]:
print president_states['George W. Bush']
Texas

If we put a key in those brackets that does not exist in the dictionary, we get an error similar to the one we get when trying to access an element of an array beyond the end of a list:

In [145]:
print president_states['Benjamin Franklin']
---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
<ipython-input-145-c1f7c5039d80> in <module>()
----> 1 print president_states['Benjamin Franklin']

KeyError: 'Benjamin Franklin'

As you might suspect, the thing you put inside the brackets doesn't have to be a string; it can be any Python expression, as long as it evaluates to something that is a key in the dictionary:

In [146]:
president = 'Barack Obama'
print president_states[president]
Hawaii

You can get a list of all the keys in a dictionary using the dictionary's .keys() method:

In [147]:
print {'Barack Obama': 'Hawaii', 'George W. Bush': 'Texas', 'Bill Clinton': 'Arkansas'}.keys()
['Bill Clinton', 'George W. Bush', 'Barack Obama']

And a list of all the values with the .values() method:

In [148]:
print {'Barack Obama': 'Hawaii', 'George W. Bush': 'Texas', 'Bill Clinton': 'Arkansas'}.values()
['Arkansas', 'Texas', 'Hawaii']

If you want a list of all key/value pairs, you can call the .items() method:

In [149]:
print {'Barack Obama': 'Hawaii', 'George W. Bush': 'Texas', 'Bill Clinton': 'Arkansas'}.items()
[('Bill Clinton', 'Arkansas'), ('George W. Bush', 'Texas'), ('Barack Obama', 'Hawaii')]

(The weird list-like things here that use parentheses instead of brackets are called tuples---we'll discuss those at a later date.)

Dictionaries can contain lists and other dictionaries

As mentioned above, a dictionary can itself contain lists and dictionaries as values (and those lists and dictionaries can themselves contain other lists and dictionaries, etc. etc. until your computer runs out of memory). The syntax for getting a value out of a list inside of a dictionary looks very similar to the syntax for getting a value out of a list of lists:

In [150]:
print {'cheeses': ['cheddar', 'edam', 'emmental']}['cheeses'][1]
edam

To explain in a bit more detail, observe here what the following expression evaluates to:

In [151]:
print {'cheeses': ['cheddar', 'edam', 'emmental']}['cheeses']
['cheddar', 'edam', 'emmental']

It follows that putting a square bracket index at the end of that expression would evaluate to a single item inside the list.

BONUS EXERCISE: Devise a dictionary that has within it another dictionary for a value. Write the expression to get the value for a key inside of the inner dictionary.

Adding key/value pairs to a dictionary

Once you've assigned a dictionary to a variable, like so:

In [152]:
president_states = {'Barack Obama': 'Hawaii', 'George W. Bush': 'Texas', 'Bill Clinton': 'Arkansas'}

... you can add another key/value pair to the dictionary by assigning a value to a new index, like so:

In [153]:
president_states['Ronald Reagan'] = 'California'

Printing the dictionary shows that there's a new key/value pair in there:

In [154]:
print president_states
{'Bill Clinton': 'Arkansas', 'George W. Bush': 'Texas', 'Barack Obama': 'Hawaii', 'Ronald Reagan': 'California'}

Dictionaries are unordered

You may have noticed something in the previous examples, which is that sometimes the order in which we wrote our key/value pairs in our dictionaries is NOT the same order that those key/value pairs come out as when evaluating the dictionary as an expression or when using the .keys() and .values() methods. That's because dictionaries in Python are unordered. A dictionary consists of a number of key/value pairs, but that's it---Python has no concept of which pairs come "before" or "after" other the pairs in the dictionary.

Here's a more forceful demonstration:

In [155]:
print {'a': 1, 'b': 2, 'c': 3, 'd': 4, 'e': 5, 'f': 6, 'g': 7, 'h': 8, 'i': 9, 'j': 10}
{'a': 1, 'c': 3, 'b': 2, 'e': 5, 'd': 4, 'g': 7, 'f': 6, 'i': 9, 'h': 8, 'j': 10}

Chances are that when you run the code in the above cell, you'll get back a different ordering than the ordering you'd originally written. If you restart the iPython process (or your server), you might get an ordering back that is different from that.

A better way of phrasing the idea that dictionaries are unordered is to say instead that "two dictionaries are considered the same if they have the same keys mapped to the same values."

Dictionary keys are unique

Another important fact about dictionaries is that you can't put the same key into one dictionary twice. If you try to write out a dictionary that has the same key used more than once, Python will silently ignore all but one of the key/value pairs. For example:

In [156]:
print {'a': 1, 'a': 2, 'a': 3}
{'a': 3}

Similarly, if we attempt to set the value for a key that already exists in the dictionary (using =), we won't add a second key/value pair for that key---we'll just overwrite the existing value:

In [157]:
test_dict = {'a': 1, 'b': 2}
print test_dict['a']
test_dict['a'] = 100
print test_dict['a']
1
100

In the case where a key needs to map to multiple values, we might instead see a data structure in which the key maps to another kind of data structure that itself can contain multiple values, like a list:

In [158]:
print {'a': [1, 2, 3]}
{'a': [1, 2, 3]}

Getting data from the web

At this point, we have enough programming scaffolding in place to start talking about how to access Web APIs. A web API is some collection of data, made available on the web, provided in a format easy for computers to parse. But in order to write programs to access web APIs, I need to talk about a few other things first.

URLs

A URL ("uniform resource locator") uniquely identifies a document on the web, and provides instructions for how to access it. It's the thing you type into your web browser's address bar. It's what you cut-and-paste when you want to e-mail an article to a friend. Most of what we do on the web---whether we're using a web browser or writing a program that accesses the web---boils down to manipulating URLs.

So it's important for us to understand the structure of URLs, so we can take them apart and put them back together (both in our heads and programmatically). URLs have a conventional structure that is specified in Internet standards documentation, and many of the web APIs we'll be accessing assume knowledge of this structure. So let's take the following URL:

http://www.example.com/foo/bar?arg1=baz&arg2=quux

... and break it down into parts, so we have a common vocabulary.

Part Name
http scheme
www.example.com host
/foo/bar path
?arg1=baz&arg2=quux query string

All of these parts are required, except for the query string, which is optional. Explanations:

  • The scheme determines what protocol will be used to access this resource. For our purposes, this will almost always be http (HyperText Transfer Protocol) or https (HTTP, but over an encrypted connection).
  • The host specifies which server on the Internet we're going to talk to in order to retrieve the document we want.
  • The path names a resource on the server, often using slashes (/) to represent hierarchical relationships between resources. (Sometimes this corresponds to actual files on the server, but just as often it does not.)
  • The query string is a means to tell the server how we want the document delivered. (More examples of this soon.)

Most of the work you'll do in learning how to use a web API is learning how to construct and manipulate URLs. A quick glance through the documentation for, e.g., the New York Times API reveals that the bulk of the documentation is just a big list of URLs, with information on how to adjust those URLs to get the information you want.

HTML, JSON and web APIs

The most common format for documents on the web is HTML (HyperText Markup Language). Web browsers like HTML because they know how to render as human-readable documents---in fact, that's what web browsers are for: turning HTML from the web into visually splendid and exciting multimedia affairs.

HTML was specifically designed to be a tool for creating web pages, and it excels at that, but it's not so great for describing structured data. Another popular format---and the format we'll be learning how to work with this week---is JSON (JavaScript Object Notation). Like HTML, JSON is a format for exchanging structured data between two computer programs. Unlike HTML, JSON is primarily intended to communicate content, rather than layout.

Roughly speaking, whenever a web site exposes a URL for human readers, the document at that URL is in HTML. Whenever a web site exposes a URL for programmatic use, the document at that URL is in JSON. (There are other formats commonly used for computer-readable documents, like XML. But let's keep it simple for now.) As an example, Facebook has a human-readable version of a fan page for the Python programming language, available at the following URL:

https://www.facebook.com/pythonlang

But Facebook also has a version of this fan page designed to be easily readable by computers. This is the URL, and it returns a document in JSON format:

https://graph.facebook.com/pythonlang

Every web site makes available a number of URLs that return human-readable documents; many web sites (like Twitter) also make available URLs that return documents intended to be read by computer programs. Often---as is the case with Facebook, or with sites like Metafilter that make their content available through RSS feeds---these are just two views into the same data.

You can think of a web API as the set of URLs, and rules for manipulating URLs, that a web site makes available and that are also intended to be read by computer programs. (API stands for "application programming interface"; a "web API" is an interface enables you to program applications that use the web site's data.)

Fetching web documents with urllib

Python has a library called urllib built-in, which allows you to make requests to web servers in order to retrieve web documents. You give it a URL, and it gives back a string that contains the content of the document located at that URL. We used this earlier to fetch CSV files.

Here's an example of how to use urllib. Our task here is to fetch the document at the URL given above, for the computer-readable version of Facebook's Python fan page.

In [159]:
import urllib

doc_str = urllib.urlopen("https://graph.facebook.com/pythonlang").read()
print doc_str
{"id":"7899581788","about":"programming, the way Guido indented it","can_post":false,"category":"Product\/service","checkins":0,"company_overview":"Python is a dynamic object-oriented programming language that can be used for many kinds of software development. It offers strong support for integration with other languages and tools, comes with extensive standard libraries, and can be learned in a few days. Many Python programmers report substantial productivity gains and feel the language encourages the development of higher quality, more maintainable code.","cover":{"cover_id":"10150985230661789","offset_x":0,"offset_y":0,"source":"https:\/\/scontent-b.xx.fbcdn.net\/hphotos-xfa1\/t1.0-9\/s720x720\/306109_10150985230661789_1338503376_n.jpg"},"founded":"February 1991 by Guido van Rossum","has_added_app":false,"is_community_page":false,"is_published":true,"likes":107247,"link":"https:\/\/www.facebook.com\/pythonlang","name":"Python","talking_about_count":307,"username":"pythonlang","website":"www.python.org","were_here_count":0}

Oh hey wow that's pretty rad! Don't worry for now about decomposing the urllib.urlopen() line---the important part is merely that you can put a string containing a URL in that first pair of parentheses and the whole expression will evaluate to a string containing the contents of the document fetched from that URL.

Interpreting JSON

So we assigned the result of fetching that document from the Facebook API to a variable called doc_str. When we printed it out, it looked like... a dictionary? It has the structure of a dictionary, at least: looks like we have keys associated with values, comma-separated with colons in between. So how would we get the value for the about key from this dictionary? It can't be as easy as it looks, right?!

In [160]:
print doc_str["about"]
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-160-d66f848e69c4> in <module>()
----> 1 print doc_str["about"]

TypeError: string indices must be integers, not str

Nope. It looks like a dictionary, but it's actually...

In [161]:
print type(doc_str)
<type 'str'>

... a string! In fact, almost all of the data that gets returned from web APIs will arrive in the form of a string. Even though the string here looks like a Python dictionary, it's actually a string in JSON format. So what we need is some kind of Python library that will allow us to write an expression that translates a JSON string into an actual Python data structure. Once we have that data structure, we can start writing other expressions to do with the data what we please.

Fortunately, just such a library exists! Here's how to use it.

In [162]:
import json

doc_dict = json.loads(doc_str)

The json.loads function takes some expression that evaluates to a string as its parameter (between the parentheses), and evaluates to whatever Python data structure is represented in the string. Snazzy. Let's check the type now:

In [163]:
print type(doc_dict)
<type 'dict'>

A dictionary! Okay. Let's get the about value:

In [164]:
print doc_dict["about"]
programming, the way Guido indented it

Got it. In fact, we can do all of the things with this dictionary that we can do with any Python dictionary:

In [165]:
print doc_dict.keys()
[u'category', u'username', u'about', u'talking_about_count', u'name', u'company_overview', u'has_added_app', u'can_post', u'cover', u'website', u'founded', u'link', u'likes', u'were_here_count', u'is_community_page', u'checkins', u'id', u'is_published']

NOTE: Ignore the u in front of all of those strings---that's just Python's way of telling us that those are technically Unicode strings. For our purposes, Unicode strings behave exactly the same as any other kind of string.

Fun with the New York Times API

Many web sites and organizations offer web APIs. We're going to go over how one API in particular works, or at least a subset of a particular web API---the New York Times API. The idea is that by introducing you to this one API, you'll learn the tools necessary to sign up for, query, and interpret APIs from other providers as well.

Signing up for an API key

Before you can use the New York Times API, you need to sign up for an API key. Do so by going to the NY Times API application registration site and following the instructions. (You may need to sign up for a New York Times account. If you signed up for the New York Times ages ago, you may need to ensure that the e-mail address on record for your account still points to an account you still have access to.)

You'll see a form that looks like this:

nytimesapi01

The name and website of your "application" aren't important---just fill in whatever you want. Check at least the "Article Search API" and the "Campaign Finance API" boxes in the "which Web APIs this application will use" section, and check the "I agree to the terms of service" box. Then Click "Register application."

You should momentarily receive several e-mails, each with a subject line like "Your NYT Campaign Finance API key," one for each API you requested access to. That e-mail will contain a string of letters an numbers that looks like this:

098f6bcd4621d373cade4e832627b4f6:0:12345678

That's your "key" for that API. Whenever you make a request to that API, you'll need to include your key in the request. The exact methodology for including the key will be explained below. (Note: the key above is just something I made up; it's not a valid key; don't try using it in actual requests.)

Using the API tool

We can start exploring how the New York Times API works, and what kind of data it provides, and what that data looks like using the API tool. When you click on that link, you'll see a screen that looks something like this:

nytimesapi02

Here are the important moving parts:

  • The drop-down labelled "APIs" selects which data source the API tool will be making requests to. You can click on the documentation link just below the drop-down for more information about what the rest of the fields in the left-hand column mean.
  • The "Requests" drop-down selects among various different parts of the selected data source. (There's only one, Query, for the Article Search API, but other sources have more options.)
  • The "Fields" section has a list of fields that can be supplied for the request. Make sure the "Respose Format" field is set to JSON. (It might default to XML for some APIs. You don't want XML.)

There's no reason you should know what these fields mean without reading the documentation first. But it's okay to guess, and it's okay to play!

Once you've selected an API, a request and filled in the fields, click "Make Request." You'll see values pop up in two more fields:

  • The "Request URL" field shows you what URL your Python program would need to use in order to make a request to the selected API with the fields you've given. (When you're using this URL in your Python programs, the only part you'll need to change is the ####---replace that with your own API key for the API you're using.)
  • The "Request Results" field shows the data that your request would return.

The API tool is an invaluable way to "try out" your requests before writing Python to programmatically access the API. The ability to construct queries from fields you input in the tool means you don't have to spend a lot of time learning how each part of the API works---you can just copy the URL from the "Request URL" field right into your program. Most APIs make a similar tool available, so when you're trying to learn a new API, keep a look out.

Making API requests in Python

Okay, enough beating around the bush. Let's actually make a request to the New York Times API from our notebook. The first step is to build a request URL in the API tool. Here's a URL I built that searches for articles containing the word python:

http://api.nytimes.com/svc/search/v1/article?format=json&query=python&rank=oldest&api-key=####

Second, let's make a cell that has my API key in it, assigned to a variable called api_key. If you run this cell, the value in api_key will be available in subsequent cells.

In [166]:
api_key = "paste your api key here and run this cell"

Now, let's make the request using urllib.urlopen(). After we've assigned the result to a variable (response_dict), we print out the dictionary's keys, just to make sure the data looks like what we want it to look like:

In [167]:
import urllib
import json

url = "http://api.nytimes.com/svc/search/v1/article?format=json&query=python&rank=oldest&api-key=" + api_key
response_str = urllib.urlopen(url).read()
response_dict = json.loads(response_str)
print response_dict.keys()
[u'tokens', u'total', u'results', u'offset']

Note: Because we're dealing with an external server, it might take a while for this code to finish! While the code is executing, IPython Notebook shows a little [*] asterisk next to the code snippet.

If you've formatted the URL incorrectly, or if there's another problem (such as your computer not having a connection to the network, or the API being unavailable), you might get an error like this:

In [168]:
url = "http://api.nytimes.com/svc/search/v1/blarticle"
error_str = urllib.urlopen(url).read()
error_dict = json.loads(error_str)
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-168-a27b25754cc9> in <module>()
      1 url = "http://api.nytimes.com/svc/search/v1/blarticle"
      2 error_str = urllib.urlopen(url).read()
----> 3 error_dict = json.loads(error_str)

/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/json/__init__.pyc in loads(s, encoding, cls, object_hook, parse_float, parse_int, parse_constant, object_pairs_hook, **kw)
    336             parse_int is None and parse_float is None and
    337             parse_constant is None and object_pairs_hook is None and not kw):
--> 338         return _default_decoder.decode(s)
    339     if cls is None:
    340         cls = JSONDecoder

/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/json/decoder.pyc in decode(self, s, _w)
    363 
    364         """
--> 365         obj, end = self.raw_decode(s, idx=_w(s, 0).end())
    366         end = _w(s, end).end()
    367         if end != len(s):

/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/json/decoder.pyc in raw_decode(self, s, idx)
    381             obj, end = self.scan_once(s, idx)
    382         except StopIteration:
--> 383             raise ValueError("No JSON object could be decoded")
    384         return obj, end

ValueError: No JSON object could be decoded

We won't be focusing here on making our web API clients robust enough to handle errors gracefully, but for more information about the error, you can try printing out the string you received as a response, before you attempted to parse it as JSON:

In [169]:
print error_str
<h1>596 Service Not Found</h1>

This is the string that the server returned in response to the malformed request. With some APIs, this response will be helpful; in this case it isn't (it tells us that something "isn't found" but doesn't give us a good hint as to how we might fix the problem).

Working with responses

Now we've gotten a response from the API, and we've parsed it into a Python data structure that we know how to use (a dictionary). But now what do we do with it? First off, let's look at the actual structure of the data that we have and try to characterize its structure, from both a syntactic (what are the parts and what's it made of) and semantic (what does it mean?) perspective.

One of IPython Notebook's nice features is that if you make a code cell with a variable or expression on its own in a single line, it will do its best to format the value of that expression in a nice, readable way. Let's do this for the response_dict variable that we created in a cell above.

In [170]:
response_dict
Out[170]:
{u'offset': u'0',
 u'results': [{u'body': u"-------------------------------------------------------------------- Douglas Hofstadter is an associate professor of computer science at Indiana University. His ''Godel, Escher, Bach'' won the Pulitzer Prize. MATHEMATICSAND HUMOR By John Allen Paulos. Illustrated. 116 pp. Chicago: University of Chicago Press. $12.95. By DOUGLAS HOFSTADTER What",
   u'date': u'19810118',
   u'title': u'ARE NUMBERS FUNNY?',
   u'url': u'http://www.nytimes.com/1981/01/18/books/are-numbers-funny.html'},
  {u'body': u"year series that will encompass all 37 of the master dramatist's works for the stage. The first two years were overseen by Cedric Messina; the last two are scheduled to be produced by Shaun Sutton, head of drama for the British Broadcasting Corporation. Under Mr. Messina, the series hewed to performance and staging traditions established over the",
   u'byline': u"By JOHN J. O'CONNOR",
   u'date': u'19810126',
   u'title': u"TV: JONATHAN MILLER'S 'SHREW' ON PBS",
   u'url': u'http://www.nytimes.com/1981/01/26/theater/tv-jonathan-miller-s-shrew-on-pbs.html'},
  {u'body': u"William H. Honan is the editor of The Times's Arts and Leisure section. By William H. Honan His Honor, the Mayor of New York, was dedicating a new shopping center in Brooklyn not long ago. He seemed to have the crowd with him as he approached his peroration when suddenly a black member of the racially mixed audience called out: ''We want John",
   u'date': u'19810201',
   u'title': u'ED KOCH: THE MAN BEHIND THE MAYOR',
   u'url': u'http://www.nytimes.com/1981/02/01/magazine/ed-koch-the-man-behind-the-mayor.html'},
  {u'body': u"TALK about too little and too late! ''Has 'Washington' Legs?,'' a comedy by the British writer Charles Wood, is a savage attack on the Hollywood movie industry, the Bicentennial hoopla of five years ago, and our romantic, schoolbook illusions about the American Revolution. While he was at it, why didn't Mr. Wood also include a few jokes about Bo",
   u'byline': u'By FRANK RICH',
   u'date': u'19810204',
   u'title': u'THEATER: A BRITISH 1776',
   u'url': u'http://www.nytimes.com/1981/02/04/theater/theater-a-british-1776.html'},
  {u'body': u"The recent ''evolution'' trial in California prompted a friend to think about her own education in evolution. She recalls that during tenth-grade science class, the nun would scornfully ask: ''Would you want to believe your great-great-great-grandfather was an ape?'' Gosh no, she and the other 15-year-olds said. It wasn't until college that she got",
   u'date': u'19810318',
   u'title': u'PYRAMIDS,SERPENTS; Devolution',
   u'url': u'http://www.nytimes.com/1981/03/18/opinion/pyramidsserpents-devolution.html'},
  {u'body': u'Everyone in the United Kingdom is very happy that Prince Charles is to marry Lady Diana Spencer. The Queen is happy, the Duke of Edinburgh is happy, Margaret Thatcher is happy, the press is happy, the Tourist Board is happy and the Bank of England is happy. The agricultural workers are happy and the sewage workers are happy and the keeper of the',
   u'byline': u"By Michael Palin; Michael Palin is a member of the Monty Python team, and author, most recently, of ''More Ripping Yarns,'' stories from the television series he wrote with Terry Jones.",
   u'date': u'19810322',
   u'title': u'MARITAL TIPS FOR CHARLES',
   u'url': u'http://www.nytimes.com/1981/03/22/opinion/marital-tips-for-charles.html'},
  {u'body': u"Critics and reviewers make their livings in search of perfection, but because very few things are perfect there's frequently the impulse to see things as being better than they are, to compromise. Some do it out of laziness or optimism or perhaps to help the needy, and many do it out of a real conviction that the second-rate is, indeed, first-.",
   u'byline': u'By Vincent Canby',
   u'date': u'19810405',
   u'title': u'CULLING GEMS FROM FLAWED MOVIES',
   u'url': u'http://www.nytimes.com/1981/04/05/movies/culling-gems-from-flawed-movies.html'},
  {u'body': u"THE Arthurian legends will not die for good reason. They are essential myths. Yet some films made of them are funnier (''Monty Python and the Holy Grail'') and more stimulating (Robert Bresson's ''Lancelot of the Lake'' and Eric Rohmer's ''Perceval'') than others, including the foolish screen version of the musical ''Camelot'' and now John",
   u'byline': u'By VINCENT CANBY',
   u'date': u'19810410',
   u'title': u"BOORMAN'S 'EXCALIBUR'",
   u'url': u'http://www.nytimes.com/1981/04/10/movies/boorman-s-excalibur.html'},
  {u'body': u'-------------------------------------------------------------------- Michael Billington, who is London theater critic for The Guardian, also maintains an interest in television. By MICHAEL BILLINGTON LONDON Jonathan Powell is a television producer with a golden touch. Indeed, if he were a corporation, he might be suspected of monopolistic',
   u'date': u'19810412',
   u'title': u'A BRITON WHO SPECIALIZES IN THE CLASSICS',
   u'url': u'http://www.nytimes.com/1981/04/12/arts/a-briton-who-specializes-in-the-classics.html'},
  {u'body': u"WASHINGTON When Keith Hay decided to go out for Chinese food last year, he went all the way to China. He was accompanied by 14 other welltraveled Americans of various ages, including two in their 70's, a teen-age boy and his parents and several single people. The object: to learn about and to eat Chinese food. Mr. Hay, who manages environmental",
   u'date': u'19810415',
   u'title': u"PIGEONS' FEET A LA CARTE",
   u'url': u'http://www.nytimes.com/1981/04/15/garden/pigeons-feet-a-la-carte.html'}],
 u'tokens': [u'python'],
 u'total': 1526}

Okay, this is still a little bit unreadable. I'm going to do my best to parse it out.

NOTE: nearly every API has its own idiosyncratic way of structuring its responses. Part of the point of API documentation is to let programmers know how the response is structured and what the response means. There's no substitute for studying the API documentation, but with a bit of practice, you can usually heuristically pinpoint which parts of a response are interesting based merely on what's visible in the response.

The Python data structure here is a dictionary. (Again, ignore the funky u in front of all of the strings for now.) We know that from looking at it, but also because of what happens when we evaluate this expression:

In [171]:
print type(response_dict)
<type 'dict'>

The dictionary appears to have four keys. Here are my guesses about what their values must mean, based on my experience and guesses:

  • We're doing a database search of some kind, and although we got a good number of results back from the API, we probably didn't get ALL of them. For that reason, I'm guessing that the total key maps to a value that indicates how many total documents were matched in our search.
  • Following along with that, the offset key probably maps to a value that indicates where in the search results we are currently. I.e., in this case, the offset is zero, so we're at the beginning of the list.
  • The tokens key seems to contain a list with one item in it, and that item is the thing we were searching for. So I'm guessing that this key maps to a value that tells us what search terms were used in the search.
  • The results key... well, this looks like it's the "payload" of our response, i.e., the data we were actually looking for. Its value appears to be a list of dictionaries.

Based on this analysis, I'm going to hone in on the value for the results key as something to play with. Let's confirm our suspicion that the value for this key is a list:

In [172]:
print type(response_dict['results'])
<type 'list'>

Okay, so what's the first value for that list?

In [173]:
print type(response_dict['results'][0])
<type 'dict'>

A dictionary! So we have a list of dictionaries. Let's have IPython Notebook display the first result and we'll see what we've got.

In [174]:
response_dict['results'][0]
Out[174]:
{u'body': u"-------------------------------------------------------------------- Douglas Hofstadter is an associate professor of computer science at Indiana University. His ''Godel, Escher, Bach'' won the Pulitzer Prize. MATHEMATICSAND HUMOR By John Allen Paulos. Illustrated. 116 pp. Chicago: University of Chicago Press. $12.95. By DOUGLAS HOFSTADTER What",
 u'date': u'19810118',
 u'title': u'ARE NUMBERS FUNNY?',
 u'url': u'http://www.nytimes.com/1981/01/18/books/are-numbers-funny.html'}

This dictionary is easier to decipher. It's easy to imagine that body contains a snippet of the body text of the article, date has the date that the article was published, title is the title of the article, and url is a URL pointing to the article itself on nytimes.com.

Now that we know we have a list, we can do some list-like things with it. We can, for example, see how many articles were returned in the response:

In [175]:
print len(response_dict['results'])
10

How about writing a list comprehension to make a list of all of the titles of the articles that were returned?

In [176]:
[article['title'] for article in response_dict['results']]
Out[176]:
[u'ARE NUMBERS FUNNY?',
 u"TV: JONATHAN MILLER'S 'SHREW' ON PBS",
 u'ED KOCH: THE MAN BEHIND THE MAYOR',
 u'THEATER: A BRITISH 1776',
 u'PYRAMIDS,SERPENTS; Devolution',
 u'MARITAL TIPS FOR CHARLES',
 u'CULLING GEMS FROM FLAWED MOVIES',
 u"BOORMAN'S 'EXCALIBUR'",
 u'A BRITON WHO SPECIALIZES IN THE CLASSICS',
 u"PIGEONS' FEET A LA CARTE"]

Or the first twenty characters of each article's body summary:

In [177]:
[article['body'][:20] for article in response_dict['results']]
Out[177]:
[u'--------------------',
 u'year series that wil',
 u'William H. Honan is ',
 u'TALK about too littl',
 u"The recent ''evoluti",
 u'Everyone in the Unit',
 u'Critics and reviewer',
 u'THE Arthurian legend',
 u'--------------------',
 u'WASHINGTON When Keit']

Building query strings dynamically

In the previous section, we used a URL that we had copied directly from the API tool. This is great as far as it goes, but we may want to write code that will be able to construct URLs from scratch---say, for example, if we wanted to perform a series of requests for several related resources, without having to copy URLs, or a series of requests that depend on data we've retrieved from some other source.

Let's review what a query string looks like. Here's the example query string from the section above about URL structure:

?arg1=baz&arg2=quux

At first glance, it looks like garbage. For another perspective, let's look at the query string from one of our requests to the New York Times API:

?format=json&query=python&rank=oldest&api-key=####

(I've written the api-key here as #### so as to not give away my credentials.) A little bit of the structure becomes more apparent here! It looks like we have a series of key/value, separated by ampersands (&): format=json, query=python, rank=oldest, etc. The pairs themselves are separated by equal signs (=).

INTEPRETIVE QUESTION: What kind of data structure does this resemble? It's a data structure that we talked about earlier today. Think hard. Yes, it's a dictionary!

What we'd like to have, then, is some kind of way to write an expression that turns a Python dictionary into a string formatted correctly to include in a query string. It turns out that the process of doing this is kind of tricky, so Python provides a function to do the hard work for us. That function is urllib.urlencode(). If you give it a dictionary as a parameter, it evaluates to a string containing the contents of that dictionary, formatted as a URL query string. An example:

In [178]:
import urllib

print urllib.urlencode({'format': 'json', 'query': 'python', 'rank': 'oldest', 'api-key': '12345'})
query=python&api-key=12345&rank=oldest&format=json

If we include that query string at the end of the base URL (taking care to put a ? between them), we get the full URL for making our request to the API:

In [179]:
base_url = "http://api.nytimes.com/svc/search/v1/article?"
query_str = urllib.urlencode({'format': 'json', 'query': 'python', 'rank': 'oldest', 'api-key': '12345'})
request_url = base_url + query_str
print request_url
http://api.nytimes.com/svc/search/v1/article?query=python&api-key=12345&rank=oldest&format=json

Now we're ready to do some damage. Join me, friends.

Full example: counting search results

Note: In this example we're using a for loop, which hasn't been discussed previously in these notes, but has been covered in Foundations. Review your notes from that class if you need a refresher.

Suppose we've set ourselves to the task of determining which fruit is the most popular fruit, based on how many times the name of the fruit has occurred in the New York Times. (This is also my version of JOURNALISM.) In order to do this, we're going to go through the process of taking a list like this:

In [180]:
topics = ["apple", "banana", "cherry", "coconut", "durian", "lemon", "mango", "orange", "peach", "pear"]

... and turn it into a dictionary that has these strings as its keys, and values for each of these keys corresponding to the number of articles found in the New York Times Article Search API for that string. We're trying, in other words, to get a data structure that looks like this:

{'apple': 4172,
 'banana': 19734,
 'cherry': 73358,
 'coconut': 37516,
 'durian': 96198,
 'lemon': 9808,
 'mango': 43265,
 'orange': 82419,
 'peach': 25389,
 'pear': 31081}

(We'll probably get different numbers, though.)

Our methodology:

  • Create an empty dictionary.
  • Loop over each of the strings in list topics.
  • Inside the loop: create a query string; perform an API request; store the value of the total key in the response dictionary as the value for the topic key (i.e., apple, banana, cherry, etc.).
  • Display the dictionary that we made.

Here's the code! Note: it may take some time for this code to complete.

In [181]:
import urllib
import json
import time

topics = ["apple", "banana", "cherry", "coconut", "durian", "lemon", "mango", "orange", "peach", "pear"]
topic_count = {}
for topic in topics:
    print topic
    query_str = urllib.urlencode({'query': topic, 'format': 'json', 'api-key': api_key})
    request_url = "http://api.nytimes.com/svc/search/v1/article?" + query_str
    response_str = urllib.urlopen(request_url).read()
    response_dict = json.loads(response_str)
    topic_count[topic] = response_dict['total']
    time.sleep(0.5)
    
print topic_count
apple
banana
cherry
coconut
durian
lemon
mango
orange
peach
pear
{'coconut': 5041, 'lemon': 14941, 'apple': 29782, 'peach': 4810, 'cherry': 13490, 'pear': 8211, 'banana': 7122, 'mango': 2990, 'orange': 47275, 'durian': 76}

Wait, what're these import time and time.sleep(0.5) things here? I'll tell you. time.sleep(0.5) lines tell Python to wait 0.5 seconds before proceeding. This is common practice when you're making multiple requests to an API in quick succession, to avoid overwhelming the server or going over your rate limit.

BONUS CHALLENGE: Write an expression that prints out the total number of articles for all topics searched for in this program.

Conclusion

In this session, you've learned how to use the dictionary, a very powerful and ubiquitous data structure. You learned the basics of how to access a web API, and how to convert the "raw data" returned from such an API (in JSON format) to an actual Python data structure. You learned how to make URL query strings "on the fly" by constructing dictionaries with the desired keys and values. Finally, you put it all together and made a short program that does multiple API queries, combining the results from those queries into a dictionary. What will you accomplish next?!