Dealing with JSON

By Allison Parrish

JSON (JavaScript Object Notation) is a popular way of formatting data so that it can be shared between different computer systems. The idea is that you might have a data structure in one application, and you want to be able to send that data structure to another application. In order to do this, we need three things: (1) a common format that both applications understand (like JSON); (2) a way to take an in-memory data structure on the source machine and convert it to that format---this is called "serialization"; and (3) a way to change the "serialized" data back into an in-memory data structure on the target machine.

Python has a library called json that does the work in (2) and (3) for us. The json library has two important functions: dumps ("dump string"), which converts a Python data structure to JSON, and loads ("load string") which converts a JSON string to a Python data structure. Here's an example:

In [11]:
import json
In [12]:
mouse = {"name": "gerald", "length": 22.5, "favorite_food": "gouda", "age": 2}
mouse_json = json.dumps(mouse)
type(mouse_json)
Out[12]:
str
In [13]:
mouse_json
Out[13]:
'{"name": "gerald", "length": 22.5, "favorite_food": "gouda", "age": 2}'

As you can see, the literal notation for Python objects (i.e., the way we write them in our programs) has a strong resemblance to the way that same data looks when encoded as JSON. There are a number of differences (i.e., JSON uses null instead of None; JSON always has double-quoted keys and values; escape sequences in JSON strings are very different from those in Python), but for the most part the formatted data should look very familiar. The json library can take pretty much any Python data structure and turn it into JSON---dictionaries, lists, ints, floats---even nested data structures, like dictionaries with lists as values.

The great thing about JSON (as illustrated above) is that JSON-encoded data is just a string. I could copy this JSON data into an e-mail and send it to you, without having to worry about formatting, and you could paste it back into Python to get back the original data structure. (Or I could make a web application that encodes data structures as JSON, and you could read them with another computer program.)

JSON data from Corpora project

JSON data is often made available as files with a .json extension. You can download these files and use them in your program. One good source of fun data in JSON format is Darius Kazemi's Corpora project, which makes available an eclectic collection of "static corpora (plural of 'corpus') that are potentially useful in the creation of weird internet stuff." For example, there's a list of common English nouns, a list of color names with corresponding RGB values, and even a list of guitar manufacturers.

These files are great sources for making experimental generative text projects. The simplest way to use one in your notebook is to download it from Github and put it in the same directory as your notebook file. To download one of the files, navigate to it in the repository (click on the data directory and navigate through the categories until you find something that looks interesting), then click on the button that reads "Raw." This will show the JSON data directly in your browser, without the surrounding formatting of the Github web page. Then use your browser to save the file (usually File > Save Page As...). Save the file in the same directory as your notebook. You should see it pop up in Jupyter Notebook's Home tab.

Random zoos

As a quick example, I'm going to make a random zoo with this list of common animals. I've already downloaded this file as common.json in the same directory as this notebook. To "deserialize" it into a Python data structure, I'll use the open() function and the loads() function from the json library like so:

In [17]:
animal_data = json.loads(open("common.json").read())

Here's what the resulting data structure looks like:

In [18]:
animal_data
Out[18]:
{'animals': ['aardvark',
  'alligator',
  'alpaca',
  'antelope',
  'ape',
  'armadillo',
  'baboon',
  'badger',
  'bat',
  'bear',
  'beaver',
  'bison',
  'boar',
  'buffalo',
  'bull',
  'camel',
  'canary',
  'capybara',
  'cat',
  'chameleon',
  'cheetah',
  'chimpanzee',
  'chinchilla',
  'chipmunk',
  'cougar',
  'cow',
  'coyote',
  'crocodile',
  'crow',
  'deer',
  'dingo',
  'dog',
  'donkey',
  'dromedary',
  'elephant',
  'elk',
  'ewe',
  'ferret',
  'finch',
  'fish',
  'fox',
  'frog',
  'gazelle',
  'gila monster',
  'giraffe',
  'gnu',
  'goat',
  'gopher',
  'gorilla',
  'grizzly bear',
  'ground hog',
  'guinea pig',
  'hamster',
  'hedgehog',
  'hippopotamus',
  'hog',
  'horse',
  'hyena',
  'ibex',
  'iguana',
  'impala',
  'jackal',
  'jaguar',
  'kangaroo',
  'koala',
  'lamb',
  'lemur',
  'leopard',
  'lion',
  'lizard',
  'llama',
  'lynx',
  'mandrill',
  'marmoset',
  'mink',
  'mole',
  'mongoose',
  'monkey',
  'moose',
  'mountain goat',
  'mouse',
  'mule',
  'muskrat',
  'mustang',
  'mynah bird',
  'newt',
  'ocelot',
  'opossum',
  'orangutan',
  'oryx',
  'otter',
  'ox',
  'panda',
  'panther',
  'parakeet',
  'parrot',
  'pig',
  'platypus',
  'polar bear',
  'porcupine',
  'porpoise',
  'prairie dog',
  'puma',
  'rabbit',
  'raccoon',
  'ram',
  'rat',
  'reindeer',
  'reptile',
  'rhinoceros',
  'salamander',
  'seal',
  'sheep',
  'shrew',
  'silver fox',
  'skunk',
  'sloth',
  'snake',
  'squirrel',
  'tapir',
  'tiger',
  'toad',
  'turtle',
  'walrus',
  'warthog',
  'weasel',
  'whale',
  'wildcat',
  'wolf',
  'wolverine',
  'wombat',
  'woodchuck',
  'yak',
  'zebra']}

Something to know about the Corpora Project (and a lot of JSON data in general, really) is that there's no standard way of arranging the data. In order to work with this data, you'll have to look at the data structure and figure out how to write expressions that access the data you want. In this case, the animal_data value is a dictionary:

In [19]:
type(animal_data)
Out[19]:
dict

That dictionary has a single key, animals, whose value is a list of strings:

In [20]:
animal_data['animals']
Out[20]:
['aardvark',
 'alligator',
 'alpaca',
 'antelope',
 'ape',
 'armadillo',
 'baboon',
 'badger',
 'bat',
 'bear',
 'beaver',
 'bison',
 'boar',
 'buffalo',
 'bull',
 'camel',
 'canary',
 'capybara',
 'cat',
 'chameleon',
 'cheetah',
 'chimpanzee',
 'chinchilla',
 'chipmunk',
 'cougar',
 'cow',
 'coyote',
 'crocodile',
 'crow',
 'deer',
 'dingo',
 'dog',
 'donkey',
 'dromedary',
 'elephant',
 'elk',
 'ewe',
 'ferret',
 'finch',
 'fish',
 'fox',
 'frog',
 'gazelle',
 'gila monster',
 'giraffe',
 'gnu',
 'goat',
 'gopher',
 'gorilla',
 'grizzly bear',
 'ground hog',
 'guinea pig',
 'hamster',
 'hedgehog',
 'hippopotamus',
 'hog',
 'horse',
 'hyena',
 'ibex',
 'iguana',
 'impala',
 'jackal',
 'jaguar',
 'kangaroo',
 'koala',
 'lamb',
 'lemur',
 'leopard',
 'lion',
 'lizard',
 'llama',
 'lynx',
 'mandrill',
 'marmoset',
 'mink',
 'mole',
 'mongoose',
 'monkey',
 'moose',
 'mountain goat',
 'mouse',
 'mule',
 'muskrat',
 'mustang',
 'mynah bird',
 'newt',
 'ocelot',
 'opossum',
 'orangutan',
 'oryx',
 'otter',
 'ox',
 'panda',
 'panther',
 'parakeet',
 'parrot',
 'pig',
 'platypus',
 'polar bear',
 'porcupine',
 'porpoise',
 'prairie dog',
 'puma',
 'rabbit',
 'raccoon',
 'ram',
 'rat',
 'reindeer',
 'reptile',
 'rhinoceros',
 'salamander',
 'seal',
 'sheep',
 'shrew',
 'silver fox',
 'skunk',
 'sloth',
 'snake',
 'squirrel',
 'tapir',
 'tiger',
 'toad',
 'turtle',
 'walrus',
 'warthog',
 'weasel',
 'whale',
 'wildcat',
 'wolf',
 'wolverine',
 'wombat',
 'woodchuck',
 'yak',
 'zebra']

I'm just going to assign that list to a separate variable, so it's a little easier to work with:

In [21]:
animals = animal_data['animals']

Now, we'll make a zoo by selecting a random subset of these animals:

In [24]:
import random
my_zoo = random.sample(animals, 10)
print("In my zoo, we have the following kinds of animals:")
for item in my_zoo:
    print("* " + item)
In my zoo, we have the following kinds of animals:
* zebra
* boar
* orangutan
* elephant
* mustang
* salamander
* chinchilla
* mouse
* dingo
* woodchuck

Let's add some personality to our zoo using this list of moods. After having downloaded the raw JSON, you can grab the list of moods from the file like so:

In [25]:
mood_data = json.loads(open("moods.json").read())
moods = mood_data["moods"] # I had to look at the JSON data itself to determine that this was the correct key!

Now let's generate a new text combining the two:

In [29]:
print("Current zoo report")
print("------------------")
for i in range(10):
    print("The " + random.choice(animals) + " is " + random.choice(moods))
Current zoo report
------------------
The pig is invisible
The silver fox is content
The guinea pig is cynical
The sloth is lousy
The dog is defensive
The koala is impatient
The moose is appreciative
The reindeer is depressed
The lizard is replaced
The dingo is reverent

More sophisticated data structures

Not all JSON files have the same structure. Some of the files in the Corpora Project in particular aren't just dictionaries that have a list of strings. Let's say that we want to write tiny random narratives about the rivers of the world. Start with this list of rivers. You might want to just be able to grab one at random, but doing so will be a bit more complicated than what we did above because of the way the data is structured. Here's what the data looks like:

{
  "description": "A list of rivers.",
  "source": "http://en.wikipedia.org/wiki/List_of_rivers_by_length",
  "rivers": [
    {
      "name": "Nile",
      "confluences": ["Kagera"],
      "outflow": "Mediterranean"
    },
    {
      "name": "Kagera",
      "confluences": ["Nile"],
      "outflow": "Mediterranean"
    },
    {
      "name": "Amazon",
      "confluences": ["Ucayali","Apurímac"],
      "outflow": "Atlantic Ocean"
    },
    [...many more entries...]
  ]
}

Take a look at this data and try to characterize how it's structured. The entire thing is a dictionary; there's a description key and a source key which aren't of tremendous interest to us. The value for the rivers key is where most of the useful information resides, so let's take a look at that value: it's a list! And each element of the list is itself a dictionary. Let's load the data and put our theory to the test:

In [33]:
river_data = json.loads(open("rivers.json").read())
rivers = river_data["rivers"]
type(rivers)
Out[33]:
list

Okay, so we have the list of rivers. Here's the first in the list (which we can take to be, hopefully, representative of the rest of the items in the list):

In [32]:
rivers[0]
Out[32]:
{'confluences': ['Kagera'], 'name': 'Nile', 'outflow': 'Mediterranean'}

To get the name key for this river dictionary:

In [34]:
rivers[0]['name']
Out[34]:
'Nile'

To get a list of just the river names, write a list comprehension that grabs just the name key of each dictionary in the list:

In [35]:
river_names = [item['name'] for item in rivers]
In [36]:
river_names
Out[36]:
['Nile',
 'Kagera',
 'Amazon',
 'Ucayali',
 'Apurímac',
 'Yangtze',
 'Mississippi',
 'Missouri',
 'Jefferson',
 'Yenisei',
 'Angara',
 'Selenge',
 'Huang He',
 'Ob',
 'Irtysh',
 'Paraná',
 'Río de la Plata',
 'Congo',
 'Chambeshi',
 'Amur',
 'Argun',
 'Lena',
 'Mekong',
 'Mackenzie',
 'Slave',
 'Peace',
 'Finlay',
 'Niger',
 'Murray',
 'Darling',
 'Tocantins',
 'Araguaia',
 'Volga',
 'Shatt al-Arab',
 'Euphrates',
 'Madeira',
 'Mamoré',
 'Grande',
 'Caine',
 'Rocha',
 'Purús',
 'Yukon',
 'Indus',
 'São Francisco',
 'Syr Darya',
 'Naryn',
 'Salween',
 'Saint Lawrence',
 'Niagara',
 'Detroit',
 'Saint Clair',
 'Saint Marys',
 'Saint Louis',
 'Rio Grande',
 'Lower Tunguska',
 'Brahmaputra',
 'Tsangpo',
 'Danube',
 'Breg',
 'Zambezi',
 'Vilyuy',
 'Araguaia',
 'Ganges',
 'Hooghly',
 'Padma',
 'Amu Darya',
 'Panj',
 'Japurá',
 'Nelson',
 'Saskatchewan',
 'Paraguay',
 'Kolyma',
 'Pilcomayo',
 'Upper Ob',
 'Katun',
 'Ishim',
 'Juruá',
 'Ural',
 'Arkansas',
 'Colorado',
 'Olenyok',
 'Dnieper',
 'Aldan',
 'Ubangi',
 'Uele',
 'Negro',
 'Columbia',
 'Pearl',
 'Zhu Jiang',
 'Red',
 'Ayeyarwady',
 'Kasai',
 'Ohio',
 'Allegheny',
 'Orinoco',
 'Tarim',
 'Xingu',
 'Orange',
 'Northern Salado',
 'Vitim',
 'Tigris',
 'Songhua',
 'Tapajós',
 'Don',
 'Stony Tunguska',
 'Pechora',
 'Kama',
 'Limpopo',
 'Guaporé',
 'Indigirka',
 'Snake',
 'Senegal',
 'Uruguay',
 'Murrumbidgee',
 'Blue Nile',
 'Churchill',
 'Khatanga',
 'Okavango',
 'Volta',
 'Beni',
 'Platte',
 'Tobol',
 'Jubba',
 'Shebelle',
 'Içá',
 'Magdalena',
 'Han',
 'Oka',
 'Pecos',
 'Upper Yenisei',
 'Little Yenisei',
 'Godavari',
 'Guapay',
 'Belaya',
 'Cooper',
 'Barcoo',
 'Marañón',
 'Dniester',
 'Benue',
 'Ili',
 'Warburton',
 'Georgina',
 'Sutlej',
 'Yamuna',
 'Vyatka',
 'Fraser',
 'Mtkvari',
 'Grande',
 'Brazos',
 'Cauca',
 'Liao',
 'Yalong',
 'Iguaçu',
 'Olyokma',
 'Northern Dvina',
 'Sukhona',
 'Krishna',
 'Iriri',
 'Narmada',
 'Lomami',
 'Ottawa',
 'Lerma',
 'Rio Grande de Santiago',
 'Elbe',
 'Vltava',
 'Zeya',
 'Juruena',
 'Upper Mississippi',
 'Rhine',
 'Athabasca',
 'Canadian',
 'North Saskatchewan',
 'Vaal',
 'Shire',
 'Nen',
 'Kızıl',
 'Green',
 'Milk',
 'Chindwin',
 'Sankuru',
 'Wu',
 'James',
 'Kapuas',
 'Desna',
 'Helmand',
 'Madre de Dios',
 'Tietê',
 'Vychegda',
 'Sepik',
 'Cimarron',
 'Anadyr',
 'Paraíba do Sul',
 'Jialing River',
 'Liard',
 'Cumberland',
 'White',
 'Huallaga',
 'Kwango',
 'Draa',
 'Gambia',
 'Chenab',
 'Yellowstone',
 'Ghaghara',
 'Huai',
 'Aras',
 'Seversky Donets',
 'Bermejo',
 'Fly',
 'Guaviare',
 'Kuskokwim',
 'Tennessee',
 'Vistula',
 'Aruwimi',
 'Daugava',
 'Gila',
 'Loire',
 'Essequibo',
 'Khoper',
 'Tagus']

Excellent! Now we can write a little random story generator:

In [47]:
my_river = random.choice(river_names)
animal1 = random.choice(animals)
animal2 = random.choice(animals)
print("I was floating down the " + my_river + " in a boat with my friend the " + animal1 + ".")
print("We came across a lonely-looking " + animal2 + " and said hello. THE END")
I was floating down the Niagara in a boat with my friend the kangaroo.
We came across a lonely-looking panda and said hello. THE END

JSON from elsewhere

Another way to get data in JSON format is to request it directly from a web server. Many web sites make their data available in both HTML and JSON format. The HTML you fetch with a web browser; the JSON you can fetch with a computer program. (The JSON-formatted version of a web site's data is sometimes called a "web API," where "API" stands for "application programming interface"—a version of the web site that you can access by writing computer programs.)

There are a lot of persnickety particulars about using web APIs, but the main gist is you need to (a) find an API that you want to work with; (b) learn how that API works, both in terms of how to structure the URLs when making requests, and how the data is structured in JSON; and (c) use Python to make the request to the desired URL.

This list of "Awesome JSON Datasets" makes our lives a little bit easier, as it has links directly to URLs that we can try out. For example, the following URL can be used to access a web API that gives information on how many people are in space right now:

http://api.open-notify.org/astros.json

If you navigate to that page in your browser's navigation bar, you'll see the JSON data. (Sometimes browsers will add some pretty formatting or other features to this view, but the data actually returned from the server is just JSON.) Something like this:

{"number": 6, "people": [{"craft": "ISS", "name": "Alexander Misurkin"}, {"craft": "ISS", "name": "Mark Vande Hei"}, {"craft": "ISS", "name": "Joe Acaba"}, {"craft": "ISS", "name": "Anton Shkaplerov"}, {"craft": "ISS", "name": "Scott Tingle"}, {"craft": "ISS", "name": "Norishige Kanai"}], "message": "success"}

Of course, you could just save that file to disk and load it into your program the same way that we did with the Corpora Project files above. But if you wanted to make the request to grab the data and load it up in your program in one fell swoop, you can do that! You'll need to use the requests library to do this. (This library is included with Anaconda by default, but you can also install it in other Python environments by typing pip install requests.)

Fetching JSON from most web APIs is just two lines of code with requests. First, import the library:

In [52]:
import requests

Then call requests.get(url).json(), replacing url with the URL that you want to load. This expression will evaluate to the data from the JSON, already deserialized into a Python data structure. Handy!

In [53]:
data = requests.get("http://api.open-notify.org/astros.json").json()
In [54]:
data
Out[54]:
{'message': 'success',
 'number': 6,
 'people': [{'craft': 'ISS', 'name': 'Alexander Misurkin'},
  {'craft': 'ISS', 'name': 'Mark Vande Hei'},
  {'craft': 'ISS', 'name': 'Joe Acaba'},
  {'craft': 'ISS', 'name': 'Anton Shkaplerov'},
  {'craft': 'ISS', 'name': 'Scott Tingle'},
  {'craft': 'ISS', 'name': 'Norishige Kanai'}]}

Now we can write a tiny generative story about these fine spacepeople:

In [60]:
for item in data['people']:
    print(item['name'] + " is " + random.choice(moods) + " and in space.")
Alexander Misurkin is scolded and in space.
Mark Vande Hei is open and in space.
Joe Acaba is warmhearted and in space.
Anton Shkaplerov is intense and in space.
Scott Tingle is tactful and in space.
Norishige Kanai is conventional and in space.

Nice!