#!/usr/bin/env python # coding: utf-8 # Application Programing Interfaces (API's) are one of the standard ways for interacting with data and software services on the internet. Learning how to use them with your programming is one of the fundamental steps in becoming a fluent developer. Here we will explore one API in particular, the one for the Digital Public Library of America (DPLA). But, first, what exactly is an API? # Imagine the following scenario: you have just accomplished a big task in putting the entire run of your university's literary journal online. People can explore the full text of each issue, and they can also download the images for your texts. Hooray! As we just learned in the lesson on web scraping, an interested digital humanist could use just this information to pull down your materials. They might scrape each page for the full texts, titles, and dates of your journal run, and put together their own little corpus for analysis. But that's a lot of work. Web scraping seems fun and all at first, but the novelty quickly wears off. We wouldn't want to scrape _every_ resource from the web. Surely there must be a better way, and there is! What if we were to package all that data up in a more usable way for our users to consume with their programs? That's where API's come in. # API's are a way for exchanging information and services from one piece of software to another. In this case, we're theorizing an API that would provide data. When a user comes to our journal site, we might imagine them saying, "hey - could you give me all the journals published in the 1950's?" And then our fledgling API would respond with something like the following: # [{"ArticleID":[42901,42902,42903,42904,42905,42906,42907,42908,42909,42910,42911,42912],"ID":1524,"Issue":1,"IssueLabel":"1","Season":"Spring","Volume":1,"Year":1950,"YearLabel":"1950"},{"ArticleID":[42913,42914,42915,42916,42917,42918,42919,42920,42921,42922],"ID":1525,"Issue":2,"IssueLabel":"2","Season":"Summer","Volume":1,"Year":1950,"YearLabel":"1950"},{"ArticleID":[42923,42924,42925,42926,42927,42928,42929,42930,42931,42932],"ID":1526,"Issue":3,"IssueLabel":"3","Season":"Winter","Volume":1,"Year":1950,"YearLabel":"1950"},……] # As we've discussed all along, computers are pretty bad at inferring things, so our API neatly structures a way for your programming to interface with an application that we've made (API - get it?) more easily. The results give us a list of all the articles in each issue, as well as relevant metadata for the issue. In this case, we learn the year, data, season, and issue number. With this information, we could make several other API requests for particular articles. But data isn't the only thing you can get from API's - they can also do things for us as well! Have you ever used a social media account to log in to a different website - say using Facebook to log into the New York Times website? Behind the scenes, the NY Times is using the Facebook API to authenticate you and prove that you're a user. API's let you do an awful lot, and they let you build on the work that others have done. # But this isn't a lesson about how to build API's - we're going to talk about how to use them. There are a couple different ways in which we can do this: from scratch or with a wrapper. In the former, we go through all the different steps of putting together a request for information from the DPLA API. In the latter, we use someone else's code to do the heavy lifting for us. First, we'll do things the easy way by working with DPyLA, a Python wrapper for the DPLA API. Let's pull in the relevant Python pieces. Notice the $, which indicates that we're working in command line and not Python. We'll need to install first. # $ pip install DPLA # We've got the DPLA wrapper installed, now we'll import it into our Python script. Remember, the "from X.Y import Z" will enable us to keep from writing X.Y.Z every time. In this case, it keeps us from writing dpla.api.DPLA when we would rather just write DPLA. # In[1]: from dpla.api import DPLA # API's generally require you to prove that you are an authentic user (not a bot), and, in some cases, that you have permission to access their interface. You generally do this by authenticating through the service using credentials that you have registered with them. DPLA lets you register by sending a request through terminal. Below, change "YOUR_EMAIL@example.com" to be an email address of your choice. # $ curl -v -XPOST https://api.dp.la/v2/api_key/YOUR_EMAIL@example.com # After running the command you should get an email with your API key. You'll then need to include this API key in every request you send to the DPLA API. For the sake of not sharing my own API key, I won't write it here. In fact, Python has a handy way for making sure that we don't share our login details in situations just like this. What we'll do is we will store our password locally in our file structure, hidden away from GitHub. Python will read in that variable from our system, store the password, and have access to it in a safe way. This process is sometimes called **sanitizing**, because you're cleaning your code to make sure that sensitive information is hidden. Run the following terminal command # $ export API_KEY=YOUR_API_KEY_HERE # Now our API key is stored locally, so we'll pull it into Python. To do that, we will pull in 'os', a Python module for interacting with the file system on your computer. # In[2]: import os my_api_key = os.getenv('API_KEY') # Now you should have your own API Key stored, and we can use it to make requests. The DPyLA wrapper makes this easy. First we open a connection to the DPLA API. Notice how we're calling it with our stored api_key. # In[3]: dpla_connection = DPLA(my_api_key) # If you follow along with the [documentation for the wrapper on GitHub](https://github.com/bibliotechy/DPyLA), you can actually see that the wrapper gives us a handy way of requesting our own API key for the first time. We could have done this instead of calling a command from the terminal to get that email sent to us. This is the line of code the documentation gives us for doing so: # In[4]: DPLA.new_key("your.email.address@here.com") # But we did this from the command line instead. This is a good first indication of the ways that wrappers can make your lives easier. They provide easy shortcuts for things that we would have to do from scratch otherwise. Now that we're all set up with the API, we can use this our dpla object to get information from their API! Let's do a quick search for something. # In[6]: result = dpla_connection.search('austen') print(type(result)) # Python's built in type() function tells us what we're doing with - notice that the API did not return us a list of items as you might expect. Instead, it's returned a Results object. This means that we can do all sorts of things to what we've gotten back, and simply dumping out the list of the search results is only one such choice. To see all the different commands that we might call on this object, you can call one of the built in commands. Or, if you're working from iTerm2, you can type "result." and hit tab twice to see options. # In[7]: print(str(result.__dict__)[:1000]) # The __dict__ command shows us a range of options. We can get the number of search results, the list of all items, and a couple other bits about the particular connection we've opened up. You can actually use these same tricks - __dict__ and dot + tabbing to explore virtually every other thing that you will encounter in Python. They give you information about the objects that you're working with, which is half the battle in any Python situation. But for now let's get some more information about the API results we see. We'll take a look at the first object here. # In[8]: item = result.items[0] item # We get a _lot_ of information from a source like this - far more than you probably wanted to know about this individual object. This is what makes API's both useful and tricky to work with. They often want to set up users with everything they could possibly need, but they can't know what it is that users will be interested in. So very often they seem to err on the side of completion, which can sometimes make it difficult to parse the results. One difficult piece here is that the information is hierarchical - the data is organized a bit like a tree. So you have to respect that hierarchy by unfolding it as it expects. The first line below does not work, but the second does. Can you see why? # In[9]: item['stateLocatedIn'] # In[10]: item['sourceResource']['stateLocatedIn'] # There is no top-level key for 'stateLocatedIn'. That data is actually organized under 'sourceResource', so we have to tell the script exactly where we want to look. We can confirm this by walking down the tree towards the data we're interested in. # In[11]: result.items[0]['sourceResource'] # This confirms what we had suggested before - things are nested in ways that can be difficult to parse. Imagine saying, "to find Y, first you have to look under X" rather than saying "look at Y. Why can't you find it?" We can get more information by querying 'item.keys()'. We're dealing with a dictionary object, so we can use all the normal dictionary commands. # In[12]: item.keys() # Another helpful tool is [http://jsonviewer.stack.hu/](http://jsonviewer.stack.hu/), which can help you format things a little bit to make JSON more workable. Let's loop over the first ten items here to get some interesting information about them. Notice here that 'item' is our name for the individual object that we are looking at in each iteration of the loop. # In[13]: for item in result.items[:9]: print(item['sourceResource']['stateLocatedIn']) # Several New York Objects, and then an error. Let's look at the fifth object to see what is wrong with it. # In[14]: result.items[4]['sourceResource'] # The first several results all appear to be held by libraries, while the fifth result is an electronic resource. It makes sense that this resourcee would not be held in a particular state. Let's make our results a bit more nuanced so as to account for these edge cases. # In[15]: for item in result.items[:9]: if 'stateLocatedIn' in item['sourceResource']: print(item['sourceResource']['stateLocatedIn']) else: print(item['sourceResource']['format']) # Above we checked to see if the 'sourceResource' dictionary has a particular key, which allows us to skip over electronic resources. And notice how we have a couple different formats for state names already! The first four list the full state, while the last item lists an abbreviation. This can get very tricky very quickly, and it points to why data cleaning is one of the most important tasks you do as a programming humanist. If we were interested in working across these dates, but they are formatted inconsistently, we would have to clean them up. # API's frequently limit the number of requests you can make to their service during a particular time period. For example, Twitter limits the number of requests you can make to 15 requests per 15 minutes. This ensures that you don't accidentally blow up their system with requests while you're learning, but it also helps to ensure that people are using their service for legitimate scripts rather than incessant spam bots. # That's the easy way to do things. When you're interested in using the data provided by a service, you should always look to see if they have an API. And whenever there is an API, it is worth looking to see whether there is also a wrapper for you to use. There are often different wrappers for different programming languages as well, but Python is a pretty common and popular language. So you'll often find someone else's work that you can build on. # Before we move on, I want to give just a taste of how to do things the hard way, if you didn't have a wrapper for this particular API. There are a few things you need to know: # # * API Endpoint: the base URL that will be responding to your requests. Think of API's like data that live at particular URLS. If you've ever looked at the URL for a page you're in and seen something like shoppingbaseurl.com/q?=shoes&type=mens&cost=expensive, you're using something similar to an API. Basically, using an API entails constructing a URL that points to exactly the data you want. The API consists of the baseurl that gets you to the root/heart/doorway of the API, and then you give params to nuance your search/request. # * Search Parameters (Params for Short): the particular things you send with your request to get back the information you want. Remember how Python dictionaries used key: value pairs? We'll do the same thing here. In the case of the previous example, you have three params: a search query, a type, and a cost. If this were a Python dictionary, we might write that as {'search': 'shoes', 'type': 'mens', 'cost': 'expensive'}. # * API key: we've already covered this, but the API key is what authenticates you when the API you're using so that they will allow you to use it. Sometimes, you simply pass your key as an additional search paramater. Other times, you might have to authenticate with a separate service (like OAuth). # # First, let's import the Python libraries that we'll need: # In[16]: import requests # 'requests' is a library that allows you to make requests to an API. Now we'll store our API endpoint so that we know where we will be making requests to. In this case, we can find out API endpoint by looking at DPLA's great [documentation](https://dp.la/info/developers/codex/api-basics/). Not every API provides you with such great guides to their work, so thanks to the wonderful people at DPLA for making this information available! # In[17]: endpoint = 'https://api.dp.la/v2/items' # Now we will set up our search params. If we're still working in the same terminal session, we should have our api_key stored in 'my_api_key'. And it's important to note that not just any search parameters would work. The API documentation specifies which pieces are allowed. If we sent over 'how_great_is_ethan' as a part of the API, it would not function. # In[18]: params = { 'api_key': my_api_key, 'q': 'Austin, Texas', } # I've set up a basic search here for information about Austin, Texas, and I've given the params my personal api key to authenticate me. I've also specified that I want to get items back. The handy thing about the 'requests' library is that will mash all this together for us in a valid URL. # In[19]: requested_the_hard_way = requests.get(endpoint, params) requested_the_hard_way.status_code requested_the_hard_way.url # You might be more familiar with a 404 code, which is what you get if you go to a webpage that doesn't exist. A 200 code is one that we don't often see, because it means that things went OK! For the sake of camparison, I've pulled in our old connection that used the wrapper and renamed it alongside our request done from scratch. # # Why is this harder? # # For one, we had to find out the correct API endpoint and parameters. Sometimes this is easier said than done. For example, my first attempt at using the API actually failed - I assumed that 'items' was part of the parameters rather than the endpoint, because you can actually get different types of things beyond items - collections, for example. Let's take a look at the things we could do with our 'requested_the_hard_way' object: # ``` # res.apparent_encoding res.elapsed res.is_redirect res.ok res.status_code # res.close( res.encoding res.iter_content( res.raise_for_status( res.text # res.connection res.headers res.iter_lines( res.raw res.url # res.content res.history res.json( res.reason # res.cookies res.is_permanent_redirect res.links res.request # ``` # As we can see, there's nothing here that actually relates to the DPLA yet - all we have are actions that relate to the API - ways to check what kind of data we asked for, where asked for it, and how. The data we got in response is here, though, in two places - .json and .text. The latter gives a long string version of the response, while .json gives us a formatted version of the data. JSON stands for JavaScript Object Notation, and we can interact with it in ways similar to a Python dictionary. # In[20]: requested_the_hard_way.json()['count'] # Our result got 81195 results (at the time of writing - your results might be different)! Let's take a closer look. # In[21]: requested_the_hard_way.json()['docs'][0] # The API limits our results to ten per page, so in order to get an analysis of the full sweep of the search results, we'll need to increment over each page and add the results together, like so. # # We use the math.ceil function to round up the number get. We know we have 500 results per page in this, more expanded version of the api call. By dividing the total number of results by 500, we'll get how many pages we need to iterate over. We round up because that last page will have a few results on it. To do this over the full collection of results, we would go all the way up to the total count - 144 pages. We'll just go 20 pages in to save bandwidth and time. Surely 10,000 results is enough to do something interesting. # In[19]: import math total_hard_way_results = [] params = { 'api_key': my_api_key, 'q': 'Austin, Texas', 'page_size': '500' } total_results = requested_the_hard_way.json()['count'] number_of_pages = math.ceil(total_results / 500) print(number_of_pages) list_of_page_numbers = range(1, 21, 1) for page_number in list_of_page_numbers: print(page_number) params['page'] = page_number for result in requests.get(endpoint, params).json()['docs']: total_hard_way_results.append(result) print(len(total_hard_way_results)) # In[20]: state_results = {} state_results['other_format'] = 0 for item in total_hard_way_results: if 'stateLocatedIn' in item['sourceResource']: if item['sourceResource']['stateLocatedIn'][0]['name'] in state_results: state_results[item['sourceResource']['stateLocatedIn'][0]['name']] += 1 else: state_results[item['sourceResource']['stateLocatedIn'][0]['name']] = 1 elif 'format' in item['sourceResource'] and item['sourceResource']['format'] != 'Text': state_results['other_format'] += 1 else: pass print(state_results) # We see a number interesting things here. For one, we see that Texas has the largest number of holdings about Texas. This makes sense. But look how low those numbers are. We can start to get a sense of how difficult it is to work with this API data. We have 10,000 files we're looking at, but only a small percentage of them have physical locations. We captured all of these other materials in the other_format key. This key, were we to open it up, contains images, electronic resources, and more. But where are all the other search results coming from? More information would be required. but this tells us that the DPLA is just that - Digital. While it might aggregate materials for some physical holdings, it primarily deals in materials that live electronically. # ## Exercises # # 1. Write your own script to use the DPLA API for a search item using the Python wrapper. # 2. Do the same search, but from scratch rather than using the wrapper. # 2. Manipulate the DPLA metadata to do something interesting with the results. # # ## Potential Answers # In[42]: # 1. Write your own script to use the DPLA API for a search item using the Python wrapper. # Let's find a dog-related item! We'll dig through the JSON to find out the URL for the original item. from dpla.api import DPLA import os my_api_key = os.getenv('API_KEY') dpla_connection = DPLA(my_api_key) result = dpla_connection.search('dogs') item = result.items[0] item # item['@id'] item['originalRecord']['metadata']['mods:mods']['mods:identifier'][1]['#text'] # In[48]: # 2. Do the same search, but from scratch rather than using the wrapper. import requests endpoint = 'https://api.dp.la/v2/items' params = { 'api_key': my_api_key, 'q': 'dogs', } requested_the_hard_way = requests.get(endpoint, params) requested_the_hard_way.status_code requested_the_hard_way.url item = requested_the_hard_way.json()['docs'][0] item['originalRecord']['metadata']['mods:mods']['mods:identifier'][1]['#text'] # In[84]: # 3. Manipulate the DPLA metadata to do something interesting with the results. # I'm going to use the wrapper here - so the template is exercise 1. # Let's look at the first 1000 dog-related objects and look at the other subjects associated # with those objects. We'll get duplicate items, so let's take that large list and turn it # into a set. from dpla.api import DPLA import os my_api_key = os.getenv('API_KEY') dpla_connection = DPLA(my_api_key) result = dpla_connection.search('dogs', page_size=1000) length = result.count dog_categories = [] for item in result.items: if 'subject' in item['sourceResource']: for name in item['sourceResource']['subject']: dog_categories.append(name['name']) print(set(dog_categories))