Organizations host their APIs on Web servers. When you type www.google.com in your browser's address bar, your computer is actually asking the www.google.com server for a Web page, which it then returns to your browser.
APIs work much the same way, except instead of your Web browser asking for a Web page, your program asks for data. The API usually returns this data in JavaScript Object Notation (JSON) format.
We make an API request to the Web server we want to get data from. The server then replies and sends it to us. In Python, we use the requests library to do this.
There are many different types of requests. The most common is a GET request, which we use to retrieve data.
It's almost always preferable to set up the parameters as a dictionary,because the requests library we mentioned earlier takes care of certain issues, like properly formatting the query parameters.
An application program interface (API) is a set a methods and tools that allow different applications to interact with each other.
APIs are hosted on web servers.
Programmers use APIS to retrieve data as it becomes available, which allows the client to quickly and effectively retrieve data the changes frequently.
JavaScript Object Notation (JSON) format is the primary format for sending and receiving data through APIs. JSON encodes data structures like lists and dictionaries as strings to ensure that machines can read them easily.
The JSON library has two main methods: dumps - Takes in a Python object, and converts it to a string. loads - Takes in a JSON string, and converts it to a Python object.
We use the requests library to communicate with the web server and retrieve the data.
An endpoint is a server route for retrieving specific data from an API.
Web servers return status codes every time they receive an API request.
Status codes that are relevant to GET requests: 200 - Everthing went okay, and the server returned a result.
# Set up the parameters we want to pass to the API.
import requests
parameters = {"lat": 40.71, "lon": -74} # This is the latitude and longitude of New York City.
# Make a get request with the parameters.
response = requests.get("http://api.open-notify.org/iss-pass.json", params=parameters)
# Print the content of the response (the data the server returned)
print(response.content)
b'{\n "message": "success", \n "request": {\n "altitude": 100, \n "datetime": 1551394100, \n "latitude": 40.71, \n "longitude": -74.0, \n "passes": 5\n }, \n "response": [\n {\n "duration": 187, \n "risetime": 1551422380\n }, \n {\n "duration": 619, \n "risetime": 1551427906\n }, \n {\n "duration": 625, \n "risetime": 1551433696\n }, \n {\n "duration": 555, \n "risetime": 1551439571\n }, \n {\n "duration": 571, \n "risetime": 1551445420\n }\n ]\n}\n'
# This is the same as above, but does not require parameters
response = requests.get("http://api.open-notify.org/iss-pass.json?lat=40.71&lon=-74")
response.content
b'{\n "message": "success", \n "request": {\n "altitude": 100, \n "datetime": 1551394100, \n "latitude": 40.71, \n "longitude": -74.0, \n "passes": 5\n }, \n "response": [\n {\n "duration": 187, \n "risetime": 1551422380\n }, \n {\n "duration": 619, \n "risetime": 1551427906\n }, \n {\n "duration": 625, \n "risetime": 1551433696\n }, \n {\n "duration": 555, \n "risetime": 1551439571\n }, \n {\n "duration": 571, \n "risetime": 1551445420\n }\n ]\n}\n'
- This is a simple API to return the current location of the ISS. It returns the current latitude and longitude of the space station with a unix timestamp for the time the location was valid. This API takes no inputs.
import requests
# Make a get request to get the latest position of the ISS from the OpenNotify API.
response = requests.get("http://api.open-notify.org/iss-now.json")
# check status code
status_code=response.status_code
print(status_code)
#content is a string JSON
response.content
200
b'{"iss_position": {"latitude": "-18.7799", "longitude": "-176.0867"}, "message": "success", "timestamp": 1553020897}'
OpenNotify has several API endpoints. An endpoint is a server route for retrieving specific data from an API. For example, the /comments endpoint on the reddit API might retrieve information about comments, while the /users endpoint might retrieve data about users.
The first endpoint we'll look at on OpenNotify is the iss-now.json endpoint. This endpoint gets the current latitude and longitude position of the ISS. A data set wouldn't be a great fit for this task because the information changes often, and involves some calculation on the server.
requests.get('http://api.open-notify.org/iss-pass').status_code
#iss-pass wasn't a valid endpoint
404
requests.get('http://api.open-notify.org/iss-pass.json').status_code
400
- 200 - Everything went okay, and the server returned a result (if any).
- 301 - The server is redirecting you to a different endpoint. This can happen when a company switches domain names, or an endpoint's name has changed.
- 401 - The server thinks you're not authenticated. This happens when you don't send the right credentials to access an API
- 400 - The server thinks you made a bad request. This can happen when you don't send the information the API requires to process your request, among other things.
- 403 - The resource you're trying to access is forbidden; you don't have the right permissions to see it.
- 404 - The server didn't find the resource you tried to access.
import requests
parameters = {"lat": 37.78, "lon": -122.41}
response = requests.get("http://api.open-notify.org/iss-pass.json", params=parameters)
content = response.content
print(content)
b'{\n "message": "success", \n "request": {\n "altitude": 100, \n "datetime": 1547597625, \n "latitude": 37.78, \n "longitude": -122.41, \n "passes": 5\n }, \n "response": [\n {\n "duration": 268, \n "risetime": 1547608870\n }, \n {\n "duration": 630, \n "risetime": 1547614433\n }, \n {\n "duration": 603, \n "risetime": 1547620244\n }, \n {\n "duration": 490, \n "risetime": 1547626151\n }, \n {\n "duration": 508, \n "risetime": 1547632013\n }\n ]\n}\n'
requests.get('http://api.open-notify.org/astros.json').content
b'{"people": [{"name": "Oleg Kononenko", "craft": "ISS"}, {"name": "David Saint-Jacques", "craft": "ISS"}, {"name": "Anne McClain", "craft": "ISS"}], "number": 3, "message": "success"}'
# Make a list of fast food chains.
best_food_chains = ["Taco Bell", "Shake Shack", "Chipotle"]
# Import the JSON library.
import json
# Use json.dumps to convert best_food_chains to a string.
best_food_chains_string = json.dumps(best_food_chains)
print(type(best_food_chains_string))
print(best_food_chains_string)
<class 'str'> ["Taco Bell", "Shake Shack", "Chipotle"]
# Convert best_food_chains_string back to a list.
print(type(json.loads(best_food_chains_string)))
json.loads(best_food_chains_string)
<class 'list'>
['Taco Bell', 'Shake Shack', 'Chipotle']
# Make the same request we did two screens ago.
parameters = {"lat": 37.78, "lon": -122.41}
response = requests.get("http://api.open-notify.org/iss-pass.json", params=parameters)
print(response)
# Get the response data as a Python object. Verify that it's a dictionary.
json_data_convert = response.json()
print(type(json_data_convert))
print(json_data_convert)
first_pass_duration = json_data_convert["response"][0]["duration"]
first_pass_duration
<Response [200]> <class 'dict'> {'message': 'success', 'request': {'altitude': 100, 'datetime': 1547603598, 'latitude': 37.78, 'longitude': -122.41, 'passes': 5}, 'response': [{'duration': 268, 'risetime': 1547608870}, {'duration': 630, 'risetime': 1547614433}, {'duration': 603, 'risetime': 1547620244}, {'duration': 490, 'risetime': 1547626151}, {'duration': 508, 'risetime': 1547632013}]}
268
#The server sends more than a status code and the data when it generates a response.
#It also sends metadata containing information on how it generated the data and how to decode it.
#This information appears in the response headers. We can access it using the .headers property that responses have.
#The headers will appear as a dictionary.
#For the OpenNotify API, the format is JSON, which is why we could decode it with JSON earlier.
response.headers
{'Server': 'nginx/1.10.3', 'Date': 'Wed, 16 Jan 2019 00:45:38 GMT', 'Content-Type': 'application/json', 'Content-Length': '521', 'Connection': 'keep-alive', 'Via': '1.1 vegur'}
response.headers["content-type"]
'application/json'
# Call the API here.
response2 = requests.get('http://api.open-notify.org/astros.json')
print(response2)
json_data=response2.content
print(json_data)
json_convert_python=response2.json()
print(json_convert_python)
number = json_convert_python["number"]
print(number)
<Response [200]> b'{"people": [{"name": "Oleg Kononenko", "craft": "ISS"}, {"name": "David Saint-Jacques", "craft": "ISS"}, {"name": "Anne McClain", "craft": "ISS"}], "number": 3, "message": "success"}' {'people': [{'name': 'Oleg Kononenko', 'craft': 'ISS'}, {'name': 'David Saint-Jacques', 'craft': 'ISS'}, {'name': 'Anne McClain', 'craft': 'ISS'}], 'number': 3, 'message': 'success'} 3
APIs also use authentication to perform rate limiting. Developers typically use APIs to build interesting applications or services. In order to ensure that it remains available and responsive for all users, an API will prevent you from making too many requests in too short a time.
The token is a string that the API can read and associate with your account.
Using a token is preferable to a username and password for a few reasons:
Typically, you'll be accessing an API from a script. If you put your username and password in the script and someone manages to get their hands on it, they can take over your account. In contrast, you can revoke an access token to cancel an unauthorized person's access if there's a security breach. Access tokens can have scopes and specific permissions. For instance, you can make a token that has permission to write to your GitHub repositories and make new ones. Or, you can make a token that can only read from your repositories. Using read-access-only tokens in potentially insecure or shared scripts gives you more control over security.
You'll need to pass your token to the GitHub API through an Authorization header. Just like the server sends headers in response to our request, we can send the server headers when we make a request. Headers contain metadata about the request. We can use Python's requests library to make a dictionary of headers, and then pass it into our request.
# Create a dictionary of headers containing our Authorization header.
headers = {"Authorization": "token cb525bb24ab50f54b020629cd848d021f364931d"}
# Make a GET request to the GitHub API with our headers.
# This API endpoint will give us details about Vik Paruchuri.
response3 = requests.get("https://api.github.com/users/VikParuchuri", headers=headers)
# Print the content of the response as a Python object. As you can see, this token corresponds to the account of Vik Paruchuri.
print(response3.json())
response4 = requests.get("https://api.github.com/users/VikParuchuri/orgs", headers=headers)
orgs = response4.json()
print(orgs)
{'login': 'VikParuchuri', 'id': 913340, 'node_id': 'MDQ6VXNlcjkxMzM0MA==', 'avatar_url': 'https://avatars2.githubusercontent.com/u/913340?v=4', 'gravatar_id': '', 'url': 'https://api.github.com/users/VikParuchuri', 'html_url': 'https://github.com/VikParuchuri', 'followers_url': 'https://api.github.com/users/VikParuchuri/followers', 'following_url': 'https://api.github.com/users/VikParuchuri/following{/other_user}', 'gists_url': 'https://api.github.com/users/VikParuchuri/gists{/gist_id}', 'starred_url': 'https://api.github.com/users/VikParuchuri/starred{/owner}{/repo}', 'subscriptions_url': 'https://api.github.com/users/VikParuchuri/subscriptions', 'organizations_url': 'https://api.github.com/users/VikParuchuri/orgs', 'repos_url': 'https://api.github.com/users/VikParuchuri/repos', 'events_url': 'https://api.github.com/users/VikParuchuri/events{/privacy}', 'received_events_url': 'https://api.github.com/users/VikParuchuri/received_events', 'type': 'User', 'site_admin': False, 'name': 'Vik Paruchuri', 'company': '@dataquestio ', 'blog': 'https://www.dataquest.io', 'location': 'San Francisco, CA', 'email': 'vik.paruchuri@gmail.com', 'hireable': None, 'bio': None, 'public_repos': 63, 'public_gists': 9, 'followers': 568, 'following': 10, 'created_at': '2011-07-13T18:18:07Z', 'updated_at': '2019-02-22T21:25:22Z'} [{'login': 'dataquestio', 'id': 11148054, 'node_id': 'MDEyOk9yZ2FuaXphdGlvbjExMTQ4MDU0', 'url': 'https://api.github.com/orgs/dataquestio', 'repos_url': 'https://api.github.com/orgs/dataquestio/repos', 'events_url': 'https://api.github.com/orgs/dataquestio/events', 'hooks_url': 'https://api.github.com/orgs/dataquestio/hooks', 'issues_url': 'https://api.github.com/orgs/dataquestio/issues', 'members_url': 'https://api.github.com/orgs/dataquestio/members{/member}', 'public_members_url': 'https://api.github.com/orgs/dataquestio/public_members{/member}', 'avatar_url': 'https://avatars3.githubusercontent.com/u/11148054?v=4', 'description': 'Learn data science online'}]
dataquestio=requests.get('https://api.github.com/orgs/dataquestio',headers=headers).json()
hello_world=requests.get('https://api.github.com/repos/octocat/Hello-World',headers=headers).json()
Since we've authenticated with our token, the system knows who we are, and can show us some relevant information without us having to specify our username.
Making a GET request to https://api.github.com/user will give us information about the user the authentication token is for.
There are other endpoints that behave like this. They automatically provide information or allow us to take actions as the authenticated user.
user=requests.get("https://api.github.com/user",headers=headers).json()
user
{'login': 'lutang123', 'id': 45894161, 'node_id': 'MDQ6VXNlcjQ1ODk0MTYx', 'avatar_url': 'https://avatars1.githubusercontent.com/u/45894161?v=4', 'gravatar_id': '', 'url': 'https://api.github.com/users/lutang123', 'html_url': 'https://github.com/lutang123', 'followers_url': 'https://api.github.com/users/lutang123/followers', 'following_url': 'https://api.github.com/users/lutang123/following{/other_user}', 'gists_url': 'https://api.github.com/users/lutang123/gists{/gist_id}', 'starred_url': 'https://api.github.com/users/lutang123/starred{/owner}{/repo}', 'subscriptions_url': 'https://api.github.com/users/lutang123/subscriptions', 'organizations_url': 'https://api.github.com/users/lutang123/orgs', 'repos_url': 'https://api.github.com/users/lutang123/repos', 'events_url': 'https://api.github.com/users/lutang123/events{/privacy}', 'received_events_url': 'https://api.github.com/users/lutang123/received_events', 'type': 'User', 'site_admin': False, 'name': 'Lu Tang', 'company': 'Canadian Disability Resources Society', 'blog': 'https://www.linkedin.com/in/lutang123/', 'location': 'Vancouver, Canada', 'email': None, 'hireable': True, 'bio': 'Keep learning, and learning is fun!', 'public_repos': 14, 'public_gists': 0, 'followers': 1, 'following': 0, 'created_at': '2018-12-15T09:27:51Z', 'updated_at': '2019-03-11T19:41:49Z', 'private_gists': 0, 'total_private_repos': 0, 'owned_private_repos': 0, 'disk_usage': 27039, 'collaborators': 0, 'two_factor_authentication': False, 'plan': {'name': 'free', 'space': 976562499, 'collaborators': 0, 'private_repos': 10000}}
It's typical for API providers to implement pagination. This means that the API provider will only return a certain number of records per page.
params = {"per_page": 50, "page": 1}
response = requests.get("https://api.github.com/users/VikParuchuri/starred", headers=headers, params=params)
page1_repos = response.json()
We use GET requests to retrieve information from a server (hence the name GET). There are a few other types of API requests.
For example, we use POST requests to send information (instead of retrieve it), and to create objects on the API's server. With the GitHub API, we can use POST requests to create new repositories.
Different API endpoints choose what types of requests they will accept. Not all endpoints will accept a POST request, and not all will accept a GET request. You'll have to consult the API's documentation to figure out which endpoints accept which types of requests.
We can make POST requests using requests.post. POST requests almost always include data, because we need to send the data the server will use to create the new object.
We pass in the data in a way that's very similar to what we do with query parameters and GET requests:
# Create the data we'll pass into the API endpoint.
# While this endpoint only requires the "name" key, there are other optional keys.
payload = {"name": "test2"}
# We need to pass in our authentication headers!
response = requests.post("https://api.github.com/user/repos", headers=headers,json=payload)
print(response.status_code)
201
The code above will create a new repository named test under the account of the currently authenticated user. It will convert the payload dictionary to JSON, and pass it along with the POST request.
Check out GitHub's API documentation for repositories to see a full list of what data we can pass in with this POST request. Here are just a couple data points:
name -- Required, the name of the repository description -- Optional, the description of the repository
A successful POST request will usually return a 201 status code indicating that it was able to create the object on the server. Sometimes, the API will return the JSON representation of the new object as the content of the response.
The reddit API requires authentication. We will authenticate with a token using OAuth.
Note that we'll need to use the string bearer instead of the string token we used in the previous mission. That's because we're using OAuth this time.
We'll also need to add a User-Agent header, which will identify us as Dataquest to the API:
{"Authorization": "bearer 13426216-4U1ckno9J5AiK72VRbpEeBaMSKk", "User-Agent": "Dataquest/1.0"}
requests.PATCH()
requests.PUT()
headers={"Authorization": "bearer 13426216-4U1ckno9J5AiK72VRbpEeBaMSKk", "User-Agent": "Dataquest/1.0"}
params={'t':'day'}
response=requests.get('https://oauth.reddit.com/r/python/top', headers=headers, params=params)
python_top=response.json()
python_top
{'message': 'Unauthorized', 'error': 401}
headers = {"Authorization": "bearer 13426216-4U1ckno9J5AiK72VRbpEeBaMSKk", "User-Agent": "Dataquest/1.0"}
response = requests.get("https://oauth.reddit.com/r/python/comments/4b7w9u", headers=headers)
comments = response.json()
response = requests.get("http://dataquestio.github.io/web-scraping-pages/simple.html")
content = response.content
print(content)
print(1)
from bs4 import BeautifulSoup
# Initialize the parser, and pass in the content we grabbed earlier.
parser = BeautifulSoup(content, 'html.parser')
print(parser)
print(2)
# Get the body tag from the document.
# Since we passed in the top level of the document to the parser, we need to pick a branch off of the root.
# With BeautifulSoup, we can access branches by using tag types as attributes.
body = parser.body
print(body)
print(3)
# Get the p tag from the body.
p = body.p
print(p)
print(4)
# Print the text inside the p tag.
# Text is a property that gets the inside text of a tag.
print(p.text)
print(5)
head = parser.head
title = head.title
title_text = title.text
print(title_text)
print(6)
b'<!DOCTYPE html>\n<html>\n <head>\n <title>A simple example page</title>\n </head>\n <body>\n <p>Here is some simple content for this page.</p>\n </body>\n</html>' 1 <!DOCTYPE html> <html> <head> <title>A simple example page</title> </head> <body> <p>Here is some simple content for this page.</p> </body> </html> 2 <body> <p>Here is some simple content for this page.</p> </body> 3 <p>Here is some simple content for this page.</p> 4 Here is some simple content for this page. 5 A simple example page 6
response = requests.get('https://www.ted.com/playlists/171/the_most_popular_talks_of_all')
content = response.content
from bs4 import BeautifulSoup
parser = BeautifulSoup(content, 'html.parser')
body = parser.body
p = body.p
print(p.text)
Are schools killing creativity? What makes a great leader? How can I find happiness? These 25 talks are the ones that you and your fellow TED fans just can't stop sharing.
it's usually better to be more explicit by using the find_all method. This method will find all occurrences of a tag in the current element, and return a list.
If we only want the first occurrence of an item, we'll need to index the list to get it. Aside from this difference, it behaves the same way as passing in the tag type as an attribute.
parser = BeautifulSoup(content, 'html.parser')
# Get a list of all occurrences of the body tag in the element.
body = parser.find_all("body")
# Get the paragraph tag.
p = body[0].find_all("p")
# Get the text.
print(p[0].text)
head = parser.find_all("head")
title = head[0].find_all("title")
title_text = title[0].text
title_text
Are schools killing creativity? What makes a great leader? How can I find happiness? These 25 talks are the ones that you and your fellow TED fans just can't stop sharing.
'The most popular talks of all time | TED Talks'
HTML allows elements to have IDs. Because they are unique, we can use an ID to refer to a specific element.
HTML uses the div tag to create a divider that splits the page into logical units. We can think of a divider as a "box" that contains content. For example, different dividers hold a Web page's footer, sidebar, and horizontal menu.
There are two paragraphs on the page; the first is nested inside a div. Luckily, the paragraphs have IDs. This means we can access them easily, even though they're nested.
# Get the page content and set up a new parser.
response = requests.get("http://dataquestio.github.io/web-scraping-pages/simple_ids.html")
content = response.content
parser = BeautifulSoup(content, 'html.parser')
# Pass in the ID attribute to only get the element with that specific ID.
first_paragraph = parser.find_all("p", id="first")[0]
print(first_paragraph)
print(type(first_paragraph))
print(first_paragraph.text)
print(type(first_paragraph.text))
second_paragraph_text=parser.find_all("p", id="second")[0].text
print(second_paragraph_text)
<p id="first"> First paragraph. </p> <class 'bs4.element.Tag'> First paragraph. <class 'str'> Second paragraph.
In HTML, elements can also have classes. Classes aren't globally unique. In other words, many different elements belong to the same class, usually because they share a common purpose or characteristic.
For example, you may want to create three dividers to display three of your photographs. You can create a common look and feel for these dividers, such as a border and caption style.
This is where classes come into play. You could create a class called "gallery," define a style for it once using CSS (which we'll talk about soon), and then apply that class to all of the dividers you'll use to display photos. One element can even have multiple classes.
Cascading Style Sheets, or CSS, is a language for adding styles to HTML pages. You may have noticed that our simple HTML pages from the past few screens didn't have any styling; all of the paragraphs had black text and the same font size. Most Web pages use CSS to display a lot more than basic black text.
CSS uses selectors to add styles to the elements and classes of elements you specify. You can use selectors to add background colors, text colors, borders, padding, and many other style choices to the elements on HTML pages.
We can use BeautifulSoup's .select method to work with CSS selectors.
# Get the Superbowl box score data.
response = requests.get("http://dataquestio.github.io/web-scraping-pages/2014_super_bowl.html")
content = response.content
parser = BeautifulSoup(content, 'html.parser')
# Find the number of turnovers the Seahawks committed.
turnovers = parser.select("#turnovers")[0]
seahawks_turnovers = turnovers.select("td")[1]
print(seahawks_turnovers)
seahawks_turnovers_count = seahawks_turnovers.text
print(seahawks_turnovers_count)
<td>1</td> 1
patriots_total_plays_count = parser.select("#total-plays")[0].select("td")[2].text
print(patriots_total_plays_count)
seahawks_total_yards_count = parser.select("#total-yards")[0].select("td")[1].text
print(seahawks_total_yards_count)
72 396
We've covered the basics of HTML and how to select elements, which are key foundational blocks.
You may be wondering why Web scraping is useful, given that in most of our examples, we could easily have found the answer by looking at the page. The real power of Web scraping lies in getting information from a large amount of pages very quickly.
Let's say we wanted to find the total number of yards each NFL team gained in every single NFL game over an entire season. We could do this manually, but it would take days of boring drudgery. We could write a script to automate this in a couple of hours instead, and have a lot more fun doing it.