#!/usr/bin/env python # coding: utf-8 # # Intro to Scraping # Very often there is data on the internet that we would just love to use for our purposes as digital humanists. But, perhaps because it is humanities data, the people publishing it online might not have made it available in a format that is very easily used by you. In a perfect world, everyone would make available clearly described dumps of their data in formats that were usable by machines. In reality, a lot of times people just put things on a web page and call it a day. Web scraping refers to the act of using a computer program to pull down the content of a web page (or, often, many web pages). Scraping is very powerful - once you get the hang of it your potential objects of study will be exponentially increased, as you'll no longer be limited to the data that others make available to you. You can start building your own corpora using real-world information. # # This lesson will call on your knowledge of HTML and CSS, which we covered earlier in the week. If you need a refresher, don't hesitate to ask! A little bit goes a long way when it comes to scraping. To get started, first we'll import the packages we need. But first we will have to install these packages! # $ pip3 install bs4 # # $ pip3 install lxml # Now we'll import the packages in Python: # In[2]: from bs4 import BeautifulSoup from urllib import request # Each of these takes care of certain aspects of the process. The main one to know here is [Beautiful Soup](https://www.crummy.com/software/BeautifulSoup/bs4/doc/), which is the Python library that allows us to process HTML we've pulled down from the web. The name comes from "Alice in Wonderland," which is a fun fact you can throw around at parties. We'll need a base link to scrape from. I've set up a number of texts at the following github repository: # In[3]: url = "https://github.com/humanitiesprogramming/scraping-corpus" # Now that we have that link saved as a variable, we can call it up again later. # In[4]: print(url) # We can also modify the URL if we want to use that URL as a base but we need to use a variation on it. # In[5]: print(url + "/subdomain") # We will use that URL to grab the basic HTML for a number of pages underneath it in the page structure. But first we need to go out and figure out what those links would be. Going to [the page](https://github.com/humanitiesprogramming/scraping-corpus) makes it pretty clear that there are a number of links that we want to grab, each of which pertains to a particular text. We could just copy and paste all those links ourselves to make a to do list: # # * link one # * link two # * link three # # and so on, and then pull in the contents from each page. But we can also get the list of links for the pages we want to scrape by scraping them as well! This is usually quicker as a way of grabbing the contents of a large number of pages on a site. # # The following code uses a Python package named "request" to go out and visit that webpage. The following two lines say, "Take the link stored at the variable 'url'. Visit it, read back to me what you find, and store that result in a new variable named HTML. # In[6]: html = request.urlopen(url).read() print(html[0:2000]) # Wait. Hold up - why are we scraping from GitHub instead of Project Gutenberg? # Project Gutenberg does not allow automated scraping of their website: # # "Any perceived use of automated tools to access this website will result in a temporary or permanent block of your IP address." # # There are good reasons for this, as outlined later in the lesson. So, instead I have collected a corpus of Project Gutenberg texts and loaded them into a GitHub repository for you to practice on. # So far we just have a whole bunch of HTML. We'll need to turn that into something that Beautiful Soup can actually work with. # In[8]: soup = BeautifulSoup(html, 'lxml') # This line says, "take the HTML that you've pulled down and get ready to do Beautiful Soup things to it." Think of it this way: you have a certain number of things that you can do in your car: # # * Drive # * Fill it with gas # * Change the tires # # But you can only really do those things once you actually get in your car. You couldn't change your tires if you were riding a horse. Horses don't have wheels. In programming speak, we're saying "turn that HTML into a Beautiful Soup **object**." Saying something is an object is a way of saying "I expect this data to have certain characteristics and be able to do certain things." In this case, BeautifulSoup gives us a series of ways to manipulate the HTML using HTML and CSS structural elements. We can do things like: # # * Get all the links # In[9]: print(soup.find_all('a')[0:10]) # We can say, get all me the text # In[8]: print(soup.text[0:2000]) # It might not be very clear, but that's just the text of the webpage as one long string with all the HTML stripped out. Here is a slightly prettier version that strips out all the '\n' characters (those are a just a way for Python to note that there should be a line break at that point in the string): # In[9]: print(soup.text.replace('\n', ' ')[0:1000]) # All that white space is coming because we're grabbing a lot of whitespace from the *entire* page. We can either strip whitespace out, or we can make a bit more nuanced request. Instead of getting all the page text first, we can say, "first get me only the HTML for the links on this page. Then give me the text for just these smaller chunks. # In[10]: print(soup.find_all('a').text) # Wait, what happened there? Python gave us an error. This is because we got confused about what kind of object we were looking at. The error message says, "This thing you've given me doesn't support the method or attribute '.text' Let's work backwards to see what we actually get from soup.find_all('a'): # In[11]: print(soup.find_all('a')[0:10]) # That looks as expected. To see what's going, let's look at it another way. The following line will tell us what kind of object we're looking at: # In[10]: print(type(soup.find_all('a')).__name__) # Ah! We're getting somewhere. We're looking at a ResultSet. Not a BeautifulSoup object. And ResultSets let us do different things to them. In fact, a results set gives us a list of Tag objects, but those still respond to a lot of the same things as BeautifulSoup objects. Check it: # In[13]: print(type(soup.find_all('a')[0]).__name__) # In[14]: print(soup.find_all('a')[0].text) # How many links are there on this page anyway? We can find out by checking out the length of this ResultSet: # In[15]: print(len(soup.find_all('a'))) # Here we go. Soup.find_all() returns us something roughly equivalent list. And you can do certain things to lists - you can find out how long they are, you can sort them, you can do things to each item. But you can't pull out the text of each list. That's something that a BeautifulSoup object can do. We were trying to change the tires of our horse. If all of this is confusing, it can be boiled down to this: # # * You need to know what kinds of stuff you're dealing with. # * You need to know what you can do with that stuff. # * You need to know how to convert one format of stuff to another format of stuff. # # So much programming consists of navigating different types of data. # # To return to our example: # # We could go through element in that list and get the text for each individual item. The following lines do just that but also give a little formatting on either side to make it more readable. And we'll strip out whitespace again # In[16]: for item in soup.find_all('a')[0:10]: print('=======') print(item.text.replace('\n', '')) # Now we're getting somewhere. Beautiful Soup can pull down data from a link, but we'll just have to be careful that we know what kinds of objects we are working with. So let's pull down only the links that we care about by being a bit more specific. # In[17]: for link in soup.select("td.content a"): print(link.text) # The "td.content a" bit is using css syntax to walk the structure of the HTML document to get to what we want. I know that I need those particular selectors because I have examined the HTML for the page to see how it is organized. You can do this by going to your webpage and inspecting the element that you want by right clicking on it. This particular code says, "find the 'td' tags that have a 'class' content and then give me the 'a' tags inside. Once we have all that, print out the text of those 'a' tags. If you haven't worked with css before, you can find a good tutorial for css selectors [here](https://www.w3schools.com/cssref/css_selectors.asp). Rather than getting the text of those links, this time we will collect those links and store them in a list for us to scrape. # In[18]: links_html = soup.select('td.content a') urls = [] for link in links_html: url = link['href'] urls.append(url) print(urls) # Getting closer to some usable URL's. We just need add the base of the website to it. So here is the same piece of code but reworked slightly. We'll modify the URL just slightly because of the way that GitHub formats its URL's. We want to get something like [this](https://raw.githubusercontent.com/walshbr/ohio-five-workshop/master/cli-tutorial.md) instead of [this](https://github.com/walshbr/ohio-five-workshop/blob/master/cli-tutorial.md), which is what we were getting. The former is stripped of all the GitHub formatting. # In[19]: links_html = soup.select('td.content a') urls = [] for link in links_html: url = link['href'].replace('blob/', '') urls.append("https://raw.githubusercontent.com" + url) print(urls) # Bingo! Since we know how to go through a list and run code on each item, we can get closer to scraping them to combine them into a dataset for us to use. Let's scrape each of them. We'll be re-using code from above. See if you can remember what each piece is doing: # In[20]: corpus_texts = [] for url in urls: print(url) html = request.urlopen(url).read() soup = BeautifulSoup(html, "lxml") text = soup.text.replace('\n', '') corpus_texts.append(text) print(corpus_texts) # The variable corpus_texts now is a list containing ten different novels. We've got a nice little collection of data, and we can do some other things with it. # Note: scraping sources through a script like this can raise a lot of questions. Do the people allow you to do so? Some websites explicitly detail whether or not you can in their terms of service. Project Gutenberg, for example, explicitly tells you that you *cannot* scrape their website. Doing so anyway potentially opens you to legal repercussions. Even if a site does not explicitly forbid scraping, it can still feel ethically suspect. A recent example of this is when a research scraping all publically available OKCupid user data. While it is true that these users made their personal information publicly available, they probably did not intend that their lives be exposed to this level of scrutiny. When getting ready to scrape data, it's usually a good idea to ask a series of questions: # * Was this data meant to be public? # * Am I harming anyone by pulling down this data? # * Is this data associated with anyone's identity in a way that they might object to? # * Is it worth it? # * Can I get the data in some other way? # * Is my scraping going to harm the website in some way. # Related to this last point - even if all these questions seem to be fine, you still need to be careful. Scraping a website can very often look like a [DDoS attack](https://en.wikipedia.org/wiki/Denial-of-service_attack). If you, say, try to scrape 10,000 links from Project Gutenberg, those 10,000 hits on Project Gutenberg's site could cause issues for their system. To get around this, it's often good practice to purposely slow down your scraper so that it more closely mimics the behavior of a human user. Rather than scraping multiple links per second, the following snippet tells the scraper to rest a random interval of up to 6 seconds between downloads: # In[ ]: import time import random def download(url, sleep=True): if sleep: time.sleep(random.random() * max_sleep) html = request.urlopen(url).read().decode('utf8', errors='replace') return BeautifulSoup(html) # Everytime you call the "download()" function, then, it would sleep a randoml amount of time. # If you're really concerned it is usually a good idea to contact the people whose site you want to work with to ask if they mind you scraping their work. Sometimes they might their data available in a more usable way. And your institutions IRB review panel can help. # Exercises: # # This page contains a press report about the Jack the Ripper murders: http://www.casebook.org/press_reports/alderley_and_wilmslow_advertiser/881019.html # # 1. Use Beautiful Soup to scrape the text of the press report. # # 2. That link is contained on this page, which contains links to all press reports for that journal: http://www.casebook.org/press_reports/alderley_and_wilmslow_advertiser/. Scrape just the links for all of the reports there. At the end you will want a list of URL's that have been cleaned so that they will resolve in your browser. # # 3. If you were planning to scrape all of the press reports on this site, what might be your approach? What would the steps be? Can you forsee any problems with this (technical or otherwise)? How might you circumvent them? # # 4. Use this link for scraping: http://humanitiesprogramming.github.io/. Scrape down: # # * The bios of Brandon and Ethan. # * The links to all our Rails exercises. # * The texts for our Rails exercises. # In[22]: # 1. Potential answer from bs4 import BeautifulSoup from urllib import request url = 'http://www.casebook.org/press_reports/alderley_and_wilmslow_advertiser/881019.html' html = request.urlopen(url).read() soup = BeautifulSoup(html, 'lxml') raw_text = soup.select('div#content') clean_text = raw_text[0].text print(clean_text[:1000]) # In[28]: # 2. Potential answer from bs4 import BeautifulSoup from urllib import request url = 'http://www.casebook.org/press_reports/alderley_and_wilmslow_advertiser/' html = request.urlopen(url).read() soup = BeautifulSoup(html, 'lxml') raw_links = soup.select('#content a') cleaner_links = [link['href'] for link in raw_links] clean_links = ['http://www.casebook.org' + link for link in cleaner_links] print(clean_links) # \3. potential answer - Scrape all the links, then scrape all of the links within links that we need so as to build a body of pages to scrape. Then I would look at what particular element I need from each page. A first technical problem would be that the scraping would need to take place over more than one session - they would take more time than could be done at once. So we would need to incorporate a mechanism that would allow us to start and stop the process without having to start over. Could do this by keeping a running tally of the remaining links in a list, removing them from a list after they are scraped, and writing that list to a file. When starting/stopping, the script would sync with the list of links. There would also be a legal question - are we allowed to scrape the list? or do we need to ask permission from someone else? # In[33]: # 4 potential answer # a. The bios of Brandon and Ethan. # b. The links to all our Rails exercises. # c. The texts for our Rails exercises. from bs4 import BeautifulSoup from urllib import request url = 'http://humanitiesprogramming.github.io/' html = request.urlopen(url).read() soup = BeautifulSoup(html, 'lxml') # answer a bios = [bio.text for bio in soup.select('.justified')] print('=====') print(bios) print('=====') # answer b - note: pretty hard # Getting the right selector and working with the list elements is tricky. url = 'https://humanitiesprogramming.github.io/exercises/' html = request.urlopen(url).read() soup = BeautifulSoup(html, 'lxml') rails_links = [link for link in soup.select('.col-lg-8.col-lg-offset-2 ul')][-1] rails_links = [link['href'] for link in rails_links.select('li a')] print('=====') print(rails_links) print('=====') # answer c - note: this is quite hard. # Getting the right selector and working with the list elements is tricky. texts = [] for url in rails_links: html = request.urlopen(url).read() soup = BeautifulSoup(html, 'lxml') raw_text = soup.select('.col-lg-8.col-lg-offset-2.col-xs-12') clean_text = '' for piece in raw_text: clean_text += piece.text texts.append(clean_text) print('=====') print(texts[0]) print('=====') # In[ ]: