Scraping websites with help of Selenium
Vadim Voskresenskii (slack: Vadimvoskresenskiy)
Today we will study how to work with one very useful and impressive framework which will help us to scrape websites having dynamic data requests. This framework is called Selenium and we can efficiently work with it on Python. The idea laying behind Selenium is very simple -- it allows web developers test their applications before launching them. With help of Selenium, they can emulate the work of browser and check how different elements of their application work from the side of a user.
But, apart from giving web developers possibility to check their applications, Selenium can be useful also for data analysts who want to get data from websites with sophisticated internal strucutres. Probably, you faced such situations when you try to collect data with help of Beatiful Soup in Python or any other package and cannot get it because you need to wait some time until data is uploaded to a website from a server. Unfortunately, your script does not know about this feature of a website and tries to ge it at once. Finally, instead of getting desirable data you get blank list. Also, I suppose, sometimes, data analysts want to collect data from websites where you need firstly put some information into text fields or click some buttons. Certainly, you cannot do such actions with help of Beuatiful Soup. My tutorial will show you how to tackle with such issues with help of Selenium.
The plan of the workshop is following:
First, we need to know how to launch Selenium. That's very simple!
With help of pip, you can install selenium.
pip install selenium
After that, you need to install driver on you computer which will allow you to interact with a browser.
My advise: choose Firefox for work with Selenium. Initially, I started working with Chrome and found that Chrome sometimes cannot find some elements on webpage which definitely exist. At the same time, Firefox had no any issues with finding these elements. I did not check other browsers though.
Regardless of a browser you selected, the algorithm of working with drivers is very similar. First, you download driver (geckodriver for Mozilla Firefox can be found here). Then, you set executable file (geckodriver.exe) as an environment variable on your computer (on Windows, you need to add the path to executable file to PATH). That's it. Now, you can work with Selenium.
If we installed everything correctly we can check how Selenium works. For that, let's import needed modules and try to get to the website.
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
from bs4 import BeautifulSoup
import requests
import pandas as pd
import datetime
from dateutil import relativedelta
import numpy as np
import time
import re
driver = webdriver.Firefox()
driver.get("https://mlcourse.ai/roadmap")
If everything is fine, you will see how magically Firefox browser opens the webpage of our course.
Before scraping I offer you to get started with looking at main functions we are going to use in the tutorial.
Our approach to collect data is very simple. First, we need to find HTML element with which we want to interact and ,second, interact with it by sending keys (browser thinks that a real user presses buttons on her keyboard) or clicking buttons.
HTML element can be identified by different ways. Here are the most important functions for us:
driver.find_element_by_id
With help of this function we can find element by it's id. All elements on a webpage have their own unique ids.
driver.find_element_by_xpath
Xpath is a path to html element we need. Sometimes, elements on one page can have the same paths. So, we need to be very careful with this approach. But in most cases, Xpath is the easiest way how to get very specific element on the webpage.
driver.find_element_by_link_text
The most dangerous function is searching for element on the base of text (the text you see on a webpage). As you can understand, it can be used only in the case if only one element is represented by this text.
Ok, we found element. How to interact with it?
For this, there are some other functions.
element.send_keys("text")
With help of this function, we can send some text to the website. For instance, we can sign up or write name of a book we want to buy on Amazon.
click()
If we work with a button, we can click on it.
In the current workshop, we will be scraping data from Airbnb. Airbnb is the website for travelers which sometimes allows you to find cheaper place for living than websites like Booking.com. On Airbnb you are searching not for hotels or hostels but for apartment offered by hosts living in a city you want to visit. Airbnb is based on principles of sharing economy where trust between hosts and guests is supported by reviews technology.
Our task is following. Let's imagine that you and your friend want to travel to London and live there from 15th of March of 2018 to 23rd of May (completely random dates). We do not want to go to the website each day in a hope to find the best offer. Instead of it, we want to write function which will be collecting regularly for us offers from hosts, some characteristics of the apartments and their prices. But as you can see Airbnb website is created well and it has a lot of interactive elements: apart from search fields we have calendars, special buttons for choosing number of guests, children. By using only Beautiful Soup we cannot collect all data we need. Thereby, we, certainly, need Selenium. Let's start!
So, we now on the main page of Airbnb and we need to choose city, country, dates and number of guests. Obviously, we need to start with place we are going to visit. The best way is to put a city and a country into the search field. For identifying html element of the search field we have special function in our browser called "Inspect element". To find it, we need to press right button of a mouse on the element we wanna get and click on Q on our keyboard (look at the picture below).
If you click on this button you will get information which HTML element is responsible for this part of the website.
In our case, the id of this element is following:
id="Koan-magic-carpet-koan-search-bar__input"
After sending keys Airbnb asks us to choose one of the options. Usually, the first option is that we need.
For the first option we have this id:
id = Koan-magic-carpet-koan-search-bar__option-0
And in this case instead of sending keys we need to click button.
Hint: add time.sleep(2) between functions as sometimes we need to wait a bit until all information is uploaded.
driver = webdriver.Firefox()
driver.get("https://ru.airbnb.com/") # if you are not from Russia, you need to write driver.get("https://airbnb.com/")
driver.find_element_by_id("Koan-magic-carpet-koan-search-bar__input").send_keys("London, United Kingdom")
time.sleep(2)
driver.find_element_by_id("Koan-magic-carpet-koan-search-bar__option-0").click()
Great! We chose city!
Now we need to do something more challenging. We need to choose dates of our check-in and check-out. Let's start with check-in. Id of the check-in window is called: checkin_input. So, we write:
driver.find_element_by_id("checkin_input").click()
After clicking on this button calendar appears. Interface of Airbnb does not allow us to get around calendar and send keys, so we need to find way to choose desired date. I remind you that our plan is to move in on 15th of March. As you can notice, calendar opens on the current month (in this case, it is December). And also you can notice special buttons which allow us to switch months (look at the picture below). So, we need to click on right button switching months 3 times to reach March. To click on this button we need to find Xpath following to it. For that, we again inspect element, then click on found element by right button of mouse and choose Copy -> Xpath. After that, copy Xpath to the script.
Here's the line of code:
for i in np.arange(0,3):
driver.find_element_by_xpath("//[@id='MagicCarpetSearchBar']/div[2]/div/div/div[2]/div/div/div/div/div/div[2]/div[1]/div[2]").click()
Now, we need to choose 15th of March on calendar. Let's look at the Xpath of this date:
"//*[@id='MagicCarpetSearchBar']/div[2]/div/div/div[2]/div/div/div/div/div/div[2]/div[2]/div/div[2]/div/table/tbody/tr[3]/td[6]
We remember that final goal of this tutorial is to write function which will be automatically work for any date. How can we do that if each date has it's unique Xpath? Very simple! Some of the numbers in Xpath are not meaningless for us. This part of Xpath tr[3]/td[6]
tells us that needed date is located on the third row and it's fifth day in a week (look at the following picture).
If you work with Russian version of Airbnb (as me), take into account that our weeks start from Monday not Sunday, so in our case element would have ending:
tr[3]/td[5]
Line of a code is very similar to the previous one:
driver.find_element_by_xpath("// [@id='MagicCarpetSearchBar']/div[2]/div/div/div[2]/div/div/div/div/div/div[2]/div[2]/div/div[2]/div/table/tbody/tr[3]/td[5]") # for Russian Airbnb
driver.find_element_by_xpath("// [@id='MagicCarpetSearchBar']/div[2]/div/div/div[2]/div/div/div/div/div/div[2]/div[2]/div/div[2]/div/table/tbody/tr[3]/td[6]") # for English version
Now, we do the same for the check-out window. But take into account that now month starts from March not from December. It means we need to click only twice on the button which switches months. And 23rd of May is the fourth day on the fourth week (in English version of the website it is fifth day of the fourth week).
Here's the final code for choosing dates:
driver = webdriver.Firefox()
driver.get("https://ru.airbnb.com/")
driver.find_element_by_id("checkin_input").click()
time.sleep(2)
for i in np.arange(0,3):
driver.find_element_by_xpath("//*[@id='MagicCarpetSearchBar']/div[2]/div/div/div[2]/div/div/div/div/div/div[2]/div[1]/div[2]").click()
time.sleep(2)
driver.find_element_by_xpath("//*[@id='MagicCarpetSearchBar']/div[2]/div/div/div[2]/div/div/div/div/div/div[2]/div[2]/div/div[2]/div/table/tbody/tr[3]/td[5]").click()
time.sleep(2)
driver.find_element_by_id("checkout_input").click()
time.sleep(2)
for i in np.arange(0,2):
driver.find_element_by_xpath("//*[@id='MagicCarpetSearchBar']/div[2]/div/div/div[2]/div/div/div/div/div/div[2]/div[1]/div[2]").click()
time.sleep(2)
driver.find_element_by_xpath("//*[@id='MagicCarpetSearchBar']/div[2]/div/div/div[2]/div/div/div/div/div/div[2]/div[2]/div/div[2]/div/table/tbody/tr[4]/td[4]").click()
Cool! We chose dates.
You go with your friend. So, we need to increase number of guests. First, click on special button for guests and then click twice on button "+" increasing number of adults (look at the picture below). The same you can do for the children if it is needed. Don't forget to click again on the button "guests" to remove the list with the options and open the way to button "Search".
Here's the code for guests:
driver = webdriver.Firefox()
driver.get("https://ru.airbnb.com/")
time.sleep(5) # it's better to wait a bit longer
driver.find_element_by_xpath("/html/body/div[4]/div/main/section/div/div/div[2]/div/div/div/div[1]/div[3]/div/form/div[3]/div[2]/button").click()
time.sleep(2)
for i in np.arange(0,2):
driver.find_element_by_xpath("/html/body/div[4]/div/main/section/div/div/div[2]/div/div/div/div[1]/div[3]/div/form/div[3]/div[2]/div/div/div/div[1]/div/div/div/div[2]/div/div[3]/button").click()
time.sleep(2)
driver.find_element_by_xpath("/html/body/div[4]/div/main/section/div/div/div[2]/div/div/div/div[1]/div[3]/div/form/div[3]/div[2]/button").click()
We chose place, dates and number of guests. Therefore, we can start searching for apartments. For that, we click on button "Search", wait until new page is uploaded and then click on option "show all" (look at the picture) to move to the apartments.
Here's the code for moving between pages.
driver.find_element_by_xpath("/html/body/div[4]/div/main/section/div/div/div[2]/div/div/div/div[1]/div[3]/div/form/div[4] /div/button").click()
time.sleep(5)
driver.find_element_by_xpath("/html/body/div[4]/div/main/div/div[2]/div/div/div/div/div[3]/div/div/div[2]/button").click()
After we get all options, we scrape the content of the page with help of Beatiful Soup and save all urls of apartments into the predefined list (here's the great tutorial on how to work with Beatiful Soup).
Here, we move all content of the final page to object soup
rooms_london = []
soup=BeautifulSoup(driver.page_source, 'lxml')
All urls have identical names, so we apply the same regular expression to find them and append to the list.
for a in soup.find_all('a', href=re.compile("rooms/[0-9]+")):
rooms_london.append(a['href'])
And now we can go through all pages with apartments and add new urls to our list. To switch between pages we can press buttons (2,3,4 etc) located at the bottom of the page (look at the picture below).
The best way to click on these buttons is function driver.find_element_by_link_text
. For instance, to click on page 2 we need to write the following line of code:
driver.find_element_by_link_text("2").click()
Here's the final code which collects all urls with apartments for the first two pages:
rooms_london = []
driver = webdriver.Firefox()
driver.get("https://ru.airbnb.com/")
driver.find_element_by_id("Koan-magic-carpet-koan-search-bar__input").send_keys("London, United Kingdom")
time.sleep(2)
driver.find_element_by_id("Koan-magic-carpet-koan-search-bar__option-0").click()
time.sleep(2)
driver.find_element_by_id("checkin_input").click()
for i in np.arange(0,3):
driver.find_element_by_xpath("//*[@id='MagicCarpetSearchBar']/div[2]/div/div/div[2]/div/div/div/div/div/div[2]/div[1]/div[2]").click()
time.sleep(2)
driver.find_element_by_xpath("//*[@id='MagicCarpetSearchBar']/div[2]/div/div/div[2]/div/div/div/div/div/div[2]/div[2]/div/div[2]/div/table/tbody/tr[3]/td[5]").click()
time.sleep(2)
driver.find_element_by_id("checkout_input").click()
for i in np.arange(0,2):
driver.find_element_by_xpath("//*[@id='MagicCarpetSearchBar']/div[2]/div/div/div[2]/div/div/div/div/div/div[2]/div[1]/div[2]").click()
time.sleep(2)
driver.find_element_by_xpath("//*[@id='MagicCarpetSearchBar']/div[2]/div/div/div[2]/div/div/div/div/div/div[2]/div[2]/div/div[2]/div/table/tbody/tr[4]/td[4]").click()
time.sleep(5)
driver.find_element_by_xpath("/html/body/div[4]/div/main/section/div/div/div[2]/div/div/div/div[1]/div[3]/div/form/div[3]/div[2]/button").click()
time.sleep(2)
for i in np.arange(0,2):
driver.find_element_by_xpath("/html/body/div[4]/div/main/section/div/div/div[2]/div/div/div/div[1]/div[3]/div/form/div[3]/div[2]/div/div/div/div[1]/div/div/div/div[2]/div/div[3]/button").click()
time.sleep(2)
driver.find_element_by_xpath("/html/body/div[4]/div/main/section/div/div/div[2]/div/div/div/div[1]/div[3]/div/form/div[3]/div[2]/button").click()
driver.find_element_by_xpath("/html/body/div[4]/div/main/section/div/div/div[2]/div/div/div/div[1]/div[3]/div/form/div[4]/div/button").click()
time.sleep(5)
driver.find_element_by_xpath("/html/body/div[4]/div/main/div/div[2]/div/div/div/div/div[3]/div/div/div[2]/button").click()
time.sleep(2)
soup=BeautifulSoup(driver.page_source, 'lxml')
for a in soup.find_all('a', href=re.compile("rooms/[0-9]+")):
rooms_london.append(a['href'])
time.sleep(2)
driver.find_element_by_link_text("2").click()
time.sleep(2)
soup=BeautifulSoup(driver.page_source, 'lxml')
for a in soup.find_all('a', href=re.compile("rooms/[0-9]+")):
rooms_london.append(a['href'])
Finally, we have 36 unique urls
len(set(rooms_london))
36
Let's look at one of the urls. Did we collect what we wanted?
set(rooms_london).pop()
'/rooms/18556441?location=London%2C%20United%20Kingdom&adults=2&children=0&infants=0&guests=2&toddlers=0&check_in=2019-03-15&check_out=2019-05-23'
Yes, it seems so.
In the next part of the code we will extract data we need from these urls. For this script, we again need Selenium as not all the elements of web pages are uploaded from the server at once and we need to wait a bit.
Let's do that for 5 rooms. In a loop, we put the url of a room to Selenium driver, then wait until all elements are uploaded, extract title of an offer, details mentioned by a host and price. All these elements are saved into dictionaries which are appended into the predefined list.
driver = webdriver.Firefox()
rooms_info = []
for i in pd.Series(rooms_london).unique()[1:6]:
url = "https://ru.airbnb.com" + i
driver.get(url)
time.sleep(2)
room = {} # create blank dictionary
soup=BeautifulSoup(driver.page_source, 'lxml')
summary = soup.find(id="summary").get_text()
summary = re.sub("[^A-Za-z]", " ", summary) # remove all non-English characters
room["title"] = re.sub("\\s+", " ", summary).strip() # remove extra whitespaces
details = soup.find(id="details").get_text()
details = re.sub("[^A-Za-z]", " ", details).strip()
room["details"] = re.sub("translated by Google", " ", details).strip()
book = soup.find(id="book_it_form").get_text()
book = re.sub("\s", "", book) # remove whitespaces
room["price"] = re.sub("Итого", " ", pd.Series(book).str.extract("(Итого€[0-9]{3,5})")[0][0]).strip() # extract final price (Итого == Total)
room["url"] = url
rooms_info.append(room)
Now, we can transform our dictionary into dataframe and check results.
df = pd.DataFrame.from_dict(rooms_info)
df
details | price | title | url | |
---|---|---|---|---|
0 | Findmyguy boutique spaces features elegant acc... | €2032 | England Lifestyle choice seize this moment Chen | https://ru.airbnb.com/rooms/29149919?location=... |
1 | bakerloo ... | €5119 | England Beautiful Victorian apartment Queens P... | https://ru.airbnb.com/rooms/27688254?location=... |
2 | BRIGHT AIRY ... | €1999 | England Master Bedroom Dence House Sean | https://ru.airbnb.com/rooms/17729904?location=... |
3 | Whitchapel ... | €2322 | England Spacious double bedroom with large com... | https://ru.airbnb.com/rooms/20584648?location=... |
4 | Cafe s Costa Coffe Baskin Robins ... | €3034 | England Double Room in Modern Apartment close ... | https://ru.airbnb.com/rooms/14786978?location=... |
Looks good!
Now, we can write final function for all the steps.
In this section of the tutorial, we will write function which will be working for any query and give user needed information on apartments. In our function, the following arguments will be included: city, country, check_in date, check-out date, number of guests, number of pages script needs to go through.
def airbnb_scrape(city, country, check_in, check_out, guests, pages):
location = city + ", " + country # join city and country
cin = datetime.datetime.strptime(check_in, '%d-%m-%Y') # transform string into datetime format
cout = datetime.datetime.strptime(check_out, '%d-%m-%Y')
diff_cin_cout = relativedelta.relativedelta(cout, cin).months # difference between check_in and check_out in months
now = datetime.datetime.now() # today's date
diff_cin_now = relativedelta.relativedelta(cin, now).months # difference between check-in and today's date
first_day_in = datetime.datetime.strptime("01" + check_in[2:], '%d-%m-%Y') # first day of month for check_in, we need it to define number of week
first_day_out = datetime.datetime.strptime("01" + check_out[2:], '%d-%m-%Y') # first day for check_out
weekday_in = cin.weekday() + 1 # day of week for check_in
weekday_out = cout.weekday() + 1 # day of week for check_out
week_in = (cin.isocalendar()[1] - first_day_in.isocalendar()[1]) + 1 # define number of week for check_in
week_out = (cout.isocalendar()[1] - first_day_out.isocalendar()[1]) + 1 # define number of week for check_out
# here, we make Xpaths for numbers of weeks
week_day_xp_in = "//*[@id='MagicCarpetSearchBar']/div[2]/div/div/div[2]/div/div/div/div/div/div[2]/div[2]/div/div[2]/div/table/tbody/tr[" + str(week_in) + "]/td[" + str(weekday_in) +"]"
week_day_xp_out = "//*[@id='MagicCarpetSearchBar']/div[2]/div/div/div[2]/div/div/div/div/div/div[2]/div[2]/div/div[2]/div/table/tbody/tr[" + str(week_out) + "]/td[" + str(weekday_out) +"]"
rooms = []
rooms_info = []
# the following script you already know
# in some cases I added conditional statements depending on the choices of user
driver = webdriver.Firefox()
driver.get("https://ru.airbnb.com/")
driver.find_element_by_id("Koan-magic-carpet-koan-search-bar__input").send_keys(location)
time.sleep(2)
driver.find_element_by_id("Koan-magic-carpet-koan-search-bar__option-0").click()
time.sleep(2)
driver.find_element_by_id("checkin_input").click()
if diff_cin_now > 0:
for i in np.arange(0, diff_cin_now + 1):
driver.find_element_by_xpath("//*[@id='MagicCarpetSearchBar']/div[2]/div/div/div[2]/div/div/div/div/div/div[2]/div[1]/div[2]").click()
time.sleep(2)
driver.find_element_by_xpath(week_day_xp_in).click()
else:
driver.find_element_by_xpath(week_day_xp_in).click()
driver.find_element_by_id("checkout_input").click()
if diff_cin_cout > 0:
for i in np.arange(0, diff_cin_cout):
driver.find_element_by_xpath("//*[@id='MagicCarpetSearchBar']/div[2]/div/div/div[2]/div/div/div/div/div/div[2]/div[1]/div[2]").click()
time.sleep(2)
driver.find_element_by_xpath(week_day_xp_out).click()
else:
driver.find_element_by_xpath(week_day_xp_out).click()
if guests > 1:
time.sleep(5)
driver.find_element_by_xpath("/html/body/div[4]/div/main/section/div/div/div[2]/div/div/div/div[1]/div[3]/div/form/div[3]/div[2]/button").click()
for i in np.arange(0,guests):
driver.find_element_by_xpath("/html/body/div[4]/div/main/section/div/div/div[2]/div/div/div/div[1]/div[3]/div/form/div[3]/div[2]/div/div/div/div[1]/div/div/div/div[2]/div/div[3]/button").click()
time.sleep(2)
driver.find_element_by_xpath("/html/body/div[4]/div/main/section/div/div/div[2]/div/div/div/div[1]/div[3]/div/form/div[3]/div[2]/button").click()
driver.find_element_by_xpath("/html/body/div[4]/div/main/section/div/div/div[2]/div/div/div/div[1]/div[3]/div/form/div[4]/div/button").click()
time.sleep(5)
driver.find_element_by_xpath("/html/body/div[4]/div/main/div/div[2]/div/div/div/div/div[3]/div/div/div[2]/button").click()
time.sleep(2)
soup=BeautifulSoup(driver.page_source, 'lxml')
for a in soup.find_all('a', href=re.compile("rooms/[0-9]+")):
rooms.append(a['href'])
time.sleep(2)
if pages > 1:
for i in np.arange(2,pages+1):
driver.find_element_by_link_text(str(i)).click()
time.sleep(2)
soup=BeautifulSoup(driver.page_source, 'lxml')
for a in soup.find_all('a', href=re.compile("rooms/[0-9]+")):
rooms.append(a['href'])
for i in pd.Series(rooms).unique():
room = {}
url = "https://ru.airbnb.com" + i
driver.get(url)
time.sleep(5)
soup=BeautifulSoup(driver.page_source, 'lxml')
summary = soup.find(id="summary").get_text()
summary = re.sub("[^A-Za-z]", " ", summary) # remove all non-English characters
room["title"] = re.sub("\\s+", " ", summary).strip() # remove extra whitespaces
details = soup.find(id="details").get_text()
details = re.sub("[^A-Za-z]", " ", details).strip()
room["details"] = re.sub("translated by Google", " ", details).strip()
book = soup.find(id="book_it_form").get_text()
book = re.sub("\s", "", book)
room["price"] = re.sub("Итого", " ", pd.Series(book).str.extract("(Итого€[0-9]{3,5})")[0][0]).strip() # extract final price (Итого == Total)
room["url"] = url
rooms_info.append(room)
df = pd.DataFrame.from_dict(rooms_info)
return df
Let's check how our function work on another query. You and two of your friends go to Berlin from the 2nd of April to 9th of April. You want to go through the first two pages to check offers on Airbnb.
berlin_df = airbnb_scrape("Berlin", "Germany", "02-04-2019", "09-04-2019", 3, 2)
As in the previous case, we collected 36 offers.
berlin_df.shape
(36, 4)
Here's our data
berlin_df.head()
details | price | title | url | |
---|---|---|---|---|
0 | U Bahn ... | €775 | beautiful flat very central Hubert | https://ru.airbnb.com/rooms/2662541?location=B... |
1 | S Bahnstation Berlin S dkreuz Tempelh... | €682 | Bright City Apartment with Garden and two Bedr... | https://ru.airbnb.com/rooms/15801211?location=... |
2 | S Bahn ... | €458 | Kleine Parterre Wohnung f r drei Leute in Neuk... | https://ru.airbnb.com/rooms/19553067?location=... |
3 | Eisenacher Str corner Nollendorfstra e ... | €448 | City Apartment Berlin near Nollendorfplatz Gis... | https://ru.airbnb.com/rooms/19742792?location=... |
4 | Bright clean stylish functional and comfort... | €564 | Beautiful home in Kreuzk lln Daniel | https://ru.airbnb.com/rooms/921223?location=Be... |
Let's look at urls. Did we collect what we wanted?
print(berlin_df.url[0])
https://ru.airbnb.com/rooms/2662541?location=Berlin%2C%20Germany&adults=3&children=0&infants=0&guests=3&toddlers=0&check_in=2019-04-02&check_out=2019-04-09
Yes! This url, certainly, satisfies our query!
So, that's it for this tutorial. I very liked working with Selenium and advice you also to check it's affordances.
Thanks for your attention!