In this jupyter notebook I will try to explain how to scrape content from a website using BeautifulSoup and Requests libraries.
Please note there might be some policies and rules for a website for using the data. So before you do the web scraping please do not forget to read the data usage policies.
For this article's purpose I am scraping the data from www.citypopulation.de website which has population statistics for different countries.
Data use policy: http://citypopulation.de/termsofuse.html (DATA -> Population Data)
!python --version
Python 3.7.0
!pip --version
pip 19.3.1 from c:\python37\lib\site-packages\pip (python 3.7)
# Importing libraries
import requests
import bs4
from bs4 import BeautifulSoup
import pandas as pd
print('Requests version: {}'.format(requests.__version__))
print('BeautifulSoup version: {}'.format(bs4.__version__))
print('Pandas version: {}'.format(pd.__version__))
Requests version: 2.19.1 BeautifulSoup version: 4.7.1 Pandas version: 0.23.4
# URLs to scrape
# This is a dictionary object with URLs.
# We will use this dictionary to scrape information for each url at a time.
urls = {
'north': 'http://citypopulation.de/en/newzealand/northisland',
'south': 'http://citypopulation.de/en/newzealand/southisland/'
}
# Using requests to get the information
output = requests.get(urls['north'])
print(output)
<Response [200]>
# What's in the output?
# Let's output upto 200 characters
output.text[:200]
'<!DOCTYPE html>\r\n<html lang="en">\r\n<head>\r\n<meta charset="utf-8">\r\n<meta name="description" content="North Island (New Zealand): Regions & Settlements with population statistics, charts and maps."'
print(len(output.text))
print(type(output.text))
233654 <class 'str'>
bs_output = BeautifulSoup(markup=output.text, features="html.parser")
len(bs_output.contents)
7
type(bs_output.contents)
list
bs_output.contents[:2]
['html', '\n']
The beauty of BeautifulSoup's parser is that you can interact with each elements and parts of html tags including classes and id values.
You might wonder wonder what is the difference between requests' output.text and bs4's bs_output.contents?
Example below:
bs_output.find_all('a')[:5]
[<a href="/">Home</a>, <a href="/Oceania.html" itemprop="url"><span itemprop="name">Oceania</span></a>, <a href="/en/newzealand/" itemprop="url"><span itemprop="name">New Zealand</span></a>, <a href="javascript:cp.changePageLang('en','de')"><img alt="" src="/images/icons/de.svg" title="Deutsch"/></a>, <a href="javascript:openMap()"><img alt="Show Map" id="smap" src="/images/smaps/newzealand-cities.png" title="Show Map"/></a>]
I've printed 5 items from the list output of BeautifulSoup's find_all function. I passed tag as 'a' to find all tag elements in the bs_output. Likewise you can extract and play around with all the html tags and their contents.
# Our data is in the <table> tag with id='ts'
#
table_output = bs_output.find(name='table', attrs={'id': 'ts'})
table_output.contents[:2] # prints a list of tag elements
['\n', <thead> <tr id="tsh"><th class="rname" data-coltype="name" onclick="javascript:sort('ts',0,false)"><a href="javascript:sort('ts',0,false)">Name</a></th> <th class="rstatus" data-coltype="status" onclick="javascript:sort('ts',1,false)"><a href="javascript:sort('ts',1,false)">Status</a></th><th class="radm rarea" data-coltype="adm" onclick="javascript:sort('ts',2,false)"><a href="javascript:sort('ts',2,false)">Region</a></th><th class="rpop prio5" data-coldate="1996-06-30" data-colhead="E 1996-06-30" data-coltype="pop" onclick="javascript:sort('ts',3,true)"><a href="javascript:sort('ts',3,true)">Population</a><br/><span class="unit">Estimate<br/>1996-06-30</span></th><th class="rpop prio4" data-coldate="2001-06-30" data-colhead="E 2001-06-30" data-coltype="pop" onclick="javascript:sort('ts',4,true)"><a href="javascript:sort('ts',4,true)">Population</a><br/><span class="unit">Estimate<br/>2001-06-30</span></th><th class="rpop prio3" data-coldate="2006-06-30" data-colhead="E 2006-06-30" data-coltype="pop" onclick="javascript:sort('ts',5,true)"><a href="javascript:sort('ts',5,true)">Population</a><br/><span class="unit">Estimate<br/>2006-06-30</span></th><th class="rpop prio2" data-coldate="2013-06-30" data-colhead="E 2013-06-30" data-coltype="pop" onclick="javascript:sort('ts',6,true)"><a href="javascript:sort('ts',6,true)">Population</a><br/><span class="unit">Estimate<br/>2013-06-30</span></th><th class="rpop prio1" data-coldate="2018-06-30" data-colhead="E 2018-06-30" data-coltype="pop" onclick="javascript:sort('ts',7,true)"><a href="javascript:sort('ts',7,true)">Population</a><br/><span class="unit">Estimate<br/>2018-06-30</span></th><th class="sc" data-coltype="other"> </th></tr> </thead>]
# Extracting column names from <tr> tag
[x.text for x in table_output.find_all('th')] # outputs a list of values of the <th> elements in the table output
['Name', 'Status', 'Region', 'PopulationEstimate1996-06-30', 'PopulationEstimate2001-06-30', 'PopulationEstimate2006-06-30', 'PopulationEstimate2013-06-30', 'PopulationEstimate2018-06-30', '\xa0']
# Append the column names into table_columns empty list
#
table_columns = [x.get_text() for x in table_output.find_all('th')][:-1]
table_columns
['Name', 'Status', 'Region', 'PopulationEstimate1996-06-30', 'PopulationEstimate2001-06-30', 'PopulationEstimate2006-06-30', 'PopulationEstimate2013-06-30', 'PopulationEstimate2018-06-30']
# Extracting table output which is in <tbody> tag
#
table_body = table_output.find_all('tbody')
north_island_output = []
for item in table_body:
rows = item.find_all('tr') # extracts <tr> elements in <tbody>
for row in rows:
td = row.find_all('td') # extracts <td> elements in each row i.e. <tr>
td_values = [val.text for val in td] # extracts value of each <td>
north_island_output.append(td_values) # appends values to the list
north_island_output[1]
['Algies Bay', 'Rural Settlement', 'Auckland', '550', '690', '800', '870', '980', '→']
north_island_output = []
for item in table_body:
rows = item.find_all('tr') # extracts <tr> elements in <tbody>
for row in rows:
td = row.find_all('td') # extracts <td> elements in each row i.e. <tr>
td_values = [val.text for val in td] # extracts value of each <td>
north_island_output.append(td_values[:-1]) # appends values to the list, also excludes the last value that is not needed
north_island_output[1]
['Algies Bay', 'Rural Settlement', 'Auckland', '550', '690', '800', '870', '980']
table_columns
['Name', 'Status', 'Region', 'PopulationEstimate1996-06-30', 'PopulationEstimate2001-06-30', 'PopulationEstimate2006-06-30', 'PopulationEstimate2013-06-30', 'PopulationEstimate2018-06-30']
north_data = pd.DataFrame(
data=north_island_output,
columns=table_columns
)
north_data.head()
Name | Status | Region | PopulationEstimate1996-06-30 | PopulationEstimate2001-06-30 | PopulationEstimate2006-06-30 | PopulationEstimate2013-06-30 | PopulationEstimate2018-06-30 | |
---|---|---|---|---|---|---|---|---|
0 | Ahipara | Rural Settlement | Northland | 930 | 1,050 | 1,120 | 1,130 | 1,180 |
1 | Algies Bay | Rural Settlement | Auckland | 550 | 690 | 800 | 870 | 980 |
2 | Arapuni | Rural Settlement | Waikato | 290 | 260 | 230 | 250 | 260 |
3 | Ashhurst | Small Urban Area | Manawatu-Wanganui | 2,530 | 2,520 | 2,510 | 2,750 | 2,990 |
4 | Athenree | Rural Settlement | Bay of Plenty | 510 | 530 | 630 | 700 | 740 |
# Importing libraries
import requests
from bs4 import BeautifulSoup
import pandas as pd
# Variables
south_island_output = []
north_island_output = []
table_columns = []
# URLs for North and South islands
urls = {
'north': 'http://citypopulation.de/en/newzealand/northisland',
'south': 'http://citypopulation.de/en/newzealand/southisland/'
}
# Function that downloads the data
def download_data():
"""
Function extracts td values by looping each child element of the parent.
Two empty lists south_island_output and north_island_output are initialised.
A urls dictionary object with north island and south island urls is also initialised.
Pseudo code:
- for each item in the dictionary
- connect to the url
- if success (response code == 200), then loop through the page data
- for each row_item in body (loop - look for <tr> element):
- for each row in the row_item (loop and look for <td> element):
- for each <td> element, extract the text value
- finally append those text values into output list
"""
for url in urls:
print(url, urls[url])
## response
response = requests.get(urls[url])
if response.status_code == 200:
print('Response code is 200. Success!')
try:
## web scraping
soup = BeautifulSoup(response.text, "html.parser")
table = soup.find(name='table', attrs={'id': 'ts'})
table_columns.append([x.get_text() for x in table.find_all('th')][:-1])
body = table.find_all('tbody')
for item in body:
rows = item.find_all('tr')
for row in rows:
td = row.find_all('td')
td_values = [val.text for val in td]
if url == 'north':
north_island_output.append(td_values[:-1]) # excluding last column that has an arrow as a value
else:
south_island_output.append(td_values[:-1]) # excluding last column that has an arrow as a value
except Exception as ex:
print(str(ex))
else:
print('Oops! {0}'.format(response.status_code))
download_data()
north http://citypopulation.de/en/newzealand/northisland Response code is 200. Success! south http://citypopulation.de/en/newzealand/southisland/ Response code is 200. Success!
table_columns[0]
['Name', 'Status', 'Region', 'PopulationEstimate1996-06-30', 'PopulationEstimate2001-06-30', 'PopulationEstimate2006-06-30', 'PopulationEstimate2013-06-30', 'PopulationEstimate2018-06-30']
# North island dataframe
north_island_data = pd.DataFrame(data=north_island_output, columns=table_columns[0])
# South island dataframe
south_island_data = pd.DataFrame(data=south_island_output, columns=table_columns[0])
north_island_data.head()
Name | Status | Region | PopulationEstimate1996-06-30 | PopulationEstimate2001-06-30 | PopulationEstimate2006-06-30 | PopulationEstimate2013-06-30 | PopulationEstimate2018-06-30 | |
---|---|---|---|---|---|---|---|---|
0 | Ahipara | Rural Settlement | Northland | 930 | 1,050 | 1,120 | 1,130 | 1,180 |
1 | Algies Bay | Rural Settlement | Auckland | 550 | 690 | 800 | 870 | 980 |
2 | Arapuni | Rural Settlement | Waikato | 290 | 260 | 230 | 250 | 260 |
3 | Ashhurst | Small Urban Area | Manawatu-Wanganui | 2,530 | 2,520 | 2,510 | 2,750 | 2,990 |
4 | Athenree | Rural Settlement | Bay of Plenty | 510 | 530 | 630 | 700 | 740 |
south_island_data.head()
Name | Status | Region | PopulationEstimate1996-06-30 | PopulationEstimate2001-06-30 | PopulationEstimate2006-06-30 | PopulationEstimate2013-06-30 | PopulationEstimate2018-06-30 | |
---|---|---|---|---|---|---|---|---|
0 | Ahaura | Rural Settlement | West Coast | 120 | 140 | 110 | 100 | 80 |
1 | Akaroa | Rural Settlement | Canterbury | 680 | 610 | 620 | 670 | 630 |
2 | Alexandra | Small Urban Area | Otago | 4,690 | 4,480 | 4,940 | 4,920 | 5,510 |
3 | Allanton | Rural Settlement | Otago | 220 | 240 | 260 | 260 | 290 |
4 | Amberley | Small Urban Area | Canterbury | 1,050 | 1,160 | 1,340 | 1,620 | 1,800 |
north_island_data = north_island_data[['Name', 'Status','Region', 'PopulationEstimate2018-06-30']]
south_island_data = south_island_data[['Name', 'Status','Region', 'PopulationEstimate2018-06-30']]
north_island_data.head(2)
Name | Status | Region | PopulationEstimate2018-06-30 | |
---|---|---|---|---|
0 | Ahipara | Rural Settlement | Northland | 1,180 |
1 | Algies Bay | Rural Settlement | Auckland | 980 |
south_island_data.head(2)
Name | Status | Region | PopulationEstimate2018-06-30 | |
---|---|---|---|---|
0 | Ahaura | Rural Settlement | West Coast | 80 |
1 | Akaroa | Rural Settlement | Canterbury | 630 |
north_island_data.shape
(350, 4)
Population column values have "," comma in the values. Let's replace , and convert the column from string to integer type.
north_island_data.dtypes
Name object Status object Region object PopulationEstimate2018-06-30 object dtype: object
north_island_data['PopulationEstimate2018-06-30'] = north_island_data['PopulationEstimate2018-06-30'].str.replace(',', '')
north_island_data.head()
Name | Status | Region | PopulationEstimate2018-06-30 | |
---|---|---|---|---|
0 | Ahipara | Rural Settlement | Northland | 1180 |
1 | Algies Bay | Rural Settlement | Auckland | 980 |
2 | Arapuni | Rural Settlement | Waikato | 260 |
3 | Ashhurst | Small Urban Area | Manawatu-Wanganui | 2990 |
4 | Athenree | Rural Settlement | Bay of Plenty | 740 |
north_island_data['PopulationEstimate2018-06-30'] = pd.to_numeric(north_island_data['PopulationEstimate2018-06-30'])
north_island_data.dtypes
Name object Status object Region object PopulationEstimate2018-06-30 int64 dtype: object
north_island_data = north_island_data.rename(columns={'PopulationEstimate2018-06-30': 'PopulationEstimate2018'})
north_island_data.head(2)
Name | Status | Region | PopulationEstimate2018 | |
---|---|---|---|---|
0 | Ahipara | Rural Settlement | Northland | 1180 |
1 | Algies Bay | Rural Settlement | Auckland | 980 |
north_island_data.sort_values(by='PopulationEstimate2018', ascending=False).head(10)
Name | Status | Region | PopulationEstimate2018 | |
---|---|---|---|---|
6 | Auckland | Main Urban Area | Auckland | 1467800 |
333 | Wellington | Main Urban Area | Wellington | 215400 |
46 | Hamilton | Main Urban Area | Waikato | 169300 |
267 | Tauranga | Main Urban Area | Bay of Plenty | 135000 |
105 | Lower Hutt | Main Urban Area | Wellington | 104900 |
198 | Palmerston North | Large Urban Area | Manawatu-Wanganui | 80300 |
141 | Napier | Large Urban Area | Hawke's Bay | 62800 |
219 | Porirua | Large Urban Area | Wellington | 55500 |
143 | New Plymouth | Large Urban Area | Taranaki | 55300 |
241 | Rotorua | Large Urban Area | Bay of Plenty | 54500 |
top_10 = north_island_data.sort_values(by='PopulationEstimate2018', ascending=False).head(10)
plt.figure(figsize=(17, 6))
sns.barplot(x='Name', y='PopulationEstimate2018', data=top_10)
plt.title("Top 10 most populated places in North Island NZ")
plt.show()
north_island_data.sort_values(by='PopulationEstimate2018', ascending=True).head(10)
Name | Status | Region | PopulationEstimate2018 | |
---|---|---|---|---|
20 | Castlepoint | Rural Settlement | Wellington | 50 |
328 | Waitotara | Rural Settlement | Taranaki | 60 |
234 | Raurimu | Rural Settlement | Manawatu-Wanganui | 70 |
309 | Waiinu Beach | Rural Settlement | Taranaki | 70 |
340 | Whangapoua | Rural Settlement | Waikato | 70 |
5 | Atiamuri | Rural Settlement | Waikato | 70 |
184 | Ormondville | Rural Settlement | Manawatu-Wanganui | 70 |
8 | Baddeleys Beach - Campbells Beach | Rural Settlement | Auckland | 70 |
324 | Waitangi | Rural Settlement | Northland | 80 |
230 | Rainbows End | Rural Settlement | Auckland | 80 |
bottom_10 = north_island_data.sort_values(by='PopulationEstimate2018', ascending=True).head(10)
plt.figure(figsize=(27, 6))
sns.barplot(x='Name', y='PopulationEstimate2018', data=bottom_10)
plt.title("Top 10 least populated places in North Island NZ")
plt.show()
region_totals = north_island_data.groupby('Region')['PopulationEstimate2018'].agg(['sum', 'count'])
region_totals
sum | count | |
---|---|---|
Region | ||
Auckland | 1612730 | 59 |
Bay of Plenty | 259400 | 38 |
Gisborne | 39320 | 9 |
Hawke's Bay | 140620 | 21 |
Manawatu-Wanganui | 199140 | 50 |
Northland | 109850 | 58 |
Taranaki | 93240 | 22 |
Waikato | 353070 | 77 |
Wellington | 498300 | 16 |
plt.figure(figsize=(17, 6))
sns.barplot(x=region_totals.index, y='sum', data=region_totals)
plt.show()
south_island_data.shape
(227, 4)
Population column values have "," comma in the values. Let's replace , and convert the column from string to integer type.
south_island_data.dtypes
Name object Status object Region object PopulationEstimate2018-06-30 object dtype: object
south_island_data['PopulationEstimate2018-06-30'] = south_island_data['PopulationEstimate2018-06-30'].str.replace(',', '')
south_island_data.head()
Name | Status | Region | PopulationEstimate2018-06-30 | |
---|---|---|---|---|
0 | Ahaura | Rural Settlement | West Coast | 80 |
1 | Akaroa | Rural Settlement | Canterbury | 630 |
2 | Alexandra | Small Urban Area | Otago | 5510 |
3 | Allanton | Rural Settlement | Otago | 290 |
4 | Amberley | Small Urban Area | Canterbury | 1800 |
south_island_data['PopulationEstimate2018-06-30'] = pd.to_numeric(south_island_data['PopulationEstimate2018-06-30'])
south_island_data.dtypes
Name object Status object Region object PopulationEstimate2018-06-30 int64 dtype: object
south_island_data = south_island_data.rename(columns={'PopulationEstimate2018-06-30': 'PopulationEstimate2018'})
south_island_data.head(2)
Name | Status | Region | PopulationEstimate2018 | |
---|---|---|---|---|
0 | Ahaura | Rural Settlement | West Coast | 80 |
1 | Akaroa | Rural Settlement | Canterbury | 630 |
south_island_data.sort_values(by='PopulationEstimate2018', ascending=False).head(10)
Name | Status | Region | PopulationEstimate2018 | |
---|---|---|---|---|
28 | Christchurch | Main Urban Area | Canterbury | 377200 |
40 | Dunedin | Main Urban Area | Otago | 104500 |
125 | Nelson | Large Urban Area | Nelson | 49300 |
73 | Invercargill | Large Urban Area | Southland | 48700 |
195 | Timaru | Medium Urban Area | Canterbury | 28300 |
19 | Blenheim | Medium Urban Area | Marlborough | 26400 |
11 | Ashburton | Medium Urban Area | Canterbury | 19600 |
158 | Rangiora | Medium Urban Area | Canterbury | 18400 |
166 | Rolleston | Medium Urban Area | Canterbury | 16350 |
154 | Queenstown | Medium Urban Area | Otago | 15650 |
top_10 = south_island_data.sort_values(by='PopulationEstimate2018', ascending=False).head(10)
plt.figure(figsize=(17, 6))
sns.barplot(x='Name', y='PopulationEstimate2018', data=top_10)
plt.title("Top 10 most populated places in South Island NZ")
plt.show()
south_island_data.sort_values(by='PopulationEstimate2018', ascending=True).head(10)
Name | Status | Region | PopulationEstimate2018 | |
---|---|---|---|---|
59 | Haast | Rural Settlement | West Coast | 50 |
9 | Arthur's Pass | Rural Settlement | Canterbury | 60 |
127 | Ngakuta Bay | Rural Settlement | Marlborough | 60 |
150 | Pounawea | Rural Settlement | Otago | 60 |
112 | Milford Huts | Rural Settlement | Canterbury | 60 |
153 | Purau | Rural Settlement | Canterbury | 60 |
116 | Moana | Rural Settlement | West Coast | 70 |
174 | Selwyn Huts | Rural Settlement | Canterbury | 70 |
103 | Makikihi | Rural Settlement | Canterbury | 80 |
209 | Waipopo | Rural Settlement | Canterbury | 80 |
bottom_10 = south_island_data.sort_values(by='PopulationEstimate2018', ascending=True).head(10)
plt.figure(figsize=(27, 6))
sns.barplot(x='Name', y='PopulationEstimate2018', data=bottom_10)
plt.title("Top 10 least populated places in South Island NZ")
plt.show()
region_totals = south_island_data.groupby('Region')['PopulationEstimate2018'].agg(['sum', 'count'])
region_totals
sum | count | |
---|---|---|
Region | ||
Canterbury | 544370 | 87 |
Marlborough | 37490 | 17 |
Nelson | 49300 | 1 |
Otago | 200050 | 58 |
Southland | 72450 | 23 |
Tasman | 35640 | 19 |
West Coast | 23000 | 22 |
plt.figure(figsize=(17, 6))
sns.barplot(x=region_totals.index, y='sum', data=region_totals)
plt.show()