Have you ever found yourself in this situation: you've grasped the basics of Python, you know how for loops
work, you know what regexes are, you've solved dozens of practice problems. And now you have a nice idea of your personal project but suddenly you find out that you actually do not know Python works. How to combine all your knowledge? Where to start?
In this project, I will address this problem and will try to write down my thinking and decision-making processes. It's worth noting that this project is complete, hence it does not show the intermediate steps I made when writing the final code. However, wherever possible, I will show the best practices I've used to write down the final code.
Note: this project assumes that you have some basic knowledge about Python, JSON, requests
and BeautifulSoup
libraries.
In this project, we are going to reproduce a full workflow of creating a datasets: from using the web scraping, API to the data cleaning. In particular, we are going to scrape the metadata of a collection of books from the Blackwells Bookshop website, we will then create a dataset, clean and export it to a csv
file. We will extract the following metadata:
To achieve this goal we will:
BeautifulSoup
library to scrape the content of HTML pages of the websiterequests-cache
library to store the data locallyLet's first of all import the necessary libraries. And here the first problem you may encounter: which packages will you need in your project? My advice is to not concentrate too much on figuring out which libraries and functions you will use but to import initially just the skeleton of the packages you will need.
For example, you will for sure scrape the data so you will need the BeautifulSoup
package. You will also work with dataframes so import pandas
. In the beginning, I had no idea that the Blackwell's Bookshop website had API, so I had not think about working with the json
library.
import json
import re
from datetime import timedelta
from random import randint
from time import sleep
import numpy as np
import pandas as pd
import requests as rq
import requests_cache
from bs4 import BeautifulSoup
from IPython.core.display import clear_output
Before we dive into scraping this website we will inspect the URL. If we go to the "Art and Design" category on the Blackwell website we'll see that the URL is the following:
https://blackwells.co.uk/bookshop/category/_artanddesign/
https://blackwells.co.uk/
, which includes a protocol (https://
) and the base URL, blackwells.co.uk
.bookshop
, category
and _artanddesign
which names are self-explanatory.What we will be interested in is the last subdirectory found in the directory categories
because it will change based on in which category we are currently in. For example, if we go to the category Biography the category's name in the URL will change to _biography
.
Now follow me:
Seems easy, right? But how did I figure it out? It may seem easy when you're done or you have a lot of experience but for someone who just started coded (independently) it's such a pain.
And here is the thing: we, humans, like categorizing things. We like patterns. So, in many cases, you will have to search for a pattern, something that repeats over and over again. In Python, you will use a lot of while
and for
loops to repeat the same code multiple times and that's how you can free your time if you have to do repetitive tasks (just write a script!).
Now you have to see clearly the pattern on the Blackwell website: we have categories, and in each category, on each page we have exactly the same structure: many-many books with (almost) the same metadata.
Before going on, here are a couple of useful advises:
Hence, I create a function that accepts two arguments:
Payload
or query parameters of our request to the serverCategory
or a book's categoryThis function will return the Response
object that we will parse with the Beautiful Soup
library to get the HTML text of the page.
# Return the Response object from a page in a category
# Payload is the page number
def blackwell_get_category_page(payload, category):
headers = {
"User-Agent": "Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:82.0) Gecko/20100101 Firefox/82.0"
}
url = "https://blackwells.co.uk/bookshop/category/_{}".format(category)
response = rq.get(url, params=payload, headers=headers)
return response
Now let's move to inspecting the HTML structure.
If we inspect an element with a book's metadata we'll see that the entire container with this data is between <div>
tags and have CSS class product-info
(more patterns!). To find them our you can expect the element of a book with your browser (just click RMB and select "Inspect Element" if you are using Firefox). Another way to find it out is to go to the page source and search for any book title/author and so on. Since many browsers adapt HTML to be human-readable it's much easier to find the information there.
So, we can get each of these containers on each page and extract the data we need. However, we have some problems...
In the pure HTML many authors' names and titles will be truncated (returning, for example, 'Wildlife Photographer of the Ye...'): that's because they are not written in full. In order to extract titles and authors we can use regular expressions: in fact, for each book, we have a link leading to the books page of this type:
/bookshop/product/Steps-Towards-a-Small-Theory-of-the-Visible-by-John-Berger-author/9780241472873
We can get this information directly from the link since it's a typical regular expression: we have the name of a book after /product/
and the author's name after -by-
.
Finally, the last number is ISBN or International Standard Book Number, a unique code for every book in the world (and here is another pattern!). We could extract this code from the book's URL that we find in the product-info
container. We will use it when accessing the website API and get the data to create another dataset that we will merge with the dataset we will create right now.
In this section we will scrape the following metadata:
To extract the ISBN we will split each link by back slash (/
) and extract the last element from the resulting list. Why so? If you look at links to books' pages you'll see the the ISBN number is always the last sequence of characters (which is another pattern).
Important note: if you look to the right of the category page you'll see that we can apply different filters and get into subcategories of each category (which in their turn can have their subcategories). We won't scrape those books and just concentrate our attention on the featured ones. It's still should be enough to have a good statistical analysis.
We will handle multiple requests to many pages in each category so we have to make sure that we do not overwhelm the server: to do so our script will sleep between 2 and 10 seconds (chosen randomly) after scraping the data from a page before proceeding to the next one.
We will also use the requests_cache
library that creates a local sqlite
database where will store reponses from the server in order to accelerate things if we have to rerun our script (with a local cache it will retrieve the data from the local database, and not from the server, which is much quicker).
Where did I know about the requests_cache
library? Just read! Read a lot of articles about the things you are going to accomplish. I first knew about this library from a DataQuest article while preparing to work with APIs for the first time.
Ok, we know that metadata are inside <div>
containers but we want to extract it from every single page in all categories. How do we work it out?
Let's go the next few pages in a category and look at how the URL changes... The query parameter offset
will increase by 48 (it makes sense since we have 48 books on every page) so that the first page in each category will have this value at 0, the second page at 48 and so on (can you the pattern?). We do not know the exact number of pages in each category (even though it should be 11 but let's assume that some categories may have less than 11 pages) so we have to make sure that we find the product-info
CSS class and only in is this case increase the offset number and break the while
loop otherwise.
Why using a while
loop? If we write while True
, it will create an infinite loop that we can break when a certain condition is met (it this case, when there are no more pages to iterate over).
But how do we know that there are no more pages? Just write a really big number (like 9600) to get to an $n$th page. You will see that the page still exists but there are no books on it, therefore there are no product-info
containers.
Now that we resolved any doubt let's start coding. Or not?
Before starting to code try to think about:
It may seem tedious and unproductive but it won't take you too much time and you will be much quicker in implementing your instructions in Python. And now we are ready to start!
First of all, we will extract category names to loop through in the future. If we go to this page and look at its source we'll see that links to the categories are found between <li>
tags with the CSS class category-list__item
.
That means that we can find all elements between these tags and with this CSS class, extract links to category pages, find a regex (a pattern) to extract the category names. We will pass the names to the blackwell_get_category_page
function that will return responses for each category page.
The important thing to notice in the below code in the error handling: if the status code of a response is not 200 (which means "OK") the function will return the status code of the error. Here we are going to run the script only once but imagine you are going to schedule it to run every month! In this case, you should add a piece of code that would retry the execution. You can read more about this strategy here.
Final notice: you have to adapt your code to your plan, for example, I am going to run this code just once (or do it manually once in a while) so I do not need retry strategies.
# The function returns category names
def extract_category_names():
url = "https://blackwells.co.uk/bookshop/category/_top"
response = rq.get(url)
# If an error occurs, return the error status code
if response.status_code != 200:
return response.text
# Otherwise extract each category element
else:
soup = BeautifulSoup(response.text, "html.parser")
cats = soup.find_all("li", class_="category-list__item")
# Initialize a list to hold category links
category_links = []
# Extract category links
for cat in cats:
category_links.append(cat.find("a").get("href"))
# Initialize a list to hold category names
category_names = []
# Search category names and extract them
for link in category_links:
category_names.append(
re.search("\/category\/_(\w+)", link).group(1)
) # Use the group method to extract the first capturing
# group in the matched string
# Finaly return the names of categories
return category_names
book_categories = extract_category_names()
# Check if the returned list is correct
book_categories[:5]
['artanddesign', 'biography', 'business', 'childrens', 'computing']
Well, now that we have a list of all categories we can write a function that will loop through all category pages and extract the data under the product-info
class. Does it seem familiar? Head back to the section "How to Iterate through Multiple Pages?" where I described the workflow of what we are going to do. Why was it in the beginning? Because we have to do some preliminary research before start coding: study the HTML, the URL, answer the questions from the "Let's Start" section. And only then start to code!
You most certainly won't be able to predict everything that's going to happen. Many ideas come when you code, or just before you start coding, so do not be discouraged if you come up with a last-minute idea or encounter a problem while coding! Just make sure you do not get overhelmed by tons of thoughts while coding, so do good preparation.
Let's now think of what we are going to do:
So here are the steps we have to follow (follow them while reading the code:
ResultSet
objects of the Beautiful Soup
libraryoffset_num
variable to 0. Recall that in the URL the offset
query parameter changes by 48 when we go to the next page. Therefore, to loop through all pages we will have to update this variable at each iteration and pass it to the blackwell_get_category_page
function as the first argument (payload
)book_categories
list. We will pass each name as the second argument in the blackwell_get_category_page
function (category
)while
loop that will break
when there is no next page (or in other words, when there are no books on that page)print
the offset number (page number) and the category the function is looping through on each iterationbreak
the loop and set the offset number back to 0 to loop through the next categoryResultSet
to the dictionary that we will return
Before attempting to run/code such a big function try to test it with smaller inputs. Did it return what you wanted? If so, you can introduce for/while
loops.
def extract_product_info(book_categories):
# Install a local database for responses, it expires after 7 days
requests_cache.install_cache(
"blackwell_category_cache", expire_after=timedelta(days=7)
)
# Dictionary with HTML of each book's metadata
categories_dict = {}
# Set the offset number to 0
offset_num = 0
# Iterate over each category name
for cat in book_categories:
# While there is the next page keep running the loop
while True:
payload = {"offset": offset_num}
response = blackwell_get_category_page(payload, cat)
# Print which page of which category we request
print("Requesting offset {} for category {}".format(offset_num, cat))
# Clear output
clear_output(wait=True)
# If there is an error in response, break
if response.status_code != 200:
print(response.text)
break
# Create a BeautifulSoup object
soup = BeautifulSoup(response.text, "html.parser")
# Check if there is a next page
next_page = soup.find_all("div", class_="product-info")
if next_page:
# Increase offset number to get to the next page
offset_num += 48
# Reset the offset number before going to the next category
# And break the loop if there is no next page before going to the next category
else:
offset_num = 0
break
# Populate the dictionary where keys are category names and values are `ResultSet` of `product-info`
categories_dict.setdefault(cat, []).append(next_page)
# Check if responses are in the local database, otherwise sleep 2-10 sec
if not getattr(response, "from_cache", False):
sleep(randint(2, 10))
return categories_dict
product_info = extract_product_info(book_categories)
Requesting offset 528 for category travel
It will take quite a while to run this function, so be patient.
After it finished running, let's check the results.
print(
"The type of each value in the dictionary is {}".format(
type(product_info["artanddesign"][0])
)
)
print()
print(
"An example of HTML of book's metadata:\n\n{}".format(
product_info["artanddesign"][0][0]
)
)
print()
print(
"The type of each returned HTML is {}".format(
type(product_info["artanddesign"][0][0])
)
)
The type of each value in the dictionary is <class 'bs4.element.ResultSet'> An example of HTML of book's metadata: <div class="product-info"> <h4><a class="product-name" href="/bookshop/product/Secrets-of-a-Devon-Wood-by-Jo-Brown-author/9781780724379" itemprop="url">Secrets of a Devon Wood</a></h4> <p class="product-author">Jo Brown (author)</p> <p class="product-format"> <span>Hardback </span> <br/>Published <span>08 Oct 2020</span> </p> <div class="product-price"> <p class="product-price--discount u-mb-sans">Save 2,92€</p> <ul class="list--inline"> <li class="product-price--old"><strike>16,55€</strike></li> <li class="product-price--current"> 13,63€ </li> </ul> <span class="is-hidden">In Stock</span> </div> </div> The type of each returned HTML is <class 'bs4.element.Tag'>
As you can see the type of each value in the dictionary is bs4.element.ResultSet
object where each value has Tag
type. You can read more about this object in the official documentation.
We've successfully extracted the information about each book in each category obtaining a dictionary with categories as keys. Now the last thing left is to create the first dataset which will contain the first part of metadata. Let's study the returned dictionary product_info
.
# The length of dictionary values (which are lists) in the first 5 book categories
for cat in book_categories[:5]:
print(
"The length of the list of the category '{}' is {}".format(
cat, len(product_info[cat])
)
)
The length of the list of the category 'artanddesign' is 11 The length of the list of the category 'biography' is 11 The length of the list of the category 'business' is 11 The length of the list of the category 'childrens' is 11 The length of the list of the category 'computing' is 11
You may notice that the length corresponds to the number of pages in each category.
for n in range(5):
print(
"The length of the ResultSet in the 'artanddesign' category on page {} is: {}".format(
n, len(product_info["artanddesign"][n])
)
)
The length of the ResultSet in the 'artanddesign' category on page 0 is: 48 The length of the ResultSet in the 'artanddesign' category on page 1 is: 48 The length of the ResultSet in the 'artanddesign' category on page 2 is: 48 The length of the ResultSet in the 'artanddesign' category on page 3 is: 48 The length of the ResultSet in the 'artanddesign' category on page 4 is: 48
As you see on each page we have 48 books (which is equal to the length of the ResultSet
). Recall that ResultSet
is the HTML code of the product-info
container which holds the book's metadata.
To extract metadata from each result set we can write 3 for loops
to iterate over each key:value pair, each ResultSet
and each element inside it and extract the following metadata:
We will need to use the range
function to update the number of a page and of a book. We can assume that each category has exactly 11 pages each of which has exactly 48 books. However, the code may fail if there are less/more than 11 pages or 48 books so let's make it more robust by passing the len
function inside the range
function to adapt the code to the real information we have.
Most of the code is straightforward because we just need to find the necessary tags and CSS classes and get the text between the tags. However, in the case of product_format
we have two nested span
tags so we will have to get inside these tags and extract the text. We can achieve it by finding p
tags with the product-format
class, then by finding all span
tags which will return a list where we can access the first element (book type) and the second element (publication date).
Don't forget about the categories: at each iteration, we will need to append them to a list to keep track of which category a book belongs to.
We will also write a supplementary function get_list_value
to handle the absence of some publication dates inside the product-info
class (we will basically write the dictionary get
function analogy).
I discovered the necessaty of the last function when I tried to run the code and got an error. As you see you cannot predict everything your will need to code, so do not be discouraged if something does not work at the first attempt.
# Returns a list value if it exists, NaN otherwise
def get_list_value(lst, idx):
try:
return lst[idx].get_text()
except:
return np.nan
# Creates the dataframe
def create_first_df(product_info):
links = []
isbns = []
current_prices = []
discounts = []
book_types = []
publication_dates = []
categories = []
# Iterate over each category
for k, v in product_info.items():
# Iterate over each page and save the current ResultSet
for i in range(len(product_info[k])):
curr_result_set = v[i]
# Iterate over each book in the ResultSet and extract metadata
for i in range(len(product_info[k][i])):
# Extract link
link = curr_result_set[i].find("a", class_="product-name").get("href")
links.append(link)
# Extract ISBN with a regular expression
isbns.append(link.split("/")[-1])
# Extract current prices; if there is no price, append NaN
current_price = curr_result_set[i].find(
"li", class_="product-price--current"
)
if current_price:
current_prices.append(current_price.get_text(strip=True))
else:
current_prices.append(current_price)
# Extract discounts; if there is no discount, append np.nan
discount = curr_result_set[i].find(
"p", class_="product-price--discount u-mb-sans"
)
if discount:
discounts.append(discount.get_text())
else:
discounts.append(np.nan)
# Extract product format (book type and publication date)
product_format = curr_result_set[i].find("p", class_="product-format")
# If there is a book type and a publication date
if product_format:
spans = product_format.find_all("span")
book_types.append(get_list_value(spans, 0))
publication_dates.append(get_list_value(spans, 1))
# Otherwise append NaN
else:
book_types.append(np.nan)
publication_dates.append(np.nan)
# Append categories
categories.append(k)
# Create the dataframe
df = pd.DataFrame(
{
"isbn": isbns,
"publication_date": publication_dates,
"price": current_prices,
"discount": discounts,
"type": book_types,
"category": categories,
"link_book_page": links,
}
)
return df
first_df = create_first_df(product_info)
# Display the first 5 word of the first dataframe
first_df.head()
isbn | publication_date | price | discount | type | category | link_book_page | |
---|---|---|---|---|---|---|---|
0 | 9781780724379 | 08 Oct 2020 | 13,63€ | Save 2,92€ | Hardback | artanddesign | /bookshop/product/Secrets-of-a-Devon-Wood-by-J... |
1 | 9780141992150 | 05 Nov 2020 | 11,03€ | NaN | Paperback | artanddesign | /bookshop/product/Poor-by-Caleb-Femi-author/97... |
2 | 9781250114297 | 06 Oct 2020 | 32,97€ | NaN | Hardback | artanddesign | /bookshop/product/Humans-by-Brandon-Stanton-au... |
3 | 9781846149481 | 05 Nov 2020 | 19,11€ | Save 2,97€ | Hardback | artanddesign | /bookshop/product/Gamish-by-Edward-Ross-author... |
4 | 9780241472859 | 24 Sep 2020 | 4,25€ | Save 2,36€ | Paperback | artanddesign | /bookshop/product/The-Narrative-of-Trajans-Col... |
Perfect! Now that we've got the first dataframe we can move to extracting other metadata from the API but before doing that we will do some data cleaning. In particular, we will check if there are any duplicates in the ISBN
and link_book_page
columns since these two should be unique. It's pretty fair to suppose that the other columns have duplicates but it does not have any impact on the data correctness (there, of course, can be duplicated publication dates).
# Check ISBN and link_book_page for duplicates
for col in ["isbn", "link_book_page"]:
print(
"Number of duplicate entries in the column '{}' is {}".format(
col, first_df[col].duplicated().sum()
)
)
Number of duplicate entries in the column 'isbn' is 3211 Number of duplicate entries in the column 'link_book_page' is 3211
We have more than 3000 duplicates in these two columns, and they are real duplicates (not just some random errors in ISBN) because the links have to be unique. Before we proceed we have to remove all of them to make sure that every ISBN is unique.
# Drop duplicates in ISBN and link_book_page
first_df = first_df.drop_duplicates(subset=["isbn", "link_book_page"]).reset_index(
drop=True
)
# Check if there any duplicates
for col in ["isbn", "link_book_page"]:
print(
"Number of duplicate entries in the column '{}' is {}".format(
col, first_df[col].duplicated().sum()
)
)
print()
# Print the length of the dataset after removing duplicates
print(
"Length of the dataset after removing the duplicates is {}".format(
first_df.shape[0]
)
)
Number of duplicate entries in the column 'isbn' is 0 Number of duplicate entries in the column 'link_book_page' is 0 Length of the dataset after removing the duplicates is 11573
Fine! We still have almost 12000 books in the dataset.
Finally, let's set the ISBN as the index (we will use it to merge the datasets).
# Set ISBN as index
first_df.set_index("isbn", inplace=True)
Now that we have some general metadata for almost 12000 books (and the most important one is ISBN) we can get more specific data about each book:
We will do it by accessing the bookshop API which has plenty of useful functions such as getting the metadata for a book by its ISBN (and we already have them!).
Before we dive into extracting the data we have to study the JSON structure. Let's first all write a function that prettifies the returning JSON file. This function will return a JSON object in a human-readable format.
# Prettifies JSON file
def prettify_json(json_obj):
text = json.dumps(json_obj, indent=4)
print(text)
And now let's write a function that returns a JSON response for a specified ISBN.
# Return a response for an ISBN
def blackwell_get_book_response(isbn):
headers = {
"User-Agent": "Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:82.0) Gecko/20100101 Firefox/82.0"
}
url = "http://shopfeeds.blackwell.co.uk/jsp/restful/bookdetails/{}".format(
isbn
) # The URL from the API documentation
response = rq.get(url, headers=headers)
return response
Before we proceed we have to study the structure of the returned JSON file. For example, we will use the book "Good Economics for Hard Times" by Abhijit V. Banerjee and Esther Duflo.
# Run this cell to return a long JSON-formatted object
good_economics = blackwell_get_book_response("9780141986197")
prettify_json(good_economics.json())
{ "isbn": "9780141986197", "name": "Good Economics for Hard Times", "author": "Abhijit V. Banerjee (author), Esther Duflo (author)", "publisher": "Penguin Books Ltd", "publishedDate": "03 Sep 2020", "published_country": "United Kingdom", "imprint": "Penguin Books", "language": "English", "place_of_publication": "UK", "no_of_pages": "x, 402", "height": "129", "width": "196", "depth": "28", "rights_afs": "AD AE AF AG AI AL AM AO AQ AR AT AU AW AX AZ BA BB BE BF BG BH BI BJ BL BM BN BO BQ BR BS BV BW BY BZ CC CD CF CG CH CI CK CL CM CN CO CR CU CV CW CX CY CZ DE DJ DK DM DO DZ EC EE EG EH ER ES ET FI FJ FK FO FR GA GB GD GE GF GG GH GI GL GM GN GP GQ GR GS GT GW GY HK HM HN HR HT HU ID IE IL IM IO IQ IR IS IT JE JM JO JP KE KG KH KI KM KN KP KR KW KY KZ LA LB LC LI LR LS LT LU LV LY MA MC MD ME MF MG MK ML MM MN MO MQ MR MS MT MU MW MX MY MZ NA NC NE NF NG NI NL NO NR NU NZ OM PA PE PF PG PL PM PN PS PT PY QA RE RO RS RU RW SA SB SC SD SE SG SH SI SJ SK SL SM SN SO SR SS ST SV SX SY SZ TC TD TF TG TH TJ TK TL TM TN TO TR TT TV TW TZ UA UG UY UZ VA VC VE VG VN VU WF WS YE YT ZA ZM ZW", "subtitle": "Better Answers to Our Biggest Problems", "series_title": "Penguin Economics", "has_image": "true", "weight": "306", "dewey": "330", "bisac": "", "visible": "1", "dewey_edition": "23", "shopstock": "H2:E0:T0:K1:C0:C2:Q3:R4:V0", "audienceString": "", "spine": "28", "cnttype": "Book", "formatString": "Paperback", "short_blurb": "The experience of the last decade has not been kind to the image of economists: asleep at the wheel (perhaps with the foot on the gas pedal) in the run-up to the great recession, squabbling about how to get out of it, tone-deaf in discussions of the plight of Greece or the Euro area; they seem to have lost the ability to provide reliable guidance on the great problems of the day. In this ambitious, provocative book Abhijit V. Banerjee and Esther Duflo show how traditional western-centric thinking has failed to explain what is happening to people in a newly globalised world: in short Good Economics has been done badly.", "long_blurb": "<p><b>FROM THE WINNERS OF THE 2019 NOBEL PRIZE IN ECONOMICS<br></b><br><b>'Wonderfully refreshing . . . A must read' Thomas Piketty </b><br><br>In this revolutionary book, prize-winning economists Abhijit V. Banerjee and Esther Duflo show how economics, when done right, can help us solve the thorniest social and political problems of our day. From immigration to inequality, slowing growth to accelerating climate change, we have the resources to address the challenges we face but we are so often blinded by ideology.<br><br>Original, provocative and urgent,<i> Good Economics for Hard Times</i> offers the new thinking that we need. It builds on cutting-edge research in economics - and years of exploring the most effective solutions to alleviate extreme poverty - to make a persuasive case for an intelligent interventionism and a society built on compassion and respect. A much-needed antidote to polarized discourse, this book shines a light to help us appreciate and understand our precariously balanced world.</p>", "synopsisBlurb": "<p><b>FROM THE WINNERS OF THE 2019 NOBEL PRIZE IN ECONOMICS<br></b><br><b>'Wonderfully refreshing . . . A must read' Thomas Piketty </b><br><br>In this revolutionary book, prize-winning economists Abhijit V. Banerjee and Esther Duflo show how economics, when done right, can help us solve the thorniest social and political problems of our day. From immigration to inequality, slowing growth to accelerating climate change, we have the resources to address the challenges we face but we are so often blinded by ideology.<br><br>Original, provocative and urgent,<i> Good Economics for Hard Times</i> offers the new thinking that we need. It builds on cutting-edge research in economics - and years of exploring the most effective solutions to alleviate extreme poverty - to make a persuasive case for an intelligent interventionism and a society built on compassion and respect. A much-needed antidote to polarized discourse, this book shines a light to help us appreciate and understand our precariously balanced world.</p>", "sJacketPath": "/jacket/s/9780141986197.jpg", "mJacketPath": "/jacket/m/9780141986197.jpg", "lJacketPath": "/jacket/l/9780141986197.jpg", "price": "9.43", "actualPrice": "9.43", "originalPrice": "9.43", "bigBuyButton": "<a href=\"null/bookshop/shopping_cart.jsp?action=add&amp;prodid=9780141986197\" class=\"bigbutton inter\">Add to basket</a>", "buyButton": "<a href=\"null/bookshop/shopping_cart.jsp?action=add&amp;prodid=9780141986197\" class=\"button inter\">Add to basket</a>", "dispatchDelay": "2", "isSecondHand": false, "dispatch": "Usually dispatched within 48 hours", "freeQty": 0, "actualPricePounds": "?9.43", "originalPricePounds": "?9.43", "pricePounds": "?9.43", "totalPrice": "0.00", "totalPricePounds": "?0.00", "pubstatus": 1, "gbpprice": "9.43", "ipop": "ip", "inStock": true, "salesRank": "618", "saveString": "RRP &#x00A3;10.99 - <strong>save &#x00A3;1.56</strong>", "discount": "1.56", "priceBeforeDiscount": "&#x00A3;10.99", "priceAfterDiscount": "&#x00A3;9.43", "priceDifference": "&#x00A3;1.56", "listPrice": "10.99", "code": "CORS", "hasDiscount": true, "qtyAvail": 24, "totalQtyAvail": 502, "blurbReview": "Excellent, important, disarmingly down to earth . . . they seek to shed much-needed light upon the distortions that bad economics bring to public debates while methodically deconstructing their false assumptions.<br /><i>Observer</i><br /><br />Not all economists wear ties and think like bankers. In their wonderfully refreshing book, Banerjee and Duflo delve into impressive areas of new research questioning conventional views about issues ranging from<b> </b>trade to top income taxation and mobility, and offer their own powerful vision of how we can grapple with them. A must-read.<br /><i>Thomas Piketty, author of Capital in the Twenty-First Century</i><br /><br />Compelling, useful, relevant ... Banerjee and Duflo use extensive data to zoom out and show us a wider view of these human dynamics<br /><i>Bill Gates</i><br /><br />Excellent ... Few have grappled as energetically with the complexity of real life as Esther Duflo and Abhijit Banerjee, or got their boots as dirty in the process ... Readers will be captivated<br /><i>The Economist</i><br /><br />A canard-slaying, unconventional take on economics ... invigorating ... a treasure trove of facts and findings about the biggest economic issues of the day<br /><i>The Times</i><br /><br />A magnificent achievement, and the perfect book for our time. Banerjee and Duflo brilliantly illuminate the largest issues of the day, including immigration, trade, climate change, and inequality.<br /><i>Cass R. Sunstein, author of How Change Happens</i><br /><br />Banerjee and Duflo are masters of this terrain . . . Their book is as stimulating as it gets", "productUrlItSelf": "/bookshop/product/Good-Economics-for-Hard-Times-by-Abhijit-V-Banerjee-author-Esther-Duflo-author/9780141986197", "availableForSale": true, "navigationList": { "SHOPPING_CART_URL": { "displayName": "SHOPPING_CART_URL", "urlPath": "null/bookshop/shopping_cart.jsp?action=add&prodid=9780141986197", "hasSubNavigation": false, "subNavigation": [], "visible": true }, "WISHLIST_BUTTON_KEY": { "displayName": "Add to wishlist", "urlPath": "null/bookshop/bookmark.jsp?page=9780141986197", "hasSubNavigation": false, "subNavigation": [], "visible": true }, "RESERVE_IN_STORE": { "displayName": "Reserve at Shop", "urlPath": "/bookshop/shops/shop-reserve.jsp?isbn=9780141986197&selectShop=V0&Reserve+at+shop=Reserve+at+shop", "hasSubNavigation": false, "subNavigation": [], "visible": true }, "BUY_BUTTON": { "displayName": "Add to basket", "urlPath": "", "hasSubNavigation": false, "subNavigation": [], "visible": true } }, "productAddToBasketUrl": "/bookshop/shopping_cart.jsp?action=add&prodid=9780141986197", "tax": "0.0", "sku": "", "skuId": "", "zone": "", "item_state": "0", "sellerCountry": "", "sellerCountryCode": "", "secondHand": false, "percentageDiscount": false, "ljacketPath": "/jacket/l/9780141986197.jpg", "mjacketPath": "/jacket/m/9780141986197.jpg", "sjacketPath": "/jacket/s/9780141986197.jpg" }
We've got a lot of metadata but the structure is pretty simple: it's just a dictionary of dictionaries.
First of all, we will write a function that returns a dictionary where we specify the data we need. We will also have to handle some missing keys in the JSON file (like many "edition" fields are missed for many books) so we will specify the np.nan
value as default if no key is found by using the dictionary function get
. I discovered it by trial and error: at some point, my function returned an error because a field was absent for a book, so I had to start handling this error. I hadn't known beforehand about that but I resolved the problem once it arose.
Notice that we use lists as values in the dictionary, that's because we will create dataframe from these dictionaries. It is better to use non-scalar values (like list) otherwise pandas
will return the ValueError: If using all scalar values, you must pass an index
.
If you encounter any problem while you are writing the code, do not worry, it will happen all the time. Just head to StackOverflow, where you will find a solution in 99% of cases (which you have to understand). However, try to understand the error and read more about the it: understanding errors is as important as understanding solutions. In this case, you can read more about scalar values in Python here. After reading this you will understand that scalar values have no indexes but pandas
requires them to create a dataframe so it throws the ValueError
if you do not pass indexes. We've got around this problem by transforming scalar values into non-scalar ones (lists).
# This function takes a json file as input and return a dictionary
# where keys are metadata names and values are corresponding metadata values in lists
def get_book_metadata(json_file) -> dict:
# Dictionary to store book metadata
book_metadata = {}
# Loop over necessary fields and add them to the dictionary
# If there is no value for this category add NaN
for ele in [
"isbn",
"name",
"subtitle",
"edition",
"author",
"gbpprice",
"discount",
"isSecondHand",
"publisher",
"published_country",
"imprint",
"language",
"no_of_pages",
"height",
"width",
"spine",
"weight",
"salesRank",
"short_blurb",
"long_blurb",
"blurbReview",
]:
book_metadata[ele] = [json_file.get(ele, np.nan)]
return book_metadata
# Print Good Economics for Hard Times metadata
get_book_metadata(good_economics.json())
{'isbn': ['9780141986197'], 'name': ['Good Economics for Hard Times'], 'subtitle': ['Better Answers to Our Biggest Problems'], 'edition': [nan], 'author': ['Abhijit V. Banerjee (author), Esther Duflo (author)'], 'gbpprice': ['9.43'], 'discount': ['1.56'], 'isSecondHand': [False], 'publisher': ['Penguin Books Ltd'], 'published_country': ['United Kingdom'], 'imprint': ['Penguin Books'], 'language': ['English'], 'no_of_pages': ['x, 402'], 'height': ['129'], 'width': ['196'], 'spine': ['28'], 'weight': ['306'], 'salesRank': ['618'], 'short_blurb': ['The experience of the last decade has not been kind to the image of economists: asleep at the wheel (perhaps with the foot on the gas pedal) in the run-up to the great recession, squabbling about how to get out of it, tone-deaf in discussions of the plight of Greece or the Euro area; they seem to have lost the ability to provide reliable guidance on the great problems of the day. In this ambitious, provocative book Abhijit V. Banerjee and Esther Duflo show how traditional western-centric thinking has failed to explain what is happening to people in a newly globalised world: in short Good Economics has been done badly.'], 'long_blurb': ["<p><b>FROM THE WINNERS OF THE 2019 NOBEL PRIZE IN ECONOMICS<br></b><br><b>'Wonderfully refreshing . . . A must read' Thomas Piketty </b><br><br>In this revolutionary book, prize-winning economists Abhijit V. Banerjee and Esther Duflo show how economics, when done right, can help us solve the thorniest social and political problems of our day. From immigration to inequality, slowing growth to accelerating climate change, we have the resources to address the challenges we face but we are so often blinded by ideology.<br><br>Original, provocative and urgent,<i> Good Economics for Hard Times</i> offers the new thinking that we need. It builds on cutting-edge research in economics - and years of exploring the most effective solutions to alleviate extreme poverty - to make a persuasive case for an intelligent interventionism and a society built on compassion and respect. A much-needed antidote to polarized discourse, this book shines a light to help us appreciate and understand our precariously balanced world.</p>"], 'blurbReview': ['Excellent, important, disarmingly down to earth . . . they seek to shed much-needed light upon the distortions that bad economics bring to public debates while methodically deconstructing their false assumptions.<br /><i>Observer</i><br /><br />Not all economists wear ties and think like bankers. In their wonderfully refreshing book, Banerjee and Duflo delve into impressive areas of new research questioning conventional views about issues ranging from<b> </b>trade to top income taxation and mobility, and offer their own powerful vision of how we can grapple with them. A must-read.<br /><i>Thomas Piketty, author of Capital in the Twenty-First Century</i><br /><br />Compelling, useful, relevant ... Banerjee and Duflo use extensive data to zoom out and show us a wider view of these human dynamics<br /><i>Bill Gates</i><br /><br />Excellent ... Few have grappled as energetically with the complexity of real life as Esther Duflo and Abhijit Banerjee, or got their boots as dirty in the process ... Readers will be captivated<br /><i>The Economist</i><br /><br />A canard-slaying, unconventional take on economics ... invigorating ... a treasure trove of facts and findings about the biggest economic issues of the day<br /><i>The Times</i><br /><br />A magnificent achievement, and the perfect book for our time. Banerjee and Duflo brilliantly illuminate the largest issues of the day, including immigration, trade, climate change, and inequality.<br /><i>Cass R. Sunstein, author of How Change Happens</i><br /><br />Banerjee and Duflo are masters of this terrain . . . Their book is as stimulating as it gets']}
What we can do now is to loop through all ISBNs in the first dataset, call the get_book_info
function on each response, return a dataframe and append each dataframe to the list frames
.
Afterward, we call the function concat
and concatenate all the dataframes from the list which will create the second dataframe containing the remaining metadata.
def create_second_df():
requests_cache.install_cache(
"blackwell_books_cache", expire_after=timedelta(days=7)
)
# The ISBN are indexes in the first dataframe
isbns = first_df.index
# Initialize the book counter to keep track while requesting book metadata
nr_book = 0
# Initialize a list to store responses
responses = []
for isbn in isbns:
response = blackwell_get_book_response(isbn)
# Keep track of which book we are requesting
print(
"Requesting the book having ISBN {}, number {}/{}".format(
first_df.index[nr_book], nr_book + 1, len(first_df)
)
)
clear_output(wait=True)
# Increment the book number by 1
nr_book += 1
responses.append(response)
# We will make 4 requests per second if they are not saved in the local database
if not getattr(response, "from_cache", False):
sleep(0.25)
# Initialize a list to store dataframes
frames = []
# Loop through responses, transform them in dataframe and append dataframe to the list
for response in responses:
frames.append(pd.DataFrame(get_book_metadata(response.json())))
# Return concatenated frames
return pd.concat(frames)
second_df = create_second_df()
Requesting the book having ISBN 9781909911468, number 11573/11573
After waiting quite a lot, we finally got the second dataframe. Let's have a quick look at it verify if it was created correctly.
# Show the first 3 rows
second_df.head(3)
isbn | name | subtitle | edition | author | gbpprice | discount | isSecondHand | publisher | published_country | ... | language | no_of_pages | height | width | spine | weight | salesRank | short_blurb | long_blurb | blurbReview | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 9781780724379 | Secrets of a Devon Wood | A Nature Journal | NaN | Jo Brown (author) | 10.40 | 4.59 | False | Short Books | England | ... | English | 96 | 242 | 218 | 18 | 390 | 108 | Walking one day in the woods behind her cottag... | <p><b>"Things of such magnitude deserve respec... | <b>Exquisite drawings and thoughtful annotatio... |
0 | 9780141992150 | Poor | NaN | NaN | Caleb Femi (author) | 9.99 | 0.00 | False | Penguin Books Ltd | United Kingdom | ... | English | 140 | 130 | 197 | 16 | 258 | 167 | What is it like to grow up in a place where th... | <p><b>A <i>NEW STATESMAN</i> AND <i>FINANCIAL ... | <b>It's rare for a book of poems to repeat... |
0 | 9781250114297 | Humans | NaN | First edition | Brandon Stanton (author) | 22.37 | 4.62 | False | St. Martin's Press | New York (State) | ... | NaN | cm | 214 | 261 | 36 | 1628 | 63 | "Brandon Stanton's new book, Humans ... shows ... | NaN | NaN |
3 rows × 21 columns
Now let's set the ISBN column as the index (we are going to join the datasets on ISBN).
# Set ISBN as index
second_df.set_index("isbn", inplace=True)
We also have to make sure that there are no duplicates among ISBNs.
print(
"The number of duplicates among ISBNs is {}".format(
second_df.index.duplicated().sum()
)
)
The number of duplicates among ISBNs is 0
Everything seems okay, so the next step is to join the dataframe to create the final dataset (that we will clean a little bit). We will use the inner join and the length of the final dataset should be equal to the lengths of the datasets we join.
We will use the join
method since we are going to join the datasets by indexes (ISBNs), although we could use the merge
method that can combine datasets by a specified column. Note that we use the lsuffix
and rsuffix
parameters to distinguish between two discounts (in euros in the first dataset, and in pounds in the second dataset). That said, consult the documentation to find out many interesting parameters you can use in your project.
We will also export the dataframe to a csv
file before attempting the cleaning to provide users with the raw data.
# Join datasets by ISBN
final_df = first_df.join(second_df, lsuffix="_euro", rsuffix="_gbp")
# Export the final dataframe to a csv file
final_df.to_csv("blackwell_shop_raw.csv")
# Show the first three rows of the final dataset
final_df.head(3)
publication_date | price | discount_euro | type | category | link_book_page | name | subtitle | edition | author | ... | language | no_of_pages | height | width | spine | weight | salesRank | short_blurb | long_blurb | blurbReview | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
isbn | |||||||||||||||||||||
9781780724379 | 08 Oct 2020 | 13,63€ | Save 2,92€ | Hardback | artanddesign | /bookshop/product/Secrets-of-a-Devon-Wood-by-J... | Secrets of a Devon Wood | A Nature Journal | NaN | Jo Brown (author) | ... | English | 96 | 242 | 218 | 18 | 390 | 108 | Walking one day in the woods behind her cottag... | <p><b>"Things of such magnitude deserve respec... | <b>Exquisite drawings and thoughtful annotatio... |
9780141992150 | 05 Nov 2020 | 11,03€ | NaN | Paperback | artanddesign | /bookshop/product/Poor-by-Caleb-Femi-author/97... | Poor | NaN | NaN | Caleb Femi (author) | ... | English | 140 | 130 | 197 | 16 | 258 | 167 | What is it like to grow up in a place where th... | <p><b>A <i>NEW STATESMAN</i> AND <i>FINANCIAL ... | <b>It's rare for a book of poems to repeat... |
9781250114297 | 06 Oct 2020 | 32,97€ | NaN | Hardback | artanddesign | /bookshop/product/Humans-by-Brandon-Stanton-au... | Humans | NaN | First edition | Brandon Stanton (author) | ... | NaN | cm | 214 | 261 | 36 | 1628 | 63 | "Brandon Stanton's new book, Humans ... shows ... | NaN | NaN |
3 rows × 26 columns
Now we need to check two things:
print(
"Is the length of the final dataset equal to the lengths of the first two dataset? {}".format(
first_df.shape[0] == second_df.shape[0] == final_df.shape[0]
)
)
print(
"Is the total number of columns in the final dataset the sum of the columns of the first and the second datasets? {}".format(
first_df.shape[1] + second_df.shape[1] == final_df.shape[1]
)
)
Is the length of the final dataset equal to the lengths of the first two dataset? True Is the total number of columns in the final dataset the sum of the columns of the first and the second datasets? True
Fine! Our dataframe was created correctly.
The next step is to clean a bit the dataset before sharing it. Here is what we can do:
object
types to int/float
types where possible (like discounts, number of pages, etc.)price
column to euro_price
and the gbpprice
column to gbp_price
We, of course, can rename categories to be more readable or transform publication dates in datetime
objects but let's leave to whoever is going to do data analysis. We can also remove HTML tags from the last three columns but it may be dangerous so we should be very careful so let's leave this decision to a data analyst (HTML tags should not interfere with somehting like sentiment analysis).
Let's start by selecting which columns we want to transform to integers of float numbers.
# Show columns of the final dataset
final_df.columns
Index(['publication_date', 'price', 'discount_euro', 'type', 'category', 'link_book_page', 'name', 'subtitle', 'edition', 'author', 'gbpprice', 'discount_gbp', 'isSecondHand', 'publisher', 'published_country', 'imprint', 'language', 'no_of_pages', 'height', 'width', 'spine', 'weight', 'salesRank', 'short_blurb', 'long_blurb', 'blurbReview'], dtype='object')
We see that numeric columns are:
In case of prices and discounts we will need to do some transformations so let's leave them for now. Also the no_of_pages
column contains the data which is not easy to understand, like "269 , 16 unnumbered of plates", so we won't do anything about it.
We will use the pandas
to_numeric
function which robustly converts strings to numbers. However, before we transform the dataset to a csv
file, it's important to make sure that we do not introduce missing values. Let's compute the number of null values before and after the transformation.
# List of columns to be transformed to numbers
cols_to_numeric = ["height", "width", "spine", "weight", "salesRank"]
# Print out number of null values before transformation
print(
"The number of null values in each column before the transformation is \n{}".format(
final_df[cols_to_numeric].isnull().sum()
)
)
The number of null values in each column before the transformation is height 4 width 5 spine 82 weight 0 salesRank 0 dtype: int64
# Transform values in selected columns to numbers
for col in cols_to_numeric:
final_df[col] = pd.to_numeric(final_df[col])
print(
"The number of null values in each column after the transformation is \n{}".format(
final_df[cols_to_numeric].isnull().sum()
)
)
The number of null values in each column after the transformation is height 4 width 5 spine 82 weight 0 salesRank 2109 dtype: int64
We've introduced 2109 missing values to the salesRank
column! Let's revert to the original dataframe and study this column. We may assume that this column has some non-alphanumeric characters (like punctuation) which causes the pd.to_numeric
function to introduce null values.
# Revert to the original dataframe
final_df = first_df.join(second_df, lsuffix="_euro", rsuffix="_gbp")
# Check if the salesRank column has non-alphabetic characters
final_df[~final_df["salesRank"].str.isalnum()]["salesRank"]
isbn 9781913107161 9780300253801 9780520304840 9780500971062 9781426221422 .. 9780241959909 9780140455076 9781840182316 9781912177059 9781909911468 Name: salesRank, Length: 2109, dtype: object
We have 2701 non-alphanumeric characters many of which seem to be blank spaces. Let's check this hypothesis by finding all space characters and printing out their amount.
print(
"The number of values with blank spaces in the 'salesRank' columns is {}".format(
final_df[final_df["salesRank"].str.contains(r"^\s*$", regex=True)].shape[0]
)
)
The number of values with blank spaces in the 'salesRank' columns is 2109
The numbers are the same so we can conclude that the pd.to_numeric
function replaces them with null values and we do not lose any information. Now we can proceed with this transformation.
# Transform valeus in selected columns to numbers
for col in cols_to_numeric:
final_df[col] = pd.to_numeric(final_df[col])
# Print types of transformed columns
print("Types of transformed columns:")
print()
print(final_df[cols_to_numeric].dtypes)
Types of transformed columns: height float64 width float64 spine float64 weight int64 salesRank float64 dtype: object
Perfect! Now the values are either of the float64
or int64
type. We can now go on with cleaning the price and discounts columns. First of all, let's rename the price columns to distinguish between prices in euros and pounds.
# Rename price columns
final_df.rename({"price": "euro_price", "gbpprice": "gbp_price"}, axis=1, inplace=True)
Let's have a quick look at the euro_price
column using the sample
function from pandas
which returns a random sample from this column.
# Return a sample of 10 euro prices
final_df["euro_price"].sample(10)
isbn 9780552779838 11,02€ 9781506717661 16,54€ 9789811531057 135,88€ 9781138931367 36,65€ 9781529034554 7,85€ 9781787702073 8,74€ 9780571314430 28,71€ 9780198853923 36,22€ 9781912891245 14,91€ 9780751566659 7,59€ Name: euro_price, dtype: object
At the first glance, we have to remove the euro symbol and change commas to dots to be able to transform the prices to numeric values. Let's try it out!
# Replace commas with dots and remove euro symbol
final_df["euro_price"] = (
final_df["euro_price"].str.replace(",", ".").str.replace("€", "")
)
# Transform euro prices to numeric values
final_df["euro_price"] = pd.to_numeric(final_df["euro_price"])
# Print out the type of the euro_price column
final_df["euro_price"].dtypes
dtype('float64')
Fine! The type of the euro_price
columns is float64
. As an exercise check if we introduce any missing value to the column during the transformation.
Let's go on with the gbp_price
column.
final_df["gbp_price"].sample(10)
isbn 9781108826716 7.39 9780262044127 30.33 9780262044080 13.99 9781471172854 15.28 9781621138082 16.32 9780521797948 19.22 9781784724269 9.25 9781472278401 10.40 9780201882957 23.71 9781138850903 38.45 Name: gbp_price, dtype: object
Well, here we can try to directly transform the values to numbers.
# Transform pound prices to numeric values
final_df["gbp_price"] = pd.to_numeric(final_df["gbp_price"])
# Print out the type of the gbp_price column
final_df["euro_price"].dtypes
dtype('float64')
Also this column is of float64
type. Do not forget to check if there are any introduced missing values.
Now let's go on with discounts.
final_df["discount_euro"].sample(10)
isbn 9780241343265 NaN 9780751573954 Save 1,27€ 9781913097073 Save 1,57€ 9780199646593 Save 2,65€ 9781108480918 Save 3,78€ 9780141980546 Save 1,55€ 9780300254778 NaN 9781138953949 Save 1,96€ 9780140021967 Save 1,60€ 9781138808089 Save 0,04€ Name: discount_euro, dtype: object
We have to remove the Save
string, the euro symbol and replace commas with dots before transforming the discounts to numbers.
final_df["discount_euro"].isnull().sum()
3050
# Remove Save and €, replace commas with dots
final_df["discount_euro"] = (
final_df["discount_euro"]
.str.replace("Save ", "")
.str.replace(",", ".")
.str.replace("€", "")
)
If we try to transform the values we'll get the ValueError
because some discounts are in percentages (like 30%). We can impute the discounts by subtracting the discounted price from the full price. Note that prices in euros in the dataframe are already discounted.
We can subset the data and then merge it with the original dataframe. It will allow us to concentrate only on a specific chunk of information and do not get lost while performing math operations.
The formula to compute the discounts is $discount\_euro$ = $(euro\_price * (1 + discount\_euro\_in\_\%)) - euro\_price$.
# Subset of the data with discounts in percents
subset = final_df.loc[final_df["discount_euro"].str.contains("%", na=False)].copy()
# Remove percent symbols
subset["discount_euro"] = subset["discount_euro"].str.replace("%", "")
# Transform discounts in decimal numbers
subset["discount_euro"] = pd.to_numeric(subset["discount_euro"]) / 100
# Apply the formula to compute discounts in euros
subset["discount_euro"] = (
subset["euro_price"] * (1 + subset["discount_euro"])
) - subset["euro_price"]
# Check if everything is OK
subset[["euro_price", "discount_euro"]]
euro_price | discount_euro | |
---|---|---|
isbn | ||
9780198724643 | 4.62 | 1.386 |
9780198786191 | 7.72 | 2.316 |
Everything is fine so we can combine this subset with the original dataframe. We will select the indexes of the subset and just change the discounts corresponding to those ISBNs in the original dataset (so we did not even use the join
method that would require us to specify suffixes).
# Change euro discounts from percents to numbers
final_df.loc[subset.index, "discount_euro"] = subset["discount_euro"]
The last transformation left is of the discounts in pounds. Let's study the values.
# Print out discounts in pounds
final_df["discount_gbp"]
isbn 9781780724379 4.59 9780141992150 0.00 9781250114297 4.62 9781846149481 6.01 9780241472859 2.31 ... 9781250038821 3.00 9781910240755 3.00 9781910702390 6.27 9781912177059 2.83 9781909911468 10.68 Name: discount_gbp, Length: 11573, dtype: object
It should be straightforward: just directly transform them to numbers.
# Transform discounts in pounds to numbers
final_df["discount_gbp"] = pd.to_numeric(final_df["discount_gbp"])
# Check the type
final_df["discount_gbp"].dtypes
dtype('float64')
Fine! All the numbers are floats now. Do not forget to check if there are any introduced missing value as an exercise.
final_df.head(2)
publication_date | euro_price | discount_euro | type | category | link_book_page | name | subtitle | edition | author | ... | language | no_of_pages | height | width | spine | weight | salesRank | short_blurb | long_blurb | blurbReview | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
isbn | |||||||||||||||||||||
9781780724379 | 08 Oct 2020 | 13.63 | 2.92 | Hardback | artanddesign | /bookshop/product/Secrets-of-a-Devon-Wood-by-J... | Secrets of a Devon Wood | A Nature Journal | NaN | Jo Brown (author) | ... | English | 96 | 242.0 | 218.0 | 18.0 | 390 | 108.0 | Walking one day in the woods behind her cottag... | <p><b>"Things of such magnitude deserve respec... | <b>Exquisite drawings and thoughtful annotatio... |
9780141992150 | 05 Nov 2020 | 11.03 | NaN | Paperback | artanddesign | /bookshop/product/Poor-by-Caleb-Femi-author/97... | Poor | NaN | NaN | Caleb Femi (author) | ... | English | 140 | 130.0 | 197.0 | 16.0 | 258 | 167.0 | What is it like to grow up in a place where th... | <p><b>A <i>NEW STATESMAN</i> AND <i>FINANCIAL ... | <b>It's rare for a book of poems to repeat... |
2 rows × 26 columns
Finally, we export the cleaned dataframe to a csv
file.
# Export to csv
final_df.to_csv("blackwell_shop_cleaned.csv")
Let's wrap up what we did:
BeautifulSoup
library to scrape the content of HTML pages of the websiterequests-cache
library to store the data locallyIn this project, I addressed the problem of not being able to apply your knowledge in real-world situations and I hope I was able to gave you some useful tips on how to organize your thinking process when working independently. Let's summarise the main points:
Good luck in your data science journey and happy coding!
Some of the interesting tasks to perform on this dataset: