The coronavirus took the entire world by surprise, changing everyone's daily routine. City dwellers no longer spent their free time outside, going to cafes and malls; more people were home, reading books. That attracted the attention of startups that rushed to develop new apps for book lovers.
I've been given a database of one of the services competing in this market. It contains data on books, publishers, authors, and customer ratings and reviews of books. This information will be used to generate a value proposition for a new product.
books:
authors:
publishers:
ratings:
reviews:
# import libraries
import pandas as pd
from sqlalchemy import create_engine
db_config = {'user': 'praktikum_student', # user name
'pwd': 'Sdf4$2;d-d30pp', # password
'host': 'rc1b-wcoijxj3yxfsf3fs.mdb.yandexcloud.net',
'port': 6432, # connection port
'db': 'data-analyst-final-project-db'} # the name of the data base
connection_string = 'postgresql://{}:{}@{}:{}/{}'.format(db_config['user'],
db_config['pwd'],
db_config['host'],
db_config['port'],
db_config['db'])
engine = create_engine(connection_string, connect_args={'sslmode':'require'})
# write queries
tables = ['books', 'authors', 'publishers', 'ratings', 'reviews']
for table in tables:
query = 'SELECT * FROM ' + table + ' LIMIT 5'
display(pd.io.sql.read_sql(query, con = engine))
book_id | author_id | title | num_pages | publication_date | publisher_id | |
---|---|---|---|---|---|---|
0 | 1 | 546 | 'Salem's Lot | 594 | 2005-11-01 | 93 |
1 | 2 | 465 | 1 000 Places to See Before You Die | 992 | 2003-05-22 | 336 |
2 | 3 | 407 | 13 Little Blue Envelopes (Little Blue Envelope... | 322 | 2010-12-21 | 135 |
3 | 4 | 82 | 1491: New Revelations of the Americas Before C... | 541 | 2006-10-10 | 309 |
4 | 5 | 125 | 1776 | 386 | 2006-07-04 | 268 |
author_id | author | |
---|---|---|
0 | 1 | A.S. Byatt |
1 | 2 | Aesop/Laura Harris/Laura Gibbs |
2 | 3 | Agatha Christie |
3 | 4 | Alan Brennert |
4 | 5 | Alan Moore/David Lloyd |
publisher_id | publisher | |
---|---|---|
0 | 1 | Ace |
1 | 2 | Ace Book |
2 | 3 | Ace Books |
3 | 4 | Ace Hardcover |
4 | 5 | Addison Wesley Publishing Company |
rating_id | book_id | username | rating | |
---|---|---|---|---|
0 | 1 | 1 | ryanfranco | 4 |
1 | 2 | 1 | grantpatricia | 2 |
2 | 3 | 1 | brandtandrea | 5 |
3 | 4 | 2 | lorichen | 3 |
4 | 5 | 2 | mariokeller | 2 |
review_id | book_id | username | text | |
---|---|---|---|---|
0 | 1 | 1 | brandtandrea | Mention society tell send professor analysis. ... |
1 | 2 | 1 | ryanfranco | Foot glass pretty audience hit themselves. Amo... |
2 | 3 | 2 | lorichen | Listen treat keep worry. Miss husband tax but ... |
3 | 4 | 3 | johnsonamanda | Finally month interesting blue could nature cu... |
4 | 5 | 3 | scotttamara | Nation purpose heavy give wait song will. List... |
# count number of books released after 1-1-2000
query_t1 = """SELECT
COUNT (DISTINCT book_id) as cnt
FROM
books
WHERE
publication_date > '2000-01-01'"""
pd.io.sql.read_sql(query_t1, con = engine)
cnt | |
---|---|
0 | 819 |
819 out of the 1000 books in the table were released after the year 2000. This represents roughly 82%.
# count number of user reviews & avg book rating for each book
query_t2 = """SELECT
books.title,
subquery.avg_rating,
subquery.review_cnt
FROM
(SELECT
reviews.book_id as book_id,
COUNT (reviews.review_id) AS review_cnt,
AVG (ratings.rating) AS avg_rating
FROM
reviews
INNER JOIN ratings ON ratings.book_id = reviews.book_id
GROUP BY
reviews.book_id
ORDER BY
review_cnt DESC,
avg_rating DESC) AS subquery
INNER JOIN books ON subquery.book_id = books.book_id
ORDER BY
review_cnt DESC,
avg_rating DESC
LIMIT 10"""
pd.io.sql.read_sql(query_t2, con = engine)
title | avg_rating | review_cnt | |
---|---|---|---|
0 | Twilight (Twilight #1) | 3.662500 | 1120 |
1 | The Hobbit or There and Back Again | 4.125000 | 528 |
2 | The Catcher in the Rye | 3.825581 | 516 |
3 | Harry Potter and the Prisoner of Azkaban (Harr... | 4.414634 | 492 |
4 | Harry Potter and the Chamber of Secrets (Harry... | 4.287500 | 480 |
5 | Angels & Demons (Robert Langdon #1) | 3.678571 | 420 |
6 | Harry Potter and the Order of the Phoenix (Har... | 4.186667 | 375 |
7 | The Lightning Thief (Percy Jackson and the Oly... | 4.080645 | 372 |
8 | The Fellowship of the Ring (The Lord of the Ri... | 4.391892 | 370 |
9 | Animal Farm | 3.729730 | 370 |
Twilight received the greatest number of reviews by a long shot. It wasnt as popular as The Hobbit or any of the Harry Potter series, however. These books all received higher ratings.
# top publishers with books over 50 pages
query_t3 = """SELECT
COUNT (books.book_id) AS cnt,
publishers.publisher AS publisher
FROM
books
INNER JOIN publishers ON publishers.publisher_id = books.publisher_id
WHERE
books.num_pages > 50
GROUP BY
publishers.publisher
ORDER BY
cnt DESC
LIMIT 5"""
pd.io.sql.read_sql(query_t3, con = engine)
cnt | publisher | |
---|---|---|
0 | 42 | Penguin Books |
1 | 31 | Vintage |
2 | 25 | Grand Central Publishing |
3 | 24 | Penguin Classics |
4 | 19 | Bantam |
The top publishers are Penguin Books, Vintage, Grand Central Publishing, Pengin Classics, and Ballantine Books. If we group Penguin with Penguin Classics, then they are by far the largest publisher by volume of books.
# author with highest average rating
query_t4 = """SELECT
authors.author,
AVG (subquery2.avg_rating) as final_avg
FROM
(SELECT
books.title,
books.author_id,
subquery1.avg_rating
FROM
(SELECT
book_id,
COUNT (rating_id) AS rating_cnt,
AVG (rating) AS avg_rating
FROM
ratings
GROUP BY
book_id
HAVING
COUNT (rating_id) > 50) AS subquery1
INNER JOIN books ON books.book_id = subquery1.book_id) AS subquery2
INNER JOIN authors ON authors.author_id = subquery2.author_id
GROUP BY
author
ORDER BY
final_avg DESC
LIMIT 5"""
pd.io.sql.read_sql(query_t4, con = engine)
author | final_avg | |
---|---|---|
0 | J.K. Rowling/Mary GrandPré | 4.283844 |
1 | Markus Zusak/Cao Xuân Việt Khương | 4.264151 |
2 | J.R.R. Tolkien | 4.258446 |
3 | Louisa May Alcott | 4.192308 |
4 | Rick Riordan | 4.080645 |
J.K. Rowling has the highest average rating out of all books with greater than 50 ratings. Not far behind are Markus Zusak and J.R.R. Tolkien.
query_t5 = """SELECT
AVG (subquery2.review_cnt) AS avg_review_cnt
FROM
(SELECT
COUNT (reviews.review_id) as review_cnt,
subquery1.username
FROM
(SELECT
username,
COUNT (rating_id) AS rating_cnt
FROM
ratings
GROUP BY
username
HAVING
COUNT (rating_id) > 50) AS subquery1
INNER JOIN reviews ON reviews.username = subquery1.username
GROUP BY
subquery1.username) AS subquery2"""
pd.io.sql.read_sql(query_t5, con = engine)
avg_review_cnt | |
---|---|
0 | 24.333333 |
Users who rated more than 50 books left on average 24.33 text reviews.