We'll be working with a modified version of a database called Chinook. The Chinook database contains information about a fictional digital music shop - kind of like a mini-iTunes store.
The Chinook database contains information about the artists, songs, and albums from the music shop, as well as information on the shop's employees, customers, and the customers purchases. This information is contained in eleven tables.
We'll use the following code to connect our Jupyter Notebook to our database file:
%%capture
%load_ext sql
%sql sqlite:///chinook.db
To run SQL queries in Jupyter Notebook, we have to add %%sql on its own line to the start of our query.
Let's start by getting familiar with our data.
%%sql
SELECT
name,
type
FROM sqlite_master
WHERE type IN ("table", "view");
* sqlite:///chinook.db Done.
name | type |
---|---|
album | table |
artist | table |
customer | table |
employee | table |
genre | table |
invoice | table |
invoice_line | table |
media_type | table |
playlist | table |
playlist_track | table |
track | table |
country_or_other | view |
The Chinook record store has just signed a deal with a new record label, and you've been tasked with selecting the first three albums that will be added to the store, from a list of four. All four albums are by artists that don't have any tracks in the store right now - we have the artist names, and the genre of music they produce:
Artist Name : Genre
Regal : Hip-Hop
Red Tone : Punk
Meteor and the Girls : Pop
Slim Jim Bites : Blues
The record label specializes in artists from the USA, and they have given Chinook some money to advertise the new albums in the USA, so we're interested in finding out which genres sell the best in the USA.
To do this, we need track, genre, invoice_line, invoice, and customer tables (refer schema). We first find out the most popular genre across all countries and then limit the result using country = 'USA'.
%%sql
SELECT
g.name genre_name,
SUM(il.quantity) tracks_sold
FROM track t
INNER JOIN invoice_line il ON il.track_id = t.track_id
INNER JOIN genre g ON g.genre_id = t.genre_id
GROUP BY 1
ORDER BY 2 DESC;
* sqlite:///chinook.db Done.
genre_name | tracks_sold |
---|---|
Rock | 2635 |
Metal | 619 |
Alternative & Punk | 492 |
Latin | 167 |
R&B/Soul | 159 |
Blues | 124 |
Jazz | 121 |
Alternative | 117 |
Easy Listening | 74 |
Pop | 63 |
Electronica/Dance | 55 |
Classical | 47 |
Reggae | 35 |
Hip Hop/Rap | 33 |
Heavy Metal | 8 |
Soundtrack | 5 |
TV Shows | 2 |
Drama | 1 |
%%sql
SELECT
g.name genre_name,
SUM(il.quantity) tracks_sold
FROM track t
INNER JOIN invoice_line il ON il.track_id = t.track_id
INNER JOIN genre g ON g.genre_id = t.genre_id
INNER JOIN invoice i ON i.invoice_id = il.invoice_id
INNER JOIN customer c ON c.customer_id = i.customer_id
WHERE c.country = "USA"
GROUP BY 1
ORDER BY 2 DESC;
* sqlite:///chinook.db Done.
genre_name | tracks_sold |
---|---|
Rock | 561 |
Alternative & Punk | 130 |
Metal | 124 |
R&B/Soul | 53 |
Blues | 36 |
Alternative | 35 |
Pop | 22 |
Latin | 22 |
Hip Hop/Rap | 20 |
Jazz | 14 |
Easy Listening | 13 |
Reggae | 6 |
Electronica/Dance | 5 |
Classical | 4 |
Heavy Metal | 3 |
Soundtrack | 2 |
TV Shows | 1 |
Based on these results Red Tone, Slim Jim Bites, Meteor and the Girls should be purchased for the store. Also note that Rock seems to be the most popular genre, so we should look out for more albums of the same to increase sales.
Each customer for the Chinook store gets assigned to a sales support agent within the company when they first make a purchase. You have been asked to analyze the purchases of customers belonging to each employee to see if any sales support agent is performing either better or worse than the others.
To do this, we need employee, customer, and invoice tables (refer schema). We would be analyzing performance based on the total dollar amount.
%%sql
SELECT
e.first_name||' '||e.last_name employee_name,
ROUND(SUM(i.total),2) total_dollar_amount
FROM employee e
INNER JOIN customer c ON c.support_rep_id = e.employee_id
INNER JOIN invoice i ON c.customer_id = i.customer_id
WHERE e.title = "Sales Support Agent"
GROUP BY 1
ORDER BY 2 DESC;
* sqlite:///chinook.db Done.
employee_name | total_dollar_amount |
---|---|
Jane Peacock | 1731.51 |
Margaret Park | 1584.0 |
Steve Johnson | 1393.92 |
We check the employee table to see if we can find out any reason for the observed results.
%%sql
SELECT *
FROM employee
WHERE title LIKE "%support%";
* sqlite:///chinook.db Done.
employee_id | last_name | first_name | title | reports_to | birthdate | hire_date | address | city | state | country | postal_code | phone | fax | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
3 | Peacock | Jane | Sales Support Agent | 2 | 1973-08-29 00:00:00 | 2017-04-01 00:00:00 | 1111 6 Ave SW | Calgary | AB | Canada | T2P 5M5 | +1 (403) 262-3443 | +1 (403) 262-6712 | jane@chinookcorp.com |
4 | Park | Margaret | Sales Support Agent | 2 | 1947-09-19 00:00:00 | 2017-05-03 00:00:00 | 683 10 Street SW | Calgary | AB | Canada | T2P 5G3 | +1 (403) 263-4423 | +1 (403) 263-4289 | margaret@chinookcorp.com |
5 | Johnson | Steve | Sales Support Agent | 2 | 1965-03-03 00:00:00 | 2017-10-17 00:00:00 | 7727B 41 Ave | Calgary | AB | Canada | T3B 1Y7 | 1 (780) 836-9987 | 1 (780) 836-9543 | steve@chinookcorp.com |
%%sql
WITH support_rep_sales AS
(
SELECT
e.employee_id,
e.first_name||' '||e.last_name employee_name,
ROUND(SUM(i.total),2) total_dollar_amount
FROM employee e
INNER JOIN customer c ON c.support_rep_id = e.employee_id
INNER JOIN invoice i ON c.customer_id = i.customer_id
WHERE e.title = "Sales Support Agent"
GROUP BY 1
)
SELECT
srs.employee_name,
srs.total_dollar_amount,
e.hire_date,
e.birthdate
FROM employee e
INNER JOIN support_rep_sales srs ON e.employee_id = srs.employee_id
ORDER BY 2 DESC;
* sqlite:///chinook.db Done.
employee_name | total_dollar_amount | hire_date | birthdate |
---|---|---|---|
Jane Peacock | 1731.51 | 2017-04-01 00:00:00 | 1973-08-29 00:00:00 |
Margaret Park | 1584.0 | 2017-05-03 00:00:00 | 1947-09-19 00:00:00 |
Steve Johnson | 1393.92 | 2017-10-17 00:00:00 | 1965-03-03 00:00:00 |
The store has 3 Sales Support Agents, all of them from Canada. All agents seem to have the same amount of experience with the company as seen in their hiring dates. The difference is sales roughly corresponds with the difference in hiring dates. The agent with the highest sales is the youngest, which could possibly reflect in the ability to connect with a younger customer market.
Your next task is to analyze the sales data for customers from each different country. You have been given guidance to use the country value from the customers table, and ignore the country from the billing address in the invoice table.
In particular, you have been directed to calculate data, for each country, on the:
To do this, we need customer and invoice tables (refer schema). We will create a temporary table that groups all countries with 1 customer as 'Other'.
%%sql
SELECT
c.country,
COUNT(DISTINCT(c.customer_id)) AS Total_Customers,
COUNT(i.invoice_id) AS Total_Orders
FROM invoice i
INNER JOIN customer c ON c.customer_id = i.customer_id
GROUP BY 1;
* sqlite:///chinook.db Done.
country | Total_Customers | Total_Orders |
---|---|---|
Argentina | 1 | 5 |
Australia | 1 | 10 |
Austria | 1 | 9 |
Belgium | 1 | 7 |
Brazil | 5 | 61 |
Canada | 8 | 76 |
Chile | 1 | 13 |
Czech Republic | 2 | 30 |
Denmark | 1 | 10 |
Finland | 1 | 11 |
France | 5 | 50 |
Germany | 4 | 41 |
Hungary | 1 | 10 |
India | 2 | 21 |
Ireland | 1 | 13 |
Italy | 1 | 9 |
Netherlands | 1 | 10 |
Norway | 1 | 9 |
Poland | 1 | 10 |
Portugal | 2 | 29 |
Spain | 1 | 11 |
Sweden | 1 | 10 |
USA | 13 | 131 |
United Kingdom | 3 | 28 |
%%sql
CREATE VIEW country_or_other AS
SELECT
CASE
WHEN (
SELECT COUNT(*)
FROM customer
WHERE country = c.country
) = 1 THEN "Other"
ELSE c.country
END country,
COUNT(DISTINCT(c.customer_id)) total_customers,
COUNT(i.invoice_id) total_orders,
ROUND(SUM(i.total),2) total_sales
FROM invoice i
INNER JOIN customer c ON c.customer_id = i.customer_id
GROUP BY 1;
* sqlite:///chinook.db Done.
[]
%%sql
SELECT * FROM country_or_other;
* sqlite:///chinook.db Done.
country | total_customers | total_orders | total_sales |
---|---|---|---|
Brazil | 5 | 61 | 427.68 |
Canada | 8 | 76 | 535.59 |
Czech Republic | 2 | 30 | 273.24 |
France | 5 | 50 | 389.07 |
Germany | 4 | 41 | 334.62 |
India | 2 | 21 | 183.15 |
Other | 15 | 147 | 1094.94 |
Portugal | 2 | 29 | 185.13 |
USA | 13 | 131 | 1040.49 |
United Kingdom | 3 | 28 | 245.52 |
We add another piece of code to ensure that 'Other' is always at the end of the result.
%%sql
SELECT
country,
total_customers,
total_orders,
total_sales,
avg_sales_per_customer,
avg_order_value
FROM
(
SELECT
co.*,
ROUND(CAST(total_sales AS FLOAT)/total_customers,2) avg_sales_per_customer,
ROUND(CAST(total_sales AS FLOAT)/total_orders,2) avg_order_value,
CASE
WHEN co.country = "Other" THEN 1
ELSE 0
END AS sort
FROM country_or_other co
GROUP BY country
ORDER BY sort ASC, total_customers DESC
);
* sqlite:///chinook.db Done.
country | total_customers | total_orders | total_sales | avg_sales_per_customer | avg_order_value |
---|---|---|---|---|---|
USA | 13 | 131 | 1040.49 | 80.04 | 7.94 |
Canada | 8 | 76 | 535.59 | 66.95 | 7.05 |
Brazil | 5 | 61 | 427.68 | 85.54 | 7.01 |
France | 5 | 50 | 389.07 | 77.81 | 7.78 |
Germany | 4 | 41 | 334.62 | 83.66 | 8.16 |
United Kingdom | 3 | 28 | 245.52 | 81.84 | 8.77 |
Czech Republic | 2 | 30 | 273.24 | 136.62 | 9.11 |
India | 2 | 21 | 183.15 | 91.58 | 8.72 |
Portugal | 2 | 29 | 185.13 | 92.57 | 6.38 |
Other | 15 | 147 | 1094.94 | 73.0 | 7.45 |
We can observe that, USA is Chinook's largest market, followed by Canada and Brazil. However, we see a clear difference between average sales in all of its countries (around 80 - 90 USD per customer on average) and average sales in Czech Republic of 136 USD per customer. This indicates that in this country average revenue per customer is significantly higher and targeted marketing could potentially be more profitable per customer compared to its other markets. Average order value tends to be approximately 8 USD per order; again, we notice that in the Czech Republic this value is slightly higher than in other countries at 9 USD per order.
The Chinook store is setup in a way that allows customer to make purchases in one of the two ways:
The store does not let customers purchase a whole album, and then add individual tracks to that same purchase (unless they do that by choosing each track manually). When customers purchase albums they are charged the same price as if they had purchased each of those tracks separately.
Management are currently considering changing their purchasing strategy to save money. The strategy they are considering is to purchase only the most popular tracks from each album from record companies, instead of purchasing every track from an album.
We have been asked to find out what percentage of purchases are individual tracks vs whole albums, so that management can use this data to understand the effect this decision might have on overall revenue.
In this instance, we have two edge cases to consider:
In the first case, since our analysis is concerned with maximizing revenue we can safely ignore albums consisting of only a few tracks. The company has previously done analysis to confirm that the second case does not happen often, so we can ignore this case also.
To do this, we need the invoice_line, album, and track tables (refer schema). We are going to check whether each invoice has all the tracks from an album.
%%sql
SELECT * FROM invoice_line
LIMIT 10;
* sqlite:///chinook.db Done.
invoice_line_id | invoice_id | track_id | unit_price | quantity |
---|---|---|---|---|
1 | 1 | 1158 | 0.99 | 1 |
2 | 1 | 1159 | 0.99 | 1 |
3 | 1 | 1160 | 0.99 | 1 |
4 | 1 | 1161 | 0.99 | 1 |
5 | 1 | 1162 | 0.99 | 1 |
6 | 1 | 1163 | 0.99 | 1 |
7 | 1 | 1164 | 0.99 | 1 |
8 | 1 | 1165 | 0.99 | 1 |
9 | 1 | 1166 | 0.99 | 1 |
10 | 1 | 1167 | 0.99 | 1 |
One invoice_id is mapped to several track_id.
We create temporary tables- album_tracks that has album_id and track_id from the tracks included in that album, and invoice_tracks_albums that has invoice_id, track_id and album_id for the tracks that are sold. We then proceed to do something as shown below with these two tables in invoice_full_dictionary. Lastly in invoice_album we have the invoice_id of an order and whether it has a full album or not.
%%sql
WITH album_tracks AS
(
SELECT
a.album_id,
t.track_id
FROM album a
INNER JOIN track t ON a.album_id = t.album_id
),
invoice_tracks_album AS
(
SELECT
i.invoice_id,
at.album_id,
at.track_id
FROM invoice_line i
INNER JOIN album_tracks at ON at.track_id = i.track_id
),
invoice_full_dictionary AS
(
SELECT
ita.invoice_id,
CASE
WHEN
(
SELECT track_id
FROM (
SELECT itain.track_id track_id
FROM invoice_tracks_album itain
WHERE itain.invoice_id = ita.invoice_id
)
EXCEPT
SELECT track_id
FROM (
SELECT at.track_id track_id
FROM album_tracks at
WHERE at.album_id = ita.album_id
)
) IS NULL
AND
(
SELECT track_id
FROM (
SELECT at.track_id track_id
FROM album_tracks at
WHERE at.album_id = ita.album_id
)
EXCEPT
SELECT track_id
FROM (
SELECT itain.track_id track_id
FROM invoice_tracks_album itain
WHERE itain.invoice_id = ita.invoice_id
)
) IS NULL
THEN 1
ELSE 0
END full_album
FROM invoice_tracks_album ita
),
invoice_album AS
(
SELECT
invoice_id,
MAX(full_album) full_album
FROM invoice_full_dictionary
GROUP BY 1
)
SELECT
COUNT(invoice_id) total_invoices,
SUM(full_album) total_full_album,
ROUND(CAST(SUM(full_album) AS FLOAT)/COUNT(invoice_id) * 100, 2) AS perc_full_album
FROM invoice_album;
* sqlite:///chinook.db Done.
total_invoices | total_full_album | perc_full_album |
---|---|---|
614 | 114 | 18.57 |
We can note that 18.5% of invoices are related to the purchase of full albums. Rest 80 percent purchases are made by customers seeking specific tracks. The company should change their strategy to purchase only the most popular tracks from each album from record companies, instead of purchasing every track from an album.
We try to find out which artists are used in most of the playlists. To do this, we need playlist_track, track, album, and artist tables (refer schema).
%%sql
SELECT DISTINCT name
FROM playlist;
* sqlite:///chinook.db Done.
name |
---|
Music |
Movies |
TV Shows |
Audiobooks |
90’s Music |
Music Videos |
Brazilian Music |
Classical |
Classical 101 - Deep Cuts |
Classical 101 - Next Steps |
Classical 101 - The Basics |
Grunge |
Heavy Metal Classic |
On-The-Go 1 |
%%sql
WITH pop_album AS
(
SELECT
al.artist_id artist_id,
COUNT(DISTINCT playlist_id) no_of_playlists
FROM album al
INNER JOIN track t ON al.album_id = t.album_id
INNER JOIN playlist_track pt ON t.track_id = pt.track_id
GROUP BY 1
)
SELECT
ar.name artist_name,
pa.no_of_playlists
FROM artist ar
INNER JOIN pop_album pa ON pa.artist_id = ar.artist_id
ORDER BY 2 DESC
LIMIT 10;
* sqlite:///chinook.db Done.
artist_name | no_of_playlists |
---|---|
Eugene Ormandy | 7 |
English Concert & Trevor Pinnock | 6 |
Academy of St. Martin in the Fields & Sir Neville Marriner | 6 |
The King's Singers | 6 |
Berliner Philharmoniker & Herbert Von Karajan | 6 |
Alberto Turco & Nova Schola Gregoriana | 5 |
Richard Marlow & The Choir of Trinity College, Cambridge | 5 |
Wilhelm Kempff | 5 |
Yo-Yo Ma | 5 |
Scholars Baroque Ensemble | 5 |
Eugene Ormandy is the most popular artist in playlists. The different kinds of playlists represent the versatility in the kind of music produced by an artist.
Let's see if the most popular artist in playlists is the best selling artist for Chinook. To do this, we would need invoice_line, track, album, and artist tables (refer schema).
%%sql
WITH invoice_info AS
(
SELECT
al.artist_id,
COUNT(il.track_id) no_of_tracks_sold
FROM album al
INNER JOIN track t ON al.album_id = t.album_id
INNER JOIN invoice_line il ON t.track_id = il.track_id
GROUP BY 1
)
SELECT
ar.name artist_name,
i.no_of_tracks_sold
FROM artist ar
INNER JOIN invoice_info i ON ar.artist_id = i.artist_id
ORDER BY 2 DESC
LIMIT 10;
* sqlite:///chinook.db Done.
artist_name | no_of_tracks_sold |
---|---|
Queen | 192 |
Jimi Hendrix | 187 |
Nirvana | 130 |
Red Hot Chili Peppers | 130 |
Pearl Jam | 129 |
AC/DC | 124 |
Guns N' Roses | 124 |
Foo Fighters | 121 |
The Rolling Stones | 117 |
Metallica | 106 |
The sales are mostly dominated by rock bands, which is not surprising based on query 1. Popularity is a metric, and can be interpreted in different ways based on the information need. As far as sales are concerned, genre seems to affect the sales more than versatility/ playlists.
The management at Chinook wants to understand what share of tracks in the store are bought by customers to be able to manage the inventory better.
To do this, we need the track and invoice_line tables (refer schema). invoice_line contains the tracks that are sold while track contains all the tracks in the store.
%%sql
SELECT
COUNT(t.track_id) total_tracks,
COUNT(DISTINCT i.track_id) tracks_purchased,
ROUND(COUNT(DISTINCT i.track_id)/CAST(COUNT(t.track_id) AS FLOAT)*100,2) perc_purchased,
ROUND(100 - COUNT(DISTINCT i.track_id)/CAST(COUNT(t.track_id) AS FLOAT)*100,2) perc_not_purchased
FROM track t
INNER JOIN invoice_line i ON i.track_id = t.track_id;
* sqlite:///chinook.db Done.
total_tracks | tracks_purchased | perc_purchased | perc_not_purchased |
---|---|---|---|
4757 | 1806 | 37.97 | 62.03 |
~38% of unique tracks from the inventory are purchased by the customers, while ~62% are not purchased. The analysis here is done in terms of variety of tracks rather than the volume of sale - say a track is purchased 100 times in comparision to another one which is purchased only once, but both of them are purchased by customers. The remaining 62% tracks can be analyzed further to understand any possible trends, leading to their unpopularity among customers.
Do protected vs non-protected media types have an effect on popularity?
To do this, we need media_type, track, and invoice_line tables (refer schema). From the previous query we know that there are 1806 distinct tracks sold by the store.
%%sql
SELECT * FROM media_type;
* sqlite:///chinook.db Done.
media_type_id | name |
---|---|
1 | MPEG audio file |
2 | Protected AAC audio file |
3 | Protected MPEG-4 video file |
4 | Purchased AAC audio file |
5 | AAC audio file |
%%sql
WITH track_media AS
(
SELECT
CASE
WHEN m.name LIKE "%protected%"
THEN "Yes"
ELSE "No"
END protected,
t.track_id
FROM media_type m
INNER JOIN track t ON t.media_type_id = m.media_type_id
)
SELECT
tm.protected,
COUNT(DISTINCT i.track_id) tracks_sold,
ROUND(CAST(COUNT(DISTINCT i.track_id) AS FLOAT)/1806*100,2) perc_sold
FROM invoice_line i
LEFT JOIN track_media tm ON i.track_id = tm.track_id
GROUP BY 1;
* sqlite:///chinook.db Done.
protected | tracks_sold | perc_sold |
---|---|---|
No | 1652 | 91.47 |
Yes | 154 | 8.53 |
Out of all the unique tracks sold (1806 tracks) by the store, ~91% comprises of unprotected media.