In this guided project, we're going to practice using our SQL skills to answer business questions.
We'll continue to use the Chinook database that we've used in earlier lessons of this course. The Chinook database is provided as a SQLite database file called chinook.db. A copy of the database schema is below - you'll need to come back to this step often to consult the schema as you write your queries.
We'll use the code below to connect our Jupyter Notebook to our database file:
%%capture
%load_ext sql
%sql sqlite:///chinook.db
'Connected: None@chinook.db'
We will familiarize ourselves with the data by querying the database to get a list of all tables and views in our database:
%%sql
SELECT
name,
type
FROM sqlite_master
WHERE type IN ("table","view")
Done.
name | type |
---|---|
album | table |
artist | table |
customer | table |
employee | table |
genre | table |
invoice | table |
invoice_line | table |
media_type | table |
playlist | table |
playlist_track | table |
track | table |
The Chinook record store has just signed a deal with a new record label, and we have been tasked with selecting the first three albums that will be added to the store, from a list of four. All four albums are by artists that don't have any tracks in the store right now - we have the artist names, and the genre of music they produce:
Artist Name | Genre |
---|---|
Regal | Hip-Hop |
Red Tone | Punk |
Meteor and the Girls | Pop |
Slim Jim Bites | Blues |
The record label specializes in artists from the USA, and they have given Chinook some money to advertise the new albums in the USA, so we're interested in finding out which genres sell the best in the USA.
We will write a query to find out which genres sell the most tracks in the USA a follow
%%sql
WITH usa_tracks_sold AS
(
SELECT
il.*
FROM invoice_line AS il
INNER JOIN invoice AS i ON i.invoice_id = il.invoice_id
INNER JOIN customer AS c ON c.customer_id = i.customer_id
WHERE c.country = "USA"
)
SELECT
g.name AS genre,
CAST(COUNT(uts.invoice_line_id) AS FLOAT)/(SELECT COUNT(*) FROM usa_tracks_sold) AS percentage_sold
FROM usa_tracks_sold AS uts
INNER JOIN track t on t.track_id = uts.track_id
INNER JOIN genre g on g.genre_id = t.genre_id
GROUP BY 1
ORDER BY 2 DESC
LIMIT 10;
Done.
genre | percentage_sold |
---|---|
Rock | 0.5337773549000951 |
Alternative & Punk | 0.12369172216936251 |
Metal | 0.11798287345385347 |
R&B/Soul | 0.05042816365366318 |
Blues | 0.03425309229305423 |
Alternative | 0.03330161750713606 |
Latin | 0.02093244529019981 |
Pop | 0.02093244529019981 |
Hip Hop/Rap | 0.019029495718363463 |
Jazz | 0.013320647002854425 |
Based on the sales of tracks across different genres in the USA we computed above, the most popular genres are Rock, Alternative & Punk and Metal. Therefore, we should purchase the new albums by the following artists:
Each customer for the Chinook store gets assigned to a sales support agent within the company when they first make a purchase. Our task is to analyze the purchases of customers belonging to each employee to see if any sales support agent is performing either better or worse than the others.
%%sql
WITH customer_support_rep_sales AS
(
SELECT
i.customer_id,
c.support_rep_id,
SUM(i.total) AS total
FROM invoice AS i
INNER JOIN customer AS c on c.customer_id = i.customer_id
GROUP BY 1,2
)
SELECT
e.first_name || " " || e.last_name employee,
e.hire_date,
SUM(csrs.total) AS total_sales
FROM customer_support_rep_sales AS csrs
INNER JOIN employee AS e on e.employee_id = csrs.support_rep_id
GROUP BY 1;
Done.
employee | hire_date | total_sales |
---|---|---|
Jane Peacock | 2017-04-01 00:00:00 | 1731.5099999999998 |
Margaret Park | 2017-05-03 00:00:00 | 1584.0000000000002 |
Steve Johnson | 2017-10-17 00:00:00 | 1393.92 |
The top sales support agents are Jane Peacock, Margaret Park and Steve Johnson. While there is $\$337.59$ (about 20%) difference in sales between Jane (the top sales agent) and Steve (the bottom sales agent), this difference roughly corresponds to the differences in their hiring dates.
Our next task is to analyze the sales data for customers from each different country. We will use the country value from the customers table, and ignore the country from the billing address in the invoice table.
In particular, we are directed to calculate data, for each country, on the:
Because there are a number of countries with only one customer, we will group these customers as "Other" in our analysis. We will use the following 'trick' to force the ordering of "Other" to last in your analysis.
If there is a particular value that we would like to force to the top or bottom of results, we can put what would normally be your most outer query in a subquery with a case statement that adds a numeric column, and then in the outer query sort by that column.
%%sql
(sqlite3.OperationalError) near "AS": syntax error [SQL: WITH country_other AS] (Background on this error at: http://sqlalche.me/e/13/e3q8)
%%sql
WITH country_or_other AS
(
SELECT
CASE
WHEN (SELECT COUNT(*)
FROM CUSTOMER
WHERE country=c.country
) = 1 THEN "Other"
ELSE c.country
END AS country,
c.customer_id,
il.*
FROM invoice_line AS il
INNER JOIN invoice AS i ON i.invoice_id = il.invoice_id
INNER JOIN customer c ON c.customer_id = i.customer_id
)
SELECT
country,
customers,
total_sales,
average_order,
customer_lifetime_value
FROM
(
SELECT
country,
count(distinct customer_id) AS customers,
SUM(unit_price) As total_sales,
SUM(unit_price)/count(distinct customer_id) AS customer_lifetime_value,
SUM(unit_price)/count(distinct invoice_id) AS average_order,
CASE
WHEN country = 'Other' THEN 1
ELSE 0
END AS sort
FROM country_or_other
GROUP BY country
ORDER BY sort ASC, total_sales DESC
);
Done.
country | customers | total_sales | average_order | customer_lifetime_value |
---|---|---|---|---|
USA | 13 | 1040.490000000008 | 7.942671755725252 | 80.03769230769292 |
Canada | 8 | 535.5900000000034 | 7.047236842105309 | 66.94875000000043 |
Brazil | 5 | 427.68000000000245 | 7.011147540983647 | 85.53600000000048 |
France | 5 | 389.0700000000021 | 7.781400000000042 | 77.81400000000042 |
Germany | 4 | 334.6200000000016 | 8.161463414634186 | 83.6550000000004 |
Czech Republic | 2 | 273.24000000000103 | 9.108000000000034 | 136.62000000000052 |
United Kingdom | 3 | 245.52000000000078 | 8.768571428571457 | 81.84000000000026 |
Portugal | 2 | 185.13000000000022 | 6.3837931034482835 | 92.56500000000011 |
India | 2 | 183.1500000000002 | 8.72142857142858 | 91.5750000000001 |
Other | 15 | 1094.9400000000085 | 7.448571428571486 | 72.99600000000056 |
Based on our findings, there might be opportunity in the following countries:
The Chinook store is setup in a way that allows customer to make purchases in one of the two ways:
The store does not let customers purchase a whole album, and then add individual tracks to that same purchase (unless they do that by choosing each track manually). When customers purchase albums they are charged the same price as if they had purchased each of those tracks separately.
Management are currently considering changing their purchasing strategy to save money. The strategy they are considering is to purchase only the most popular tracks from each album from record companies, instead of purchasing every track from an album.
We have been asked to find out what percentage of purchases are individual tracks vs whole albums, so that management can use this data to understand the effect this decision might have on overall revenue.
%%sql
WITH invoice_first_track AS
(
SELECT
il.invoice_id invoice_id,
MIN(il.track_id) first_track_id
FROM invoice_line il
GROUP BY 1
)
SELECT
album_purchase,
COUNT(invoice_id) number_of_invoices,
CAST(count(invoice_id) AS FLOAT) / (
SELECT COUNT(*) FROM invoice
) percent
FROM
(
SELECT
ifs.*,
CASE
WHEN
(
SELECT t.track_id FROM track t
WHERE t.album_id = (
SELECT t2.album_id FROM track t2
WHERE t2.track_id = ifs.first_track_id
)
EXCEPT
SELECT il2.track_id FROM invoice_line il2
WHERE il2.invoice_id = ifs.invoice_id
) IS NULL
AND
(
SELECT il2.track_id FROM invoice_line il2
WHERE il2.invoice_id = ifs.invoice_id
EXCEPT
SELECT t.track_id FROM track t
WHERE t.album_id = (
SELECT t2.album_id FROM track t2
WHERE t2.track_id = ifs.first_track_id
)
) IS NULL
THEN "yes"
ELSE "no"
END AS "album_purchase"
FROM invoice_first_track ifs
)
GROUP BY album_purchase;
Done.
album_purchase | number_of_invoices | percent |
---|---|---|
no | 500 | 0.8143322475570033 |
yes | 114 | 0.18566775244299674 |
Only 18.6% of the customers purchased the album. According to the data we recommend against purchasing only selected tracks from albums from record componies, since there is potential to lose one fifth of the revenue.