Chinook is a fictional digital music store. They have a database containing information on their music, employees, and customers. Taking the role of a hired data analysts, we will assist the Chinook team with answering some questions about their sales and their employees.
The database (.db) file and schema diagram (.png) can be downloaded here
To start querying with SQL from Jupyter Notebook, we have to first run the magic commands below. As seen, our SQL flavor is SQLite and the name of our database is chinook.db
.
%%capture
%load_ext sql
%sql sqlite:///chinook.db
Let's familiarize ourselves with the tables present in our database. Note that for every succeeding code cell, we need to place the magic command %%sql
at the beginning in order for our query to be executed.
%%sql
SELECT
name,
type
FROM sqlite_master
WHERE type IN ('table');
* sqlite:///chinook.db Done.
name | type |
---|---|
album | table |
artist | table |
customer | table |
employee | table |
genre | table |
invoice | table |
invoice_line | table |
media_type | table |
playlist | table |
playlist_track | table |
track | table |
We have 11 tables in the database. The schema diagram below shows how the tables are related to one another. This will help us in constructing our queries. The cells highlighted in yellow are the primary keys of the tables. Each are linked to a corresponding cell in another table which is either its primary key or foreign key.
from IPython.display import Image
Image(filename='chinook schema.png', width = (726), height = (699))
NOTE: For the queries in this project, a breakdown of the full query will be done to better understand the logic behind it. At the beginning of each breakdown, a link will be provided if you wish to skip to the full query.
The Chinook team has decided to add more music albums in their inventory. These albums will be sold exclusively to customers from the USA. They have been provided with four artists, whose albums they can add, that are listed below. The corresponding genre of each artist's albums are also shown.
Artist Name | Genre |
---|---|
Regal | Hip-Hop |
Red Tone | Punk |
Meteor and the Girls | Pop |
Slim Jim Bites | Blues |
The team has decided to select only three from among the artists. They want us to determine the top three selling genres in the USA. They will use this as criteria for selecting the three artists whose albums they will add.
CLICK HERE to skip to the full query.
To begin constructing our query, we need to know which tables from the database we are going to use. Based on the requirements of the team, we will use the following tables:
Let's get an overview of their contents.
%%sql
SELECT * FROM genre
ORDER BY RANDOM()
LIMIT 5;
* sqlite:///chinook.db Done.
genre_id | name |
---|---|
9 | Pop |
23 | Alternative |
5 | Rock And Roll |
11 | Bossa Nova |
12 | Easy Listening |
%%sql
SELECT * FROM track
ORDER BY RANDOM()
LIMIT 5;
* sqlite:///chinook.db Done.
track_id | name | album_id | media_type_id | genre_id | composer | milliseconds | bytes | unit_price |
---|---|---|---|---|---|---|---|---|
1797 | It's a Mistake | 147 | 1 | 1 | None | 273371 | 8979965 | 0.99 |
717 | Eu Sou Neguinha (Ao Vivo) | 56 | 1 | 7 | None | 251768 | 8376000 | 0.99 |
751 | Love Child | 58 | 1 | 1 | Bolin/Coverdale | 188160 | 6173806 | 0.99 |
3466 | Rehab (Hot Chip Remix) | 321 | 2 | 14 | None | 418293 | 6670600 | 0.99 |
1679 | Dezesseis | 139 | 1 | 7 | Renato Russo | 323918 | 10573515 | 0.99 |
%%sql
SELECT * FROM invoice_line
LIMIT 15;
* sqlite:///chinook.db Done.
invoice_line_id | invoice_id | track_id | unit_price | quantity |
---|---|---|---|---|
1 | 1 | 1158 | 0.99 | 1 |
2 | 1 | 1159 | 0.99 | 1 |
3 | 1 | 1160 | 0.99 | 1 |
4 | 1 | 1161 | 0.99 | 1 |
5 | 1 | 1162 | 0.99 | 1 |
6 | 1 | 1163 | 0.99 | 1 |
7 | 1 | 1164 | 0.99 | 1 |
8 | 1 | 1165 | 0.99 | 1 |
9 | 1 | 1166 | 0.99 | 1 |
10 | 1 | 1167 | 0.99 | 1 |
11 | 1 | 1168 | 0.99 | 1 |
12 | 1 | 1169 | 0.99 | 1 |
13 | 1 | 1170 | 0.99 | 1 |
14 | 1 | 1171 | 0.99 | 1 |
15 | 1 | 1172 | 0.99 | 1 |
%%sql
SELECT * FROM invoice
LIMIT 5;
* sqlite:///chinook.db Done.
invoice_id | customer_id | invoice_date | billing_address | billing_city | billing_state | billing_country | billing_postal_code | total |
---|---|---|---|---|---|---|---|---|
1 | 18 | 2017-01-03 00:00:00 | 627 Broadway | New York | NY | USA | 10012-2612 | 15.84 |
2 | 30 | 2017-01-03 00:00:00 | 230 Elgin Street | Ottawa | ON | Canada | K2P 1L7 | 9.9 |
3 | 40 | 2017-01-05 00:00:00 | 8, Rue Hanovre | Paris | None | France | 75002 | 1.98 |
4 | 18 | 2017-01-06 00:00:00 | 627 Broadway | New York | NY | USA | 10012-2612 | 7.92 |
5 | 27 | 2017-01-07 00:00:00 | 1033 N Park Ave | Tucson | AZ | USA | 85719 | 16.83 |
The difference between invoice and invoice_line is that each row of the latter contains purchase information for a single track while the former contains information on a customer's single purchase. A single purchase may contain a single track, multiple tracks, entire albums, or any combination. The invoice table also contains the billing country for each purchase made. We can confirm these with the tables shown above.
The first step we can do is to create a Common Table Expression (CTE) or "temporary view" that will contain all of the tracks that were sold with USA as the billing country in the invoice.
WITH
tracks_sold_USA AS
(
SELECT
g.name genre,
t.name song_title,
i.billing_country
FROM genre g
LEFT JOIN track t ON t.genre_id = g.genre_id
LEFT JOIN invoice_line il ON t.track_id = il.track_id
LEFT JOIN invoice i ON i.invoice_id = il.invoice_id
WHERE billing_country = 'USA'
)
In the cell above, we simply combined the four tables that we need with Left Joins, using the keys as shown in the schema diagram. We made sure to take into account every single purchase made from the store. We then filtered the result set (or the output) to only include three columns and the rows where billing_country
is equal to USA. The query above is used as a CTE in our full query using the WITH
clause. It is labeled as tracks_sold_USA
SELECT
genre,
COUNT(song_title) songs_sold,
ROUND(CAST(COUNT(song_title) AS FLOAT) / (
SELECT
COUNT(*)
FROM tracks_sold_USA
) * 100, 4) perc_share
FROM tracks_sold_USA
GROUP BY 1
ORDER BY 2 DESC;
The main query above will output the list of genres in order of the number of songs each has sold in the USA. The rows of the tracks_sold_USA
are grouped by genre. For each genre, the number of songs sold is counted. The values are stored in a column, songs_sold
. The corresponding percentage shares of these values are computed by dividing each by the total number of songs listed in tracks_sold_USA
. This total value is computed using a nested query as can be seen.
%%sql
WITH
tracks_sold_USA AS
(
SELECT
g.name genre,
t.name song_title,
i.billing_country
FROM genre g
LEFT JOIN track t ON t.genre_id = g.genre_id
LEFT JOIN invoice_line il ON t.track_id = il.track_id
LEFT JOIN invoice i ON i.invoice_id = il.invoice_id
WHERE billing_country = 'USA'
)
SELECT
genre,
COUNT(song_title) songs_sold,
ROUND(CAST(COUNT(song_title) AS FLOAT) / (
SELECT
COUNT(*)
FROM tracks_sold_USA
) * 100, 4) perc_share
FROM tracks_sold_USA
GROUP BY 1
ORDER BY 2 DESC;
* sqlite:///chinook.db Done.
genre | songs_sold | perc_share |
---|---|---|
Rock | 561 | 53.3777 |
Alternative & Punk | 130 | 12.3692 |
Metal | 124 | 11.7983 |
R&B/Soul | 53 | 5.0428 |
Blues | 36 | 3.4253 |
Alternative | 35 | 3.3302 |
Pop | 22 | 2.0932 |
Latin | 22 | 2.0932 |
Hip Hop/Rap | 20 | 1.9029 |
Jazz | 14 | 1.3321 |
Easy Listening | 13 | 1.2369 |
Reggae | 6 | 0.5709 |
Electronica/Dance | 5 | 0.4757 |
Classical | 4 | 0.3806 |
Heavy Metal | 3 | 0.2854 |
Soundtrack | 2 | 0.1903 |
TV Shows | 1 | 0.0951 |
Based on the results of our full query, the Chinook team should choose the following artists:
Together, the three genres only make up for around 18% of the total share of songs sold in the USA. The team should be on the lookout for Rock albums as this genre accounts for 53%.
After their meeting on employee performance assessment, the managers of the store request us to help them determine their top employee for sales. They suggest that we use the total dollar amount of sales of each sales support agent as a metric.
CLICK HERE to skip to the full query.
Given the scenario, we will use the following tables:
Let's explore the first two tables.
%%sql
SELECT
*
FROM employee
* sqlite:///chinook.db Done.
employee_id | last_name | first_name | title | reports_to | birthdate | hire_date | address | city | state | country | postal_code | phone | fax | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
1 | Adams | Andrew | General Manager | None | 1962-02-18 00:00:00 | 2016-08-14 00:00:00 | 11120 Jasper Ave NW | Edmonton | AB | Canada | T5K 2N1 | +1 (780) 428-9482 | +1 (780) 428-3457 | andrew@chinookcorp.com |
2 | Edwards | Nancy | Sales Manager | 1 | 1958-12-08 00:00:00 | 2016-05-01 00:00:00 | 825 8 Ave SW | Calgary | AB | Canada | T2P 2T3 | +1 (403) 262-3443 | +1 (403) 262-3322 | nancy@chinookcorp.com |
3 | Peacock | Jane | Sales Support Agent | 2 | 1973-08-29 00:00:00 | 2017-04-01 00:00:00 | 1111 6 Ave SW | Calgary | AB | Canada | T2P 5M5 | +1 (403) 262-3443 | +1 (403) 262-6712 | jane@chinookcorp.com |
4 | Park | Margaret | Sales Support Agent | 2 | 1947-09-19 00:00:00 | 2017-05-03 00:00:00 | 683 10 Street SW | Calgary | AB | Canada | T2P 5G3 | +1 (403) 263-4423 | +1 (403) 263-4289 | margaret@chinookcorp.com |
5 | Johnson | Steve | Sales Support Agent | 2 | 1965-03-03 00:00:00 | 2017-10-17 00:00:00 | 7727B 41 Ave | Calgary | AB | Canada | T3B 1Y7 | 1 (780) 836-9987 | 1 (780) 836-9543 | steve@chinookcorp.com |
6 | Mitchell | Michael | IT Manager | 1 | 1973-07-01 00:00:00 | 2016-10-17 00:00:00 | 5827 Bowness Road NW | Calgary | AB | Canada | T3B 0C5 | +1 (403) 246-9887 | +1 (403) 246-9899 | michael@chinookcorp.com |
7 | King | Robert | IT Staff | 6 | 1970-05-29 00:00:00 | 2017-01-02 00:00:00 | 590 Columbia Boulevard West | Lethbridge | AB | Canada | T1K 5N8 | +1 (403) 456-9986 | +1 (403) 456-8485 | robert@chinookcorp.com |
8 | Callahan | Laura | IT Staff | 6 | 1968-01-09 00:00:00 | 2017-03-04 00:00:00 | 923 7 ST NW | Lethbridge | AB | Canada | T1H 1Y8 | +1 (403) 467-3351 | +1 (403) 467-8772 | laura@chinookcorp.com |
We see that there are only three employees whose total sales we want to determine. What we can do in our final query is to concatenate the last_name
and first_name
columns to display each employee's full name in a single column.
We can also notice that there is a column, hire_date
. We should include this in our result set to see if this could have a potential influence in an employee's sales.
%%sql
SELECT * FROM customer
LIMIT 5;
* sqlite:///chinook.db Done.
customer_id | first_name | last_name | company | address | city | state | country | postal_code | phone | fax | support_rep_id | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
1 | Luís | Gonçalves | Embraer - Empresa Brasileira de Aeronáutica S.A. | Av. Brigadeiro Faria Lima, 2170 | São José dos Campos | SP | Brazil | 12227-000 | +55 (12) 3923-5555 | +55 (12) 3923-5566 | luisg@embraer.com.br | 3 |
2 | Leonie | Köhler | None | Theodor-Heuss-Straße 34 | Stuttgart | None | Germany | 70174 | +49 0711 2842222 | None | leonekohler@surfeu.de | 5 |
3 | François | Tremblay | None | 1498 rue Bélanger | Montréal | QC | Canada | H2G 1A7 | +1 (514) 721-4711 | None | ftremblay@gmail.com | 3 |
4 | Bjørn | Hansen | None | Ullevålsveien 14 | Oslo | None | Norway | 0171 | +47 22 44 22 22 | None | bjorn.hansen@yahoo.no | 4 |
5 | František | Wichterlová | JetBrains s.r.o. | Klanova 9/506 | Prague | None | Czech Republic | 14700 | +420 2 4172 5555 | +420 2 4172 5555 | frantisekw@jetbrains.com | 4 |
Our query does not have to be complex to answer the managers' question. First is to combine our three tables using Left Joins. We can take into account all purchases of all customers by using invoice
as the last Left Join. We then filter our result set where the title
column of employee
is equal to Sales Support Agent
. Grouping by employee name (full name), we create a new column containing the total sales per employee. This is done by taking the sum of the total
column from the invoice
table. The total
column contains the amount paid per purchase.
Our final query can be found in the next section below.
%%sql
SELECT
e.first_name ||' '|| e.last_name employee_name,
e.title,
hire_date,
ROUND(SUM(total), 2) total_sales_per_employee
FROM employee e
LEFT JOIN customer c ON c.support_rep_id = e.employee_id
LEFT JOIN invoice i ON i.customer_id = c.customer_id
WHERE e.title = 'Sales Support Agent'
GROUP BY employee_name;
* sqlite:///chinook.db Done.
employee_name | title | hire_date | total_sales_per_employee |
---|---|---|---|
Jane Peacock | Sales Support Agent | 2017-04-01 00:00:00 | 1731.51 |
Margaret Park | Sales Support Agent | 2017-05-03 00:00:00 | 1584.0 |
Steve Johnson | Sales Support Agent | 2017-10-17 00:00:00 | 1393.92 |
We can see that Jane Peacock has the highest total sales at 1731.51 USD. Looking at the hire_date
column, though, we notice that she was the first to be hired among the three Sales Support Agents. The other two, who were hired at later dates, have correspondingly lower total sales. Though not conclusive in itself, it is possible that the hire dates influenced the total sales that each employee was able to make.
The team wants to find out how the store is performing among different countries. To do this, we have been tasked to analyze customers' sales data from each country. We have been told use the country
column from the customer
table instead of billing_country
from the invoice
table. For each country, we need to include the following information in our results:
We are also told to group countries with only one customer as "Other". The result set should be in descending order of total sales. The "Other" group should be at the bottom of the list regardless of total sales.
CLICK HERE to skip to the full query.
We will only need three tables for this query:
However, due to the nature of the request, the query will be more complex than our previous ones. Here's how our thought process would go when constructing it:
For step one, we can create a CTE just like when we determined the top genres in the USA. Here's how it will look like:
WITH
country_or_other AS
(
SELECT
c.customer_id,
CASE
WHEN
(
SELECT COUNT(*)
FROM customer
WHERE country = c.country
) = 1 THEN 'Other'
ELSE country
END AS country_grouped,
il.*
FROM customer c
LEFT JOIN invoice i ON i.customer_id = c.customer_id
LEFT JOIN invoice_line il ON il.invoice_id = i.invoice_id
ORDER BY invoice_line_id
)
It looks a bit complicated at first due to the CASE
statement. The rest of the query just combines our tables, selects certains columns, and orders the temporary result set by invoice_line_id
. Let's take a closer look at the CASE
statement:
CASE
WHEN
(
SELECT COUNT(*)
FROM customer
WHERE country = c.country
) = 1 THEN 'Other'
ELSE country
END AS country_grouped
The CASE
statement simply returns an output based on a certain condition. In our case, it outputs either "Other" or a country's name. After our three tables are combined, the CASE
statement acts on the country
column of each row. For each row, nested query counts the number of occurrences of the country's name in the original customer
table. If the occurrence is one, "Other" is returned. If otherwise, the country's name is returned. The results are stored in a new column, country_grouped
.
For step two, we can create another CTE:
country_ordered AS
(
SELECT
*,
CASE
WHEN country_grouped = 'Other' THEN 1
ELSE 0
END AS sort
FROM country_or_other
)
The above CTE simply takes all the columns from the country_or_other
CTE and creates an additional one. This new column assigns a label to the row based on the country_grouped
column. If a row belongs to the "Other" group, a label of 1
is assigned, 0
if otherwise. Sorting the final result set by this column first will allow the "Other" group to be always at the bottom, regardless of total sales. This columns is aptly named sort
.
Since we will use multiple CTEs, the WITH
statement need only appear once before the first CTE. A comma needs to separate the two. These are seen in the full query.
Our main query can now be constructed to address steps three and four:
SELECT
country_grouped,
COUNT(DISTINCT(customer_id)) customers_per_country,
ROUND(SUM(unit_price), 2) total_sales_country,
ROUND(SUM(unit_price) / COUNT(DISTINCT(customer_id)), 2) ave_sales_customer,
ROUND(SUM(unit_price) / COUNT(DISTINCT(invoice_id)), 2) ave_order_value
FROM country_ordered
GROUP BY country_grouped
ORDER BY sort, 3 DESC;
From country_ordered
, we select the country_grouped
column and group the result set by this. We create four new ones containing the required information. The last three columns are rounded to two decimal places to make the output more readable. Note that in the ORDER BY
statement, the result set is first ordered by the sort
column, followed by the total_sales_country
column.
We can combine everything in the full query below.
%%sql
WITH
country_or_other AS
(
SELECT
c.customer_id,
CASE
WHEN
(
SELECT COUNT(*)
FROM customer
WHERE country = c.country
) = 1 THEN 'Other'
ELSE country
END AS country_grouped,
il.*
FROM customer c
LEFT JOIN invoice i ON i.customer_id = c.customer_id
LEFT JOIN invoice_line il ON il.invoice_id = i.invoice_id
ORDER BY invoice_line_id
),
country_ordered AS
(
SELECT
*,
CASE
WHEN country_grouped = 'Other' THEN 1
ELSE 0
END AS sort
FROM country_or_other
)
SELECT
country_grouped,
COUNT(DISTINCT(customer_id)) customers_per_country,
ROUND(SUM(unit_price), 2) total_sales_country,
ROUND(SUM(unit_price) / COUNT(DISTINCT(customer_id)), 2) ave_sales_customer,
ROUND(SUM(unit_price) / COUNT(DISTINCT(invoice_id)), 2) ave_order_value
FROM country_ordered
GROUP BY country_grouped
ORDER BY sort, 3 DESC;
* sqlite:///chinook.db Done.
country_grouped | customers_per_country | total_sales_country | ave_sales_customer | ave_order_value |
---|---|---|---|---|
USA | 13 | 1040.49 | 80.04 | 7.94 |
Canada | 8 | 535.59 | 66.95 | 7.05 |
Brazil | 5 | 427.68 | 85.54 | 7.01 |
France | 5 | 389.07 | 77.81 | 7.78 |
Germany | 4 | 334.62 | 83.66 | 8.16 |
Czech Republic | 2 | 273.24 | 136.62 | 9.11 |
United Kingdom | 3 | 245.52 | 81.84 | 8.77 |
Portugal | 2 | 185.13 | 92.57 | 6.38 |
India | 2 | 183.15 | 91.58 | 8.72 |
Other | 15 | 1094.94 | 73.0 | 7.45 |
Above, we can see that our query executed correctly. Sales data of customers from each country is summarized above in a clear manner. We see that the USA has the highest number of customers and the highest total sales. As for customers who spend the most per purchase, we can see that these come from the Czech Republic.
If the Chinook team wishes to ramp up their marketing in other countries, the Czech Republic, United Kingdom, and India are possible countries we could recommend to them. This is because the customers from these countries have the highest expenditures per purchase as compared to the rest. We would advise them, though, to gather more data as the sample sizes shown are quite small and may not be representative of other potential customers from these countries.
The Chinook team is discussing how to conduct their future inventory purchases. They want to find out if it will be a good choice to purchase only individual tracks and to stop purchasing whole albums. To help them in their decision, we need to determine how many of the number total purchases fall in either category.
There are some caveats to this that make our task tricky. One is that there are some albums that have only track. Another caveat is that, in a single purchase (single invoice), a customer may have purchased an entire album plus some individual tracks. The two caveats may be combined in that the album in combined purchase is a single track album. Taking these into account will complicate our query by having to classify that purchase as both an "Album" and an "Individual" purchase. It is also possible that a customer (for some reason) may have bought the tracks of an entire album in separate purchases or invoices.
Fortunately, we have been told not to worry about these due to either their non-existence or their negligibility in the dataset.
CLICK HERE to skip to the full query.
The following tables will be used in our query:
With the advice given to us, we can simplify our query to make the following classifications:
Similar to our previous queries, we will create a CTE to handle our classification. If you skip to the full query, it is evident that this is our longest CTE. This is due to the lengthy CASE
statement used for classification. Let's take a look at it below without the CASE
and WITH
statements.
%%sql
SELECT
il.invoice_id,
invoice_line_id,
t.track_id,
album_id,
ROUND(SUM(il.unit_price),2) purchase_price
FROM invoice_line il
LEFT JOIN track t ON t.track_id = il.track_id
GROUP BY invoice_id
LIMIT 10;
* sqlite:///chinook.db Done.
invoice_id | invoice_line_id | track_id | album_id | purchase_price |
---|---|---|---|---|
1 | 1 | 1158 | 91 | 15.84 |
2 | 17 | 3476 | 322 | 9.9 |
3 | 27 | 2516 | 203 | 1.98 |
4 | 29 | 3448 | 314 | 7.92 |
5 | 37 | 1986 | 163 | 16.83 |
6 | 54 | 1045 | 83 | 1.98 |
7 | 56 | 3477 | 322 | 10.89 |
8 | 67 | 2231 | 183 | 9.9 |
9 | 77 | 206 | 21 | 8.91 |
10 | 86 | 2663 | 215 | 1.98 |
We combine invoice_line
and track
. Grouping by the invoice_id
column, we only perform a summary operation on unit_price
to get the price of each purchase. For the other three columns, the retained values are actually the first values for each group (recall that an invoice can contain multiple tracks). Only the first track is retained if we do not carry out summary operations like SUM()
or MAX()
. We will only need these first values for our classification as we will see below.
CASE
WHEN
(
/* FIRST PART */
SELECT track_id
FROM track
WHERE album_id = t.album_id /* Selects whole album */
EXCEPT
SELECT track_id
FROM invoice_line
WHERE invoice_id = il.invoice_id /* Selects all tracks per purchase*/
) IS NULL
AND
/* SECOND PART */
(
SELECT track_id
FROM invoice_line
WHERE invoice_id = il.invoice_id /* Selects all tracks per purchase*/
EXCEPT
SELECT track_id
FROM track
WHERE album_id = t.album_id /* Selects whole album */
) IS NULL
THEN 'Album'
ELSE 'Individual'
We can see that there are actually just two subqueries repeated twice:
/*SUBQUERY 1*/
SELECT track_id
FROM track
WHERE album_id = t.album_id /* Selects whole album */
/*SUBQUERY 2*/
SELECT track_id
FROM invoice_line
WHERE invoice_id = il.invoice_id /* Selects all tracks per purchase*/
Acting on each row (refer to the last table) or on each purchase (invoice_id
), SUBQUERY 1 selects the whole album to which the track belongs to. SUBQUERY 2 selects all the tracks present in the current purchase.
Taking the AND
statement as a division, the first and second parts each compare both subqueries with the EXCEPT
statement. The EXCEPT
statement selects rows that occur in the first subquery but do not occur in the second subquery. A NULL
value is returned if all rows from the first subquery are also in the second subquery. This is why the positions of SUBQUERY 1 and SUBQUERY 2 are interchanged in both parts of the CASE
statement.
Let's say a customer purchases multiple individual tracks from various albums. When the CASE
statement acts on this purchase, SUBQUERY 1 selects the whole album of the first track in the invoice. SUBQUERY 2 selects all the tracks in the invoice. If the first track comes from a multi-track album, the first part of the CASE
statement will return NOT NULL
since there will be tracks not present in SUBQUERY 2. If the first track comes from a single-track album, NULL
will be returned since all rows (only one in this case) also appear in SUBQUERY 2. If there was not a second part to the CASE
statement, this would be classified as an Album purchase. The second part, which interchanges the positions of both subqueries, prevents this from happening by applying the same logic. Since not all rows from SUBQUERY 2 appear in SUBQUERY 1, NOT NULL
will be returned.
The CASE
statement above returns Album only if both parts return NULL
. Otherwise, it returns Individual. Below are the first 10 rows of the table after classification:
invoice_id | invoice_line_id | track_id | album_id | purchase_price | purchase_type |
---|---|---|---|---|---|
1 | 1 | 1158 | 91 | 15.84 | Album |
2 | 17 | 3476 | 322 | 9.9 | Individual |
3 | 27 | 2516 | 203 | 1.98 | Individual |
4 | 29 | 3448 | 314 | 7.92 | Individual |
5 | 37 | 1986 | 163 | 16.83 | Album |
6 | 54 | 1045 | 83 | 1.98 | Individual |
7 | 56 | 3477 | 322 | 10.89 | Individual |
8 | 67 | 2231 | 183 | 9.9 | Individual |
9 | 77 | 206 | 21 | 8.91 | Individual |
10 | 86 | 2663 | 215 | 1.98 | Individual |
We can construct our main query to calculate how many of the purchases fall in either type. We will also calculate the percentages of sales for each type.
SELECT
purchase_type,
COUNT(invoice_id) invoice_count,
ROUND(CAST(COUNT(invoice_id) AS FLOAT)/ (
SELECT COUNT(*)
FROM invoice
), 2) perc_of_purchases,
ROUND(SUM(purchase_price) / (
SELECT SUM(unit_price)
FROM invoice_line
), 2) perc_of_sales
FROM album_or_individual aoi
GROUP BY 1
%%sql
WITH album_or_individual AS
(
SELECT
il.invoice_id,
invoice_line_id,
t.track_id,
album_id,
ROUND(SUM(il.unit_price),2) purchase_price,
CASE
WHEN
(
SELECT track_id
FROM track
WHERE album_id = t.album_id /* Selects whole album */
EXCEPT
SELECT track_id
FROM invoice_line
WHERE invoice_id = il.invoice_id /* Selects all tracks per purchase*/
) IS NULL
AND
(
SELECT track_id
FROM invoice_line
WHERE invoice_id = il.invoice_id /* Selects all tracks per purchase*/
EXCEPT
SELECT track_id
FROM track
WHERE album_id = t.album_id /* Selects whole album */
) IS NULL
THEN 'Album'
ELSE 'Individual'
END AS purchase_type
FROM invoice_line il
LEFT JOIN track t ON t.track_id = il.track_id
GROUP BY invoice_id
)
SELECT
purchase_type,
COUNT(invoice_id) invoice_count,
ROUND(CAST(COUNT(invoice_id) AS FLOAT)/ (
SELECT COUNT(*)
FROM invoice
), 2) perc_of_purchases,
ROUND(SUM(purchase_price) / (
SELECT SUM(unit_price)
FROM invoice_line
), 2) perc_of_sales
FROM album_or_individual aoi
GROUP BY 1
* sqlite:///chinook.db Done.
purchase_type | invoice_count | perc_of_purchases | perc_of_sales |
---|---|---|---|
Album | 114 | 0.19 | 0.31 |
Individual | 500 | 0.81 | 0.69 |
Based on the contents of our result set, we can see that the majority of purchases made are those composed of individual tracks. Whole album purchases account for almost 1/5 or almost 20% of customer purchases. In spite of the lower percentage of album purchases, the Chinook team should NOT decide to buy individual tracks excusively. 20% is still a significant portion of their purchases and choosing to not buy whole albums can decrease their sales significantly. This potential effect is affirmed by the 31% figure of sales that can be attributed to album purchases.
Top Genres in the USA
Top Performing Employees
Analyzing Sales by Country
Albums vs. Individual Tracks