Before we start with the analysis let's see that Relational Database we will be using for our analysis :
Let's first connect to the database and get all the tables :
%%capture
%load_ext sql
%sql sqlite:///chinook.db
'Connected: None@chinook.db'
Let's get all the tables in the database :
%%sql
SELECT name, type
FROM sqlite_master
WHERE type IN ('table','view') ;
Done.
name | type |
---|---|
album | table |
artist | table |
customer | table |
employee | table |
genre | table |
invoice | table |
invoice_line | table |
media_type | table |
playlist | table |
playlist_track | table |
track | table |
track_album | view |
alb_min_track_id | view |
blah | view |
blah_1 | view |
tracks_available | view |
Now that we have all the tables lets see a few rows from each of these tables :
%%sql
SELECT *
FROM album
LIMIT 3 ;
Done.
album_id | title | artist_id |
---|---|---|
1 | For Those About To Rock We Salute You | 1 |
2 | Balls to the Wall | 2 |
3 | Restless and Wild | 2 |
%%sql
SELECT *
FROM album
LIMIT 3 ;
Done.
album_id | title | artist_id |
---|---|---|
1 | For Those About To Rock We Salute You | 1 |
2 | Balls to the Wall | 2 |
3 | Restless and Wild | 2 |
%%sql
SELECT *
FROM artist
LIMIT 3 ;
Done.
artist_id | name |
---|---|
1 | AC/DC |
2 | Accept |
3 | Aerosmith |
%%sql
SELECT *
FROM customer
LIMIT 3 ;
Done.
customer_id | first_name | last_name | company | address | city | state | country | postal_code | phone | fax | support_rep_id | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
1 | Luís | Gonçalves | Embraer - Empresa Brasileira de Aeronáutica S.A. | Av. Brigadeiro Faria Lima, 2170 | São José dos Campos | SP | Brazil | 12227-000 | +55 (12) 3923-5555 | +55 (12) 3923-5566 | luisg@embraer.com.br | 3 |
2 | Leonie | Köhler | None | Theodor-Heuss-Straße 34 | Stuttgart | None | Germany | 70174 | +49 0711 2842222 | None | leonekohler@surfeu.de | 5 |
3 | François | Tremblay | None | 1498 rue Bélanger | Montréal | QC | Canada | H2G 1A7 | +1 (514) 721-4711 | None | ftremblay@gmail.com | 3 |
%%sql
SELECT *
FROM employee
LIMIT 3 ;
Done.
employee_id | last_name | first_name | title | reports_to | birthdate | hire_date | address | city | state | country | postal_code | phone | fax | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
1 | Adams | Andrew | General Manager | None | 1962-02-18 00:00:00 | 2016-08-14 00:00:00 | 11120 Jasper Ave NW | Edmonton | AB | Canada | T5K 2N1 | +1 (780) 428-9482 | +1 (780) 428-3457 | andrew@chinookcorp.com |
2 | Edwards | Nancy | Sales Manager | 1 | 1958-12-08 00:00:00 | 2016-05-01 00:00:00 | 825 8 Ave SW | Calgary | AB | Canada | T2P 2T3 | +1 (403) 262-3443 | +1 (403) 262-3322 | nancy@chinookcorp.com |
3 | Peacock | Jane | Sales Support Agent | 2 | 1973-08-29 00:00:00 | 2017-04-01 00:00:00 | 1111 6 Ave SW | Calgary | AB | Canada | T2P 5M5 | +1 (403) 262-3443 | +1 (403) 262-6712 | jane@chinookcorp.com |
%%sql
SELECT *
FROM genre
LIMIT 3 ;
Done.
genre_id | name |
---|---|
1 | Rock |
2 | Jazz |
3 | Metal |
%%sql
SELECT *
FROM invoice
LIMIT 3 ;
Done.
invoice_id | customer_id | invoice_date | billing_address | billing_city | billing_state | billing_country | billing_postal_code | total |
---|---|---|---|---|---|---|---|---|
1 | 18 | 2017-01-03 00:00:00 | 627 Broadway | New York | NY | USA | 10012-2612 | 15.84 |
2 | 30 | 2017-01-03 00:00:00 | 230 Elgin Street | Ottawa | ON | Canada | K2P 1L7 | 9.9 |
3 | 40 | 2017-01-05 00:00:00 | 8, Rue Hanovre | Paris | None | France | 75002 | 1.98 |
%%sql
SELECT *
FROM invoice_line
LIMIT 3 ;
Done.
invoice_line_id | invoice_id | track_id | unit_price | quantity |
---|---|---|---|---|
1 | 1 | 1158 | 0.99 | 1 |
2 | 1 | 1159 | 0.99 | 1 |
3 | 1 | 1160 | 0.99 | 1 |
%%sql
SELECT *
FROM media_type
LIMIT 3 ;
Done.
media_type_id | name |
---|---|
1 | MPEG audio file |
2 | Protected AAC audio file |
3 | Protected MPEG-4 video file |
%%sql
SELECT *
FROM playlist
LIMIT 3 ;
Done.
playlist_id | name |
---|---|
1 | Music |
2 | Movies |
3 | TV Shows |
%%sql
SELECT *
FROM playlist_track
LIMIT 3 ;
Done.
playlist_id | track_id |
---|---|
1 | 3402 |
1 | 3389 |
1 | 3390 |
%%sql
SELECT *
FROM track
LIMIT 3 ;
Done.
track_id | name | album_id | media_type_id | genre_id | composer | milliseconds | bytes | unit_price |
---|---|---|---|---|---|---|---|---|
1 | For Those About To Rock (We Salute You) | 1 | 1 | 1 | Angus Young, Malcolm Young, Brian Johnson | 343719 | 11170334 | 0.99 |
2 | Balls to the Wall | 2 | 2 | 1 | None | 342562 | 5510424 | 0.99 |
3 | Fast As a Shark | 3 | 2 | 1 | F. Baltes, S. Kaufman, U. Dirkscneider & W. Hoffman | 230619 | 3990994 | 0.99 |
*Which Genre sells the best?
*
The Chinook record store has just signed a deal with a new record label, and we've been tasked with selecting the first three albums that will be added to the store, from a list of four. All four albums are by artists that don't have any tracks in the store right now - we have the artist names, and the genre of music they produce:
Artist Name | Genre |
---|---|
Regal | Hip-Hop |
Red Tone | Punk |
Meteor and the Girls | Pop |
Slim Jim Bites | Blues |
The record label specializes in artists from the USA, so we're interested in finding out which genres sell the best in the USA.
To find out which genres sell the most tracks in the USA, and making a recommendation for the three artists whose albums we should purchase for the store.
Steps for our analysis :
Genre
table to the tracks
tracks
by connecting the resulting the table from our first query and joining in turn with the invoice
table via invoice_line
Step 1 :
In order to determine the type of join to use we need to see the number of unique genre_id
in tracks
as well as genre
table so that we don't miss out on any listed genres.
%%sql
SELECT COUNT(name) AS 'unique_genre_id_from_genre'
FROM genre ;
Done.
unique_genre_id_from_genre |
---|
25 |
%%sql
SELECT COUNT(DISTINCT(genre_id)) AS 'unique_genre_id_from_track'
FROM track;
Done.
unique_genre_id_from_track |
---|
25 |
Since we see that the number of genre_ids
are same, we can use inner-join to get the track-genre
and make a view for the same :
%%sql
CREATE VIEW track_genre AS
SELECT t.* , g.name genre_name
FROM track t
INNER JOIN genre g ON g.genre_id = t.genre_id ;
Done.
[]
Step 2 : Let's get the total number of tracks sold and create another view with billing_country
as USA
in invoice
and left-join on invoice_line
:
%%sql
SELECT COUNT(quantity) AS 'total_tracks_sold'
FROM invoice_line ;
Done.
total_tracks_sold |
---|
4757 |
%%sql
CREATE VIEW usa_tracks AS
SELECT *
FROM invoice inv
INNER JOIN invoice_line inv_l ON inv.invoice_id = inv_l.invoice_id
WHERE billing_country = 'USA' ;
Done.
[]
Step 3 : Now that we have all the basic tables, let's join the obtained tables to get the values we desire :
%%sql
CREATE VIEW main_table AS
SELECT *
FROM track_genre t_g
INNER JOIN usa_tracks u_t ON t_g.track_id = u_t.track_id ;
Done.
[]
%%sql
WITH
total_songs AS ( SELECT SUM(m_t.quantity)
FROM main_table )
SELECT m_t.genre_name AS 'genre_name',
SUM(m_t.quantity) AS 'absolute_sales',
ROUND((CAST (SUM(m_t.quantity) AS FLOAT) / CAST((SELECT SUM(main_table.quantity) FROM main_table)
AS FLOAT)) * 100, 3)
AS 'pct_sold (%)'
FROM main_table AS m_t
GROUP BY m_t.genre_name
ORDER BY absolute_sales DESC;
Done.
genre_name | absolute_sales | pct_sold (%) |
---|---|---|
Rock | 561 | 53.378 |
Alternative & Punk | 130 | 12.369 |
Metal | 124 | 11.798 |
R&B/Soul | 53 | 5.043 |
Blues | 36 | 3.425 |
Alternative | 35 | 3.33 |
Latin | 22 | 2.093 |
Pop | 22 | 2.093 |
Hip Hop/Rap | 20 | 1.903 |
Jazz | 14 | 1.332 |
Easy Listening | 13 | 1.237 |
Reggae | 6 | 0.571 |
Electronica/Dance | 5 | 0.476 |
Classical | 4 | 0.381 |
Heavy Metal | 3 | 0.285 |
Soundtrack | 2 | 0.19 |
TV Shows | 1 | 0.095 |
Using the list of the top sold genres in the USA we can see which albums to be added to the store on the basis of their genre:
Genre_rank_USA | Genre_name | Sold_percent (%) | Artist Name | Recommendation |
---|---|---|---|---|
2 | Alternative & Punk |
12.36 % | Red Tone | Select |
5 | Blues |
3.45 % | Slim Jim Bites | Select |
8 | Pop |
2.09 % | Meteor and the Girls | Select |
9 | Hip Hop/Rap |
1.90 % | Regal | Hold |
*Analysing Employee Sales' Performance
*
Each customer for the Chinook store gets assigned to a sales support agent within the company when they first make a purchase. We have been asked to analyze the purchases of customers belonging to each employee to see if any sales support agent is performing either better or worse than the others.
To find the total dollar amount of sales assigned to each sales support agent within the company and finding attributes for that employee that is relevant to the analysis.
Steps for our analysis :
invoice_line
table to the invoice
customers
to get the ' No. of Tracks purchased by customers ' .employee
table we can get the support_rep
assigned to each customer who assisted them with their sales.Step 1 :
Getting all the information about the tracks sold using by joining the invoice_line
and invoice
tables :
%%sql
CREATE VIEW tracks_sold AS
SELECT *
FROM invoice inv
INNER JOIN invoice_line inv_l ON inv.invoice_id = inv_l.invoice_id ;
Done.
[]
Step 2 :
Getting the customer data by joining the table obtained in step 1 to the customer
table :
%%sql
CREATE VIEW customer_tracks AS
SELECT *
FROM tracks_sold t_s
INNER JOIN customer cust ON t_s.customer_id = cust.customer_id ;
Done.
[]
Step 3 :
Getting the final employee data by joining the table obtained to the employee
table :
%%sql
CREATE VIEW employee_sales AS
SELECT cust_tr.*,
emp.* ,
emp.first_name AS emp_first_name,
emp.last_name AS emp_last_name
FROM customer_tracks cust_tr
INNER JOIN employee emp ON emp.employee_id = cust_tr.support_rep_id ;
Done.
[]
Now that we have our final table let's see all the sales made by each employee and compare it with the number of days they have worked :
%%sql
SELECT emp_first_name || ' ' || emp_last_name AS Name,
employee_id,
hire_date,
ROUND((SELECT JULIANDAY('now')) - JULIANDAY(emp_s.hire_date)) AS days_worked_till_now,
SUM(quantity) AS tracks_sold,
ROUND (CAST (SUM(quantity) AS FLOAT) / ROUND((SELECT JULIANDAY('now')) -
JULIANDAY(emp_s.hire_date)),3)
AS average_sales_per_day
FROM employee_sales AS emp_s
GROUP BY employee_id
ORDER BY average_sales_per_day DESC ;
Done.
Name | employee_id | hire_date | days_worked_till_now | tracks_sold | average_sales_per_day |
---|---|---|---|---|---|
Jane Peacock | 3 | 2017-04-01 00:00:00 | 1497.0 | 1749 | 1.168 |
Margaret Park | 4 | 2017-05-03 00:00:00 | 1465.0 | 1600 | 1.092 |
Steve Johnson | 5 | 2017-10-17 00:00:00 | 1298.0 | 1408 | 1.085 |
Now that we have average metrics for the sales, let's also calculate the number of days each employee took to make his/her first sale :
%%sql
SELECT employee_id,
emp_first_name || ' ' || emp_last_name AS Name,
ROUND((SELECT JULIANDAY(invoice_date)) - JULIANDAY(emp_s.hire_date)) AS days_to_sale,
hire_date,
invoice_date
FROM employee_sales AS emp_s
WHERE employee_id = 5
ORDER BY invoice_date
LIMIT 5;
Done.
employee_id | Name | days_to_sale | hire_date | invoice_date |
---|---|---|---|---|
5 | Steve Johnson | -280.0 | 2017-10-17 00:00:00 | 2017-01-10 00:00:00 |
5 | Steve Johnson | -280.0 | 2017-10-17 00:00:00 | 2017-01-10 00:00:00 |
5 | Steve Johnson | -272.0 | 2017-10-17 00:00:00 | 2017-01-18 00:00:00 |
5 | Steve Johnson | -272.0 | 2017-10-17 00:00:00 | 2017-01-18 00:00:00 |
5 | Steve Johnson | -267.0 | 2017-10-17 00:00:00 | 2017-01-23 00:00:00 |
%%sql
SELECT employee_id,
emp_first_name || ' ' || emp_last_name AS Name,
ROUND((SELECT JULIANDAY(invoice_date)) - JULIANDAY(emp_s.hire_date)) AS days_to_sale,
hire_date,
invoice_date
FROM employee_sales AS emp_s
WHERE employee_id = 4
ORDER BY invoice_date
LIMIT 5;
Done.
employee_id | Name | days_to_sale | hire_date | invoice_date |
---|---|---|---|---|
4 | Margaret Park | -118.0 | 2017-05-03 00:00:00 | 2017-01-05 00:00:00 |
4 | Margaret Park | -118.0 | 2017-05-03 00:00:00 | 2017-01-05 00:00:00 |
4 | Margaret Park | -116.0 | 2017-05-03 00:00:00 | 2017-01-07 00:00:00 |
4 | Margaret Park | -116.0 | 2017-05-03 00:00:00 | 2017-01-07 00:00:00 |
4 | Margaret Park | -116.0 | 2017-05-03 00:00:00 | 2017-01-07 00:00:00 |
%%sql
SELECT employee_id,
emp_first_name || ' ' || emp_last_name AS Name,
ROUND((SELECT JULIANDAY(invoice_date)) - JULIANDAY(emp_s.hire_date)) AS days_to_sale,
hire_date,
invoice_date
FROM employee_sales AS emp_s
WHERE employee_id = 3
ORDER BY invoice_date
LIMIT 5 ;
Done.
employee_id | Name | days_to_sale | hire_date | invoice_date |
---|---|---|---|---|
3 | Jane Peacock | -88.0 | 2017-04-01 00:00:00 | 2017-01-03 00:00:00 |
3 | Jane Peacock | -88.0 | 2017-04-01 00:00:00 | 2017-01-03 00:00:00 |
3 | Jane Peacock | -88.0 | 2017-04-01 00:00:00 | 2017-01-03 00:00:00 |
3 | Jane Peacock | -88.0 | 2017-04-01 00:00:00 | 2017-01-03 00:00:00 |
3 | Jane Peacock | -88.0 | 2017-04-01 00:00:00 | 2017-01-03 00:00:00 |
We see that there is a major discrepancy between the registered hire_date and the first registered invoice_date for the sale. This shows that the invoice_date has been registered before the employee made the sale.
Let's consider the invoices after their hire date and see how many days it took for the employees to make their first sale :
%%sql
SELECT employee_id,
emp_first_name || ' ' || emp_last_name AS name,
hire_date,
invoice_date,
MIN(ROUND((SELECT JULIANDAY(invoice_date)) - JULIANDAY(emp_s.hire_date)))
AS min_days_to_first_sale
FROM employee_sales AS emp_s
WHERE ROUND((SELECT JULIANDAY(invoice_date)) - JULIANDAY(emp_s.hire_date)) > 0
GROUP BY employee_id
ORDER BY min_days_to_first_sale ;
Done.
employee_id | name | hire_date | invoice_date | min_days_to_first_sale |
---|---|---|---|---|
3 | Jane Peacock | 2017-04-01 00:00:00 | 2017-04-02 00:00:00 | 1.0 |
5 | Steve Johnson | 2017-10-17 00:00:00 | 2017-10-21 00:00:00 | 4.0 |
4 | Margaret Park | 2017-05-03 00:00:00 | 2017-05-18 00:00:00 | 15.0 |
Best performing employee on the basis of average sales per day and minimum days to make their first sale is :
Employee_id | Employee_name | Avg_sales_per_day | Days_to_make_first_sale | Total_tracks_sold | |
---|---|---|---|---|---|
3 | Jane Peacock |
1.17 | 1.0 | 1749 |
*Calculating Sales metrics by Country
*
To analyze the sales data for customers from different countries to calculate :
To collate data on purchases from different countries.
Steps for our analysis :
invoice_line
,invoice
and cusotmers
In the earlier analyses we have used Views to seperate out our steps and make the joining of tables easier. Let's now use Subqueries efficienlty in order to perform all the tasks in one go.
We will use the first subquery to primarily group the countries which have only one customer into the other
group. Also, we will calcuate some aggregates so that we can sort using those aggregates in our main outer query.
In our main outer query we will use CASE
again to pick out the other
group of countries and force it to the bottom of the list as they are not as important for our analysis.
%%sql
WITH country_grouped AS
(
SELECT *,
SUM(quantity) AS total_sales,
COUNT(DISTINCT(invoice_id)) number_of_orders,
COUNT(DISTINCT(customer_id)) number_customers,
ROUND(SUM(total),2) total_sales_value,
CASE
WHEN COUNT(DISTINCT(customer_id)) = 1 THEN 'Other'
ELSE customer_tracks.country
END AS country_group
FROM customer_tracks
GROUP BY country
)
SELECT country_group,
SUM(number_customers) AS total_customers,
total_sales,
number_of_orders,
ROUND(CAST (total_sales AS FLOAT) / CAST (number_of_orders AS FLOAT), 2 )
AS avg_sales_per_order,
total_sales_value,
ROUND(CAST (total_sales_value AS FLOAT) / CAST (SUM(number_customers) AS FLOAT), 2 )
AS avg_customer_value,
ROUND(CAST (total_sales_value AS FLOAT) / CAST (number_of_orders AS FLOAT), 2 )
AS avg_order_value
FROM
(
SELECT *,
CASE
WHEN c_g.country_group = 'Other' THEN 1
ELSE 0
END AS sort
FROM country_grouped c_g
)
GROUP BY country_group
ORDER BY sort ASC, number_of_orders DESC ;
Done.
country_group | total_customers | total_sales | number_of_orders | avg_sales_per_order | total_sales_value | avg_customer_value | avg_order_value |
---|---|---|---|---|---|---|---|
USA | 13 | 1051 | 131 | 8.02 | 10405.89 | 800.45 | 79.43 |
Canada | 8 | 541 | 76 | 7.12 | 5489.55 | 686.19 | 72.23 |
Brazil | 5 | 432 | 61 | 7.08 | 4059.0 | 811.8 | 66.54 |
France | 5 | 393 | 50 | 7.86 | 3972.87 | 794.57 | 79.46 |
Germany | 4 | 338 | 41 | 8.24 | 3441.24 | 860.31 | 83.93 |
Czech Republic | 2 | 276 | 30 | 9.2 | 3183.84 | 1591.92 | 106.13 |
Portugal | 2 | 187 | 29 | 6.45 | 1755.27 | 877.63 | 60.53 |
United Kingdom | 3 | 248 | 28 | 8.86 | 2498.76 | 832.92 | 89.24 |
India | 2 | 185 | 21 | 8.81 | 1887.93 | 943.97 | 89.9 |
Other | 15 | 76 | 10 | 7.6 | 754.38 | 50.29 | 75.44 |
Other
: The other category comprises of countries with only 1 customer each.
Though USA
pulls out ahead in terms of total_sales
and total_customers
we see that Czech Republic
and India
have a higher average_customer_value
, which shows the potential those countries have for the company.
*Albums VS Individual Tracks
*
The Chinook store is setup in a way that allows customer to make purchases in one of the two ways:
The store does not let customers purchase a whole album, and then add individual tracks to that same purchase (unless they do that by choosing each track manually). When customers purchase albums they are charged the same price as if they had purchased each of those tracks separately. We have been asked to find out what percentage of purchases are individual tracks vs whole albums, so that management can use this data to understand the effect this decision might have on overall revenue.
Edge cases to avoid :
Categorize each invoice as either an album purchase or not, and calculate the following summary statistics:
Steps for our analysis :
Creating a view inv_min_track_id
which will contain the track_id
of the first track in per invoice.
Starting from the inner-most subquery :
minimun_track_id
per invoice earlier, select their album_ids and use their album_ids to get all the tracks in that album.invoice_id
we get all the tracks sold in that invoice.EXCEPT
clause to check if both the tables are same by using a combination IS NULL-AND
clauses and use the CASE
function to catergorise the albums to album_purchased
if the above tables are same.%%sql
CREATE VIEW inv_min_track_id AS
SELECT *, MIN(track_id) AS min_id_per_invoice
FROM invoice_line il
GROUP BY invoice_id ;
Done.
[]
%%sql
SELECT
SUM(quantity) total_invoices,
category,
ROUND(((CAST (SUM(quantity) AS FLOAT) / (SELECT COUNT(DISTINCT(invoice_id)) FROM invoice_line))
*100) ,2)
AS 'percent_distribution (%)'
FROM (
SELECT *,
CASE
WHEN
( SELECT track_id FROM track t
WHERE t.album_id = (SELECT tr2.album_id FROM track tr2
WHERE tr2.track_id = imt.min_id_per_invoice)
EXCEPT
SELECT track_id FROM invoice_line il
WHERE il.invoice_id = imt.invoice_id
) IS NULL
AND
( SELECT track_id FROM invoice_line il
WHERE il.invoice_id = imt.invoice_id
EXCEPT
SELECT track_id FROM track t
WHERE t.album_id = (SELECT tr2.album_id FROM track tr2
WHERE tr2.track_id = imt.min_id_per_invoice)
)IS NULL
THEN 'album_purchased'
ELSE 'album_not_purchased'
END AS category
FROM inv_min_track_id imt
)
GROUP BY category;
Done.
total_invoices | category | percent_distribution (%) |
---|---|---|
500 | album_not_purchased | 81.43 |
114 | album_purchased | 18.57 |
18.57 %
of the invoices are where the customers have purchased the entire albums togther. Since it is a significant amount, my recommendation would be to** NOT use the new stratergy which includes purchasing only the most popular tracks from each album from record companies, instead of purchasing every track from an album.**
To go through the playsists and select the most played Artist, along with the number of playlists he/she/they feature in.
Steps for our analysis :
artist
table to the track
table via album
table.playlist
table to get the ' track_ids in each playlist.'%%sql
WITH track_artist AS
(
SELECT t.track_id, t.album_id,art.artist_id, art.name artist_name
FROM track t
INNER JOIN album alb ON alb.album_id = t.album_id
INNER JOIN artist art ON art.artist_id = alb.artist_id
),
final_table AS
(
SELECT ta.artist_name, COUNT(DISTINCT(p.playlist_id)) number_of_playlists
FROM track_artist ta
INNER JOIN playlist_track pt ON ta.track_id = pt.track_id
INNER JOIN playlist p ON pt.playlist_id = p.playlist_id
GROUP BY ta.artist_name
ORDER BY number_of_playlists DESC
)
SELECT * FROM final_table
LIMIT 5 ;
Done.
artist_name | number_of_playlists |
---|---|
Eugene Ormandy | 7 |
Academy of St. Martin in the Fields & Sir Neville Marriner | 6 |
Berliner Philharmoniker & Herbert Von Karajan | 6 |
English Concert & Trevor Pinnock | 6 |
The King's Singers | 6 |
According to our analysis the Most featured Artist in the Customer's plalist is :
Artist_name | number_of_playlists |
---|---|
Eugene Ormandy |
7 |
7
playlists with 5
artists featuring in 6
playlists.To see which tracks have been purchased by the customers.
Steps for our analysis :
available tracks
listed in the database using the tracks
table.sold tracks
from the invoice_line
table using by getting all the distinct track_ids mentioned in the table.unsold tracks
from the avaiable_tracks
, and categorize them using the CASE function as NO
under the tracks_purchased
column.tracks_purchased
we can then calculate the required aggregates to get the number_of_tracks
and their percentage_of_total_tracks.
%%sql
WITH
tracks_available AS
(
SELECT DISTINCT(track_id)
FROM track t
ORDER BY track_id
),
tracks_sold AS
(
SELECT DISTINCT(track_id)
FROM invoice_line il
ORDER BY track_id
),
unsold_tracks AS
(
SELECT DISTINCT(track_id) FROM tracks_available
EXCEPT
SELECT DISTINCT(track_id) FROM tracks_sold
)
SELECT
CASE
WHEN t.track_id IN (SELECT DISTINCT(track_id) FROM unsold_tracks)
THEN 'No'
ELSE 'Yes'
END AS 'tracks_purchased',
COUNT(track_id) number_of_tracks,
ROUND((CAST (COUNT(track_id) AS FLOAT) / CAST ((SELECT COUNT(track_id) FROM track) As FLOAT)
*100),3)
AS 'percentage_of_total_tracks (%)'
FROM track t
GROUP BY tracks_purchased
ORDER BY number_of_tracks DESC;
Done.
tracks_purchased | number_of_tracks | percentage_of_total_tracks (%) |
---|---|---|
Yes | 1806 | 51.556 |
No | 1697 | 48.444 |
According to our analysis we see that around 48.4 % (almost half the amount of total available tracks) of the entire tracks
library is unsold.
*Does Protected Media gets more Sales?
*
Do protected vs non-protected media types have an effect on popularity?
To see if the 'Media Type' has an effect on the sales of the tracks.
Steps for our analysis :
media_type
table into protected
and unprotected
using the ** LIKE- %** wildcards.track
table and then to the invoice_line
to categorize the sold tracks
by their media type.percentage_sales
.%%sql
WITH media_type_new AS
(
SELECT *,
CASE
WHEN md.name LIKE '%protected%'
THEN 'media_protected'
ELSE 'media_unprotected'
END AS media_type_col
FROM media_type md
)
SELECT COUNT(DISTINCT(il.track_id)) tracks_sold,
md.media_type_col media_type,
ROUND((CAST(COUNT(DISTINCT(il.track_id)) AS FLOAT) / CAST ((SELECT COUNT(DISTINCT(track_id))
FROM invoice_line) As FLOAT) *100),3)
AS 'percentage_sales (%)'
FROM invoice_line il
LEFT JOIN track t ON il.track_id = t.track_id
INNER JOIN media_type_new md ON t.media_type_id = md.media_type_id
GROUP BY md.media_type_col
ORDER BY tracks_sold DESC;
Done.
tracks_sold | media_type | percentage_sales (%) |
---|---|---|
1652 | media_unprotected | 91.473 |
154 | media_protected | 8.527 |
According to our analysis we see that around ** 91.4 % of the sold tracks
are unprotected, hence media_type seems to have very less effect on the sales of the tracks. **
*Does the available range of songs affect sales?
*
Is the range of tracks in the store reflective of their sales popularity?
To see if the Range of tracks has an effect on the sales/popularity of the tracks.
Steps for our analysis :
range of tracks
by itself has no mathematical significance, we can use the list of Genres
to assess the number of available songs in each genre compared to their sales.Genre
table to the track
table to get the genre of each of the available tracks.sold vs unsold
by each genre.%%sql
WITH
tracks_available AS
(
SELECT DISTINCT(track_id)
FROM track t
ORDER BY track_id
),
tracks_sold AS
(
SELECT DISTINCT(track_id)
FROM invoice_line il
ORDER BY track_id
),
unsold_tracks AS
(
SELECT DISTINCT(track_id) FROM tracks_available
EXCEPT
SELECT DISTINCT(track_id) FROM tracks_sold
),
labeled_tracks AS
(
SELECT *,
CASE
WHEN t.track_id IN (SELECT DISTINCT(track_id) FROM unsold_tracks)
THEN '0'
ELSE '1'
END AS 'tracks_purchased'
FROM track t
)
SELECT g.name genre_name,
COUNT(lt.track_id) number_of_tracks_available,
ROUND((CAST (COUNT(lt.track_id) AS FLOAT) / CAST ((SELECT COUNT(track_id) FROM track) As FLOAT)
*100),3)
AS 'tracks_available_pct',
SUM(tracks_purchased) number_of_tracks_sold,
ROUND((CAST (SUM(tracks_purchased) AS FLOAT) / CAST ((SELECT COUNT(DISTINCT(track_id))
FROM invoice_line) As FLOAT)*100),3)
AS 'sales_by_genre_pct'
FROM genre g
LEFT JOIN labeled_tracks lt ON g.genre_id = lt.genre_id
GROUP BY g.genre_id
ORDER BY sales_by_genre_pct DESC;
Done.
genre_name | number_of_tracks_available | tracks_available_pct | number_of_tracks_sold | sales_by_genre_pct |
---|---|---|---|---|
Rock | 1297 | 37.025 | 915 | 50.664 |
Metal | 374 | 10.677 | 238 | 13.178 |
Alternative & Punk | 332 | 9.478 | 176 | 9.745 |
Latin | 579 | 16.529 | 119 | 6.589 |
Jazz | 130 | 3.711 | 61 | 3.378 |
Blues | 81 | 2.312 | 56 | 3.101 |
R&B/Soul | 61 | 1.741 | 55 | 3.045 |
Alternative | 40 | 1.142 | 34 | 1.883 |
Electronica/Dance | 30 | 0.856 | 29 | 1.606 |
Pop | 48 | 1.37 | 25 | 1.384 |
Easy Listening | 24 | 0.685 | 24 | 1.329 |
Reggae | 58 | 1.656 | 22 | 1.218 |
Hip Hop/Rap | 35 | 0.999 | 21 | 1.163 |
Classical | 74 | 2.112 | 16 | 0.886 |
Heavy Metal | 28 | 0.799 | 7 | 0.388 |
Soundtrack | 43 | 1.228 | 5 | 0.277 |
TV Shows | 93 | 2.655 | 2 | 0.111 |
Drama | 64 | 1.827 | 1 | 0.055 |
Rock And Roll | 12 | 0.343 | 0 | 0.0 |
Bossa Nova | 15 | 0.428 | 0 | 0.0 |
World | 28 | 0.799 | 0 | 0.0 |
Science Fiction | 13 | 0.371 | 0 | 0.0 |
Sci Fi & Fantasy | 26 | 0.742 | 0 | 0.0 |
Comedy | 17 | 0.485 | 0 | 0.0 |
Opera | 1 | 0.029 | 0 | 0.0 |
To be clear, according to our chosen metric the relation : ' More the number of songs available, more should be the sales percentage of that genre ' , should hold true.
Observation :
For Rock
:
37 %
of the total number of available tracks.50 %
of the total sales.For Latin
:
16.5 %
of the total number of available tracks.6.5 %
of the total sales with only 119 tracks sold out of 579.For Easy Listening
:
0.68 %
of the total number of available tracks.1.3 %
of the total sales with 24 out of 24 tracks soldHence, The range of tracks listed in the store are not indicative of their sales popularity.
To answer all our business questions together :
BQ 1. Recommend the artists to be added to the store by genre? :
Red Tone
: Alternative & PunkJim Bites
: Blues Slim Jim BitesMeteor and the Girls
: PopBQ 2. Best performing sales employee :
Jane Peacock
: Avergae sale of 1.17 track per dayBQ 3. Sales metrics by country :
USA
: Highest number of sales - 1051
Czech Republic
: Highest average order value - USD 106.13
BQ 4. Sales of full album vs Individual tracks :
Full albums
: Account for 18 %
of the total sales Individual tracks
: Account for 81 %
of the total sales BQ 5. Artist featuring in Most playlists :
Eugene Ormandy
: 7 playlistsBQ 6. How many tracks have been purchased vs not purchased? :
Sold tracks
: With 1806 tracks, it accounts for 51.5%
of the total available tracks Unsold tracks
: With 1697 tracks, it accounts for 48.4 %
of the total available tracks BQ 7. Do protected vs non-protected media types have an effect on popularity? :
Unprotected media
: With 1652 sold tracks i.e 91.4 %
of the total sales BQ 8. Does the available range of songs affect sales?? :
Rock
having 37 %
of the total tracks and accounting for 50 %
of the sales VSLatin
having 16.5 %
of the total tracks but only 6.5 %
of the total sales,
The available range is not an indicator of sales.