In this project we're working with a modified version of a database called Chinook
. The Chinook database
contains information about a fictional digital music shop - kind of like a mini-iTunes store.
The Chinook database
contains information about the artists, songs, and albums from the music shop, as well as information on the shop's employees
, customers
, and the customers purchases
. This information is contained in eleven tables.
*We'll run queries and provide solutions to 4 Business Questions.*
The Chinook record store has just signed a deal with a new record label, and we've been tasked with selecting the first three albums that will be added to the store, from a list of four.
All four albums are by artists that don't have any tracks in the store right now - we have the artist names, and the genre of music they produce:
Artist Name | Genre |
---|---|
Regal | Hip-Hop |
Red Tone | Punk |
Meteor and the Girls | Pop |
Slim Jim Bites | Blues |
*The record label specializes in artists from the USA, and they have given Chinook some money to advertise the new albums in the USA, so we're interested in finding out which genres sell the best in the USA.*
SQLite
, pandas
and matplotlib
modules, and use the magic command %matplotlib inline
to make sure any plots render in the notebook.run_query()
function, that takes a SQL query as an argument and returns a pandas dataframe of that query.run_command()
function that takes a SQL command as an argument and executes it using the sqlite module.show_tables()
function that calls the run_query() function to return a list of all tables and views in the database.show_tables()
function.import pandas as pd
import sqlite3
import matplotlib.pyplot as plt
import numpy as np
%matplotlib inline
def run_query(q):
with sqlite3.connect('chinook.db') as conn:
return pd.read_sql(q, conn)
def run_command(c):
with sqlite3.connect('chinook.db') as conn:
conn.isolation_level = None
conn.execute(c)
def show_tables():
q = '''
SELECT
name,
type
FROM sqlite_master
WHERE type IN ("table", "view");'''
return run_query(q)
show_tables()
name | type | |
---|---|---|
0 | album | table |
1 | artist | table |
2 | customer | table |
3 | employee | table |
4 | genre | table |
5 | invoice | table |
6 | invoice_line | table |
7 | media_type | table |
8 | playlist | table |
9 | playlist_track | table |
10 | track | table |
q1 = '''
WITH tracks_sold_usa AS
(
SELECT il.*
FROM invoice_line il
INNER JOIN invoice i ON i.invoice_id = il.invoice_id
INNER JOIN customer c ON c.customer_id = i.customer_id
WHERE c.country = 'USA'
)
SELECT
g.name,
COUNT(tsu.invoice_line_id) 'tracks_sold',
CAST(COUNT(tsu.invoice_line_id) AS FLOAT)/ (SELECT COUNT(*) FROM tracks_sold_usa) 'tracks_sold_perc'
FROM tracks_sold_usa tsu
INNER JOIN track t ON tsu.track_id = t.track_id
INNER JOIN genre g ON g.genre_id = t.genre_id
GROUP BY 1
ORDER BY 2 DESC
LIMIT 10
'''
Genres with Amount of Tracks Sold
run_query(q1)
name | tracks_sold | tracks_sold_perc | |
---|---|---|---|
0 | Rock | 561 | 0.533777 |
1 | Alternative & Punk | 130 | 0.123692 |
2 | Metal | 124 | 0.117983 |
3 | R&B/Soul | 53 | 0.050428 |
4 | Blues | 36 | 0.034253 |
5 | Alternative | 35 | 0.033302 |
6 | Latin | 22 | 0.020932 |
7 | Pop | 22 | 0.020932 |
8 | Hip Hop/Rap | 20 | 0.019029 |
9 | Jazz | 14 | 0.013321 |
Plotting Genre vs Track Percentage
genre_albums_sold = run_query(q1)
genre_albums_sold.set_index("name", drop=True, inplace=True)
genre_albums_sold.plot.barh(
title = "Top Selling Genres in USA",
xlim = (0, 625),
colormap = plt.cm.Accent
)
plt.ylabel('')
for key, value in enumerate(list(genre_albums_sold.index)):
score = genre_albums_sold.loc[value, 'tracks_sold']
label = (genre_albums_sold.loc[value, 'tracks_sold_perc'] * 100).astype(int).astype(str) + '%'
plt.annotate(str(label), (score+10, key - 0.15))
plt.show()
*RECOMMENDATION:*
The recommended artists according to popularity of their genres (in Descreasing Order) are:
However, the total sales of these genres combined is just 17%.
We should instead look out for artists in Rock genre, since it alone sells for 53%
Each customer for the Chinook store gets assigned to a sales support agent within the company when they first make a purchase.
*We want to analyze Employee Sales Performance.
For this we will have to analyze the purchases of customers belonging to each employee.*
q2 = '''
SELECT
e.first_name || " " || e.last_name 'Employee Name',
e.hire_date 'Hiring Date',
SUM(total) 'Sales'
FROM invoice i
INNER JOIN customer c ON c.customer_id = i.customer_id
INNER JOIN employee e ON e.employee_id = c.support_rep_id
GROUP BY c.support_rep_id
'''
Employees with Hiring Dates and Sales
run_query(q2)
Employee Name | Hiring Date | Sales | |
---|---|---|---|
0 | Jane Peacock | 2017-04-01 00:00:00 | 1731.51 |
1 | Margaret Park | 2017-05-03 00:00:00 | 1584.00 |
2 | Steve Johnson | 2017-10-17 00:00:00 | 1393.92 |
Plotting Employees vs Sales Data
employee_sales = run_query(q2)
employee_sales.set_index("Employee Name", drop=True, inplace=True)
employee_sales.plot.barh(
colormap = plt.cm.Accent,
xlim = (0, 2100)
)
for key, value in enumerate(list(employee_sales.index)):
plt.annotate(str(employee_sales.loc[value, 'Sales'].round(2)), (employee_sales.loc[value,'Sales'] +50, key))
plt.ylabel('')
plt.show()
*RECOMMENDATION:*
Here Mr. Jean Peacock has a margin of around 20% compared to the lowest performer (Mr. Steve Johnson).
However, we may also note that the difference in Hiring Date corresponds to the Sales difference.
q3 = '''
WITH country_or_other AS
(
SELECT
CASE
WHEN
(SELECT
COUNT(*)
FROM customer
WHERE country = c.country
) = 1 THEN 'Other'
ELSE c.country
END AS country,
c.customer_id,
il.*
FROM invoice_line il
INNER JOIN invoice i ON i.invoice_id = il.invoice_id
INNER JOIN customer c ON c.customer_id = i.customer_id
)
SELECT
Country,
Customers,
Total_sales,
Average_customer_life,
Average_order
FROM
(
SELECT
country 'Country',
COUNT(DISTINCT customer_id) 'Customers',
SUM(unit_price) 'Total_sales',
SUM(unit_price) / COUNT(DISTINCT customer_id) 'Average_customer_life',
SUM(unit_price) / COUNT(DISTINCT invoice_id) 'Average_order',
CASE
WHEN country = 'Other' THEN 1
ELSE 0
END AS sort
FROM country_or_other
GROUP BY country
ORDER BY sort ASC, total_sales DESC
);
'''
run_query(q3)
Country | Customers | Total_sales | Average_customer_life | Average_order | |
---|---|---|---|---|---|
0 | USA | 13 | 1040.49 | 80.037692 | 7.942672 |
1 | Canada | 8 | 535.59 | 66.948750 | 7.047237 |
2 | Brazil | 5 | 427.68 | 85.536000 | 7.011148 |
3 | France | 5 | 389.07 | 77.814000 | 7.781400 |
4 | Germany | 4 | 334.62 | 83.655000 | 8.161463 |
5 | Czech Republic | 2 | 273.24 | 136.620000 | 9.108000 |
6 | United Kingdom | 3 | 245.52 | 81.840000 | 8.768571 |
7 | Portugal | 2 | 185.13 | 92.565000 | 6.383793 |
8 | India | 2 | 183.15 | 91.575000 | 8.721429 |
9 | Other | 15 | 1094.94 | 72.996000 | 7.448571 |
from numpy import arange
import matplotlib
# Required Dataframe
country_sales = run_query(q3)
country_sales.set_index('Country', inplace=True, drop=True)
# Cols is array containing names of columns to be plotted
cols = list(country_sales.columns)
# Required Figure Container
fig = plt.figure(figsize = (18, 12))
fig.subplots_adjust(hspace=.5, wspace=.3)
# Bar Positions (Left) for Bar Plots
positions = arange(country_sales.shape[0])
# Colors Array for Bar Plots
colors = matplotlib.cm.get_cmap('viridis')
arr1 = np.append(arange(0, 0.9, 0.1), [0.99])
rgba = list(colors(i) for i in arr1)
#################################
#Top Left (Total_sales)(PieChart)
ax1 = fig.add_subplot(2, 2, 1)
# Column to be plotted copied from Dataframe
customers_in_country = country_sales[cols[1]].copy().rename('')
# Explode is used for Removing a part of Pie a littel bit to highlight
explode = list(0.1 if i == 'Other' else 0 for i in customers_in_country.index)
# Plotting Data
ax1 = customers_in_country.plot(
kind='pie',
colormap = plt.cm.viridis,
startangle = +90,
explode = explode,
title = 'Total Sales per Country'
)
ax1.set_ylabel('')
#########################################################
# Top Right (Pct Sales vs Pct Customers) (Grouped Bar Plot)
ax2 = fig.add_subplot(2, 2, 2)
# Columns to be Plotted
cust_vs_sales_cols = cols[:2]
cust_vs_sales = country_sales[cust_vs_sales_cols].copy()
cust_vs_sales.index.name = ''
for i in cust_vs_sales_cols:
cust_vs_sales[i] /= cust_vs_sales[i].sum() / 100
# Plotting Data
ax2.bar(
[x - 0.15 for x in positions],
cust_vs_sales['Customers'],
0.4,
# color = "#FFB7A2"
color='#355F8D'
)
ax2.bar(
[0.25 + x for x in positions],
cust_vs_sales['Total_sales'],
0.4,
# color = '#96BBE6'
color = '#FDE736'
)
ax2.set_title('$\%$ Sales vs. $\%$ Customers')
ax2.legend(list(cust_vs_sales.columns), loc='upper center')
###########################################################
# Bottom Left (Average Order (Pct Diff From Mean)) (Bar Plot)
ax3 = fig.add_subplot(2, 2, 3)
avg_cust_order = country_sales['Average_order'].mean()
diff_from_avg = country_sales['Average_order'] - avg_cust_order
ax3.bar(
positions,
diff_from_avg,
color = rgba
)
ax3.axhline(0, c=(0/255, 0/255, 0/255), alpha=0.2)
ax3.set_title('Average Order\n($\%$ Diff From Mean)')
# ##########################################################
# Bottom Right(Customer Lifetime Value (Dollars)) (Bar Plot)
ax4 = fig.add_subplot(2, 2, 4)
ax4.bar(
positions,
country_sales['Average_customer_life'],
color = rgba
)
ax4.set_title('Customer Lifetime Value (Dollars)')
##########################################################
# Improving Plot Aesthetics
axes_objects = [ax2, ax3, ax4]
for i in axes_objects:
i.tick_params(top=False, right=False, left=False, bottom=False)
i.set_xticks(arange(cust_vs_sales.shape[0]))
i.set_xticklabels(country_sales.index, rotation = 70)
for k, v in i.spines.items():
if (k == 'right' or k == 'top'):
v.set_visible(False)
plt.savefig('Comparison.png')
plt.show()
*Based on the data, there may be opportunity in the following countries:*
*Czech Republic
United Kingdom
India
*
*It's worth keeping in mind that because the amount of data from each of these countries is relatively low. Because of this, we should be cautious spending too much money on new marketing campaigns, as the sample size is not large enough to give us high confidence. A better approach would be to run small campaigns in these countries, collecting and analyzing the new customers to make sure that these trends hold with new customers.*
In the Chinook store
, customers are not allowed to purchase a whole album, and then add individual tracks to that same purchase. When customers purchase albums they are charged the same price as if they had purchased each of those tracks separately.
Management is considering a change in the purchasing strategy to save money. The new strategy is to purchase only the most popular tracks from each album from record companies, instead of purchasing every track from an album.
*We want to analyze what percentage of purchases are individual tracks vs whole albums. It will help management to understand the effect this decision might have on overall revenue.*
q4 = '''
WITH invoice_first_track AS
(
SELECT
invoice_id,
MIN(track_id) 'first_track_id'
FROM invoice_line
GROUP BY 1
)
SELECT
ifs.*,
CASE
WHEN
(
SELECT t.track_id FROM track t
WHERE t.album_id = (
SELECT t2.album_id FROM track t2
WHERE t2.track_id = ifs.first_track_id
)
EXCEPT
SELECT il2.track_id FROM invoice_line il2
WHERE il2.invoice_id = ifs.invoice_id
) IS NULL
AND
(
SELECT il2.track_id FROM invoice_line il2
WHERE il2.invoice_id = ifs.invoice_id
EXCEPT
SELECT t.track_id FROM track t
WHERE t.album_id = (
SELECT t2.album_id FROM track t2
WHERE t2.track_id = ifs.first_track_id
)
) IS NULL
THEN "yes"
ELSE "no"
END AS "album_purchase"
FROM invoice_first_track ifs
'''
run_query(q4)
invoice_id | first_track_id | album_purchase | |
---|---|---|---|
0 | 1 | 1158 | yes |
1 | 2 | 201 | no |
2 | 3 | 2516 | no |
3 | 4 | 748 | no |
4 | 5 | 1986 | yes |
5 | 6 | 30 | no |
6 | 7 | 42 | no |
7 | 8 | 81 | no |
8 | 9 | 196 | no |
9 | 10 | 2663 | no |
10 | 11 | 610 | no |
11 | 12 | 92 | no |
12 | 13 | 2553 | no |
13 | 14 | 541 | no |
14 | 15 | 807 | no |
15 | 16 | 16 | no |
16 | 17 | 55 | no |
17 | 18 | 1027 | no |
18 | 19 | 105 | no |
19 | 20 | 60 | no |
20 | 21 | 13 | no |
21 | 22 | 383 | no |
22 | 23 | 1 | yes |
23 | 24 | 1146 | yes |
24 | 25 | 32 | no |
25 | 26 | 85 | no |
26 | 27 | 1280 | no |
27 | 28 | 36 | no |
28 | 29 | 479 | no |
29 | 30 | 33 | no |
... | ... | ... | ... |
584 | 585 | 49 | no |
585 | 586 | 20 | no |
586 | 587 | 94 | no |
587 | 588 | 18 | no |
588 | 589 | 2271 | yes |
589 | 590 | 1362 | yes |
590 | 591 | 162 | no |
591 | 592 | 465 | no |
592 | 593 | 57 | no |
593 | 594 | 476 | no |
594 | 595 | 46 | no |
595 | 596 | 12 | no |
596 | 597 | 46 | no |
597 | 598 | 1000 | no |
598 | 599 | 19 | no |
599 | 600 | 1006 | no |
600 | 601 | 14 | no |
601 | 602 | 164 | no |
602 | 603 | 481 | no |
603 | 604 | 1755 | yes |
604 | 605 | 1128 | no |
605 | 606 | 2003 | yes |
606 | 607 | 30 | no |
607 | 608 | 3060 | no |
608 | 609 | 1636 | no |
609 | 610 | 814 | no |
610 | 611 | 57 | no |
611 | 612 | 2204 | yes |
612 | 613 | 1126 | no |
613 | 614 | 2650 | no |
614 rows × 3 columns
*Statistics based on above data*
q6 = '''
WITH invoice_first_track AS
(
SELECT
invoice_id,
MIN(track_id) 'first_track_id'
FROM invoice_line
GROUP BY 1
)
SELECT
album_purchase,
COUNT(invoice_id) number_of_invoices,
CAST(COUNT(invoice_id) AS FLOAT)*100 / (SELECT COUNT(*) FROM invoice) 'percent'
FROM
(
SELECT
ifs.*,
CASE
WHEN
(
SELECT t1.track_id from track t1
WHERE t1.album_id = (
SELECT t2.album_id FROM track t2
WHERE t2.track_id = ifs.first_track_id
)
EXCEPT
SELECT il2.track_id FROM invoice_line il2
WHERE il2.invoice_id = ifs.invoice_id
) IS NULL
AND
(
SELECT il2.track_id FROM invoice_line il2
WHERE il2.invoice_id = ifs.invoice_id
EXCEPT
SELECT t1.track_id from track t1
WHERE t1.album_id = (
SELECT t2.album_id FROM track t2
WHERE t2.track_id = ifs.first_track_id
)
) IS NULL
THEN 'yes'
ELSE 'no'
END AS 'album_purchase'
FROM invoice_first_track ifs
)
GROUP BY 1
'''
run_query(q6)
album_purchase | number_of_invoices | percent | |
---|---|---|---|
0 | no | 500 | 81.433225 |
1 | yes | 114 | 18.566775 |
*Here, we can we observe that there are around 18.5% purchases are being done as complete albums.
So we may recommend against going for only individual tracks. Since, around 1/5th of the revenue is being generated by Albums. However, more analysis needs to be done in order to provide a sound recommendation.*