Answering Business Questions using SQL¶

Introduction and Schema Diagram¶

In this project we're working with a modified version of a database called Chinook. The Chinook database contains information about a fictional digital music shop - kind of like a mini-iTunes store.
The Chinook database contains information about the artists, songs, and albums from the music shop, as well as information on the shop's employees, customers, and the customers purchases. This information is contained in eleven tables.
*We'll run queries and provide solutions to 4 Business Questions.*

Scenario 1 : Selecting Albums to Purchase¶

The Chinook record store has just signed a deal with a new record label, and we've been tasked with selecting the first three albums that will be added to the store, from a list of four.

All four albums are by artists that don't have any tracks in the store right now - we have the artist names, and the genre of music they produce:

Artist Name	Genre
Regal	Hip-Hop
Red Tone	Punk
Meteor and the Girls	Pop
Slim Jim Bites	Blues

Business Question¶

*The record label specializes in artists from the USA, and they have given Chinook some money to advertise the new albums in the USA, so we're interested in finding out which genres sell the best in the USA.*

Creating Helper Functions¶

First, we'll import SQLite, pandas and matplotlib modules, and use the magic command %matplotlib inline to make sure any plots render in the notebook.
Next, we will create three helper functions:
- A run_query() function, that takes a SQL query as an argument and returns a pandas dataframe of that query.
- A run_command() function that takes a SQL command as an argument and executes it using the sqlite module.
- A show_tables() function that calls the run_query() function to return a list of all tables and views in the database.
Next, we will run the show_tables() function.

In [1]:

import pandas as pd
import sqlite3
import matplotlib.pyplot as plt
import numpy as np
%matplotlib inline

def run_query(q):
    with sqlite3.connect('chinook.db') as conn:
        return pd.read_sql(q, conn)

def run_command(c):
    with sqlite3.connect('chinook.db') as conn:
        conn.isolation_level = None
        conn.execute(c)
        
def show_tables():
    q = '''
    SELECT 
        name, 
        type
    FROM sqlite_master
    WHERE type IN ("table", "view");'''
    return run_query(q)

show_tables()    

Out[1]:

	name	type
0	album	table
1	artist	table
2	customer	table
3	employee	table
4	genre	table
5	invoice	table
6	invoice_line	table
7	media_type	table
8	playlist	table
9	playlist_track	table
10	track	table

Finding Best-Selling Genres¶

We will write a query that returns each genre, with the number of tracks sold in absolute numbers and in percentages.
And, then create a plot to show this data.

In [2]:

q1 = '''
WITH tracks_sold_usa AS 
    (
    SELECT il.*
    FROM invoice_line il
    INNER JOIN invoice i ON i.invoice_id = il.invoice_id
    INNER JOIN customer c ON c.customer_id = i.customer_id
    WHERE c.country = 'USA'
    )
    
SELECT 
    g.name,
    COUNT(tsu.invoice_line_id) 'tracks_sold',
    CAST(COUNT(tsu.invoice_line_id) AS FLOAT)/ (SELECT COUNT(*) FROM tracks_sold_usa) 'tracks_sold_perc'
FROM tracks_sold_usa tsu
INNER JOIN track t ON tsu.track_id = t.track_id 
INNER JOIN genre g ON g.genre_id = t.genre_id
GROUP BY 1
ORDER BY 2 DESC
LIMIT 10
'''

Genres with Amount of Tracks Sold

In [3]:

run_query(q1)

Out[3]:

	name	tracks_sold	tracks_sold_perc
0	Rock	561	0.533777
1	Alternative & Punk	130	0.123692
2	Metal	124	0.117983
3	R&B/Soul	53	0.050428
4	Blues	36	0.034253
5	Alternative	35	0.033302
6	Latin	22	0.020932
7	Pop	22	0.020932
8	Hip Hop/Rap	20	0.019029
9	Jazz	14	0.013321

Plotting Genre vs Track Percentage

In [4]:

genre_albums_sold = run_query(q1)
genre_albums_sold.set_index("name", drop=True, inplace=True)

genre_albums_sold.plot.barh(
    title = "Top Selling Genres in USA",
    xlim = (0, 625),
    colormap = plt.cm.Accent
)

plt.ylabel('')

for key, value in enumerate(list(genre_albums_sold.index)):
    score = genre_albums_sold.loc[value, 'tracks_sold']
    label = (genre_albums_sold.loc[value, 'tracks_sold_perc'] * 100).astype(int).astype(str) + '%'
    plt.annotate(str(label), (score+10, key - 0.15))

plt.show()

*RECOMMENDATION:*

The recommended artists according to popularity of their genres (in Descreasing Order) are:

Red Stone (Punk) (12%)
Slim Jim Bites (Blues) (3%)
Meteor and the Girls (Pop) (2%)

However, the total sales of these genres combined is just 17%.
We should instead look out for artists in Rock genre, since it alone sells for 53%

Scenario 2: Analysing Employee Sales Performance¶

Each customer for the Chinook store gets assigned to a sales support agent within the company when they first make a purchase.

Business Question¶

*We want to analyze Employee Sales Performance.
For this we will have to analyze the purchases of customers belonging to each employee.*

Finding Top Sales Employees¶

We will write a query that finds the total dollar amount of sales assigned to each sales support agent within the company.
We would need to add some extra attributes for the employees (e.g Date of joining, etc). Since they are relevant to the analysis.

In [5]:

q2 = '''
SELECT 
    e.first_name || " " || e.last_name 'Employee Name',
    e.hire_date 'Hiring Date',
    SUM(total) 'Sales'
FROM invoice i 
INNER JOIN customer c ON c.customer_id = i.customer_id
INNER JOIN employee e ON e.employee_id = c.support_rep_id
GROUP BY c.support_rep_id
'''

Employees with Hiring Dates and Sales

In [6]:

run_query(q2)

Out[6]:

	Employee Name	Hiring Date	Sales
0	Jane Peacock	2017-04-01 00:00:00	1731.51
1	Margaret Park	2017-05-03 00:00:00	1584.00
2	Steve Johnson	2017-10-17 00:00:00	1393.92

Plotting Employees vs Sales Data

In [7]:

employee_sales = run_query(q2)
employee_sales.set_index("Employee Name", drop=True, inplace=True)
employee_sales.plot.barh(
    colormap = plt.cm.Accent,
    xlim = (0, 2100)
)

for key, value in enumerate(list(employee_sales.index)):
    plt.annotate(str(employee_sales.loc[value, 'Sales'].round(2)), (employee_sales.loc[value,'Sales'] +50, key))

plt.ylabel('')
plt.show()

*RECOMMENDATION:*

Here Mr. Jean Peacock has a margin of around 20% compared to the lowest performer (Mr. Steve Johnson).
However, we may also note that the difference in Hiring Date corresponds to the Sales difference.

Scenario 3: Analysing Sales By Country¶

Business Question¶

*We want to analyze Sales By country. In particular, we will calculate:*

*total number of customers*
*total value of sales*
*average value of sales per customer*
*average order value*

Table - Total No. of Customers, Sales, Average Sales, Average Orders Per Country¶

We will write a query that collates data on purchases from different countries
Whenever a country has only one customer, we'll collect them into an "Other" group.
The results will be sorted by the total sales from highest to lowest, with the "Other" group at the very bottom.

In [8]:

q3 = '''
WITH country_or_other AS
    (
     SELECT 
         CASE
             WHEN
             (SELECT 
                 COUNT(*)
             FROM customer
             WHERE country = c.country
             ) = 1 THEN 'Other'
             ELSE c.country
         END AS country,
         c.customer_id,
         il.*
     FROM invoice_line il
     INNER JOIN invoice i ON i.invoice_id = il.invoice_id
     INNER JOIN customer c ON c.customer_id = i.customer_id
    )
    
SELECT
    Country,
    Customers,
    Total_sales,
    Average_customer_life,
    Average_order
FROM 
    (
     SELECT 
         country 'Country',
         COUNT(DISTINCT customer_id) 'Customers',
         SUM(unit_price) 'Total_sales',
         SUM(unit_price) / COUNT(DISTINCT customer_id) 'Average_customer_life',
         SUM(unit_price) / COUNT(DISTINCT invoice_id) 'Average_order',
         CASE 
             WHEN country = 'Other' THEN 1
             ELSE 0
         END AS sort
     FROM country_or_other
     GROUP BY country
     ORDER BY sort ASC, total_sales DESC
    );
'''

run_query(q3)

Out[8]:

	Country	Customers	Total_sales	Average_customer_life	Average_order
0	USA	13	1040.49	80.037692	7.942672
1	Canada	8	535.59	66.948750	7.047237
2	Brazil	5	427.68	85.536000	7.011148
3	France	5	389.07	77.814000	7.781400
4	Germany	4	334.62	83.655000	8.161463
5	Czech Republic	2	273.24	136.620000	9.108000
6	United Kingdom	3	245.52	81.840000	8.768571
7	Portugal	2	185.13	92.565000	6.383793
8	India	2	183.15	91.575000	8.721429
9	Other	15	1094.94	72.996000	7.448571

Plotting Sales By Country¶

For each dimension, we will create a visualization which demonstrates the data collated in previous step.

In [22]:

from numpy import arange
import matplotlib

# Required Dataframe
country_sales = run_query(q3) 
country_sales.set_index('Country', inplace=True, drop=True)

# Cols is array containing names of columns to be plotted
cols = list(country_sales.columns)

# Required Figure Container
fig = plt.figure(figsize = (18, 12))
fig.subplots_adjust(hspace=.5, wspace=.3)

# Bar Positions (Left) for Bar Plots
positions = arange(country_sales.shape[0])

# Colors Array for Bar Plots
colors = matplotlib.cm.get_cmap('viridis')
arr1 = np.append(arange(0, 0.9, 0.1), [0.99])
rgba = list(colors(i) for i in arr1)

#################################
#Top Left (Total_sales)(PieChart)
ax1 = fig.add_subplot(2, 2, 1)

# Column to be plotted copied from Dataframe
customers_in_country = country_sales[cols[1]].copy().rename('')

# Explode is used for Removing a part of Pie a littel bit to highlight 
explode = list(0.1 if i == 'Other' else 0 for i in customers_in_country.index)

# Plotting Data
ax1 = customers_in_country.plot(
    kind='pie',
    colormap = plt.cm.viridis,
    startangle = +90,
    explode = explode,
    title = 'Total Sales per Country'
    )
ax1.set_ylabel('')

#########################################################
# Top Right (Pct Sales vs Pct Customers) (Grouped Bar Plot)
ax2 = fig.add_subplot(2, 2, 2)

# Columns to be Plotted
cust_vs_sales_cols = cols[:2]
cust_vs_sales = country_sales[cust_vs_sales_cols].copy()
cust_vs_sales.index.name = ''

for i in cust_vs_sales_cols:
    cust_vs_sales[i] /= cust_vs_sales[i].sum() / 100
        
# Plotting Data
ax2.bar(
    [x - 0.15 for x in positions],
    cust_vs_sales['Customers'],
    0.4,
#     color = "#FFB7A2"
    color='#355F8D'

)
ax2.bar(
    [0.25 + x for x in positions],
    cust_vs_sales['Total_sales'],
    0.4,
#     color = '#96BBE6'
    color = '#FDE736'
)

ax2.set_title('$\%$ Sales vs. $\%$ Customers')
ax2.legend(list(cust_vs_sales.columns), loc='upper center')

###########################################################
# Bottom Left (Average Order (Pct Diff From Mean)) (Bar Plot)
ax3 = fig.add_subplot(2, 2, 3)
avg_cust_order = country_sales['Average_order'].mean()
diff_from_avg = country_sales['Average_order'] - avg_cust_order

ax3.bar(
    positions,
    diff_from_avg,
    color = rgba
)
ax3.axhline(0, c=(0/255, 0/255, 0/255), alpha=0.2)
ax3.set_title('Average Order\n($\%$ Diff From Mean)')

# ##########################################################
# Bottom Right(Customer Lifetime Value (Dollars)) (Bar Plot)
ax4 = fig.add_subplot(2, 2, 4)
ax4.bar(
    positions,
    country_sales['Average_customer_life'],
    color = rgba
)
ax4.set_title('Customer Lifetime Value (Dollars)')

##########################################################
# Improving Plot Aesthetics
axes_objects = [ax2, ax3, ax4]
for i in axes_objects:
    i.tick_params(top=False, right=False, left=False, bottom=False)
    i.set_xticks(arange(cust_vs_sales.shape[0]))
    i.set_xticklabels(country_sales.index, rotation = 70)
    for k, v in i.spines.items():
        if (k == 'right' or k == 'top'):
            v.set_visible(False)
            
plt.savefig('Comparison.png')
plt.show()

RECOMMENDATION¶

*Based on the data, there may be opportunity in the following countries:*

*Czech Republic

United Kingdom
India
*

*It's worth keeping in mind that because the amount of data from each of these countries is relatively low. Because of this, we should be cautious spending too much money on new marketing campaigns, as the sample size is not large enough to give us high confidence. A better approach would be to run small campaigns in these countries, collecting and analyzing the new customers to make sure that these trends hold with new customers.*

Scenario 4: Analyze purchases of Albums vs Individual Tracks¶

In the Chinook store, customers are not allowed to purchase a whole album, and then add individual tracks to that same purchase. When customers purchase albums they are charged the same price as if they had purchased each of those tracks separately.

Management is considering a change in the purchasing strategy to save money. The new strategy is to purchase only the most popular tracks from each album from record companies, instead of purchasing every track from an album.

Business Question¶

*We want to analyze what percentage of purchases are individual tracks vs whole albums. It will help management to understand the effect this decision might have on overall revenue.*

Finding Number and Percentage of invoices in Categories (Album vs Tracks)¶

We will write two queries:
- First, we will categorize each invoice as either an album purchase or not.
- Second, we will calculate the following summary statistics:
  - Number of invoices
  - Percentage of invoices

In [14]:

q4 = '''
WITH invoice_first_track AS 
    (
    SELECT 
        invoice_id,
        MIN(track_id) 'first_track_id'
    FROM invoice_line
    GROUP BY 1
    )
    
SELECT
        ifs.*,
        CASE
            WHEN
                 (
                  SELECT t.track_id FROM track t
                  WHERE t.album_id = (
                                      SELECT t2.album_id FROM track t2
                                      WHERE t2.track_id = ifs.first_track_id
                                     ) 

                  EXCEPT 

                  SELECT il2.track_id FROM invoice_line il2
                  WHERE il2.invoice_id = ifs.invoice_id
                 ) IS NULL
             AND
                 (
                  SELECT il2.track_id FROM invoice_line il2
                  WHERE il2.invoice_id = ifs.invoice_id

                  EXCEPT 

                  SELECT t.track_id FROM track t
                  WHERE t.album_id = (
                                      SELECT t2.album_id FROM track t2
                                      WHERE t2.track_id = ifs.first_track_id
                                     ) 
                 ) IS NULL
             THEN "yes"
             ELSE "no"
         END AS "album_purchase"
     FROM invoice_first_track ifs
'''

run_query(q4)

Out[14]:

	invoice_id	first_track_id	album_purchase
0	1	1158	yes
1	2	201	no
2	3	2516	no
3	4	748	no
4	5	1986	yes
5	6	30	no
6	7	42	no
7	8	81	no
8	9	196	no
9	10	2663	no
10	11	610	no
11	12	92	no
12	13	2553	no
13	14	541	no
14	15	807	no
15	16	16	no
16	17	55	no
17	18	1027	no
18	19	105	no
19	20	60	no
20	21	13	no
21	22	383	no
22	23	1	yes
23	24	1146	yes
24	25	32	no
25	26	85	no
26	27	1280	no
27	28	36	no
28	29	479	no
29	30	33	no
...	...	...	...
584	585	49	no
585	586	20	no
586	587	94	no
587	588	18	no
588	589	2271	yes
589	590	1362	yes
590	591	162	no
591	592	465	no
592	593	57	no
593	594	476	no
594	595	46	no
595	596	12	no
596	597	46	no
597	598	1000	no
598	599	19	no
599	600	1006	no
600	601	14	no
601	602	164	no
602	603	481	no
603	604	1755	yes
604	605	1128	no
605	606	2003	yes
606	607	30	no
607	608	3060	no
608	609	1636	no
609	610	814	no
610	611	57	no
611	612	2204	yes
612	613	1126	no
613	614	2650	no

614 rows × 3 columns

*Statistics based on above data*

In [15]:

q6 = '''
WITH invoice_first_track AS
    (
    SELECT 
        invoice_id,
        MIN(track_id) 'first_track_id' 
    FROM invoice_line
    GROUP BY 1
    )

SELECT 
    album_purchase,
    COUNT(invoice_id) number_of_invoices,
    CAST(COUNT(invoice_id) AS FLOAT)*100 / (SELECT COUNT(*) FROM invoice) 'percent'
FROM 
    (
    SELECT 
        ifs.*,
        CASE
            WHEN
                (
                SELECT t1.track_id from track t1
                WHERE t1.album_id = (
                                    SELECT t2.album_id FROM track t2
                                    WHERE t2.track_id = ifs.first_track_id
                                    )
                EXCEPT
                
                SELECT il2.track_id FROM invoice_line il2
                WHERE il2.invoice_id = ifs.invoice_id
                ) IS NULL
            AND
                (
                SELECT il2.track_id FROM invoice_line il2
                WHERE il2.invoice_id = ifs.invoice_id
                
                EXCEPT
                
                SELECT t1.track_id from track t1
                WHERE t1.album_id = (
                                    SELECT t2.album_id FROM track t2
                                    WHERE t2.track_id = ifs.first_track_id
                                    )
                ) IS NULL            
            THEN 'yes'
            ELSE 'no'
        END AS 'album_purchase'
    FROM invoice_first_track ifs
    )
GROUP BY 1
'''

run_query(q6)

Out[15]:

	album_purchase	number_of_invoices	percent
0	no	500	81.433225
1	yes	114	18.566775

RECOMMENDATION¶

*Here, we can we observe that there are around 18.5% purchases are being done as complete albums.

So we may recommend against going for only individual tracks. Since, around 1/5th of the revenue is being generated by Albums. However, more analysis needs to be done in order to provide a sound recommendation.*