#!/usr/bin/env python # coding: utf-8 # # Answering Business Questions Using SQL # *** # ## Introduction # # The aim of this project is to explore a modified version of the [Chinook](https://github.com/lerocha/chinook-database) database using SQL and answer some business questions. The Chinook database represents a fictional digital media shop, based on real data from an iTunes Library and manually generated data. The database is provided as a SQLite database file called `chinook.db`. # # Here's a schema diagram for the Chinook database: # ![Chinook schema diagram](https://s3.amazonaws.com/dq-content/189/chinook-schema.svg) # ## Connecting to the Database and Data Overview # In[1]: get_ipython().run_cell_magic('capture', '', '%load_ext sql\n%sql sqlite:///chinook.db\n') # In[2]: get_ipython().run_cell_magic('sql', '', 'SELECT name, type\n FROM sqlite_master\n WHERE type IN ("table","view");\n') # The database consists of 11 tables containing information about artists, albums, media tracks, playlists, invoices, customers, and shop employees. # ## Selecting Albums to Purchase # We'll start by selecting the first three albums (from a list of four) of a new record label that will be added to the store. All 4 albums are by artists that currently don't have any tracks in the store. The available data is represented by the artist names, and the genre of music they produce: # # | Artist Name | Genre | # | -------------------- | ------- | # | Regal | Hip-Hop | # | Red Tone | Punk | # | Meteor and the Girls | Pop | # | Slim Jim Bites | Blues | # # The record label specializes in artists from the USA, so the new albums will be advertised in the USA. Hence we're interested in finding out which genres sell there the best (the most tracks). # # For this purpose, we'll write a query that returns TOP10 genres with the number and \% of tracks sold in the USA, and put them in descending order. The necessary information can be obtained from the `genre`, `track`, `invoice`, and `invoice_line` tables. Then, we'll display the output of this query graphically, using Python's pandas and matplotlib libraries. # In[3]: import pandas as pd import numpy as np import sqlite3 import matplotlib as mpl import matplotlib.pyplot as plt get_ipython().run_line_magic('matplotlib', 'inline') query = """ WITH total_count AS (SELECT COUNT(t.track_id) total_count FROM invoice i INNER JOIN invoice_line il ON il.invoice_id = i.invoice_id LEFT JOIN track t ON il.track_id = t.track_id INNER JOIN genre g ON t.genre_id = g.genre_id WHERE i.billing_country = "USA") SELECT g.name genre, COUNT(t.track_id) num_tracks, ROUND(COUNT(t.track_id) * 100.00/(SELECT total_count FROM total_count), 0) pct FROM invoice i INNER JOIN invoice_line il ON il.invoice_id = i.invoice_id LEFT JOIN track t ON il.track_id = t.track_id INNER JOIN genre g ON t.genre_id = g.genre_id WHERE i.billing_country = "USA" GROUP BY 1 ORDER BY 2 DESC LIMIT 10; """ # Reading SQL results into a pandas DataFrame conn = sqlite3.connect("chinook.db") df = pd.read_sql_query(query, conn) df # In[4]: # Resetting the index df = df.set_index("genre", drop=True) # Plotting sold tracks by genre in % fig, ax = plt.subplots(figsize=(15,11)) df["pct"].sort_values().plot.barh(color="slateblue", xlim=(0, 60), rot=0) ax.set_title("Sold Tracks by Genre (in %)", fontsize=55) ax.set_ylabel(None) ax.set_xlabel("%", fontsize=40) ax.tick_params(axis="both", labelsize=35, left = False) ax.spines["top"].set_visible(False) ax.spines["right"].set_visible(False) ax.spines["left"].set_visible(False) # The _most popular genres in the USA_ are **Rock**, **Alternative & Punk**, and **Metal**, followed with a big gap by all the others. Since our choice is limited by Hip-Hop, Punk, Pop, and Blues genres, and since we have to choose 3 out of 4 albums, we can say that only one of the albums to be advertised is related to these popular genres (**Punk**). Then, we should select the album of **Blues** genre (on the 5th place by popularity) and the one of **Pop** (on the 7th place): # # | Artist Name | Genre | # | -------------------- | ------- | # | Red Tone | Punk | # | Slim Jim Bites | Blues | # | Meteor and the Girls | Pop | # # ## Analyzing Employee Sales Performance # Each customer in the Chinook store, when they first make a purchase, gets assigned to a sales support agent within the company. We're going to analyze the purchases of customers belonging to each sales support agent, in order to compare their sales performance. # # Let's find the total dollar amounts of sales assigned to each sales support agent, together with their country, birthdate, hire date, and number and % of customers. The tables to be used here are `employee`, `customer`, and `invoice`. Again, we'll represent this information also in graphical form, using a grouped bar chart. Since we're going to use this kind of charts further in the project, we'll create a function for it. # In[5]: query = """ WITH summary_agents AS (SELECT e.first_name || " " || e.last_name employee, e.country, e.birthdate, e.hire_date, COUNT(c.customer_id) customers, ROUND(SUM(i.total), 0) total_sales FROM employee e LEFT JOIN customer c ON e.employee_id = c.support_rep_id INNER JOIN invoice i ON c.customer_id = i.customer_id WHERE e.title = "Sales Support Agent" GROUP BY 1 ORDER BY 6 DESC) SELECT employee, country, birthdate, hire_date, customers, ROUND(customers * 100.0/(SELECT SUM(customers) FROM summary_agents), 0) pct_customers, total_sales, ROUND(total_sales * 100.0/(SELECT SUM(total_sales) FROM summary_agents), 0) pct_sales FROM summary_agents; """ # Reading SQL results into a pandas DataFrame conn = sqlite3.connect("chinook.db") df = pd.read_sql_query(query, conn).set_index("employee", drop=True) # Defining a function for creating grouped bar charts def plot_grouped_bar_chart(fig_width, fig_height, column1, column2, label1, label2, title, ylabel, ylim_lower, ylim_upper, loc, rotation, bbox_to_anchor): # Converting series to lists column1_list = column1.to_list() column2_list = column2.to_list() # Creating labels from the index labels = df.index.to_list() # Specifying the label locations and the width of the bars x = np.arange(len(labels)) width = 0.35 # Plotting the data fig, ax = plt.subplots(figsize=(10,6)) ax.bar(x - width/2, column1_list, width, label=label1) ax.bar(x + width/2, column2_list, width, label=label2) ax.set_title(title, fontsize=30) ax.set_ylabel(ylabel, fontsize=20) ax.set_ylim(ylim_lower, ylim_upper) ax.set_xticks(x) ax.set_xticklabels(labels, fontsize=17, rotation=rotation) ax.tick_params(bottom = False, labelsize=17) ax.spines["top"].set_visible(False) ax.spines["right"].set_visible(False) ax.spines["bottom"].set_visible(False) ax.legend(loc=loc, bbox_to_anchor=bbox_to_anchor, fontsize=16, frameon=False) # Creating a grouped bar chart for % of customers and sales by employee plot_grouped_bar_chart(fig_width=10, fig_height=6, column1=df["pct_customers"], column2=df["pct_sales"], label1="Customers", label2="Sales", title="Employee Sales Performance", ylabel="%", ylim_lower=20, ylim_upper=40, loc=0, rotation=0, bbox_to_anchor=None) df # There are 3 sales support agents among the employees, they are all from Canada and were employed in 2017 (April, May and October). The youngest of them is Jane Peacock, the oldest - Margaret Park. # # On one hand, we see a direct correlation between the hire date of a sales support agent and the total amount of sales. However, we also see that the first 2 employees have only 1 month difference in their hire date; moreover, Margaret Park, being hired later, had 2 more customers. Despite that, they have 3% difference in their total amounts of sales, whereas Steve Johnson, being hired more than 6 months after Jane Peacock, has only 7% less sales than him, and 4% less customers. # # Given that, Jane Peacock demonstrates the best sales performance, while Margaret Park - the worst. # ## Analyzing Sales by Country # Let's now collate data on purchases for customers from different countries, including total number and \% of customers, total value and \% of sales, average value of sales per customer, average order value. Because in our database there are some countries with only one customer, we'll collect those countries into an "Other" group. The results will be sorted by the total sales in descending order, with the "Other" group at the bottom of the table, and then displayed as a grouped bar chart. All the data necessary for this analysis is contained in the tables `customer` and `invoice`. # In[6]: query = """ WITH sales_per_customer AS (SELECT c.country, c.customer_id, ROUND(SUM(i.total), 0) total, COUNT(i.invoice_id) num_sales FROM customer c INNER JOIN invoice i ON i.customer_id=c.customer_id GROUP BY 1, 2), all_countries AS (SELECT c.country, COUNT(c.customer_id) customers, ROUND(SUM(spc.total), 0) total_sales, ROUND(AVG(spc.num_sales), 1) sales_per_customer, ROUND(AVG(spc.total), 0) avg_order FROM customer c INNER JOIN sales_per_customer spc ON spc.customer_id = c.customer_id GROUP BY 1), only_1_customer AS (SELECT COUNT(country) country, COUNT(customers) customers, SUM(total_sales) total_sales, ROUND(AVG(sales_per_customer), 0) sales_per_customer, ROUND(AVG(avg_order), 0) avg_order FROM all_countries WHERE customers=1), other_countries AS (SELECT CASE WHEN country = (SELECT country FROM only_1_customer) THEN "Other" END AS country, customers, total_sales, sales_per_customer, avg_order FROM only_1_customer), more_than_1_customer AS (SELECT * FROM all_countries WHERE customers > 1 ORDER BY 1), united_unsorted AS (SELECT * FROM other_countries UNION SELECT * FROM more_than_1_customer) SELECT country, customers, ROUND(customers * 100.0/(SELECT SUM(customers) FROM united_unsorted), 0) pct_customers, total_sales, ROUND(total_sales * 100.0/(SELECT SUM(total_sales) FROM united_unsorted), 0) pct_sales, sales_per_customer, avg_order FROM (SELECT united_unsorted.*, CASE WHEN united_unsorted.country = "Other" THEN 1 ELSE 0 END AS sort FROM united_unsorted) ORDER BY sort ASC, 4 DESC; """ # Reading SQL results into a pandas DataFrame conn = sqlite3.connect("chinook.db") df = pd.read_sql_query(query, conn).set_index("country", drop=True) # Creating a grouped bar chart for % of customers and sales by country plot_grouped_bar_chart(fig_width=10, fig_height=7, column1=df["pct_customers"], column2=df["pct_sales"], label1="Customers", label2="Sales", title="Sales & Customers by Country", ylabel="%", ylim_lower=0, ylim_upper=30, loc=0, rotation=90, bbox_to_anchor=None) df # The sales by country are evidently dominated by the **USA**, in terms of both the total sales and the number of customers (if not to count the "Other" group). Besides, we can notice a strong positive correlation between the number of customers and the total sales. Interesting that Czech Republic, having only 2 customers, shows the highest average value of sales per customer and, consequently, the highest average order value. # ## Albums vs. Individual Tracks # In the Chinook store, customers can make purchases in one of the two ways: # - purchase a whole album, # - purchase a collection of one or more individual tracks. # # The store doesn't let customers purchase a whole album, and then add individual tracks to that same purchase (unless they do that by choosing each track manually). When customers purchase albums they are charged the same price as if they had purchased each of those tracks separately. # # Management are currently considering changing their purchasing strategy so that to purchase only the most popular tracks from each album from record companies, instead of purchasing every track from an album. Our task is to find out what percentage of purchases are individual tracks vs. whole albums. This data will further be used to understand the effect this new strategy might have on overall revenue. # # Here we have two edge cases to consider: # # - Albums that have only one or two tracks are likely to be purchased by customers as part of a collection of individual tracks. # - Customers may decide to manually select every track from an album, and then add a few individual tracks from other albums to their purchase. # # In the first case, since our analysis is concerned with maximizing revenue we can safely ignore albums consisting of only a few tracks. The company has previously done analysis to confirm that the second case doesn't happen often, so we can ignore this case also. # # In order to answer our question, we're going to identify whether each invoice has all the tracks from an album. We can do this by getting the list of tracks from an invoice and comparing it to the list of tracks from an album. We can find the album to compare the purchase to by looking up the album that one of the purchased tracks belongs to. It doesn't matter which track we pick, since if it's an album purchase, that album will be the same for all tracks, as long as we decided to ignore the second edge case. # # In this way, we'll categorize each invoice as a purchase of either a whole album or individual tracks, and calculate the summary statistics for both groups (number and \% of invoices). For this purpose, we'll use the following tables: `invoice_line`, `track`, and `album`. Next, we'll represent the results graphically. # In[7]: query = """ WITH invoice_album AS (SELECT il.invoice_id, t.album_id FROM invoice_line il LEFT JOIN track t ON t.track_id = il.track_id GROUP BY 1), album_vs_tracks AS (SELECT ia.*, CASE WHEN (SELECT il.track_id FROM invoice_line il WHERE il.invoice_id = ia.invoice_id EXCEPT SELECT t.track_id FROM track t WHERE t.album_id = ia.album_id) IS NULL AND (SELECT t.track_id FROM track t WHERE t.album_id = ia.album_id EXCEPT SELECT il.track_id FROM invoice_line il WHERE il.invoice_id = ia.invoice_id) IS NULL THEN "Albums" ELSE "Tracks" END AS purchase FROM invoice_album ia) SELECT purchase, COUNT(*) num_invoices, ROUND(COUNT(*) * 100.0 /(SELECT COUNT(*) FROM invoice), 0) pct FROM album_vs_tracks GROUP BY 1 ORDER BY 2 DESC; """ # Reading SQL results into a pandas DataFrame conn = sqlite3.connect("chinook.db") df = pd.read_sql_query(query, conn).set_index("purchase", drop=True) print("\n") print(df) print("\n") # Creating labels from the index labels = df.index.tolist() # Selecting data to plot sizes = df["pct"] # Plotting the data using a pie chart fig, ax = plt.subplots() ax.pie(sizes, autopct="%1.0f%%", startangle=90) ax.axis("equal") # Equal aspect to draw the pie as a circle ax.set_title("Track vs. Album \n Purchases", fontsize=27, y=0.96) ax.legend(fontsize=15, labels=labels, bbox_to_anchor=(0.773,0.325), frameon=False) mpl.rcParams["font.size"] = 18 plt.show() # Hence, customers tend to buy **individual tracks** 4 times more often than the whole albums. This means that the new strategy in consideration (purchasing only the most popular tracks from each album from record companies instead of purchasing the whole albums) will definitely help to maximize the overall revenue. # ## The Most Popular Artists # Next, we're going to find out which artists are used in the most playlists. This information can give us valuable insights on further purchasing strategies of the Chinook store. The tables to be used are `artist`, `album`, `track`, and `playlist_track`. # In[8]: get_ipython().run_cell_magic('sql', '', 'SELECT ar.name artist_name,\n COUNT(DISTINCT(pt.playlist_id)) num_playlists\n FROM artist ar\n INNER JOIN album al \n ON al.artist_id = ar.artist_id \n \n INNER JOIN track t \n ON t.album_id = al.album_id\n \n INNER JOIN playlist_track pt \n ON pt.track_id = t.track_id \n GROUP BY 1\n ORDER BY 2 DESC\n LIMIT 5;\n') # We got a table of the TOP5 artists, with [Eugene Ormandy](https://en.wikipedia.org/wiki/Eugene_Ormandy) at the first place (present in 7 playlists), and followed by 4 other artists (present in 6 playlists). Let's have a look to which genres the tracks of these artists are related. # In[9]: get_ipython().run_cell_magic('sql', '', 'SELECT DISTINCT(g.name) genre\n FROM genre g\n INNER JOIN track t \n ON t.genre_id = g.genre_id \n \n INNER JOIN album al \n ON al.album_id = t.album_id\n \n INNER JOIN artist ar \n ON ar.artist_id = al.artist_id \n WHERE ar.name = "Eugene Ormandy" \n OR ar.name = "The King\'s Singers"\n OR ar.name = "English Concert & Trevor Pinnock"\n OR ar.name = "Berliner Philharmoniker & Herbert Von Karajan" \n OR ar.name = "Academy of St. Martin in the Fields & Sir Neville Marriner";\n') # From one of our previous queries (calculating the number and \% of tracks of each genre sold in the USA), we remember that Classical genre, with all due respect, is far away from being the most sold one. Hence, in spite of these artists being present in the most playlists, a strategy for the Chinook store to buy more tracks of theirs _cannot be considered as potentially profitable_ and should be disregarded. # ## Purchased vs Not Purchased Tracks # Let's now calculate how many unique tracks were sold. # In[10]: get_ipython().run_cell_magic('sql', '', 'WITH all_vs_sold AS (SELECT COUNT(DISTINCT(t.track_id)) all_tracks, \n COUNT(DISTINCT(il.track_id)) sold_tracks\n FROM track t\n LEFT JOIN invoice_line il \n ON il.track_id = t.track_id)\n\nSELECT all_vs_sold.*,\n ROUND(sold_tracks * 100.0 / all_tracks, 0) pct_sold\n FROM all_vs_sold;\n') # Almost half of all the unique tracks available in the Chinook store were never bought, probably being of unpopular genre or unpopular artists. Possible solutions here may include a new _advertising campaign_ (especialy in the USA, as the biggest segment of market), _sales promotion_, additional _training and motivation_ of the sales support agents. # ## Popularity of Protected vs Non-Protected Media Types # If a track has a protected media type, it means that it has some associated rules for using its content. It can be encrypted using some form of digital rights management encryption. To play such a file, playback must occur inside the protected media path. In addition, the user might have to acquire rights to the content. # # Let's start by looking at the unique values of media types and whether there are any protected types among them. # In[11]: get_ipython().run_cell_magic('sql', '', 'SELECT DISTINCT(name) AS media_type\n FROM media_type;\n') # Two of the five media types are protected. # # Now, we're going to calculate the summary statistics for all protected vs. non-protected media types, and create some charts for them. # In[12]: query = """ WITH non_protected AS (SELECT COUNT(mt.name) to_be_renamed, COUNT(DISTINCT(t.track_id)) unique_tracks, COUNT(DISTINCT(il.track_id)) sold_unique, COUNT(il.track_id) sold FROM media_type mt INNER JOIN track t ON mt.media_type_id = t.media_type_id LEFT JOIN invoice_line il ON il.track_id = t.track_id WHERE mt.name != "Protected AAC audio file" AND mt.name != "Protected MPEG-4 video file"), protected AS (SELECT COUNT(mt.name) to_be_renamed, COUNT(DISTINCT(t.track_id)) unique_tracks, COUNT(DISTINCT(il.track_id)) sold_unique, COUNT(il.track_id) sold FROM media_type mt INNER JOIN track t ON mt.media_type_id = t.media_type_id LEFT JOIN invoice_line il ON il.track_id = t.track_id WHERE mt.name = "Protected AAC audio file" OR mt.name = "Protected MPEG-4 video file"), united AS (SELECT CASE WHEN to_be_renamed != 0 THEN "Non-protected" END AS media_type, unique_tracks, sold_unique, ROUND(sold_unique * 100.0 / unique_tracks, 0) pct_sold_unique, sold FROM non_protected UNION SELECT CASE WHEN to_be_renamed != 0 THEN "Protected" END AS media_type, unique_tracks, sold_unique, ROUND(sold_unique * 100.0 / unique_tracks, 0) pct_sold_unique, sold FROM protected) SELECT media_type, unique_tracks, ROUND(unique_tracks * 100.0 / (SELECT SUM(unique_tracks) FROM united), 0) pct_unique, sold_unique, pct_sold_unique, sold, ROUND(sold * 100.0 / (SELECT SUM(sold) FROM united), 0) pct_sold FROM united; """ # Reading SQL results into a pandas DataFrame conn = sqlite3.connect("chinook.db") df = pd.read_sql_query(query, conn).set_index("media_type", drop=True) # Printing the dataframe print("\n") print(df.iloc[:, :4]) print("\n") print(df.iloc[:, 4:]) print("\n") # Creating a grouped bar chart for protected vs. non-protected tracks plot_grouped_bar_chart(fig_width=10, fig_height=6, column1=df["pct_unique"], column2=df["pct_sold"], label1="Unique", label2="Sold", title="Protected vs. Non-Protected Tracks: \n Assortment & Purchases", ylabel="%", ylim_lower=0, ylim_upper=100, loc=0, rotation=0, bbox_to_anchor=None) plt.show() print("\n") # Creatting a new dataframe for pie charts df1 = df[["pct_sold_unique"]].copy() df1[["pct_not_sold_unique"]] = 100- df1[["pct_sold_unique"]] df1 = df1.transpose() # Creating labels labels = ["Sold", "Not sold"] # Selecting data to plot sizes_1 = df1["Non-protected"] sizes_2 = df1["Protected"] # Plotting the data using pie charts fig,ax = plt.subplots(figsize=(12,6)) fig.suptitle("Unique Tracks", fontsize=27, fontweight="bold", y=0.98) ax1 = fig.add_subplot(1, 2, 1) ax2 = fig.add_subplot(1, 2, 2) ax1.pie(sizes_1, autopct="%1.0f%%", startangle=90, colors=("limegreen", "orangered")) ax1.set_title("Non-protected", fontsize=27, y=0.92) ax2.pie(sizes_2, autopct="%1.0f%%", startangle=90, colors=("limegreen", "orangered")) ax2.set_title("Protected", fontsize=27, y=0.92) ax.axis("equal") # Equal aspect to draw the pies as a circle ax.set_xticklabels("", fontsize=0) ax.set_yticklabels("", fontsize=0) ax.tick_params(axis="both", left = False, bottom=False) for key,spine in ax.spines.items(): spine.set_visible(False) ax1.legend(fontsize=18, labels=labels, bbox_to_anchor=(0.85,0.25), frameon=False) plt.show() # From this table, we can make the following observations: # - Only 13\% of all the unique tracks available in the Chinook store are of protected media types. # - From all the unique tracks of protected media types, only 34\% were sold, while from those of non-protected ones - 54\%. # - Among all the tracks that were sold, those of protected media types amounts only to 9%. # # In a nutshell, the **tracks of protected media types are much less popular** than those of non-protected, both in terms of the store assortment and, especially, in terms of sales. It can be easily explained by a cumbersome way of their using, which is not really justified when it comes to buying media products for entertainment. Hence, as a future strategy, it is not recommended to the store to purchase such tracks from record companies. # ## Conclusion # In this project, we have analyzed a modified version of the Chinook database for finding the answers to our business questions. Below are the main results of this work (emphasized are the factors and suggestions for maximizing overall revenue): # - From the 4 albums in consideration, the following 3 ones should be selected for purchasing and advertising in the USA, based on the genre popularity rating in this country: **Red Tone** (Punk), **Slim Jim Bites** (Blues), **Meteor and the Girls** (Pop). # - Of the 3 sales support agents, **Jane Peacock** demonstrates the most effective sales performance, while Margaret Park - the least. The main contributing factor is the employed service; other factors are the number of clients and the employee's age. # - Of all the countries, the **USA** represents the major part of the market, in terms of both the total sales and the number of customers. # - Customers tend to buy **individual tracks** much more often than the whole albums. Hence, it's recommended to the Chinook store to purchase only the most popular tracks from each album from record companies. # - The tracks of the artists present in the most playlists are all of Classical genre, which is not popular among customers. Purchasing them cannot be considered as a potentially profitable startegy for the Chinook store. # - Almost half of all the unique tracks of the Chinook store were never bought. Possible solutions here may include a new **advertising campaign**, **sales promotion**, additional **training and motivation** of sales support agents. # - The tracks of protected media types are much less popular than those of **non-protected**, both in terms of the store assortment and, especially, in terms of sales. Hence the store should avoid purchasing such tracks.