Recently, I worked on a case study in which I needed to write accurate SQL queries that would answer practical business questions. I wanted to ensure that my work would be correct, but since I'm a student and don't have access to a database, I wrote up this notebook so I could generate data, add the data to a sqlite3
database, and practice queries on it. While working on this, I found Corey Schafer's SQLite tutorial video to be extremely useful.
Disclaimer: Please note that this data is all generated, and does not represent real business data from any firm I have interviewed with.
Now, on to the fun part!
Our two questions are as follows:
Given an “Orders” table with the columns provided, write a SQL query that shows the total quantity ordered & total revenue per product group for the month of July 2015.
Using the “Orders” table, which products (identified by product_id) were sold at least twice in 2017 and once in 2018? Write a SQL query that will return the answer.
import sqlite3
import pandas as pd
import numpy.random as r
# Establish a sqlite connection (this creates orders.db if it doesn't already exist)
conn = sqlite3.connect('orders.db')
# Create a cursor, which we use to execute statements
c = conn.cursor()
# Outline our SQL statement that creates the table `orders`
create_statement = """CREATE TABLE orders (
order_item_id integer,
order_id integer,
order_date text,
product_id integer,
quantity integer,
unit_price real,
product_group varchar)"""
# If the orders table already exists, this try-except statement will drop it and create a new one
try:
c.execute(create_statement)
except sqlite3.OperationalError:
c.execute('DROP TABLE orders')
conn.commit()
c.execute(create_statement)
conn.commit()
# Create a test entry: order_1
order_1 = (881,
4,
"2018-10-20",
9,
2,
4.99,
"cleanser")
Best practice here is to use ?s when inserting arbitrary values; most other string formatting solutions are vulnerable to SQL injections.
c.execute('INSERT INTO orders VALUES (?, ?, ?, ?, ?, ?, ?)', order_1)
conn.commit()
Let's grab our one entry using a WHERE statement. We'll find it by date:
sql = """SELECT * FROM orders
WHERE order_date LIKE ?"""
args = ['2018' + "%"] # Read as: "begins with 2018"
c.execute(sql, args)
conn.commit()
print(c.fetchone())
(881, 4, '2018-10-20', 9, 2, 4.99, 'cleanser')
numpy
's random
module doesn't include a function for generating dates, so I wrote one up quickly. It works by selecting only from days between 1 and 29, to avoid adding in rules for which days can be drawn based on the month. Year was restricted to 2015-2019 to favor the exercises.
def generate_dates(n):
"""
Generates n dates. Includes all months, but only includes days from 1-29
and years from 2016-2019.
"""
dates_list = []
for num in range(n):
dates_list.append('{}-{}-{}'.format(str(r.randint(2015, 2020)), # year
str(r.randint(1, 12)), # month
str(r.randint(1, 30)))) # day
return dates_list
# Example dates:
generate_dates(5)
['2015-8-29', '2018-1-9', '2015-5-19', '2017-7-7', '2016-1-7']
def generate_data(n=10, seed=10):
"""
Generates a pandas DataFrame with n (default 10) data points for this particular table.
Random number seed is set to 10 by default.
"""
r.seed(seed) # set seed; default 10
order_id = r.randint(1, 100, n) # Order ID
order_item_id = r.randint(100, 1000, n) # Order item ID
order_date = generate_dates(n) # # Order date
product_id = r.randint(1, 51, n) # Product ID
quantity = r.randint(1, 5, n) # Quantity
unit_price = r.choice([4.99, 7.99, 12.99, 19.99, 29.99], n) # Unit price
product_group = r.choice(['cleanser', 'moisturizer', 'toner', 'treatment'], n) # Product group
# Create dict
d = {'order_id': order_id,
'order_item_id': order_item_id,
'order_date': order_date,
'product_id': product_id,
'quantity': quantity,
'unit_price': unit_price,
'product_group': product_group}
# Create the DataFrame from the dict
df = pd.DataFrame(data = d)
return df
# Generate 500 data entries
fake_data = generate_data(n=500, seed=6)
fake_data.head()
order_id | order_item_id | order_date | product_id | quantity | unit_price | product_group | |
---|---|---|---|---|---|---|---|
0 | 11 | 398 | 2017-4-11 | 33 | 4 | 4.99 | cleanser |
1 | 74 | 386 | 2016-5-22 | 44 | 2 | 4.99 | toner |
2 | 85 | 656 | 2019-1-19 | 46 | 1 | 7.99 | treatment |
3 | 80 | 860 | 2015-1-26 | 40 | 3 | 7.99 | treatment |
4 | 81 | 261 | 2015-3-21 | 41 | 3 | 12.99 | toner |
# Add fake data to orders table
fake_data.to_sql('orders', conn, if_exists='append', index=False)
# First 10 entries of the table
sql = 'SELECT * FROM orders LIMIT 10'
c.execute(sql)
conn.commit()
for entry in c.fetchall():
print(entry)
(881, 4, '2018-10-20', 9, 2, 4.99, 'cleanser') (398, 11, '2017-4-11', 33, 4, 4.99, 'cleanser') (386, 74, '2016-5-22', 44, 2, 4.99, 'toner') (656, 85, '2019-1-19', 46, 1, 7.99, 'treatment') (860, 80, '2015-1-26', 40, 3, 7.99, 'treatment') (261, 81, '2015-3-21', 41, 3, 12.99, 'toner') (937, 63, '2017-3-26', 37, 4, 4.99, 'cleanser') (960, 26, '2017-7-10', 5, 1, 29.99, 'cleanser') (977, 2, '2015-4-24', 35, 4, 7.99, 'moisturizer') (291, 76, '2019-1-17', 23, 3, 12.99, 'moisturizer')
Given an “Orders” table with the above columns, write a SQL query that shows the total quantity ordered & total revenue per product group for the month of July 2015.
# Query needs to:
# Restrict data to July 2015 orders only
# Group by product group
# Sum the quantity ordered across all orders
# Sum the unit price * quantity across all orders
sql = """SELECT product_group, SUM(quantity), SUM(quantity * unit_price) AS revenue
FROM orders
WHERE order_date LIKE ?
GROUP BY product_group"""
args = ['2015-7' + '%']
c.execute(sql, args)
for entry in c.fetchall():
print(entry)
('cleanser', 4, 44.96) ('moisturizer', 9, 193.90999999999997) ('toner', 14, 277.86) ('treatment', 12, 155.88)
The reason we need to use question marks is because Python has special instructions for when it sees a %
symbol (causing a syntax error when passing arguments directly with sqlite3
), so it's simpler to pass arguments separately rather than embedding them in the query.
With the date embedded into the query, it'd look like this:
SELECT
FROM orders
WHERE order_date LIKE 2015-7%
GROUP BY product_group```
# Use the pandas DataFrame to verify that the quantities match
july_2015 = fake_data.loc[fake_data['order_date'].str.contains('2015-7')]
july_2015.head(4)
order_id | order_item_id | order_date | product_id | quantity | unit_price | product_group | |
---|---|---|---|---|---|---|---|
20 | 32 | 637 | 2015-7-5 | 32 | 1 | 29.99 | cleanser |
72 | 70 | 680 | 2015-7-23 | 22 | 1 | 19.99 | moisturizer |
149 | 28 | 636 | 2015-7-16 | 44 | 1 | 4.99 | treatment |
184 | 89 | 254 | 2015-7-17 | 29 | 2 | 19.99 | toner |
# Sanity check
july_2015.groupby('product_group')['quantity'].sum()
product_group cleanser 4 moisturizer 9 toner 14 treatment 12 Name: quantity, dtype: int64
Using the “Orders” table, which products (identified by product_id
) were sold at least twice in 2017 and once in 2018? Write a SQL query that will return the answer.
sql = """SELECT product_id
FROM (SELECT product_id, COUNT(*) FROM orders
AS sold_2_2017
WHERE order_date LIKE ?
GROUP BY product_id
HAVING COUNT(*) >= 2)
WHERE product_id IN (SELECT product_id FROM orders WHERE order_date LIKE ?)"""
args = [('2017' + '%'), ('2018' + '%')]
c.execute(sql, args)
for entry in c.fetchall():
print(entry)
(1,) (2,) (4,) (5,) (7,) (12,) (15,) (16,) (21,) (22,) (25,) (26,) (27,) (30,) (31,) (32,) (33,) (38,) (41,) (42,) (44,) (45,) (46,) (47,) (48,) (50,)
To break down the above query: What we're doing is selecting products that were also sold in 2018 from a nested table of products that were sold twice in 2017.
Given the date structure "YYYY-MM-DD", it would look like this as a regular SQL query:
SELECT
FROM (SELECT product_id, COUNT(*) FROM orders
AS sold_2_2017
WHERE order_date LIKE 2017%
GROUP BY product_id
HAVING COUNT(*) >= 2)
WHERE product_id IN (SELECT product_id FROM orders WHERE order_date LIKE 2018%)```
# Close up the connection
c.close()