Practicing SQL Queries using sqlite3 in Python¶

Recently, I worked on a case study in which I needed to write accurate SQL queries that would answer practical business questions. I wanted to ensure that my work would be correct, but since I'm a student and don't have access to a database, I wrote up this notebook so I could generate data, add the data to a sqlite3 database, and practice queries on it. While working on this, I found Corey Schafer's SQLite tutorial video to be extremely useful.

Disclaimer: Please note that this data is all generated, and does not represent real business data from any firm I have interviewed with.

Now, on to the fun part!

Our two questions are as follows:

Given an “Orders” table with the columns provided, write a SQL query that shows the total quantity ordered & total revenue per product group for the month of July 2015.
Using the “Orders” table, which products (identified by product_id) were sold at least twice in 2017 and once in 2018? Write a SQL query that will return the answer.

In [1]:

import sqlite3
import pandas as pd
import numpy.random as r

In [2]:

# Establish a sqlite connection (this creates orders.db if it doesn't already exist)
conn = sqlite3.connect('orders.db')

# Create a cursor, which we use to execute statements
c = conn.cursor()

In [3]:

# Outline our SQL statement that creates the table `orders`
create_statement = """CREATE TABLE orders (
                    order_item_id integer,
                    order_id integer,
                    order_date text,
                    product_id integer,
                    quantity integer,
                    unit_price real,
                    product_group varchar)"""

# If the orders table already exists, this try-except statement will drop it and create a new one
try:
    c.execute(create_statement)
except sqlite3.OperationalError: 
    c.execute('DROP TABLE orders')
    conn.commit()
    c.execute(create_statement)

conn.commit()

In [4]:

# Create a test entry: order_1
order_1 = (881,
           4,
          "2018-10-20",
          9,
          2,
          4.99,
          "cleanser")

Best practice here is to use ?s when inserting arbitrary values; most other string formatting solutions are vulnerable to SQL injections.

In [5]:

c.execute('INSERT INTO orders VALUES (?, ?, ?, ?, ?, ?, ?)', order_1)

conn.commit()

Let's grab our one entry using a WHERE statement. We'll find it by date:

In [6]:

sql = """SELECT * FROM orders
            WHERE order_date LIKE ?"""

args = ['2018' + "%"] # Read as: "begins with 2018"

c.execute(sql, args)
conn.commit()

print(c.fetchone())

(881, 4, '2018-10-20', 9, 2, 4.99, 'cleanser')

numpy's random module doesn't include a function for generating dates, so I wrote one up quickly. It works by selecting only from days between 1 and 29, to avoid adding in rules for which days can be drawn based on the month. Year was restricted to 2015-2019 to favor the exercises.

In [7]:

def generate_dates(n):
    """
    Generates n dates. Includes all months, but only includes days from 1-29 
    and years from 2016-2019.
    """
    
    dates_list = []
    
    for num in range(n):
        dates_list.append('{}-{}-{}'.format(str(r.randint(2015, 2020)), # year
                                            str(r.randint(1, 12)), # month
                                            str(r.randint(1, 30)))) # day
      
    return dates_list
    
# Example dates:
generate_dates(5)

Out[7]:

['2015-8-29', '2018-1-9', '2015-5-19', '2017-7-7', '2016-1-7']

In [8]:

def generate_data(n=10, seed=10):
    """
    Generates a pandas DataFrame with n (default 10) data points for this particular table. 
    Random number seed is set to 10 by default.
    """
    
    r.seed(seed) # set seed; default 10
    
    order_id = r.randint(1, 100, n) # Order ID
    order_item_id = r.randint(100, 1000, n) # Order item ID
    order_date = generate_dates(n) # # Order date
    product_id = r.randint(1, 51, n) # Product ID
    quantity = r.randint(1, 5, n) # Quantity
    unit_price = r.choice([4.99, 7.99, 12.99, 19.99, 29.99], n) # Unit price
    product_group = r.choice(['cleanser', 'moisturizer', 'toner', 'treatment'], n) # Product group
    
    # Create dict
    d = {'order_id': order_id, 
         'order_item_id': order_item_id,
         'order_date': order_date,
         'product_id': product_id,
         'quantity': quantity,
         'unit_price': unit_price,
         'product_group': product_group}
    
    # Create the DataFrame from the dict
    df = pd.DataFrame(data = d)
    
    return df

In [9]:

# Generate 500 data entries
fake_data = generate_data(n=500, seed=6)

fake_data.head()

Out[9]:

	order_id	order_item_id	order_date	product_id	quantity	unit_price	product_group
0	11	398	2017-4-11	33	4	4.99	cleanser
1	74	386	2016-5-22	44	2	4.99	toner
2	85	656	2019-1-19	46	1	7.99	treatment
3	80	860	2015-1-26	40	3	7.99	treatment
4	81	261	2015-3-21	41	3	12.99	toner

In [10]:

# Add fake data to orders table
fake_data.to_sql('orders', conn, if_exists='append', index=False)

In [11]:

# First 10 entries of the table
sql = 'SELECT * FROM orders LIMIT 10'

c.execute(sql)
conn.commit()

for entry in c.fetchall():
    print(entry)

(881, 4, '2018-10-20', 9, 2, 4.99, 'cleanser')
(398, 11, '2017-4-11', 33, 4, 4.99, 'cleanser')
(386, 74, '2016-5-22', 44, 2, 4.99, 'toner')
(656, 85, '2019-1-19', 46, 1, 7.99, 'treatment')
(860, 80, '2015-1-26', 40, 3, 7.99, 'treatment')
(261, 81, '2015-3-21', 41, 3, 12.99, 'toner')
(937, 63, '2017-3-26', 37, 4, 4.99, 'cleanser')
(960, 26, '2017-7-10', 5, 1, 29.99, 'cleanser')
(977, 2, '2015-4-24', 35, 4, 7.99, 'moisturizer')
(291, 76, '2019-1-17', 23, 3, 12.99, 'moisturizer')

Question 1¶

Given an “Orders” table with the above columns, write a SQL query that shows the total quantity ordered & total revenue per product group for the month of July 2015.

In [12]:

# Query needs to:
# Restrict data to July 2015 orders only
# Group by product group 
# Sum the quantity ordered across all orders
# Sum the unit price * quantity across all orders

sql = """SELECT product_group, SUM(quantity), SUM(quantity * unit_price) AS revenue
            FROM orders
            WHERE order_date LIKE ?
            GROUP BY product_group"""

args = ['2015-7' + '%']

c.execute(sql, args)
for entry in c.fetchall():
    print(entry)

('cleanser', 4, 44.96)
('moisturizer', 9, 193.90999999999997)
('toner', 14, 277.86)
('treatment', 12, 155.88)

The reason we need to use question marks is because Python has special instructions for when it sees a % symbol (causing a syntax error when passing arguments directly with sqlite3), so it's simpler to pass arguments separately rather than embedding them in the query.

With the date embedded into the query, it'd look like this:

SELECT
    FROM orders
    WHERE order_date LIKE 2015-7%
    GROUP BY product_group```

In [13]:

# Use the pandas DataFrame to verify that the quantities match
july_2015 = fake_data.loc[fake_data['order_date'].str.contains('2015-7')]
july_2015.head(4)

Out[13]:

	order_id	order_item_id	order_date	product_id	quantity	unit_price	product_group
20	32	637	2015-7-5	32	1	29.99	cleanser
72	70	680	2015-7-23	22	1	19.99	moisturizer
149	28	636	2015-7-16	44	1	4.99	treatment
184	89	254	2015-7-17	29	2	19.99	toner

In [14]:

# Sanity check
july_2015.groupby('product_group')['quantity'].sum()

Out[14]:

product_group
cleanser        4
moisturizer     9
toner          14
treatment      12
Name: quantity, dtype: int64

Question 2¶

Using the “Orders” table, which products (identified by product_id) were sold at least twice in 2017 and once in 2018? Write a SQL query that will return the answer.

In [15]:

sql = """SELECT product_id
             FROM (SELECT product_id, COUNT(*) FROM orders
                AS sold_2_2017
                WHERE order_date LIKE ?
                GROUP BY product_id 
                HAVING COUNT(*) >= 2)
            WHERE product_id IN (SELECT product_id FROM orders WHERE order_date LIKE ?)"""

args = [('2017' + '%'), ('2018' + '%')]

c.execute(sql, args)
for entry in c.fetchall():
    print(entry)

(1,)
(2,)
(4,)
(5,)
(7,)
(12,)
(15,)
(16,)
(21,)
(22,)
(25,)
(26,)
(27,)
(30,)
(31,)
(32,)
(33,)
(38,)
(41,)
(42,)
(44,)
(45,)
(46,)
(47,)
(48,)
(50,)

To break down the above query: What we're doing is selecting products that were also sold in 2018 from a nested table of products that were sold twice in 2017.

Given the date structure "YYYY-MM-DD", it would look like this as a regular SQL query:

SELECT
     FROM (SELECT product_id, COUNT(*) FROM orders
        AS sold_2_2017
        WHERE order_date LIKE 2017%
        GROUP BY product_id 
        HAVING COUNT(*) >= 2)
    WHERE product_id IN (SELECT product_id FROM orders WHERE order_date LIKE 2018%)```

In [16]:

# Close up the connection
c.close()