Python Homework with Chipotle Data - Explained¶

Original version written by Alex Sherman

Part 1¶

Read in the file with csv.reader() and store it in an object called 'file_nested_list'.
Hint: This is a TSV file, and csv.reader() needs to be told how to handle it.

In [1]:

# Change the working directory to the 'data' directory
%cd ../data

c:\Users\Kevin\Desktop\DAT8\data

In [2]:

# To use csv.reader, we must import the csv module
import csv

# The csv.reader has a delimeter parameter, which we set to '\t' to indicate that the file is tab-separated
with open('chipotle.tsv', mode='rU') as f:   # We temporarily refer to the file by the variable name f for file
    file_nested_list = [row for row in csv.reader(f, delimiter='\t')]   # Create a list by looping through each line in f

Why use csv.reader?¶

As stated in the CSV file reading and writing documentation:

There is no "CSV standard", so the format is operationally defined by the many applications which

read and write it. The lack of a standard means that subtle differences often exist in the data produced and consumed by different applications. These differences can make it annoying to process CSV files from multiple sources. Still, while the delimiters and quoting characters vary, the overall format is similar enough that it is possible to write a single module which can efficiently manipulate such data, hiding the details of reading and writing the data from the programmer.

In other words, depending on the source, there may be intricacies in the data format. These are not always easy to distinguish - for instance, non-visible new line characters. The csv.reader module is built to handle these intricacies, and thus provides an efficient way to load data.

This is why we prefer: file_nested_list = [row for row in csv.reader(f, delimiter='\t')]

Instead of: file_nested_list = [row.split('\t') for row in f]

Part 2¶

Separate 'file_nested_list' into the 'header' and the 'data'.

In [3]:

header = file_nested_list[0]
data = file_nested_list[1:]

Part 3¶

Calculate the average price of an order.
Hint: Examine the data to see if the 'quantity' column is relevant to this calculation.
Hint: Think carefully about the simplest way to do this!

We want to find the average price of an order. This means we need the sum of the price of all orders and the total number of orders.

Calculating the sum of the price of all orders¶

In [4]:

# After exploring our data for a minute, we find two orders for the same item - Chicken Bowl - differing by the quantity
print header
print data[4]
print data[5]

['order_id', 'quantity', 'item_name', 'choice_description', 'item_price']
['2', '2', 'Chicken Bowl', '[Tomatillo-Red Chili Salsa (Hot), [Black Beans, Rice, Cheese, Sour Cream]]', '$16.98 ']
['3', '1', 'Chicken Bowl', '[Fresh Tomato Salsa (Mild), [Rice, Cheese, Sour Cream, Guacamole, Lettuce]]', '$10.98 ']

We see that the item_price field reflects the quantity ordered. Thus, to calculate the total value of all orders, we can safely ignore the quantity column because the item_price takes quantity into account.

In [5]:

# We want the sum of all the order prices - the last item in each list. Here are two ways we could get this data:

# Option 1
prices = [row[4] for row in data]   # slice to position four

# Option 2
prices = [row[-1] for row in data]   # slice to the last position

# Let's look at the first five results:
prices[0:5]

Out[5]:

['$2.39 ', '$3.39 ', '$3.39 ', '$2.39 ', '$16.98 ']

In [6]:

# Each item in the list is a sting.  We can tell this because the results above are wrapped in quotes.
# To confirm, let's explicity check the type of the first item in the list:
type(prices[0])

Out[6]:

str

In [7]:

# Since we want to do a calculation, we need to change the type from string to float. 
# To do this, we first need to remove the $. Here are two different ways to accomplish this:

# Option 1
prices = [row[4][1:] for row in data]   # remove the dollar sign by slicing

# Option 2
prices = [row[4].replace('$', '') for row in data]   # remove the dollar sign by replacing '$' with an empty string

# Let's look at the first five results:
prices[0:5]

Out[7]:

['2.39 ', '3.39 ', '3.39 ', '2.39 ', '16.98 ']

In [8]:

# Now we can convert our results to floats
prices = [float(row[4][1:]) for row in data]

# Let's look at the first five results and check the type of the first item:
print prices[0:5]
print type(prices[0])

[2.39, 3.39, 3.39, 2.39, 16.98]
<type 'float'>

In [9]:

# Finally, we calculate our total order sum with the built-in sum function
total_order_sum = sum([float(row[4][1:]) for row in data]) 
total_order_sum

Out[9]:

34500.16000000046

Calculating the total number of orders¶

In [10]:

# We can look at the first and last items in the list
print header
print data[0]
print data[-1]

['order_id', 'quantity', 'item_name', 'choice_description', 'item_price']
['1', '1', 'Chips and Fresh Tomato Salsa', 'NULL', '$2.39 ']
['1834', '1', 'Chicken Salad Bowl', '[Fresh Tomato Salsa, [Fajita Vegetables, Pinto Beans, Lettuce]]', '$8.75 ']

It seems that there are 1834 orders. You could assume this since that's the maximum order_id, but it is best to check, as we are not certain that the data is clean. If the data was not sorted by order or if there was a missing order, then 1834 might not be correct.

So, let's confirm this assumption:

In [11]:

# First, let's build a list of the order_ids
order_ids = [row[0] for row in data]

# Let's look at the first ten results
order_ids[0:10]

Out[11]:

['1', '1', '1', '1', '2', '3', '3', '4', '4', '5']

In [12]:

# We only want to count each order once. We can get the distinct order values with the set function:
set(order_ids[0:10])

Out[12]:

{'1', '2', '3', '4', '5'}

In [13]:

# Only keep unique order_ids
unique_order_ids = set(order_ids)

# Use the len function to determine the number of unique order_ids
num_orders = len(unique_order_ids)
num_orders

Out[13]:

Calculating the average price¶

In [14]:

# Finally, we answer the question by calculating the average
average_order_price = total_order_sum / num_orders
average_order_price

Out[14]:

18.811428571428824

In [15]:

# Let's recap by looking at the final code:
total_order_sum = sum([float(row[4][1:]) for row in data])
num_orders = len(set([row[0] for row in data]))
average_order_price = round(total_order_sum / num_orders, 2)   # Let's round our result to 2 decimal places
average_order_price

Out[15]:

18.81

Part 4¶

Create a list (or set) of all unique sodas and soft drinks that they sell.
Note: Just look for 'Canned Soda' and 'Canned Soft Drink', and ignore other drinks like 'Izze'.

In [16]:

# First let's look at all of the items
distinct_items = set([row[2] for row in data])
distinct_items 

Out[16]:

{'6 Pack Soft Drink',
 'Barbacoa Bowl',
 'Barbacoa Burrito',
 'Barbacoa Crispy Tacos',
 'Barbacoa Salad Bowl',
 'Barbacoa Soft Tacos',
 'Bottled Water',
 'Bowl',
 'Burrito',
 'Canned Soda',
 'Canned Soft Drink',
 'Carnitas Bowl',
 'Carnitas Burrito',
 'Carnitas Crispy Tacos',
 'Carnitas Salad',
 'Carnitas Salad Bowl',
 'Carnitas Soft Tacos',
 'Chicken Bowl',
 'Chicken Burrito',
 'Chicken Crispy Tacos',
 'Chicken Salad',
 'Chicken Salad Bowl',
 'Chicken Soft Tacos',
 'Chips',
 'Chips and Fresh Tomato Salsa',
 'Chips and Guacamole',
 'Chips and Mild Fresh Tomato Salsa',
 'Chips and Roasted Chili Corn Salsa',
 'Chips and Roasted Chili-Corn Salsa',
 'Chips and Tomatillo Green Chili Salsa',
 'Chips and Tomatillo Red Chili Salsa',
 'Chips and Tomatillo-Green Chili Salsa',
 'Chips and Tomatillo-Red Chili Salsa',
 'Crispy Tacos',
 'Izze',
 'Nantucket Nectar',
 'Salad',
 'Side of Chips',
 'Steak Bowl',
 'Steak Burrito',
 'Steak Crispy Tacos',
 'Steak Salad',
 'Steak Salad Bowl',
 'Steak Soft Tacos',
 'Veggie Bowl',
 'Veggie Burrito',
 'Veggie Crispy Tacos',
 'Veggie Salad',
 'Veggie Salad Bowl',
 'Veggie Soft Tacos'}

Our first goal is to reduce the dataset to only soda and soft drink orders.

It appears that the only items that use the word 'Canned' are 'Canned Soda' and 'Canned Soft Drink.'

This means we only need to use one filter criteria: Look for rows with the word 'Canned'

In [17]:

# Create a list only including soda and soft drink orders
soda_orders = []
for row in data:
    if 'Canned' in row[2]:
        soda_orders.append(row)

# Let's look at the first five results:
soda_orders[0:5]

Out[17]:

[['9', '2', 'Canned Soda', '[Sprite]', '$2.18 '],
 ['14', '1', 'Canned Soda', '[Dr. Pepper]', '$1.09 '],
 ['23', '2', 'Canned Soda', '[Mountain Dew]', '$2.18 '],
 ['24', '1', 'Canned Soda', '[Sprite]', '$1.09 '],
 ['47', '1', 'Canned Soda', '[Dr. Pepper]', '$1.09 ']]

In [18]:

# This can also be done using a list comprehension with an 'if' condition
soda_orders = [row for row in data if 'Canned' in row[2]]

Just out of interest, let's look at two other ways we could have filtered the data:

In [19]:

soda_orders = [row for row in data if 'Canned Soda' in row[2] or 'Canned Soft Drink' in row[2]]
soda_orders[0:5]

Out[19]:

[['9', '2', 'Canned Soda', '[Sprite]', '$2.18 '],
 ['14', '1', 'Canned Soda', '[Dr. Pepper]', '$1.09 '],
 ['23', '2', 'Canned Soda', '[Mountain Dew]', '$2.18 '],
 ['24', '1', 'Canned Soda', '[Sprite]', '$1.09 '],
 ['47', '1', 'Canned Soda', '[Dr. Pepper]', '$1.09 ']]

In [20]:

soda_orders = [row for row in data if 'Canned Soda' == row[2] or 'Canned Soft Drink' == row[2]]
soda_orders[0:5]

Out[20]:

[['9', '2', 'Canned Soda', '[Sprite]', '$2.18 '],
 ['14', '1', 'Canned Soda', '[Dr. Pepper]', '$1.09 '],
 ['23', '2', 'Canned Soda', '[Mountain Dew]', '$2.18 '],
 ['24', '1', 'Canned Soda', '[Sprite]', '$1.09 '],
 ['47', '1', 'Canned Soda', '[Dr. Pepper]', '$1.09 ']]

In [21]:

# We only want the choice_description (e.g. Sprite, Mountain Dew). This is the fourth item in the list.
# Since Python uses 0-based indexing, we get this by using row[3] as the first argument in our list comprehension:
sodas = [row[3] for row in data if 'Canned' in row[2]]

# Let's look at the first five results
sodas[0:5]

Out[21]:

['[Sprite]', '[Dr. Pepper]', '[Mountain Dew]', '[Sprite]', '[Dr. Pepper]']

The results above may look like 5 lists inside of a larger list. Let's assume that's the case, and try to get the first Sprite:

In [22]:

sodas[0][0]

Out[22]:

'['

What is going on?

The raw data for choice_description includues brackets (e.g. [Sprite]). We loaded this data in as a string, so while it looks like we have lists inside lists, the result is actually just one list. This is indicated by the quotes wrapping each item in the list, which means the list contains strings.

In [23]:

# Print the first list element
print sodas[0]

# Show that it's a string
print type(sodas[0])

# It is 8 characters long, including the brackets
print len(sodas[0])

[Sprite]
<type 'str'>
8

In [24]:

# Let's strip the brackets at the start and end of each soda name, using [1:-1] to remove the first and last characters
sodas = [row[3][1:-1] for row in data if 'Canned' in row[2]]

# Let's look at the first five results
sodas[0:5]

Out[24]:

['Sprite', 'Dr. Pepper', 'Mountain Dew', 'Sprite', 'Dr. Pepper']

In [25]:

# Almost done - we just need to get rid of duplicate values
unique_sodas = set([row[3][1:-1] for row in data if 'Canned' in row[2]])   # Success in one line of code!
unique_sodas

Out[25]:

{'Coca Cola',
 'Coke',
 'Diet Coke',
 'Diet Dr. Pepper',
 'Dr. Pepper',
 'Lemonade',
 'Mountain Dew',
 'Nestea',
 'Sprite'}

Just for reference, how would this look if we did not use a list comprehension?

In [26]:

# build a list of all sodas
sodas = []
for row in data:
    if 'Canned' in row[2]:
        sodas.append(row[3][1:-1])   # strip the brackets

# create a set of unique sodas
unique_sodas = set(sodas)

Part 5¶

Calculate the average number of toppings per burrito.
Note: Let's ignore the 'quantity' column to simplify this task.
Hint: Think carefully about the easiest way to count the number of toppings!

To calculate the average number of toppings, we simply need to divide the total number of burritos by the total number of toppings.

Calculating the total number of burritos¶

In [27]:

# keep a running total
burrito_count = 0

# loop through the data, looking for lines containing 'Burrito'
for row in data:
    if 'Burrito' in row[2]:
        burrito_count = burrito_count + 1

Like many programming languages, Python allows you to use x += 1 as a replacement for x = x + 1. Let's use that instead:

In [28]:

# keep a running total
burrito_count = 0

# loop through the data, looking for lines containing 'Burrito'
for row in data:
    if 'Burrito' in row[2]:
        burrito_count += 1   # this is the only line that changed

In [29]:

burrito_count

Out[29]:

The count is 1172, which seems reasonable given the total number of orders (1834).

Calculating the total number of toppings¶

In [30]:

# Let's look at a single burrito order
data[7]

Out[30]:

['4',
 '1',
 'Steak Burrito',
 '[Tomatillo Red Chili Salsa, [Fajita Vegetables, Black Beans, Pinto Beans, Cheese, Sour Cream, Guacamole, Lettuce]]',
 '$11.75 ']

In [31]:

# There appear to be 8 toppings:
data[7][3]

Out[31]:

'[Tomatillo Red Chili Salsa, [Fajita Vegetables, Black Beans, Pinto Beans, Cheese, Sour Cream, Guacamole, Lettuce]]'

With all of this formatting within the string, what's the easiest way to count the number of toppings?

Start by asking yourself: How did you count the number of toppings? You probably looked for commas!

In [32]:

# Use the string method 'count' to count the number of commas
data[7][3].count(',')

Out[32]:

And of course, if there are 7 commas, that means there are 8 toppings.

So, let's revise our original loop:

In [33]:

# keep a running total of burritos and toppings
burrito_count = 0
topping_count = 0

# calculate number of toppings by counting the commas and adding 1
for row in data:
    if 'Burrito' in row[2]:
        burrito_count += 1
        topping_count += (row[3].count(',') + 1)

print burrito_count
print topping_count

1172
6323

Calculating the average number of toppings¶

In [34]:

# calculate the average topping count and round to 2 digits
round(topping_count / float(burrito_count), 2)

Out[34]:

5.4

Just for reference, how would this look if we used list comprehensions?

In [35]:

burrito_count = sum(1 for row in data if 'Burrito' in row[2])
topping_count = sum([row[3].count(',') + 1 for row in data if 'Burrito' in row[2]])
round(topping_count / float(burrito_count), 2)

Out[35]:

5.4

Part 6¶

Create a dictionary in which the keys represent chip orders and the values represent the total number of orders.
Expected output: {'Chips and Roasted Chili-Corn Salsa': 18, ... }
Note: Please take the 'quantity' column into account!
Optional: Learn how to use 'defaultdict' to simplify your code.

Building a dictionary of names¶

Let's pretend I have a list of four names, and I want to make a dictionary in which the key is the name, and the value is the count of that name.

In [36]:

# This is my list of names
names = ['Ben', 'Victor', 'Laura', 'Victor']

I want to create a dictionary that looks like this:

{'Ben':1, 'Laura':1, 'Victor':2}

How would I do that? Here's my first attempt:

In [37]:

# create empty dictionary
name_count = {}

# loop through list of names
for name in names:
    # set the name as the key and 1 as the value
    name_count[name] = 1

name_count

Out[37]:

{'Ben': 1, 'Laura': 1, 'Victor': 1}

Well, that creates a dictionary, but it didn't count Victor twice.

Let's try again:

In [38]:

name_count = {}

for name in names:
    # increment the value
    name_count[name] += 1

name_count

---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
<ipython-input-38-72c83b89be69> in <module>()
      3 for name in names:
      4     # increment the value
----> 5     name_count[name] += 1
      6 
      7 name_count

KeyError: 'Ben'

That doesn't work because the dictionary starts out empty, and you can't tell Python to "increment the Ben value by 1" unless the Ben value starts at 0.

Let's try to fix that:

In [39]:

name_count = {}

for name in names:
    # initially set every name to 0
    name_count[name] = 0

for name in names:
    # increment the value
    name_count[name] += 1

name_count

Out[39]:

{'Ben': 1, 'Laura': 1, 'Victor': 2}

By looping through the list twice, we fixed the problem. But that's kind of clunky.

Here's what we really want to do:

In [40]:

name_count = {}

for name in names:
    
    # check if the key is already present in the dictionary
    if name not in name_count:
        name_count[name] = 1   # this is a new key, so create key/value pair
    else:
        name_count[name] += 1   # this is an existing key, so add to the value

name_count

Out[40]:

{'Ben': 1, 'Laura': 1, 'Victor': 2}

Applying this technique to chip orders¶

In [41]:

# Reminder on the header
header

Out[41]:

['order_id', 'quantity', 'item_name', 'choice_description', 'item_price']

In [42]:

# Find all the chip orders
chip_orders = [row for row in data if 'Chips' in row[2]]

# Look at the first five
chip_orders[:5]

Out[42]:

[['1', '1', 'Chips and Fresh Tomato Salsa', 'NULL', '$2.39 '],
 ['1', '1', 'Chips and Tomatillo-Green Chili Salsa', 'NULL', '$2.39 '],
 ['3', '1', 'Side of Chips', 'NULL', '$1.69 '],
 ['5', '1', 'Chips and Guacamole', 'NULL', '$4.45 '],
 ['7', '1', 'Chips and Guacamole', 'NULL', '$4.45 ']]

In [43]:

# The chip quantities are easily accessible
chip_quantities = [row[1] for row in data if 'Chips' in row[2]]

# Look at the first five
chip_quantities[:5]

Out[43]:

['1', '1', '1', '1', '1']

Let's put this all together!

In [44]:

# start with an empty dictionary
chips = {}

# if chip order is not in dictionary, then add a new key/value pair
# if chip order is already in dictionary, then update the value for that key
for row in data:
    if 'Chips' in row[2]:
        if row[2] not in chips:
            chips[row[2]] = int(row[1])     # this is a new key, so create key/value pair
        else:
            chips[row[2]] += int(row[1])    # this is an existing key, so add to the value

chips

Out[44]:

{'Chips': 230,
 'Chips and Fresh Tomato Salsa': 130,
 'Chips and Guacamole': 506,
 'Chips and Mild Fresh Tomato Salsa': 1,
 'Chips and Roasted Chili Corn Salsa': 23,
 'Chips and Roasted Chili-Corn Salsa': 18,
 'Chips and Tomatillo Green Chili Salsa': 45,
 'Chips and Tomatillo Red Chili Salsa': 50,
 'Chips and Tomatillo-Green Chili Salsa': 33,
 'Chips and Tomatillo-Red Chili Salsa': 25,
 'Side of Chips': 110}

Using defaultdict instead¶

defaultdict simplifies this task, because it saves you the trouble of checking whether a key already exists.

Here's a simple example using the names data:

In [45]:

# This is a tiny variation of our code that previously raised an error

# Create an empty dictionary that will eventually contain integers (and thus the default value is 0)
from collections import defaultdict
name_count = defaultdict(int)

# We no longer have to check if the key is present
for name in names:
    name_count[name] += 1

name_count

Out[45]:

defaultdict(int, {'Ben': 1, 'Laura': 1, 'Victor': 2})

In [46]:

# It will print nicely if we convert it to a regular dictionary
dict(name_count)

Out[46]:

{'Ben': 1, 'Laura': 1, 'Victor': 2}

In [47]:

# Apply this to the chip orders
dchips = defaultdict(int)

for row in data:
    if 'Chips' in row[2]:
        dchips[row[2]] += int(row[1])

dict(dchips)

Out[47]:

{'Chips': 230,
 'Chips and Fresh Tomato Salsa': 130,
 'Chips and Guacamole': 506,
 'Chips and Mild Fresh Tomato Salsa': 1,
 'Chips and Roasted Chili Corn Salsa': 23,
 'Chips and Roasted Chili-Corn Salsa': 18,
 'Chips and Tomatillo Green Chili Salsa': 45,
 'Chips and Tomatillo Red Chili Salsa': 50,
 'Chips and Tomatillo-Green Chili Salsa': 33,
 'Chips and Tomatillo-Red Chili Salsa': 25,
 'Side of Chips': 110}