#!/usr/bin/env python
# coding: utf-8
# *This notebook is an exploration of the 0-1 knapsack problem as formulated by lecture 1 and 2 of MIT's 6.00.2x course.*
#
#
#
# # The 0/1 Knapsack Problem
#
# The 0/1 Knapsack problem occurs whenever you want to maximize some value by selecting an optimal subset of items while obeying certain constraints. For example, a robber trying to figure out which items to steal; he can't take everything (too heavy) so he wants to maximize the amount of value he can take. Another formulation has an individual on a calorie-restricting diet; she wants to maximize the enjoyment from the food she eats while still staying beneath some set calorie limit.
#
# As an aside, it's called the "0/1" knapsack problem because it is discrete; the robber either takes an item or does not, food is either consumed or left untouched. The "continuous" knapsack problem is significantly easier to solve as you can just take as much as possible right up to the limit; for example, if the robber comes across a store of gold dust then he can just fill his bag as high as it can go.
#
# ## Diet Scenario
#
# I will look at the diet scenario as it is the one covered by the course.
#
# > Dave is on a calorie-restricting diet that limits him to 750 calories per meal. He arrives at a restaurant and is trying to decide what to order. He assigns pleasure values to each food and makes note of their cost (in calories).
#
# **Simplification**: each item on the menu can only be ordered once.
#
# ### Data
#
# Provided by the course.
#
# I'm going to try out [pandas](http://pandas.pydata.org/) for this.
# In[1]:
import pandas as pd
# In[2]:
# Read data from csv file
data = pd.read_csv("food-value-calories.csv")
# In[3]:
# Display the data
# head() displays a default 5 elements. To view all, pass the total number of elements (is there a better way?)
data.head(len(data))
# In[4]:
# Compute some descriptive statistics
data.describe()
# In[5]:
# Plot (because why not)
get_ipython().run_line_magic('matplotlib', 'inline')
import matplotlib.pyplot as plt
# Set style
import matplotlib
matplotlib.style.use('ggplot')
# In[6]:
data.plot(kind="bar", x="Food")
# ## Tackling the Problem
#
# The objective is to find the set of menu items with the highest value while still amounting to less than or equal to 750 calories. For instance, Dave could pick:
#
# - wine
#
# - burger
#
# - donut
#
# Which has a total value of 234 and a total cost of 476 calories. This is a valid choice, but is not optimal.
#
# ## Finding an Optimal Solution
#
# Finding an optimal solution is straightforward:
#
# 1. Gather all possible sets of items
#
# 2. Eliminate invalid sets (i.e. sets with calorie counts larger than 750)
#
# 3. Sort the sets by value
#
# 4. The first set in the sorted list of sets is the optimal solution
#
# ### Implementation
#
# I am not familar enough with Pandas to use it properly, so I'm going to convert the data into a regular Python list.
# In[7]:
menu = data.values.tolist()
menu
# In[8]:
# Set constant
CALORIE_LIMIT = 750
# In[9]:
def power_set(set_):
"""Binary powerset algorithm."""
power_set = []
power_cardinality = 2**len(set_)
# the number of binary digits needed
digit_count = len(set_)
# setting up the formatting
format_spec = '0' + str(digit_count) + 'b'
for n in range(power_cardinality):
subset = []
binary_n = format(n, format_spec)
# for every character in a binary number
for i, char in enumerate(binary_n):
if char == '1':
# when char is 1, the element in set_ with matching index is present in the subset
subset.append(set_[i])
power_set.append(subset)
return power_set
# In[10]:
# The powerset is the list of all possible sets of items
menu_power = power_set(menu)
menu_power
# To help eliminate invalid sets, I'll write a function which calculates the sum of a given set.
# In[11]:
def valid_choice(choice):
"""(list) -> bool
Given a list of chosen foods, return true if their total cost exceeds CALORIE_LIMIT"""
total_cost = 0
for food in choice:
total_cost += food[2]
return total_cost < CALORIE_LIMIT
# In[12]:
# Collect valid sets
valid_choices = []
for choice in menu_power:
if valid_choice(choice):
valid_choices.append(choice)
valid_choices
# To help sort the valid choices by value, I'll write a function to calculate the total value of a given choice.
# In[13]:
def total_value(choice):
"""(list) -> int
Given a list of foods, returns the sum of their values."""
total_value = 0
for food in choice:
total_value += food[1]
return total_value
# In[14]:
sorted_choices = sorted(valid_choices, key=total_value, reverse=True)
print("The optimal menu choice is", sorted_choices[0])
# In[15]:
print("Total value:", total_value(sorted_choices[0]))
# In[16]:
total_cost = 0
for food in sorted_choices[0]:
total_cost += food[2]
print("Total cost:", total_cost)
# This was fun, but my implementation isn't the best. In the course, the data is converted into objects (e.g. there is a Food class). Hmm...I wonder what the best way to carry out this kind of analysis is? Maybe I'll look into this. *Moving on...*
#
# ## Greedy Algorithms
#
# Finding the optimal solution is computationally expensive, just computing the powerset costs $O(2^n)$! This dataset is small enough that I can be as inefficient as I want, but this does not scale. Greedy algorithms offer a way to determine a "good" (but not optimal) solution in a lot less time.
#
# A greedy algorithm for the 0/1 knapsack problem is:
#
# ```
# while knapsack is not full:
# put "best" available item into it
# ```
#
# The definition of "best" is up for debate. It could mean:
#
# - highest value
#
# - lowest cost
#
# - highest ratio of value to cost (value/cost)
#
# ### Implementation
#
# Fairly simple:
#
# 1. Sort the data set by the criteria we think is "best"
#
# 2. Loop through the sorted set, taking items until reaching the limit
# In[17]:
menu
# In[18]:
# Sorted from highest to lowest value
by_value = sorted(menu, key=lambda food: food[1], reverse=True)
by_value
# In[19]:
# Sorted from smallest to largest cost
by_cost = sorted(menu, key=lambda food: food[2])
by_cost
# In[20]:
# Sorted from greatest to least value/cost (i.e. best "bang for your buck")
by_ratio = sorted(menu, key=lambda food: food[1]/food[2], reverse=True)
by_ratio
# In[21]:
def order_to_limit(menu, criteria):
"""(list, str) -> None
Given a menu, orders as many items as possible (until reaching CALORIE_LIMIT).
Prints the results."""
cost = 0
value = 0
ordered = []
for food in menu:
f_cost = food[2]
f_value = food[1]
f_name = food[0]
# If the calorie cost + calories already consumed does not exceed limit
if ((cost + f_cost) < CALORIE_LIMIT):
# Order the food
cost += f_cost
value += f_value
ordered.append(f_name)
print("With {} criteria:".format(criteria))
print(" food ordered:", ordered)
print(" total calories:", cost)
print(" total value:", value)
# In[22]:
# Run some computations, print the results
order_to_limit(by_value, "VALUE")
print()
order_to_limit(by_cost, "COST")
print()
order_to_limit(by_ratio, "RATIO")
# For comparison,
#
# ```
# Optimal solution:
# food ordered: ['wine', 'beer', 'pizza', 'cola']
# total calories: 685
# total value: 353
# ```
#
#
# Though none of the greedy results matched the optimal solution, they were close. They are also far more efficient; $O(n \log n)$, if I'm not mistaken.
#
# **tl;dr**: Dave should order wine, beer, pizza, and a cola.
#
# > **DAVE:** "Three drinks and a pizza isn't a meal."
# >
# > **SCIENTIST:** "FOOL! You should have been more specific with your constraints!"
#
#
# # Extra Bits - Plotting the Search Space
#
# Remember the set of all possible *valid* orders? I think it would be neat to see visually.
#
# ## The Data
# In[23]:
# From earlier
valid_choices
# ## The Plan
#
# **x-axis**: The set ID (a meaningless number corresponding to an individual set in the list of sets).
#
# **y-axis**: Represents the numerical pleasure value for a particular set
#
# **Chart type**: Line
#
# ## Data Prep
#
# So, I think I can achieve my goal by creating a list and filling each position with the total value of each list. Then convert to a pandas series. Finally, create the plot.
# In[24]:
summed_values = []
for s in valid_choices:
summed_values.append(total_value(s))
# In[25]:
summed_values = pd.Series(summed_values)
summed_values
# In[26]:
value_plot = summed_values.plot()
value_plot.set(xlabel="Order #", ylabel="Total pleasure value")
# Neat!
#
# ## More Nonsense
#
# To get more experience with these tools, I want to do a bit more. I'm going to add total calorie cost to the plot.
# In[27]:
def total_cost(choice):
"""(list) -> int
Given a list of foods, returns the sum of their costs."""
total_cost = 0
for food in choice:
total_cost += food[2]
return total_cost
# In[28]:
summed_costs = []
for s in valid_choices:
summed_costs.append(total_cost(s))
summed_costs
# In[29]:
cost_value_data = pd.DataFrame({'Total pleasure': summed_values, 'Total cost': summed_costs})
cost_value_data
# In[30]:
cvp = cost_value_data.plot()
cvp.set(xlabel="Order #", ylabel="Total value")