import numpy as np
import pandas as pd
Imagine you have 4 apples with these attributes
apples_docs = [
"red round",
"red round",
"green sour round",
"green round",
]
and 3 bananas with these attributes:
bananas_docs = [
"yellow skinny",
"yellow skinny",
"green skinny"
]
Split into list of lists:
apples = [a.split() for a in apples_docs]
bananas = [b.split() for b in bananas_docs]
Q. What is the sorted set of all attributes (assign to vocabulary variable $V$)?
(Let's ignore the unknown word issue in our vectors and in our computations.)
You can compute like this:
Va = set(np.concatenate(apples))
Vb = set(np.concatenate(bananas))
V = sorted(Va.union(Vb))
Q. What is the word vector for the "red round" apple?
The column values are 1 if the word is mentioned otherwise 0. Assume the sorted column order.
Q. What is the word vector for the "green sour round" apple?
The column values are 1 if the word is mentioned otherwise 0. Assume the sorted column order.
Let's look at all fruit vectors now and fruit target column
data = np.zeros((7,len(V)))
for i,row in enumerate(apples+bananas):
for w in row:
data[i,V.index(w)] = 1
df = pd.DataFrame(data,columns=V,dtype=int)
df['fruit'] = [0,0,0,0,1,1,1]
df
green | red | round | skinny | sour | yellow | fruit | |
---|---|---|---|---|---|---|---|
0 | 0 | 1 | 1 | 0 | 0 | 0 | 0 |
1 | 0 | 1 | 1 | 0 | 0 | 0 | 0 |
2 | 1 | 0 | 1 | 0 | 1 | 0 | 0 |
3 | 1 | 0 | 1 | 0 | 0 | 0 | 0 |
4 | 0 | 0 | 0 | 1 | 0 | 1 | 1 |
5 | 0 | 0 | 0 | 1 | 0 | 1 | 1 |
6 | 1 | 0 | 0 | 1 | 0 | 0 | 1 |
Q. What is a good estimate of $P(apple)$=P_apple
and $P(banana)$=P_banana
?
(Define those variables)
P_apple = 4/7 P_banana = 3/7
Q. What are good estimates of $P(red|apple)$ and $P(red|banana)$?
Count number of times red appears and divide by number of words in apple docs, then do for banana.
Q. What are good estimates of $P(green|apple)$ and $P(green|banana)$?
Q. If $P(skinny|apple)=0$, what is our smoothed estimate?
Now, do that using vector operations to get smoothed P_w_apple
from the apple records
Recall that df[df.fruit==0]
gets you just the apple records. Your $P(skinny|apple)$ resuls should be:
green 0.200000
red 0.200000
round 0.333333
skinny 0.066667
sour 0.133333
yellow 0.066667
fruit 0.066667
w_counts_apple = df[df.fruit==0].sum(axis=0) P_w_apple = (w_counts_apple+1) / (9+len(V)) P_w_apple
Do that same thing to P_w_banana
from the banana records
You should get:
green 0.166667
red 0.083333
round 0.083333
skinny 0.333333
sour 0.083333
yellow 0.250000
fruit 0.333333
w_counts_banana = df[df.fruit==1].sum(axis=0) P_w_banana = (w_counts_banana+1) / (6+len(V)) P_w_banana
Q. Given P_w_apple
, what is P_apple_redround
, the "probability" that "red round" is an apple?
(We haven't normalized the scores (per our friend Bayes) so they aren't technically probabilities.) Just compute the score we'd use for classification per the lecture. Hint: P_w_apple['skinny']
gives the estimate of $P(skinny|apple)$.
Q. Given P_w_banana
, what is P_banana_redround
, the "probability" that "red round" is a banana?
Here's how to easily compute the probability of each word in V given class apple and class banana:
[P_w_apple[w] for w in V]
[P_w_banana[w] for w in V]
Now, define a function for computing likelihood of a document index $d \in [0,5]$
$$ c^*= \underset{c}{argmax} ~ P(c) \prod_{w \in V} P(w | c)^{n_w(d)} $$You have these pieces: P_apple
, P_w_apple
, and $n_w(d)$ is just the value in df[w][d]
.
def likelihood_apple(d:int):
return ...
def likelihood_banana(d:int):
return ...
def likelihood_apple(d:int): return P_apple * np.product([P_w_apple[w]**df[w][d] for w in V]) def likelihood_banana(d:int): return P_banana * np.product([P_w_banana[w]**df[w][d] for w in V])
Run the following loop to make predictions for each document
Output should be:
red round : 0.038095, 0.002976 => apple
red round : 0.038095, 0.002976 => apple
green sour round : 0.005079, 0.000496 => apple
green round : 0.038095, 0.005952 => apple
yellow skinny : 0.002540, 0.035714 => banana
yellow skinny : 0.002540, 0.035714 => banana
green skinny : 0.007619, 0.023810 => banana
docs = apples_docs+bananas_docs
for d in range(len(df)):
a = likelihood_apple(d)
b = likelihood_banana(d)
print(f"{docs[d]:17s}: {a:4f}, {b:4f} => {'apple' if a>b else 'banana'}")