Notebook

Review of Naive Bayes¶

Predicting class of a text, say "humor":
$P(C = \textrm{humor} \mid X_a = \textrm{False}, etc) \\ P(C \mid X) = P(X \mid C)P(C) / P(X)$

(aside: We have generally been doing supervised learning.)

Parametric vs. Non-Paramentric:¶

K-NN: not considered parametric, even though $K$ is a parameter, since you're not given a probability distribution
Decision trees: nonparametric
Linear regression: parametric (assumes the world is linear; if the world is sinusoidal, for instance, LR will provide a biased estimate
Linear regression with complexity penalty: parametric
Naive Bayes: parametric (form of the prior, probabilities of each word given each category, etc)
- number of parameters is $k * v$, where $k$ == categories and $v$ == words

Generative vs. Discriminative¶

Generative
- learn $p(x)$ -- learn the probability distribution
- or, equivalently, $p(y, x)$
Discriminative
- learn $p(y\mid x)$
- or, minimize $L(y, f(x, \Theta))$

Given $p(y, x)$, you can estimate $p(y\mid x)$. But you cannot estimate $p(y, x)$ given $p(y\mid x)$

Linear regression:
- $P(Y,X) = P(Y\mid X)P(X)$
- $Y_i \sim N(wx_i, \sigma^2)$
- We don't make any assumptions about $P(X)$
- Therefore, it is discriminative
Naive Bayes:
- makes an assumption about how $X$ is distributed, so it is generative
- assumes a very particular distribution of Xs

In a discriminative model, I'm not estimating P(X). If P(X) is particularly messy, then it's easier to use discriminative models so we don't have to model the Xs.

Logistic Regression¶

Imagine X vs Y: Y = 1 if true, 0 if false. Plotting gives you set of points at {0,1}. Fitting a linear regression gives you a bad probability estimator (can go above or below 0, 1). You can fit a smoothing function $f(x) = 1/(1 + e^{-x})$. This function starts at 0 and goes to 1; no matter what X is, $f(x)$ is between 0 and 1.

This is the logistic function.

$P(Y = 1 \mid x,w) = \frac{1}{1+\exp(-\sum_j w_j x_j)}\frac{1}{1+\exp(-w^\intercal x)} = \frac{1}{1+\exp(-yw^\intercal x)}$

$\log \frac{P(Y=1\mid x,w)}{P(Y=-1\mid x,w)} = w^\intercal x$

In log. regression, each weight tells you how much you'll drive up the logistic function

Adding interactions:
Can you add an interaction term and have it still be a logistic regression? Yes

If you include transformations and interactions, LR can be extremely powerful

Why would you use NB vs LR? NB is useful because often you don't have enough data

Linear boundary for 2-class Gaussian Naive Bayes with shared variances¶

see notes on website

Intuition about linear boundaries:¶

Geometric intuition:

Representing a line: $0 = x - y$

0 = [1 -1] [x; y]
in general a hyperplane is defined by $0 = \vec{w} \bullet \vec{x}$
- the vector ($-\hat{w}$) defines the plane that is orthagonal to it

In [ ]: