Predicting class of a text, say "humor":
$P(C = \textrm{humor} \mid X_a = \textrm{False}, etc) \\
P(C \mid X) = P(X \mid C)P(C) / P(X)$
(aside: We have generally been doing supervised learning.)
Given $p(y, x)$, you can estimate $p(y\mid x)$. But you cannot estimate $p(y, x)$ given $p(y\mid x)$
In a discriminative model, I'm not estimating P(X). If P(X) is particularly messy, then it's easier to use discriminative models so we don't have to model the Xs.
Imagine X vs Y: Y = 1 if true, 0 if false. Plotting gives you set of points at {0,1}. Fitting a linear regression gives you a bad probability estimator (can go above or below 0, 1). You can fit a smoothing function $f(x) = 1/(1 + e^{-x})$. This function starts at 0 and goes to 1; no matter what X is, $f(x)$ is between 0 and 1.
This is the logistic function.
$P(Y = 1 \mid x,w) = \frac{1}{1+\exp(-\sum_j w_j x_j)}\frac{1}{1+\exp(-w^\intercal x)} = \frac{1}{1+\exp(-yw^\intercal x)}$
$\log \frac{P(Y=1\mid x,w)}{P(Y=-1\mid x,w)} = w^\intercal x$
In log. regression, each weight tells you how much you'll drive up the logistic function
Adding interactions:
Can you add an interaction term and have it still be a logistic regression? Yes
If you include transformations and interactions, LR can be extremely powerful
Why would you use NB vs LR? NB is useful because often you don't have enough data
see notes on website
Geometric intuition:
Representing a line: $0 = x - y$