For feature extraction we used Sci-Kit Learns, tf-idf vectorizer. It is a count vectorizer combined with idf. The count vectorizer measures term frequency(tf), ie how often a word appears in a title. If we do this for the following sentences then we produce the matrix below.
the | dog | jumped | over | fence | cat | chased | white | brown | who | ||
---|---|---|---|---|---|---|---|---|---|---|---|
Title 1 | 2 | 1 | 1 | 1 | 1 | 0 | 0 | 0 | 0 | 0 | 0 |
Title 2 | 1 | 1 | 0 | 0 | 0 | 1 | 1 | 0 | 0 | 0 | 0 |
Title 3 | 3 | 0 | 1 | 1 | 0 | 3 | 1 | 1 | 1 | 1 | 1 |
The downside of just using tf is that words that appear most often tend to dominate the vector. To overcome this we use a combination of term frequency - inverse document frequency(tf-idf). Idf is measure of whether a term is common or rare across all documents [Side note 2]. Idf is the log of one plus the number of documents(N) divided by the number of documents a term(n) appears in. The one is present so that the equation doesn't evaluate to zero.
\begin{equation*} log(1 +\frac{N}{n_t}) \end{equation*}Essentially, Tf-idf creates a word vector in which a word is weighted by its occurence not only in the title it was derived from but also the entire group of titles(corpus). Tf-idf is calculated by the following formula
t = term, d = single title, D = all titles
\begin{equation*} tfidf(t,d,D) = tf(t,d)\cdot idf(t, D) \end{equation*}Below is the workflow for calculating tfidf for the term "cat" in the above titles.
\begin{equation*} tf("cat",d_1) = \frac{0}{6} = 0 \end{equation*}\begin{equation*} tf("cat",d_2) = \frac{1}{4} = 0.250 \end{equation*}\begin{equation*} tf("cat",d_3) = \frac{3}{13} \approx 0.231 \end{equation*}\begin{equation*} idf("cat",D) = log(1 + \frac{3}{2}) \approx 0.4 \end{equation*}\begin{equation*} tfidf("cat", d_1) = tf("cat", d_1) \times idf("cat", D) = 0 \times 0.4 = 0 \end{equation*}\begin{equation*} tfidf("cat", d_2) = tf("cat", d_2) \times idf("cat", D) = 0.250 \times 0.4 = 0.1 \end{equation*}\begin{equation*} tfidf("cat", d_3) = tf("cat", d_3) \times idf("cat", D) = 0.231 \times 0.4 = 0.0924 \end{equation*}