In this exercise, we will implement a more advanced retrieval model. In contrast to the last exercise, where we explored vector space retrieval using TFIDF-vectors, in this notebook, we will take a look at a generative (language) model for retrieval.
First, recall some important concepts from the lecture:
The representation $\pmb{d}$ of a document $d$ is a probability distribution over the index terms $T$, where the probability of the $i$-th term in $T$ indicates how likely it occurs in documents generated by the topics’ language model that also generated $\pmb{d}$.
A query $q$ is represented as a sequence $\pmb{q}$ of index terms.
The relevance function formulated as a query likelihood model: $\rho(d,q) =P(\pmb{d}|\pmb{q})$
To implement a scoring model, we first import existing code from the previous exercises. We will make use of the Shakespeare document collection, the preprocessing component and the index. We supply a reference solution you can import, but feel free to continue using your own code!
Exercise: using the code from previous exercises, construct an index on top of the Shakespeare document collection.
from modules.shakespeare import corpus
from modules.preprocessing import PreprocessorSpacy as Preprocessor
from modules.indexing import Index
preprocessor = Preprocessor()
corpus_preprocessed = list(zip(
[doc[0] for doc in corpus],
map(preprocessor.preprocess, [doc[1] for doc in corpus])
))
index = Index(corpus_preprocessed)
Let $d$ denote a language model for document, and $q$ the sequence of query terms from query. Then the query likelihood model is derived as follows, utilizing Bayes' rule:
$$ \rho = P(\pmb{d} | \pmb{q}) = \frac{P(\pmb{q}|\pmb{d})\cdot P(\pmb{d})}{P(\pmb{q})} $$Here, $P(\pmb{q})$ can be omitted without affecting the relative ranking of all documents (i.e. omitting it influences all scores equally), and $P(\pmb{d})$ is assumed uniform, cancelling its influence.
The query representation $\pmb{q}$ can be written as sequence of $t_1, ..., t_{|q|}$ of index terms, which (under the assumption that terms are independent of each other) allows us to rewrite the equation as follows:
$$ \rho = P(q | \pmb{d}) = P(t_1, ..., t_{|q|} | \pmb{d}) = \prod_{i=1}^{|q|} P(t_i | \pmb{d})$$This means that we can divide our implementation into four steps:
We can use maximum likelihood estimation to calculate the term probability:
$$ P(t | \pmb{d}) = \frac{\text{tf}(t,d)}{|d|}$$However, this creates a problem: if a term $t$ is contained in the query, but does not occur in a document (i.e. $\text{tf}(t,d) = 0$), $P(t|\pmb{d}) = 0$. Since the individual term probabilities are multiplied, this results in zero relevance for all documents that only partially contain the query!
Solution: we also calculate the probability of the term based on the language model of the corpus, $D$, and combine both probabilities.
As first part to our relevance function, we implement the probability of a term based on a document language model $P(t|\pmb{d})$.
$$ P(t | \pmb{d}) = \frac{\text{tf}(t,d)}{|d|}$$Exercise: implement a function to calculate the term probability give a document.
def document_probability(index, term, doc_id):
"""
Calculates the conditional probability of a term give a document.
:param index: index to get frequency data from
:param term: term to calculate the probability for
:param doc_id: document to calculate the probability for
"""
frequency = index.get_term_frequency(term, doc_id)
doc_sum = 0
for t in index.get_index_terms():
doc_sum += index.get_term_frequency(t, doc_id)
if doc_sum == 0:
return 0
else:
return frequency/doc_sum
As first part to our relevance function, we implement the probability of a term based on a document language model $P(t|\pmb{D})$.
$$ P(t | \pmb{D}) = \frac{\sum_{d \in D}\text{tf}(t,d)}{\sum_{d \in D}|d|}$$Exercise: implement a function to calculate the term probability give a document collection.
def collection_probability(index, term):
"""
Calculates the conditional probability of a term give a document collection.
:param index: index to get frequency data from
:param term: term to calculate the probability for
"""
frequencies = []
doc_sums = []
for doc_id in index.get_document_ids():
frequencies.append(index.get_term_frequency(term, doc_id))
doc_sum = 0
for t in index.get_index_terms():
doc_sum += index.get_term_frequency(t, doc_id)
doc_sums.append(doc_sum)
return sum(frequencies)/sum(doc_sums)
Now that we can calculate both the document probability and the collection probability of a term $t$, we need a way of combining them into a single value, indicating the overall probability of the term. To start out, we implement Jelinek-Mercer smoothing, which is a linear interpolation between both probabilities:
$$P(t|\pmb{d}) = (1- \omega) \cdot P(t | \pmb{d}) + \omega\cdot P(t|\pmb{D})$$where $\omega$ is a paramter that controls the relative influence of both terms on the result.
Note: we call the weigthing term $\omega$ in this exercise. In the lecture slides, this is denoted as $\lambda$, but since lambda
is a reserved keyword in the Python language, we opt to rename it to avoid confusion.
Exercise: implement the overall term probability with Jelinek-Mercer smoothing.
def jm_term_probability(index, term, doc_id, omega):
"""
Calculates the conditional probability of a term give a document using Jelinek-Mercer smoothing.
:param index: index to get frequency data from
:param term: term to calculate the probability for
:param doc_id: document to calculate the probability for
:param omega: weigthing factor for the linear interpolation
"""
p1 = document_probability(index, term, doc_id)
p2 = collection_probability(index, term)
return (1-omega) * p1 + omega * p2
Instead of interpolating linearly between both probabilities, we can take a more fine-grained approach and base the interpolation factor on the length of the document: the longer it is, the more confidence is placed on the document probability, assuming more data = better prediction.
This substitutes the term $\omega$ with the following term:
$$\omega = \frac{\alpha}{|d| + \alpha}$$where $\alpha$ is a parameter that indicates the boundary at which more weight is put towards the document probability or the collection probability.
Note: this method is called Dirichlet smoothing, since from a Bayesian point of view, this corresponds to the expected value of the posterior distribution, using a symmetric Dirichlet distribution with parameter $\alpha$ as a prior.
Exercise: implement a weight function to calculate $\omega$ for a given document and $\alpha$ value.
def weight(index, doc_id, alpha):
"""
Calculates the dirichlet smoothing weighting factor for a given document and alpha value
:param index: index to get frequency data from
:param doc_id: document to calculate the weight factor for
:param alpha: alpha-prior for the dirichlet smoothing
"""
doc_len = 0
for term in index.get_index_terms():
doc_len += index.get_term_frequency(term, doc_id)
return alpha / (doc_len + alpha)
Exercise: implement the overall term probability with Dirichlet smoothing.
def dirichlet_term_probability(index, term, doc_id, alpha):
"""
Calculates the conditional probability of a term give a document using Dirichlet smoothing.
:param index: index to get frequency data from
:param term: term to calculate the probability for
:param doc_id: document to calculate the probability for
:param alpha: alpha-prior for the dirichlet interpolation
"""
omega = weight(index, doc_id, alpha)
p1 = document_probability(index, term, doc_id)
p2 = collection_probability(index, term)
return (1-omega) * p1 + omega * p2
Given the probability function for a single term $t$ using Jelinek-Mercer-Smoothing, we can now implement the full relevance scoring function. We will however utilize a slightly different formulation of the total probability by taking its logarithm:
$$ \rho = \log\prod_{i=1}^{|q|} P(t_i | \pmb{d}) = \sum_{i=1}^{|q|} \log P(t_i | \pmb{d})$$The reason behind this is that the individual term probabilities can get extremely small. If we multiply them, we might run into an underflow: the number being too small to be represented by a 64-bit float. Therefore, utilizing the logarithm product rule, we can instead calculate the sum of the logarithm of the individual probabilities. Note that logarithmization yields negative relevance scores; recall that only the relative ranking among documents is important, not the scores themselves.
Exercise: implement the relevance function $\rho$ with Jelinek-Mercer smoothing. Use the sum of log probabilities as defined above.
from math import log
def jm_score(index, query, doc_id, omega):
"""
Calculates the relevance of a document given a query using Jelinek-Mercer smoothing.
:param index: index to get relevance data from
:param query: query to calculate the relevance for
:param doc_id: document to calculate the relevance for
:param omega: omega paramter for Jelinek-Mercer smoothing
"""
rho = 1
for term in query:
rho += log(jm_term_probability(index, term, doc_id, omega))
return rho
Similarly, we can implement $\rho$ with Dirichlet smoothing, once again applying the logarithm.
Exercise: implement the relevance function $\rho$ with Dirichlet smoothing. Use the sum of log probabilities as defined above.
from math import log
def dirichlet_score(index, query, doc_id, alpha):
"""
Calculates the relevance of a document given a query using Dirichlet smoothing.
:param index: index to get relevance data from
:param query: query to calculate the relevance for
:param doc_id: document to calculate the relevance for
:param alpha: alpha paramter for Dirichlet smoothing
"""
rho = 1
for term in query:
rho += log(dirichlet_term_probability(index, term, doc_id, alpha))
return rho
We now have all the components needed to answer queries. Write a wrapper function that takes a text and returns the (top $k$) results for the query. It construct a query representation and calculate the pairwise probability between the query and all documents, returning an ordered list of (doc_id, score)
tuples, descending by score.
Exercise: implement a query function for Jelinek-Mercer smoothing. Successful parameters are $\omega= 0.1$ and $\omega$= 0.7 for short and long queries, respectively.
def jm_query(index, preprocessor, text, omega=0.1, topK=-1):
"""
Queries a given text against the given index using a Jelinek-Mercer smoothed language model
:param preprocessor: preprocessor instance to process the query with
:param index: the index data to query against
:param text: query text
:param omega: weight parameter for Jelinek-Mercer smoothing
:param topK: number of top results to return
:return: list of (doc_id, score) tuples descending by score for all documents in the vector space
"""
query = preprocessor.preprocess(text)
scores = {}
for doc_id in index.get_document_ids():
scores[doc_id] = jm_score(index, query, doc_id, omega=omega)
return sorted(scores.items(), key=lambda item: item[1], reverse=True)[:topK]
jm_query(index, preprocessor, "proceed", topK=5)
[('Coriolanus', -0.203513772874222), ('Comedy of Errors', -1.1835784414003485), ('Troilus and Cressida', -7.890135107818841), ('Henry IV, Part II', -7.890135107818841), ('Much Ado About Nothing', -7.890135107818841)]
Exercise: implement a query function for Dirichlet smoothing. Typical parameters are $1000 < \alpha < 2000$.
def dirichlet_query(index, preprocessor, text, alpha=1000, topK=-1):
"""
Queries a given text against the given index using a Dirichlet smoothed language model
:param preprocessor: preprocessor instance to process the query with
:param index: the index data to query against
:param text: query text
:param alpha: alpha-parameter for Dirichlet smoothing
:param topK: number of top results to return
:return: list of (doc_id, score) tuples descending by score for all documents in the vector space
"""
query = preprocessor.preprocess(text)
scores = {}
for doc_id in index.get_document_ids():
scores[doc_id] = dirichlet_score(index, query, doc_id, alpha=alpha)
return sorted(scores.items(), key=lambda item: item[1], reverse=True)[:topK]
dirichlet_query(index, preprocessor, "proceed", topK=5)
[('Coriolanus', -5.044738931143362), ('Comedy of Errors', -5.049711591812738), ('Hamlet', -5.587550014824796), ('Measure for Measure', -5.5885495151578795), ('Tempest', -5.5885495151578795)]