In this exercise, we will implement a basic retrieval model. While we already answered some questions using term statistics in the indexing exercise, this lab introduces the concept of a relevance function: a way of quantifying the relevance between a query (expressing a users information need) and a document (potentially fulfilling an information need).
First, recall some important concepts from the lecture: indexing, term weighting, document representations, query and query representations.
Gather and store auxiliary information about documents, required for fast online query processing (i.e., to score and rank documents when a query is submitted). Such information may include term frequencies, document frequencies, document lengths, ... The data structure used is a key-value store (i.e., a hash map).
For each index term of a document, calculate a weight indicating its importance with respect to the document, allowing for document scoring with respect to a query. Common term weighting schemes are term-frequency (tf), term-frequency-inverse-document-frequency (tfidf), BM25, ...
The representation of a document $d$ is a $|T|$-dimensional vector, where the $i$-th vector component of $d$ corresponds to a term weight $w_i$ of term $t_i \in T$, indicating its importance for $d$.
A inquiry string expressing a users information need, either in natural or in a specialized query language.
A query representation $q$ is constructed like a document representation, but for a user-supplied query instead of a document.
To implement a scoring model, we first import existing code from the previous exercises. We will make use of the Shakespeare document collection, the preprocessing component and the index. We supply a reference solution you can import, but feel free to continue using your own code!
Exercise: using the code from previous exercises, construct an index on top of the Shakespeare document collection.
from modules.shakespeare import corpus
from modules.preprocessing import PreprocessorSpacy as Preprocessor
from modules.indexing import Index
preprocessor = Preprocessor()
corpus_preprocessed = list(zip(
[doc[0] for doc in corpus],
map(preprocessor.preprocess, [doc[1] for doc in corpus])
))
corpus_preprocessed[5]
index = Index(corpus_preprocessed)
To compute document representations, we first need to implement term weighting. For the purposes of this exercise, we choose the term-frequency-inverse-document-frequency (TFIDF) weighting scheme, given by the following equation: $ tfidf = tf \cdot idf $, where $tf(t,d)$ is the normalized term frequency, denoting the importance of a term $t$ in document $d$, and $ idf(t, D) = \log\frac{|D|}{df(t,d)}$ is the inverse document frequency, denoting the importance of term $t$ in general.
Note that different approaches to normalizing the term frequency exist. For this exercise, we use the following normalization:
$$1 + \log(\text{tf}(t,d))\text{ for tf}(t,d)> 0$$Exercise: implement the tfidf term weighting function.
from math import log
def tfidf(term_frequency: int, document_frequency: int, document_count: int) -> float:
"""
Term-frequency-inverted-document-frequency, which calculates the tfidf for a given term frequency, document frequency and document count.
:param term_frequency: The frequency of the term in the document.
:param document_frequency:
:param document_count:
:returns: tfidf for the word and the argument, 0 if not present.
"""
if term_frequency == 0:
return 0
else:
tf = 1 + log(term_frequency)
idf = log(document_count/document_frequency)
return tf * idf
The representation $d$ of a document $D$ is a $|T|$-dimensional vector, where the i-th vector component of $D$ corresponds to a term weight $w_i$ of term $t_i \in T$, indicating its importance for $D$.
We will use the previously defined TFIDF score as term weight: $w_i = tfidf(t_i, D)$.
Exercise: implement a function to generate a TFIDF-vector representation for a given document id, based on the the information available in the index. For the purposes of this exercise, implement vectors as lists. Make sure to keep dimensionality between documents, i.e. the same vector dimension should always correspond to the same index term!
def get_document_representation(doc_id, index):
"""
Returns the tfidf vector for a given doc_id and index
:param doc_id: doc_id to construct the vector for
:param index: index to construct the vector from
:return: the tfidf vector for this document
"""
vec = []
for term in sorted(index.get_index_terms()):
vec.append(tfidf(
index.get_term_frequency(term, doc_id),
index.get_document_frequency(term),
index.get_document_count()
))
return vec
To retrieve similar documents to a given query, we need to project the query into the same vector space as our document representations.
The query representation is constructed in a similar way to document representations, using the term data stored in the index. However, we calculate the term frequency manually, since a text query, not a document, is indexed. Keep in mind that the query needs to be processed in exactly the same ways as documents, i.e. have the same preprocessing applied.
Exercise: implement a function to generate a TFIDF-vector representation for a text string, based on the the information available in the index. For the purposes of this exercise, implement vectors as lists. Make sure to keep dimensionality between documents, i.e. the same vector dimension should always correspond to the same index term!
def get_query_representation(text, index):
"""
Returns the tfidf vector for an arbitrary string.
:param text: string to construct the vector for
:param index: index to construct the vector from
:return: the tfidf vector for the string
"""
query_terms = preprocessor.preprocess(text)
term_frequencies = {term: query_terms.count(term) for term in query_terms}
vec = []
for term in sorted(index.get_index_terms()):
vec.append(tfidf(
term_frequencies.get(term, 0),
index.get_document_frequency(term),
index.get_document_count()
))
return vec
Now that we have a way to embed both documents and the query into a common vector space, we can rank documents according to a similarity function with respect to the query.
To retrieve documents, we need to build a vector space, i.e. construct the representation vector for every document in our collection.
Exercise: construct a vector space for all documents in the index using their document representations. Use a {doc_id: vector, ...}
dictionary as data structure.
vectors = {}
for doc_id in index.get_document_ids():
vectors[doc_id] = get_document_representation(doc_id, index)
To calculate the similarity score between the query and each document, we need to define a similarity function. We will use cosine similarity.
Exercise: implement cosine similarity. (Be mindful of the difference between distance and similarity!)
from math import sqrt, pow
def score(vec_a, vec_b):
def dot(vec_a, vec_b):
# Dot product of two vectors
return sum(map(lambda x: x[0]*x[1], zip(vec_a, vec_b)))
def norm(vec):
# Euclidean norm of a vector
return sqrt(sum(map(lambda x: pow(x, 2), vec)))
def cosine_sim(vec_a, vec_b):
# Cosine similarity of two vectors
try:
return (dot(vec_a, vec_b)/(norm(vec_a)*norm(vec_b)))
except ZeroDivisionError:
return 0
return cosine_sim(vec_a, vec_b)
We now have all the components needed to answer queries. Write a wrapper function that takes a text and returns the (top $k$) results for the query. It construct a query representation and calculate the pairwise similarity between the query and all documents, returning an ordered list of (doc_id, score)
tuples, descending by score.
Exercise: write a function to query the vector space.
def query(text, vectors, index, topK=-1):
"""
Queries a given text against a given vector space. Embeds the query text and calculates the cosine similarity to every document in the vector space.
:param text: query text
:param vectors: list of document vectors to query against
:param index: the index data to query against
:param topK: number of top results to return
:return: list of (doc_id, score) tuples descending by score for all documents in the vector space
"""
query_vec = get_query_representation(text, index)
scores = {}
for doc_id, doc_vec in vectors.items():
scores[doc_id] = score(doc_vec, query_vec)
return sorted(scores.items(), key=lambda item: item[1], reverse=True)[:topK]
You can now try out different queries. There are two example searchers below, but try and search for anything you like. Take a look at both the ranking results as well as the original documents; try to trace why certain documents appear in the order they are in given their TFIDF similarity.
query("two households verona", vectors, index, topK=5)
[('Romeo and Juliet', 0.19370888618699222), ('Henry IV, Part II', 0.04532408825317537), ('Henry VI, Part I', 0.0), ('Timon of Athens', 0.0), ('Antony and Cleopatra', 0.0)]
query("love", vectors, index, topK=5)
[('Passionate Pilgrim', 0.28253619196580304), ('Two Gentlemen of Verona', 0.22471931574655718), ('Twelfth Night', 0.13726372664629935), ('Timon of Athens', 0.11891302011429873), ('Romeo and Juliet', 0.07777035257918241)]