#!/usr/bin/env python # coding: utf-8 # # Information Retrieval Lab: Indexing # ## Lecture Recall # # First, recall some important concepts from the lecture: *index*, *document*, and *vocabulary*. # # ###### Index # > An index is a systematic guide designed to indicate topics or features of # documentary units as index terms in order to facilitate their retrieval. # The function of an index is to provide users with an effective means for locating documentary units relevant to their information needs in answer to queries. # # ###### Document # > A document is a medium on or in which a message is encoded; the combination of message and medium. A documentary unit is a document, document segment, or collection of documents to which entries in an index refer. # # ###### Vocabulary # > A vocabulary is a collection of words and phrases. # The terminology is a collection of index terms used in an index. # An index term, or term, is (a derivation of) a word or phrase selected from a vocabulary, used to represent a topic or feature of a document in an index. # ## Document Collection # To make the data a bit more interesting than the toy sentences from the last exercise, consider the following list of opening lines from Shakespeare plays. # In[1]: shakespeare = [ ("All's Well That Ends Well", "In Delivering My Son from Me, I Bury a Second Husband."), ("Antony and Cleopatra", "Nay, but this dotage of our general's O'erflows the measure: those his goodly eyes, That o'er the files and musters of the war Have glow'd like plated Mars, now bend, now turn, The office and devotion of their view Upon a tawny front: his captain's heart, Which in the scuffles of great fights hath burst The buckles on his breast, reneges all temper, And is become the bellows and the fan To cool a gipsy's lust. "), ("As You Like It", "As I remember, Adam, it was upon this fashion bequeathed me by will but poor a thousand crowns, and, as thou say'st, charged my brother, on his blessing, to breed me well; and there begins my sadness. My brother Jaques he keeps at school, and report speaks goldenly of his profit. For my part, he keeps me rustically at home, or, to speak more properly, stays me here at home unkept; for call you that keeping for a gentleman of my birth that differs not from the stalling of an ox? His horses are bred better; for, besides that they are fair with their feeding, they are taught their manage, and to that end riders dearly hir'd; but I, his brother, gain nothing under him but growth; for the which his animals on his dunghills are as much bound to him as I. Besides this nothing that he so plentifully gives me, the something that nature gave me his countenance seems to take from me. He lets me feed with his hinds, bars me the place of a brother, and as much as in him lies, mines my gentility with my education. This is it, Adam, that grieves me; and the spirit of my father, which I think is within me, begins to mutiny against this servitude. I will no longer endure it, though yet I know no wise remedy how to avoid it."), ("Comedy of Errors", "Proceed, Solinus, to procure my fall And by the doom of death end woes and all."), ("Coriolanus", "Before We Proceed Any Further, Hear Me Speak."), ("Cymbeline", "You do not meet a man but frowns: our bloods No more obey the heavens than our courtiers Still seem as does the king."), ("Hamlet", "Who's There?"), ("Henry IV, Part I", "So shaken as we are, so wan with care, Find we a time for frighted peace to pant, And breathe short-winded accents of new broils To be commenced in strands afar remote. No more the thirsty entrance of this soil Shall daub her lips with her own children's blood; Nor more shall trenching war channel her fields, Nor bruise her flowerets with the armed hoofs Of hostile paces: those opposed eyes, Which, like the meteors of a troubled heaven, All of one nature, of one substance bred, Did lately meet in the intestine shock And furious close of civil butchery Shall now, in mutual well-beseeming ranks, March all one way and be no more opposed Against acquaintance, kindred and allies: The edge of war, like an ill-sheathed knife, No more shall cut his master."), ("Henry IV, Part II", "Open your ears; for which of you will stop The vent of hearing when loud Rumour speaks? I, from the orient to the drooping west, Making the wind my post-horse, still unfold The acts commenced on this ball of earth. Upon my tongues continual slanders ride, The which in every language I pronounce, Stuffing the ears of men with false reports. I speak of peace while covert emnity, Under the smile of safety, wounds the world; And who but Rumour, who but only I, Make fearful musters and prepar'd defence, Whiles the big year, swoln with some other grief, Is thought with child by the stern tyrant war, And no such matter? Rumour is a pipe Blown by surmises, jealousies, conjectures, And of so easy and so plain a stop That the blunt monster with uncounted heads, The still-discordant wav'ring multitude, Can play upon it. But what need I thus My well-known body to anatomize Among my household? Why is Rumour here? I run before King Harry's victory, Who, in a bloody field by Shrewsbury, Hath beaten down young Hotspur and his troops, Quenching the flame of bold rebellion Even with the rebels' blood. But what mean I To speak so true at first? My office is To noise abroad that Harry Monmouth fell Under the wrath of noble Hotspur's sword, And that the King before the Douglas' rage Stoop'd his anointed head as low as death. This have I rumour'd through the peasant towns Between that royal field of Shrewsbury And this worm-eaten hold of ragged stone, Where Hotspur's father, old Northumberland, Lies crafty-sick. The posts come tiring on, And not a man of them brings other news Than they have learnt of me. From Rumour's tongues They bring smooth comforts false, worse than true wrongs."), ("Henry VIII", "I come no more to make you laugh: things now, That bear a weighty and a serious brow, Sad, high, and working, full of state and woe, Such noble scenes as draw the eye to flow, We now present. Those that can pity, here May, if they think it well, let fall a tear; The subject will deserve it."), ("Henry VI, Part I", "Hung be the heavens with black, yield day to night! Comets, importing change of times and states, Brandish your crystal tresses in the sky, And with them scourge the bad revolting stars That have consented unto Henry's death! King Henry the Fifth, too famous to live long! England ne'er lost a king of so much worth."), ("Henry VI, Part II", "As by Your High Imperial Majesty, I Had in Charge at My Depart for France"), ("Henry VI, Part III", "As by your high imperial majesty I had in charge at my depart for France, As procurator to your excellence, To marry Princess Margaret for your grace, So, in the famous ancient city, Tours, In presence of the Kings of France and Sicil, The Dukes of Orleans, Calaber, Bretagne and Alencon, Seven earls, twelve barons and twenty reverend bishops, I have perform'd my task and was espoused: And humbly now upon my bended knee, In sight of England and her lordly peers, Deliver up my title in the queen To your most gracious hands, that are the substance Of that great shadow I did represent; The happiest gift that ever marquess gave, The fairest queen that ever king received."), ("Henry V", "O for a Muse of fire, that would ascend The brightest heaven of invention, A kingdom for a stage, princes to act And monarchs to behold the swelling scene! Then should the warlike Harry, like himself, Assume the port of Mars; and at his heels, Leash'd in like hounds, should famine, sword and fire Crouch for employment. But pardon, and gentles all, The flat unraised spirits that have dared On this unworthy scaffold to bring forth So great an object: can this cockpit hold The vasty fields of France? or may we cram Within this wooden O the very casques That did affright the air at Agincourt? O, pardon! since a crooked figure may Attest in little place a million; And let us, ciphers to this great accompt, On your imaginary forces work. Suppose within the girdle of these walls Are now confined two mighty monarchies, Whose high upreared and abutting fronts The perilous narrow ocean parts asunder: Piece out our imperfections with your thoughts; Into a thousand parts divide on man, And make imaginary puissance; Think when we talk of horses, that you see them Printing their proud hoofs i' the receiving earth; For 'tis your thoughts that now must deck our kings, Carry them here and there; jumping o'er times, Turning the accomplishment of many years Into an hour-glass: for the which supply, Admit me Chorus to this history; Who prologue-like your humble patience pray, Gently to hear, kindly to judge, our play."), ("Julius Caesar", "Hence! home, you idle creatures get you home: Is this a holiday? what! know you not, Being mechanical, you ought not walk Upon a labouring day without the sign Of your profession? Speak, what trade art thou?"), ("King John", "Now, Say, Chatillon, What Would France With Us?"), ("King Lear","I thought the King had more affected the Duke of Albany than Cornwall."), ("Lover's Complaint", "From off a hill whose concave womb reworded A plaintful story from a sistering vale, My spirits to attend this double voice accorded, And down I laid to list the sad-tuned tale; Ere long espied a fickle maid full pale, Tearing of papers, breaking rings a-twain, Storming her world with sorrow's wind and rain."), ("Love's Labour's Lost", "Let fame, that all hunt after in their lives, Live register'd upon our brazen tombs And then grace us in the disgrace of death; When, spite of cormorant devouring Time, The endeavor of this present breath may buy That honour which shall bate his scythe's keen edge And make us heirs of all eternity. Therefore, brave conquerors,--for so you are, That war against your own affections And the huge army of the world's desires,-- Our late edict shall strongly stand in force: Navarre shall be the wonder of the world; Our court shall be a little Academe, Still and contemplative in living art. You three, Biron, Dumain, and Longaville, Have sworn for three years' term to live with me My fellow-scholars, and to keep those statutes That are recorded in this schedule here: Your oaths are pass'd; and now subscribe your names, That his own hand may strike his honour down That violates the smallest branch herein: If you are arm'd to do as sworn to do, Subscribe to your deep oaths, and keep it too."), ("Macbeth","When shall we three meet again In thunder, lightning, or in rain?"), ("Measure for Measure","Escalus."), ("The Merchant of Venice","In sooth, I know not why I am so sad: It wearies me; you say it wearies you; But how I caught it, found it, or came by it, What stuff 'tis made of, whereof it is born, I am to learn; And such a want-wit sadness makes of me, That I have much ado to know myself."), ("The Merry Wives of Windsor","Sir Hugh, persuade me not; I will make a Star- chamber matter of it: if he were twenty Sir John Falstaffs, he shall not abuse Robert Shallow, esquire."), ("A Midsummer Night's Dream","Now, fair Hippolyta, our nuptial hour Draws on apace; four happy days bring in Another moon: but, O, methinks, how slow This old moon wanes! she lingers my desires, Like to a step-dame or a dowager Long withering out a young man revenue."), ("Much Ado About Nothing","I Learn in This Letter That Don Peter of Arragon, Comes This Night to Messina."), ("Othello","Tush! never tell me; I take it much unkindly That thou, Iago, who hast had my purse As if the strings were thine, shouldst know of this."), ("Passionate Pilgrim","When my love swears that she is made of truth, I do believe her, though I know she lies, That she might think me some untutor'd youth, Unskilful in the world's false forgeries. Thus vainly thinking that she thinks me young, Although I know my years be past the best, I smiling credit her false-speaking tongue, Outfacing faults in love with love's ill rest. But wherefore says my love that she is young? And wherefore say not I that I am old? O, love's best habit is a soothing tongue, And age, in love, loves not to have years told. Therefore I'll lie with love, and love with me, Since that our faults in love thus smother'd be."), ("Pericles","To sing a song that old was sung, From ashes ancient Gower is come; Assuming man's infirmities, To glad your ear, and please your eyes. It hath been sung at festivals, On ember-eves and holy-ales; And lords and ladies in their lives Have read it for restoratives: The purchase is to make men glorious; Et bonum quo antiquius, eo melius."), ("Phoenix and the Turtle","Let the bird of loudest lay, On the sole Arabian tree, Herald sad and trumpet be, To whose sound chaste wings obey."), ("Richard III","Now is the winter of our discontent Made glorious summer by this sun of York; And all the clouds that lour'd upon our house In the deep bosom of the ocean buried. Now are our brows bound with victorious wreaths; Our bruised arms hung up for monuments; Our stern alarums changed to merry meetings, Our dreadful marches to delightful measures. Grim-visaged war hath smooth'd his wrinkled front; And now, instead of mounting barded steeds To fright the souls of fearful adversaries, He capers nimbly in a lady's chamber To the lascivious pleasing of a lute. But I, that am not shaped for sportive tricks, Nor made to court an amorous looking-glass; I, that am rudely stamp'd, and want love's majesty To strut before a wanton ambling nymph; I, that am curtail'd of this fair proportion, Cheated of feature by dissembling nature, Deformed, unfinish'd, sent before my time Into this breathing world, scarce half made up, And that so lamely and unfashionable That dogs bark at me as I halt by them; Why, I, in this weak piping time of peace, Have no delight to pass away the time, Unless to spy my shadow in the sun And descant on mine own deformity: And therefore, since I cannot prove a lover, To entertain these fair well-spoken days, I am determined to prove a villain And hate the idle pleasures of these days. Plots have I laid, inductions dangerous, By drunken prophecies, libels and dreams, To set my brother Clarence and the king In deadly hate the one against the other: And if King Edward be as true and just As I am subtle, false and treacherous, This day should Clarence closely be mew'd up, About a prophecy, which says that 'G' Of Edward's heirs the murderer shall be. Dive, thoughts, down to my soul: here Clarence comes."), ("Richard II","Old John of Gaunt, time-honour'd Lancaster, Hast thou, according to thy oath and band, Brought hither Henry Hereford thy bold son, Here to make good the boisterous late appeal, Which then our leisure would not let us hear, Against the Duke of Norfolk, Thomas Mowbray?"), ("Romeo and Juliet","Two households, both alike in dignity, In fair Verona, where we lay our scene, From ancient grudge break to new mutiny, Where civil blood makes civil hands unclean. From forth the fatal loins of these two foes A pair of star-cross'd lovers take their life; Whole misadventured piteous overthrows Do with their death bury their parents' strife. The fearful passage of their death-mark'd love, And the continuance of their parents' rage, Which, but their children's end, nought could remove, Is now the two hours' traffic of our stage; The which if you with patient ears attend, What here shall miss, our toil shall strive to mend."), ("Taming of the Shrew","I'll Pheeze You, in Faith."), ("Tempest","Boatswain!"), ("Timon of Athens","Noble patricians, patrons of my right, Defend the justice of my cause with arms, And, countrymen, my loving followers, Plead my successive title with your swords: I am his first-born son, that was the last That wore the imperial diadem of Rome; Then let my father's honours live in me, Nor wrong mine age with this indignity."), ("Titus Andronicus","Noble Patricians, Patrons of My Right, Defend the Justice of My Cause With Arms"), ("Troilus and Cressida","In Troy, there lies the scene. From isles of Greece The princes orgulous, their high blood chafed, Have to the port of Athens sent their ships, Fraught with the ministers and instruments Of cruel war: sixty and nine, that wore Their crownets regal, from the Athenian bay Put forth toward Phrygia; and their vow is made To ransack Troy, within whose strong immures The ravish'd Helen, Menelaus' queen, With wanton Paris sleeps; and that's the quarrel. To Tenedos they come; And the deep-drawing barks do there disgorge Their warlike fraughtage: now on Dardan plains The fresh and yet unbruised Greeks do pitch Their brave pavilions: Priam's six-gated city, Dardan, and Tymbria, Helias, Chetas, Troien, And Antenorides, with massy staples And corresponsive and fulfilling bolts, Sperr up the sons of Troy. Now expectation, tickling skittish spirits, On one and other side, Trojan and Greek, Sets all on hazard: and hither am I come A prologue arm'd, but not in confidence Of author's pen or actor's voice, but suited In like conditions as our argument, To tell you, fair beholders, that our play Leaps o'er the vaunt and firstlings of those broils, Beginning in the middle, starting thence away To what may be digested in a play. Like or find fault; do as your pleasures are: Now good or bad, 'tis but the chance of war."), ("Twelfth Night","If music be the food of love, play on; Give me excess of it, that, surfeiting, The appetite may sicken, and so die. That strain again! it had a dying fall: O, it came o'er my ear like the sweet sound, That breathes upon a bank of violets, Stealing and giving odour! Enough; no more: 'Tis not so sweet now as it was before. O spirit of love! how quick and fresh art thou, That, notwithstanding thy capacity Receiveth as the sea, nought enters there, Of what validity and pitch soe'er, But falls into abatement and low price, Even in a minute: so full of shapes is fancy That it alone is high fantastical."), ("Two Gentlemen of Verona","Cease to persuade, my loving Proteus: Home-keeping youth have ever homely wits. Were't not affection chains thy tender days To the sweet glances of thy honour'd love, I rather would entreat thy company To see the wonders of the world abroad, Than, living dully sluggardized at home, Wear out thy youth with shapeless idleness. But since thou lovest, love still and thrive therein, Even as I would when I to love begin."), ("Venus and Adonis","Vilia Miretur Vulgus"), ("The Winter's Tale","If you shall chance, Camillo, to visit Bohemia, on the like occasion whereon my services are now on foot, you shall see, as I have said, great difference betwixt our Bohemia and your Sicilia.") ] # Before we begin indexing, we want to transform our document collection, such that each document is represented by the words it contains (bag-of-words model). # # We included the reference solution of the last week wrapped in the `Preprocessor` class in the `preprocessing.py` file. You can import and use it in this exercise. However, feel free to write such a class yourself and import your own code! # **Exercise:** Transform the documents so that each document is represented as `{"id": title, "terms": [...]}` dictionary. Extract the term set using the preprocessing functions from the last exercise. We recommend using the `spaCy` solution. # In[2]: from modules.preprocessing import PreprocessorSpacy prep = PreprocessorSpacy() documents = [] for play, opener in shakespeare: documents.append({"id": play, "terms": prep.preprocess(opener)}) # In[3]: documents # ## The Term-Document Matrix # # Before we can construct an Index, we need an intermediate data structure, the term-document matrix $T\times D$, based on the set of all tokens that occur in our data, our *index terms* $T$, and the *document set* $D$. This matrix denotes which terms $t \in T$ occur in each document $d \in D$. # **Exercise:** construct the index term set from the document collection. # In[4]: index_terms = {term for document in documents for term in document["terms"]} # **Exercise:** based on the index terms, build the complete term-document matrix $T\times D$. Use the following data structure: `{ "doc_id": {"term": Bool, ...}, ...}`, where the boolean value indicates if the specified term appears in the document or not. # In[5]: td_matrix = {} for document in documents: row = {} for term in index_terms: row[term] = term in document["terms"] td_matrix[document["id"]] = row # We can make this matrix much more useful if we count the number of occurences, instead of relying on a simple binary representation. # **Exercise:** using the data structure from above, replace the boolean value with the count of occurences for each term/document combination. You can omit zero-values. # In[6]: td_matrix = {} for document in documents: row = {} for term in index_terms: row[term] = document["terms"].count(term) td_matrix[document["id"]] = row # **Questions:** # - Why is it useful to count the number of occurences? # - What can you observe about the distribution of counts in the $T\times D$ matrix? # - What limitations does using a boolean instead of a count place on the possible ways of retrieving information from the matrix? # ## The Inverted Index # An inverted index is a data structure that encodes a term-document matrix for space-efficient storing and time-efficient retrieval. The name "inverted" stems from the basic idea behind it: instead of pointing from a document to the terms it includes, it "inverts" the matrix and now points from terms to the documents they occur in. # **Exercise:** build an inverted index from the term-document matrix. Use the following data structure: `{"term": {"doc_id": boolean, ...}, ...}`. The inverted index is a *sparse* representation, i.e. you do not need to include term/document combinations where `occurences` is 0. # In[7]: index = {} for term in index_terms: docs = {} for doc_id in td_matrix.keys(): if td_matrix[doc_id][term] > 0: docs[doc_id] = True index[term] = docs # Similarly to how we made the term-document matrix more useful when counting the number of occurences for each term, our inverted index gets more useful if we also incorporate that information here. # We introduce a *term weight* $w$. This weight will contain the term frequency $\text{tf}(t,d)$ of that term for the current document. # **Exercise:** modify your inverted index to reflect the number of occurences of each term. # In[8]: index = {} for term in index_terms: docs = {} for doc_id in td_matrix.keys(): if td_matrix[doc_id][term] > 0: docs[doc_id] = td_matrix[doc_id][term] index[term] = docs # ## Introducing term weights # Two important pieces of information about each term are still missing: # - the *total term frequency* $ttf(t, D)$, which equals the total number of occurences of the term (i.e. the sum of the term frequencies). # - the *document frequency* $\text{df}(t)$, the number of documents that a term appears in at least once # # **Exercise:** add the *total term frequency* to the inverted index. # # **Question:** why don't we explicitly add the *document frequency* to the index as well? # In[9]: index = {} for term in index_terms: docs = {} for doc_id in td_matrix.keys(): if td_matrix[doc_id][term] > 0: docs[doc_id] = td_matrix[doc_id][term] index[term] = (sum(docs.values()), docs) # ## Computing Term Statistics # # With the inverted index finished, we now retrieve documents using term statistics. # For that, we should have an easy way to compute term statistics for a selected term or document. # # **Exercise:** write helper functions that return the correct value from our index for each of the term statistics defined below. # *Term Frequency*: frequency (i.e. number of occurences) for a specified term and document. # In[10]: def get_term_frequency(index, term, doc_id): # Returns the term frequency for a specified term and document try: return index[term][1][doc_id] except KeyError: # If the term does not exist in this document, return 0 return 0 # *Total Term Frequency*: total number of occurrences of a term across all documents. # In[11]: def get_total_term_frequency(index, term): # Returns the total number of occurrences of a term across all documents try: return index[term][0] except KeyError: # If the term does not exist, return 0 return 0 # *Document Frequency*: the number of documents that contain the given term at least once. # In[12]: def get_document_frequency(index, term): # Returns the number of documents that contain the given term at least once try: return len(index[term][1]) except KeyError: # If the term does not exist, return 0 return 0 # *Document Count*: the number of documents in the index. # In[13]: def get_document_count(index): # Returns the number of documents in the index return len(get_document_ids(index)) # *Term Count*: the number of terms in the index. # In[14]: def get_term_count(index): # Returns the number of terms in the index return len(index.keys()) # Additionally, we want two helper functions to easily access the set of indexed terms and document IDs. # *Index Terms*: a list of all indexed terms. # In[15]: def get_index_terms(index): # Returns a list of all indexed terms return list(index.keys()) # *Document IDs*: a list of IDs for all documents in the index. # In[16]: def get_document_ids(index): # Returns a list of IDs for all documents in the index ids = [] for v in index.values(): ids.extend(list(v[1].keys())) return set(ids) # ## Answering Simple Queries # # Using our simple index, we can begin answering queries, i.e. *retrieving information*. # # **Exercise**: Answer the following questions using the term statistic functions you defined. # *Which plays contain the term `king` at least once?* # In[17]: results = [] for doc_id in get_document_ids(index): frequency = get_term_frequency(index, "king", doc_id) if frequency > 0: results.append((doc_id, frequency)) results # *What term(s) appears most often overall? How often?* # In[18]: result = ([], 0) for term in get_index_terms(index): frequency = get_total_term_frequency(index, term) if frequency == result[1]: result = (result[0] + [term], frequency) elif frequency > result[1]: result = ([term], frequency) result # *Are there words that occur more than once in one document?* # In[19]: results = [] for doc_id in get_document_ids(index): for term in get_index_terms(index): frequency = get_term_frequency(index, term, doc_id) if frequency > 1: results.append((term, frequency)) results # **Question:** is there a result above that surprises you? If yes, think about why it may be different from your expectation.