Notebook

Stats507 Homework 3, February 6th, 2019¶

Israel Diego (Go to Home Page)¶

israeldi@umich.edu¶

This notebook shows solutions to homework 3 for Stats507

Table of Contents¶

Problem 1: Counting Word Bigrams
Problem 2: More Fun with Vectors

Problem 1: Counting Word Bigrams¶

(Back to Top)¶

Time Spent: 1 hour¶

In your previous homework, you wrote a function for counting character bigrams. Now, let's write a function for counting word bigrams. That is, for each pair of words, say, cat and dog, we want to count how many times the word "cat" occurred immediately before the word "dog". We will represent this bigram by a tuple, ('cat', 'dog'). For our purposes, we will ignore all spaces, newlines, punctuation and capitalization in our counting. So, as an example, the fragment of poem,

Half a league, half a league,
Half a league onward,
All in the valley of Death
Rode the six hundred.

includes the bigrams ('half', 'a') and ('a', 'league') both three times, the bigram ('league', 'half') appears twice,while the bigram ('in', 'the') appears only once.

Write a function count_bigrams_in_file that takes a filename as its only argument. Your function should read from the given file, and return a dictionary whose keys are bigrams (given in the tuple form above), and values are the counts for those bigrams. Again, your function should ignore punctuation, spaces, newlines and capitalization. The strings in your key tuples should be lower-case. Your function should use a try-catch statement to raise an error with an appropriate message to alert the user in the event that the given file cannot be opened, and a different error in the event that the provided argument isn't a string at all. Hint: you will find the Python function str.strip(), along with the string constants defined in the string documentation (https://docs.python.org/3/library/string.html ), useful in removing punctuation. Hint: be careful to check that your function handles newlines correctly. For example, in the poem above, one of the ('league', 'half') bigrams spans a newline, but should be counted nonetheless. Note: be careful that your function does not accidentally count the empty string as a word (this is a common bug if you aren't careful about splitting the input text). Solutions that merely delete "bad" keys from the dictionary at the end will not receive full credit, as all edge cases can handled by correctly splitting the input.

In [34]:

import string

def bigram_hist(words):
    bigramDict = {}
    
    for i in range(len(words) - 1):
        key = (words[i], words[i + 1])
        if(bigramDict.get(key) == None):
            bigramDict[key] = 1
        else:
            bigramDict[key] += 1
    
    return(bigramDict)

def count_bigrams_in_file(file): 
    # Check file name is a string
    if not isinstance(file, str):
        raise ValueError('Argument file name is not of type string!')
    
    extractedWords = []
    
    try:
        f = open(file)
        for line in f:
            for word in line.split():
                extractedWords.append(word.strip(string.punctuation).lower())
        return(bigram_hist(extractedWords))
    except FileNotFoundError:
        print('File: ' + file + ' ,unable to be opened!')

Download the file WandP.txt from the course webpage: http://www-personal.umich.edu/~klevin/teaching/Winter2019/STATS507/WandP.txt. This is an ASCII copy of all of Tolstoi's novel War and Peace. Run your function on this file, and pickle the resulting dictionary in a file called mb.bigrams.pickle. Please include this file in your submission, along with WandP.txt, so that we can run your notebook directly from your submission.

In [35]:

import pickle

file = open('mb.bigrams.pickle.txt', 'wb')
file.write(pickle.dumps(count_bigrams_in_file('WandP.txt')))
file.close()

We say that word A is collocated with word B in a text if words A and B occur immediately one after another (in either order). That is, words A and B are collocated if and only if either of the tuples (A, B) or (B, A) are present in the text. Write a function collocations that takes a filename as its only argument and returns a dictionary. Your function should read from the given file (raising an appropriate error if the file cannot be opened or if the argument isn't a string at all) and return a dictionary whose keys are all the strings appearing in the file (again ignoring case and stripping away all spaces, newlines and punctuation) and the value of word A is a Python set containing all the words collocated with A. Again using the poem fragment above as an example, the string 'league' should appear as a key, and should have as its value the set {'a', 'half', 'onward'}, while the string 'in' should have the set {'all', 'the'} as its value. Hint: we didn't discuss Python sets in lecture, because they are essentially just dictionaries without values. See the documentation at https://docs.python.org/3/tutorial/datastructures.html#sets for more information.

In [36]:

def collocations(file):
    # Check file name is a string
    if not isinstance(file, str):
        raise ValueError('Argument file name is not of type string!')
    
    # Extract all words from file
    extractedWords = []
    try:
        f = open(file)
        for line in f:
            for word in line.split():
                extractedWords.append(word.strip(string.punctuation).lower())
    except FileNotFoundError:
        print('File: ' + file + ' ,unable to be opened!')
    
    # check collocations
    bigramDict = {}
    n = len(extractedWords)
    for i in range(n):
        key = extractedWords[i]
        if n == 1:
            bigramDict[key] = set('')
            return (bigramDict)

        if i < (n - 1):
            nextKey = extractedWords[i + 1]
            if (bigramDict.get(key) == None):
                bigramDict[key] = {nextKey}
            elif nextKey not in bigramDict[key]:
                bigramDict[key].add(nextKey)

        if i > 0:
            prevKey = extractedWords[i - 1]
            if (bigramDict.get(key) == None):
                bigramDict[key] = {prevKey}
            elif prevKey not in bigramDict[key]:
                bigramDict[key].add(prevKey)

    return (bigramDict)

Run your function on the file WandP.txt and pickle the resulting dictionary in a file called mb.colloc.pickle. Please include this file in your submission.

In [37]:

file = open('mb.colloc.pickle.txt', 'wb')
file.write(pickle.dumps(collocations('WandP.txt')))
file.close()

Problem 2: More Fun with Vectors¶

(Back to Top)¶

Time Spent: 2 hours¶

In this exercise, we'll encounter our old friend the vector yet again, this time taking an object-oriented approach.

Define a class Vector. Every vector should have a dimension (a non-negative integer) and a list or tuple of its entries. The initializer for your class should take the dimension as its first argument and a list or tuple of numbers (ints or floats), representing the vector's entries, as its second argument. Choose sensible default behavior for the case where the user applies only a dimension and no entries. The initializer should raise a sensible error in the case where the dimension is invalid (i.e., wrong type or a negative number), and should also raise an error in the event that the dimension and the number of supplied entries disagree.

Did you choose to make the vector's entries a tuple or a list (there is no wrong answer here, although I would say one is better than the other in this context)? Defend your choice.

I made vector entries into a tuple, to protect vectors from being modified. Any vector modifications or vector operations should instead return another tuple.

Are the dimension and entries class attributes or instance attributes? Why is this the right design choice?

I made the dimension entries instance attributes, because I want every object to have their copies of these instance attributes.

Implement the necessary operator(s) to support comparison (equality, less than, less or equal to, greater than, etc) of Vector objects. We will say that two Vector objects are equivalent if they have the same coordinates. Otherwise, comparison should be analogous to tuples in Python, so that comparison is done on the first coordinate first, then the second coordinate, then the third, and so on. So, for example, the two-dimensional vector (2, 4) is ordered before (less than) (2, 5). Attempting to compare two vectors of different dimensions should result in an error.

Implement a method Vector.dot that takes a single Vector as its argument and returns the inner product of the caller with the given Vector object. Your method should raise an appropriate error in the event that the argument is not of the correct type or in the event that the dimensions of the two vectors do not agree.

We would also like our Vector class to support scalar multiplication. Left- or right- multiplication by a scalar, e.g., 2*v or v*2, where v is a Vector object, should result in a new Vector object with its entries all scaled by the given scalar. We will also follow R and numpy (which you will learn in a few weeks), and use * to denote entrywise vector-vector multiplication, so that for Vector objects v and w, v*w results in a new Vector object, with the $i$-th entry of v*w equal to the $i$-th entry of v multiplied by the $i$-th entry of w. Implement the appropriate operators to support this multiplication operation. Many languages have a convention for dealing with multiplication of vectors that differ in their dimension, but we will punt on this matter. Your method should raise an appropriate error in the event that v and w disagree in their dimensions.

For a real number $0\leq p\leq\infty$, and a vector $v\in\mathbb{R}^{d}$, the $p$-norm of $v$, written $\Vert v\Vert_{p}$, is given by

\begin{align*} \Vert v\Vert_{p}= & \begin{cases} \sum_{i=1}^{d}1_{v_{i}\neq0} & \textrm{if } p=0\\ \left(\sum_{i=1}^{d}\mid v_{i}\mid^{p}\right)^{1/p} & \textrm{if }0<p<\infty\\ max_{i=1,2,\ldots,d\mid v_{i}\mid} & \textrm{if }p=\infty \end{cases} \end{align*}

Strictly speaking, this is only a norm for $p\geq1$, but that's beside the point. Implement a method Vector.norm that takes a single int or float p as an argument and returns the p-norm of the calling Vector object. Your method should work whether p is an integer or float. Your method should raise a sensible error in the event that $p$ is negative. Hint: see https://docs.python.org/3/library/functions.html# float for documentation on representing positive infinity in Python.

In [39]:

class Vector:
    # 1. Constructor for Vector Class
    def __init__(self, dimension, entries = None):
        # Error Checking
        if not isinstance(dimension, int):
            raise TypeError('Dimension should be an int!')
            
        if dimension < 0:
            raise ValueError('Cannot Negative dimension!')
        
        if entries != None:
            if not isinstance(entries, (list, tuple)):
                raise TypeError('Input elements should be list or a tuple!')

            if not all(isinstance(elmt, (int, float)) for elmt in entries):
                raise TypeError('Elements of list/tuple should of type int/float!')

            if dimension != len(entries):
                raise ValueError('Dimension and Number of Elements in entries disagree!')
        
        # Initialize Instance Attributes
        self.dimension = dimension
        if entries == None:
            self.entries = tuple([0 for _ in range(dimension)])
        else:
            self.entries = tuple(entries)
    
    # 2. I made vector entries into a tuple, to protect vectors from being modified. Any 
    #    vector modifications or vector operations should instead return another tuple.
    
    # 3. I made the dimension entries instance attributes, because I want every object 
    #    to have their own copies of these instance attributes.
    
    # 4. Defining operators for Vector Class
    # Equality '=='
    def __eq__(self, other):
        if (self.dimension != other.dimension):
                raise ValueError('Vectors should have the same dimension!')
        if self.entries == other.entries:
            return(True)
        else:
            return(False)
        
    # Not equal '!='
    def __ne__(self, other):
        if (self.dimension != other.dimension):
                raise ValueError('Vectors should have the same dimension!')
        if self.entries != other.entries:
            return(True)
        else:
            return(False)
    
    # Less than '<'    
    def __lt__(self, other):
        if (self.dimension != other.dimension):
                raise ValueError('Vectors should have the same dimension!')
        if self.entries < other.entries:
            return(True)
        else:
            return(False)
    
    # Greater than '>'
    def __gt__(self, other):
        if (self.dimension != other.dimension):
                raise ValueError('Vectors should have the same dimension!')
        if self.entries > other.entries:
            return(True)
        else:
            return(False)
        
    # Less than or equal '<='
    def __le__(self, other):
        if (self.dimension != other.dimension):
                raise ValueError('Vectors should have the same dimension!')
        if self.entries <= other.entries:
            return(True)
        else:
            return(False)
    
    # Greater than or equal '>='
    def __ge__(self, other):
        if (self.dimension != other.dimension):
                raise ValueError('Vectors should have the same dimension!')
        if self.entries >= other.entries:
            return(True)
        else:
            return(False)
    
    # 5. Vector.dot method
    def dot(self, other):
        if not isinstance(other, Vector):
            raise TypeError('Should supply argument of type Vector!')
        
        if (self.dimension != other.dimension):
            raise ValueError('Vectors should have the same dimension!')
        
        x = self.entries; y = other.entries
        return(float(sum(x[i] * y[i] for i in range(len(x)))))
    
    # 6. Vector and Scalar multiplication
    def __mul__(self, other):
        if not isinstance(other, (int, float, Vector)):
            raise TypeError('Should only multiply with scalars or Vectors!')
        
        if isinstance(other, (int, float)):
            return(tuple([other * self.entries[i] for i in range(self.dimension)]))
        
        if isinstance(other, Vector):
            if (self.dimension != other.dimension):
                raise ValueError('Vectors should have the same dimension!')
            return(tuple([other.entries[i] * self.entries[i] for i in range(self.dimension)]))
        
    def __rmul__(self, other):
        if not isinstance(other, (int, float, Vector)):
            raise TypeError('Should only multiply with scalars or Vectors!')
        
        if isinstance(other, (int, float)):
            return(tuple([other * self.entries[i] for i in range(self.dimension)]))
        
        if isinstance(other, Vector):
            if (self.dimension != other.dimension):
                raise ValueError('Vectors should have the same dimension!')
            return(tuple([other.entries[i] * self.entries[i] for i in range(self.dimension)]))
        
    # 7. Vector.norm method
    def norm(self, p):
        if not isinstance(p, (int, float)):
            raise TypeError('Input should be of type int/float!')
        
        if p < 0:
            raise ValueError('Input should be non-negative!')
        
        if p == 0:
            return(float(sum(1 for x in self.entries if x != 0)))
        
        elif p > 0 and p != float('inf'):
            return(sum(abs(x)**p for x in self.entries)**(1 / p))
        
        elif p == float('inf'):
            return(max(abs(x) for x in self.entries))