This notebook shows solutions to homework 3 for Stats507
In your previous homework, you wrote a function for counting character bigrams. Now,
let's write a function for counting word bigrams. That is, for each pair of words, say,
cat
and dog
, we want to count how many times the word "cat" occurred immediately
before the word "dog". We will represent this bigram by a tuple, ('cat', 'dog')
. For
our purposes, we will ignore all spaces, newlines, punctuation and capitalization in our
counting. So, as an example, the fragment of poem,
Half a league, half a league,
Half a league onward,
All in the valley of Death
Rode the six hundred.
includes the bigrams ('half', 'a')
and ('a', 'league')
both three times, the bigram
('league', 'half')
appears twice,while the bigram ('in', 'the')
appears only once.
count_bigrams_in_file
that takes a filename as its only argument. Your function should read from the given file, and return a dictionary whose keys are bigrams (given in the tuple form above), and values are the counts for those bigrams. Again, your function should ignore punctuation, spaces, newlines and capitalization. The strings in your key tuples should be lower-case. Your function should use a try-catch statement to raise an error with an appropriate message to alert the user in the event that the given file cannot be opened, and a different error in the event that the provided argument isn't a string at all. Hint: you will find the Python function str.strip()
, along with the string constants defined in the string documentation (https://docs.python.org/3/library/string.html ), useful in removing punctuation. Hint: be careful to check that your function handles newlines correctly. For example, in the poem above, one of the ('league', 'half')
bigrams spans a newline, but should be counted nonetheless. Note: be careful that your function does not accidentally count the empty string as a word (this is a common bug if you aren't careful about splitting the input text). Solutions that merely delete "bad" keys from the dictionary at the end will not receive full credit, as all edge cases can handled by correctly splitting the input.import string
def bigram_hist(words):
bigramDict = {}
for i in range(len(words) - 1):
key = (words[i], words[i + 1])
if(bigramDict.get(key) == None):
bigramDict[key] = 1
else:
bigramDict[key] += 1
return(bigramDict)
def count_bigrams_in_file(file):
# Check file name is a string
if not isinstance(file, str):
raise ValueError('Argument file name is not of type string!')
extractedWords = []
try:
f = open(file)
for line in f:
for word in line.split():
extractedWords.append(word.strip(string.punctuation).lower())
return(bigram_hist(extractedWords))
except FileNotFoundError:
print('File: ' + file + ' ,unable to be opened!')
WandP.txt
from the course webpage: http://www-personal.umich.edu/~klevin/teaching/Winter2019/STATS507/WandP.txt. This is an ASCII copy of all of Tolstoi's novel War and Peace. Run your function on this file, and pickle the resulting dictionary in a file called mb.bigrams.pickle
. Please include this file in your submission, along with WandP.txt
, so that we can run your notebook directly from your submission.import pickle
file = open('mb.bigrams.pickle.txt', 'wb')
file.write(pickle.dumps(count_bigrams_in_file('WandP.txt')))
file.close()
(A, B)
or (B, A)
are present in the text. Write a function collocations
that takes a filename as its only argument and returns a dictionary. Your function should read from the given file (raising an appropriate error if the file cannot be opened or if the argument isn't a string at all) and return a dictionary whose keys are all the strings appearing in the file (again ignoring case and stripping away all spaces, newlines and punctuation) and the value of word A is a Python set containing all the words collocated with A. Again using the poem fragment above as an example, the string 'league'
should appear as a key, and should have as its value the set {'a', 'half', 'onward'}
, while the string 'in'
should have the set {'all', 'the'}
as its value. Hint: we didn't discuss Python sets in lecture, because they are essentially just dictionaries without values. See the documentation at https://docs.python.org/3/tutorial/datastructures.html#sets for more information.def collocations(file):
# Check file name is a string
if not isinstance(file, str):
raise ValueError('Argument file name is not of type string!')
# Extract all words from file
extractedWords = []
try:
f = open(file)
for line in f:
for word in line.split():
extractedWords.append(word.strip(string.punctuation).lower())
except FileNotFoundError:
print('File: ' + file + ' ,unable to be opened!')
# check collocations
bigramDict = {}
n = len(extractedWords)
for i in range(n):
key = extractedWords[i]
if n == 1:
bigramDict[key] = set('')
return (bigramDict)
if i < (n - 1):
nextKey = extractedWords[i + 1]
if (bigramDict.get(key) == None):
bigramDict[key] = {nextKey}
elif nextKey not in bigramDict[key]:
bigramDict[key].add(nextKey)
if i > 0:
prevKey = extractedWords[i - 1]
if (bigramDict.get(key) == None):
bigramDict[key] = {prevKey}
elif prevKey not in bigramDict[key]:
bigramDict[key].add(prevKey)
return (bigramDict)
WandP.txt
and pickle the resulting dictionary in a file called mb.colloc.pickle
. Please include this file in your submission.file = open('mb.colloc.pickle.txt', 'wb')
file.write(pickle.dumps(collocations('WandP.txt')))
file.close()
In this exercise, we'll encounter our old friend the vector yet again, this time taking an object-oriented approach.
Vector
. Every vector should have a dimension (a non-negative integer) and a list or tuple of its entries. The initializer for your class should take the dimension as its first argument and a list or tuple of numbers (ints or floats), representing the vector's entries, as its second argument. Choose sensible default behavior for the case where the user applies only a dimension and no entries. The initializer should raise a sensible error in the case where the dimension is invalid (i.e., wrong type or a negative number), and should also raise an error in the event that the dimension and the number of supplied entries disagree.Vector
objects. We will say that two Vector
objects are equivalent if they have the same coordinates. Otherwise, comparison should be analogous to tuples in Python, so that comparison is done on the first coordinate first, then the second coordinate, then the third, and so on. So, for example, the two-dimensional vector (2, 4) is ordered before (less than) (2, 5). Attempting to compare two vectors of different dimensions should result in an error.Vector.dot
that takes a single Vector
as its argument and returns the inner product of the caller with the given Vector
object. Your method should raise an appropriate error in the event that the argument is not of the correct type or in the event that the dimensions of the two vectors do not agree.Vector
class to support scalar multiplication. Left- or right- multiplication by a scalar, e.g., 2*v
or v*2
, where v
is a Vector
object, should result in a new Vector
object with its entries all scaled by the given scalar. We will also follow R
and numpy
(which you will learn in a few weeks), and use * to denote entrywise vector-vector multiplication, so that for Vector
objects v
and w
, v*w
results in a new Vector
object, with the $i$-th entry of v*w
equal to the $i$-th entry of v
multiplied by the $i$-th entry of w
. Implement the appropriate operators to support this multiplication operation. Many languages have a convention for dealing with multiplication of vectors that differ in their dimension, but we will punt on this matter. Your method should raise an appropriate error in the event that v and w disagree in their dimensions.Strictly speaking, this is only a norm for $p\geq1$, but that's beside the point. Implement a method Vector.norm
that takes a single int or float p
as an argument and returns the p
-norm of the calling Vector
object. Your method should work whether p
is an integer or float. Your method should raise a sensible error in the event that $p$ is negative. Hint: see https://docs.python.org/3/library/functions.html# float for documentation on representing positive infinity in Python.
class Vector:
# 1. Constructor for Vector Class
def __init__(self, dimension, entries = None):
# Error Checking
if not isinstance(dimension, int):
raise TypeError('Dimension should be an int!')
if dimension < 0:
raise ValueError('Cannot Negative dimension!')
if entries != None:
if not isinstance(entries, (list, tuple)):
raise TypeError('Input elements should be list or a tuple!')
if not all(isinstance(elmt, (int, float)) for elmt in entries):
raise TypeError('Elements of list/tuple should of type int/float!')
if dimension != len(entries):
raise ValueError('Dimension and Number of Elements in entries disagree!')
# Initialize Instance Attributes
self.dimension = dimension
if entries == None:
self.entries = tuple([0 for _ in range(dimension)])
else:
self.entries = tuple(entries)
# 2. I made vector entries into a tuple, to protect vectors from being modified. Any
# vector modifications or vector operations should instead return another tuple.
# 3. I made the dimension entries instance attributes, because I want every object
# to have their own copies of these instance attributes.
# 4. Defining operators for Vector Class
# Equality '=='
def __eq__(self, other):
if (self.dimension != other.dimension):
raise ValueError('Vectors should have the same dimension!')
if self.entries == other.entries:
return(True)
else:
return(False)
# Not equal '!='
def __ne__(self, other):
if (self.dimension != other.dimension):
raise ValueError('Vectors should have the same dimension!')
if self.entries != other.entries:
return(True)
else:
return(False)
# Less than '<'
def __lt__(self, other):
if (self.dimension != other.dimension):
raise ValueError('Vectors should have the same dimension!')
if self.entries < other.entries:
return(True)
else:
return(False)
# Greater than '>'
def __gt__(self, other):
if (self.dimension != other.dimension):
raise ValueError('Vectors should have the same dimension!')
if self.entries > other.entries:
return(True)
else:
return(False)
# Less than or equal '<='
def __le__(self, other):
if (self.dimension != other.dimension):
raise ValueError('Vectors should have the same dimension!')
if self.entries <= other.entries:
return(True)
else:
return(False)
# Greater than or equal '>='
def __ge__(self, other):
if (self.dimension != other.dimension):
raise ValueError('Vectors should have the same dimension!')
if self.entries >= other.entries:
return(True)
else:
return(False)
# 5. Vector.dot method
def dot(self, other):
if not isinstance(other, Vector):
raise TypeError('Should supply argument of type Vector!')
if (self.dimension != other.dimension):
raise ValueError('Vectors should have the same dimension!')
x = self.entries; y = other.entries
return(float(sum(x[i] * y[i] for i in range(len(x)))))
# 6. Vector and Scalar multiplication
def __mul__(self, other):
if not isinstance(other, (int, float, Vector)):
raise TypeError('Should only multiply with scalars or Vectors!')
if isinstance(other, (int, float)):
return(tuple([other * self.entries[i] for i in range(self.dimension)]))
if isinstance(other, Vector):
if (self.dimension != other.dimension):
raise ValueError('Vectors should have the same dimension!')
return(tuple([other.entries[i] * self.entries[i] for i in range(self.dimension)]))
def __rmul__(self, other):
if not isinstance(other, (int, float, Vector)):
raise TypeError('Should only multiply with scalars or Vectors!')
if isinstance(other, (int, float)):
return(tuple([other * self.entries[i] for i in range(self.dimension)]))
if isinstance(other, Vector):
if (self.dimension != other.dimension):
raise ValueError('Vectors should have the same dimension!')
return(tuple([other.entries[i] * self.entries[i] for i in range(self.dimension)]))
# 7. Vector.norm method
def norm(self, p):
if not isinstance(p, (int, float)):
raise TypeError('Input should be of type int/float!')
if p < 0:
raise ValueError('Input should be non-negative!')
if p == 0:
return(float(sum(1 for x in self.entries if x != 0)))
elif p > 0 and p != float('inf'):
return(sum(abs(x)**p for x in self.entries)**(1 / p))
elif p == float('inf'):
return(max(abs(x) for x in self.entries))