Please complete this notebook by filling in the cells provided. Before you begin, execute the following cell to load the provided tests. Each time you start your server, you will need to execute this cell again to load the tests.
Homework 4 is due Thursday, 2/13 at 11:59pm. Start early so that you can come to office hours if you're stuck. Check the website for the office hours schedule.Late work will not be accepted as per the policies of this course.
# Don't change this cell; just run it.
import numpy as np
from datascience import *
# These lines do some fancy plotting magic.\n",
import matplotlib
%matplotlib inline
import matplotlib.pyplot as plots
plots.style.use('fivethirtyeight')
The following table contains the words from four chapters of Charles Dickens' A Tale of Two Cities. We're going to compute some simple facts about each chapter. Since we're performing the same computation on each chapter, it's best to encapsulate each computational procedure in a function, and then call the function several times. Run the cell to get a table with one column.
# Just run this cell to load the data.
tale_chapters = Table.read_table("tale.csv")
tale_chapters
...
word_count(tale_chapters.column("Chapter text").item(0))
Question 1.2. Create an array called chapter_lengths
which contains the length of each chapter in tale_chapters
.
Hint: Consider using apply
along with the function you have defined in the previous question.
chapter_lengths = ...
chapter_lengths
Question 1.3. Write a function called character_count
. It should take a string as its argument and return the number of characters in that string that aren't spaces (" "), periods ("."), exclamation marks ("!"), or question marks ("?"). Remember that tale_chapters
is a table, and that the function takes in only the text of one chapter as input.
Hint: Try using the string method replace
several times to remove the characters we don't want to count.
...
Question 1.4. Write a function called chapter_number
. It should take a single argument, the text of a chapter from our dataset, and return the number of that chapter, as a Roman numeral. (For example, it should return the string "I" for the first chapter and "II" for the second.) If the argument doesn't have a chapter number in the same place as the chapters in our dataset, chapter_number
can return whatever you like.
To help you with this, we've included a function called text_before
. Its documentation describes what it does.
def text_before(full_text, pattern):
"""Finds all the text that occurs in full_text before the specified pattern.
Parameters
----------
full_text : str
The text we want to search within.
pattern : str
The thing we want to search for.
Returns
-------
str
All the text that occurs in full_text before pattern. If pattern
doesn't appear anywhere, all of full_text is returned.
Examples
--------
>>> text_before("The rain in Spain falls mainly on the plain.", "Spain")
'The rain in '
>>> text_before("The rain in Spain falls mainly on the plain.", "ain")
'The r'
>>> text_before("The rain in Spain falls mainly on the plain.", "Portugal")
'The rain in Spain falls mainly on the plain.'
"""
return np.array(full_text.split(pattern)).item(0)
def chapter_number(chapter_text):
...
Below we load tables containing 200,000 weekday Uber rides in the Manila, Philippines, and Boston, Massachusetts metropolitan areas from the Uber Movement project. The sourceid
and dstid
columns contain codes corresponding to start and end locations of each ride. The hod
column contains codes corresponding to the hour of the day the ride took place. The ride time
table contains the length of the ride, in minutes.
boston = Table.read_table("boston.csv")
manila = Table.read_table("manila.csv")
print("Boston Table")
boston.show(4)
print("Manila Table")
manila.show(4)
Question 2.1. Produce histograms of all ride times in Boston and in Manila, using the given bins. Please put the code for both of them in the following cell, and put the ride times for Boston first.
bins = np.arange(0, 120, 5)
...
...
Question 2.2. Set the two variables below to estimates of what percentage of rides are less than 10 minutes in Boston and Manila. Find your estimates by visually assessing the histograms. Your solution should consist of only mathematical operations and numbers.
boston_under_10 = ...
manila_under_10 = ...
Question 2.3. Comment on the main difference between the two histograms. What might be causing this?
Your Answer Here:
The following two questions are optional!
Please do make an attempt at them, but they will not be incorporated into the final grading of this homework.
Optional Question 2.4. The hod
column in each table represents the hour of the day during which the Uber was called. 0 corresponds to 12-1 AM, 1 to 1-2 AM, 13 to 1-2 PM, etc. Write a function which takes in a table like boston
or manila
, and an hod
number between 0 and 23, and displays a histogram of ride lengths from that hour in that city. Use the same bins as before.
def hist_for_time(tbl, hod):
bins = np.arange(0, 120, 5)
...
#DO NOT DELETE THIS LINE!
hist_for_time(boston, 12)
Optional Question 2.5. Which city has a larger difference between Uber ride times at 10 AM vs. 10 PM? In other words, which is larger: the difference between 10 AM and 10 PM Uber ride times in Manila or the difference between 10 AM and 10 PM uber ride times in Boston. Use the function you just created to answer this question. You do not need to calculate an actual difference.
Assign larger_diff
to the number 1 if the answer is Manila, and 2 if the answer is Boston.
larger_diff = ...
In the next cell, we load a dataset created by the Daily Cal which contains Berkeley faculty, their departments, their positions, and their gross salaries in 2015. (Unfortunately this information is not publicly available for Yale!)
raw_profs = Table.read_table("faculty.csv").where("year", are.equal_to(2015)).drop("year", "title")
profs = raw_profs.relabeled("title_category", "position")
profs
We want to use this table to generate arrays with the names of each professor in each department.
Question 3.1 Set prof_names
to a table with two columns. The first column should be called "department" and have the name of every department once, and the second column should be called "faculty" and contain an array of the names of all faculty members in that department.
Hint: Think about how group
works: it collects values into an array and then applies a function to that array. We have defined two functions below for you, and you will need to use one of them in your call to group
.
# Pick between the two functions defined below
def identity(array):
return array
def first(array):
return array.item(0)
prof_names = ...
prof_names
Question 3.2 At the moment, the name
column is sorted by last name. Would the arrays you generated in the previous part be the same if we had sorted by first name instead before generating them? Two arrays are the same if they contain the same number of elements and the elements located at corresponding indexes in the two arrays are identical. Explain your answer. If you feel you need to make certain assumptions about the data, feel free to state them in your response.
Write your answer here, replacing this text.
Question 3.3 Set biggest_range_dept
to the name of the department with the largest salary range, where range is defined as the difference between the lowest and highest salaries in the department.
Hint: First you'll need to define a new function salary_range
which takes in an array of salaries and returns the salary range of the corresponding department. Then, set department_ranges
to a table containing the names and salary ranges of each department.
# Define salary_range in this cell
...
...
department_ranges = ...
biggest_range_dept = ...
biggest_range_dept
Once you're finished, submit your assignment as a .ipynb (Jupyter Notebook) and .pdf (download as .html, then print to save as a .pdf) on the class Canvas site.