Projects using Python:
We can boil down programming into two main 'entities': Data and Operations. In a program, we have some information, or data, and we perform some operations on it to produce the desired output. So let's see how these two are represented in Python.
As you can imagine, data can come in several forms and be packaged in many ways. The fundamental Python data types that we will cover here are the following: integers, floats, booleans, and strings. We can store the values of these different types in what we call 'variables'. You can just think of them as containers with a name (no spaces or special characters!) that have a piece of data. Below are some examples of how we can work with these data types.
# this is a comment. anything that follows a '#' symbol in Python is ignored by the interpeter
# here are two ways we can represent numbers
#integers a.k.a int
x = 4 #here we store the integer value of 4 in the variable x
y = 2
#floating point numbers
a = 0.5
b = 0.1
# strings are sequences of characters. they are always contained in single or double quotes
s = "Hello, world."
#booleans are data types that can take on only the values 'True' or 'False':
n = False
m = True
#finally, we have the 'None' type which is basically nothing, it's like an empty variable
t = None
## so now what we've done is store some values in some variables. how can we see what those values are? print!
print(x)
print(y)
print(a, b)
#print the rest of the variables
print(n)
4 2 0.5 0.1 False
#another useful thing is type-casting where we can convert some types into other types.
x = 5.0
y = "5.0"
#converts a floating point number to an integer
print(int(x))
#converts the string "5.0" into a numeric value
print(float(y))
5 5.0
# we can do some basic operations on these data types
print(4 + 4) #add or subtract two integers
print(4/5) # divide
8 0.8
# we can assign the value of a variable back to itself.
print(x)
x = x + 5
print(x)
5.0 10.0
#we can 'concatenate' two strings
print(s + " How are you?")
Hello, world. How are you?
# This is a list:
l = [1, 2, 3, 4]
# we can access each element in a list with its index (starting from 0)
print(l)
print(l[0]) #prints the first element in the list
print(l[1]) #prints the second element in the list
print(l[-1]) #what does this print?
[1, 2, 3, 4] 1 2 4
#list slicing. very useful way in python to access sub-lists in python
#say we want to make a list that only has the elements between positions 2 and 4 in list l
sliced = l[2:4]
print(sliced)
#now say we want a list that contains everything but the first number in l
everything_but_first = l[:1]
print(everything_but_first)
#now make a list that has everything except the last element in the list (remember the -1 index)
everything_but_last = l[:-1]
print(everything_but_last)
[3, 4] [1] [1, 2, 3]
#we can modify the values in the list
l[1] = 3000
print(l[1])
print(l)
3000 [1, 3000, 3, 4]
#we can add values to the end of the list with the append() function
print("adding a value")
l.append("hi!")
print(l)
adding a value [1, 3000, 3, 4, 'hi!']
Tuples are like lists, but they are less flexible. Unlike lists, they have a fixed size, and you can't reassign elements to them once they're assigned. The advantage is that they are more memory efficient than lists. So it's good to use them when you know that you will not be changing your data around much.
#declare a tuple. use round brackets instead of square brackets
person = ('Martin', 50, 1.65)
print(person)
#you can still access its elements just like in a list
print(person[1])
('Martin', 50, 1.65) 50
Dictionaries are one of the most useful data structures in Python and because they are very powerful for organizing data. A dictionary is like a list, except every element is indexed by a 'key instead of a number like we saw with lists and tuples.
d[key] = value
# dictionaries are initialized with curly braces
d = {} # is an empty dictionary
#let's add some keys to the dictionary and give it a value.
d['heights'] = []
d['weights'] = []
print(d)
{'weights': [], 'heights': []}
#now we can add 'heights' to the list
d['heights'].append(165.4)
d['weights'].append(221.2)
print(d)
#if we want to get the 'keys' of the dictionary we use the keys() function
print(d.keys())
print(d['heights'])
{'weights': [221.2, 221.2, 221.2], 'heights': [165.4, 165.4, 165.4]} dict_keys(['weights', 'heights']) [165.4, 165.4, 165.4]
As you may have started to notice, it is possible to store any kind of mixture of data into lists, tuples, and dictionaries. Here are some examples:
mix = ['hi', 1, 2, ('a', 2, 'e')]
print(mix)
d2 = {1: [1, 2, 3], 'bob': 4}
print(d2)
['hi', 1, 2, ('a', 2, 'e')] {1: [1, 2, 3], 'bob': 4}
Now that we have an idea of how Python stores data, we would like to be able to do something interesting with is. That is, peform operations on the data in an efficient manner.
for
loops¶Loops make python repeat a set of commands a given number of times. They are by far the most widely used loop.
# for loops store the each item in the list in the variable following the 'for' one at a time.
#let's iterate through that 'mix' list that we made in the previous cell and print each item in the list
for i in mix:
print(i)
hi 1 2 ('a', 2, 'e')
# we can also iterate through a range of numbers by using the range(n) command which returns a special kind of
# list (not exactly a normal list) containing the integers within the specified range
for i in range(10):
print(i)
0 1 2 3 4 5 6 7 8 9
#write a for loop that adds the numbers from 1 to n. declare any variables you may need
count = 0
n = 10
for i in range(n):
count = count + i
print(count)
45
if
statements¶if
statements allow us to control which parts of code are executed depending on a condition. This condition is expressed as a boolean that can be True
or False
. If the statement is True
then the code contained in the if
statement will execute. Otherwise, it gets skipped.
#first, a bit more about booleans. we can compare two booleans using the '==' operator to obtain a 'True' if they are
# equal and 'False' if they are not equal.
print(True == True)
print(True == False)
True False
# We can perform some operations on boolean variables to better express conditions
#the 'and' operation gives a True boolean if both elements are true
a,b = True, True
print(a and b)
a,b = True, False
print(a and b)
# the 'or' operator gives True if either one (or both) of the elements is true
print(a or b)
# the 'not' operator simply gives you the opposite of the given element
print(not a)
True False True False
feel_like_it = True
raining = False
date_tonight = True
#write a boolean to decide if you should go to the gym based on the three booleans above.
go_to_gym = (feel_like_it and not raining) or date_tonight
print(go_to_gym)
#play around with different values of the variables!
True
We can use if
statements to make our loops more powerful
#write a for loop that prints every even number up until n. (hint. use the modulo operator which returns the
# remainder of dividing two numbers. e.g. 5 % 2 = 3, 10%2 = 0)
n = 25
for i in range(n):
if i % 2 == 0:
print(i)
else:
continue #the 'continue' statement lets you skip to the next iteration in the loop
0 2 4 6 8 10 12 14 16 18 20 22 24
if feel_like_it and raining: #whatever follows the if has to be a boolean statement
print("i feel like it, but it's raining")
#if the previous clause is not met, the 'elif' or 'else if' block is checked
elif date_tonight:
print("i have a date tonight")
#if neither the if or the elif match then we go into the 'else'
else:
print("i should just stay home")
i have a date tonight
Try an example for yourself. This is a famous programming challenge to test if
statements, known as the 'fizzbuzz' test. You have to print the numbers from 0 to 50 following three rules:
There are many different ways to do this, so take a couple of minutes to come up with yours. (hint. use the %
operator).
for i in range(50):
if i % 15 == 0:
print("fizzbuzz")
elif i % 5 == 0:
print("buzz")
elif i % 3 == 0:
print("fizz")
else:
print(i)
fizzbuzz 1 2 fizz 4 buzz fizz 7 8 fizz buzz 11 fizz 13 14 fizzbuzz 16 17 fizz 19 buzz fizz 22 23 fizz buzz 26 fizz 28 29 fizzbuzz 31 32 fizz 34 buzz fizz 37 38 fizz buzz 41 fizz 43 44 fizzbuzz 46 47 fizz 49
List comprehensions are a very nice Python feature that allow you to make lists in a single line. Let's see how they work.
b = []
for i in range(10):
b.append(i)
print(b)
b_comprehension = [i for i in range(10)]
print(b_comprehension)
[0, 1, 2, 3, 4, 5, 6, 7, 8, 9] [0, 1, 2, 3, 4, 5, 6, 7, 8, 9]
# this will grow a list where each element in the list is whatever the statement preceding
#the 'for' evaluates to
numbers_times_two = [n*2 for n in range(11)]
print(numbers_times_two)
[0, 2, 4, 6, 8, 10, 12, 14, 16, 18, 20]
# we can also add if statements in the list comprehension
odd_numbers = [i for i in range(10) if i % 2 != 0]
print(odd_numbers)
[1, 3, 5, 7, 9]
#now try one yourself. make a list comprehension where
#each item is a tuple (number, number*number)
square_tuples = [(i, i*i) for i in range(10)]
[(0, 0), (1, 1), (2, 4), (3, 9)]
[(0, 0), (1, 1), (2, 4), (3, 9)]
That day of the week function was super useful! Let's say we want to use that code again many times, but sometimes we want it to find a different day of the week. We would have to change our if statement each time and re-run the code. This seems like a bit of a pain. Thankfully there is a better way.. functions! Think of a function as a little machine that takes in some input and does something to it and returns some output. So let's turn the days of the week finder into a function.
#the first line of every function is a header. headers have 3 parts. the 'def' keyword which tells python you are
#about to declare a function, then the name of the function, and finally the inputs to the function
def day_finder(day_to_find):
days_of_week = ['Monday', 'Tuesday', 'Wednesday', 'Thursday', 'Friday', 'Satuday', 'Sunday']
for day in days_of_week:
if day == day_to_find:
#the return statement ends every function and it 'sends' the output of your function to whoever called it
return "Thank God it's %s." % (day) # the %s in the string is like a placeholder that gets filled with
# the value of the string variable 'day'
return "no match"
#you can call any function just by typing its name and the arguments you want it to work on in braces ()
match = day_finder("tuesday")
print(match)#"thank god it's tuesday" is the return value of the function day_finder()
no match
a = "hi im here"
print(scope_test)
def test(inp):
scope_test = 0
for i
print(a)
--------------------------------------------------------------------------- NameError Traceback (most recent call last) <ipython-input-63-583bc08ac9c9> in <module>() 1 a = "hi im here" ----> 2 print(scope_test) 3 def test(inp): 4 scope_test = 0 5 print(a) NameError: name 'scope_test' is not defined
test("hi")
hi im here
As you may have noticed, it would be a mess if variables were accessible from everywhere in the code. In fact, in Python, anything that is defined within an 'indentation block' is shared but not otherwise.
Example:
a = 3
def fun():
b = 3
for i in range(10):
i = i + 1
In this example a
can be accessed from anywhere in the code. b
can only be seen inside the indentation block started by fun()
and i
can only be seen within the indentation block defined by the for
loop.
Just how we generalized a piece of code that runs on similar kinds of input and gave it a name with the functions example, we can do the same thing with just about any piece of Python. Objects let us easily work with instances of some kind of thing where each instance might be a little different from the other but still behave according to its type of thing.
Let's illustrate this with our thing being a Recipe. We can have many different types of recipes but we would like to be able to deal with each one in a uniform manner. We therefore define a Recipe as an object that is defined by some 'attributes'
# every object is defined in a 'class' and classes are declared with the 'class' keyword
class Recipe:
def __init__(self, n, ig, ml, pt):
self.name = n
self.ingredients = ig #set the ingredients attribute of the object to the value of ig
self.meal = ml
self.prep_time = pt
r = Recipe("hot dog", ["bread", "sausage"], "snack", 15)
print(r.ingredients)
['bread', 'sausage']
#now that we've defined what a Recipe should look like, we can nicely store some new recipes.
#let's make some recipes
sandwich = Recipe('sandwich', ['ham', 'bread', 'cheese'], 'lunch', 10)
cake = Recipe('cake', ['flour', 'sugar', 'eggs'], 'dessert', 90)
#add two more recipes of your own!
# now we can very conveniently access some information about our recipes by name. This can be done with the '.'
# operator that we saw above
#print the ingredients in the sandwich, and the preparation time of the cake
print(sandwich.ingredients)
print(cake.prep_time)
['ham', 'bread', 'cheese'] 90
Let's go even further and make a new object to contain our Recipe objects. Let's call this class Cookbook and it will simply contain a list of Recipe objects. However, it will not only have some data variables, but it will also be able to perform functions.
class Cookbook:
def __init__(self, r):
self.recipes = r
#functions declared inside a class are called 'Class functions' and they act on an object
def get_vegetarian(self):
"""This function is going to return recipes in the cookbook that do not contain meat."""
meats = ['ham', 'beef', 'chicken', 'fish']
veg_dishes = []
#hint: use the 'in' operator which returns true if a value is in the
#given list. e.g.: 5 in [1, 3, 5] returns True.
#the 'self' keyword is used to access the object the function is being called on
for recipe in self.recipes:
if i in recipe.ingredients:
veg_dishes.append(recipe)
return veg_dishes
#write a class function that returns a list of recipes that take under 'n' minutes to prepare
def preptime(self, n):
matches = []
for r in self.recipes:
if r.prep_time < n:
matches.append(r)
return matches
# let's make a cookbook
#gather some recipes
recipes = []
recipes.append(Recipe('steak', ['beef', 'butter', 'mashed potatoes'], 'main', 120))
recipes.append(Recipe('toast', ['nutella', 'bread'], 'snack', 5))
recipes.append(Recipe('salad', ['lettuce', 'kale', 'tomatoes'], 'main', 15))
recipes.append(Recipe('chicken parm', ['chicken', 'sauce', 'cheese'], 'main', 90))
recipes.append(Recipe('brownie', ['chocolate', 'flour', 'eggs'], 'dessert', 15))
#put them in a cookbook
cookbook = Cookbook(recipes)
print([r.name for r in cookbook.get_vegetarian()])
[]
#get vegetarian recipes
print(cookbook.get_vegetarian())
#get recipes that take less than 20 minutes to prepare
print(...)
#oops what is that? because our functions are returning lists of objects we are seeing how the objects are represented
# when we print them. What you see are addresses in memory where the objects are stored. Let's print it in a way
# we can understand
#print the name and ingredients of recipes that take under 20 minutes to prepare
short_recipes = cookbook.preptime(20)
for s in short_recipes:
print(s.name)
We've now covered most of the basics you need to get up and running. However, one of the nicest things about Python is that it has a fairly extensive 'standard library'. The standard library is a set of functions that come with Python that do a variety of useful things so that you don't have to reinvent the wheel each time you write a program. We've already seen an example with the range()
function. I'm going to give some examples of some of the most useful functions in the standard library, but you should always check if what you are trying to do has already been implemented to save you time.
Some functions are directly built-in and some you have to import
from a module which is just the name of a Python program whose functions you can use in your code.
#help prints what the given function does
help(abs)
Help on built-in function abs in module builtins: abs(x, /) Return the absolute value of the argument.
#a set is a group of unique items from a list
pop_stars = ["beyonce", "rihanna", "lady gaga", "lady gaga"]
hip_hop_stars = ["rihanna", "nicki minaj", "lil kim"]
pop = set(pop_stars)
print(pop)
hip_hop = set(hip_hop_stars)
print(pop, hip_hop)
#you can perform some set operations. look up the Python
# set documentation an print the intersection between both sets.
overlap = pop.intersection(hip_hop)
print(overlap)
{'rihanna', 'lady gaga', 'beyonce'} {'rihanna', 'lady gaga', 'beyonce'} {'lil kim', 'rihanna', 'nicki minaj'} {'rihanna'}
#sort a list
s = sorted([1, 5, 3, 2])
print(s)
#max and min of a list
print(max(s))
print(min(s))
#print the sum of a list's elements
print(sum(s))
#get length of list
print(len(s))
[1, 2, 3, 5] 5 1 11 4
#write a for loop that iterates through each *index* in s and prints the
# index (Hint: combine range() and len())
test_list = [1, 2, 3, 4, 5]
for index in range(len(test_list)):
print(index)
0 1 2 3 4
#the enumerate() function gives you the item and the index of a list at the same time
string_list = ["carlos", "steve", "joe"]
for i, name in enumerate(string_list):
print(i, name)
0 carlos 1 steve 2 joe
#let's import some libraries so we can use their functions
import math
import random
#python numerical tools
import numpy as np
#generate a random floating point number between 0 and 1
ran = random.random()
print(ran)
#print a random integer in the given range
random_integer = random.randint(2, 100)
print(random_integer)
#sample randomly from a list
print(np.random.choice(["watermelon", "mango", "pineapple"], p=[0.5, 0.25, 0.25], replace=False))
0.7928981520871538 5 watermelon
#compute e^(n)
print(math.exp(10))
#compute x^n
print(math.pow(2,3))
22026.465794806718 8.0
# you can remove characters from the end of a string with the strip() function
disney = "I'm. Walt. Disney."
print(disney.strip()) #removes the character from the given string (default removes whitespace characters)
#the join function inserts characters between elements in a list and joins them into a string
print("\t".join(["1", "2", "3"]))
#the opposite of join is the split() function which breaks a string up into a list of sub-strings
print(disney.split(".")) #you can also tell split() which characters to break on
I'm. Walt. Disney. 1 2 3 ["I'm", ' Walt', ' Disney', '']
A very important part of dealing with scientific data is handling files containing your data and loading them into your Python scripts. Thankfully Python also makes this very easy.
#to open a file you just use the open() function along with the 'with' context manager (more on this below)
#the 'with' block takes care of opening and closing the file, and giving us the lines of the file to iterate over
with open("food.csv", "r") as food: # the "r" argument tells open() that we want to read the file
for item in food:
print(item.strip()) #let's get rid of the linebreak character
food_item, price, quantity bananas, 1.5, 10 cupcakes, 3, 4 skittles, 2, 200 tacos, 2, 100 chips, 1, 2.5
#we can also write to a file
with open("helloworld.txt", "w+") as hello:
hello.write("Hello, world!")
#check to see if the file was created
Most computers today have multiple processors, meaning you can use these processors simulaneously to make your computations go a lot faster. Default python code is run as a single process. So if you have some work to do that can be split up into independent parts, you can easily implement this with the multiprocessing
module in Python.
import multiprocessing
import time #this module allows us to do some operations involving time.
MAX_PROC = 4
def square(x):
time.sleep(3) #let's make it a bit slower by making python sleep for 10 seconds before returning the output
return x*x
to_do = [1, 2, 3, 4, 5, 6, 7, 8]
#let's write a loop that squares each element in that list in series
#record the time at which we start the computation
t_start_serial = time.time()
squared_serial = []
for i in to_do:
squared_serial.append(square(i))
#get the difference in time between now and the start time to get total time.
t_total_serial = time.time() - t_start_serial
print("Serial job took %s seconds." % (t_total_serial))
print(squared_serial)
Serial job took 24.030616998672485 seconds. [1, 4, 9, 16, 25, 36, 49, 64]
#that was slow! now let's do this in parallel
pool = multiprocessing.Pool(MAX_PROC) #we start a 'pool' of workers that we can send processing jobs to
t_start_para = time.time()
squared_para = []
for squared_number in pool.map(square, to_do):
squared_para.append(squared_number)
t_total_para = time.time() - t_start_para
print("Parallel job took %s seconds!" % (t_total_para))
print(squared_para)
Parallel job took 6.007381200790405 seconds! [1, 4, 9, 16, 25, 36, 49, 64]
At this point you should have a pretty good idea of the Python basics and some of the extras. Yet, we are still only scratching the surface of the surface of the iceberg. Python has appliations in pretty much every field of programming and software development. But since we are mainly interested in using it as a tool for data handling let's do a mini data-science project to get an idea of some more data-specific Python capabilities. Here the goal is to be able to load a data-set efficiently and do some nice visualizations.
The mini project is to take baby name data from the US social security database and try to identify "Hipster" names. This example inspired by a Kaggle post (https://www.kaggle.com/ryanburge/d/kaggle/us-baby-names/hipster-names). You can load the original dataset here.
I filtered and re-arranged the original data to make it easier for us to handle (but trust, me I used Python for that and it was very easy). So you can find the dataset we will use in this exercise in the workshop downloads under the name BabyNames.csv
Here's what the data looks like:
name, 1880, 1881, 1882, .... , 2014
Aaron, 3, 4, 6, 22, 0, ...., 199
Aaliyah, 0, 0, 0, 0, 1, ...., 100
As you can see, each row in the file is a baby name, and each column contains the number of babies with that name for each year this dataset was collected.
import os
#this is python's main plotting library
import matplotlib.pyplot as plt
#tell the notebook to make plots appear inline
%matplotlib inline
plt.rcParams['figure.figsize'] = 10, 10
# the os module lets us take care of operating system operations. let's use it to specify
# the path to a file for opening
#let's load the name dataset. Replace the arguments with your own path.
babypath = os.path.join("BabyNames.csv")
#let's first get an idea of what the lines look like.
#print the first 10 lines
with open(babypath, "r") as b:
for i, line in enumerate(b):
if i < 10:
print(line)
else:
break
Name,1880,1881,1882,1883,1884,1885,1886,1887,1888,1889,1890,1891,1892,1893,1894,1895,1896,1897,1898,1899,1900,1901,1902,1903,1904,1905,1906,1907,1908,1909,1910,1911,1912,1913,1914,1915,1916,1917,1918,1919,1920,1921,1922,1923,1924,1925,1926,1927,1928,1929,1930,1931,1932,1933,1934,1935,1936,1937,1938,1939,1940,1941,1942,1943,1944,1945,1946,1947,1948,1949,1950,1951,1952,1953,1954,1955,1956,1957,1958,1959,1960,1961,1962,1963,1964,1965,1966,1967,1968,1969,1970,1971,1972,1973,1974,1975,1976,1977,1978,1979,1980,1981,1982,1983,1984,1985,1986,1987,1988,1989,1990,1991,1992,1993,1994,1995,1996,1997,1998,1999,2000,2001,2002,2003,2004,2005,2006,2007,2008,2009,2010,2011,2012,2013,2014 Aaliyah,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,18.0,40.0,46.0,38.0,39.0,28.0,28.0,20.0,21.0,10.0,18.0,14.0,17.0,14.0,12.0,15.0,25.0,22.0,1457.0,1254.0,831.0,1738.0,1404.0,1088.0,1494.0,3360.0,4786.0,3677.0,3496.0,3465.0,3739.0,3951.0,4038.0,4365.0,4663.0,5104.0,5502.0,5225.0,4855.0 Aaron,102.0,94.0,85.0,105.0,97.0,88.0,86.0,78.0,90.0,85.0,96.0,69.0,95.0,81.0,79.0,94.0,69.0,87.0,89.0,71.0,103.0,80.0,78.0,93.0,117.0,96.0,96.0,130.0,114.0,142.0,145.0,187.0,303.0,417.0,490.0,553.0,588.0,601.0,663.0,645.0,668.0,697.0,705.0,626.0,682.0,657.0,600.0,552.0,560.0,475.0,506.0,466.0,514.0,463.0,477.0,465.0,443.0,470.0,482.0,471.0,513.0,562.0,574.0,558.0,514.0,485.0,577.0,689.0,710.0,802.0,804.0,879.0,944.0,905.0,981.0,1108.0,1291.0,1351.0,1405.0,1556.0,1794.0,1892.0,2030.0,2088.0,2413.0,2445.0,2618.0,2929.0,3450.0,4647.0,6655.0,8477.0,7924.0,9319.0,10601.0,10671.0,11502.0,11780.0,12471.0,13218.0,13342.0,14838.0,14563.0,14627.0,13532.0,13271.0,12824.0,12820.0,14526.0,15414.0,14633.0,14342.0,14593.0,13936.0,14484.0,13368.0,12047.0,11231.0,10603.0,9901.0,9586.0,9568.0,9039.0,8897.0,8443.0,7824.0,8314.0,8955.0,8559.0,7995.0,7473.0,7627.0,7530.0,7289.0,7357.0 Abigail,12.0,8.0,14.0,11.0,13.0,9.0,15.0,13.0,18.0,20.0,9.0,11.0,11.0,21.0,13.0,15.0,15.0,15.0,11.0,7.0,14.0,13.0,18.0,20.0,8.0,11.0,15.0,12.0,13.0,12.0,16.0,18.0,24.0,14.0,25.0,38.0,37.0,41.0,26.0,31.0,33.0,24.0,29.0,32.0,31.0,37.0,30.0,33.0,40.0,33.0,25.0,27.0,30.0,32.0,30.0,37.0,36.0,45.0,50.0,53.0,48.0,40.0,48.0,63.0,56.0,53.0,77.0,78.0,72.0,113.0,111.0,137.0,152.0,167.0,148.0,181.0,189.0,188.0,193.0,206.0,228.0,233.0,264.0,255.0,244.0,211.0,190.0,245.0,263.0,281.0,362.0,392.0,395.0,481.0,605.0,614.0,838.0,796.0,1009.0,1226.0,1587.0,1819.0,1892.0,1920.0,1848.0,1859.0,2011.0,2014.0,2382.0,3423.0,3739.0,3810.0,4003.0,5196.0,7252.0,7833.0,8615.0,9633.0,11117.0,11695.0,13103.0,14822.0,15313.0,15941.0,15512.0,15760.0,15638.0,15488.0,15092.0,14392.0,14237.0,13250.0,12675.0,12377.0,11997.0 Adam,104.0,116.0,114.0,107.0,83.0,96.0,103.0,84.0,120.0,93.0,74.0,87.0,106.0,87.0,86.0,78.0,96.0,88.0,87.0,74.0,111.0,71.0,99.0,82.0,103.0,101.0,105.0,127.0,109.0,113.0,154.0,189.0,307.0,354.0,489.0,574.0,609.0,614.0,653.0,542.0,579.0,588.0,469.0,435.0,464.0,459.0,425.0,389.0,357.0,308.0,274.0,300.0,257.0,235.0,237.0,238.0,216.0,222.0,214.0,216.0,227.0,230.0,241.0,252.0,221.0,209.0,247.0,270.0,244.0,301.0,306.0,297.0,295.0,332.0,282.0,351.0,394.0,434.0,469.0,585.0,1019.0,1581.0,2155.0,2349.0,2867.0,2578.0,2512.0,2537.0,2557.0,2884.0,4341.0,5883.0,5774.0,6879.0,8489.0,8699.0,9992.0,11096.0,13989.0,17163.0,18965.0,20118.0,20209.0,23612.0,24075.0,20278.0,18244.0,17014.0,16535.0,16987.0,14743.0,12320.0,11939.0,11549.0,11053.0,10495.0,9577.0,8817.0,8367.0,8239.0,8149.0,7762.0,7757.0,7705.0,7505.0,6850.0,6801.0,6786.0,6098.0,5669.0,5102.0,5207.0,5303.0,5232.0,5300.0 Addison,19.0,17.0,21.0,20.0,17.0,17.0,16.0,12.0,20.0,11.0,14.0,10.0,12.0,9.0,20.0,19.0,14.0,8.0,18.0,12.0,11.0,9.0,14.0,15.0,9.0,16.0,7.0,11.0,17.0,13.0,12.0,20.0,33.0,33.0,59.0,62.0,71.0,73.0,76.0,64.0,59.0,78.0,74.0,64.0,58.0,44.0,59.0,65.0,54.0,49.0,57.0,51.0,42.0,41.0,42.0,34.0,35.0,37.0,33.0,37.0,33.0,35.0,40.0,44.0,32.0,26.0,35.0,34.0,25.0,34.0,42.0,40.0,31.0,31.0,41.0,36.0,30.0,33.0,21.0,30.0,32.0,31.0,26.0,30.0,20.0,28.0,26.0,18.0,16.0,27.0,34.0,38.0,28.0,23.0,22.0,28.0,29.0,16.0,25.0,37.0,43.0,25.0,40.0,36.0,44.0,73.0,327.0,324.0,289.0,310.0,319.0,347.0,382.0,481.0,595.0,664.0,778.0,889.0,1041.0,1213.0,1419.0,1586.0,1856.0,2132.0,2484.0,3414.0,8062.0,12284.0,11021.0,10895.0,10509.0,9484.0,8334.0,7848.0,7079.0 Adrian,18.0,12.0,18.0,14.0,9.0,12.0,17.0,13.0,19.0,16.0,17.0,16.0,18.0,13.0,18.0,18.0,22.0,27.0,17.0,30.0,27.0,29.0,26.0,23.0,30.0,33.0,34.0,31.0,41.0,38.0,42.0,62.0,138.0,164.0,192.0,242.0,253.0,287.0,337.0,295.0,292.0,295.0,290.0,347.0,350.0,284.0,298.0,235.0,242.0,242.0,220.0,233.0,224.0,210.0,232.0,234.0,243.0,247.0,276.0,270.0,266.0,295.0,277.0,258.0,281.0,273.0,338.0,378.0,371.0,415.0,401.0,450.0,469.0,490.0,463.0,519.0,551.0,568.0,542.0,591.0,666.0,741.0,847.0,773.0,947.0,888.0,914.0,1047.0,1189.0,1244.0,1511.0,1737.0,1917.0,1810.0,2013.0,2167.0,2329.0,2534.0,2624.0,2712.0,2737.0,2783.0,2923.0,3022.0,3112.0,3518.0,3248.0,3200.0,3601.0,3722.0,4375.0,4249.0,4166.0,4076.0,4054.0,4139.0,4322.0,4344.0,5674.0,5242.0,5549.0,5613.0,5875.0,6279.0,6306.0,6883.0,7557.0,7911.0,8083.0,7788.0,7558.0,7487.0,7059.0,6951.0,6764.0 Adriana,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,5.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,6.0,0.0,5.0,0.0,0.0,0.0,0.0,0.0,6.0,9.0,0.0,5.0,13.0,19.0,13.0,15.0,11.0,12.0,13.0,18.0,16.0,22.0,20.0,20.0,16.0,15.0,16.0,11.0,16.0,21.0,9.0,6.0,23.0,17.0,25.0,17.0,17.0,16.0,23.0,21.0,21.0,24.0,21.0,23.0,21.0,35.0,45.0,33.0,38.0,43.0,46.0,49.0,71.0,53.0,77.0,89.0,77.0,102.0,137.0,148.0,149.0,172.0,217.0,238.0,258.0,283.0,360.0,475.0,458.0,464.0,447.0,490.0,519.0,614.0,683.0,796.0,816.0,938.0,1113.0,1055.0,1195.0,1067.0,977.0,1137.0,1263.0,1224.0,1465.0,2290.0,2167.0,2264.0,2217.0,2085.0,2188.0,2076.0,2109.0,2561.0,2483.0,2309.0,2686.0,2649.0,2392.0,2838.0,2837.0,2709.0,3101.0,2932.0,2724.0,2567.0,2460.0,2318.0,2147.0,2092.0,1869.0 Agnes,473.0,424.0,565.0,623.0,703.0,695.0,779.0,901.0,1046.0,1033.0,1095.0,1125.0,1333.0,1399.0,1477.0,1466.0,1661.0,1585.0,1708.0,1619.0,1922.0,1534.0,1681.0,1622.0,1672.0,1729.0,1767.0,1842.0,1941.0,1985.0,2176.0,2315.0,2959.0,3217.0,3735.0,4803.0,4861.0,5078.0,5303.0,4907.0,4920.0,4816.0,4488.0,4235.0,4252.0,3927.0,3551.0,3354.0,3033.0,2782.0,2521.0,2312.0,2126.0,1887.0,1942.0,1588.0,1608.0,1428.0,1395.0,1304.0,1200.0,1093.0,1141.0,1054.0,962.0,824.0,868.0,839.0,833.0,729.0,748.0,655.0,626.0,642.0,523.0,563.0,508.0,494.0,418.0,425.0,429.0,396.0,353.0,320.0,278.0,259.0,202.0,193.0,154.0,129.0,120.0,132.0,121.0,94.0,94.0,86.0,65.0,63.0,65.0,71.0,77.0,93.0,69.0,82.0,62.0,67.0,69.0,70.0,76.0,79.0,76.0,67.0,69.0,54.0,58.0,61.0,64.0,47.0,64.0,63.0,68.0,60.0,53.0,62.0,57.0,59.0,68.0,65.0,61.0,81.0,67.0,96.0,122.0,123.0,187.0 Aidan,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,5.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,6.0,7.0,9.0,7.0,5.0,7.0,0.0,5.0,7.0,11.0,9.0,12.0,13.0,14.0,8.0,11.0,23.0,13.0,22.0,25.0,17.0,33.0,28.0,38.0,44.0,37.0,38.0,45.0,67.0,86.0,65.0,75.0,97.0,150.0,184.0,307.0,370.0,551.0,1061.0,1163.0,1466.0,1792.0,2348.0,3440.0,4826.0,7143.0,10290.0,10017.0,9912.0,10199.0,8545.0,7816.0,5881.0,4351.0,3762.0,3254.0,2683.0,2285.0
Now we can write a function that stores the information in the file so that we can work with it in Python.
#this function reads the file in the given path line by line and returns two thing:
# 1. a dictionary in the form dict['name'] = [0, 33, 110, 3, 0, ...] where the each key is a baby name
# and the value at each key is a list of counts per year
# 2. a list whose elements are the values of the years in the study i.e. [1880, 1881, .., 2014]
def read_names(file_path):
names_dict = {}
with open(file_path) as f:
for row_number, row_string in enumerate(f):
#get the values of each column into a list (hint: split())
row_info = row_string.split(",")
#first we need to check if we're at a header row
if row_number == 0:
#let's store the years so we can use them as labels later
#we split the row_string by the comma since this is a csv file
#and we only keep the entries after the first one since it is just the first column label
#split() and list slicing come in handy here!
years = row_info[1:]
#convert each year in the list to an integer
int_years = [int(y.strip()) for y in years]
#if we're not at a header then we're in the real data. so we need to store counts and baby names
else:
name = row_info[0]
counts = [float(c.strip()) for c in row_info[1:]]
#make the key-value entry in the dictionary
names_dict[name] = counts
#return a dictionary with all the names and their data, and a list with the years
return names_dict, int_years
baby_dict, year_list = read_names(babypath)
#look up your name in the dictionary!
print(baby_dict["Carlos"])
[17.0, 19.0, 20.0, 22.0, 13.0, 28.0, 16.0, 20.0, 29.0, 25.0, 17.0, 16.0, 24.0, 26.0, 25.0, 20.0, 30.0, 28.0, 30.0, 36.0, 37.0, 38.0, 37.0, 41.0, 38.0, 58.0, 50.0, 74.0, 56.0, 60.0, 82.0, 69.0, 124.0, 165.0, 210.0, 273.0, 234.0, 298.0, 347.0, 313.0, 431.0, 399.0, 447.0, 434.0, 453.0, 477.0, 506.0, 551.0, 584.0, 570.0, 587.0, 553.0, 534.0, 573.0, 568.0, 565.0, 541.0, 544.0, 598.0, 589.0, 668.0, 621.0, 659.0, 709.0, 699.0, 736.0, 878.0, 1037.0, 1043.0, 1149.0, 1207.0, 1252.0, 1271.0, 1408.0, 1628.0, 1719.0, 1816.0, 1810.0, 1923.0, 1954.0, 2027.0, 2029.0, 2144.0, 2132.0, 2208.0, 2159.0, 2187.0, 2317.0, 2532.0, 2925.0, 3415.0, 3469.0, 3473.0, 3613.0, 3914.0, 3839.0, 3827.0, 3842.0, 3772.0, 4209.0, 4127.0, 4215.0, 4316.0, 3952.0, 3998.0, 4113.0, 4142.0, 4189.0, 4174.0, 4682.0, 5251.0, 5369.0, 5365.0, 5381.0, 5349.0, 5565.0, 5490.0, 5481.0, 5491.0, 6679.0, 6332.0, 6861.0, 6606.0, 6231.0, 6269.0, 6571.0, 6551.0, 6414.0, 6048.0, 5371.0, 4592.0, 4179.0, 4008.0, 3668.0, 3402.0]
#first let's look for an individual name and plot its popularity trend in time.
#notice the 'title=' argument. this is known as keyword argument. it is useful for giving a default value to a function
#so in this case the user can decide whether or not he gives the title of the plot. If he/she doesn't then the title
#defaults to "Plot Title"
def name_plot(name, names_dict, year_list, title="Plot Title"):
#we give the matplotlib function plot() the x and y lists that we would like to plot
plt.plot(year_list, baby_dict[name], label=name)
plt.title(title)
plt.xlabel("Year")
plt.ylabel("Count")
pass
#let's try out the function. give it a name and the names_df
name_plot("Carlos", baby_dict, year_list, title="Carlos Popularity")
# we define a hipster name with the three following criteria. let's say a name is popular if it reaches 1000 babies in
# a given year.
# Criteria:
# * was very popular a long time ago (at least 1000 count between 1915-1930)
# * very unpopular 30 years ago (under 1000 between 1980-2000)
# * popular in recent years (more than 1000 after 2010)
def hipster_names(baby_dict, year_list):
#let's start an emty list to contain all the matching 'hip' names
hip_names = []
#let's establish the ranges of time to look at
popular_range = range(1915, 1930)
unpopular_range = range(1960, 2000)
recent = 2010
#get all the baby names so we can check each one
names = baby_dict.keys()
#we go through each row in the name group
for name in names:
#set some booleans with some initial values that will tell us together if the name is hip
was_popular = False
was_unpopular = True
becoming_popular = False
#we go through each year in the name
for index, count in enumerate(baby_dict[name]):
#look up what year we're in with the year_list
current_year = year_list[index]
#check if the name was popular a long time ago
if current_year in popular_range and count > 1000:
was_popular = True
continue
#check if the year was unpopular recently (find a year where it was popular in the range)
if current_year in unpopular_range and count > 1000:
was_unpopular = False
continue
#check if the name is growing in popularity
if current_year >= recent and count > 1000:
becoming_popular = True
continue
#combine all the booleans to tell us if the name matches all our criteria. if it does, add it to the list
if was_popular and was_unpopular and becoming_popular:
hip_names.append(name)
#return the list.
return hip_names
#now we just have to call the function hipster_names.
hip_names = hipster_names(baby_dict, year_list)
print(hip_names)
['Eleanor', 'Rosalie', 'Clara', 'Oliver', 'Ella', 'Hazel', 'Genevieve', 'Violet', 'Lena', 'Stella']
#let's plot the trendlines of each hip name
for n in hip_names:
name_plot(n, baby_dict, year_list, title="Hipster Names")
#we set the legend, x-axis label, y-axis label, and title of the plot
plt.legend(loc="lower left")
<matplotlib.legend.Legend at 0x110229c18>
There are many, many, many more useful libraries that I didn't have time to cover. I'm going to list a few of the ones that you might want to check out.