As we've often discussed in the course, computers are dumb. I say this, writing weeks before the course actually meets, confident that we will say it at least once. Computers are very bad at inferring things, and your data needs to be clearly structured in order to work with it properly. Because of these difficulties, it is no accident that a number of different best practices have emerged for working with data. One of the most common formats for storing data that you will interact with is a Comma Separated Value (CSV) file. We will spend just a little time working with this file type today, as this form of data cleaning forms the basis for a lot of the work you are likely to do with Python.

A CSV file is, just as it sounds, a series of data fields separated by commas, often with headers at the top. And there is one data field per line:

id,last_name, first_name, cool?
0,reed,ethan,so cool
1,walsh,brandon,the coolest

And so on. By keeping careful track of our data in this way we are essentially creating a spreadsheet. This is good, because computers are quite good at reading spreadsheets.

But first we will need some data on our computer. Let's try these commands from the terminal:

$ mkdir csvs
$ curl http://humanitiesprogramming.github.io/python/notebooks/csvs/basic.csv > csvs/basic.csv

The first one should be pretty familiar, but what do you think the second command did? Let's break it down step by step.

curl http://humanitiesprogramming.github.io/python/notebooks/csvs/basic.csv > basic.csv do_something the_thing_being_affected arrow filename

The second piece is clearly a URL, and the fact that's a .csv extension suggests that it's some data of the filetype that we are discussing. The arrow suggests we're sending it somewhere. You got it! The curl command copies the contents of a URL. And the > sends the results of the thing on the left to the thing on the right. So we're saying "Take the contents of this URL and save it in this filename that I've given you on the right.

Next, we'll pull in our newly created CSV file that we need. Notice the $'s have disappeared, so we're back in Python:

In [3]:
import csv
with open('csvs/basic.csv', 'r') as csvfile:
    our_reader = csv.reader(csvfile)
    names = [row for row in our_reader]
print(names)
[['id', 'last_name', ' first_name', ' cool?'], ['0', 'reed', 'ethan', 'so cool'], ['1', 'walsh', 'brandon', 'the coolest']]

What I've done above is open the CSV file. Then, using the csv.reader function, we walk over the CSV file to see what is inside. The fourth line above is where we actually construct the data in a way we can work with it - we walk over every row in the table and smash those rows into a list called 'data'. At the end, data is a list of lists, and each sublist contains one row of the CSV file. We can use list indexing to explore parts of the data. We might be interested in the second row. Indexing once will give you the row you're interested in:

In [2]:
print(names[1])
Out[2]:
['0', 'reed', 'ethan', 'so cool']

Indexing twice will let you select first the row and then the column.

In [3]:
print(names[1][2])
Out[3]:
'ethan'

Since these are just lists, we can do anything to them that we might want to do to lists.

In [4]:
print(len(names))

# find the length of each first name
for row in names:
    print(len(row[2]))
    
# find the longest first name
longest = ""
for row in names:
    if len(row[2]) > len(longest):
        longest = row[2]
print(longest)

# construct a new list consisting of only the first names we have here.
first_names = [row[2] for row in names]
first_names.reverse()
print(first_names)
11
5
7
 first_name
['brandon', 'ethan', ' first_name']

Since our CSV is just a list of lists, we could add to it by adding another row. And that's as easy as adding a new list:

In [5]:
new_row = [2,'wayne','graham','meh']
names.append(new_row)
print(names)
Out[5]:
[['id', 'last_name', ' first_name', ' cool?'],
 ['0', 'reed', 'ethan', 'so cool'],
 ['1', 'walsh', 'brandon', 'the coolest'],
 [2, 'wayne', 'graham', 'meh']]

We could go on and on, adding to our CSV one row at a time. Let's try something else.

In [6]:
a_row = [3,'fox','eliza','SO COOL']
names + a_row
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
<ipython-input-6-ca0d4006bc8c> in <module>()
      1 a_row = [3,'fox','eliza','SO COOL']
----> 2 data + a_row

NameError: name 'data' is not defined

What happened? Take a look and see if you can tell.

When we tried adding our new list to our collection of lists, it broke it apart and tried to add the individual items. You can tell this because the brackets disappear and we start getting rows of one item each.

We're starting to get to the point where we could use some more sophisticated ways of working with data. You might have noticed that I'm doing a LOT of looping. There are sometimes easier ways to work with your data than this. The way we're interacting with this CSV as a list of lists is really slow. There are other data structures that let you suck in a csv and, say, quickly get a particular column without having to loop over it first. If you're interested in learning better ways for interacting with there's a great book on using Python for data science that I can't recommend enough. For the purposes of this lesson, though, we'll keep things simple.

Another common thing that you'll want to do with CSV files is write them out. They're a common way of sharing data, and being able to get your data into them can offer a quick way to make your work more public. Rather than simply producing analyses of interesting data, you can also make your foundational resources public by sharing the CSV file. Let's start by getting something to write to the csv:

In [4]:
with open('csvs/practice.csv', 'w') as fout:
    csvwriter = csv.writer(fout)
    for i in range(0, 100, 10):
        csvwriter.writerow([i, i + 5, i + 10, i + 15, i + 20])

Above we open a new file and, using a range, write out to that csv file. Did it work? We can check using the things we've already learned!

In [5]:
with open('csvs/practice.csv', 'r') as fin:
    our_reader = csv.reader(fin)
    data = [row for row in our_reader]
data
Out[5]:
[['0', '5', '10', '15', '20'],
 ['10', '15', '20', '25', '30'],
 ['20', '25', '30', '35', '40'],
 ['30', '35', '40', '45', '50'],
 ['40', '45', '50', '55', '60'],
 ['50', '55', '60', '65', '70'],
 ['60', '65', '70', '75', '80'],
 ['70', '75', '80', '85', '90'],
 ['80', '85', '90', '95', '100'],
 ['90', '95', '100', '105', '110']]

Using what we already know about indexing and element assignment in lists, we could modify particular elements and then write it back out to a CSV again.

In [49]:
data[3][5] = 'Ethan'
data[5][4] = 'Brandon'
data[2][6] = 'Tony the cat'

with open('csvs/practice.csv', 'w') as fout:
    csvwriter = csv.writer(fout)
    for row in data:
        csvwriter.writerow(row)
        
print(data)
Out[49]:
[['0', '5', '10', '15', '20', '25', '30'],
 ['10', '15', '20', '25', '30', '35', '40'],
 ['20', '25', '30', '35', '40', '45', 'Tony the cat'],
 ['30', '35', '40', '45', '50', 'Ethan', '60'],
 ['40', '45', '50', '55', '60', '65', '70'],
 ['50', '55', '60', '65', 'Brandon', '75', '80'],
 ['60', '65', '70', '75', '80', '85', '90'],
 ['70', '75', '80', '85', '90', '95', '100'],
 ['80', '85', '90', '95', '100', '105', '110'],
 ['90', '95', '100', '105', '110', '115', '120']]

Check out your 'practice.csv' file to see if it worked! Do you remember how to do this from the command line?

$ cat csvs/practice.csv

But maybe you want to get a little more nuanced than this. After all, we put things in CSV files because we care about their organization. Working with lists of lists gets confusing quickly. There's a more nuanced way of reading these files using the DictWriter and DictReader classes.

In [6]:
groceries = []
with open('csvs/shopping.csv', 'r') as csvfile:
    reader = csv.DictReader(csvfile)
    for row in reader:
        groceries.append(row)
        
print(groceries)
[OrderedDict([('id', '0'), ('name', 'apple'), ('nutritious', 'True')]), OrderedDict([('id', '1'), ('name', 'banana'), ('nutritious', 'True')]), OrderedDict([('id', '2'), ('name', 'potato'), ('nutritious', 'True')]), OrderedDict([('id', '3'), ('name', "ethan's heart"), ('nutritious', 'True')]), OrderedDict([('id', '4'), ('name', 'puppy'), ('nutritious', 'False')])]

We still can't easily get to all the values for single column, but it does allow us to query by the column name when we're looking at a particular item:

In [73]:
print(groceries[2]['name'])
Out[73]:
'potato'

We could get at a particular column by looping over the whole list.

In [74]:
for item in groceries:
    print(item['name'])
apple
banana
potato
ethan's heart
puppy

The Pandas package that I mentioned above is far better for this sort of thing, so check it out if you start doing complicated things with large amounts of data. Just as there is a DictReader class, there is also a DictWriter class that allows you to write that data back out to a file. Let's add some things to our grocery list and then write it to a new file.

In [76]:
tony = {'name': 'tony', 'nutritious': 'False', 'id': '5'}

groceries.append(tony)

with open('csvs/updated_groceries.csv', 'w') as fout:
    column_names = ['id', 'name', 'nutritious']
    dictwriter = csv.DictWriter(fout, fieldnames=column_names)
    dictwriter.writeheader()
    for row in groceries:
        dictwriter.writerow(row)

Check out your file and see if it worked. If we left any fields out in the tony item, it would still get added to the csv. It would just have blank spaces where those fields would be.

Exercises

The following exercises ask you to work with a dataset of historical press reports on Jack the Ripper that was scrapped from the Jack the Ripper Casebook. More information about the dataset can be found here. I've pulled out just the first 300 or so rows from the corpus for you to work with. The CSV file lives at this URL - http://humanitiesprogramming.github.io/python/notebooks/csvs/ripper.csv

  1. Adapt the above methods to get this CSV file onto your computer.
  2. Successfully read in the CSV and prove that you did it by printing out the column headings.
  3. Get all the texts of the press reports and store it in a variable called 'all_texts'. Print this variable out.
  4. What is the earliest date that an article was published?
  5. Write a new CSV with all the same data as the original CSV, but lowercase all the texts for the press reports.

Answers Note: there are many approaches to any programming problem. You might use wildly different solutions than the ones described below!

In [39]:
# 2. Potential Answer:

import csv

with open('csvs/ripper.csv', 'r') as fin:
    reader = csv.reader(fin)
    results = [row for row in reader]
    
results[0]
Out[39]:
['0',
 'index/aberdeen_weekly_journal_18830629.txt',
 'aberdeen_weekly_journal',
 '1883-06-29',
 "Abderdeen Weekly Journal29 June 1883\tALLEGED WIFE MURDER IN LONDON -An inquest was held at St. Bartholomew's Hospital, London, yesterday, respecting the death of Sarah Ann Kelly, aged twenty-two, late of Cottage Lane, City Road, who is alleged to have died from injuries inflicted by her husband, James Kelly, now under remand at Clerkenwell.  The couple had only been married a fortnight.  Last Thursday they quarreled, and Kelly stabbed his wife several times.  The jury returned a verdict of willful murder. "]
In [40]:
# 3. Potential Answer:

import csv

with open('csvs/ripper.csv', 'r') as fin:
    reader = csv.reader(fin)
    results = [row[4] for row in reader]
print(results[0])

# or, slightly more verbose way:

import csv

with open('csvs/ripper.csv', 'r') as fin:
    reader = csv.reader(fin)
    results = [row for row in reader]
    
texts = [row[4] for row in results]
print(texts[0])
Abderdeen Weekly Journal29 June 1883	ALLEGED WIFE MURDER IN LONDON -An inquest was held at St. Bartholomew's Hospital, London, yesterday, respecting the death of Sarah Ann Kelly, aged twenty-two, late of Cottage Lane, City Road, who is alleged to have died from injuries inflicted by her husband, James Kelly, now under remand at Clerkenwell.  The couple had only been married a fortnight.  Last Thursday they quarreled, and Kelly stabbed his wife several times.  The jury returned a verdict of willful murder. 
Abderdeen Weekly Journal29 June 1883	ALLEGED WIFE MURDER IN LONDON -An inquest was held at St. Bartholomew's Hospital, London, yesterday, respecting the death of Sarah Ann Kelly, aged twenty-two, late of Cottage Lane, City Road, who is alleged to have died from injuries inflicted by her husband, James Kelly, now under remand at Clerkenwell.  The couple had only been married a fortnight.  Last Thursday they quarreled, and Kelly stabbed his wife several times.  The jury returned a verdict of willful murder. 
In [41]:
# 4. Potential answers

import csv

with open('csvs/ripper.csv', 'r') as fin:
    reader = csv.reader(fin)
    results = [row for row in reader]

dates = [row[3] for row in results]
print(min(dates))

"""Bonus! The date fields here are actually being interpreted as strings, 
so you might run into problems trying to sort them (according to a string, 1, 2, 11 would be
sorted as 1, 11, 2). To parse them more thoroughly as dates you would need to import the
time module."""
1857-11-04
Out[41]:
'Bonus! The date fields here are actually being interpreted as strings, \nso you might run into problems trying to sort them (according to a string, 1, 2, 11 would be\nsorted as 1, 11, 2). To parse them more thoroughly as dates you would need to import the\ntime module.'
In [42]:
# 5. Potential answer

import csv
with open('csvs/ripper.csv', 'r') as fin:
    reader = csv.reader(fin)
    results = [row for row in reader]

for row in results:
    row[4] = row[4].lower()
    
with open('csvs/ripper_texts.csv', 'w') as fout:
    writer = csv.writer(fout)
    for row in results:
        writer.writerow(row)