There is a natural dichotomy between data files that are easy for a computer to read, and data files that are easy for a human to read. In this article, we'll look at three common kinds of data files you are likely to encounter, are fairly human-readable, as well as being very easily manipulated in Python.
CSV is a very simple file format. The initials stand for "comma-separated values", and is used to encode a table of data. Spreadsheets and many databases can be exported into CSV format, and spreadsheet software such as Excel can import CSV files, so this is a nice format to be familiar with. Line-breaks (that is, \n
or \r
characters) separate the rows, and commas separate the values in each row. Well, it's a bit more lenient than that; you can choose any character to separate the values, but comma is the one that stuck around in the name for some reason. So, this is a perfectly reasonable .csv file:
but so is
As we should have come to expect in Python, there is a standard library module for handling .csv files, called (wait for it) csv
. The two main functions offered by this module are reader()
and writer()
, respectively used for reading and writing csv
files. Let's look at reading, first.
import csv
with open('topmovies.csv', encoding='UTF-8') as f: # file contains highest grossing films of 2016
moviescsv = csv.reader(f, delimiter=',')
for line in moviescsv:
print(line)
['movie name', ' gross', 'IMDB rating', 'IMDB votes'] ['Captain America: Civil War', ' 1153304495', '7.9', '418614'] ['Rogue One', ' 1056057273', '7.9', '334132'] ['Finding Dory', ' 1028570889', '7.4', '161213'] ['Zootopia', ' 1023784195', '8.1', '302479'] ['The Jungle Book', ' 966550600', '7.5', '200798'] ['The Secret Life Of Pets', ' 875457937', '6.6', '124027'] ['Batman v Superman', ' 873260194', '6.7', '479880'] ['Deadpool', ' 783112979', '8.0', '637685'] ['Suicide Squad', ' 745600054', '6.2', '404862'] ['Sing', ' 632443719', '7.2', '66129']
As we can see, each line is turned into a list of strings (by default). In this use of reader
, we specified a delimiter character -- the character that separates each entry. We chose a comma, although this is redundant since it is the default anyway.
We can see a problem with this import. Most entries have a space following the comma, which we don't want. This is easily fixed:
with open('topmovies.csv', encoding='UTF-8') as f:
moviescsv = csv.reader(f, delimiter=',', skipinitialspace=True)
for line in moviescsv:
print(line)
['movie name', 'gross', 'IMDB rating', 'IMDB votes'] ['Captain America: Civil War', '1153304495', '7.9', '418614'] ['Rogue One', '1056057273', '7.9', '334132'] ['Finding Dory', '1028570889', '7.4', '161213'] ['Zootopia', '1023784195', '8.1', '302479'] ['The Jungle Book', '966550600', '7.5', '200798'] ['The Secret Life Of Pets', '875457937', '6.6', '124027'] ['Batman v Superman', '873260194', '6.7', '479880'] ['Deadpool', '783112979', '8.0', '637685'] ['Suicide Squad', '745600054', '6.2', '404862'] ['Sing', '632443719', '7.2', '66129']
The csv.reader
object provides a large number of options and fixes to get the import just right. Another consideration with csvs is "what if the entry contains the delimiter?" For example, what if one of the movie titles in the file I imported contained a comma? For this, we use quotes: punctuation inside quotation marks is ignored. Hence, the csv
reader also allows us to specify which character is used for quoting.
The csv.reader
object has not created a copy of the csv file in Python's memory; it is still reading from the file. Therefore, the file must still be open when we access the lines via the reader. If we want to close the file, we should move the entries into another structure such as a list or dictionary:
with open('topmovies.csv', encoding='UTF-8') as f:
moviescsv = csv.reader(f, delimiter=',', skipinitialspace=True)
movielist = [line for line in moviescsv]
print(movielist)
[['movie name', 'gross', 'IMDB rating', 'IMDB votes'], ['Captain America: Civil War', '1153304495', '7.9', '418614'], ['Rogue One', '1056057273', '7.9', '334132'], ['Finding Dory', '1028570889', '7.4', '161213'], ['Zootopia', '1023784195', '8.1', '302479'], ['The Jungle Book', '966550600', '7.5', '200798'], ['The Secret Life Of Pets', '875457937', '6.6', '124027'], ['Batman v Superman', '873260194', '6.7', '479880'], ['Deadpool', '783112979', '8.0', '637685'], ['Suicide Squad', '745600054', '6.2', '404862'], ['Sing', '632443719', '7.2', '66129']]
This is, of course, a rather crude demonstration (in fact it breaks our rule of avoiding lists of mixed types); we'd likely want to be more organized with how we structure the data we obtain from a csv, but that will depend on your individual needs.
Writing to a csv is basically just as easy, and follows practically the same format. Let's put the movielist
into a new csv, with different delimiters, and quotation marks around non-numeric data (this will make it easier to import in the future, since with this style, the reader
can distinguish between numbers and strings when it reads the file. Firstly, we should actually convert the numeric data into numbers:
for row in movielist:
for i, entry in enumerate(row):
try:
row[i] = float(entry)
except ValueError:
pass
print(movielist)
[['movie name', 'gross', 'IMDB rating', 'IMDB votes'], ['Captain America: Civil War', 1153304495.0, 7.9, 418614.0], ['Rogue One', 1056057273.0, 7.9, 334132.0], ['Finding Dory', 1028570889.0, 7.4, 161213.0], ['Zootopia', 1023784195.0, 8.1, 302479.0], ['The Jungle Book', 966550600.0, 7.5, 200798.0], ['The Secret Life Of Pets', 875457937.0, 6.6, 124027.0], ['Batman v Superman', 873260194.0, 6.7, 479880.0], ['Deadpool', 783112979.0, 8.0, 637685.0], ['Suicide Squad', 745600054.0, 6.2, 404862.0], ['Sing', 632443719.0, 7.2, 66129.0]]
Exactly how this code works will be explained in a future article, but try and guess what it does anyway! (Clue: float("movie name")
will usually produce an error. What do try
and except
do?)
Now that our table is in a nicer format, let's write it to a csv.
with open("movies-new.csv", "w", encoding='UTF-8') as f:
# creates a "writer" that is ready to write whatever we give to it
moviecsv = csv.writer(f, delimiter='|', quotechar="`", quoting=csv.QUOTE_NONNUMERIC)
# give the list to the writer
moviecsv.writerows(movielist)
The resulting file on my computer looks like this:
Each argument here should be fairly self-explanatory. Rows can be written one at a time with writer.writerow(row)
, instead of providing a full list all at once.
JSON stands for "JavaScript Object Notation". JavaScript is a programming languaged used to make interactive and dynamic webpages. This means there needs to be a way for a JavaScript script in a webpage to send and receive bundles of data without reloading the whole page. The JSON file format is one such solution. However, JSON files can be used to store many kinds of data, and are commonly encountered when grabbing information from the web. In fact, Jupyter notebook files such as this very article are really just JSON files with a different suffix; try opening one in your text editor to see for yourself!
Thankfully for us, JSON files are easy for any Python programmer to understand. Here is an example of a JSON file containing an Amazon review for a piece of audio equipment (and yes, Amazon reviews really are sent over to your computer as JSON - in fact most of the things you see on any given Amazon page are):
Look familiar? It's basically just a dictionary. This is a good reason for Python programmers to know about JSON files, as it means we can save the data in our Python programs on an external file, so long as the values are of a suitable format (strings, numbers, etc).
Of course, it's not precisely a dictionary. There are some translations that need to be made from the JSON format (and terminology). The dictionary-like structure of a JSON file is called a JSON object. Since objects can have objects as their values, this gives JSON objects a tree-like structure. Booleans in JSON files use all lower case (so Python's True
is true
in JSON). Lists and tuples are rolled into one object called an array, and all floats and ints are rolled into one object called a number. Let's see how we can import JSONs, using the json
module, which does all the necessary translations for us (phew!).
The main functions of the module are dump
and dumps
(for creating JSON objects), and load
and loads
(for reading JSON objects). The s
appearing in two of these functions simply stands for string. We'll see the difference in a second.
import json
with open("Musical_Instruments_5.json", "r", encoding='UTF-8') as f:
reviews = [json.loads(line) for line in f]
for review in reviews[:5]:
print(review)
{'reviewerID': 'A2IBPI20UZIR0U', 'asin': '1384719342', 'reviewerName': 'cassandra tu "Yeah, well, that\'s just like, u...', 'helpful': [0, 0], 'reviewText': "Not much to write about here, but it does exactly what it's supposed to. filters out the pop sounds. now my recordings are much more crisp. it is one of the lowest prices pop filters on amazon so might as well buy it, they honestly work the same despite their pricing,", 'overall': 5.0, 'summary': 'good', 'unixReviewTime': 1393545600, 'reviewTime': '02 28, 2014'} {'reviewerID': 'A14VAT5EAX3D9S', 'asin': '1384719342', 'reviewerName': 'Jake', 'helpful': [13, 14], 'reviewText': "The product does exactly as it should and is quite affordable.I did not realized it was double screened until it arrived, so it was even better than I had expected.As an added bonus, one of the screens carries a small hint of the smell of an old grape candy I used to buy, so for reminiscent's sake, I cannot stop putting the pop filter next to my nose and smelling it after recording. :DIf you needed a pop filter, this will work just as well as the expensive ones, and it may even come with a pleasing aroma like mine did!Buy this product! :]", 'overall': 5.0, 'summary': 'Jake', 'unixReviewTime': 1363392000, 'reviewTime': '03 16, 2013'} {'reviewerID': 'A195EZSQDW3E21', 'asin': '1384719342', 'reviewerName': 'Rick Bennette "Rick Bennette"', 'helpful': [1, 1], 'reviewText': 'The primary job of this device is to block the breath that would otherwise produce a popping sound, while allowing your voice to pass through with no noticeable reduction of volume or high frequencies. The double cloth filter blocks the pops and lets the voice through with no coloration. The metal clamp mount attaches to the mike stand secure enough to keep it attached. The goose neck needs a little coaxing to stay where you put it.', 'overall': 5.0, 'summary': 'It Does The Job Well', 'unixReviewTime': 1377648000, 'reviewTime': '08 28, 2013'} {'reviewerID': 'A2C00NNG1ZQQG2', 'asin': '1384719342', 'reviewerName': 'RustyBill "Sunday Rocker"', 'helpful': [0, 0], 'reviewText': 'Nice windscreen protects my MXL mic and prevents pops. Only thing is that the gooseneck is only marginally able to hold the screen in position and requires careful positioning of the clamp to avoid sagging.', 'overall': 5.0, 'summary': 'GOOD WINDSCREEN FOR THE MONEY', 'unixReviewTime': 1392336000, 'reviewTime': '02 14, 2014'} {'reviewerID': 'A94QU4C90B1AX', 'asin': '1384719342', 'reviewerName': 'SEAN MASLANKA', 'helpful': [0, 0], 'reviewText': "This pop filter is great. It looks and performs like a studio filter. If you're recording vocals this will eliminate the pops that gets recorded when you sing.", 'overall': 5.0, 'summary': 'No more pops when I record my vocals.', 'unixReviewTime': 1392940800, 'reviewTime': '02 21, 2014'}
Let's break it down. As usual, we open the file using open
to give us a file object. This file contains a different JSON object on each line. Because we can read through a file line-by-line, and those lines are read as text, we use loads
because each line is a string.
One we have loaded it, each JSON object is now precisely a dictionary:
print(type(reviews[0]))
print("This person thought the product they purchased was: {}".format(reviews[0]["summary"]))
<class 'dict'> This person thought the product they purchased was: good
This strategy is very good when we have a file containing lots of JSON objects. We read each object (line) as a string and then pass it to loads
to convert it. If we have a file containing only a single JSON object, we can pass the whole file to load
instead:
with open("book_review.json", encoding='UTF-8') as f:
bookreview = json.load(f)
print(bookreview["reviewText"])
Spiritually and mentally inspiring! A book that allows you to question your morals and will help you discover who you really are!
Writing JSON objects to files is much the same scenario. This provides a convenient way to save the important information in a Python program, in a way that can be re-opened by the same program, another Python program, or even a program written in a different programming language.
Let's encode a JSON object. Suppose I run a business and my program is storing the delivery details of my customers. I might store each customer as a dictionary like this:
customer = {
'name': 'Sam Cassidy',
'address': '19 Notareal Avenue, Liverpool',
'postcode': 'L00 9JR',
'purchase-history': [('inflatable pet', 3, '05.09.2016'),
('disposable widget', 60, '01.01.2017')],
'Phone number': '0712345567889'
}
So the information contained here is mostly strings, although some are organized into lists of tuples, and there's also a couple of integers here as well. Let's create a JSON object from this dictionary.
customer_JSON = json.dumps(customer)
print(customer_JSON)
{"name": "Sam Cassidy", "address": "19 Notareal Avenue, Liverpool", "postcode": "L00 9JR", "purchase-history": [["inflatable pet", 3, "05.09.2016"], ["disposable widget", 60, "01.01.2017"]]}
As we can see, the reproduction is mostly faithful, but our tuples have been turned into JSON arrays. That's okay: if we wish to recover the original tuples when we load from this JSON file, we can just write a function to convert them back.
JSON files ignore whitespace, so we can actually make a slightly nicer JSON file by setting the indent parameter:
customer_JSON = json.dumps(customer, indent=2)
print(customer_JSON)
{ "name": "Sam Cassidy", "address": "19 Notareal Avenue, Liverpool", "postcode": "L00 9JR", "purchase-history": [ [ "inflatable pet", 3, "05.09.2016" ], [ "disposable widget", 60, "01.01.2017" ] ] }
Which is easier to read, and closer to how we defined customer
in the file. Since this is now just a string, we can write it to a .json file using the usual write methods:
with open('my_customers.json', 'a', encoding='UTF-8') as f:
f.write(customer_JSON)
And there, the data is saved and can be reopened using json.load()
or json.loads()
.
The difficulty with JSON files is that ultimately, they're only text files. For one, this means encoding is sometimes important, and all the previous things we've said about encoding should be considered. For two, this means that more complicated Python data structures (for example custom made classes, which we'll meet soon enough) might need to be cleverly converted into text, somehow or other. For instance, it might be more useful to inside my program to work with Python's smart "date" objects, rather than just strings containing the date:
from datetime import date
customer = {
'name': 'Sam Cassidy',
'address': '19 Notareal Avenue, Liverpool',
'postcode': 'L00 9JR',
'purchase-history': [('inflatable pet', 3, date(2016, 9, 5)),
('disposable widget', 60, date(2017, 1, 1))],
'Phone number': '0712345567889'
}
The date object is much smarter than a string. For example:
inflatable_pet_date = customer['purchase-history'][0][2]
print(inflatable_pet_date)
print(inflatable_pet_date.isoweekday()) # day of the week as a number-- this day was a Monday (1st day)!
print(inflatable_pet_date.strftime('%A %d %B')) # there's a little mini language to control
# this representatio
2016-09-05 1 Monday 05 September
Now let's write this object to a JSON again:
customer_JSON = json.dumps(customer, indent=2)
--------------------------------------------------------------------------- TypeError Traceback (most recent call last) <ipython-input-23-1767e5e3802e> in <module>() ----> 1 customer_JSON = json.dumps(customer, indent=2) /home/sam/anaconda3/lib/python3.6/json/__init__.py in dumps(obj, skipkeys, ensure_ascii, check_circular, allow_nan, cls, indent, separators, default, sort_keys, **kw) 236 check_circular=check_circular, allow_nan=allow_nan, indent=indent, 237 separators=separators, default=default, sort_keys=sort_keys, --> 238 **kw).encode(obj) 239 240 /home/sam/anaconda3/lib/python3.6/json/encoder.py in encode(self, o) 199 chunks = self.iterencode(o, _one_shot=True) 200 if not isinstance(chunks, (list, tuple)): --> 201 chunks = list(chunks) 202 return ''.join(chunks) 203 /home/sam/anaconda3/lib/python3.6/json/encoder.py in _iterencode(o, _current_indent_level) 428 yield from _iterencode_list(o, _current_indent_level) 429 elif isinstance(o, dict): --> 430 yield from _iterencode_dict(o, _current_indent_level) 431 else: 432 if markers is not None: /home/sam/anaconda3/lib/python3.6/json/encoder.py in _iterencode_dict(dct, _current_indent_level) 402 else: 403 chunks = _iterencode(value, _current_indent_level) --> 404 yield from chunks 405 if newline_indent is not None: 406 _current_indent_level -= 1 /home/sam/anaconda3/lib/python3.6/json/encoder.py in _iterencode_list(lst, _current_indent_level) 323 else: 324 chunks = _iterencode(value, _current_indent_level) --> 325 yield from chunks 326 if newline_indent is not None: 327 _current_indent_level -= 1 /home/sam/anaconda3/lib/python3.6/json/encoder.py in _iterencode_list(lst, _current_indent_level) 323 else: 324 chunks = _iterencode(value, _current_indent_level) --> 325 yield from chunks 326 if newline_indent is not None: 327 _current_indent_level -= 1 /home/sam/anaconda3/lib/python3.6/json/encoder.py in _iterencode(o, _current_indent_level) 435 raise ValueError("Circular reference detected") 436 markers[markerid] = o --> 437 o = _default(o) 438 yield from _iterencode(o, _current_indent_level) 439 if markers is not None: /home/sam/anaconda3/lib/python3.6/json/encoder.py in default(self, o) 178 """ 179 raise TypeError("Object of type '%s' is not JSON serializable" % --> 180 o.__class__.__name__) 181 182 def encode(self, o): TypeError: Object of type 'date' is not JSON serializable
Umm... ouch. A simple solution is to write functions that can convert to and from strings and date
objects, and just apply them before and after encoding or decoding. The date module of course already provides a way to convert the date to a string -- you can just use str()
on the object.
def customer_to_jsonable(customer):
from copy import copy
# so we don't damage the original dictionary
customer = copy(customer)
for i, entry in enumerate(customer['purchase-history']):
customer['purchase-history'][i] = (entry[0], entry[1], str(entry[2]))
return customer
def json_to_customer(jsonCust):
for i, entry in enumerate(jsonCust['purchase-history']):
date_list = entry[2].split('-')
date_list = [int(x) for x in date_list]
jsonCust['purchase-history'][i] = (entry[0], entry[1], date(date_list[0], date_list[1], date_list[2]))
return jsonCust
# now call the function before converting to JSON
customer_JSON = json.dumps(customer_to_jsonable(customer), indent=2)
print(customer_JSON)
{ "name": "Sam Cassidy", "address": "19 Notareal Avenue, Liverpool", "postcode": "L00 9JR", "purchase-history": [ [ "inflatable pet", 3, "2016-09-05" ], [ "disposable widget", 60, "2017-01-01" ] ], "Phone number": "0712345567889" }
original_customer = json_to_customer(json.loads(customer_JSON))
print(original_customer, '\n')
print("Customer {} purchased {} on:".format(original_customer['name'],
original_customer['purchase-history'][0][0]))
print(original_customer['purchase-history'][0][2])
{'name': 'Sam Cassidy', 'address': '19 Notareal Avenue, Liverpool', 'postcode': 'L00 9JR', 'purchase-history': [('inflatable pet', 3, datetime.date(2016, 9, 5)), ('disposable widget', 60, datetime.date(2017, 1, 1))], 'Phone number': '0712345567889'} Customer Sam Cassidy purchased inflatable pet on: 2016-09-05
So, with a little bit of thinking and tinkering, we've recovered all of the original information, even though JSON only allows us to store text.
There's another very common way to store and manipulate data on the web. It's called XML, and is in many ways like JSON, but more flexible, and more complicated. Python has an xml
module for working with these formats, which again converts hierarchies of attributes into dictionary. There's a great article about how the format works and how to use xml
here: http://www.diveintopython3.net/xml.html.