#!/usr/bin/env python # coding: utf-8 # ### This little tutorial means to further acquaint you with Python, its data science libraries, and Jupyter Notebook. It is also keen on better understanding one of the data formats that likely comprise a significant part of your future: .XML # At the same time, it tries to contextualize all of this work with how data science actually gets done, and what it can look like *en route*. This is a lot to take in all at once, I know. But, in my small opinion, you cannot separate lessons in *doing data science* from lessons in *programming* from a *working knowledge* of where and how our data lives. These three components have to inform one another, or there is no point going forward. # I suggest that you work along with this Jupyter Notebook tutorial by opening up your own notebook (with a Python 3 kernal) and building each section as you make sense of it. Don't hesitate to break things, add things, go off in your own direction. Remember: This is the advantage of the digital Notebook format. Don't get stuck in *pen-and-ink-think.* # ## ONE # You're going to need an iTunes **.plist** (preferences list) file for this exercise. Its actually a version of an .XML file. In case you don't have one handy, or if you'd prefer to work along with the same one I'm using, I happen to have a copy of it right here for download. # # [download drm_itunes.xml](drm_itunes.xml) # # *Like most .xml files, this is just a flatfile: It doesn't actually contain any of the music to which it refers.* # ## TWO # This week, Jupyter Notebook is again our weapon of choice. # ## THREE # #### Data Handling # Don't forget: *Never work on the original copy of a data file*. You will regret that decision. If you're using your own music.xml file, make a duplicate, *leaving the original untouched and unmoved.* # Rename your copy of the file in order to mark it as a disposable working copy: Say, `myiTunesData.xml`. Make note of that in your processing book. (These ostensibly *disposable* copies really add up over time. I found yesterday that I had 4 copies of the (awesome) NYC TaxiCab GPS dataset (January 2016 edition); each copy is just over 2GB in size.) # Now from Mac's Finder or Window's Explorer (or the command line, if you're feeling adventuresome), move your `.xml` datafile into the same directory that now contains your Jupyter file. In my case, for example, `.iTunesPlotter.ipynb` and `itunes_gwl.xml` are both safely stored in the same directory: # # Users/garrisonlemasters/Documents/AnacondaProjects/iTunes/ # The `.xml` file does not *have* to be in the same directory as the Jupyter notebook (the .ipynb file), but it makes life much easier. # **Q:** *Doesn't all of this seem like a lot of administrative work?* # # **A:** It probably only seems that way because **it is a lot of administrative work**. Doing data science means learning to love file versioning systems, duplication macros, RAID arrays and redundant storage systems. It means you fantasize about off-site storage and data firehoses. Do a good job keeping track of your data and the rest will follow. # ## FOUR # Let's look at the file we want to ingest and manipulate. As we can tell from the file extension and from the file header, it is an .XML file. Here's the header: # ``` # # # # ``` # One of the principles to which .XML files adhere is that their headers point to a place on the web where someone maintains the file's schema: That is, a detailed description of how the file is organized. That's the second line (above), the one that begins !DOCTYPE and ends with the standard file extension for XML schema, .dtd. # Apple -- like many companies -- uses the .XML standard, but doesn't always adhere to the standards themselves. This is usually annoying. In this case, it means that the file is easier for us to interpret, but less easily interpreted by Python. To understand why, let's talk about dictionaries. # In computation, a DICTIONARY is a fairly simple way of storing information. It is called a dictionary because it roughly mimics the way a dictionary stores data: With a unique term on the left-hand side of the page and one or more definitions on the right-hand side. When we consult a dictionary, we always know the term that we're looking for: # Doing extensive work with an unfamiliar library, data format, or hardware? Keep a notebook or a webpage with all of the relevant links to tutorials, src code, and (ideally) the developers' docs. For decades, documentation of code has been abysmal: But it has gotten to the point that some documentation is genuinely useful now. [Here is one page](https://developer.apple.com/legacy/library/documentation/Darwin/Reference/ManPages/man5/plist.5.html), for example, that Apple maintains on its use of XML in PLIST formats. And [here is a page devoted to the Python Software Foundation's plistlib](https://docs.python.org/3/library/plistlib.html). The changelog that appears right above the yellow box on that page hints at the terrific difficulties that even the best documentation is hard-pressed to ameliorate: "Changed in version 3.4: New API, old API deprecated. Support for binary format plists added." While the shift from 3.3 to 3.4 doesn't *seem* like much, the 3.4 library is effectively completely new. # ## Process # In[1]: import plistlib # In[2]: fileName = 'drm_itunes.xml' # In[3]: myFileObject = open(fileName, 'rb') # This statement opens our file (fileName) as a Read-Only file ('r') in Binary mode ('b'). "Oh no you didn't", you'll complain: "What's this yer on about now, *binary*? I thought if I could read the file, it was a *plaintext* file -- and I can bloody well read this `.xml` file just fine, *gov'nuh*, I do thank ye very much. It isn't binary!" # # Well, yes: *How colorful you are!* And how *right* you are! # # Or rather, right you *were,* because at the same time that we opened the file, we were also asking Python to *convert it* into a binary object right away: The 'b' for Binary didn't reflect the nature of the file -- "Python, open that file, and be careful, its in BINARY!". Instead, it told Python how we want it to treat the data it finds inside: "Python, open that file, and store it as binary data, right away!" # So the obvious question is: Why? Why change the 'stream' of information from a legible (human-readable) stream of letters to a stream of 1's and 0's? # # The answer isn't that interesting, in truth: In our case, we do that because the library we're about to use -- Python's `plistlib` -- wants its data delivered as a binary stream. # # And so that is exactly what we've done with our `open(filename, 'rb')` statement: We've asked Python to take our file full of song titles and singer's names and convert them all to a stream of binary digits. Python obliges by converting the data, and then it packages it up inside an `object` called an \_io.BufferedReader: Basically, *a high-tech chrome-and-carbon-fiber box*, full of bells and whistles that make it easier for us to get immediate access to lots of data, quickly. # We can get a quick overview of the `myFileObject` object if we `print()` it to our screen: # In[4]: print(myFileObject) # Our high-tech box, filled with liquid data, is called a BufferedReader. # So: Python knows how to pull data in and push data out (*its all 1's and 0's at this point, after all*). But we usually need one or more external libraries if we're going to make automated sense of that data at all. In our case, Python already has a plist library (called, appropriately enough, `plistlib`) ready to go. *(Note that there are other libraries that would also work here, like the XML parsing library; but plistlib is custom-made for Apple-style data, which suits us perfectly.)* # # When we use one of the tools in the library (in this case, `load` is a 'method' -- an action, a verb -- stored inside our `.plistlib`), we need to tell Python that we are using a tool *from that library*. We've done this sort of thing before. In this case, it looks like this: # In[5]: songLibraryData = plistlib.load(myFileObject) # Remember: In the code above, `songLibraryData` and `myFileObject` are both variables: We want Python to 'read between the lines,' not to interpret those words literally. Because of that, we do NOT use quotes around them. If I put quotes around `myFileObject`, for example, Python would think that the phrase `myFileObject` was important. In this case, it clearly is not: The *value contained within that variable* is what matters. # Python asks the plist library to take a look at that binary stream of data we opened up; if it recognizes how things are organized, we're in luck! # # Let's see what we've got: What does `songLibraryData` look like? # In[6]: print(songLibraryData) # A good sign! # # We can read it! # # Sort of, at least. Take a moment and compare what you see rendered by the plistlib with the original text of the .xml file (below). You can see that there have been some significant changes: The **header** is gone; in the processed code, single quotes (above) have replaced most tags (below). And so on. # # # # # Major Version1 # Minor Version1 # Date2018-02-08T16:16:42Z # Application Version12.7.3.46 # Features5 # Show Content Ratings # In[7]: import plistlib import pandas # In[8]: fileName = 'drm_itunes.xml' myFileObject = open(fileName, 'rb') songLibraryData = plistlib.load(myFileObject) # In[9]: # Prep the arrays to hold the lists of data # trackSerial will be where we store trackID: The four-digit # code assigned each track inside the .xml file # The only non-obvious one is trackTime: # In the .xml, that is an integer describing the # the length of a song in milliseconds. # So for minutes, we need to divide by 1000 # and then by 60... trackSerial=[] trackName=[] trackArtist=[] trackAlbum=[] trackTime = [] trackYear = [] # In[10]: # If you look closely at the .xml file, you'll see # that most of the data is organized into sets of curly # braces. If we look at the top-most name for those sets, # we see that it is 'Tracks'. The library is organized # into units called 'Tracks'. # ``` # { # 'Major Version': 1, # 'Minor Version': 1, # 'Features': 5, # 'Show Content Ratings': True, # 'Tracks': { # '1494': { # 'Track ID': 1494, # 'Name': 'Take on Me', # 'Artist': 'NO BS! Brass', # 'Library Folder Count': 1 # }, # '1496': { # 'Track ID': 1496, # 'Name': 'Splatter Splatter', # 'Artist': 'Moxy Fruvous'... # } # ``` # Again, this is basically just a Dictionary, and everything is ordered in KVPs: Key-Value Pairs. The Key is the term you look up, and the Value is its definition. They are separated by a colon. # # THE TRICK HERE is that the VALUE of a KEY is frequently more than just a number or a word. It often is, for example, a second KVP. This sounds complicated. It is not: Just work backwards from our earlier metaphor: If we use a dictionary to keep track of the meaning of each word, then how might we keep track of the different kinds of dictionaries we keep? Right: Inside a dictionary of dictionaries. # # The important thing is that everything resolves to a single K and a single V -- even if that single V is actually a pair of curly braces with lots and lots of messy things deep down inside it. We just need to be sure that if we climb all the way to the top of the data set, we have one K for one V. # In[11]: # That's what we've got here. 'Tracks' is one of the topmost # Keys in our .XML file. Inside the VALUE for that one Key # is our entire music collection. So it makes sense that # it is the Key we want to open: tracks = songLibraryData['Tracks'] # In[12]: # in order to iterate through data, Python # prefers this for statement. You'll see # it everywhere you go. # Here's an archetypal version of it: # for KEY, VALUE in DICTIONARY.items(): # It just means "Get a key and a value # from each item in this dictionary." # As I get each value, I "append" that # information to a list of similar data. # (Append just means "add to the end # of the current list"). # So for trackYear, for example: # trackYear = ['1994'] # and then trackYear.append('1999') # so trackYear = ['1994', '1999'] # and then trackYear.append('2000') # so trackYear = ['1994', '1999', '2000'] for trackID,track in tracks.items(): trackSerial.append(trackID) trackName.append(track['Name']) trackAlbum.append(track['Album']) trackArtist.append(track['Artist']) trackYear.append(track['Year']) trackTime.append(track['Total Time']) # In[30]: for ID,T in tracks.items(): print(T) print("___") # In[13]: # Look at some random data print(trackName[2]) print(trackAlbum[6]) print(trackArtist[1]) # Just a word of warning: We're not slowing down # to bother with Error Checking, although we # probably should. Because of that, if any of # our data is off-kilter or is missing a line or # two of values, everything will go to hell quickly. # With "real" data, you can't afford to skip # error-handling. Here we can, because I've # pre-screened this dataset and I know it is # 100% complete. # In[14]: # You know why I love the PANDAS library so much? # Because it always seems to work. Plus I love # writing code that talks to pandas. TianTian = pandas.DataFrame({ 'TrackID':trackSerial, 'Name':trackName, 'Album':trackAlbum, 'Artist':trackArtist, 'Year':trackYear, 'Duration':trackTime }) # ## The Payoff # The payoff here is pretty great, I think. Yes, it took a while to get here, but look at what we can do in a single line. Below, I'm telling TianTian to sort his list by Year, and to add a bar graph to the Duration column, showing relative lengths of each song.' # In[15]: TianTian.sort_values(by='Year').style.bar(subset='Duration', color='#ffcc33') # In[16]: # Finally, saving from PANDAS is soooo easy: Ready? TianTian.to_csv("myTunesData.csv")