#!/usr/bin/env python # coding: utf-8 # # Tutorial: Getting to that Sweet, Sweet Data # ### I/O tutorial covering (in several parts) .txt, .HTML, .CSV, .XML, .JSON # #### Digital Research Methods (LeMasters, 2018) # Draft 2 (9 September 2018) # I'd built some tutorials this week that demonstrated three different techniques for acquiring (and making some sense of) data from text- and HTML-based files; instead of presenting those three projects in a serial, discrete fashion, though, I've begun chopping them up and re-organizing them thematically -- I think they'll be much more useful to us that way. # # So, for example, instead of building an entire mini-application in one tutorial, we'll review a tutorial wherein we look at 3 different techniques for loading files; one where we look at three different ways of parsing text; and then one where we look at three approaches to rendering that data. # # I've never approached them problem in this way, but it looks promising. Be sure to let me know what you think. # ## Pulling text into your data science pipeline -- and getting it back out. # ### I/O # I/O is ubiquitous in computation: It stands for In/Out. The I/O interface is where the computer and the world meet. There are so many ways to do this -- but in the end, there are only a few techniques that you'll return to again and again. The biggest determining factor in your approach is (1) the kind of file you're reading and (2) its disposition in your workflow: What comes out of that file? # # As we mentioned on Thursday, there are many file types you're likely to deal with, but .txt, in addition to four related filetypes, stands out as most common (for now): # 1. **txt** | the plain-vanilla text file. When in doubt, you can almost always read or write any kind of information into this format. The formats below are all .txt files with fancy hats. # 2. **HTML** | HTML (Hypertext mark-up language) is plain text that is heavily "marked up" with tags (words inside of angle brackets). Because HTML is typically used to build web pages, it is heavily engaged in layout and presentation -- not just content. And because HTML was never really intended to be used in this fashion, it is now joined by dozens of auxillary languages, scripts, and objects -- all of which tend to get in your way while you work with HTML. # 3. **.CSV** | The venerable comma-separated-values file is the workhorse of contemporary data science. .CSV files (and those of its fraternal twin, .TSV, tab-separated-value) tend to do a great job delivering data with a minimal amount of cruft. Their main drawback is that they are uncompressed. It is not uncommon to find even powerful computers choking on unnecesssarily large .CSV files. # 4. **.JSON** | Originating with the popular web programming language javascript, the JavaScript Object Notation is a simple, efficient, and remarkably popular format: It makes all the other formats jealous. Many professionals who work with data science toolsets have come to prefer .JSON over .XML, in part because it simply looks less intimidating. While APIs tend to make their data available in both formats, JSON now tends to be the default. # 5. **.XML** | The W3C describes the Extensible Markup Language (XML) as "a simple, very flexible text format." I would call that description "overly optimistic." When we have time, we can review some .XML files and you'll see why fewer and fewer institutions seem to be depending on it like they used to. # #### Before you get started: # # We need some files to download. Once you've worked through these examples, I encourage you to repeat them but do so using your own files. To get started, though: # # Begin by launching Anaconda Navigator. From there, launch Jupyter. It will send you to the default directory listing: If that is not where you want to store your Notebook files, then navigate to the place where you DO want to keep them. Once you are satisfied that you are looking at the best directory for this purpose, go to the upper right hand of the browser window and choose the button/pulldown labeled NEW. From the list it creates, choose Python 3. # # You should be in Jupyter now: Let's fix the name of this notebook first. At the top of the document, click on the word "Untitled" and rename the file to: # # IO_Tutorial_One # # Note: The underscores look silly, but real spaces tend to make code complicated. For example, in order to change to a directory called "/rough draft graphics/color imgs", I would type: # # cd /rough\ draft\ graphics/color\ imgs # # See what I mean? So we all try to avoid using spaces whenever possible. # # Save the document by clicking on the little floppy disk icon on the leftmost side of the tool ribbon. Let's make sure we're off to a good start: Put that browser window aside (minimize it, hide it, etc.) and open up your operating system's native file browser (the one you use everyday: OSX uses the Finder; Windows uses Explorer). Navigate to the directory where your Jupyter notebook should be, and make sure it is, in fact, there. Jupyter will have added a file extension to the filename, so it should look like this: # # IO_Tutorial_One.ipynb # # Great. Now before you leave this directory, download [this file from our website](http://www.digitalresearch.online/IO_Tutorial_One_Data.zip) and save it in the same directory as your notebook file. # # Explanation: You are downloading a compressed copy of 5 or more files inside a single folder. This is delivered to you as a file called IO_Tutorial_One_Data.zip. BUT MacOS may automatically unzip it for you, without asking your permission. This can make things confusing. If so, use the folder MacOS unzipped -- if not, unzip the file yourself and drag that folder to the same place your notebook is stored. # ## PATH # Now we need to ingest the datafile. It is a three-step process. # # 1. Identify the path. Tell Jupyter where it can find your file. # 2. Open the file. Tell Jupyter how to open your file. # 3. Read the file. Tell Jupyter where to put the data. # # Piece of cake! # > Quick Question: Why so many steps? # # > It's true, all of this can be accomplished in one step. But don't. Spread it out so you can get a sense of what is happening. You can compress the process once you're more confident about the component parts. # # > Quick Question: Why use variables instead of just using the file's name? # # > You certainly don't have to use variables - but the idea is that you won't just do this once. You'll do it dozens, even hundreds of times in the near future. By using a variable instead of the file's "real" path, you save yourself effort, and are working in the spirit of data scientists and programmers, whose motto is always DRY: Don't Repeat Yourself. # ## Identify the Path # Tell Python how to find your file. The location of the file you want to load into memory is the *file path*. In our case, it is where we can find the folder called IO_Tutorial_One. (If you haven't downloaded that yet, see the section above). # # Your code will reuse variables like PATH a lot. So let's return to our Jupyter notebook and create a variable that points right to the file we need: # # myFilePath = 'IO_Tutorial_One/RilkePoem.txt' # # Don't forget the quotes (typically single quotes on Macs, double quotes on PCs). And note that the variable myFilePath is arbitrary. When I name variables that contain information specific to me, I tend to put the possessive adjective in the front -- it helps me recognize variables that are "mine." But again -- it is arbitrary: # # secretNuclearStorageFacility = 'IO_Tutorial_One/RilkePoem.txt' # # Whatever makes sense to you is fine for now. But generally, it should be named so that others can understand what you're doing. # ### Bonus Tip # Define your path variables way up at the top of your code, right after you import your libraries: Those will change frequently from project to project, and by keeping them near the top, you'll spend less time searching for them. # ## Open() # Now we call Python's open() function to ask the operating system for permission to access the file. We also need to let Jupyter know what our long-term intentions are. There are many details you can share with Python about your file, but we only care about two: # 1. The file path # 2. The access mode # We've already handled the file path. Let's turn to the *mode*. # # The **mode parameter** wants to know how you will use the file: # # 'r' : use for reading # 'w' : use for writing # 'x' : use for creating and writing to a new file # 'a' : use for appending to a file # 'r+': use for reading and writing to the same file # # For the most part, 'r' is a strong choice -- you typically don't really want to overwrite your data files. # # Let's put these steps together, then: # In[21]: myFilePath = 'IO_Tutorial_One_Data/RilkePoem.txt' # In[22]: myPoemData = open(myFilePath, 'r') # No complaints from Jupyter is a good sign! # We're so close! Let's seal the deal by reading the poem into the system! # # # # # Read() # In order to see just what we've wrought, let's make use of Jupyter's fancy *interactive mode* again: Instead of writing out the code that a program will execute ("interpret"), we're just going to talk with Python. We can't get much calculation done this way, but it gives us a much more intimate look at internal processes that other programming languages almost never share. # # So then: We've set the path, opened the file: All that is left to do is read it. # In[23]: myPoemData.read() # **Bam!** *Welcome to flavor country!* Jupyter paints the whole poem right there, line for line. # # Of course, the poem looks a bit wilted, a bit crushed: All of those '\n' (usually called "escape-Ns") are usually hidden from sight. They're called "string literals" — in this case, this string literal is an "escape sequence" called "linefeed" (sometimes abbreviated LF). It is a code that originally told a printer to advance its sheet of paper to a new line — or about 5 mm. # What about those empty parentheses? Parentheses, empty or not, almost always signify *action*. It's useful to think in terms of grammar (because, in the end, it is a kind of grammar): # # myPoemData.read() # # can be understood as: # # (in the imperative) "Tell myPoemData to read itself." # # And it does just that, spilling the results all over our page because we didn't tell it where to store the results. Let's do that now. # In[24]: textOutput = myPoemData.read() # Excellent! The data is safely locked inside our variable textOutput. See? # In[25]: print(textOutput) # Umm...Wait, what? Where's our poem? Why didn't that work? # Sigh. Here's the story: When I .read() the file to which myFileData refers, I initiate a lot of work on the computer's part. After all, I'm moving a block of data from the file I've opened on my harddrive to be stored in another location. Python uses a pointer (like a cursor) to keep track of the data as it gets channeled to its new home in memory. When that pointer reaches the end of the file, it just stops -- much like the needle arm on an old-fashioned record player. If I want to hear that song again, or if I want to see that data again, then we're going to have to move the needle ourselves. # How? Easy. The pointer is stored inside the data object we built. We just need to reset it thus: # In[26]: myPoemData.seek(0) # That zero means "success" (I know, not very sensible, is it?). But let's try our .read() again. # In[27]: myPoemData.read() # *Voila!* # # There are many variations we can make use of, too. Using the .seek(0) to reset the needle on the record: Let's try them first, and then make sense of what we're seeing: # In[28]: myPoemData.seek(0) myPoemData.readlines() # And then: # In[35]: myPoemData.seek(0) myPoemData.readline() # Ooh! Again, without the .seek(0)? # In[36]: myPoemData.readline() # Interesting. We're moving line by line, right? Let's see: # In[37]: myPoemData.readline() # In[44]: myPoemData.readline() # Fine. So, self-evidently, we need to get the data into a more dependable place: I don't want to reset a variable every time I read it. So this time, we'll just "read" it into a new variable for safekeeping. # In[46]: myPoemData.seek(0) # reset our pointer workingPoemData = myPoemData.read() # And then let's peer into the new variable and see what's up. # In[47]: print(workingPoemData) # # Recap # OK: That was a lot, and not-a-lot, all at once. Just to recall the most important points in the form of working code: # In[51]: myFilePath = 'IO_Tutorial_One_Data/RilkePoem.txt' myPoemData = open(myFilePath, 'r') workingPoemData = myPoemData.read() print(workingPoemData) # Of course, I can get any flatfile this way -- as long as it isn't a binary file. All of these will work just as well: # In[53]: myFilePathA = 'IO_Tutorial_One_Data/RilkePoem.csv' myFilePathB = 'IO_Tutorial_One_Data/RilkePoem.html' # and even the related style sheet: myFilePathC = 'IO_Tutorial_One_Data/main.css' # In[57]: myPoemDataB = open(myFilePathB, 'r') workingPoemDataB = myPoemDataB.read() print(workingPoemDataB) # Now its your turn. Grab a few text files and write out enough code in Jupyter to pull those files in and display them.