This notebook goes through the use case described in Chapter 1 of Data Science at the Command Line by Jeroen Janssens. The goal here is to demonstrate the Bash kernel for IPython Notebook, and its ability to show inline images. Chapter 1 is also available as a free PDF on the O'Reilly product page.
display < ~/book/.cover.png
In the previous sections, we’ve given you a definition of data science and explained to you why the command line can be a great environment for doing data science. Now it’s time to demonstrate the power and flexibility of the command line through a real-world use case. We'll go pretty fast, so don’t worry if some things don’t make sense yet.
Personally, we never seem to remember when Fashion Week is happening in New York. We know it’s held twice a year, but every time it comes as a surprise! In this section we’ll consult the wonderful web API of The New York Times to figure out when it's being held. Once you have obtained your own API keys on the developer website, you’ll be able to, for example, search for articles, get the list of best sellers, and see a list of events.
The particular API endpoint that we’re going to query is the article search one. We expect that a spike in the amount of coverage in The New York Times about New York Fashion Week indicates whether it’s happening. The results from the API are paginated, which means that we have to execute the same query multiple times but with a different page number. (It’s like clicking Next on a search engine.) This is where GNU Parallel (Tange, 2014) comes in handy because it can act as a for
loop. The entire command looks as follows (don’t worry about all the command-line arguments given to parallel
; we’re going to discuss this in great detail in Chapter 8:
cd ~/book/ch01/data
parallel -j1 --delay 0.1 --results results "curl -sL "\
"'http://api.nytimes.com/svc/search/v2/articlesearch.json?q=New+York+'"\
"'Fashion+Week&begin_date={1}0101&end_date={1}1231&page={2}&api-key='"\
"'<your-api-key>'" ::: {2009..2013} ::: {0..99} > /dev/null
Basically, we’re performing the same query for years 2009-2014. The API only allows up to 100 pages (starting at 0) per query, so we’re generating 100 numbers using brace expansion. These numbers are used by the page parameter in the query. We’re searching for articles in 2013 that contain the search term New+York+Fashion+Week
. Because the API has certain limits, we ensure that there’s only one request at a time, with a one-second delay between them. Make sure that you replace <your-api-key>
with your own API key for the article search endpoint.
Each request returns 10 articles, so that’s 1000 articles in total. These are sorted by page views, so this should give us a good estimate of the coverage. The results are in JSON format, which we store in the results directory. The command-line tool tree
(Baker, 2014) gives an overview of how the subdirectories are structured:
tree results | head
results └── 1 ├── 2009 │ └── 2 │ ├── 0 │ │ ├── stderr │ │ └── stdout │ ├── 1 │ │ ├── stderr │ │ └── stdout
Let's have a look at the JSON of one article using jq
(Dolan, 2014):
< results/1/2009/2/0/stdout jq '.response.docs[0]'
{ "web_url": "http://www.nytimes.com/video/2009/02/20/fashion/1194838010547/recap-fall-fashion-week-new-york.html", "snippet": "Eric Wilson interviews buyers, editors and designers about trends and designing in this economic environment.", "lead_paragraph": "Eric Wilson interviews buyers, editors and designers about trends and designing in this economic environment.", "abstract": null, "print_page": null, "blog": [], "source": "The New York Times", "multimedia": [ { "subtype": "wide", "url": "images/2009/02/20/fashion/4877_1_fwrecap_190x126.jpg", "height": 126, "width": 190, "legacy": { "wide": "images/2009/02/20/fashion/4877_1_fwrecap_190x126.jpg", "wideheight": "126", "widewidth": "190" }, "type": "image" }, { "subtype": "thumbnail", "url": "images/2009/02/20/fashion/4877_1_fwrecap_75x75.jpg", "height": 75, "width": 75, "legacy": { "thumbnailheight": "75", "thumbnail": "images/2009/02/20/fashion/4877_1_fwrecap_75x75.jpg", "thumbnailwidth": "75" }, "type": "image" } ], "headline": { "main": "Recap: Fall Fashion Week, New York", "sub": "Eric Wilson on This Year's Trends" }, "keywords": [ { "value": "Jacobs, Marc", "is_major": "N", "rank": "3", "name": "persons" }, { "value": "New York Fashion Week", "is_major": "N", "rank": "1", "name": "subject" }, { "value": "Kors, Michael", "is_major": "N", "rank": "2", "name": "persons" } ], "pub_date": "2009-02-20T15:19:37Z", "document_type": "multimedia", "news_desk": "Fashion & Style", "section_name": "Fashion & Style", "subsection_name": null, "byline": { "person": [ { "organization": "", "role": "reported", "rank": 1, "firstname": "Jigar", "lastname": "Mehta" } ], "original": "Jigar Mehta" }, "type_of_material": "Video", "_id": "52457c867988105ad44f2dd1", "word_count": "15" }
We can combine and process the results using cat
(Granlund & Stallman, 2012), jq
, and json2csv
(Czebotar, 2014):
cat results/1/*/2/*/stdout |
jq -c '.response.docs[] | {date: .pub_date, type: .document_type, title: .headline.main}' |
json2csv -p -k date,type,title > fashion.csv
Let’s break down this command:
parallel
jobs (or API requests).jq
to extract the publication date, the document type, and the headline of each article.json2csv
and store it as fashion.csv.With wc -l
(Rubin & MacKenzie, 2012), we find out that this data set contains 4,855 articles (and not 5,000 because we probably retrieved everything from 2009):
wc -l fashion.csv
4850 fashion.csv
Let’s inspect the first 10 articles to verify that we have succeeded in obtaining the data:
< fashion.csv head | csvlook
|-----------------------+------------+-----------------------------------------| | date | type | title | |-----------------------+------------+-----------------------------------------| | 2009-02-20T15:19:37Z | multimedia | Recap: Fall Fashion Week, New York | | 2009-02-15T23:56:26Z | multimedia | Michael Kors | | 2009-09-17T03:54:58Z | multimedia | UrbanEye: Backstage at Marc Jacobs | | 2009-02-16T23:56:55Z | multimedia | Bill Cunningham on N.Y. Fashion Week | | 2009-09-17T21:55:48Z | multimedia | Fashion Week Spring 2010 | | 2009-02-12T19:40:39Z | multimedia | Alexander Wang | | 2009-09-11T23:07:51Z | multimedia | Of Color | Diversity Beyond the Runway | | 2009-09-18T18:44:10Z | multimedia | On the Street | The Look | | 2009-09-14T14:20:32Z | multimedia | A Designer Reinvents Himself | |-----------------------+------------+-----------------------------------------|
That seems to have worked! In order to gain any insight, we’d better visualize the data. The figure below contains a line graph created with R (R Foundation for Statistical Computing, 2014), Rio
(Janssens, 2014), and ggplot2
(Wickham, 2009).
export RIO_DPI=100 # Increase image size produced by ggplot2 a bit (default is 72)
< fashion.csv Rio -ge 'g + geom_freqpoly(aes(as.Date(date), color = type), binwidth = 7) '\
'+ scale_x_date() + labs(x = "date", title = "Coverage of New York Fashion Week in New York Times")' | display
By looking at the line graph, we can infer that New York Fashion Week happens two times per year. And now we know when: once in February and once in September. Let’s hope that it’s going to be the same this year so that we can prepare ourselves! In any case, we hope that with this example, we’ve shown that The New York Times API is an interesting source of data. More importantly, we hope that we’ve convinced you that the command line can be a very powerful approach for doing data science.
In this section, we’ve peeked at some important concepts and some exciting command-line tools. Don’t worry if some things don’t make sense yet. Most of the concepts will be discussed in Chapter 2, and in the subsequent chapters we’ll go into more detail for all the command-line tools used in this section.
Mason, H., & Wiggins, C. H. (2010). A Taxonomy of Data Science. Retrieved May 10, 2014, from http://www.dataists.com/2010/09/a-taxonomy-of-data-science.
Patil, D (2012). Data Jujitsu. O’Reilly Media.
O’Neil, C., & Schutt, R. (2013). Doing Data Science. O’Reilly Media.
Shron, M. (2014). Thinking with Data. O’Reilly Media.