Getting Started Working with Twitter Data Using jq

jq is a command line JSON processor that's helpful for working with JSON data from Twitter. You'll want to download and install jq on your system to use this notebook with your data. You could also use jqplay to try out these jq statements.

This notebook works with tweets collected from the Twitter filterstream API using an earlier version of Social Feed Manager, but there are lots of tools to get data from the Twitter APIs. To use this notebook with your own data, set the path to your data file as DATA.

As background, Twitter streaming API data is line-oriented JSON, meaning one tweet in JSON format per line. Output from tools such as twarc is also often line-oriented JSON.

This notebook is intended to help people getting started with working with Twitter data using jq. There are many additional software libraries available to do further analysis, including within a notebook. As an example, see Cody Buntain's notebook analyzing #Ferguson tweets as part of the Researching Ferguson Teach-In at MITH in 2015.

We use jq a lot in working with students and faculty at GW Libraries. Do you have useful jq statements we could share here? We welcome suggestions and improvements to this notebook, via Github, Twitter (@liblaura, @dankerchner, @justin_littman or email (lwrubel at gwu dot edu).

In [1]:
DATA="data/tweets"

Basic filtering

View the JSON data, both keys and values, in a prettified format. I'm using the head command to just show the first tweet in the file. Alternatively, you can use cat to look at the whole file.

In [2]:
!head -1 $DATA | jq '.'
{
  "created_at": "Wed Apr 06 22:32:54 +0000 2016",
  "id": 717842544655208400,
  "id_str": "717842544655208449",
  "text": "#WMATA GM says he has lots of partners, critics. \"I've got lots of critics and that's good. We should have critics.\" Opens up to questions",
  "source": "<a href=\"http://twitter.com\" rel=\"nofollow\">Twitter Web Client</a>",
  "truncated": false,
  "in_reply_to_status_id": 717842320452948000,
  "in_reply_to_status_id_str": "717842320452947968",
  "in_reply_to_user_id": 62842049,
  "in_reply_to_user_id_str": "62842049",
  "in_reply_to_screen_name": "faizsays",
  "user": {
    "id": 62842049,
    "id_str": "62842049",
    "name": "Faiz Siddiqui",
    "screen_name": "faizsays",
    "location": "Washington, DC",
    "url": "https://www.washingtonpost.com/people/faiz-siddiqui",
    "description": "Transportation reporter @washingtonpost. Formerly @BostonGlobe. Books, biryani, good raps. [email protected]",
    "protected": false,
    "verified": true,
    "followers_count": 1188,
    "friends_count": 897,
    "listed_count": 51,
    "favourites_count": 797,
    "statuses_count": 6132,
    "created_at": "Tue Aug 04 15:13:48 +0000 2009",
    "utc_offset": -14400,
    "time_zone": "Eastern Time (US & Canada)",
    "geo_enabled": true,
    "lang": "en",
    "contributors_enabled": false,
    "is_translator": false,
    "profile_background_color": "C0DEED",
    "profile_background_image_url": "http://abs.twimg.com/images/themes/theme1/bg.png",
    "profile_background_image_url_https": "https://abs.twimg.com/images/themes/theme1/bg.png",
    "profile_background_tile": false,
    "profile_link_color": "0084B4",
    "profile_sidebar_border_color": "C0DEED",
    "profile_sidebar_fill_color": "DDEEF6",
    "profile_text_color": "333333",
    "profile_use_background_image": true,
    "profile_image_url": "http://pbs.twimg.com/profile_images/706225181254139909/dXqML--m_normal.jpg",
    "profile_image_url_https": "https://pbs.twimg.com/profile_images/706225181254139909/dXqML--m_normal.jpg",
    "profile_banner_url": "https://pbs.twimg.com/profile_banners/62842049/1454078321",
    "default_profile": true,
    "default_profile_image": false,
    "following": null,
    "follow_request_sent": null,
    "notifications": null
  },
  "geo": null,
  "coordinates": null,
  "place": null,
  "contributors": null,
  "is_quote_status": false,
  "retweet_count": 0,
  "favorite_count": 0,
  "entities": {
    "hashtags": [
      {
        "text": "WMATA",
        "indices": [
          0,
          6
        ]
      }
    ],
    "urls": [],
    "user_mentions": [],
    "symbols": []
  },
  "favorited": false,
  "retweeted": false,
  "filter_level": "low",
  "lang": "en",
  "timestamp_ms": "1459981974172"
}

View just the values of each field, without the labels:

In [3]:
!head -1 $DATA| jq '.[]'
"Wed Apr 06 22:32:54 +0000 2016"
717842544655208400
"717842544655208449"
"#WMATA GM says he has lots of partners, critics. \"I've got lots of critics and that's good. We should have critics.\" Opens up to questions"
"<a href=\"http://twitter.com\" rel=\"nofollow\">Twitter Web Client</a>"
false
717842320452948000
"717842320452947968"
62842049
"62842049"
"faizsays"
{
  "id": 62842049,
  "id_str": "62842049",
  "name": "Faiz Siddiqui",
  "screen_name": "faizsays",
  "location": "Washington, DC",
  "url": "https://www.washingtonpost.com/people/faiz-siddiqui",
  "description": "Transportation reporter @washingtonpost. Formerly @BostonGlobe. Books, biryani, good raps. [email protected]",
  "protected": false,
  "verified": true,
  "followers_count": 1188,
  "friends_count": 897,
  "listed_count": 51,
  "favourites_count": 797,
  "statuses_count": 6132,
  "created_at": "Tue Aug 04 15:13:48 +0000 2009",
  "utc_offset": -14400,
  "time_zone": "Eastern Time (US & Canada)",
  "geo_enabled": true,
  "lang": "en",
  "contributors_enabled": false,
  "is_translator": false,
  "profile_background_color": "C0DEED",
  "profile_background_image_url": "http://abs.twimg.com/images/themes/theme1/bg.png",
  "profile_background_image_url_https": "https://abs.twimg.com/images/themes/theme1/bg.png",
  "profile_background_tile": false,
  "profile_link_color": "0084B4",
  "profile_sidebar_border_color": "C0DEED",
  "profile_sidebar_fill_color": "DDEEF6",
  "profile_text_color": "333333",
  "profile_use_background_image": true,
  "profile_image_url": "http://pbs.twimg.com/profile_images/706225181254139909/dXqML--m_normal.jpg",
  "profile_image_url_https": "https://pbs.twimg.com/profile_images/706225181254139909/dXqML--m_normal.jpg",
  "profile_banner_url": "https://pbs.twimg.com/profile_banners/62842049/1454078321",
  "default_profile": true,
  "default_profile_image": false,
  "following": null,
  "follow_request_sent": null,
  "notifications": null
}
null
null
null
null
false
0
0
{
  "hashtags": [
    {
      "text": "WMATA",
      "indices": [
        0,
        6
      ]
    }
  ],
  "urls": [],
  "user_mentions": [],
  "symbols": []
}
false
false
"low"
"en"
"1459981974172"

Filter your data down to specific fields:

In [4]:
!head -3 $DATA | jq '[.created_at, .text]'
[
  "Wed Apr 06 22:32:54 +0000 2016",
  "#WMATA GM says he has lots of partners, critics. \"I've got lots of critics and that's good. We should have critics.\" Opens up to questions"
]
[
  "Wed Apr 06 22:33:10 +0000 2016",
  "Wiedefeld says he has lots of critics, looking for partners. Wants smart people for #wmata"
]
[
  "Wed Apr 06 22:34:21 +0000 2016",
  "Chairwoman Hermanson is first: Q: With all of the news about long-term plan, are there specific ways that riders can help? #wmata"
]

The Twitter API documentation describes the responses from the streaming (e.g. filter, sample) and REST (user timeline, search) APIs.

JSON is hierarchical, and the created_at and text fields are at the top level of the tweet. Some fields in a tweet have additional fields within them. For example, the user field contains fields with information about the user who tweeted, including a count of their followers, location, and a unique id (id_str):

In [5]:
!head -1 $DATA | jq '[.user]'
[
  {
    "id": 62842049,
    "id_str": "62842049",
    "name": "Faiz Siddiqui",
    "screen_name": "faizsays",
    "location": "Washington, DC",
    "url": "https://www.washingtonpost.com/people/faiz-siddiqui",
    "description": "Transportation reporter @washingtonpost. Formerly @BostonGlobe. Books, biryani, good raps. [email protected]",
    "protected": false,
    "verified": true,
    "followers_count": 1188,
    "friends_count": 897,
    "listed_count": 51,
    "favourites_count": 797,
    "statuses_count": 6132,
    "created_at": "Tue Aug 04 15:13:48 +0000 2009",
    "utc_offset": -14400,
    "time_zone": "Eastern Time (US & Canada)",
    "geo_enabled": true,
    "lang": "en",
    "contributors_enabled": false,
    "is_translator": false,
    "profile_background_color": "C0DEED",
    "profile_background_image_url": "http://abs.twimg.com/images/themes/theme1/bg.png",
    "profile_background_image_url_https": "https://abs.twimg.com/images/themes/theme1/bg.png",
    "profile_background_tile": false,
    "profile_link_color": "0084B4",
    "profile_sidebar_border_color": "C0DEED",
    "profile_sidebar_fill_color": "DDEEF6",
    "profile_text_color": "333333",
    "profile_use_background_image": true,
    "profile_image_url": "http://pbs.twimg.com/profile_images/706225181254139909/dXqML--m_normal.jpg",
    "profile_image_url_https": "https://pbs.twimg.com/profile_images/706225181254139909/dXqML--m_normal.jpg",
    "profile_banner_url": "https://pbs.twimg.com/profile_banners/62842049/1454078321",
    "default_profile": true,
    "default_profile_image": false,
    "following": null,
    "follow_request_sent": null,
    "notifications": null
  }
]

To filter for a subset of the user fields, use dot notation:

In [6]:
!head -2 $DATA | jq '[.user.screen_name, .user.name, .user.followers_count, .user.id_str]'
[
  "faizsays",
  "Faiz Siddiqui",
  1188,
  "62842049"
]
[
  "MetroReasons",
  "Metro Reasons",
  1492,
  "3378190378"
]

Some fields occur multiple times, such as hashtags and mentions. Pull out the hashtag text fields and put them together into one field, separated by commas:

In [7]:
!cat $DATA | jq '[([.entities.hashtags[].text] | join(","))]'
[
  "WMATA"
]
[
  "wmata"
]
[
  "wmata"
]
[
  "WMATA"
]
[
  "wmata"
]
[
  "wmata"
]
[
  "wmata"
]
[
  "WMATA,SupportUnions"
]
[
  "WMATA"
]
[
  "wmata"
]
[
  "WMATA"
]
[
  "wmata"
]
[
  "WMATA,RAC"
]
[
  "wmata"
]
[
  "WMATA"
]
[
  "WMATA"
]
[
  "wmata"
]
[
  "WMATA"
]
[
  "wmata"
]
[
  "WMATA"
]
[
  "wmata"
]
[
  "wmata"
]
[
  "wmata"
]
[
  ""
]
[
  "wmata,rac"
]
[
  "WMATA"
]
[
  "wmata"
]
[
  "wmata"
]
[
  "wmata"
]
[
  "wmata"
]
[
  "wmata,rac"
]
[
  "wmata"
]
[
  "wmata"
]
[
  "wmata"
]

Output to CSV

A common use of jq is to turn your JSON data into a csv file to load into other analysis software. The -r option (--raw-output) formats the field as a string suitable for csv, as opposed to a JSON-formatted string with quotes.

In [8]:
!head -8 $DATA | jq -r '[.id_str, .created_at, .text] | @csv'
"717842544655208449","Wed Apr 06 22:32:54 +0000 2016","#WMATA GM says he has lots of partners, critics. ""I've got lots of critics and that's good. We should have critics."" Opens up to questions"
"717842614859513856","Wed Apr 06 22:33:10 +0000 2016","Wiedefeld says he has lots of critics, looking for partners. Wants smart people for #wmata"
"717842912202063873","Wed Apr 06 22:34:21 +0000 2016","Chairwoman Hermanson is first: Q: With all of the news about long-term plan, are there specific ways that riders can help? #wmata"
"717843310124081153","Wed Apr 06 22:35:56 +0000 2016","RT @FixWMATA: #WMATA RAC meeting is starting - should be live streaming... eventually. https://t.co/iUy3EXkFEJ"
"717843816611495936","Wed Apr 06 22:37:57 +0000 2016","@FixWMATA raises potential shutdown conversation to #wmata GM. ""I think the ridership shouldn't be blindsided by things like that."""
"717844057104494593","Wed Apr 06 22:38:54 +0000 2016","RT @faizsays: @FixWMATA raises potential shutdown conversation to #wmata GM. ""I think the ridership shouldn't be blindsided by things like…"
"717844406334853121","Wed Apr 06 22:40:18 +0000 2016","Time to lead people out in handcuffs. @ATULocal689 #wmata  https://t.co/ex0nNRzje3"
"717844672656326658","Wed Apr 06 22:41:21 +0000 2016","[email protected] @ATULocal689: We've swindled riders and taxpayers for decades. That should qualify as being ""smart""! #WMATA #SupportUnions"

You probably want to write that data to a file, however:

In [9]:
!cat $DATA | jq -r '[.id_str, .created_at, .text] | @csv' > tweets.csv
In [10]:
!head tweets.csv
"717842544655208449","Wed Apr 06 22:32:54 +0000 2016","#WMATA GM says he has lots of partners, critics. ""I've got lots of critics and that's good. We should have critics."" Opens up to questions"
"717842614859513856","Wed Apr 06 22:33:10 +0000 2016","Wiedefeld says he has lots of critics, looking for partners. Wants smart people for #wmata"
"717842912202063873","Wed Apr 06 22:34:21 +0000 2016","Chairwoman Hermanson is first: Q: With all of the news about long-term plan, are there specific ways that riders can help? #wmata"
"717843310124081153","Wed Apr 06 22:35:56 +0000 2016","RT @FixWMATA: #WMATA RAC meeting is starting - should be live streaming... eventually. https://t.co/iUy3EXkFEJ"
"717843816611495936","Wed Apr 06 22:37:57 +0000 2016","@FixWMATA raises potential shutdown conversation to #wmata GM. ""I think the ridership shouldn't be blindsided by things like that."""
"717844057104494593","Wed Apr 06 22:38:54 +0000 2016","RT @faizsays: @FixWMATA raises potential shutdown conversation to #wmata GM. ""I think the ridership shouldn't be blindsided by things like…"
"717844406334853121","Wed Apr 06 22:40:18 +0000 2016","Time to lead people out in handcuffs. @ATULocal689 #wmata  https://t.co/ex0nNRzje3"
"717844672656326658","Wed Apr 06 22:41:21 +0000 2016","[email protected] @ATULocal689: We've swindled riders and taxpayers for decades. That should qualify as being ""smart""! #WMATA #SupportUnions"
"717844796254117888","Wed Apr 06 22:41:50 +0000 2016","At #WMATA RAC mtg. Glad Wiedefeld showed."
"717844863748816897","Wed Apr 06 22:42:07 +0000 2016","Wiedefeld notes frustration of going from crisis to crisis. Wants to change approach to break that cycle to deal with issues. #wmata"

Some fields, particularly the text of a tweet, have newline characters. This can be a problem with your csv, breaking a tweet across lines. Substitute all occurrences of the newline character (\n) with a space:

In [11]:
!cat $DATA | jq -r '[.id_str, .created_at, (.text | gsub("\n";" "))] | @csv' > tweets-oneline.csv
In [12]:
!head tweets-oneline.csv
"717842544655208449","Wed Apr 06 22:32:54 +0000 2016","#WMATA GM says he has lots of partners, critics. ""I've got lots of critics and that's good. We should have critics."" Opens up to questions"
"717842614859513856","Wed Apr 06 22:33:10 +0000 2016","Wiedefeld says he has lots of critics, looking for partners. Wants smart people for #wmata"
"717842912202063873","Wed Apr 06 22:34:21 +0000 2016","Chairwoman Hermanson is first: Q: With all of the news about long-term plan, are there specific ways that riders can help? #wmata"
"717843310124081153","Wed Apr 06 22:35:56 +0000 2016","RT @FixWMATA: #WMATA RAC meeting is starting - should be live streaming... eventually. https://t.co/iUy3EXkFEJ"
"717843816611495936","Wed Apr 06 22:37:57 +0000 2016","@FixWMATA raises potential shutdown conversation to #wmata GM. ""I think the ridership shouldn't be blindsided by things like that."""
"717844057104494593","Wed Apr 06 22:38:54 +0000 2016","RT @faizsays: @FixWMATA raises potential shutdown conversation to #wmata GM. ""I think the ridership shouldn't be blindsided by things like…"
"717844406334853121","Wed Apr 06 22:40:18 +0000 2016","Time to lead people out in handcuffs. @ATULocal689 #wmata  https://t.co/ex0nNRzje3"
"717844672656326658","Wed Apr 06 22:41:21 +0000 2016","[email protected] @ATULocal689: We've swindled riders and taxpayers for decades. That should qualify as being ""smart""! #WMATA #SupportUnions"
"717844796254117888","Wed Apr 06 22:41:50 +0000 2016","At #WMATA RAC mtg. Glad Wiedefeld showed."
"717844863748816897","Wed Apr 06 22:42:07 +0000 2016","Wiedefeld notes frustration of going from crisis to crisis. Wants to change approach to break that cycle to deal with issues. #wmata"

Output to JSON

If you'd like JSON format as your output, you can specify the keys in the JSON objects created in the output:

In [2]:
!cat $DATA | jq -c '{{id: .id_str, user_id: .user.id_str, screen_name: .user.screen_name, created_at: .created_at, text: .text, user_mentions: [.entities.user_mentions[]?.screen_name], hashtags: [.entities.hashtags[]?.text], urls: [.entities.urls[]?.expanded_url]}}' > newtweets.json
cat: data/tweets: No such file or directory
In [19]:
!head newtweets.json
{"id":717842544655208400,"user_id":"62842049","screen_name":"faizsays","created_at":"Wed Apr 06 22:32:54 +0000 2016","text":"#WMATA GM says he has lots of partners, critics. \"I've got lots of critics and that's good. We should have critics.\" Opens up to questions","user_mentions":[],"hashtags":["WMATA"],"urls":[]}
{"id":717842614859513900,"user_id":"3378190378","screen_name":"MetroReasons","created_at":"Wed Apr 06 22:33:10 +0000 2016","text":"Wiedefeld says he has lots of critics, looking for partners. Wants smart people for #wmata","user_mentions":[],"hashtags":["wmata"],"urls":[]}
{"id":717842912202063900,"user_id":"62842049","screen_name":"faizsays","created_at":"Wed Apr 06 22:34:21 +0000 2016","text":"Chairwoman Hermanson is first: Q: With all of the news about long-term plan, are there specific ways that riders can help? #wmata","user_mentions":[],"hashtags":["wmata"],"urls":[]}
{"id":717843310124081200,"user_id":"105560705","screen_name":"waflanagan","created_at":"Wed Apr 06 22:35:56 +0000 2016","text":"RT @FixWMATA: #WMATA RAC meeting is starting - should be live streaming... eventually. https://t.co/iUy3EXkFEJ","user_mentions":["FixWMATA"],"hashtags":["WMATA"],"urls":["http://www.wmata.com/about_metro/riders_advisory_council/events_and_docs_by_month.cfm"]}
{"id":717843816611496000,"user_id":"62842049","screen_name":"faizsays","created_at":"Wed Apr 06 22:37:57 +0000 2016","text":"@FixWMATA raises potential shutdown conversation to #wmata GM. \"I think the ridership shouldn't be blindsided by things like that.\"","user_mentions":["FixWMATA"],"hashtags":["wmata"],"urls":[]}
{"id":717844057104494600,"user_id":"1587152041","screen_name":"FixWMATA","created_at":"Wed Apr 06 22:38:54 +0000 2016","text":"RT @faizsays: @FixWMATA raises potential shutdown conversation to #wmata GM. \"I think the ridership shouldn't be blindsided by things like…","user_mentions":["faizsays","FixWMATA"],"hashtags":["wmata"],"urls":[]}
{"id":717844406334853100,"user_id":"461101349","screen_name":"FireDrGridlock","created_at":"Wed Apr 06 22:40:18 +0000 2016","text":"Time to lead people out in handcuffs. @ATULocal689 #wmata  https://t.co/ex0nNRzje3","user_mentions":["ATULocal689"],"hashtags":["wmata"],"urls":["https://twitter.com/concernemploye1/status/717008115389095936"]}
{"id":717844672656326700,"user_id":"178065130","screen_name":"ConGrpThink","created_at":"Wed Apr 06 22:41:21 +0000 2016","text":"[email protected] @ATULocal689: We've swindled riders and taxpayers for decades. That should qualify as being \"smart\"! #WMATA #SupportUnions","user_mentions":["MetroReasons","ATULocal689"],"hashtags":["WMATA","SupportUnions"],"urls":[]}
{"id":717844796254117900,"user_id":"82183980","screen_name":"swdcjenn","created_at":"Wed Apr 06 22:41:50 +0000 2016","text":"At #WMATA RAC mtg. Glad Wiedefeld showed.","user_mentions":[],"hashtags":["WMATA"],"urls":[]}
{"id":717844863748816900,"user_id":"3378190378","screen_name":"MetroReasons","created_at":"Wed Apr 06 22:42:07 +0000 2016","text":"Wiedefeld notes frustration of going from crisis to crisis. Wants to change approach to break that cycle to deal with issues. #wmata","user_mentions":[],"hashtags":["wmata"],"urls":[]}
In [ ]: