All About TEDx

TEAM : Chan Kim, JT Huang

Our Goal

TED is a nonprofit devoted to Ideas Worth Spreading. It started out (in 1984) as a conference bringing together people from three worlds: Technology, Entertainment, and Design. The TED Open Translation Project brings TED Talks beyond the English-speaking world by offering subtitles, interactive transcripts and the ability for any talk to be translated by volunteers worldwide. The project was launched with 300 translations, 40 languages and 200 volunteer translators; now, there are more than 32,000 completed translations from the thousands-strong community. The TEDx program is designed to give communities the opportunity to stimulate dialogue through TED-like experiences at the local level.

Our project wants to encourage people to translate TEDx Talk as well by showing how TEDx Talk videos are translated and spreaded among different languages, places and topics, and comparing the spreading status with TED Talk videos.

The questions we are trying to answer:

  • How are the TEDx videos distributed among different languages and places?
  • How is the spreading status of TEDx videos comparing to that of TED videos?

Outline

About Dataset

Since TEDx did not provide any API for people to retrieve video data, we write our own scrapper to crawl various attributes of the TEDx videos. And since all TEDx videos are on YouTube, we also use YouTube API to retrieve more interesting information about the videos.

  • TEDx Website

    • Language
    • Event
    • Country
    • Topic
  • YouTube API

    • Uploaded Timestamp
    • Title
    • Tags
    • Thumbnail
    • Duration
    • Like Count
    • Rating
    • Rating Count
    • View Count
    • Favorite Count
    • Comment Count

TEDx Web Scrapper

First, we try to get all the type portal links from TEDx home URL.

We try to find out all the links begin with the following strings in the TEDx home URL with Beautiful Soup.

  • Language Pages: '/browse/talks-by-language/', EX:
    • Korean: '/browse/talks-by-language/korean'
    • Chinese: '/browse/talks-by-language/chinese'
  • Event Pages: '/browse/talks-by-event/', EX:
    • TEDxBerkeley: '/browse/talks-by-event/tedxberkeley'
    • TEDxStanford: '/browse/talks-by-event/tedxstanford'
  • Country Pages: '/browse/talks-by-country/'
    • South Korea: '/browse/talks-by-country/korea'
    • Taiwan: '/browse/talks-by-country/taiwan'
  • Topic Pages: '/browse/talks-by-topic/
    • Technology: '/browse/talks-by-topic/technology'
    • Design: '/browse/talks-by-topic/design'

Sample Code and Results

In [68]:
import requests
from bs4 import BeautifulSoup

TEDX_HOME_URL = "http://tedxtalks.ted.com"
LANG_URL = "/browse/talks-by-language/"

s = requests.get(TEDX_HOME_URL)
soup = BeautifulSoup(s.content)

total = 0
link_tags = soup.find_all('a', href=True)
for link_tag in link_tags:
    link = link_tag['href']
    lang = link_tag.next_element.next_element.next_element
    if link.startswith(LANG_URL):
        print("Language %s: %s" % (lang, link))
        total += 1

print("Total: %d" % (total))
Language American Sign Language: /browse/talks-by-language/asl
Language Azerbaijani: /browse/talks-by-language/azerbaijani
Language Galician: /browse/talks-by-language/galician
Language Arabic: /browse/talks-by-language/arabic
Language Bulgarian: /browse/talks-by-language/bulgarian
Language Catalan: /browse/talks-by-language/catalan
Language Chinese: /browse/talks-by-language/chinese
Language Croatian: /browse/talks-by-language/croatian
Language Czech: /browse/talks-by-language/czech
Language Dutch: /browse/talks-by-language/dutch
Language English: /browse/talks-by-language/english
Language Estonian: /browse/talks-by-language/estonian
Language Finnish: /browse/talks-by-language/finnish
Language French: /browse/talks-by-language/french
Language German: /browse/talks-by-language/german
Language Greek: /browse/talks-by-language/greek
Language Hebrew: /browse/talks-by-language/hebrew
Language Hindi: /browse/talks-by-language/hindi
Language Hungarian: /browse/talks-by-language/hungarian
Language Icelandic: /browse/talks-by-language/icelandic
Language Indonesian: /browse/talks-by-language/indonesian
Language Italian: /browse/talks-by-language/italian
Language Japanese: /browse/talks-by-language/japanese
Language Korean: /browse/talks-by-language/korean
Language Lithuanian: /browse/talks-by-language/lithuanian
Language Malay: /browse/talks-by-language/malay
Language Polish: /browse/talks-by-language/polish
Language Portuguese: /browse/talks-by-language/portuguese
Language Rajasthani: /browse/talks-by-language/rajasthani
Language Romanian: /browse/talks-by-language/romanian
Language Russian: /browse/talks-by-language/russian
Language Slovak: /browse/talks-by-language/slovak
Language Slovene: /browse/talks-by-language/slovene
Language Spanish: /browse/talks-by-language/spanish
Language Swedish: /browse/talks-by-language/swedish
Language Tamil: /browse/talks-by-language/tamil
Language Thai: /browse/talks-by-language/thai
Language Turkish: /browse/talks-by-language/turkish
Language Ukrainian: /browse/talks-by-language/ukrainian
Language Urdu: /browse/talks-by-language/urdu
Total: 40

Then we go to each type portal link to get video links, and the type will become the attribute of the video.

We will go through page 1, page 2, until there is no other pages in that type attribute. For example, we will go to 'http://tedxtalks.ted.com/browse/talks-by-language/icelandic?page=1', then 'http://tedxtalks.ted.com/browse/talks-by-language/icelandic?page=2' and stop at ''http://tedxtalks.ted.com/browse/talks-by-language/icelandic?page=3' to get all the 28 videos in Icelandic.

In 'http://tedxtalks.ted.com/browse/talks-by-language/icelandic?page=1' we can find the video link '/list/search%3Atag%3A%22icelandic%22/video/TEDxReykjavik-Eythor-Edvardsson' by using Beautiful Soup and the regular expression. For example, the second video in 'http://tedxtalks.ted.com/browse/talks-by-language/icelandic?page=1'

<a id="mvp_grid_panel_img_1" href="/list/search%3Atag%3A%22icelandic%22/video/TEDxReykjavik-Eythor-Edvardsson" class="mvp_thumbnail_magnified" style="position: relative; width: 293px; height: 220px; background-image: url('http://s3.amazonaws.com/magnifythumbs/GZL3TH25NHYK1LB6.jpg'); filter: progid:DXImageTransform.Microsoft.AlphaImageLoader( src='http://s3.amazonaws.com/magnifythumbs/GZL3TH25NHYK1LB6.jpg', sizingMethod='scale');" title="TEDxReykjavik - Eythor Edvardsson -"></a>

Sample Code and Results

In [13]:
import re

VIDEO_LINK_PREFIX = "mvp_grid_panel_img_"
MSG_CLASS = "mvp_padded_message"
EMPTY_PAGE_MSG = "This page is empty."

portal_url = "http://tedxtalks.ted.com/browse/talks-by-language/icelandic"
page = 1
while(True):
    # EX: http://tedxtalks.ted.com/browse/talks-by-language/icelandic?page=1
    url = portal_url + "?page=" + str(page)
    print("Reading URL: " + url)
    s = requests.get(url)
    soup = BeautifulSoup(s.content)
    
    # if there is no Next page
    # <div class="mvp_padded_message">This page is empty.</div>
    msg_tag = soup.find('div', {'class': MSG_CLASS})
    if msg_tag and msg_tag.get_text() == EMPTY_PAGE_MSG:
        print("empty page.")
        break
    
    link_tags = soup.find_all('a', id=re.compile(VIDEO_LINK_PREFIX), href=True)
    for link_tag in link_tags:
        link = link_tag['href']
        # EX: from /list/search%3Atag%3A%22chinese%22/video/The-tragedy-of-Hong-Kong-Archiv to /video/The-tragedy-of-Hong-Kong-Archiv
        pos = link.find("/video")
        link = link[pos:]
        print("video link: %s (attr Language: Icelandic)" % (link))

    page += 1
Reading URL: http://tedxtalks.ted.com/browse/talks-by-language/icelandic?page=1
video link: /video/TEDxReykjavik-Berghildur-Bergrs (attr Language: Icelandic)
video link: /video/TEDxReykjavik-Eythor-Edvardsson (attr Language: Icelandic)
video link: /video/TEDxReykjavik-Ari-Kristinn-Jons (attr Language: Icelandic)
video link: /video/TEDxReykjavik-Danielle-Morrill (attr Language: Icelandic)
video link: /video/TEDxReykjavik-Deepa-Iyengar-Jed (attr Language: Icelandic)
video link: /video/TEDxReykjavik-Rakel-Solvadottir (attr Language: Icelandic)
video link: /video/TEDxReykjavik-Alex-MacNeil-Exit (attr Language: Icelandic)
video link: /video/TEDxReykjavik-Daddi-Gudbergsson (attr Language: Icelandic)
video link: /video/TEDxReykjavik-Peter-Anderson-Mo (attr Language: Icelandic)
video link: /video/TEDxReykjavik-Iceland-Dance-Com (attr Language: Icelandic)
video link: /video/TEDxReykjavik-Hrund-Gunnsteinsd (attr Language: Icelandic)
video link: /video/TEDxReykjavik-Gudrun-Petursdott (attr Language: Icelandic)
video link: /video/TEDxReykjavik-Ingibjorg-Greta-G (attr Language: Icelandic)
video link: /video/TEDxReykjavik-Ragnheidur-Harald (attr Language: Icelandic)
video link: /video/TEDxReykjavik-Smri-McCarthy-960 (attr Language: Icelandic)
video link: /video/TEDxReykjavik-Skli-Mogensen-960 (attr Language: Icelandic)
video link: /video/TEDxReykjavik-Edda-Bjrgvinsdtti (attr Language: Icelandic)
video link: /video/TEDxReykjavik-Margrt-Dra-Ragnar (attr Language: Icelandic)
video link: /video/TEDxReykjavik-Jnas-Antonsson-96 (attr Language: Icelandic)
video link: /video/TEDxReykjavik-Guni-Gunnarsson-9 (attr Language: Icelandic)
Reading URL: http://tedxtalks.ted.com/browse/talks-by-language/icelandic?page=2
video link: /video/TEDxReykjavik-Gumundur-Oddur-96 (attr Language: Icelandic)
video link: /video/TEDxReykjavik-Andri-Heiar-Krist (attr Language: Icelandic)
video link: /video/TEDxReykjavik-Kristin-Drfjr-960 (attr Language: Icelandic)
video link: /video/TEDxReykjavik-Gurn-Lilja-Gunnla (attr Language: Icelandic)
video link: /video/TEDxReykjavik-Teitur-orkelsson (attr Language: Icelandic)
video link: /video/TEDxReykjavik-orvaldur-orsteins (attr Language: Icelandic)
video link: /video/TEDxReykjavik-Mary-Frances-Davi (attr Language: Icelandic)
video link: /video/TEDxReykjavik-Torfi-G-Yngvason (attr Language: Icelandic)
Reading URL: http://tedxtalks.ted.com/browse/talks-by-language/icelandic?page=3
empty page.

Stage 3: Get the YouTube ID of the Video

Since all TEDx videos are on YouTube, and we also use YouTube API to get other interesting information of the videos, we use YouTube ID as the key to represent the video. And because Beautiful Soup does not parse through the tag <embed> so we just use the regualr expression to get the YouTube ID.

For example, in the video 'http://tedxtalks.ted.com/video/TEDxReykjavik-Eythor-Edvardsson' we can find its YouTube ID: 'bzF4GPguPL8'

<embed type="application/x-shockwave-flash" src="http://www.youtube.com/v/bzF4GPguPL8&amp;rel=0&amp;fs=1&amp;showsearch=0&amp;enablejsapi=1&amp;modestbranding=1&amp;autoplay=1&amp;playerapiid=mvp_swfo_embed_V8C4K631YLWW0QF3_1299136773" width="634" height="382" style="undefined" id="mvp_swfo_embed_V8C4K631YLWW0QF3_1299136773" name="mvp_swfo_embed_V8C4K631YLWW0QF3_1299136773" quality="high" allowfullscreen="true" allowscriptaccess="always" wmode="opaque" loop="false">

Sample Code and Results

In [18]:
VIDEO_ID_RE = b"""
<embed.*\ src=\\\\\".*/v/(.*?)\\\\\".*>.*</embed>
"""
url = "http://tedxtalks.ted.com/video/TEDxReykjavik-Eythor-Edvardsson"
s = requests.get(url)
html = s.content
video_ids = re.findall(VIDEO_ID_RE, html, re.IGNORECASE|re.VERBOSE)
for video_id in video_ids:
        # NOTE: byte string => need decode
        print("YouTube ID: %s (%s)" % (url, video_id.decode('utf-8')))
YouTube ID: http://tedxtalks.ted.com/video/TEDxReykjavik-Eythor-Edvardsson (bzF4GPguPL8)

YouTube API

In [2]:
#https://developers.google.com/youtube/articles/view_youtube_jsonc_responses
#https://developers.google.com/youtube/2.0/developers_guide_jsonc

import requests
import json
import time

user_id = "tedxtalks"
page = 1
maxcount = 25
count = 0
start_index = 0

# Obtaining Total page number
s = requests.get("https://gdata.youtube.com/feeds/api/users/"+user_id+"/uploads?v=2&alt=jsonc&start-index=1&max-result=1")
data = [json.loads(row) for row in s.content.split("\n") if row]
totalcount = data[0]['data']['totalItems']
pagenumber = totalcount/maxcount +1

key = ['id', 'uploaded', 'category', 'title', 'tags', 'thumbnail', 'duration', 'likeCount', 'rating', 'ratingCount', 'viewCount', 'favoriteCount', 'commentCount'] 
tedx ={'id':'',
        'data':{   'uploaded':'','title':'','tags':'','thumbnail':'','duration':'','likeCount':'','rating':'','ratingCount':'','viewCount':'','favoriteCount':'','commentCount':''}
    }

# Obtaining Data from each page (sample)
for index in range(1,2): #range(1,pagenumber):
    # changing index number 
    if index == 1:
        start_index = 1
    else:
        start_index = index*maxcount
    s = requests.get("https://gdata.youtube.com/feeds/api/users/"+user_id+"/uploads?v=2&alt=jsonc&start-index="+str(start_index)+"&max-result="+str(maxcount))
    data = [json.loads(row) for row in s.content.split("\n") if row]
    metadata = data[0]['data']['items']
    
    # obtaining each data in a page (25 items)
    for i in range(5):#len(metadata)):
        count +=1
        u = metadata[i]

        #missing key-value pair
        for j in key:
            if j=='id':
                tedx['id']=u['id']
            elif j =='thumbnail':
                tedx['data'][j] = u[j][u'hqDefault']
            elif j == 'title': 
                tedx['data'][j] = u[j].encode('utf-8')
            else:
                tedx['data'][j] = u[j] if not j in list(set(key) -set(u.keys())) else '-'
        
        the_dump = json.dumps(tedx)
        print the_dump
    # delay
    time.sleep(1)

# https://developers.google.com/youtube/2.0/developers_guide_jsonc 
{"data": {"uploaded": "2013-05-02T08:46:29.000Z", "rating": "-", "tags": "-", "likeCount": "-", "commentCount": 0, "ratingCount": "-", "duration": 960, "category": "People", "viewCount": 2, "title": "Corporate rebels: Peter Vander Auwera at TEDxBrusselsChange", "favoriteCount": 0, "thumbnail": "http://i.ytimg.com/vi/GgdwNiOwajg/hqdefault.jpg"}, "id": "GgdwNiOwajg"}
{"data": {"uploaded": "2013-05-02T06:43:09.000Z", "rating": 5.0, "tags": "-", "likeCount": "3", "commentCount": 0, "ratingCount": 3, "duration": 1134, "category": "Tech", "viewCount": 39, "title": "One Chance at Life - What Would You Do: Chuck Berry at TEDxQueenstown", "favoriteCount": 0, "thumbnail": "http://i.ytimg.com/vi/5LMMIu813zQ/hqdefault.jpg"}, "id": "5LMMIu813zQ"}
{"data": {"uploaded": "2013-05-02T05:47:59.000Z", "rating": 4.75, "tags": "-", "likeCount": "15", "commentCount": 2, "ratingCount": 16, "duration": 866, "category": "People", "viewCount": 130, "title": "The Habbits of Highly Boring People: Chris Sauve at TEDxCarletonU", "favoriteCount": 0, "thumbnail": "http://i.ytimg.com/vi/3rbVQNTzCh8/hqdefault.jpg"}, "id": "3rbVQNTzCh8"}
{"data": {"uploaded": "2013-05-02T05:05:34.000Z", "rating": 5.0, "tags": "-", "likeCount": "4", "commentCount": 0, "ratingCount": 4, "duration": 924, "category": "People", "viewCount": 21, "title": "A Selfless Good Deed: Trevor Deley at TEDxCarletonU", "favoriteCount": 0, "thumbnail": "http://i.ytimg.com/vi/OvZlGIT1tOA/hqdefault.jpg"}, "id": "OvZlGIT1tOA"}
{"data": {"uploaded": "2013-05-02T04:24:49.000Z", "rating": 5.0, "tags": "-", "likeCount": "2", "commentCount": 0, "ratingCount": 2, "duration": 858, "category": "Sports", "viewCount": 14, "title": "Fishing for the Future: Dr. Steven Cooke at TEDxCarletonU", "favoriteCount": 0, "thumbnail": "http://i.ytimg.com/vi/Wsz8Wn76h-4/hqdefault.jpg"}, "id": "Wsz8Wn76h-4"}

Merged Results

Now that we've got the video attributes from both TEDx website and YouTube, we can merge these attributes.

Sample Code and Results

In [29]:
import json

SITE_JSON = "tedx_video.json"
YOUTUBE_JSON = "tedx_v7.txt"
SITE_ATTR_LIST = ['lang', 'event', 'country', 'topic']

# import JSON from TEDx website and make video_dict
site_json_file = open(SITE_JSON)
site_json = json.load(site_json_file)
site_json_file.close()
video_dict = {}
for video in site_json:
    vid = site_json[video]['id']
    video_dict[vid] = {}
    for attr in SITE_ATTR_LIST:
        if attr in site_json[video]:
            video_dict[vid][attr] = site_json[video][attr]

# get JSON from YouTube and print to merged result file
merged_cnt = 0
with open(YOUTUBE_JSON, "r") as youtube_json_file:
    for line in youtube_json_file:
        if merged_cnt >= 10:
            break
        youtube_json = json.loads(line)
        vid = youtube_json['id']
        merged_video = youtube_json['data']
        merged_video['id'] = vid
        if vid in video_dict:
            attr_cnt = 0
            for attr in SITE_ATTR_LIST:
                if attr in video_dict[vid]:
                    merged_video[attr] = video_dict[vid][attr]
                    attr_cnt += 1
            if attr_cnt == 4:
                print(json.dumps(merged_video))
                merged_cnt += 1
{"uploaded": "2013-04-24T08:42:45.000Z", "rating": 4.6, "lang": "English", "tags": "-", "country": "Spain", "id": "JcqXD5JgVXw", "title": "Wonder and beauty in education: Catherine L'Ecuyer at TEDxManresa", "event": "TEDxManresa", "likeCount": "9", "commentCount": 0, "topic": "Education", "ratingCount": 10, "duration": 1087, "category": "Nonprofit", "favoriteCount": 0, "thumbnail": "http://i.ytimg.com/vi/JcqXD5JgVXw/hqdefault.jpg", "viewCount": 900}
{"uploaded": "2013-04-23T10:09:17.000Z", "rating": 4.8974357, "lang": "English", "tags": "-", "country": "Greece", "id": "s6KM9MxY5ZM", "title": "Learning is a Game: Ed Cooke at TEDxThessaloniki", "event": "TEDxThessaloniki", "likeCount": "38", "commentCount": 5, "topic": "Education", "ratingCount": 39, "duration": 1187, "category": "Education", "favoriteCount": 0, "thumbnail": "http://i.ytimg.com/vi/s6KM9MxY5ZM/hqdefault.jpg", "viewCount": 2152}
{"uploaded": "2013-04-23T07:26:23.000Z", "rating": 5.0, "lang": "Czech", "tags": "-", "country": "Czech Republic", "id": "pZsORC8sgl4", "title": "Architektura jako starost o m\u00edsto kde \u017eijeme: Roman Brychta at TEDxHradecKralove", "event": "TEDxHradecKralove", "likeCount": "1", "commentCount": 0, "topic": "Education", "ratingCount": 1, "duration": 981, "category": "Education", "favoriteCount": 0, "thumbnail": "http://i.ytimg.com/vi/pZsORC8sgl4/hqdefault.jpg", "viewCount": 89}
{"uploaded": "2013-04-23T07:25:08.000Z", "rating": 5.0, "lang": "Czech", "tags": "-", "country": "Czech Republic", "id": "HeSH7cKTs0s", "title": "Kdy\u017e se chce, tak to jde... V\u011b\u0159te n\u00e1m, testovali jsme to na lidech: Dan P\u0159ib\u00e1\u0148 at TEDxHradecKralove", "event": "TEDxHradecKralove", "likeCount": "82", "commentCount": 6, "topic": "Education", "ratingCount": 82, "duration": 1031, "category": "Education", "favoriteCount": 0, "thumbnail": "http://i.ytimg.com/vi/HeSH7cKTs0s/hqdefault.jpg", "viewCount": 2799}
{"uploaded": "2013-04-23T07:24:51.000Z", "rating": 5.0, "lang": "Czech", "tags": "-", "country": "Czech Republic", "id": "Bkrku3_sv88", "title": "Geometrie trojrozm\u011brn\u00e9ho \u017eivota: Jan Han\u00e1k at TEDxHradecKralove", "event": "TEDxHradecKralove", "likeCount": "1", "commentCount": 0, "topic": "Education", "ratingCount": 1, "duration": 883, "category": "Education", "favoriteCount": 0, "thumbnail": "http://i.ytimg.com/vi/Bkrku3_sv88/hqdefault.jpg", "viewCount": 92}
{"uploaded": "2013-04-23T07:24:17.000Z", "rating": 5.0, "lang": "Czech", "tags": "-", "country": "Czech Republic", "id": "zNS7kdSMVac", "title": "Psan\u00edm k sebepozn\u00e1n\u00ed: Ji\u0159\u00ed Van\u011bk at TEDxHradecKralove", "event": "TEDxHradecKralove", "likeCount": "2", "commentCount": 0, "topic": "Education", "ratingCount": 2, "duration": 874, "category": "Education", "favoriteCount": 0, "thumbnail": "http://i.ytimg.com/vi/zNS7kdSMVac/hqdefault.jpg", "viewCount": 243}
{"uploaded": "2013-04-22T21:18:14.000Z", "rating": 4.6363635, "lang": "English", "tags": "-", "country": "United States", "id": "hktzJ7QNcMU", "title": "Empowering Women and Girls: Halima Hima at TEDxChange", "event": "TEDxChange", "likeCount": "10", "commentCount": 0, "topic": "Entertainment", "ratingCount": 11, "duration": 1419, "category": "Entertainment", "favoriteCount": 0, "thumbnail": "http://i.ytimg.com/vi/hktzJ7QNcMU/hqdefault.jpg", "viewCount": 173}
{"uploaded": "2013-04-20T00:51:52.000Z", "rating": 4.6363635, "lang": "English", "tags": "-", "country": "United States", "id": "845UrCAFTsQ", "title": "Iconic toilets: Mathew Lippincott at TEDxConcordiaUPortland", "event": "TEDxConcordiaUPortland", "likeCount": "20", "commentCount": 6, "topic": "Education", "ratingCount": 22, "duration": 648, "category": "Education", "favoriteCount": 0, "thumbnail": "http://i.ytimg.com/vi/845UrCAFTsQ/hqdefault.jpg", "viewCount": 1104}
{"uploaded": "2013-04-18T09:38:39.000Z", "rating": 5.0, "lang": "English", "tags": "-", "country": "India", "id": "MiwjplU6kAc", "title": "Three laws of user experience: Apala Lahiri Chavan at TEDxGolfLinksPark", "event": "TEDxGolflinkspark", "likeCount": "10", "commentCount": 2, "topic": "Education", "ratingCount": 10, "duration": 1393, "category": "Education", "favoriteCount": 0, "thumbnail": "http://i.ytimg.com/vi/MiwjplU6kAc/hqdefault.jpg", "viewCount": 1069}
{"uploaded": "2013-04-18T02:18:44.000Z", "rating": 4.2941175, "lang": "English", "tags": "-", "country": "United States", "id": "5-YIxJEyBBs", "title": "Crowd sourcing the feminine intelligence of the planet: Jensine Larsen at TEDxConcordiaUPortland", "event": "TEDxConcordiaUPortland", "likeCount": "14", "commentCount": 6, "topic": "Education", "ratingCount": 17, "duration": 1149, "category": "Education", "favoriteCount": 0, "thumbnail": "http://i.ytimg.com/vi/5-YIxJEyBBs/hqdefault.jpg", "viewCount": 432}

Basic Statistics

Now that we've merged all the information we got, we can try to discover some basic statistics of these videos.

Load JSON Data and Convert Them into DataFrame

In [93]:
import pandas as pd
from pandas import Series, DataFrame
import json

TEDX_JSON_FILE = "final_tedx.json"
tedx_video_list = []
with open(TEDX_JSON_FILE, "r") as tedx_json_file:
    for line in tedx_json_file:
        tedx_video_list.append(json.loads(line))
    tedx_df = DataFrame(tedx_video_list)
tedx_df.set_index('id', inplace=True, drop=True)
tedx_df
Out[93]:
<class 'pandas.core.frame.DataFrame'>
Index: 29982 entries, ew0ovccWuQg to QZkUPZr1Zbc
Data columns:
category         27081  non-null values
commentCount     27081  non-null values
country          19813  non-null values
duration         27081  non-null values
event            23177  non-null values
favoriteCount    27081  non-null values
lang             24061  non-null values
likeCount        27081  non-null values
rating           27081  non-null values
ratingCount      27081  non-null values
tags             27081  non-null values
thumbnail        27081  non-null values
title            27081  non-null values
topic            13332  non-null values
uploaded         27081  non-null values
viewCount        27081  non-null values
dtypes: float64(1), object(15)

Statistics by Language

Number of Videos by Language

Conclusion?

In [94]:
tedx_df['lang'].value_counts()[:10]
Out[94]:
English       17479
Spanish        1388
Portuguese     1002
Korean          654
French          559
Arabic          471
Russian         397
Japanese        289
Italian         253
Polish          158
In [95]:
tedx_df[tedx_df.lang!='English']['lang'].value_counts().plot(kind="bar")
Out[95]:
<matplotlib.axes.AxesSubplot at 0x18f04610>

Number of View Counts by Language

Conclusion?

In [96]:
tedx_df['viewCount'] = tedx_df['viewCount'].fillna(0)
tedx_df[tedx_df.viewCount!='-'][['viewCount', 'lang']].groupby('lang').sum().sort('viewCount', ascending=0)[:10]
Out[96]:
viewCount
lang
English 58477968
Arabic 4213183
Spanish 4173229
French 3571510
Portuguese 1516388
Japanese 909865
Korean 892479
Polish 827042
Greek 490969
Indonesian 478289
In [98]:
tmp_tedx_df = tedx_df[tedx_df.lang!='English'].copy()
tmp_tedx_df[tmp_tedx_df.viewCount!='-'][['viewCount', 'lang']].groupby('lang').sum().sort('viewCount', ascending=0).plot(kind="bar")
Out[98]:
<matplotlib.axes.AxesSubplot at 0x1976f0f0>

Number of Events by Language

Conclusion?

In [99]:
tedx_df.groupby('lang').event.nunique().order(ascending=False)[:10]
Out[99]:
lang
English       1035
Spanish        101
Portuguese      63
French          54
Korean          48
Arabic          35
Russian         26
Italian         19
Japanese        15
Chinese         14
In [100]:
tmp_tedx_df = tedx_df[tedx_df.lang!='English'].copy()
tmp_tedx_df.groupby('lang').event.nunique().order(ascending=False).plot(kind="bar")
Out[100]:
<matplotlib.axes.AxesSubplot at 0x17887670>

Statistics by Country

Number of Videos by Country

In [51]:
tedx_df['country'].value_counts()[:10]
Out[51]:
United States     5634
Canada            1249
India              815
Brazil             726
Netherlands        710
South Korea        685
Spain              674
Australia          605
United Kingdom     591
Japan              474

Number of View Counts by Country

In [104]:
tedx_df[tedx_df.viewCount!='-'][['viewCount', 'country']].groupby('country').sum().sort('viewCount', ascending=0)[:10]
Out[104]:
viewCount
country
United States 18189465
Canada 4572056
France 3065666
United Kingdom 2226657
Argentina 2171808
Netherlands 1947574
Japan 1734785
India 1510407
Yemen 1308153
Spain 1196308

Number of Events by Country

In [48]:
tedx_df.groupby('country').event.nunique().order(ascending=False)[:10]
Out[48]:
country
United States     359
India              89
Canada             80
United Kingdom     51
South Korea        47
Brazil             44
Spain              41
Netherlands        32
France             30
Australia          28

Trends of TEDx over 5 years (by language)

In [ ]:
# more information for visualization, including how to prepare data for D3.js
# http://nbviewer.ipython.org/5501063
In [3]:
from IPython.display import HTML
HTML('<iframe src="http://96chany.com/projects/tedx_popularity" width="1000" height="800"></iframe>')

# you can navigate years by sliding 'year' digits
Out[3]:

Comparison between TED and TEDx by language

In [6]:
from IPython.display import HTML
HTML('<iframe src="http://96chany.com/projects/tedx_comparison" width="1100" height="750"></iframe>')
Out[6]:
In [1]:
from pandas import read_csv
from urllib import urlopen
from pandas import Series, DataFrame

page = urlopen("list of languages by number of native speaker.csv")
df = read_csv(page)
df.set_index('Language',inplace=True,drop=True)
df[:30].plot(kind="bar")
Out[1]:
<matplotlib.axes.AxesSubplot at 0x6aacb90>

Issues We Encountered

Data inconsistancy : The video TEDxCausewayBay - Terence Wong - 04/15/10 appears in language Chinese and Korean, and in event TEDxCAU and TEDxCausewayBay

What's Next?

  • Video Recommendation for Translators Based on Translators' Translating History
  • Comparing the Translated Results of Human Translators and Machine Translators (Google/Bing Translate) to Improve the Translation Quality