mlcourse.ai – Open Machine Learning Course

Author: Artem Kuznetsov, ODS Slack te

Exploring TED Talks

Research plan

- Dataset and features description
- Exploratory data analysis
- Visual analysis of the features
- Patterns, insights, pecularities of data
- Data preprocessing
- Metric selection
- Feature engineering and description
- Cross-validation, hyperparameter tuning
- Validation and learning curves
- Prediction for hold-out set
- Model selection
- Conclusions

Part 1. Dataset and features description

TED is the conference organizer, which holds events were people from different areas can have a public talk of important ideas. Last years TED had significantly grown in popularity due video and audio recordings publications of talks.

The dataset was collected by Rounak Banik and stored to Kaggle https://www.kaggle.com/rounakbanik/ted-talks/. It's not sure how it was collected by web scrapping or by TED api (now closed). Data contains talks before September 21st, 2017.

Data set constists of two files:

ted_main.csv - metadata about talks and speakers

  • comments- The number of first level comments made on the talk (number)
  • description - A blurb of what the talk is about (string)
  • duration - The duration of the talk in seconds (number)
  • event - The TED/TEDx event where the talk took place (string)
  • film_date - The Unix timestamp of the filming (date in unix time format)
  • languages - The number of languages in which the talk is available (number)
  • main_speaker - The first named speaker of the talk (string)
  • name - The official name of the TED Talk. Includes the title and the speaker. (string)
  • num_speaker - The number of speakers in the talk (number)
  • published_date - The Unix timestamp for the publication of the talk on TED.com (date in unix time format)
  • ratings - A stringified dictionary of the various ratings given to the talk (inspiring, fascinating, jaw dropping, etc.) (json)
  • related_talks - A list of dictionaries of recommended talks to watch next (json)
  • speaker_occupation - The occupation of the main speaker (string)
  • tags - The themes associated with the talk (list)
  • title - The title of the talk (string)
  • url - The URL of the talk (string)
  • views - The number of views on the talk (number)

transcripts.csv - talk transcripts

  • transcript - The official English transcript of the talk. (string)
  • url - The URL of the talk (string)

Target of this project is to to research how can be predicted count of views.

In [436]:
import re
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
import statsmodels.api as sm
import seaborn as sns
import scipy.stats

from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split, TimeSeriesSplit, GridSearchCV, learning_curve
from sklearn.model_selection import cross_val_score
from sklearn.metrics import mean_squared_error, mean_absolute_error
from sklearn.linear_model import LinearRegression, Ridge

DATA_PATH = '../data/'

# Set up seeds
RANDOM_SEED = 42
np.random.seed(RANDOM_SEED)

%matplotlib inline
In [437]:
plt.rcParams['figure.figsize'] = 12., 9.

Part 2. Exploratory data analysis

In [3]:
# Load data
df_ted_main = pd.read_csv(DATA_PATH + 'ted_main.csv.zip')
df_ted_transcripts = pd.read_csv(DATA_PATH + 'transcripts.csv.zip')
In [4]:
df_ted_main.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2550 entries, 0 to 2549
Data columns (total 17 columns):
comments              2550 non-null int64
description           2550 non-null object
duration              2550 non-null int64
event                 2550 non-null object
film_date             2550 non-null int64
languages             2550 non-null int64
main_speaker          2550 non-null object
name                  2550 non-null object
num_speaker           2550 non-null int64
published_date        2550 non-null int64
ratings               2550 non-null object
related_talks         2550 non-null object
speaker_occupation    2544 non-null object
tags                  2550 non-null object
title                 2550 non-null object
url                   2550 non-null object
views                 2550 non-null int64
dtypes: int64(7), object(10)
memory usage: 338.8+ KB
In [5]:
df_ted_transcripts.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2467 entries, 0 to 2466
Data columns (total 2 columns):
transcript    2467 non-null object
url           2467 non-null object
dtypes: object(2)
memory usage: 38.6+ KB

The datasets contains different count of records, so probably there are fewer transcripts then talks.

Duplicates check

In [6]:
df_ted_main[df_ted_main.duplicated()]
Out[6]:
comments description duration event film_date languages main_speaker name num_speaker published_date ratings related_talks speaker_occupation tags title url views
In [7]:
df_ted_transcripts[df_ted_transcripts.duplicated()]
Out[7]:
transcript url
1114 I have a question for you: Are you religious? ... https://www.ted.com/talks/jonathan_haidt_human...
1115 The recent debate over copyright laws like SOP... https://www.ted.com/talks/rob_reid_the_8_billi...
1116 I'm going to tell you a little bit about my TE... https://www.ted.com/talks/brene_brown_listenin...

Ok, we've got some in df_ted_transcripts, let's remove them.

In [8]:
df_ted_transcripts = df_ted_transcripts.drop_duplicates()

Merge datasets

In [9]:
df_ted_main.shape, df_ted_transcripts.shape
Out[9]:
((2550, 17), (2464, 2))
In [10]:
df_ted = pd.merge(df_ted_main, df_ted_transcripts, how='left', on='url')
df_ted.shape
Out[10]:
(2550, 18)
In [11]:
df_ted.columns
Out[11]:
Index(['comments', 'description', 'duration', 'event', 'film_date',
       'languages', 'main_speaker', 'name', 'num_speaker', 'published_date',
       'ratings', 'related_talks', 'speaker_occupation', 'tags', 'title',
       'url', 'views', 'transcript'],
      dtype='object')
In [12]:
df_ted.head()
Out[12]:
comments description duration event film_date languages main_speaker name num_speaker published_date ratings related_talks speaker_occupation tags title url views transcript
0 4553 Sir Ken Robinson makes an entertaining and pro... 1164 TED2006 1140825600 60 Ken Robinson Ken Robinson: Do schools kill creativity? 1 1151367060 [{'id': 7, 'name': 'Funny', 'count': 19645}, {... [{'id': 865, 'hero': 'https://pe.tedcdn.com/im... Author/educator ['children', 'creativity', 'culture', 'dance',... Do schools kill creativity? https://www.ted.com/talks/ken_robinson_says_sc... 47227110 Good morning. How are you?(Laughter)It's been ...
1 265 With the same humor and humanity he exuded in ... 977 TED2006 1140825600 43 Al Gore Al Gore: Averting the climate crisis 1 1151367060 [{'id': 7, 'name': 'Funny', 'count': 544}, {'i... [{'id': 243, 'hero': 'https://pe.tedcdn.com/im... Climate advocate ['alternative energy', 'cars', 'climate change... Averting the climate crisis https://www.ted.com/talks/al_gore_on_averting_... 3200520 Thank you so much, Chris. And it's truly a gre...
2 124 New York Times columnist David Pogue takes aim... 1286 TED2006 1140739200 26 David Pogue David Pogue: Simplicity sells 1 1151367060 [{'id': 7, 'name': 'Funny', 'count': 964}, {'i... [{'id': 1725, 'hero': 'https://pe.tedcdn.com/i... Technology columnist ['computers', 'entertainment', 'interface desi... Simplicity sells https://www.ted.com/talks/david_pogue_says_sim... 1636292 (Music: "The Sound of Silence," Simon & Garfun...
3 200 In an emotionally charged talk, MacArthur-winn... 1116 TED2006 1140912000 35 Majora Carter Majora Carter: Greening the ghetto 1 1151367060 [{'id': 3, 'name': 'Courageous', 'count': 760}... [{'id': 1041, 'hero': 'https://pe.tedcdn.com/i... Activist for environmental justice ['MacArthur grant', 'activism', 'business', 'c... Greening the ghetto https://www.ted.com/talks/majora_carter_s_tale... 1697550 If you're here today — and I'm very happy that...
4 593 You've never seen data presented like this. Wi... 1190 TED2006 1140566400 48 Hans Rosling Hans Rosling: The best stats you've ever seen 1 1151440680 [{'id': 9, 'name': 'Ingenious', 'count': 3202}... [{'id': 2056, 'hero': 'https://pe.tedcdn.com/i... Global health expert; data visionary ['Africa', 'Asia', 'Google', 'demo', 'economic... The best stats you've ever seen https://www.ted.com/talks/hans_rosling_shows_t... 12005869 About 10 years ago, I took on the task to teac...
In [13]:
DATE_COLUMNS = 'film_date', 'published_date'
for column in DATE_COLUMNS:
    df_ted[column] = pd.to_datetime(df_ted[column], unit='s')
In [14]:
df_ted.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 2550 entries, 0 to 2549
Data columns (total 18 columns):
comments              2550 non-null int64
description           2550 non-null object
duration              2550 non-null int64
event                 2550 non-null object
film_date             2550 non-null datetime64[ns]
languages             2550 non-null int64
main_speaker          2550 non-null object
name                  2550 non-null object
num_speaker           2550 non-null int64
published_date        2550 non-null datetime64[ns]
ratings               2550 non-null object
related_talks         2550 non-null object
speaker_occupation    2544 non-null object
tags                  2550 non-null object
title                 2550 non-null object
url                   2550 non-null object
views                 2550 non-null int64
transcript            2464 non-null object
dtypes: datetime64[ns](2), int64(5), object(11)
memory usage: 378.5+ KB

Missing values

Looks like we have transcript for almost all talks but also have missing values. Also some values of speaker_ocupation is missing.

Recheck NA's

In [15]:
for column in df_ted.columns:
    na_count = df_ted[column].isna().sum()
    if na_count > 0:
        print('%s : %s' % (column, na_count))
speaker_occupation : 6
transcript : 86

Common numerics stats

In [16]:
df_ted.describe()
Out[16]:
comments duration languages num_speaker views
count 2550.000000 2550.000000 2550.000000 2550.000000 2.550000e+03
mean 191.562353 826.510196 27.326275 1.028235 1.698297e+06
std 282.315223 374.009138 9.563452 0.207705 2.498479e+06
min 2.000000 135.000000 0.000000 1.000000 5.044300e+04
25% 63.000000 577.000000 23.000000 1.000000 7.557928e+05
50% 118.000000 848.000000 28.000000 1.000000 1.124524e+06
75% 221.750000 1046.750000 33.000000 1.000000 1.700760e+06
max 6404.000000 5256.000000 72.000000 5.000000 4.722711e+07
In [17]:
df_ted.median()
Out[17]:
comments           118.0
duration           848.0
languages           28.0
num_speaker          1.0
views          1124523.5
dtype: float64

description

In [19]:
df_ted['description'].nunique(), len(df_ted['description'])
Out[19]:
(2550, 2550)
In [20]:
df_ted['description'].str.len().describe()
Out[20]:
count    2550.000000
mean      313.656078
std       106.381571
min        52.000000
25%       236.000000
50%       296.000000
75%       379.000000
max       769.000000
Name: description, dtype: float64
In [21]:
df_ted['duration'].values[:100]
Out[21]:
array([1164,  977, 1286, 1116, 1190, 1305,  992, 1198, 1485, 1262, 1414,
       1538, 1550,  527, 1057, 1481, 1445,  906, 1170, 1201, 1114, 1136,
       1006, 1407, 1225, 1140, 1316, 1275, 1050, 1276, 1177, 1129, 1365,
        952,  773, 1080, 1125, 1177, 1083, 1672, 2065, 1609, 1280,  805,
       1376, 1200,  848,  210,  247,  198,  843, 1001, 1321, 1115, 1046,
       1151,  603, 1141,  825, 1385,  869, 1038, 1155, 1447, 1355, 1316,
       1645, 1021, 1211,  873, 1207, 1340,  930, 1054, 1240, 1204,  276,
        850,  891, 1012, 1399, 1011,  893, 1045,  977,  985,  163,  459,
       1308, 1929, 1205, 1031,  251,  378,  312, 1172, 1750, 1195,  201,
        858])

Each talk has an unique description.

event

In [22]:
df_ted['event'].value_counts()
Out[22]:
TED2014                         84
TED2009                         83
TED2016                         77
TED2013                         77
TED2015                         75
TED2011                         70
TEDGlobal 2012                  70
TED2010                         68
TEDGlobal 2011                  68
TED2007                         68
TED2017                         67
TEDGlobal 2013                  66
TED2012                         65
TEDGlobal 2009                  65
TED2008                         57
TEDGlobal 2010                  55
TEDGlobal 2014                  51
TED2006                         45
TED2005                         37
TEDIndia 2009                   35
TEDWomen 2010                   34
TED2003                         34
TEDSummit                       34
TED2004                         31
TED2002                         28
TEDWomen 2015                   28
TEDGlobal 2007                  27
TEDGlobal 2005                  26
TEDWomen 2016                   25
TEDxBeaconStreet                22
                                ..
TEDxPenn                         1
TEDxProvidence                   1
TEDxZurich                       1
TEDxNatick                       1
TEDxConcordiaUPortland           1
TEDxGöteborg 2010                1
TEDxSussexUniversity             1
TEDxUM                           1
Elizabeth G. Anderson School     1
TEDxSantaCruz                    1
TEDxCHUV                         1
TEDxImperialCollege              1
TEDxIslay                        1
TEDxIndianapolis                 1
TEDxToulouse                     1
TED Prize Wish                   1
TEDxNewy                         1
TEDxNextGenerationAsheville      1
TED-Ed Weekend                   1
TEDxSF                           1
TEDxMIA                          1
TEDNairobi Ideas Search          1
[email protected]               1
TED1984                          1
TEDxUofM                         1
TEDxGeorgetown                   1
TEDxGroningen                    1
TEDxSaltLakeCity                 1
TEDxO'Porto                      1
TEDxUF                           1
Name: event, Length: 355, dtype: int64

We had different type of events here whith the most popular TED2014 event. We can see TED and TEDx events, and some events differen from it. Let's investigate a little more.

In [23]:
sorted(df_ted['event'].unique())
Out[23]:
['AORN Congress',
 'Arbejdsglaede Live',
 'BBC TV',
 'Bowery Poetry Club',
 'Business Innovation Factory',
 'Carnegie Mellon University',
 'Chautauqua Institution',
 'DICE Summit 2010',
 'DLD 2007',
 'EG 2007',
 'EG 2008',
 'Elizabeth G. Anderson School',
 "Eric Whitacre's Virtual Choir",
 'Fort Worth City Council',
 'Full Spectrum Auditions',
 'Gel Conference',
 'Global Witness HQ',
 'Handheld Learning',
 'Harvard University',
 'INK Conference',
 'Justice with Michael Sandel',
 'LIFT 2007',
 'Michael Howard Studios',
 'Mission Blue II',
 'Mission Blue Voyage',
 'New York State Senate',
 'NextGen:Charity',
 'Princeton University',
 'RSA Animate',
 'Royal Institution',
 'Serious Play 2008',
 'Skoll World Forum 2007',
 'SoulPancake',
 'Stanford University',
 'TED Dialogues',
 'TED Fellows 2015',
 'TED Fellows Retreat 2013',
 'TED Fellows Retreat 2015',
 'TED Prize Wish',
 'TED Residency',
 'TED Senior Fellows at TEDGlobal 2010',
 'TED Studio',
 'TED Talks Education',
 'TED Talks Live',
 'TED in the Field',
 'TED-Ed',
 'TED-Ed Weekend',
 'TED1984',
 'TED1990',
 'TED1994',
 'TED1998',
 'TED2001',
 'TED2002',
 'TED2003',
 'TED2004',
 'TED2005',
 'TED2006',
 'TED2007',
 'TED2008',
 'TED2009',
 'TED2010',
 'TED2011',
 'TED2012',
 'TED2013',
 'TED2014',
 'TED2015',
 'TED2016',
 'TED2017',
 '[email protected] Berlin',
 '[email protected] London',
 '[email protected] Paris',
 '[email protected] San Francisco',
 '[email protected] Singapore',
 '[email protected]',
 '[email protected]',
 '[email protected]',
 '[email protected]',
 '[email protected]',
 '[email protected]',
 '[email protected]',
 '[email protected]',
 '[email protected]',
 '[email protected] York',
 '[email protected]',
 '[email protected]',
 '[email protected] Street Boston',
 '[email protected] Street London',
 '[email protected] Boston',
 '[email protected]',
 '[email protected]',
 'TEDActive 2011',
 'TEDActive 2014',
 'TEDActive 2015',
 'TEDCity2.0',
 'TEDGlobal 2005',
 'TEDGlobal 2007',
 'TEDGlobal 2009',
 'TEDGlobal 2010',
 'TEDGlobal 2011',
 'TEDGlobal 2012',
 'TEDGlobal 2013',
 'TEDGlobal 2014',
 'TEDGlobal 2017',
 'TEDGlobal>Geneva',
 'TEDGlobal>London',
 'TEDGlobalLondon',
 'TEDIndia 2009',
 'TEDLagos Ideas Search',
 'TEDMED 2009',
 'TEDMED 2010',
 'TEDMED 2011',
 'TEDMED 2012',
 'TEDMED 2013',
 'TEDMED 2014',
 'TEDMED 2015',
 'TEDMED 2016',
 'TEDNYC',
 'TEDNairobi Ideas Search',
 '[email protected]',
 'TEDSalon 2006',
 'TEDSalon 2007 Hot Science',
 'TEDSalon 2009 Compassion',
 'TEDSalon Berlin 2014',
 'TEDSalon London 2009',
 'TEDSalon London 2010',
 'TEDSalon London Fall 2011',
 'TEDSalon London Fall 2012',
 'TEDSalon London Spring 2011',
 'TEDSalon London Spring 2012',
 'TEDSalon NY2011',
 'TEDSalon NY2012',
 'TEDSalon NY2013',
 'TEDSalon NY2014',
 'TEDSalon NY2015',
 'TEDSummit',
 'TEDWomen 2010',
 'TEDWomen 2013',
 'TEDWomen 2015',
 'TEDWomen 2016',
 'TEDYouth 2011',
 'TEDYouth 2012',
 'TEDYouth 2013',
 'TEDYouth 2014',
 'TEDYouth 2015',
 'TEDxABQ',
 'TEDxAmazonia',
 'TEDxAmericanRiviera',
 'TEDxAmoskeagMillyard',
 'TEDxAmsterdam',
 'TEDxArendal',
 'TEDxAsheville',
 'TEDxAthens',
 'TEDxAtlanta',
 'TEDxAustin',
 'TEDxBG',
 'TEDxBeaconStreet',
 'TEDxBeirut',
 'TEDxBend',
 'TEDxBerkeley',
 'TEDxBerlin',
 'TEDxBinghamtonUniversity',
 'TEDxBloomington',
 'TEDxBoston',
 'TEDxBoston 2009',
 'TEDxBoston 2010',
 'TEDxBoston 2011',
 'TEDxBoston 2012',
 'TEDxBoulder',
 'TEDxBoulder 2011',
 'TEDxBratislava',
 'TEDxBrighton',
 'TEDxBrussels',
 'TEDxCERN',
 'TEDxCHUV',
 'TEDxCMU',
 'TEDxCaFoscariU',
 'TEDxCaltech',
 'TEDxCambridge',
 'TEDxCanberra',
 'TEDxCannes',
 'TEDxChange',
 'TEDxChapmanU',
 'TEDxClaremontColleges',
 'TEDxColbyCollege',
 'TEDxColoradoSprings',
 'TEDxColumbus',
 'TEDxColumbusWomen',
 'TEDxConcorde',
 'TEDxConcordiaUPortland',
 'TEDxCreativeCoast',
 'TEDxCrenshaw',
 'TEDxDU 2010',
 'TEDxDU 2011',
 'TEDxDanubia',
 'TEDxDeExtinction',
 'TEDxDelft',
 'TEDxDesMoines',
 'TEDxDirigo',
 'TEDxDubai',
 'TEDxDublin',
 'TEDxEQChCh',
 'TEDxEast',
 'TEDxEastEnd',
 'TEDxEdmonton',
 'TEDxEuston',
 'TEDxExeter',
 'TEDxFiDiWomen',
 'TEDxFrankfurt',
 'TEDxFulbrightDublin',
 'TEDxGatewayWomen',
 'TEDxGeorgetown',
 'TEDxGhent',
 'TEDxGlasgow',
 'TEDxGoldenGatePark 2012',
 'TEDxGoodenoughCollege',
 'TEDxGrandRapids',
 'TEDxGreatPacificGarbagePatch',
 'TEDxGroningen',
 'TEDxGöteborg 2010',
 'TEDxHamburg',
 'TEDxHampshireCollege',
 'TEDxHelvetia',
 'TEDxHogeschoolUtrecht',
 'TEDxHousesOfParliament',
 'TEDxHouston',
 'TEDxImperialCollege',
 'TEDxIndianaUniversity',
 'TEDxIndianapolis',
 'TEDxIslay',
 'TEDxJaffa 2012',
 'TEDxJaffa 2013',
 'TEDxKC',
 '[email protected]',
 '[email protected]',
 'TEDxKrakow',
 'TEDxKyoto',
 'TEDxLeuvenSalon',
 'TEDxLinnaeusUniversity',
 'TEDxLondonBusinessSchool',
 'TEDxMIA',
 'TEDxMaastricht',
 'TEDxManchester',
 'TEDxManhattan',
 'TEDxManhattanBeach',
 'TEDxMarin',
 'TEDxMaui',
 'TEDxMet',
 'TEDxMiamiUniversity',
 'TEDxMidAtlantic',
 'TEDxMidAtlantic 2013',
 'TEDxMidwest',
 'TEDxMileHigh',
 'TEDxMonroeCorrectionalComplex',
 'TEDxMonterey',
 'TEDxMontreal',
 'TEDxMtHood',
 'TEDxMuncyStatePrison',
 'TEDxNASA',
 '[email protected]',
 'TEDxNYED',
 'TEDxNatick',
 'TEDxNewYork',
 'TEDxNewy',
 'TEDxNextGenerationAsheville',
 'TEDxNijmegen',
 'TEDxNorrkoping',
 'TEDxNorthwesternU',
 "TEDxO'Porto",
 'TEDxObserver',
 'TEDxOilSpill',
 'TEDxOmaha',
 'TEDxOrangeCoast',
 'TEDxOrcasIsland',
 'TEDxOslo',
 'TEDxPSU',
 'TEDxParis 2010',
 'TEDxParis 2012',
 'TEDxPeachtree',
 'TEDxPenn',
 'TEDxPennQuarter',
 'TEDxPennsylvaniaAvenue',
 'TEDxPerth',
 'TEDxPhoenix',
 'TEDxPittsburgh',
 'TEDxPlaceDesNations',
 'TEDxPortland',
 'TEDxPortofSpain',
 'TEDxProvidence',
 'TEDxPuget Sound ',
 'TEDxRC2',
 'TEDxRainier',
 'TEDxRiodelaPlata',
 'TEDxRotterdam 2010',
 'TEDxSBU',
 'TEDxSF',
 'TEDxSFU',
 'TEDxSMU',
 'TEDxSaltLakeCity',
 'TEDxSanDiego',
 'TEDxSanJoseCA',
 'TEDxSanMigueldeAllende',
 'TEDxSanQuentin',
 'TEDxSantaCruz',
 'TEDxSeattleU',
 'TEDxSeoul',
 'TEDxSiliconValley',
 'TEDxSkoll',
 'TEDxSonomaCounty',
 'TEDxSouthBank',
 'TEDxStanford',
 'TEDxSummit',
 'TEDxSussexUniversity',
 'TEDxSydney',
 'TEDxTC',
 'TEDxTeen',
 'TEDxTelAviv 2010',
 'TEDxThessaloniki',
 'TEDxTokyo',
 'TEDxToronto',
 'TEDxToronto 2010',
 'TEDxToronto 2011',
 'TEDxToulouse',
 'TEDxUCL',
 'TEDxUF',
 'TEDxUIUC',
 'TEDxUM',
 'TEDxUMKC',
 'TEDxUSC',
 'TEDxUW',
 'TEDxUdeM',
 'TEDxUniversityofNevada',
 'TEDxUofM',
 'TEDxVancouver',
 'TEDxVictoria',
 'TEDxVienna',
 'TEDxVirginiaTech',
 'TEDxWarwick',
 'TEDxWaterloo',
 'TEDxWinnipeg',
 'TEDxWitsUniversity',
 'TEDxWomen 2011',
 'TEDxWomen 2012',
 'TEDxYYC',
 '[email protected]',
 'TEDxYou[email protected]',
 'TEDxZurich',
 'TEDxZurich 2011',
 'TEDxZurich 2012',
 'TEDxZurich 2013',
 'Taste3 2008',
 'The Do Lectures',
 'Toronto Youth Corps',
 'University of California',
 'Web 2.0 Expo 2008',
 'World Science Festival']
In [24]:
sorted(df_ted[df_ted['event'].str.startswith('TEDx')]['event'].unique())
Out[24]:
['TEDxABQ',
 'TEDxAmazonia',
 'TEDxAmericanRiviera',
 'TEDxAmoskeagMillyard',
 'TEDxAmsterdam',
 'TEDxArendal',
 'TEDxAsheville',
 'TEDxAthens',
 'TEDxAtlanta',
 'TEDxAustin',
 'TEDxBG',
 'TEDxBeaconStreet',
 'TEDxBeirut',
 'TEDxBend',
 'TEDxBerkeley',
 'TEDxBerlin',
 'TEDxBinghamtonUniversity',
 'TEDxBloomington',
 'TEDxBoston',
 'TEDxBoston 2009',
 'TEDxBoston 2010',
 'TEDxBoston 2011',
 'TEDxBoston 2012',
 'TEDxBoulder',
 'TEDxBoulder 2011',
 'TEDxBratislava',
 'TEDxBrighton',
 'TEDxBrussels',
 'TEDxCERN',
 'TEDxCHUV',
 'TEDxCMU',
 'TEDxCaFoscariU',
 'TEDxCaltech',
 'TEDxCambridge',
 'TEDxCanberra',
 'TEDxCannes',
 'TEDxChange',
 'TEDxChapmanU',
 'TEDxClaremontColleges',
 'TEDxColbyCollege',
 'TEDxColoradoSprings',
 'TEDxColumbus',
 'TEDxColumbusWomen',
 'TEDxConcorde',
 'TEDxConcordiaUPortland',
 'TEDxCreativeCoast',
 'TEDxCrenshaw',
 'TEDxDU 2010',
 'TEDxDU 2011',
 'TEDxDanubia',
 'TEDxDeExtinction',
 'TEDxDelft',
 'TEDxDesMoines',
 'TEDxDirigo',
 'TEDxDubai',
 'TEDxDublin',
 'TEDxEQChCh',
 'TEDxEast',
 'TEDxEastEnd',
 'TEDxEdmonton',
 'TEDxEuston',
 'TEDxExeter',
 'TEDxFiDiWomen',
 'TEDxFrankfurt',
 'TEDxFulbrightDublin',
 'TEDxGatewayWomen',
 'TEDxGeorgetown',
 'TEDxGhent',
 'TEDxGlasgow',
 'TEDxGoldenGatePark 2012',
 'TEDxGoodenoughCollege',
 'TEDxGrandRapids',
 'TEDxGreatPacificGarbagePatch',
 'TEDxGroningen',
 'TEDxGöteborg 2010',
 'TEDxHamburg',
 'TEDxHampshireCollege',
 'TEDxHelvetia',
 'TEDxHogeschoolUtrecht',
 'TEDxHousesOfParliament',
 'TEDxHouston',
 'TEDxImperialCollege',
 'TEDxIndianaUniversity',
 'TEDxIndianapolis',
 'TEDxIslay',
 'TEDxJaffa 2012',
 'TEDxJaffa 2013',
 'TEDxKC',
 '[email protected]',
 '[email protected]',
 'TEDxKrakow',
 'TEDxKyoto',
 'TEDxLeuvenSalon',
 'TEDxLinnaeusUniversity',
 'TEDxLondonBusinessSchool',
 'TEDxMIA',
 'TEDxMaastricht',
 'TEDxManchester',
 'TEDxManhattan',
 'TEDxManhattanBeach',
 'TEDxMarin',
 'TEDxMaui',
 'TEDxMet',
 'TEDxMiamiUniversity',
 'TEDxMidAtlantic',
 'TEDxMidAtlantic 2013',
 'TEDxMidwest',
 'TEDxMileHigh',
 'TEDxMonroeCorrectionalComplex',
 'TEDxMonterey',
 'TEDxMontreal',
 'TEDxMtHood',
 'TEDxMuncyStatePrison',
 'TEDxNASA',
 '[email protected]',
 'TEDxNYED',
 'TEDxNatick',
 'TEDxNewYork',
 'TEDxNewy',
 'TEDxNextGenerationAsheville',
 'TEDxNijmegen',
 'TEDxNorrkoping',
 'TEDxNorthwesternU',
 "TEDxO'Porto",
 'TEDxObserver',
 'TEDxOilSpill',
 'TEDxOmaha',
 'TEDxOrangeCoast',
 'TEDxOrcasIsland',
 'TEDxOslo',
 'TEDxPSU',
 'TEDxParis 2010',
 'TEDxParis 2012',
 'TEDxPeachtree',
 'TEDxPenn',
 'TEDxPennQuarter',
 'TEDxPennsylvaniaAvenue',
 'TEDxPerth',
 'TEDxPhoenix',
 'TEDxPittsburgh',
 'TEDxPlaceDesNations',
 'TEDxPortland',
 'TEDxPortofSpain',
 'TEDxProvidence',
 'TEDxPuget Sound ',
 'TEDxRC2',
 'TEDxRainier',
 'TEDxRiodelaPlata',
 'TEDxRotterdam 2010',
 'TEDxSBU',
 'TEDxSF',
 'TEDxSFU',
 'TEDxSMU',
 'TEDxSaltLakeCity',
 'TEDxSanDiego',
 'TEDxSanJoseCA',
 'TEDxSanMigueldeAllende',
 'TEDxSanQuentin',
 'TEDxSantaCruz',
 'TEDxSeattleU',
 'TEDxSeoul',
 'TEDxSiliconValley',
 'TEDxSkoll',
 'TEDxSonomaCounty',
 'TEDxSouthBank',
 'TEDxStanford',
 'TEDxSummit',
 'TEDxSussexUniversity',
 'TEDxSydney',
 'TEDxTC',
 'TEDxTeen',
 'TEDxTelAviv 2010',
 'TEDxThessaloniki',
 'TEDxTokyo',
 'TEDxToronto',
 'TEDxToronto 2010',
 'TEDxToronto 2011',
 'TEDxToulouse',
 'TEDxUCL',
 'TEDxUF',
 'TEDxUIUC',
 'TEDxUM',
 'TEDxUMKC',
 'TEDxUSC',
 'TEDxUW',
 'TEDxUdeM',
 'TEDxUniversityofNevada',
 'TEDxUofM',
 'TEDxVancouver',
 'TEDxVictoria',
 'TEDxVienna',
 'TEDxVirginiaTech',
 'TEDxWarwick',
 'TEDxWaterloo',
 'TEDxWinnipeg',
 'TEDxWitsUniversity',
 'TEDxWomen 2011',
 'TEDxWomen 2012',
 'TEDxYYC',
 '[email protected]',
 '[email protected]',
 'TEDxZurich',
 'TEDxZurich 2011',
 'TEDxZurich 2012',
 'TEDxZurich 2013']
In [25]:
sorted(df_ted[df_ted['event'].str.startswith('TEDx') == False]['event'].unique())
Out[25]:
['AORN Congress',
 'Arbejdsglaede Live',
 'BBC TV',
 'Bowery Poetry Club',
 'Business Innovation Factory',
 'Carnegie Mellon University',
 'Chautauqua Institution',
 'DICE Summit 2010',
 'DLD 2007',
 'EG 2007',
 'EG 2008',
 'Elizabeth G. Anderson School',
 "Eric Whitacre's Virtual Choir",
 'Fort Worth City Council',
 'Full Spectrum Auditions',
 'Gel Conference',
 'Global Witness HQ',
 'Handheld Learning',
 'Harvard University',
 'INK Conference',
 'Justice with Michael Sandel',
 'LIFT 2007',
 'Michael Howard Studios',
 'Mission Blue II',
 'Mission Blue Voyage',
 'New York State Senate',
 'NextGen:Charity',
 'Princeton University',
 'RSA Animate',
 'Royal Institution',
 'Serious Play 2008',
 'Skoll World Forum 2007',
 'SoulPancake',
 'Stanford University',
 'TED Dialogues',
 'TED Fellows 2015',
 'TED Fellows Retreat 2013',
 'TED Fellows Retreat 2015',
 'TED Prize Wish',
 'TED Residency',
 'TED Senior Fellows at TEDGlobal 2010',
 'TED Studio',
 'TED Talks Education',
 'TED Talks Live',
 'TED in the Field',
 'TED-Ed',
 'TED-Ed Weekend',
 'TED1984',
 'TED1990',
 'TED1994',
 'TED1998',
 'TED2001',
 'TED2002',
 'TED2003',
 'TED2004',
 'TED2005',
 'TED2006',
 'TED2007',
 'TED2008',
 'TED2009',
 'TED2010',
 'TED2011',
 'TED2012',
 'TED2013',
 'TED2014',
 'TED2015',
 'TED2016',
 'TED2017',
 '[email protected] Berlin',
 '[email protected] London',
 '[email protected] Paris',
 '[email protected] San Francisco',
 '[email protected] Singapore',
 '[email protected]',
 '[email protected]',
 '[email protected]',
 '[email protected]',
 '[email protected]',
 '[email protected]',
 '[email protected]',
 '[email protected]',
 '[email protected]',
 '[email protected] York',
 '[email protected]',
 '[email protected]',
 '[email protected] Street Boston',
 '[email protected] Street London',
 '[email protected] Boston',
 '[email protected]',
 '[email protected]',
 'TEDActive 2011',
 'TEDActive 2014',
 'TEDActive 2015',
 'TEDCity2.0',
 'TEDGlobal 2005',
 'TEDGlobal 2007',
 'TEDGlobal 2009',
 'TEDGlobal 2010',
 'TEDGlobal 2011',
 'TEDGlobal 2012',
 'TEDGlobal 2013',
 'TEDGlobal 2014',
 'TEDGlobal 2017',
 'TEDGlobal>Geneva',
 'TEDGlobal>London',
 'TEDGlobalLondon',
 'TEDIndia 2009',
 'TEDLagos Ideas Search',
 'TEDMED 2009',
 'TEDMED 2010',
 'TEDMED 2011',
 'TEDMED 2012',
 'TEDMED 2013',
 'TEDMED 2014',
 'TEDMED 2015',
 'TEDMED 2016',
 'TEDNYC',
 'TEDNairobi Ideas Search',
 '[email protected]',
 'TEDSalon 2006',
 'TEDSalon 2007 Hot Science',
 'TEDSalon 2009 Compassion',
 'TEDSalon Berlin 2014',
 'TEDSalon London 2009',
 'TEDSalon London 2010',
 'TEDSalon London Fall 2011',
 'TEDSalon London Fall 2012',
 'TEDSalon London Spring 2011',
 'TEDSalon London Spring 2012',
 'TEDSalon NY2011',
 'TEDSalon NY2012',
 'TEDSalon NY2013',
 'TEDSalon NY2014',
 'TEDSalon NY2015',
 'TEDSummit',
 'TEDWomen 2010',
 'TEDWomen 2013',
 'TEDWomen 2015',
 'TEDWomen 2016',
 'TEDYouth 2011',
 'TEDYouth 2012',
 'TEDYouth 2013',
 'TEDYouth 2014',
 'TEDYouth 2015',
 'Taste3 2008',
 'The Do Lectures',
 'Toronto Youth Corps',
 'University of California',
 'Web 2.0 Expo 2008',
 'World Science Festival']

We can add some feature to distinct different types of events.

In [26]:
def get_event_type(event):
    '''
    Returns type of event
    '''
    if not 'TED' in event:
        return 'NOT_TED'
    elif event.startswith('TEDx'):
        return 'TEDx'
    elif event.startswith('[email protected]'):
        return '[email protected]'
    elif re.fullmatch('TED\d{4}', event) is not None:
        return 'TED_YEAR'
    else:
        return event.split()[0]
In [27]:
df_ted['event'].apply(get_event_type).value_counts()
Out[27]:
TED_YEAR            978
TEDx                471
TEDGlobal           431
NOT_TED             111
TEDWomen             96
[email protected]                 87
TEDSalon             79
TEDMED               68
TED                  58
TEDIndia             35
TEDSummit            34
TEDYouth             19
TEDNYC               19
TEDGlobal>London     13
TEDCity2.0           11
TEDGlobal>Geneva     11
TED-Ed               10
TEDGlobalLondon       8
TEDActive             6
TEDLagos              2
[email protected]           2
TEDNairobi            1
Name: event, dtype: int64

Wikipedia has some additional info on different conference types https://en.wikipedia.org/wiki/TED_(conference)

In [28]:
df_ted.columns
Out[28]:
Index(['comments', 'description', 'duration', 'event', 'film_date',
       'languages', 'main_speaker', 'name', 'num_speaker', 'published_date',
       'ratings', 'related_talks', 'speaker_occupation', 'tags', 'title',
       'url', 'views', 'transcript'],
      dtype='object')

film_date

In [29]:
df_ted['film_date'].describe()
Out[29]:
count                    2550
unique                    735
top       2017-04-24 00:00:00
freq                       64
first     1972-05-14 00:00:00
last      2017-08-27 00:00:00
Name: film_date, dtype: object

Some talks have filming date year as early as 1972. Let's try to find some more.

In [30]:
df_ted[df_ted['film_date'] < '2000-01-01']
Out[30]:
comments description duration event film_date languages main_speaker name num_speaker published_date ratings related_talks speaker_occupation tags title url views transcript
200 64 With surprising accuracy, Nicholas Negroponte ... 1523 TED1984 1984-02-02 00:00:00 18 Nicholas Negroponte Nicholas Negroponte: 5 predictions, from 1984 1 2008-03-11 01:26:00 [{'id': 23, 'name': 'Jaw-dropping', 'count': 7... [{'id': 288, 'hero': 'https://pe.tedcdn.com/im... Tech visionary ['demo', 'design', 'entertainment', 'future', ... 5 predictions, from 1984 https://www.ted.com/talks/nicholas_negroponte_... 974087 In this rather long sort of marathon presentat...
202 9 Before he was a legend, architect Frank Gehry ... 2678 TED1990 1990-03-03 00:00:00 19 Frank Gehry Frank Gehry: My days as a young rebel 1 2008-03-13 01:38:00 [{'id': 11, 'name': 'Longwinded', 'count': 76}... [{'id': 13, 'hero': 'https://pe.tedcdn.com/ima... Architect ['architecture', 'collaboration', 'design', 'g... My days as a young rebel https://www.ted.com/talks/frank_gehry_as_a_you... 620806 I'm going to go right into the slides. And all...
260 791 Speaking at TED in 1998, Rev. Billy Graham mar... 1580 TED1998 1998-02-02 00:00:00 27 Billy Graham Billy Graham: On technology and faith 1 2008-07-16 01:00:00 [{'id': 3, 'name': 'Courageous', 'count': 470}... [{'id': 71, 'hero': 'https://pe.tedcdn.com/ima... Preacher ['Christianity', 'God', 'death', 'faith', 'rel... On technology and faith https://www.ted.com/talks/billy_graham_on_tech... 1532675 As a clergyman, you can imagine how out of pla...
290 66 With vibrant video clips captured by submarine... 800 TED1998 1998-02-28 00:00:00 25 David Gallo David Gallo: Life in the deep oceans 1 2008-09-11 01:00:00 [{'id': 10, 'name': 'Inspiring', 'count': 155}... [{'id': 40, 'hero': 'https://pe.tedcdn.com/ima... Oceanographer ['animals', 'geology', 'life', 'oceans', 'scie... Life in the deep oceans https://www.ted.com/talks/david_gallo_on_life_... 996736 (Applause) David Gallo: This is Bill Lange. I'...
316 29 In 1998, aircraft designer Paul MacCready look... 1368 TED1998 1998-02-02 00:00:00 19 Paul MacCready Paul MacCready: Nature vs. humans 1 2008-10-22 01:00:00 [{'id': 9, 'name': 'Ingenious', 'count': 47}, ... [{'id': 74, 'hero': 'https://pe.tedcdn.com/ima... Engineer ['demo', 'design', 'drones', 'flight', 'green'... Nature vs. humans https://www.ted.com/talks/paul_maccready_on_na... 197139 You hear that this is the era of environment —...
376 67 In this TED archive video from 1998, paralympi... 1345 TED1998 1998-02-02 00:00:00 30 Aimee Mullins Aimee Mullins: Changing my legs - and my mindset 1 2009-01-28 01:00:00 [{'id': 3, 'name': 'Courageous', 'count': 249}... [{'id': 82, 'hero': 'https://pe.tedcdn.com/ima... Athlete and actor ['beauty', 'body language', 'design', 'prosthe... Changing my legs - and my mindset https://www.ted.com/talks/aimee_mullins_on_run... 1013266 Sheryl Shade: Hi, Aimee. Aimee Mullins: Hi.SS:...
382 21 From the TED archives: The legendary graphic d... 914 TED1998 1998-02-02 00:00:00 20 Milton Glaser Milton Glaser: Using design to make ideas new 1 2009-02-11 01:00:00 [{'id': 22, 'name': 'Fascinating', 'count': 58... [{'id': 215, 'hero': 'https://pe.tedcdn.com/im... Graphic designer ['art', 'communication', 'creativity', 'cultur... Using design to make ideas new https://www.ted.com/talks/milton_glaser_on_usi... 382985 'Theme and variations' is one of those forms t...
395 91 At TED in 1998, Brenda Laurel asks: Why are al... 788 TED1998 1998-02-02 00:00:00 30 Brenda Laurel Brenda Laurel: Why not make video games for gi... 1 2009-03-02 01:00:00 [{'id': 25, 'name': 'OK', 'count': 90}, {'id':... [{'id': 361, 'hero': 'https://pe.tedcdn.com/im... Designer and theorist ['children', 'culture', 'design', 'entertainme... Why not make video games for girls? https://www.ted.com/talks/brenda_laurel_on_mak... 382517 Back in 1992, I started working for a company ...
600 133 At the Royal Institution in 1991, Richard Dawk... 3475 Royal Institution 1991-12-20 14:11:00 0 Richard Dawkins Richard Dawkins: Growing up in the universe 1 2010-01-23 13:10:00 [{'id': 22, 'name': 'Fascinating', 'count': 15... [{'id': 98, 'hero': 'https://pe.tedcdn.com/ima... Evolutionary biologist ['biology', 'evolution', 'life', 'science', 'u... Growing up in the universe https://www.ted.com/talks/richard_dawkins_grow... 318423 NaN
629 146 In this archival footage from BBC TV, celebrat... 3955 BBC TV 1983-07-08 17:00:00 0 Richard Feynman Richard Feynman: Physics is fun to imagine 1 2010-03-03 15:57:00 [{'id': 8, 'name': 'Informative', 'count': 324... [{'id': 194, 'hero': 'https://pe.tedcdn.com/im... Physicist ['astronomy', 'physics', 'science'] Physics is fun to imagine https://www.ted.com/talks/richard_feynman\n 521974 NaN
686 373 In this rare clip from 1972, legendary psychia... 262 Toronto Youth Corps 1972-05-14 00:00:00 0 Viktor Frankl Viktor Frankl: Why believe in others 1 2010-05-14 14:37:00 [{'id': 10, 'name': 'Inspiring', 'count': 1673... [{'id': 272, 'hero': 'https://pe.tedcdn.com/im... Psychiatrist, neurologist, author ['humanity', 'mind', 'peace', 'psychology', 'w... Why believe in others https://www.ted.com/talks/viktor_frankl_youth_... 1028630 NaN
1131 71 From deep in the TED archive, Danny Hillis out... 1150 TED1994 1994-02-20 00:00:00 22 Danny Hillis Danny Hillis: Back to the future (of 1994) 1 2012-02-03 16:06:37 [{'id': 11, 'name': 'Longwinded', 'count': 35}... [{'id': 1082, 'hero': 'https://pe.tedcdn.com/i... Computer theorist ['DNA', 'TED Brain Trust', 'computers', 'engin... Back to the future (of 1994) https://www.ted.com/talks/danny_hillis_back_to... 581419 Because I usually take the role of trying to e...

Can be seen that there are three talks, that are not from TED and filmed before 1992.

languages

In [31]:
df_ted['languages'].describe(), df_ted['languages'].median()
Out[31]:
(count    2550.000000
 mean       27.326275
 std         9.563452
 min         0.000000
 25%        23.000000
 50%        28.000000
 75%        33.000000
 max        72.000000
 Name: languages, dtype: float64, 28.0)

Interesting, some of the talks has language count equal to zero. Let's investigate a little bit.

In [32]:
df_ted[df_ted['languages'] == 0]
Out[32]:
comments description duration event film_date languages main_speaker name num_speaker published_date ratings related_talks speaker_occupation tags title url views transcript
58 222 Two Pilobolus dancers perform "Symbiosis." Doe... 825 TED2005 2005-02-25 00:00:00 0 Pilobolus Pilobolus: A dance of "Symbiosis" 1 2007-02-09 00:11:00 [{'id': 1, 'name': 'Beautiful', 'count': 1810}... [{'id': 40, 'hero': 'https://pe.tedcdn.com/ima... Dance company ['dance', 'entertainment', 'nature', 'performa... A dance of "Symbiosis" https://www.ted.com/talks/pilobolus_perform_sy... 3051507 NaN
115 27 The avant-garde string quartet Ethel performs ... 214 TED2006 2006-02-02 00:00:00 0 Ethel Ethel: A string quartet plays "Blue Room" 1 2007-06-18 16:29:00 [{'id': 1, 'name': 'Beautiful', 'count': 216},... [{'id': 103, 'hero': 'https://pe.tedcdn.com/im... String quartet ['cello', 'collaboration', 'culture', 'enterta... A string quartet plays "Blue Room" https://www.ted.com/talks/ethel_performs_blue_... 384641 NaN
135 36 After Vusi Mahlasela's 3-song set at TEDGlobal... 299 TEDGlobal 2007 2007-06-08 00:00:00 0 Vusi Mahlasela Vusi Mahlasela: "Woza" 1 2007-08-21 11:24:00 [{'id': 8, 'name': 'Informative', 'count': 4},... [{'id': 158, 'hero': 'https://pe.tedcdn.com/im... Musician, activist ['Africa', 'entertainment', 'guitar', 'live mu... "Woza" https://www.ted.com/talks/vusi_mahlasela_s_enc... 416603 NaN
209 67 Rokia Traore sings the moving "M'Bifo," accomp... 419 TEDGlobal 2007 2007-06-06 00:00:00 0 Rokia Traore Rokia Traore: "M'Bifo" 1 2008-03-27 01:18:00 [{'id': 23, 'name': 'Jaw-dropping', 'count': 5... [{'id': 265, 'hero': 'https://pe.tedcdn.com/im... Singer-songwriter ['Africa', 'entertainment', 'guitar', 'live mu... "M'Bifo" https://www.ted.com/talks/rokia_traore_sings_m... 294936 NaN
237 43 Singer-songwriter Rokia Traore performs "Kouna... 386 TEDGlobal 2007 2007-06-06 00:00:00 0 Rokia Traore Rokia Traore: "Kounandi" 1 2008-06-05 01:00:00 [{'id': 22, 'name': 'Fascinating', 'count': 84... [{'id': 186, 'hero': 'https://pe.tedcdn.com/im... Singer-songwriter ['Africa', 'guitar', 'live music', 'music', 's... "Kounandi" https://www.ted.com/talks/rokia_traore_sings_k... 82488 NaN
249 50 Composer Sxip Shirey makes music from the simp... 186 TED2008 2008-02-12 00:00:00 0 Sxip Shirey + Rachelle Garniez Sxip Shirey + Rachelle Garniez: A performance ... 2 2008-06-30 01:00:00 [{'id': 9, 'name': 'Ingenious', 'count': 44}, ... [{'id': 115, 'hero': 'https://pe.tedcdn.com/im... Musician ['entertainment', 'live music', 'music'] A performance with breath, music, passion https://www.ted.com/talks/sxip_shirey_at_the_b... 217663 NaN
399 194 Eric Lewis, an astonishingly talented crossove... 636 TED2009 2009-02-06 00:00:00 0 Eric Lewis Eric Lewis: Piano jazz that rocks 1 2009-03-06 01:00:00 [{'id': 21, 'name': 'Unconvincing', 'count': 8... [{'id': 46, 'hero': 'https://pe.tedcdn.com/ima... Pianist ['entertainment', 'innovation', 'invention', '... Piano jazz that rocks https://www.ted.com/talks/eric_lewis_strikes_c... 697257 NaN
446 138 Eric Lewis explores the piano's expressive pow... 294 TED2009 2009-02-05 00:00:00 0 Eric Lewis Eric Lewis: Chaos and harmony on piano 1 2009-05-12 01:00:00 [{'id': 26, 'name': 'Obnoxious', 'count': 84},... [{'id': 478, 'hero': 'https://pe.tedcdn.com/im... Pianist ['art', 'entertainment', 'live music', 'music'... Chaos and harmony on piano https://www.ted.com/talks/eric_lewis_plays_cha... 391427 NaN
474 135 Organ virtuoso Qi Zhang plays her electric ren... 185 TEDxUSC 2009-03-23 00:00:00 0 Qi Zhang Qi Zhang: An electrifying organ performance 1 2009-06-19 08:50:00 [{'id': 23, 'name': 'Jaw-dropping', 'count': 3... [{'id': 46, 'hero': 'https://pe.tedcdn.com/ima... Organist ['TEDx', 'china', 'music', 'performance', 'pia... An electrifying organ performance https://www.ted.com/talks/qi_zhang_s_electrify... 803691 NaN
512 146 Vishal Vaid and his band explore a traditional... 814 TED2006 2006-02-09 00:00:00 0 Vishal Vaid Vishal Vaid: Hypnotic South Asian improv music 1 2009-09-11 01:00:00 [{'id': 25, 'name': 'OK', 'count': 68}, {'id':... [{'id': 581, 'hero': 'https://pe.tedcdn.com/im... Musician ['Asia', 'beauty', 'culture', 'history', 'musi... Hypnotic South Asian improv music https://www.ted.com/talks/vishal_vaid_s_hypnot... 414396 NaN
547 46 The euphonium, with its sweet brass sound, is ... 141 TEDGlobal 2009 2009-07-23 00:00:00 0 Matthew White Matthew White: The modern euphonium 1 2009-10-30 01:00:00 [{'id': 9, 'name': 'Ingenious', 'count': 43}, ... [{'id': 478, 'hero': 'https://pe.tedcdn.com/im... Brass virtuoso ['creativity', 'live music', 'music', 'perform... The modern euphonium https://www.ted.com/talks/matthew_white_gives_... 771962 NaN
580 307 Is torture ever justified? Would you steal a d... 3296 Justice with Michael Sandel 2005-09-01 00:00:00 0 Michael Sandel Michael Sandel: What's the right thing to do? 1 2009-12-23 12:46:00 [{'id': 24, 'name': 'Persuasive', 'count': 132... [{'id': 187, 'hero': 'https://pe.tedcdn.com/im... Political philosopher ['government', 'law', 'philosophy', 'politics'] What's the right thing to do? https://www.ted.com/talks/michael_sandel_what_... 393459 NaN
581 105 At the BIF innovation summit, Cat Laine draws ... 889 Business Innovation Factory 2009-10-07 00:00:00 0 Cat Laine Cat Laine: Engineering a better life for all 1 2009-12-23 12:49:00 [{'id': 22, 'name': 'Fascinating', 'count': 40... [{'id': 702, 'hero': 'https://pe.tedcdn.com/im... Social entrepreneur ['business', 'creativity', 'engineering', 'glo... Engineering a better life for all https://www.ted.com/talks/cat_laine_engineerin... 154698 NaN
588 124 In 2007, Carnegie Mellon professor Randy Pausc... 4587 Carnegie Mellon University 2007-09-18 00:00:00 0 Randy Pausch Randy Pausch: Really achieving your childhood ... 1 2010-01-08 11:32:00 [{'id': 3, 'name': 'Courageous', 'count': 299}... [{'id': 229, 'hero': 'https://pe.tedcdn.com/im... Professor ['culture', 'disease', 'education', 'life', 's... Really achieving your childhood dreams https://www.ted.com/talks/randy_pausch_really_... 564781 NaN
589 204 At Stanford University, primatologist Robert S... 2246 Stanford University 2009-09-08 15:40:00 0 Robert Sapolsky Robert Sapolsky: The uniqueness of humans 1 2010-01-08 14:39:00 [{'id': 25, 'name': 'OK', 'count': 40}, {'id':... [{'id': 11, 'hero': 'https://pe.tedcdn.com/ima... Neuroscientist, primatologist, writer ['biology', 'brain', 'humanity', 'life', 'mind... The uniqueness of humans https://www.ted.com/talks/robert_sapolsky_the_... 572845 NaN
590 87 Matt Weinstein lost his life savings to Bernie... 510 AORN Congress 2009-03-14 10:00:00 0 Matt Weinstein Matt Weinstein: What Bernie Madoff couldn't st... 1 2010-01-09 08:59:00 [{'id': 10, 'name': 'Inspiring', 'count': 246}... [{'id': 211, 'hero': 'https://pe.tedcdn.com/im... Motivational speaker ['business', 'life', 'money', 'presentation'] What Bernie Madoff couldn't steal from me https://www.ted.com/talks/matt_weinstein_what_... 149818 NaN
594 20 In the midst of an earlier crisis, Haitian aut... 3573 University of California 2004-10-13 00:00:00 0 Edwidge Danticat Edwidge Danticat: Stories of Haiti 1 2010-01-14 15:04:00 [{'id': 1, 'name': 'Beautiful', 'count': 25}, ... [{'id': 652, 'hero': 'https://pe.tedcdn.com/im... Author ['books', 'disaster relief', 'novel', 'poetry'... Stories of Haiti https://www.ted.com/talks/edwidge_danticat_sto... 50443 NaN
599 104 Percussionist Sivamani delivers one of TED's l... 1000 TEDIndia 2009 2009-11-06 00:00:00 0 Sivamani Sivamani: Rhythm is everything, everywhere 1 2010-01-22 08:24:00 [{'id': 11, 'name': 'Longwinded', 'count': 46}... [{'id': 544, 'hero': 'https://pe.tedcdn.com/im... Percussionist ['live music', 'music', 'performance'] Rhythm is everything, everywhere https://www.ted.com/talks/sivamani_rhythm_is_e... 556163 NaN
600 133 At the Royal Institution in 1991, Richard Dawk... 3475 Royal Institution 1991-12-20 14:11:00 0 Richard Dawkins Richard Dawkins: Growing up in the universe 1 2010-01-23 13:10:00 [{'id': 22, 'name': 'Fascinating', 'count': 15... [{'id': 98, 'hero': 'https://pe.tedcdn.com/ima... Evolutionary biologist ['biology', 'evolution', 'life', 'science', 'u... Growing up in the universe https://www.ted.com/talks/richard_dawkins_grow... 318423 NaN
601 177 Ever heard the phrase "Those who can't do, tea... 182 Bowery Poetry Club 2005-11-12 14:18:00 0 Taylor Mali Taylor Mali: What teachers make 1 2010-01-23 13:16:00 [{'id': 10, 'name': 'Inspiring', 'count': 825}... [{'id': 66, 'hero': 'https://pe.tedcdn.com/ima... Slam poet ['education', 'performance', 'poetry', 'writing'] What teachers make https://www.ted.com/talks/taylor_mali_what_tea... 676741 NaN
607 356 At her Harvard commencement speech, "Harry Pot... 1258 Harvard University 2008-06-05 00:00:00 0 JK Rowling JK Rowling: The fringe benefits of failure 1 2010-01-30 15:03:00 [{'id': 26, 'name': 'Obnoxious', 'count': 18},... [{'id': 453, 'hero': 'https://pe.tedcdn.com/im... Author ['books', 'creativity', 'goal-setting', 'pover... The fringe benefits of failure https://www.ted.com/talks/jk_rowling_the_fring... 1406151 NaN
625 321 In this fun, 3-min performance from the World ... 184 World Science Festival 2009-06-10 00:00:00 0 Bobby McFerrin Bobby McFerrin: Watch me play ... the audience! 1 2010-02-27 01:00:00 [{'id': 24, 'name': 'Persuasive', 'count': 163... [{'id': 286, 'hero': 'https://pe.tedcdn.com/im... Musician ['brain', 'mind', 'music', 'performance'] Watch me play ... the audience! https://www.ted.com/talks/bobby_mcferrin_hacks... 3302312 NaN
629 146 In this archival footage from BBC TV, celebrat... 3955 BBC TV 1983-07-08 17:00:00 0 Richard Feynman Richard Feynman: Physics is fun to imagine 1 2010-03-03 15:57:00 [{'id': 8, 'name': 'Informative', 'count': 324... [{'id': 194, 'hero': 'https://pe.tedcdn.com/im... Physicist ['astronomy', 'physics', 'science'] Physics is fun to imagine https://www.ted.com/talks/richard_feynman\n 521974 NaN
637 556 At the Web 2.0 Expo, entrepreneur Gary Vaynerc... 927 Web 2.0 Expo 2008 2008-09-19 17:49:00 0 Gary Vaynerchuk Gary Vaynerchuk: Do what you love (no excuses!) 1 2010-03-12 16:46:00 [{'id': 10, 'name': 'Inspiring', 'count': 758}... [{'id': 473, 'hero': 'https://pe.tedcdn.com/im... Entrepreneur ['Internet', 'business', 'entrepreneur', 'mone... Do what you love (no excuses!) https://www.ted.com/talks/gary_vaynerchuk_do_w... 757791 NaN
640 101 Blind river dolphins, reclusive lemurs, a parr... 5256 University of California 2001-05-16 00:00:00 0 Douglas Adams Douglas Adams: Parrots, the universe and every... 1 2010-03-16 17:54:00 [{'id': 22, 'name': 'Fascinating', 'count': 29... [{'id': 635, 'hero': 'https://pe.tedcdn.com/im... Author, satirist ['biodiversity', 'biology', 'comedy', 'humor',... Parrots, the universe and everything https://www.ted.com/talks/douglas_adams_parrot... 473220 NaN
649 74 Patsy Rodenburg says the world needs actors mo... 407 Michael Howard Studios 2008-10-09 00:00:00 0 Patsy Rodenburg Patsy Rodenburg: Why I do theater 1 2010-03-26 13:48:00 [{'id': 24, 'name': 'Persuasive', 'count': 70}... [{'id': 60, 'hero': 'https://pe.tedcdn.com/ima... Acting and voice coach ['humanity', 'psychology', 'theater'] Why I do theater https://www.ted.com/talks/patsy_rodenburg_why_... 176995 NaN
655 368 Games are invading the real world -- and the r... 1698 DICE Summit 2010 2010-02-18 00:00:00 0 Jesse Schell Jesse Schell: When games invade real life 1 2010-04-03 08:32:00 [{'id': 23, 'name': 'Jaw-dropping', 'count': 1... [{'id': 799, 'hero': 'https://pe.tedcdn.com/im... Game designer ['business', 'consumerism', 'design', 'enterta... When games invade real life https://www.ted.com/talks/jesse_schell_when_ga... 449161 NaN
665 152 185 voices from 12 countries join a choir that... 255 Eric Whitacre's Virtual Choir 2010-03-10 16:04:00 0 Eric Whitacre Eric Whitacre: A choir as big as the Internet 1 2010-04-16 14:59:00 [{'id': 23, 'name': 'Jaw-dropping', 'count': 2... [{'id': 805, 'hero': 'https://pe.tedcdn.com/im... Composer, conductor ['Internet', 'music', 'online video', 'perform... A choir as big as the Internet https://www.ted.com/talks/a_choir_as_big_as_th... 399280 NaN
686 373 In this rare clip from 1972, legendary psychia... 262 Toronto Youth Corps 1972-05-14 00:00:00 0 Viktor Frankl Viktor Frankl: Why believe in others 1 2010-05-14 14:37:00 [{'id': 10, 'name': 'Inspiring', 'count': 1673... [{'id': 272, 'hero': 'https://pe.tedcdn.com/im... Psychiatrist, neurologist, author ['humanity', 'mind', 'peace', 'psychology', 'w... Why believe in others https://www.ted.com/talks/viktor_frankl_youth_... 1028630 NaN
696 85 This haunting, intimate performance by Europea... 1384 TEDGlobal 2009 2009-07-23 00:00:00 0 Sophie Hunger Sophie Hunger: Songs of secrets and city lights 1 2010-05-28 09:21:00 [{'id': 23, 'name': 'Jaw-dropping', 'count': 5... [{'id': 823, 'hero': 'https://pe.tedcdn.com/im... Singer ['cities', 'music', 'performance', 'poetry', '... Songs of secrets and city lights https://www.ted.com/talks/sophie_hunger_plays_... 518461 NaN
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
1125 104 Bjarke Ingels' architecture is luxurious, sust... 1335 TEDxEast 2011-05-09 00:00:00 0 Bjarke Ingels Bjarke Ingels: Hedonistic sustainability 1 2012-01-28 15:03:24 [{'id': 23, 'name': 'Jaw-dropping', 'count': 2... [{'id': 1072, 'hero': 'https://pe.tedcdn.com/i... Architect ['TEDx', 'architecture', 'cities', 'design', '... Hedonistic sustainability https://www.ted.com/talks/bjarke_ingels_hedoni... 422460 NaN
1133 80 Five billion people can't use the Internet. Al... 594 TEDxSanMigueldeAllende 2011-08-06 00:00:00 0 Aleph Molinari Aleph Molinari: Let's bridge the digital divide! 1 2012-02-04 15:01:38 [{'id': 26, 'name': 'Obnoxious', 'count': 9}, ... [{'id': 892, 'hero': 'https://pe.tedcdn.com/im... Economist, techno-activist ['TEDx', 'global issues', 'poverty', 'technolo... Let's bridge the digital divide! https://www.ted.com/talks/aleph_molinari_let_s... 115346 NaN
1134 245 From the "I have a dream" speech to Steve Jobs... 1090 TEDxEast 2011-11-11 00:00:00 0 Nancy Duarte Nancy Duarte: The secret structure of great talks 1 2012-02-05 14:58:38 [{'id': 11, 'name': 'Longwinded', 'count': 100... [{'id': 353, 'hero': 'https://pe.tedcdn.com/im... CEO, presentation designer ['TEDx', 'communication', 'presentation', 'sto... The secret structure of great talks https://www.ted.com/talks/nancy_duarte_the_sec... 1152886 NaN
1141 70 Keith Nolan always wanted to join the United S... 1118 TEDxIslay 2011-04-23 00:00:00 0 Keith Nolan Keith Nolan: Deaf in the military 1 2012-02-12 14:56:39 [{'id': 22, 'name': 'Fascinating', 'count': 61... [{'id': 1311, 'hero': 'https://pe.tedcdn.com/i... Teacher ['TEDx', 'disability', 'global issues', 'milit... Deaf in the military https://www.ted.com/talks/keith_nolan_deaf_in_... 120274 NaN
1144 15 Singer Inara George and guitarist Mike Andrews... 199 TEDxGreatPacificGarbagePatch 2010-11-06 00:00:00 0 Inara George Inara George: "Family Tree" 1 2012-02-14 22:43:40 [{'id': 7, 'name': 'Funny', 'count': 8}, {'id'... [{'id': 557, 'hero': 'https://pe.tedcdn.com/im... Singer and songwriter ['TEDx', 'family', 'live music', 'music', 'per... "Family Tree" https://www.ted.com/talks/inara_george_sings_f... 217851 NaN
1149 80 How often do you see the true beauty of the ni... 669 TEDxPhoenix 2011-11-11 00:00:00 0 Lucianne Walkowicz Lucianne Walkowicz: Look up for a change 1 2012-02-19 14:53:14 [{'id': 10, 'name': 'Inspiring', 'count': 169}... [{'id': 141, 'hero': 'https://pe.tedcdn.com/im... Stellar astronomer ['NASA', 'Planets', 'TED Fellows', 'TEDx', 'ac... Look up for a change https://www.ted.com/talks/lucianne_walkowicz_l... 175395 NaN
1156 90 Were you the favorite child, the wild child or... 1254 TEDxAsheville 2011-11-13 00:00:00 0 Jeffrey Kluger Jeffrey Kluger: The sibling bond 1 2012-02-26 15:09:02 [{'id': 8, 'name': 'Informative', 'count': 341... [{'id': 1241, 'hero': 'https://pe.tedcdn.com/i... Senior Editor, TIME Magazine ['TEDx', 'anthropology', 'brain', 'compassion'... The sibling bond https://www.ted.com/talks/jeffrey_kluger_the_s... 418651 NaN
1169 123 Kelli Anderson shatters our expectations about... 953 TEDxPhoenix 2011-11-11 00:00:00 0 Kelli Anderson Kelli Anderson: Design to challenge reality 1 2012-03-10 15:00:37 [{'id': 23, 'name': 'Jaw-dropping', 'count': 6... [{'id': 1340, 'hero': 'https://pe.tedcdn.com/i... Artist, designer ['TEDx', 'art', 'design', 'materials'] Design to challenge reality https://www.ted.com/talks/kelli_anderson_desig... 306296 NaN
1171 152 By dissecting a cockroach ... yes, live on sta... 376 TED-Ed 2011-11-17 00:00:00 0 Greg Gage Greg Gage: The cockroach beatbox 1 2012-03-12 15:49:00 [{'id': 22, 'name': 'Fascinating', 'count': 40... [{'id': 1267, 'hero': 'https://pe.tedcdn.com/i... Neuroscientist ['TED Fellows', 'TED-Ed', 'biology', 'brain', ... The cockroach beatbox https://www.ted.com/talks/the_cockroach_beatbox\n 303986 NaN
1172 428 TED curator Chris Anderson shares his obsessio... 728 TED-Ed 2012-03-12 00:00:00 0 Chris Anderson (TED) Chris Anderson (TED): Questions no one knows t... 1 2012-03-12 19:01:44 [{'id': 21, 'name': 'Unconvincing', 'count': 6... [{'id': 955, 'hero': 'https://pe.tedcdn.com/im... TED Curator ['Planets', 'String theory', 'TED-Ed', 'consci... Questions no one knows the answers to https://www.ted.com/talks/questions_no_one_kno... 659450 NaN
1173 62 In the deepest, darkest parts of the oceans ar... 508 TED-Ed 2012-03-12 00:00:00 0 David Gallo David Gallo: Deep ocean mysteries and wonders 1 2012-03-13 15:04:50 [{'id': 23, 'name': 'Jaw-dropping', 'count': 1... [{'id': 206, 'hero': 'https://pe.tedcdn.com/im... Oceanographer ['TED-Ed', 'biology', 'deextinction', 'ecology... Deep ocean mysteries and wonders https://www.ted.com/talks/deep_ocean_mysteries... 277544 NaN
1174 145 Adam Savage walks through two spectacular exam... 452 TED-Ed 2011-11-02 00:00:00 0 Adam Savage Adam Savage: How simple ideas lead to scientif... 1 2012-03-13 19:02:32 [{'id': 8, 'name': 'Informative', 'count': 655... [{'id': 1385, 'hero': 'https://pe.tedcdn.com/i... Maker, critical thinker ['Nobel prize', 'Planets', 'TED-Ed', 'ancient ... How simple ideas lead to scientific discoveries https://www.ted.com/talks/how_simple_ideas_lea... 877096 NaN
1178 84 Prosthetics can't replicate the look and feel ... 663 TEDxCambridge 2011-11-19 00:00:00 0 Scott Summit Scott Summit: Beautiful artificial limbs 1 2012-03-17 13:56:49 [{'id': 10, 'name': 'Inspiring', 'count': 191}... [{'id': 1311, 'hero': 'https://pe.tedcdn.com/i... Industrial Designer ['TEDx', 'beauty', 'design', 'industrial desig... Beautiful artificial limbs https://www.ted.com/talks/scott_summit_beautif... 132199 NaN
1179 99 Architecture can bring people together, or div... 1191 TEDxPortofSpain 2011-11-11 00:00:00 0 Mark Raymond Mark Raymond: Victims of the city 1 2012-03-18 14:17:57 [{'id': 11, 'name': 'Longwinded', 'count': 89}... [{'id': 1253, 'hero': 'https://pe.tedcdn.com/i... Architect ['TEDx', 'architecture', 'cities', 'design', '... Victims of the city https://www.ted.com/talks/mark_raymond_victims... 142164 NaN
1184 59 Jer Thorp creates beautiful data visualization... 1042 TEDxVancouver 2011-11-12 00:00:00 0 Jer Thorp Jer Thorp: Make data more human 1 2012-03-24 14:01:48 [{'id': 8, 'name': 'Informative', 'count': 190... [{'id': 1227, 'hero': 'https://pe.tedcdn.com/i... Data artist ['TEDx', 'data', 'design', 'media', 'news'] Make data more human https://www.ted.com/talks/jer_thorp_make_data_... 220716 NaN
1191 86 Solar-powered LED lightbulbs could transform t... 338 TEDxPittsburgh 2011-11-19 00:00:00 0 Daniel Schnitzer Daniel Schnitzer: Inventing is the easy part. ... 1 2012-03-31 14:13:34 [{'id': 8, 'name': 'Informative', 'count': 255... [{'id': 91, 'hero': 'https://pe.tedcdn.com/ima... Founder and Executive Director, Earthspark Int... ['TEDx', 'alternative energy', 'global issues'... Inventing is the easy part. Marketing takes work https://www.ted.com/talks/daniel_schnitzer_inv... 208115 NaN
1193 76 New videography techniques have opened up the ... 362 TED-Ed 2012-04-02 00:00:00 0 Tierney Thys + Plankton Chronicles Project Tierney Thys + Plankton Chronicles Project: T... 2 2012-04-02 15:02:39 [{'id': 23, 'name': 'Jaw-dropping', 'count': 1... [{'id': 126, 'hero': 'https://pe.tedcdn.com/im... Marine biologist ['TED-Ed', 'oceans'] The secret life of plankton https://www.ted.com/talks/the_secret_life_of_p... 197120 NaN
1198 175 At TEDYouth 2011, performance artist Carvens L... 305 TED-Ed 2011-11-11 00:00:00 0 Carvens Lissaint Carvens Lissaint: "Put the financial aid in th... 1 2012-04-07 14:17:21 [{'id': 3, 'name': 'Courageous', 'count': 233}... [{'id': 208, 'hero': 'https://pe.tedcdn.com/im... Performance artist ['TED-Ed', 'TEDYouth', 'culture', 'entertainme... "Put the financial aid in the bag" https://www.ted.com/talks/put_the_financial_ai... 186308 NaN
1212 224 Just how small are atoms? Really, really, real... 328 TED-Ed 2012-04-25 00:00:00 0 Jon Bergmann Jon Bergmann: Just how small is an atom? 1 2012-04-25 15:24:02 [{'id': 22, 'name': 'Fascinating', 'count': 43... [{'id': 1408, 'hero': 'https://pe.tedcdn.com/i... Educator ['TED-Ed', 'chemistry', 'nanoscale', 'physics'... Just how small is an atom? https://www.ted.com/talks/just_how_small_is_an... 419672 NaN
1223 59 Rick Guidotti is a fashion photographer with a... 1084 TEDxPhoenix 2011-11-11 00:00:00 0 Rick Guidotti Rick Guidotti: From stigma to supermodel 1 2012-05-06 15:41:54 [{'id': 8, 'name': 'Informative', 'count': 71}... [{'id': 1647, 'hero': 'https://pe.tedcdn.com/i... Photographer ['TEDx', 'beauty', 'fashion', 'global issues',... From stigma to supermodel https://www.ted.com/talks/rick_guidotti_from_s... 166959 NaN
1229 73 The revolution that made music more marketable... 769 TEDxSMU 2011-12-03 00:00:00 0 José Bowen José Bowen: Beethoven the businessman 1 2012-05-12 13:58:07 [{'id': 24, 'name': 'Persuasive', 'count': 50}... [{'id': 1331, 'hero': 'https://pe.tedcdn.com/i... Professor of music ['TEDx', 'business', 'history', 'music'] Beethoven the businessman https://www.ted.com/talks/jose_bowen_beethoven... 117756 NaN
1237 67 An average teaspoon of ocean water contains fi... 697 TEDxMonterey 2012-04-18 00:00:00 0 Melissa Garren Melissa Garren: The sea we've hardly seen 1 2012-05-20 14:01:08 [{'id': 22, 'name': 'Fascinating', 'count': 13... [{'id': 509, 'hero': 'https://pe.tedcdn.com/im... Marine biologist ['TEDx', 'bacteria', 'biodiversity', 'biology'... The sea we've hardly seen https://www.ted.com/talks/melissa_garren_the_s... 166835 NaN
1301 43 Giles Duley gave up a life of glamour and cele... 711 TEDxObserver 2012-03-10 00:00:00 0 Giles Duley Giles Duley: When a reporter becomes the story 1 2012-07-29 14:55:00 [{'id': 23, 'name': 'Jaw-dropping', 'count': 4... [{'id': 1311, 'hero': 'https://pe.tedcdn.com/i... Photojournalist ['TEDx', 'culture', 'disability', 'journalism'... When a reporter becomes the story https://www.ted.com/talks/giles_duley_when_a_r... 144044 NaN
1307 148 Can an algorithm forecast the site of the next... 602 TEDxUCL 2012-06-03 00:00:00 0 Hannah Fry Hannah Fry: Is life really that complex? 1 2012-08-04 13:57:17 [{'id': 24, 'name': 'Persuasive', 'count': 108... [{'id': 1006, 'hero': 'https://pe.tedcdn.com/i... Complexity theorist ['TEDx', 'algorithm', 'anthropology', 'behavio... Is life really that complex? https://www.ted.com/talks/hannah_fry_is_life_r... 353350 NaN
1341 180 Gravity. The stars in day. Thoughts. The human... 528 TED-Ed 2012-09-26 00:00:00 0 John Lloyd John Lloyd: An animated tour of the invisible 1 2012-09-26 15:01:49 [{'id': 9, 'name': 'Ingenious', 'count': 274},... [{'id': 510, 'hero': 'https://pe.tedcdn.com/im... Producer ['AI', 'DNA', 'TED-Ed', 'astronomy', 'chemistr... An animated tour of the invisible https://www.ted.com/talks/john_lloyd_an_animat... 336430 NaN
1427 229 Kid President commands you to wake up, listen ... 208 SoulPancake 2013-01-24 00:00:00 0 Kid President Kid President: I think we all need a pep talk 1 2013-02-01 16:07:02 [{'id': 10, 'name': 'Inspiring', 'count': 771}... [{'id': 815, 'hero': 'https://pe.tedcdn.com/im... Inspirer ['children', 'comedy', 'humor'] I think we all need a pep talk https://www.ted.com/talks/kid_president_i_thin... 828203 NaN
1467 183 As we move through the world, we have an innat... 388 TEDYouth 2012 2012-11-17 00:00:00 0 Katherine Kuchenbecker Katherine Kuchenbecker: The technology of touch 1 2013-03-29 14:59:50 [{'id': 9, 'name': 'Ingenious', 'count': 150},... [{'id': 1692, 'hero': 'https://pe.tedcdn.com/i... Mechanical engineer ['TEDYouth', 'engineering', 'technology'] The technology of touch https://www.ted.com/talks/katherine_kuchenbeck... 274986 NaN
1486 126 What color is a mirror? How much does a video ... 441 TED-Ed 2013-02-28 00:00:00 0 Michael Stevens Michael Stevens: How much does a video weigh? 1 2013-04-24 14:59:29 [{'id': 22, 'name': 'Fascinating', 'count': 17... [{'id': 1386, 'hero': 'https://pe.tedcdn.com/i... YouTube educator ['Internet', 'TED-Ed', 'computers', 'humor', '... How much does a video weigh? https://www.ted.com/talks/how_much_does_a_vide... 195899 NaN
2407 5 Grammy-winning Silk Road Ensemble display thei... 389 TED2016 2016-02-15 00:00:00 0 Silk Road Ensemble Silk Road Ensemble: "Turceasca" 1 2017-03-17 14:00:15 [{'id': 1, 'name': 'Beautiful', 'count': 80}, ... [{'id': 2611, 'hero': 'https://pe.tedcdn.com/i... Musical explorers ['art', 'live music', 'music', 'performance'] "Turceasca" https://www.ted.com/talks/silk_road_ensemble_t... 640734 NaN
2418 11 Sō Percussion creates adventurous compositions... 609 TED2016 2016-02-15 00:00:00 0 Sō Percussion Sō Percussion: "Music for Wood and Strings" 1 2017-03-31 12:34:06 [{'id': 21, 'name': 'Unconvincing', 'count': 8... [{'id': 2538, 'hero': 'https://pe.tedcdn.com/i... Percussion ensemble ['live music', 'music', 'performance', 'perfor... "Music for Wood and Strings" https://www.ted.com/talks/so_percussion_music_... 767660 NaN

86 rows × 18 columns

In [33]:
df_ted[df_ted['languages'] == 0]['url'].values[:10]
Out[33]:
array(['https://www.ted.com/talks/pilobolus_perform_symbiosis\n',
       'https://www.ted.com/talks/ethel_performs_blue_room\n',
       'https://www.ted.com/talks/vusi_mahlasela_s_encore_at_tedglobal2007\n',
       'https://www.ted.com/talks/rokia_traore_sings_m_bifo\n',
       'https://www.ted.com/talks/rokia_traore_sings_kounandi\n',
       'https://www.ted.com/talks/sxip_shirey_at_the_breathing_place\n',
       'https://www.ted.com/talks/eric_lewis_strikes_chords_to_rock_the_jazz_world\n',
       'https://www.ted.com/talks/eric_lewis_plays_chaos_and_harmony\n',
       'https://www.ted.com/talks/qi_zhang_s_electrifying_organ_performance\n',
       'https://www.ted.com/talks/vishal_vaid_s_hypnotic_song\n'],
      dtype=object)

Most of those are art perfomances, but not all. Also for those records there is no transcript.

main_speaker

In [34]:
df_ted['main_speaker'].value_counts()
Out[34]:
Hans Rosling                                  9
Juan Enriquez                                 7
Marco Tempest                                 6
Rives                                         6
Nicholas Negroponte                           5
Bill Gates                                    5
Jacqueline Novogratz                          5
Clay Shirky                                   5
Julian Treasure                               5
Dan Ariely                                    5
Steven Johnson                                4
Ken Robinson                                  4
Eve Ensler                                    4
Chris Anderson                                4
David Pogue                                   4
Lawrence Lessig                               4
Jonathan Haidt                                4
Dan Dennett                                   4
Barry Schwartz                                4
Robert Full                                   4
Jonathan Drori                                4
Stewart Brand                                 4
Stefan Sagmeister                             4
Tom Wujec                                     4
Kevin Kelly                                   4
Al Gore                                       4
Andrew Solomon                                3
Ben Saunders                                  3
Dan Gilbert                                   3
Aimee Mullins                                 3
                                             ..
Shimon Steinberg                              1
Daniel Suarez                                 1
Shai Agassi                                   1
Vanessa Ruiz                                  1
Stephen Palumbi                               1
William Black                                 1
David Holt                                    1
Ricardo Semler                                1
Alice Goffman                                 1
Devdutt Pattanaik                             1
Myriam Sidibe                                 1
LaToya Ruby Frazier                           1
Jim Toomey                                    1
Brian Skerry                                  1
Charles Moore                                 1
Conrad Wolfram                                1
Franco Sacchi                                 1
Ivan Krastev                                  1
Radhika Nagpal                                1
Eric Sanderson                                1
Jamie Bartlett                                1
Baba Shiv                                     1
Karen Tse                                     1
Lee Smolin                                    1
Jill Farrant                                  1
Amanda Palmer, Jherek Bischoff, Usman Riaz    1
Freeman Hrabowski                             1
Marvin Minsky                                 1
Kamal Meattle                                 1
Amber Case                                    1
Name: main_speaker, Length: 2156, dtype: int64
In [35]:
df_ted['main_speaker'].value_counts().describe()
Out[35]:
count    2156.000000
mean        1.182746
std         0.574799
min         1.000000
25%         1.000000
50%         1.000000
75%         1.000000
max         9.000000
Name: main_speaker, dtype: float64

Most of people have a talk at TED events only once.

In [36]:
df_ted['main_speaker'].str.len().describe()
Out[36]:
count    2550.000000
mean       13.552157
std         4.119468
min         2.000000
25%        11.000000
50%        13.000000
75%        15.000000
max        58.000000
Name: main_speaker, dtype: float64
In [37]:
df_ted[df_ted['main_speaker'].str.len() > 20]
Out[37]:
comments description duration event film_date languages main_speaker name num_speaker published_date ratings related_talks speaker_occupation tags title url views transcript
83 335 Savage-Rumbaugh's work with bonobo apes, which... 1045 TED2004 2004-02-02 28 Susan Savage-Rumbaugh Susan Savage-Rumbaugh: The gentle genius of bo... 1 2007-04-05 00:11:00 [{'id': 1, 'name': 'Beautiful', 'count': 544},... [{'id': 340, 'hero': 'https://pe.tedcdn.com/im... Primate authority ['animals', 'apes', 'biology', 'culture', 'evo... The gentle genius of bonobos https://www.ted.com/talks/susan_savage_rumbaug... 2197377 I work with a species called "Bonobo." And I'm...
101 56 Google co-founders Larry Page and Sergey Brin ... 1233 TED2004 2004-02-02 31 Sergey Brin + Larry Page Sergey Brin + Larry Page: The genesis of Google 2 2007-05-03 17:51:00 [{'id': 11, 'name': 'Longwinded', 'count': 143... [{'id': 319, 'hero': 'https://pe.tedcdn.com/im... Computer scientist, entrepreneur and philanthr... ['Google', 'TED Brain Trust', 'business', 'col... The genesis of Google https://www.ted.com/talks/sergey_brin_and_larr... 1451846 Sergey Brin: I want to discuss a question I kn...
103 328 In James Howard Kunstler's view, public spaces... 1184 TED2004 2004-02-02 21 James Howard Kunstler James Howard Kunstler: The ghastly tragedy of ... 1 2007-05-12 08:20:00 [{'id': 7, 'name': 'Funny', 'count': 1006}, {'... [{'id': 123, 'hero': 'https://pe.tedcdn.com/im... Social critic ['alternative energy', 'architecture', 'cars',... The ghastly tragedy of the suburbs https://www.ted.com/talks/james_howard_kunstle... 1683456 The immersive ugliness of our everyday environ...
108 260 Blaise Aguera y Arcas leads a dazzling demo of... 450 TED2007 2007-03-03 34 Blaise Agüera y Arcas Blaise Agüera y Arcas: How PhotoSynth can conn... 1 2007-05-27 00:37:00 [{'id': 1, 'name': 'Beautiful', 'count': 632},... [{'id': 147, 'hero': 'https://pe.tedcdn.com/im... Software architect ['collaboration', 'demo', 'microsoft', 'photog... How PhotoSynth can connect the world's images https://www.ted.com/talks/blaise_aguera_y_arca... 4772595 What I'm going to show you first, as quickly a...
193 46 Two TED favorites, Jill Sobule and Julia Sween... 374 TED2007 2007-03-03 23 Jill Sobule + Julia Sweeney Jill Sobule + Julia Sweeney: The Jill and Juli... 2 2008-02-20 01:03:00 [{'id': 3, 'name': 'Courageous', 'count': 34},... [{'id': 86, 'hero': 'https://pe.tedcdn.com/ima... Singer/songwriter ['collaboration', 'comedy', 'entertainment', '... The Jill and Julia Show https://www.ted.com/talks/the_jill_and_julia_s... 487972 ♫ Jill Sobule: At a conference in Monterey by ...
197 70 Educator Roy Gould and researcher Curtis Wong ... 402 TED2008 2008-02-27 26 Roy Gould + Curtis Wong Roy Gould + Curtis Wong: A preview of the Worl... 2 2008-02-27 23:00:00 [{'id': 8, 'name': 'Informative', 'count': 170... [{'id': 178, 'hero': 'https://pe.tedcdn.com/im... Researcher ['astronomy', 'collaboration', 'demo', 'scienc... A preview of the WorldWide Telescope https://www.ted.com/talks/roy_gould_and_curtis... 1034064 Roy Gould: Less than a year from now, the worl...
216 87 Tod Machover of MIT's Media Lab is devoted to ... 1241 TED2008 2008-03-03 21 Tod Machover + Dan Ellsey Tod Machover + Dan Ellsey: Inventing instrumen... 2 2008-04-15 03:25:00 [{'id': 22, 'name': 'Fascinating', 'count': 16... [{'id': 103, 'hero': 'https://pe.tedcdn.com/im... Composer, inventor ['creativity', 'demo', 'design', 'entertainmen... Inventing instruments that unlock new music https://www.ted.com/talks/tod_machover_and_dan... 497153 The first idea I'd like to suggest is that we ...
249 50 Composer Sxip Shirey makes music from the simp... 186 TED2008 2008-02-12 0 Sxip Shirey + Rachelle Garniez Sxip Shirey + Rachelle Garniez: A performance ... 2 2008-06-30 01:00:00 [{'id': 9, 'name': 'Ingenious', 'count': 44}, ... [{'id': 115, 'hero': 'https://pe.tedcdn.com/im... Musician ['entertainment', 'live music', 'music'] A performance with breath, music, passion https://www.ted.com/talks/sxip_shirey_at_the_b... 217663 NaN
272 16 After Robert Lang's talk on origami at TED2008... 178 TED2008 2008-02-02 38 Bruno Bowden + Rufus Cappadocia Bruno Bowden + Rufus Cappadocia: Blindfold ori... 2 2008-08-01 01:00:00 [{'id': 9, 'name': 'Ingenious', 'count': 53}, ... [{'id': 321, 'hero': 'https://pe.tedcdn.com/im... Engineer and origamist ['cello', 'entertainment', 'music', 'origami'] Blindfold origami and cello https://www.ted.com/talks/bruno_bowden_folds_w... 375734 Hello everyone. And so the two of us are here ...
317 243 Mihaly Csikszentmihalyi asks, "What makes a li... 1135 TED2004 2004-02-29 34 Mihaly Csikszentmihalyi Mihaly Csikszentmihalyi: Flow, the secret to h... 1 2008-10-23 01:00:00 [{'id': 22, 'name': 'Fascinating', 'count': 92... [{'id': 97, 'hero': 'https://pe.tedcdn.com/ima... Positive psychologist ['culture', 'global issues', 'happiness', 'mus... Flow, the secret to happiness https://www.ted.com/talks/mihaly_csikszentmiha... 4016531 I grew up in Europe, and World War II caught m...
321 41 The Inventables guys, Zach Kaplan and Keith Sc... 946 TED2005 2005-02-02 20 Zach Kaplan + Keith Schacht Zach Kaplan + Keith Schacht: Toys and material... 2 2008-10-30 01:00:00 [{'id': 2, 'name': 'Confusing', 'count': 15}, ... [{'id': 1692, 'hero': 'https://pe.tedcdn.com/i... Inventor ['business', 'creativity', 'design', 'industri... Toys and materials from the future https://www.ted.com/talks/toys_from_the_future\n 411647 Zach Kaplan: Keith and I lead a research team....
387 204 The Teresa Carreño Youth Orchestra contains th... 1026 TED2009 2009-02-05 29 Gustavo Dudamel and the Teresa Carreño Youth O... Gustavo Dudamel and the Teresa Carreño Youth O... 1 2009-02-18 18:00:00 [{'id': 1, 'name': 'Beautiful', 'count': 668},... [{'id': 464, 'hero': 'https://pe.tedcdn.com/im... Ensemble ['TED Prize', 'children', 'conducting', 'cultu... El Sistema's top youth orchestra https://www.ted.com/talks/astonishing_performa... 2062308 Chris Anderson: And now we go live to Caracas ...
401 768 This demo -- from Pattie Maes' lab at MIT, spe... 522 TED2009 2009-02-06 43 Pattie Maes + Pranav Mistry Pattie Maes + Pranav Mistry: Meet the SixthSen... 2 2009-03-10 01:00:00 [{'id': 25, 'name': 'OK', 'count': 444}, {'id'... [{'id': 457, 'hero': 'https://pe.tedcdn.com/im... Researcher ['demo', 'design', 'interface design', 'techno... Meet the SixthSense interaction https://www.ted.com/talks/pattie_maes_demos_th... 9753630 I've been intrigued by this question of whethe...
421 252 Bruce Bueno de Mesquita uses mathematical anal... 1145 TED2009 2009-02-07 21 Bruce Bueno de Mesquita Bruce Bueno de Mesquita: A prediction for the ... 1 2009-04-07 01:00:00 [{'id': 3, 'name': 'Courageous', 'count': 57},... [{'id': 33, 'hero': 'https://pe.tedcdn.com/ima... Political scientist ['global issues', 'math', 'prediction', 'techn... A prediction for the future of Iran https://www.ted.com/talks/bruce_bueno_de_mesqu... 744448 What I'm going to try to do is explain to you ...
530 1155 Our lives, our cultures, are composed of many ... 1129 TEDGlobal 2009 2009-07-23 46 Chimamanda Ngozi Adichie Chimamanda Ngozi Adichie: The danger of a sing... 1 2009-10-07 01:00:00 [{'id': 1, 'name': 'Beautiful', 'count': 5607}... [{'id': 159, 'hero': 'https://pe.tedcdn.com/im... Novelist ['Africa', 'books', 'culture', 'identity', 'st... The danger of a single story https://www.ted.com/talks/chimamanda_adichie_t... 13298341 I'm a storyteller. And I would like to tell yo...
583 380 Neuroscientist Vilayanur Ramachandran outlines... 463 TEDIndia 2009 2009-11-05 41 Vilayanur Ramachandran Vilayanur Ramachandran: The neurons that shape... 1 2010-01-04 07:30:00 [{'id': 10, 'name': 'Inspiring', 'count': 436}... [{'id': 184, 'hero': 'https://pe.tedcdn.com/im... Brain expert ['biology', 'brain', 'cities', 'cognitive scie... The neurons that shaped civilization https://www.ted.com/talks/vs_ramachandran_the_... 1939385 I'd like to talk to you today about the human ...
612 36 TED visits Tom Shannon in his Manhattan studio... 801 TED in the Field 2009-05-05 24 Tom Shannon, John Hockenberry Tom Shannon, John Hockenberry: The painter and... 2 2010-02-05 09:08:00 [{'id': 1, 'name': 'Beautiful', 'count': 143},... [{'id': 534, 'hero': 'https://pe.tedcdn.com/im... Sculptor ['art', 'astronomy', 'creativity', 'illness', ... The painter and the pendulum https://www.ted.com/talks/tom_shannon_the_pain... 433879 John Hockenberry: It's great to be here with y...
615 209 In a demo that drew gasps at TED2010, Blaise A... 465 TED2010 2010-02-11 28 Blaise Agüera y Arcas Blaise Agüera y Arcas: Augmented-reality maps 1 2010-02-13 09:54:00 [{'id': 23, 'name': 'Jaw-dropping', 'count': 8... [{'id': 129, 'hero': 'https://pe.tedcdn.com/im... Software architect ['cities', 'design', 'map', 'technology', 'vir... Augmented-reality maps https://www.ted.com/talks/blaise_aguera\n 1718568 About a year and a half ago, Stephen Lawler, w...
634 210 Fifty percent of traffic accidents happen at i... 266 TED2010 2010-02-02 31 Gary Lauder's new traffic sign Gary Lauder's new traffic sign: Take Turns 1 2010-03-09 08:39:00 [{'id': 9, 'name': 'Ingenious', 'count': 216},... [{'id': 212, 'hero': 'https://pe.tedcdn.com/im... Venture capitalist ['cities', 'culture', 'design', 'transportation'] Take Turns https://www.ted.com/talks/gary_lauder_s_new_tr... 575346 I only have three minutes so I'm going to have...
645 76 Biologist Juliana Machado Ferreira, a TED Seni... 334 TED2010 2010-02-02 31 Juliana Machado Ferreira Juliana Machado Ferreira: The fight to end rar... 1 2010-03-23 11:01:00 [{'id': 3, 'name': 'Courageous', 'count': 53},... [{'id': 40, 'hero': 'https://pe.tedcdn.com/ima... Biologist ['Brazil', 'South America', 'activism', 'biodi... The fight to end rare-animal trafficking in Br... https://www.ted.com/talks/juliana_machado_ferr... 320717 Illegal wildlife trade in Brazil is one of the...
694 1502 Filmmaker Sharmeen Obaid-Chinoy takes on a ter... 489 TED2010 2010-02-10 32 Sharmeen Obaid-Chinoy Sharmeen Obaid-Chinoy: Inside a school for sui... 1 2010-05-26 09:26:00 [{'id': 23, 'name': 'Jaw-dropping', 'count': 3... [{'id': 171, 'hero': 'https://pe.tedcdn.com/im... Filmmaker ['TED Fellows', 'children', 'culture', 'film',... Inside a school for suicide bombers https://www.ted.com/talks/sharmeen_obaid_chino... 1057238 Today, I want you to look at children who beco...
700 477 The founder of 4chan, a controversial, uncenso... 790 TED2010 2010-02-12 28 Christopher "moot" Poole" Christopher "moot" Poole": The case for anonym... 1 2010-06-02 08:29:00 [{'id': 22, 'name': 'Fascinating', 'count': 36... [{'id': 714, 'hero': 'https://pe.tedcdn.com/im... Founder, 4chan ['Internet', 'activism', 'collaboration', 'com... The case for anonymity online https://www.ted.com/talks/christopher_m00t_poo... 1451649 Tom Green: That's a 4chan thing. These kids on...
706 921 Nuclear power: the energy crisis has even die-... 1379 TED2010 2010-02-12 27 Stewart Brand + Mark Z. Jacobson Stewart Brand + Mark Z. Jacobson: Debate: Does... 2 2010-06-10 09:25:00 [{'id': 8, 'name': 'Informative', 'count': 585... [{'id': 767, 'hero': 'https://pe.tedcdn.com/im... Environmentalist, futurist ['Anthropocene', 'TED Brain Trust', 'climate c... Debate: Does the world need nuclear energy? https://www.ted.com/talks/debate_does_the_worl... 1294171 Chris Anderson: We're having a debate. The deb...
709 223 Margaret Gould Stewart, YouTube's head of user... 345 TED2010 2010-02-10 35 Margaret Gould Stewart Margaret Gould Stewart: How YouTube thinks abo... 1 2010-06-15 08:40:00 [{'id': 23, 'name': 'Jaw-dropping', 'count': 4... [{'id': 187, 'hero': 'https://pe.tedcdn.com/im... User experience master ['art', 'business', 'creativity', 'culture', '... How YouTube thinks about copyright https://www.ted.com/talks/margaret_stewart_how... 777079 So, if you're in the audience today, or maybe ...
712 155 Renowned classical Indian dancer Ananda Shanka... 967 TEDIndia 2009 2009-11-12 31 Ananda Shankar Jayant Ananda Shankar Jayant: Fighting cancer with dance 1 2010-06-18 09:00:00 [{'id': 23, 'name': 'Jaw-dropping', 'count': 7... [{'id': 688, 'hero': 'https://pe.tedcdn.com/im... Dancer and choreographer ['cancer', 'dance', 'music', 'performance', 'p... Fighting cancer with dance https://www.ted.com/talks/ananda_shankar_jayan... 587639 (Music)[Sanskrit] This is an ode to the mother...
757 215 His Holiness the Karmapa talks about how he wa... 1523 TEDIndia 2009 2009-11-04 29 His Holiness the Karmapa His Holiness the Karmapa: The technology of th... 1 2010-09-01 08:48:00 [{'id': 2, 'name': 'Confusing', 'count': 33}, ... [{'id': 71, 'hero': 'https://pe.tedcdn.com/ima... Spiritual leader ['brain', 'culture', 'happiness', 'religion', ... The technology of the heart https://www.ted.com/talks/his_holiness_the_kar... 800001 Tyler Dewar: The way I feel right now is that ...
761 70 Alwar Balasubramaniam's sculpture plays with t... 1011 TEDIndia 2009 2009-11-06 22 Alwar Balasubramaniam Alwar Balasubramaniam: Art of substance and ab... 1 2010-09-08 08:38:00 [{'id': 22, 'name': 'Fascinating', 'count': 11... [{'id': 32, 'hero': 'https://pe.tedcdn.com/ima... Artist ['art', 'design', 'entertainment', 'visualizat... Art of substance and absence https://www.ted.com/talks/alwar_balasubramania... 427590 The moment I say "school," so many memories co...
770 209 Christien Meindertsma, author of "Pig 05049" l... 534 TEDGlobal 2010 2010-07-09 30 Christien Meindertsma Christien Meindertsma: How pig parts make the ... 1 2010-09-20 08:43:00 [{'id': 8, 'name': 'Informative', 'count': 605... [{'id': 214, 'hero': 'https://pe.tedcdn.com/im... Artist ['books', 'business', 'consumerism', 'design',... How pig parts make the world turn https://www.ted.com/talks/christien_meindertsm... 1157394 Hello. I would like to start my talk with actu...
796 63 David Byrne sings the Talking Heads' 1988 hit,... 195 TED2010 2010-02-12 34 David Byrne, Ethel + Thomas Dolby David Byrne, Ethel + Thomas Dolby: "(Nothing B... 3 2010-10-22 08:49:00 [{'id': 1, 'name': 'Beautiful', 'count': 225},... [{'id': 883, 'hero': 'https://pe.tedcdn.com/im... Electronic music pioneer ['future', 'garden', 'music', 'performance', '... "(Nothing But) Flowers" with string quartet https://www.ted.com/talks/david_byrne_sings_no... 621361 (Music)♫ Here we stand ♫♫ Like an Adam and an ...
831 219 Babble.com publishers Rufus Griscom and Alisa ... 1028 TEDWomen 2010 2010-12-08 32 Rufus Griscom + Alisa Volkman Rufus Griscom + Alisa Volkman: Let's talk pare... 1 2010-12-16 14:41:00 [{'id': 25, 'name': 'OK', 'count': 187}, {'id'... [{'id': 347, 'hero': 'https://pe.tedcdn.com/im... Website co-founders ['children', 'communication', 'culture', 'ente... Let's talk parenting taboos https://www.ted.com/talks/rufus_griscom_alisa_... 2108409 Alisa Volkman: So this is where our story begi...
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
1952 98 With humor and charm, mathematician Eduardo Sá... 581 TEDxRiodelaPlata 2014-10-01 24 Eduardo Sáenz de Cabezón Eduardo Sáenz de Cabezón: Math is forever 1 2015-04-07 15:53:05 [{'id': 8, 'name': 'Informative', 'count': 349... [{'id': 1811, 'hero': 'https://pe.tedcdn.com/i... Math educator ['TED en Español', 'TEDx', 'humor', 'math', 's... Math is forever https://www.ted.com/talks/eduardo_saenz_de_cab... 1610293 Imagine you're in a bar, or a club, and you st...
2015 69 Stacey Baker has always been obsessed with how... 618 TED2015 2015-03-18 33 Alec Soth + Stacey Baker Alec Soth + Stacey Baker: This is what endurin... 2 2015-07-15 14:49:51 [{'id': 25, 'name': 'OK', 'count': 252}, {'id'... [{'id': 2285, 'hero': 'https://pe.tedcdn.com/i... Photo editor ['culture', 'love', 'photography', 'relationsh... This is what enduring love looks like https://www.ted.com/talks/alec_soth_stacey_bak... 1780746 Alec Soth: So about 10 years ago, I got a call...
2082 50 As a gay couple in San Francisco, Jenni Chang ... 710 TEDWomen 2015 2015-05-27 23 Jenni Chang and Lisa Dazols Jenni Chang and Lisa Dazols: This is what LGBT... 1 2015-11-09 16:09:45 [{'id': 8, 'name': 'Informative', 'count': 180... [{'id': 2005, 'hero': 'https://pe.tedcdn.com/i... Documentary filmmakers ['Gender equality', 'LGBT', 'film', 'global is... This is what LGBT life is like around the world https://www.ted.com/talks/jenni_chang_and_lisa... 1609136 Jenni Chang: When I told my parents I was gay,...
2090 81 Written language, the hallmark of human civili... 725 TED Fellows Retreat 2015 2015-08-26 31 Genevieve von Petzinger Genevieve von Petzinger: Why are these 32 symb... 1 2015-11-20 16:31:46 [{'id': 1, 'name': 'Beautiful', 'count': 189},... [{'id': 1081, 'hero': 'https://pe.tedcdn.com/i... Paleoanthropologist and rock art researcher ['Europe', 'TED Fellows', 'ancient world', 'an... Why are these 32 symbols found in ancient cave... https://www.ted.com/talks/genevieve_von_petzin... 3436514 There's something about caves — a shadowy open...
2101 35 Nicole Paris was raised to be a beatboxer -- w... 421 TEDYouth 2015 2015-11-14 26 Nicole Paris and Ed Cage Nicole Paris and Ed Cage: A beatboxing lesson ... 1 2015-12-11 16:19:07 [{'id': 7, 'name': 'Funny', 'count': 210}, {'i... [{'id': 1458, 'hero': 'https://pe.tedcdn.com/i... Beatboxers ['TEDYouth', 'art', 'entertainment', 'family',... A beatboxing lesson from a father-daughter duo https://www.ted.com/talks/nicole_paris_and_ed_... 2852664 Nicole Paris: TEDYouth, make some noise!(Beatb...
2105 52 Legendary duo Jane Fonda and Lily Tomlin have ... 944 TEDWomen 2015 2015-05-27 29 Jane Fonda and Lily Tomlin Jane Fonda and Lily Tomlin: A hilarious celebr... 2 2015-12-17 16:12:21 [{'id': 3, 'name': 'Courageous', 'count': 122}... [{'id': 2236, 'hero': 'https://pe.tedcdn.com/i... Actor and activist ['Gender equality', 'aging', 'comedy', 'friend... A hilarious celebration of lifelong female fri... https://www.ted.com/talks/jane_fonda_and_lily_... 2269844 Pat Mitchell: So I was thinking about female f...
2108 44 For sculptor Jason deCaires Taylor, the ocean ... 669 Mission Blue II 2015-10-10 30 Jason deCaires Taylor Jason deCaires Taylor: An underwater art museu... 1 2015-12-22 16:33:01 [{'id': 1, 'name': 'Beautiful', 'count': 492},... [{'id': 2160, 'hero': 'https://pe.tedcdn.com/i... Sculptor ['art', 'creativity', 'design', 'ecology', 'en... An underwater art museum, teeming with life https://www.ted.com/talks/jason_decaires_taylo... 1446673 Ten years ago, I had my first exhibition here....
2128 105 Plastic bags are essentially indestructible, y... 660 TEDGlobal>London 2015-09-29 29 Melati and Isabel Wijsen Melati and Isabel Wijsen: Our campaign to ban ... 1 2016-01-29 16:12:22 [{'id': 10, 'name': 'Inspiring', 'count': 721}... [{'id': 1056, 'hero': 'https://pe.tedcdn.com/i... Activists ['activism', 'big problems', 'consumerism', 'e... Our campaign to ban plastic bags in Bali https://www.ted.com/talks/melati_and_isabel_wi... 1219306 Melati Wijsen: Bali — island of gods.Isabel Wi...
2236 36 We're on the edge of a new frontier in art and... 1054 [email protected] Paris 2016-05-18 22 Blaise Agüera y Arcas Blaise Agüera y Arcas: How computers are learn... 1 2016-06-28 14:46:01 [{'id': 1, 'name': 'Beautiful', 'count': 135},... [{'id': 766, 'hero': 'https://pe.tedcdn.com/im... Software architect ['AI', 'algorithm', 'art', 'beauty', 'brain', ... How computers are learning to be creative https://www.ted.com/talks/blaise_aguera_y_arca... 1413042 So, I lead a team at Google that works on mach...
2280 266 "We're not in a clean energy revolution; we're... 838 TEDSummit 2016-06-28 21 Michael Shellenberger Michael Shellenberger: How fear of nuclear pow... 1 2016-09-14 14:51:50 [{'id': 3, 'name': 'Courageous', 'count': 150}... [{'id': 1727, 'hero': 'https://pe.tedcdn.com/i... Climate policy expert ['Natural resources', 'activism', 'alternative... How fear of nuclear power is hurting the envir... https://www.ted.com/talks/michael_shellenberge... 1185114 Have you heard the news? We're in a clean ener...
2303 86 Tango, waltz, foxtrot ... these classic ballro... 933 TEDxMontreal 2015-11-07 16 Trevor Copp and Jeff Fox Trevor Copp and Jeff Fox: Ballroom dance that ... 2 2016-10-14 14:59:07 [{'id': 1, 'name': 'Beautiful', 'count': 296},... [{'id': 2147, 'hero': 'https://pe.tedcdn.com/i... Artistic director ['TEDx', 'art', 'beauty', 'creativity', 'cultu... Ballroom dance that breaks gender roles https://www.ted.com/talks/trevor_copp_jeff_fox... 571009 (Music)(Applause)Trevor Copp: When "Dancing Wi...
2313 6 Singer Rhiannon Giddens joins international mu... 523 TED2016 2016-02-16 25 Silk Road Ensemble + Rhiannon Giddens Silk Road Ensemble + Rhiannon Giddens: "St. Ja... 2 2016-10-28 12:04:39 [{'id': 1, 'name': 'Beautiful', 'count': 161},... [{'id': 2366, 'hero': 'https://pe.tedcdn.com/i... Musical explorers ['art', 'live music', 'music', 'performance', ... "St. James Infirmary Blues" https://www.ted.com/talks/silk_road_ensemble_r... 820295 (Music)I went down to St. James InfirmaryTo se...
2314 326 In a society obsessed with body image and mark... 740 TEDxSydney 2016-05-24 26 Kelli Jean Drinkwater Kelli Jean Drinkwater: Enough with the fear of... 1 2016-10-28 16:55:49 [{'id': 21, 'name': 'Unconvincing', 'count': 1... [{'id': 1205, 'hero': 'https://pe.tedcdn.com/i... Artist, activist ['TEDx', 'activism', 'art', 'beauty', 'creativ... Enough with the fear of fat https://www.ted.com/talks/kelli_jean_drinkwate... 1594248 I'm here today to talk to you about a very pow...
2319 8 Singer Amanda Palmer pays tribute to the inimi... 369 TED2016 2016-02-19 28 Amanda Palmer, Jherek Bischoff, Usman Riaz Amanda Palmer, Jherek Bischoff, Usman Riaz: "S... 3 2016-11-04 12:05:31 [{'id': 1, 'name': 'Beautiful', 'count': 206},... [{'id': 1682, 'hero': 'https://pe.tedcdn.com/i... Musician, blogger ['TED Fellows', 'guitar', 'live music', 'music... "Space Oddity" https://www.ted.com/talks/amanda_palmer_jherek... 738590 (Music)Amanda Palmer (singing): Ground Control...
2334 99 Born out of a social media post, the Black Liv... 965 TEDWomen 2016 2016-10-27 13 Alicia Garza, Patrisse Cullors and Opal Tometi Alicia Garza, Patrisse Cullors and Opal Tometi... 4 2016-11-29 15:50:19 [{'id': 26, 'name': 'Obnoxious', 'count': 80},... [{'id': 1378, 'hero': 'https://pe.tedcdn.com/i... Writer, activist ['Africa', 'Gender equality', 'activism', 'col... An interview with the founders of Black Lives ... https://www.ted.com/talks/alicia_garza_patriss... 782276 Mia Birdsong: Why is Black Lives Matter import...
2338 53 Love is a tool for revolutionary change and a ... 1027 TEDWomen 2016 2016-10-27 13 Tiq Milan and Kim Katrin Milan Tiq Milan and Kim Katrin Milan: A queer vision... 2 2016-12-05 16:24:27 [{'id': 1, 'name': 'Beautiful', 'count': 447},... [{'id': 2005, 'hero': 'https://pe.tedcdn.com/i... Transgender activist ['Gender equality', 'Gender spectrum', 'LGBT',... A queer vision of love and marriage https://www.ted.com/talks/tiq_milan_and_kim_ka... 1172873 Tiq Milan: Our first conversation was on Faceb...
2360 43 Nature is wonderfully abundant, diverse and my... 759 TEDxKC 2016-08-19 20 Alejandro Sánchez Alvarado Alejandro Sánchez Alvarado: To solve old probl... 1 2017-01-12 16:07:48 [{'id': 1, 'name': 'Beautiful', 'count': 118},... [{'id': 206, 'hero': 'https://pe.tedcdn.com/im... Developmental and regeneration biologist ['TEDx', 'animals', 'beauty', 'biodiversity', ... To solve old problems, study new species https://www.ted.com/talks/alejandro_sanchez_al... 1059199 For the past few years, I've been spending my ...
2377 308 In 1996, Thordis Elva shared a teenage romance... 1146 TEDWomen 2016 2016-10-26 21 Thordis Elva and Tom Stranger Thordis Elva and Tom Stranger: Our story of ra... 2 2017-02-07 11:44:34 [{'id': 3, 'name': 'Courageous', 'count': 1053... [{'id': 2068, 'hero': 'https://pe.tedcdn.com/i... Writer ['activism', 'collaboration', 'communication',... Our story of rape and reconciliation https://www.ted.com/talks/thordis_elva_tom_str... 3950921 [This talk contains graphic language and descr...
2421 42 How can we bridge the gap between left and rig... 2853 TED Dialogues 2017-03-01 5 Gretchen Carlson, David Brooks Gretchen Carlson, David Brooks: Political comm... 4 2017-04-03 21:50:55 [{'id': 3, 'name': 'Courageous', 'count': 19},... [{'id': 2695, 'hero': 'https://pe.tedcdn.com/i... TV journalist, women's empowerment advocate ['United States', 'collaboration', 'communicat... Political common ground in a polarized United ... https://www.ted.com/talks/gretchen_carlson_dav... 890478 Chris Anderson: Welcome to this next edition o...
2430 55 We teach girls that they can have ambition, bu... 1768 TEDxEuston 2012-12-01 14 Chimamanda Ngozi Adichie Chimamanda Ngozi Adichie: We should all be fem... 1 2017-04-14 14:52:57 [{'id': 1, 'name': 'Beautiful', 'count': 263},... [{'id': 652, 'hero': 'https://pe.tedcdn.com/im... Novelist ['Africa', 'Gender equality', 'TEDx', 'childre... We should all be feminists https://www.ted.com/talks/chimamanda_ngozi_adi... 1318454 So I would like to start by telling you about ...
2432 30 Our universe is strange, wonderful and vast, s... 925 TEDxPerth 2016-10-14 17 Natasha Hurley-Walker Natasha Hurley-Walker: How radio telescopes sh... 1 2017-04-18 15:02:39 [{'id': 8, 'name': 'Informative', 'count': 201... [{'id': 2723, 'hero': 'https://pe.tedcdn.com/i... Astronomer ['astronomy', 'discovery', 'space', 'telescope... How radio telescopes show us unseen galaxies https://www.ted.com/talks/natasha_hurley_walke... 970115 Space, the final frontier.I first heard these ...
2435 67 Financial literacy isn't a skill -- it's a lif... 663 TEDxSanQuentin 2016-01-22 27 Curtis "Wall Street" Carroll Curtis "Wall Street" Carroll: How I learned to... 1 2017-04-21 15:00:21 [{'id': 10, 'name': 'Inspiring', 'count': 867}... [{'id': 2453, 'hero': 'https://pe.tedcdn.com/i... Financial literacy advocate ['Criminal Justice', 'TEDx', 'business', 'capi... How I learned to read -- and trade stocks -- i... https://www.ted.com/talks/curtis_wall_street_c... 2198889 I was 14 years old inside of a bowling alley, ...
2438 235 A single individual is enough for hope to exis... 1072 TED2017 2017-04-25 32 His Holiness Pope Francis His Holiness Pope Francis: Why the only future... 1 2017-04-26 01:01:05 [{'id': 1, 'name': 'Beautiful', 'count': 1665}... [{'id': 2110, 'hero': 'https://pe.tedcdn.com/i... Bishop of Rome ['Christianity', 'children', 'climate change',... Why the only future worth building includes ev... https://www.ted.com/talks/pope_francis_why_the... 2679881 [His Holiness Pope Francis Filmed in Vatican C...
2439 22 Twenty-three Grand Slam titles later, tennis s... 1108 TED2017 2017-04-24 13 Serena Williams and Gayle King Serena Williams and Gayle King: On tennis, lov... 2 2017-04-27 18:50:06 [{'id': 10, 'name': 'Inspiring', 'count': 173}... [{'id': 2327, 'hero': 'https://pe.tedcdn.com/i... Athlete ['Gender equality', 'children', 'family', 'gen... On tennis, love and motherhood https://www.ted.com/talks/serena_williams_gayl... 1426132 Gayle King: Have a seat, Serena Williams, or s...
2455 48 T. Morgan Dixon and Vanessa Garrison, founders... 933 TED2017 2017-04-24 8 T. Morgan Dixon and Vanessa Garrison T. Morgan Dixon and Vanessa Garrison: When Bla... 2 2017-05-19 15:03:52 [{'id': 1, 'name': 'Beautiful', 'count': 175},... [{'id': 2726, 'hero': 'https://pe.tedcdn.com/i... Health activist ['activism', 'community', 'health', 'heart hea... When Black women walk, things change https://www.ted.com/talks/t_morgan_dixon_and_v... 907047 Vanessa Garrison: I am Vanessa, daughter of An...
2467 35 The more we read and watch online, the harder ... 866 TED2017 2017-04-24 18 Michael Patrick Lynch Michael Patrick Lynch: How to see past your ow... 1 2017-06-05 14:59:55 [{'id': 10, 'name': 'Inspiring', 'count': 188}... [{'id': 2625, 'hero': 'https://pe.tedcdn.com/i... Philosopher ['Internet', 'communication', 'democracy', 'id... How to see past your own perspective and find ... https://www.ted.com/talks/michael_patrick_lync... 1264166 So, imagine that you had your smartphone minia...
2470 28 Attention isn't just about what we focus on --... 392 TED2017 2017-04-24 29 Mehdi Ordikhani-Seyedlar Mehdi Ordikhani-Seyedlar: What happens in your... 1 2017-06-08 14:48:53 [{'id': 8, 'name': 'Informative', 'count': 374... [{'id': 2495, 'hero': 'https://pe.tedcdn.com/i... Neuroscientist ['AI', 'algorithm', 'brain', 'cognitive scienc... What happens in your brain when you pay attent... https://www.ted.com/talks/mehdi_ordikhani_seye... 1781721 Paying close attention to something: Not that ...
2484 48 It's a fateful moment in history. We've seen d... 756 TED2017 2017-04-24 17 Rabbi Lord Jonathan Sacks Rabbi Lord Jonathan Sacks: How we can face the... 1 2017-07-11 14:54:57 [{'id': 23, 'name': 'Jaw-dropping', 'count': 3... [{'id': 2744, 'hero': 'https://pe.tedcdn.com/i... Religious leader ['community', 'fear', 'future', 'global issues... How we can face the future without fear, together https://www.ted.com/talks/rabbi_lord_jonathan_... 1244502 "These are the times," said Thomas Paine, "tha...
2507 6 Movement artists Jon Boogz and Lil Buck debut ... 575 TED2017 2017-04-24 12 Jon Boogz and Lil Buck Jon Boogz and Lil Buck: A dance to honor Mothe... 5 2017-08-11 10:56:02 [{'id': 1, 'name': 'Beautiful', 'count': 84}, ... [{'id': 2589, 'hero': 'https://pe.tedcdn.com/i... Movement artist ['art', 'creativity', 'dance', 'performance', ... A dance to honor Mother Earth https://www.ted.com/talks/jon_boogz_and_lil_bu... 182975 Mother Earth: Our end was imminent yet finalit...
2536 21 Can you still be friends with someone who does... 865 TEDxMileHigh 2017-07-08 5 Caitlin Quattromani and Lauran Arledge Caitlin Quattromani and Lauran Arledge: How ou... 2 2017-09-11 15:01:00 [{'id': 3, 'name': 'Courageous', 'count': 45},... [{'id': 2625, 'hero': 'https://pe.tedcdn.com/i... Marketing leader ['TEDx', 'communication', 'friendship', 'polit... How our friendship survives our opposing politics https://www.ted.com/talks/caitlin_quattromani_... 566101 Caitlin Quattromani: The election of 2016 felt...

107 rows × 18 columns

In case of long main_speaker field we can suspect more then one speaker.

name

In [38]:
df_ted['name'].nunique()
Out[38]:
2550

Every talk has an unique name.

In [39]:
df_ted['name'].str.len().describe()
Out[39]:
count    2550.000000
mean       50.654510
std        12.602419
min        19.000000
25%        42.000000
50%        49.000000
75%        59.000000
max       100.000000
Name: name, dtype: float64

num_speaker

In [40]:
df_ted['num_speaker'].describe()
Out[40]:
count    2550.000000
mean        1.028235
std         0.207705
min         1.000000
25%         1.000000
50%         1.000000
75%         1.000000
max         5.000000
Name: num_speaker, dtype: float64

Most of people present their talks alone.

published_date

In [41]:
df_ted['published_date'].describe()
Out[41]:
count                    2550
unique                   2490
top       2007-04-05 00:11:00
freq                       20
first     2006-06-27 00:11:00
last      2017-09-22 15:00:22
Name: published_date, dtype: object
In [42]:
df_ted[df_ted['event'].str.startswith('TED')]['film_date'].min()
Out[42]:
Timestamp('1984-02-02 00:00:00')

Fist published_date is 2006-06-27, but first filmed talk was on 1984-02-02. So it may be interesting to have a look at timespan between the filming and the publication.

In [43]:
(df_ted['published_date'] - df_ted['film_date']).describe()
Out[43]:
count                        2550
mean     249 days 23:21:53.703529
std      616 days 14:41:39.952308
min           -348 days +04:57:00
25%       49 days 15:00:10.750000
50%             100 days 05:21:30
75%             190 days 12:45:15
max           13879 days 14:37:00
dtype: object
In [44]:
(df_ted['published_date'] - df_ted['film_date']).median()
Out[44]:
Timedelta('100 days 05:21:30')
In [45]:
df_ted[(df_ted['published_date'] - df_ted['film_date']).dt.total_seconds() < 0]
Out[45]:
comments description duration event film_date languages main_speaker name num_speaker published_date ratings related_talks speaker_occupation tags title url views transcript
323 76 While we all agree that compassion is a great ... 946 TEDSalon 2009 Compassion 2009-10-01 30 Jackie Tabick Jackie Tabick: The balancing act of compassion 1 2008-10-31 04:17:00 [{'id': 25, 'name': 'OK', 'count': 14}, {'id':... [{'id': 676, 'hero': 'https://pe.tedcdn.com/im... Spiritual leader ['charter for compassion', 'compassion', 'glob... The balancing act of compassion https://www.ted.com/talks/jackie_tabick\n 176245 One of my favorite cartoon characters is Snoop...
324 99 Swami Dayananda Saraswati unravels the paralle... 1014 Chautauqua Institution 2009-10-01 38 Dayananda Saraswati Dayananda Saraswati: The profound journey of c... 1 2008-10-31 04:41:00 [{'id': 10, 'name': 'Inspiring', 'count': 369}... [{'id': 675, 'hero': 'https://pe.tedcdn.com/im... Vedantic teacher ['charter for compassion', 'compassion', 'glob... The profound journey of compassion https://www.ted.com/talks/swami_dayananda_sara... 273396 A human child is born, and for quite a long ti...
325 30 Join Rev. James Forbes at the dinner table of ... 1118 Chautauqua Institution 2009-10-01 20 James Forbes James Forbes: Compassion at the dinner table 1 2008-10-31 04:45:00 [{'id': 10, 'name': 'Inspiring', 'count': 73},... [{'id': 674, 'hero': 'https://pe.tedcdn.com/im... Preacher ['charter for compassion', 'compassion', 'fait... Compassion at the dinner table https://www.ted.com/talks/james_forbes\n 204410 Compassion: what does it look like? Come with ...
326 152 Imam Faisal Abdul Rauf combines the teachings ... 1007 TEDSalon 2009 Compassion 2009-10-14 47 Feisal Abdul Rauf Feisal Abdul Rauf: Lose your ego, find your co... 1 2008-10-31 04:57:00 [{'id': 11, 'name': 'Longwinded', 'count': 35}... [{'id': 674, 'hero': 'https://pe.tedcdn.com/im... Chairman of the Cordoba Initiative ['charter for compassion', 'compassion', 'glob... Lose your ego, find your compassion https://www.ted.com/talks/imam_feisal_abdul_ra... 433202 I'm speaking about compassion from an Islamic ...
327 46 It's hard to always show compassion -- even to... 1087 Chautauqua Institution 2009-10-01 27 Robert Thurman Robert Thurman: Expanding your circle of compa... 1 2008-10-31 04:58:00 [{'id': 10, 'name': 'Inspiring', 'count': 148}... [{'id': 674, 'hero': 'https://pe.tedcdn.com/im... Buddhist scholar ['charter for compassion', 'compassion', 'glob... Expanding your circle of compassion https://www.ted.com/talks/robert_thurman_on_co... 304800 I want to open by quoting Einstein's wonderful...
328 82 Robert Wright uses evolutionary biology and ga... 1016 TEDSalon 2009 Compassion 2009-10-14 21 Robert Wright Robert Wright: The evolution of compassion 1 2008-10-31 19:39:00 [{'id': 11, 'name': 'Longwinded', 'count': 32}... [{'id': 673, 'hero': 'https://pe.tedcdn.com/im... Journalist, philosopher ['charter for compassion', 'compassion', 'evol... The evolution of compassion https://www.ted.com/talks/robert_wright_the_ev... 236002 I'm going to talk about compassion and the gol...
614 1137 Sharing powerful stories from his anti-obesity... 1313 TED2010 2010-02-20 49 Jamie Oliver Jamie Oliver: Teach every child about food 1 2010-02-11 15:36:00 [{'id': 23, 'name': 'Jaw-dropping', 'count': 1... [{'id': 10, 'hero': 'https://pe.tedcdn.com/ima... Chef, activist ['business', 'education', 'food', 'global issu... Teach every child about food https://www.ted.com/talks/jamie_oliver\n 7638978 Sadly, in the next 18 minutes when I do our ch...
857 146 Imagine playing a video game controlled by you... 904 TEDxToronto 2011 2011-09-23 26 Ariel Garten Ariel Garten: Know thyself, with a brain scanner 1 2011-01-26 16:08:00 [{'id': 21, 'name': 'Unconvincing', 'count': 1... [{'id': 1152, 'hero': 'https://pe.tedcdn.com/i... Artist, scientist and entrepreneur ['TEDx', 'neuroscience', 'psychology', 'scienc... Know thyself, with a brain scanner https://www.ted.com/talks/ariel_garten_know_th... 399314 The maxim, "Know thyself" has been around sinc...
1493 632 Rita Pierson, a teacher for 40 years, once hea... 468 TED Talks Education 2013-05-07 46 Rita Pierson Rita Pierson: Every kid needs a champion 1 2013-05-03 14:02:17 [{'id': 10, 'name': 'Inspiring', 'count': 5946... [{'id': 66, 'hero': 'https://pe.tedcdn.com/ima... Educator ['children', 'education', 'motivation', 'teach... Every kid needs a champion https://www.ted.com/talks/rita_pierson_every_k... 7469445 I have spent my entire life either at the scho...
1846 49 Before he hit eighteen, Fred Swaniker had live... 806 TEDGlobal 2014 2014-10-22 24 Fred Swaniker Fred Swaniker: The leaders who ruined Africa, ... 1 2014-10-21 14:58:36 [{'id': 25, 'name': 'OK', 'count': 51}, {'id':... [{'id': 156, 'hero': 'https://pe.tedcdn.com/im... Educational entrepreneur ['Africa', 'TED Fellows', 'entrepreneur'] The leaders who ruined Africa, and the generat... https://www.ted.com/talks/fred_swaniker_the_le... 1239635 I experienced my first coup d'état at the age ...

Interesting, looks like we have some mistakes in data. The records above are where published_date is earlier then film_date which is the case.

ratings

It's a rating from TED site. TED asks people to describe video (talk) in three words. Count simply means amount of people who choosen the category. We will not use the field because it's closely linked with our target variable "views". More views video has more people rated it.

In [46]:
df_ted['ratings'].values[0]
Out[46]:
"[{'id': 7, 'name': 'Funny', 'count': 19645}, {'id': 1, 'name': 'Beautiful', 'count': 4573}, {'id': 9, 'name': 'Ingenious', 'count': 6073}, {'id': 3, 'name': 'Courageous', 'count': 3253}, {'id': 11, 'name': 'Longwinded', 'count': 387}, {'id': 2, 'name': 'Confusing', 'count': 242}, {'id': 8, 'name': 'Informative', 'count': 7346}, {'id': 22, 'name': 'Fascinating', 'count': 10581}, {'id': 21, 'name': 'Unconvincing', 'count': 300}, {'id': 24, 'name': 'Persuasive', 'count': 10704}, {'id': 23, 'name': 'Jaw-dropping', 'count': 4439}, {'id': 25, 'name': 'OK', 'count': 1174}, {'id': 26, 'name': 'Obnoxious', 'count': 209}, {'id': 10, 'name': 'Inspiring', 'count': 24924}]"
In [47]:
df_ted['ratings'].values[1]
Out[47]:
"[{'id': 7, 'name': 'Funny', 'count': 544}, {'id': 3, 'name': 'Courageous', 'count': 139}, {'id': 2, 'name': 'Confusing', 'count': 62}, {'id': 1, 'name': 'Beautiful', 'count': 58}, {'id': 21, 'name': 'Unconvincing', 'count': 258}, {'id': 11, 'name': 'Longwinded', 'count': 113}, {'id': 8, 'name': 'Informative', 'count': 443}, {'id': 10, 'name': 'Inspiring', 'count': 413}, {'id': 22, 'name': 'Fascinating', 'count': 132}, {'id': 9, 'name': 'Ingenious', 'count': 56}, {'id': 24, 'name': 'Persuasive', 'count': 268}, {'id': 23, 'name': 'Jaw-dropping', 'count': 116}, {'id': 26, 'name': 'Obnoxious', 'count': 131}, {'id': 25, 'name': 'OK', 'count': 203}]"
In [48]:
df_ted['ratings'].values[2]
Out[48]:
"[{'id': 7, 'name': 'Funny', 'count': 964}, {'id': 3, 'name': 'Courageous', 'count': 45}, {'id': 9, 'name': 'Ingenious', 'count': 183}, {'id': 1, 'name': 'Beautiful', 'count': 60}, {'id': 21, 'name': 'Unconvincing', 'count': 104}, {'id': 11, 'name': 'Longwinded', 'count': 78}, {'id': 8, 'name': 'Informative', 'count': 395}, {'id': 10, 'name': 'Inspiring', 'count': 230}, {'id': 22, 'name': 'Fascinating', 'count': 166}, {'id': 2, 'name': 'Confusing', 'count': 27}, {'id': 25, 'name': 'OK', 'count': 146}, {'id': 24, 'name': 'Persuasive', 'count': 230}, {'id': 23, 'name': 'Jaw-dropping', 'count': 54}, {'id': 26, 'name': 'Obnoxious', 'count': 142}]"
In [49]:
df_ted['ratings'].values[3]
Out[49]:
"[{'id': 3, 'name': 'Courageous', 'count': 760}, {'id': 1, 'name': 'Beautiful', 'count': 291}, {'id': 2, 'name': 'Confusing', 'count': 32}, {'id': 7, 'name': 'Funny', 'count': 59}, {'id': 9, 'name': 'Ingenious', 'count': 105}, {'id': 21, 'name': 'Unconvincing', 'count': 36}, {'id': 11, 'name': 'Longwinded', 'count': 53}, {'id': 8, 'name': 'Informative', 'count': 380}, {'id': 10, 'name': 'Inspiring', 'count': 1070}, {'id': 22, 'name': 'Fascinating', 'count': 132}, {'id': 24, 'name': 'Persuasive', 'count': 460}, {'id': 23, 'name': 'Jaw-dropping', 'count': 230}, {'id': 26, 'name': 'Obnoxious', 'count': 35}, {'id': 25, 'name': 'OK', 'count': 85}]"

We will not use the field in research due to it complexity for analysis.

In [50]:
df_ted['related_talks'].values[0]
Out[50]:
'[{\'id\': 865, \'hero\': \'https://pe.tedcdn.com/images/ted/172559_800x600.jpg\', \'speaker\': \'Ken Robinson\', \'title\': \'Bring on the learning revolution!\', \'duration\': 1008, \'slug\': \'sir_ken_robinson_bring_on_the_revolution\', \'viewed_count\': 7266103}, {\'id\': 1738, \'hero\': \'https://pe.tedcdn.com/images/ted/de98b161ad1434910ff4b56c89de71af04b8b873_1600x1200.jpg\', \'speaker\': \'Ken Robinson\', \'title\': "How to escape education\'s death valley", \'duration\': 1151, \'slug\': \'ken_robinson_how_to_escape_education_s_death_valley\', \'viewed_count\': 6657572}, {\'id\': 2276, \'hero\': \'https://pe.tedcdn.com/images/ted/3821f3728e0b755c7b9aea2e69cc093eca41abe1_2880x1620.jpg\', \'speaker\': \'Linda Cliatt-Wayman\', \'title\': \'How to fix a broken school? Lead fearlessly, love hard\', \'duration\': 1027, \'slug\': \'linda_cliatt_wayman_how_to_fix_a_broken_school_lead_fearlessly_love_hard\', \'viewed_count\': 1617101}, {\'id\': 892, \'hero\': \'https://pe.tedcdn.com/images/ted/e79958940573cc610ccb583619a54866c41ef303_2880x1620.jpg\', \'speaker\': \'Charles Leadbeater\', \'title\': \'Education innovation in the slums\', \'duration\': 1138, \'slug\': \'charles_leadbeater_on_education\', \'viewed_count\': 772296}, {\'id\': 1232, \'hero\': \'https://pe.tedcdn.com/images/ted/0e3e4e92d5ee8ae0e43962d447d3f790b31099b8_800x600.jpg\', \'speaker\': \'Geoff Mulgan\', \'title\': \'A short intro to the Studio School\', \'duration\': 376, \'slug\': \'geoff_mulgan_a_short_intro_to_the_studio_school\', \'viewed_count\': 667971}, {\'id\': 2616, \'hero\': \'https://pe.tedcdn.com/images/ted/71cde5a6fa6c717488fb55eff9eef939a9241761_2880x1620.jpg\', \'speaker\': \'Kandice Sumner\', \'title\': "How America\'s public schools keep kids in poverty", \'duration\': 830, \'slug\': \'kandice_sumner_how_america_s_public_schools_keep_kids_in_poverty\', \'viewed_count\': 1181333}]'

speaker_occupation

In [51]:
df_ted['speaker_occupation'].value_counts()
Out[51]:
Writer                                                      45
Designer                                                    34
Artist                                                      34
Journalist                                                  33
Entrepreneur                                                31
Architect                                                   30
Inventor                                                    27
Psychologist                                                26
Photographer                                                25
Filmmaker                                                   21
Economist                                                   20
Author                                                      20
Educator                                                    20
Neuroscientist                                              20
Roboticist                                                  16
Philosopher                                                 16
Biologist                                                   15
Physicist                                                   14
Marine biologist                                            11
Musician                                                    11
Technologist                                                10
Activist                                                    10
Global health expert; data visionary                        10
Historian                                                    9
Astronomer                                                   9
Behavioral economist                                         9
Poet                                                         9
Graphic designer                                             9
Philanthropist                                               9
Singer/songwriter                                            9
                                                            ..
Executive chair, Ford Motor Co.                              1
Student scientist                                            1
Director, designer                                           1
Curator, rare book scholar                                   1
Product creator                                              1
Science communicator                                         1
Motivational speaker                                         1
Interface designer                                           1
Designer and theorist                                        1
Social activist                                              1
Neurosurgeon                                                 1
Sustainable development expert                               1
High seas policy advisor                                     1
Choreographer                                                1
New media artist                                             1
Twitter co-founder                                           1
Creative entrepreneur                                        1
Founder and Executive Director, Earthspark International     1
Physician, disaster-preparedness activist                    1
Comedian, journalist, activist                               1
Global leader                                                1
Theremin player                                              1
Religious leader                                             1
Experimental particle physicist                              1
CEO of Google                                                1
Space explorer                                               1
Quantum Researcher                                           1
New philanthropist                                           1
Endurance runner                                             1
Lie detector                                                 1
Name: speaker_occupation, Length: 1458, dtype: int64
In [52]:
df_ted['speaker_occupation'].str.len().describe()
Out[52]:
count    2544.000000
mean       17.337264
std         8.699253
min         3.000000
25%        10.000000
50%        16.000000
75%        21.000000
max        72.000000
Name: speaker_occupation, dtype: float64
In [53]:
df_ted[df_ted['speaker_occupation'].str.len() > 50]['speaker_occupation']
Out[53]:
80      Science writer, innovation consultant, conserv...
101     Computer scientist, entrepreneur and philanthr...
441     Astronaut, engineer, entrepreneur, physician a...
497     Chief Economist and Senior Vice President, Wor...
498     Science writer, innovation consultant, conserv...
955     Chief Economist and Senior Vice President, Wor...
1191    Founder and Executive Director, Earthspark Int...
1507    Computer scientist, entrepreneur and philanthr...
1721    Former U.S. Representative and NASA astronaut;...
2095    Strategy consultant, social entrepreneur and a...
2116    Chief of the Community Partnership Division, B...
2133    Special Olympics International Sargent Shriver...
2155    Principal Investigator and Director of the Ope...
Name: speaker_occupation, dtype: object

The most popular occupations are from arts, business, journalism, architecture and psychology. Some of people describe themself with a lot of different occupation types. Count of occupations could be a feature later.

tags

In [54]:
df_ted['tags'].values[:5]
Out[54]:
array(["['children', 'creativity', 'culture', 'dance', 'education', 'parenting', 'teaching']",
       "['alternative energy', 'cars', 'climate change', 'culture', 'environment', 'global issues', 'science', 'sustainability', 'technology']",
       "['computers', 'entertainment', 'interface design', 'media', 'music', 'performance', 'simplicity', 'software', 'technology']",
       "['MacArthur grant', 'activism', 'business', 'cities', 'environment', 'green', 'inequality', 'politics', 'pollution']",
       "['Africa', 'Asia', 'Google', 'demo', 'economics', 'global development', 'global issues', 'health', 'math', 'statistics', 'visualizations']"],
      dtype=object)
In [55]:
df_ted['tags'] = df_ted['tags'].apply(lambda x: eval(x))
In [56]:
df_ted['tags'].values.reshape(-1,1)
Out[56]:
array([[list(['children', 'creativity', 'culture', 'dance', 'education', 'parenting', 'teaching'])],
       [list(['alternative energy', 'cars', 'climate change', 'culture', 'environment', 'global issues', 'science', 'sustainability', 'technology'])],
       [list(['computers', 'entertainment', 'interface design', 'media', 'music', 'performance', 'simplicity', 'software', 'technology'])],
       ...,
       [list(['AI', 'ants', 'fish', 'future', 'innovation', 'insects', 'intelligence', 'robots', 'science'])],
       [list(['Internet', 'TEDx', 'United States', 'community', 'compassion', 'politics', 'race'])],
       [list(['cities', 'design', 'future', 'infrastructure', 'play', 'public spaces', 'society', 'software', 'urban planning'])]],
      dtype=object)
In [57]:
type(df_ted['tags'].values[0])
Out[57]:
list
In [58]:
# Some code to flatten list of tags
df_ted['tags'].apply(pd.Series).reset_index().melt(id_vars='index').value.dropna().value_counts()
Out[58]:
technology           727
science              567
global issues        501
culture              486
TEDx                 450
design               418
business             348
entertainment        299
health               236
innovation           229
society              224
art                  221
social change        218
future               195
communication        191
biology              189
creativity           189
humanity             182
collaboration        174
environment          165
economics            164
medicine             162
brain                158
activism             157
education            153
community            148
history              146
TED Fellows          143
children             143
invention            140
                    ... 
CRISPR                 4
vulnerability          4
urban                  4
jazz                   4
ants                   4
glacier                4
sleep                  4
nuclear weapons        4
microsoft              3
Nobel prize            3
novel                  3
forensics              3
mining                 3
epidemiology           3
3d printing            3
Brand                  3
street art             3
pandemic               3
cello                  3
apes                   3
Moon                   3
blockchain             3
grammar                2
augmented reality      2
evil                   2
origami                2
testing                1
funny                  1
skateboarding          1
cloud                  1
Name: value, Length: 416, dtype: int64

title

In [59]:
df_ted['title'].nunique()
Out[59]:
2550

Every talk has his own title.

In [60]:
df_ted['title'].str.len().describe()
Out[60]:
count    2550.000000
mean       35.143137
std        11.834592
min         6.000000
25%        27.000000
50%        34.000000
75%        42.750000
max        78.000000
Name: title, dtype: float64
In [61]:
df_ted['title'].values[:5]
Out[61]:
array(['Do schools kill creativity?', 'Averting the climate crisis',
       'Simplicity sells', 'Greening the ghetto',
       "The best stats you've ever seen"], dtype=object)

Looks like title + main_speaker = name

In [62]:
df_ted[['name', 'main_speaker', 'title']].head()
Out[62]:
name main_speaker title
0 Ken Robinson: Do schools kill creativity? Ken Robinson Do schools kill creativity?
1 Al Gore: Averting the climate crisis Al Gore Averting the climate crisis
2 David Pogue: Simplicity sells David Pogue Simplicity sells
3 Majora Carter: Greening the ghetto Majora Carter Greening the ghetto
4 Hans Rosling: The best stats you've ever seen Hans Rosling The best stats you've ever seen

url

In [63]:
df_ted['url'].nunique()
Out[63]:
2550
In [64]:
df_ted['url'].values[:5]
Out[64]:
array(['https://www.ted.com/talks/ken_robinson_says_schools_kill_creativity\n',
       'https://www.ted.com/talks/al_gore_on_averting_climate_crisis\n',
       'https://www.ted.com/talks/david_pogue_says_simplicity_sells\n',
       'https://www.ted.com/talks/majora_carter_s_tale_of_urban_renewal\n',
       'https://www.ted.com/talks/hans_rosling_shows_the_best_stats_you_ve_ever_seen\n'],
      dtype=object)
In [65]:
sum(df_ted['url'].str.endswith('\n'))
Out[65]:
2550

Every url ends with '\n', so it could be cleaned.

In [66]:
df_ted['url'] = df_ted['url'].str.strip()
In [67]:
df_ted['url'].apply(lambda s: s.split('/')[0]).value_counts()
Out[67]:
https:    2550
Name: url, dtype: int64
In [68]:
df_ted['url'].apply(lambda s: s.split('/')[2]).value_counts()
Out[68]:
www.ted.com    2550
Name: url, dtype: int64
In [69]:
df_ted['url'].apply(lambda s: s.split('/')[3]).value_counts()
Out[69]:
talks    2550
Name: url, dtype: int64

All urls are 'https://www.ted.com/talks/name_of_talk' so we could omit the field without consequences.

views

'views' is our target variable. We also need to check normality of it distribution.

In [70]:
df_ted['views'].describe()
Out[70]:
count    2.550000e+03
mean     1.698297e+06
std      2.498479e+06
min      5.044300e+04
25%      7.557928e+05
50%      1.124524e+06
75%      1.700760e+06
max      4.722711e+07
Name: views, dtype: float64

Doesn't look normal distributed. Let's check via plots and stat tests.

In [71]:
df_ted['views'].hist(bins=100);
In [72]:
scipy.stats.normaltest(df_ted['views'])
Out[72]:
NormaltestResult(statistic=3634.35222369011, pvalue=0.0)
In [73]:
scipy.stats.shapiro(df_ted['views'])
Out[73]:
(0.40940701961517334, 0.0)
In [74]:
sm.qqplot(df_ted['views'], line='s');
In [75]:
scipy.stats.normaltest(np.log(df_ted['views']))
Out[75]:
NormaltestResult(statistic=201.78023146791185, pvalue=1.5274938070975164e-44)
In [76]:
scipy.stats.shapiro(np.log(df_ted['views']))
Out[76]:
(0.9729341864585876, 1.4359117312548763e-21)
In [77]:
sm.qqplot(np.log(df_ted['views']), line='s');
In [78]:
np.log(df_ted['views']).hist(bins=100);
In [79]:
alpha = 0.001
p = scipy.stats.shapiro(np.log(np.log(df_ted['views'])))[1]

if p < alpha:  # null hypothesis: x comes from a normal distribution
    print("The null hypothesis can be rejected")
else:
    print("The null hypothesis cannot be rejected")
The null hypothesis can be rejected

It doesnt't looks like we get normal distibution after applying logarithm, but it looks much closer to it. So we will assume that our target variable has normal distribution.

In [80]:
df_ted.columns
Out[80]:
Index(['comments', 'description', 'duration', 'event', 'film_date',
       'languages', 'main_speaker', 'name', 'num_speaker', 'published_date',
       'ratings', 'related_talks', 'speaker_occupation', 'tags', 'title',
       'url', 'views', 'transcript'],
      dtype='object')
In [81]:
df_ted['target'] = np.log(df_ted['views'])

transcripts

In [82]:
df_ted['transcript'].nunique(), len(df_ted['transcript']), sum(df_ted['transcript'].isna())
Out[82]:
(2464, 2550, 86)

Not every talk has transcipt and each transcript in unique.

In [83]:
df_ted['transcript'].str.len().describe()
Out[83]:
count     2464.000000
mean     11438.002029
std       5291.533620
min         17.000000
25%       7568.000000
50%      11538.500000
75%      15113.750000
max      51355.000000
Name: transcript, dtype: float64
In [84]:
df_ted[df_ted['transcript'].str.len() < 200]['transcript'].values
Out[84]:
array(['(Applause)(Music)(Applause)',
       "Let's just get started here.Okay, just a moment.(Whirring)All right. (Laughter) Oh, sorry.(Music) (Beatboxing)Thank you.(Applause)",
       '(Music)(Applause)(Music)(Music) (Applause)(Music) (Applause) (Applause)Herbie Hancock: Thank you. Marcus Miller. (Applause) Harvey Mason. (Applause)Thank you. Thank you very much. (Applause)',
       '(Music)(Applause)(Music)(Applause)',
       '(Music)(Applause)(Music)(Applause)(Music)(Applause)(Music)(Applause)',
       '(Mechanical noises)(Music) (Applause)', '(Music)(Applause)',
       '(Music)(Music) (Applause)(Applause)',
       '(Guitar music starts)(Music ends)(Applause)(Distorted guitar music starts)(Music ends)(Applause)(Ambient/guitar music starts)(Music ends)(Applause)',
       '(Guitar music starts)(Cheers)(Cheers)(Music ends)'], dtype=object)

Ok, looks like some transcript are from music.

Part 3. Visual analysis of the features

In [85]:
df_ted.columns
Out[85]:
Index(['comments', 'description', 'duration', 'event', 'film_date',
       'languages', 'main_speaker', 'name', 'num_speaker', 'published_date',
       'ratings', 'related_talks', 'speaker_occupation', 'tags', 'title',
       'url', 'views', 'transcript', 'target'],
      dtype='object')
In [86]:
df_ted.drop('views', axis=1, inplace=True)
In [87]:
df_ted.drop('related_talks', axis=1, inplace=True)
In [88]:
df_ted.drop('comments', axis=1, inplace=True)
In [89]:
# Make separate dataframe for data preparation for plotting
df_plot = df_ted.copy()
df_plot['film_date_unix'] = df_ted['film_date'].astype(int)
df_plot['published_date_unix'] = df_ted['published_date'].astype(int)
In [ ]:
 
In [90]:
%%time

sns.pairplot(df_plot, diag_kind="kde", markers="+",
    plot_kws=dict(s=50, edgecolor="b", linewidth=1),
    diag_kws=dict(shade=True));
/usr/lib64/python3.6/site-packages/scipy/stats/stats.py:1713: FutureWarning: Using a non-tuple sequence for multidimensional indexing is deprecated; use `arr[tuple(seq)]` instead of `arr[seq]`. In the future this will be interpreted as an array index, `arr[np.array(seq)]`, which will result either in an error or a different result.
  return np.add.reduce(sorted[indexer] * weights, axis=axis) / sumval
CPU times: user 2.77 s, sys: 2.55 s, total: 5.32 s
Wall time: 2.2 s
Out[90]:
<seaborn.axisgrid.PairGrid at 0x7f47f5d50320>

Clearly visible correlation between number of languages and views count.

In [91]:
df_ted.columns
Out[91]:
Index(['description', 'duration', 'event', 'film_date', 'languages',
       'main_speaker', 'name', 'num_speaker', 'published_date', 'ratings',
       'speaker_occupation', 'tags', 'title', 'url', 'transcript', 'target'],
      dtype='object')
In [92]:
df_plot.corr(method='pearson')
Out[92]:
duration languages num_speaker target film_date_unix published_date_unix
duration 1.000000 -0.295681 0.022257 0.012922 -0.242941 -0.166324
languages -0.295681 1.000000 -0.063100 0.544463 -0.061957 -0.171836
num_speaker 0.022257 -0.063100 1.000000 -0.056512 0.040227 0.049240
target 0.012922 0.544463 -0.056512 1.000000 0.167044 0.144343
film_date_unix -0.242941 -0.061957 0.040227 0.167044 1.000000 0.902565
published_date_unix -0.166324 -0.171836 0.049240 0.144343 0.902565 1.000000
In [93]:
sns.heatmap(df_plot.corr(method='pearson').abs(), annot=True)
Out[93]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f47f44d1080>
In [94]:
df_plot.corr(method='spearman')
Out[94]:
duration languages num_speaker target film_date_unix published_date_unix
duration 1.000000 -0.325966 -0.010589 0.036616 -0.198422 -0.197449
languages -0.325966 1.000000 -0.050928 0.497200 -0.121390 -0.149548
num_speaker -0.010589 -0.050928 1.000000 -0.050292 0.032236 0.032061
target 0.036616 0.497200 -0.050292 1.000000 0.229521 0.208939
film_date_unix -0.198422 -0.121390 0.032236 0.229521 1.000000 0.987641
published_date_unix -0.197449 -0.149548 0.032061 0.208939 0.987641 1.000000
In [95]:
sns.heatmap(df_plot.corr(method='spearman').abs(), annot=True)
Out[95]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f47f4410550>
In [98]:
plt.plot(df_plot['published_date_unix'])
Out[98]:
[<matplotlib.lines.Line2D at 0x7f47f426a9b0>]

So, data is sorted by published_date

In [99]:
plt.plot(df_plot['target'])
Out[99]:
[<matplotlib.lines.Line2D at 0x7f47f4250278>]
In [100]:
plt.plot(df_plot['published_date_unix'], df_plot['target'])
Out[100]:
[<matplotlib.lines.Line2D at 0x7f47f41ae9b0>]
In [102]:
sns.countplot(df_ted['event']);
In [103]:
plt.plot(df_plot.groupby(by='event')['target'].mean().sort_values(ascending=False), 'o-');

Target mean variable looks like near normally disributed related to event name.

In [104]:
sns.countplot(df_ted['main_speaker']);
In [105]:
plt.plot(df_plot.groupby(by='main_speaker')['target'].mean().sort_values(ascending=False), 'o-');

The normality of disribution also holds for speaker name.

In [106]:
sns.countplot(df_ted['speaker_occupation']);
In [107]:
plt.plot(df_plot.groupby(by='speaker_occupation')['target'].mean().sort_values(ascending=False), 'o-');

speaker_occupation also looks normaly distributed.

Part 4. Patterns, insights, pecularities of data

From previous analysis we have following observations:

  • 'view' variable is not normally distributed according to tests. For linear model it's should be more adequate to use normally distributed target variable, so we apply logarithm to it. It doesn't makes distribution normal but now it looks closer to it.
  • Our new target variable highly correlated with language count. May be it's due the fact that the most popular talks are moe often translated to more languages. Because we are doing correlation analysis we cann't say it for sure without additional data. But it can be useful to omit language variable.
  • Also we have strong correlation between published and film date, so we need to exclude one of them for more accurate predicictions.
  • We clearly see different kind of events, so may be useful to add additional feature with event type information
  • Url and name just redunant because contains information also available from other fields
  • Data is sorted by published date, so we can use TimeSeriesSplit to not get catched by data leak. Also it's fine because we are interested in feature prediction so we don't need to sort data.
  • We have some errors in data, when published_date is smaller than film_date, but as we are more interested in date of publication and published_date and film_date are highly correlated we will exclude film_date

Part 5. Data preprocessing

We will do different type of preprocessing for different type of columns:

  • numeric columns will be scaled using StandardScaler
  • text columns will be converter to lower case and after that vectorized using TfIdfVectorizer
  • categorial variables will be factorized (similar to label encoding) and then transformed using OneHotEncoder
  • empty values in transcript field will be filled with 'na' string
  • date will be converted to unix time and used as numeric
  • tags arrays will be converted to string and used as string column
In [108]:
df_ted.columns
Out[108]:
Index(['description', 'duration', 'event', 'film_date', 'languages',
       'main_speaker', 'name', 'num_speaker', 'published_date', 'ratings',
       'speaker_occupation', 'tags', 'title', 'url', 'transcript', 'target'],
      dtype='object')
In [226]:
# Only leave features filtred by assumptions from previous part
X = df_ted[['description', 'duration', 'event', 'languages',
       'main_speaker', 'num_speaker', 'published_date',
       'speaker_occupation', 'tags', 'title', 'transcript']].copy()
y = df_ted['target'].copy()

NUMERIC_COLUMNS = ['duration', 'languages']
DATE_COLUMNS = ['published_date']

# We will convert 'tags' column to string
TEXT_COLUMNS = ['description', 'tags', 'title', 'transcript']

CATEGORICAL_COLUMNS = ['event',
       'main_speaker', 'speaker_occupation', 'num_speaker']

# We will convert published_date back to unix time and use it like numeric column
for c in DATE_COLUMNS:
    X[c] = X[c].astype(int)

# We will use data columns simply as numeric_column, so 
NUMERIC_COLUMNS += DATE_COLUMNS

# StandardScaler will convert fields to float64 with warning, so we will do it before
for c in NUMERIC_COLUMNS:
    X[c] = X[c].astype(float)

# Convert tags to string
X['tags'] = X['tags'].apply(lambda tags: ' '.join(tags))

X['transcript'] = X['transcript'].fillna('na')

# Convert all text columns to lower case
for c in TEXT_COLUMNS:
    X[c] = X[c].str.lower()

# Factorize categorical_columns (similar to LabelEncoding)
for c in CATEGORICAL_COLUMNS:
    X[c] = X[c].factorize()[0]
In [227]:
X.head()
Out[227]:
description duration event languages main_speaker num_speaker published_date speaker_occupation tags title transcript
0 sir ken robinson makes an entertaining and pro... 1164.0 0 60.0 0 0 1.151367e+18 0 children creativity culture dance education pa... do schools kill creativity? good morning. how are you?(laughter)it's been ...
1 with the same humor and humanity he exuded in ... 977.0 0 43.0 1 0 1.151367e+18 1 alternative energy cars climate change culture... averting the climate crisis thank you so much, chris. and it's truly a gre...
2 new york times columnist david pogue takes aim... 1286.0 0 26.0 2 0 1.151367e+18 2 computers entertainment interface design media... simplicity sells (music: "the sound of silence," simon & garfun...
3 in an emotionally charged talk, macarthur-winn... 1116.0 0 35.0 3 0 1.151367e+18 3 macarthur grant activism business cities envir... greening the ghetto if you're here today — and i'm very happy that...
4 you've never seen data presented like this. wi... 1190.0 0 48.0 4 0 1.151441e+18 4 africa asia google demo economics global devel... the best stats you've ever seen about 10 years ago, i took on the task to teac...
In [228]:
preprocessing = ColumnTransformer(transformers=[
    ('ohe', OneHotEncoder(categories='auto', handle_unknown='ignore'), CATEGORICAL_COLUMNS),
    ('scaler', StandardScaler(), NUMERIC_COLUMNS),
    ('tfidf_0', TfidfVectorizer(), TEXT_COLUMNS[0]),
    ('tfidf_1', TfidfVectorizer(), TEXT_COLUMNS[1]),
    ('tfidf_2', TfidfVectorizer(), TEXT_COLUMNS[2]),
    ('tfidf_3', TfidfVectorizer(), TEXT_COLUMNS[3]),
])
In [229]:
# It's crucial not sort splits, because we want to predict future (so no future data should be in train set)
# We don't need seed here because shuffle is disabled
X_train, X_valid, y_train, y_valid = train_test_split(X, y, test_size = 0.3, shuffle=False)
In [230]:
X_train.shape, X_valid.shape, y_train.shape, y_valid.shape
Out[230]:
((1785, 11), (765, 11), (1785,), (765,))

Part 6. Metric selection

For regression task there are two most popular metrics - RMSE and MAE.

$ \begin{align} RMSE = \sqrt{\frac{1}{n}\sum_{j=1}^{n}{(\hat{y} - y_j)^2}} \end{align} $

$ \begin{align} MAE = \frac{1}{n}\sum_{j=1}^{n}{\lvert\hat{y} - y_j\rvert} \end{align} $

RMSE put higher weight on the bigger errors in predictions. RMSE has tendency to increase more then MAE with bigger sample size. In our case bigger errors should not be threated in special way. MAE is more easy to interpretate, especially as we have log transformation of initial target variable, so exp(MAE) could be viewed as multiplicator of true value of the original variable.

So we will go with MAE.

Part 7. Feature engineering and description

Let's try Ridge from sklearn.

In [319]:
%%time
model_ridge = Pipeline(
    steps=[
        ('preprocessing', preprocessing),
        ('ridge', Ridge(random_state=RANDOM_SEED))
    ]
)


cv = GridSearchCV(model_ridge, param_grid={}, scoring='neg_mean_absolute_error', cv=TimeSeriesSplit(n_splits=5),
                 return_train_score=True, verbose=3)
cv.fit(X_train, y_train)
Fitting 5 folds for each of 1 candidates, totalling 5 fits
[CV]  ................................................................
[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[CV] ....................... , score=-0.529572765869789, total=   1.3s
[CV]  ................................................................
[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:    1.9s remaining:    0.0s
[CV] ...................... , score=-0.4670555265153398, total=   2.0s
[CV]  ................................................................
[Parallel(n_jobs=1)]: Done   2 out of   2 | elapsed:    5.0s remaining:    0.0s
[CV] ...................... , score=-0.4514045360568452, total=   2.7s
[CV]  ................................................................
[CV] ..................... , score=-0.48895244539870886, total=   3.3s
[CV]  ................................................................
[CV] ..................... , score=-0.46075069989416256, total=   4.0s
[Parallel(n_jobs=1)]: Done   5 out of   5 | elapsed:   21.5s finished
CPU times: user 26 s, sys: 29.3 ms, total: 26 s
Wall time: 26 s
In [320]:
cv.best_score_
Out[320]:
-0.4795471947469691

We will use it as baseline for future.

Let's construct new features:

  • len of transcript (because target can depends on how much time presenter speaks)
  • event type, because it's some events are clearly more popular (like TED events vs regional TEDx events)
  • published_date hour, month, dayofweek
In [322]:
# Only leave features filtred by assumptions from previous part
X = df_ted[['description', 'duration', 'event', 'languages',
       'main_speaker', 'num_speaker', 'published_date',
       'speaker_occupation', 'tags', 'title', 'transcript']].copy()
y = df_ted['target'].copy()

X['transcript'] = X['transcript'].fillna('na')
X['transcript_len'] = X['transcript'].str.len()
X['event_type'] = X['event'].apply(get_event_type)
X['published_hour'] = X['published_date'].dt.hour
X['published_month'] = X['published_date'].dt.month
X['published_dayofweek'] = X['published_date'].dt.dayofweek


NUMERIC_COLUMNS = ['duration', 'languages',
                   'transcript_len'
                  ]
DATE_COLUMNS = ['published_date']

# We will convert 'tags' column to string
TEXT_COLUMNS = ['description', 'tags', 'title', 'transcript']

CATEGORICAL_COLUMNS = ['event',
       'main_speaker', 'speaker_occupation', 'num_speaker', 
                       'event_type',
                       'published_hour',
                       'published_month',
                       'published_dayofweek'
                      ]

# We will convert published_date back to unix time and use it like numeric column
for c in DATE_COLUMNS:
    X[c] = X[c].astype(int)

# We will use data columns simply as numeric_column, so 
NUMERIC_COLUMNS += DATE_COLUMNS

# Convert tags to string
X['tags'] = X['tags'].apply(lambda tags: ' '.join(tags))

# StandardScaler will convert fields to float64 with warning, so we will do it before
for c in NUMERIC_COLUMNS:
    X[c] = X[c].astype(float)

# Convert all text columns to lower case
for c in TEXT_COLUMNS:
    X[c] = X[c].str.lower()

# Factorize categorical_columns (similar to LabelEncoding)
for c in CATEGORICAL_COLUMNS:
    X[c] = X[c].factorize()[0]

X_train, X_valid, y_train, y_valid = train_test_split(X, y, test_size = 0.3, shuffle=False)

We will test new features it one by one, using ColumnTransformer propery - it will drop columns, not mentnioned in transformers list.

Let's try to exclude published_date

In [323]:
%%time
NUMERIC_COLUMNS = ['duration', 'languages']

CATEGORICAL_COLUMNS = ['event',
       'main_speaker', 'speaker_occupation', 'num_speaker']

preprocessing = ColumnTransformer(transformers=[
    ('ohe', OneHotEncoder(categories='auto', handle_unknown='ignore'), CATEGORICAL_COLUMNS),
    ('scaler', StandardScaler(), NUMERIC_COLUMNS),
    ('tfidf_0', TfidfVectorizer(), TEXT_COLUMNS[0]),
    ('tfidf_1', TfidfVectorizer(), TEXT_COLUMNS[1]),
    ('tfidf_2', TfidfVectorizer(), TEXT_COLUMNS[2]),
    ('tfidf_3', TfidfVectorizer(), TEXT_COLUMNS[3]),
])

model_ridge = Pipeline(
    steps=[
        ('preprocessing', preprocessing),
        ('ridge', Ridge(random_state=RANDOM_SEED))
    ]
)


cv = GridSearchCV(model_ridge, param_grid={}, scoring='neg_mean_absolute_error', cv=TimeSeriesSplit(n_splits=5),
                 return_train_score=True, verbose=3)
cv.fit(X_train, y_train)
Fitting 5 folds for each of 1 candidates, totalling 5 fits
[CV]  ................................................................
[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[CV] ...................... , score=-0.5515183016074823, total=   1.3s
[CV]  ................................................................
[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:    1.9s remaining:    0.0s
[CV] ....................... , score=-0.450775278258968, total=   2.0s
[CV]  ................................................................
[Parallel(n_jobs=1)]: Done   2 out of   2 | elapsed:    5.1s remaining:    0.0s
[CV] ...................... , score=-0.4427393393712061, total=   2.6s
[CV]  ................................................................
[CV] ...................... , score=-0.4683652757283662, total=   3.6s
[CV]  ................................................................
[CV] ...................... , score=-0.4704698330144345, total=   3.9s
[Parallel(n_jobs=1)]: Done   5 out of   5 | elapsed:   21.7s finished
CPU times: user 26.1 s, sys: 52 ms, total: 26.2 s
Wall time: 26.2 s
In [324]:
cv.best_score_
Out[324]:
-0.4767736055960914

We have some improvement in score, let's continue.

transcript_len

In [325]:
%%time
NUMERIC_COLUMNS = ['duration', 'languages', 'transcript_len'
                  ]

CATEGORICAL_COLUMNS = ['event',
       'main_speaker', 'speaker_occupation', 'num_speaker']

preprocessing = ColumnTransformer(transformers=[
    ('ohe', OneHotEncoder(categories='auto', handle_unknown='ignore'), CATEGORICAL_COLUMNS),
    ('scaler', StandardScaler(), NUMERIC_COLUMNS),
    ('tfidf_0', TfidfVectorizer(), TEXT_COLUMNS[0]),
    ('tfidf_1', TfidfVectorizer(), TEXT_COLUMNS[1]),
    ('tfidf_2', TfidfVectorizer(), TEXT_COLUMNS[2]),
    ('tfidf_3', TfidfVectorizer(), TEXT_COLUMNS[3]),
])

model_ridge = Pipeline(
    steps=[
        ('preprocessing', preprocessing),
        ('ridge', Ridge(random_state=RANDOM_SEED))
    ]
)


cv = GridSearchCV(model_ridge, param_grid={}, scoring='neg_mean_absolute_error', cv=TimeSeriesSplit(n_splits=5),
                 return_train_score=True, verbose=3)
cv.fit(X_train, y_train)
Fitting 5 folds for each of 1 candidates, totalling 5 fits
[CV]  ................................................................
[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[CV] ...................... , score=-0.5388501938587508, total=   1.4s
[CV]  ................................................................
[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:    2.1s remaining:    0.0s
[CV] ..................... , score=-0.43580727618207843, total=   2.1s
[CV]  ................................................................
[Parallel(n_jobs=1)]: Done   2 out of   2 | elapsed:    5.3s remaining:    0.0s
[CV] ..................... , score=-0.44389246819413897, total=   2.6s
[CV]  ................................................................
[CV] ..................... , score=-0.47417138790282504, total=   3.1s
[CV]  ................................................................
[CV] ...................... , score=-0.4850193724144867, total=   3.9s
[Parallel(n_jobs=1)]: Done   5 out of   5 | elapsed:   21.1s finished
CPU times: user 25.1 s, sys: 29 ms, total: 25.2 s
Wall time: 25.2 s
In [326]:
cv.best_score_
Out[326]:
-0.475548139710456

Previous value was -0.4767736055960914, so we have some small imporvement.

event_type

In [340]:
%%time
NUMERIC_COLUMNS = ['duration', 'languages', 'transcript_len']

CATEGORICAL_COLUMNS = [
       'main_speaker', 'speaker_occupation', 'num_speaker', 'event_type']

preprocessing = ColumnTransformer(transformers=[
    ('ohe', OneHotEncoder(categories='auto', handle_unknown='ignore'), CATEGORICAL_COLUMNS),
    ('scaler', StandardScaler(), NUMERIC_COLUMNS),
    ('tfidf_0', TfidfVectorizer(), TEXT_COLUMNS[0]),
    ('tfidf_1', TfidfVectorizer(), TEXT_COLUMNS[1]),
    ('tfidf_2', TfidfVectorizer(), TEXT_COLUMNS[2]),
    ('tfidf_3', TfidfVectorizer(), TEXT_COLUMNS[3]),
])

model_ridge = Pipeline(
    steps=[
        ('preprocessing', preprocessing),
        ('ridge', Ridge(random_state=RANDOM_SEED))
    ]
)


cv = GridSearchCV(model_ridge, param_grid={}, scoring='neg_mean_absolute_error', cv=TimeSeriesSplit(n_splits=5),
                 return_train_score=True, verbose=3)
cv.fit(X_train, y_train)
Fitting 5 folds for each of 1 candidates, totalling 5 fits
[CV]  ................................................................
[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[CV] ...................... , score=-0.5322149001704244, total=   1.3s
[CV]  ................................................................
[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:    1.9s remaining:    0.0s
[CV] ...................... , score=-0.4690839922595546, total=   2.0s
[CV]  ................................................................
[Parallel(n_jobs=1)]: Done   2 out of   2 | elapsed:    5.1s remaining:    0.0s
[CV] ...................... , score=-0.4444954311015843, total=   2.7s
[CV]  ................................................................
[CV] ..................... , score=-0.47892641110948575, total=   3.3s
[CV]  ................................................................
[CV] ...................... , score=-0.4364469923530735, total=   4.1s
[Parallel(n_jobs=1)]: Done   5 out of   5 | elapsed:   21.5s finished
CPU times: user 26.2 s, sys: 75.3 ms, total: 26.3 s
Wall time: 26.3 s
In [341]:
cv.best_score_
Out[341]:
-0.4722335453988245

This is our new best crossval score

published_hour

In [342]:
%%time
NUMERIC_COLUMNS = ['duration', 'languages', 'transcript_len'
                  ]

CATEGORICAL_COLUMNS = ['event_type',
       'main_speaker', 'speaker_occupation', 'num_speaker', 'published_hour']

preprocessing = ColumnTransformer(transformers=[
    ('ohe', OneHotEncoder(categories='auto', handle_unknown='ignore'), CATEGORICAL_COLUMNS),
    ('scaler', StandardScaler(), NUMERIC_COLUMNS),
    ('tfidf_0', TfidfVectorizer(), TEXT_COLUMNS[0]),
    ('tfidf_1', TfidfVectorizer(), TEXT_COLUMNS[1]),
    ('tfidf_2', TfidfVectorizer(), TEXT_COLUMNS[2]),
    ('tfidf_3', TfidfVectorizer(), TEXT_COLUMNS[3]),
])

model_ridge = Pipeline(
    steps=[
        ('preprocessing', preprocessing),
        ('ridge', Ridge(random_state=RANDOM_SEED))
    ]
)


cv = GridSearchCV(model_ridge, param_grid={}, scoring='neg_mean_absolute_error', cv=TimeSeriesSplit(n_splits=5),
                 return_train_score=True, verbose=3)
cv.fit(X_train, y_train)
Fitting 5 folds for each of 1 candidates, totalling 5 fits
[CV]  ................................................................
[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[CV] ...................... , score=-0.5307546302183648, total=   1.3s
[CV]  ................................................................
[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:    2.0s remaining:    0.0s
[CV] ...................... , score=-0.4823182767578877, total=   2.0s
[CV]  ................................................................
[Parallel(n_jobs=1)]: Done   2 out of   2 | elapsed:    5.0s remaining:    0.0s
[CV] ..................... , score=-0.45352675804496617, total=   2.6s
[CV]  ................................................................
[CV] ..................... , score=-0.49771810106941766, total=   3.5s
[CV]  ................................................................
[CV] ...................... , score=-0.4153635981565675, total=   3.9s
[Parallel(n_jobs=1)]: Done   5 out of   5 | elapsed:   21.3s finished
CPU times: user 25.8 s, sys: 61.1 ms, total: 25.9 s
Wall time: 25.9 s
In [343]:
cv.best_score_
Out[343]:
-0.47593627284944073

Not improvement of the best score

published_month

In [344]:
%%time
NUMERIC_COLUMNS = ['duration', 'languages', 'transcript_len'
                  ]

CATEGORICAL_COLUMNS = ['event_type',
       'main_speaker', 'speaker_occupation', 'num_speaker', 'published_month']

preprocessing = ColumnTransformer(transformers=[
    ('ohe', OneHotEncoder(categories='auto', handle_unknown='ignore'), CATEGORICAL_COLUMNS),
    ('scaler', StandardScaler(), NUMERIC_COLUMNS),
    ('tfidf_0', TfidfVectorizer(), TEXT_COLUMNS[0]),
    ('tfidf_1', TfidfVectorizer(), TEXT_COLUMNS[1]),
    ('tfidf_2', TfidfVectorizer(), TEXT_COLUMNS[2]),
    ('tfidf_3', TfidfVectorizer(), TEXT_COLUMNS[3]),
])

model_ridge = Pipeline(
    steps=[
        ('preprocessing', preprocessing),
        ('ridge', Ridge(random_state=RANDOM_SEED))
    ]
)


cv = GridSearchCV(model_ridge, param_grid={}, scoring='neg_mean_absolute_error', cv=TimeSeriesSplit(n_splits=5),
                 return_train_score=True, verbose=3)
cv.fit(X_train, y_train)
Fitting 5 folds for each of 1 candidates, totalling 5 fits
[CV]  ................................................................
[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[CV] ....................... , score=-0.541120473167134, total=   1.3s
[CV]  ................................................................
[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:    1.9s remaining:    0.0s
[CV] ...................... , score=-0.4878525644474154, total=   2.0s
[CV]  ................................................................
[Parallel(n_jobs=1)]: Done   2 out of   2 | elapsed:    5.1s remaining:    0.0s
[CV] ...................... , score=-0.4491814631916443, total=   2.7s
[CV]  ................................................................
[CV] ..................... , score=-0.47936596719331287, total=   3.2s
[CV]  ................................................................
[CV] ...................... , score=-0.4436368339011812, total=   3.9s
[Parallel(n_jobs=1)]: Done   5 out of   5 | elapsed:   21.3s finished
CPU times: user 25.5 s, sys: 71.7 ms, total: 25.6 s
Wall time: 25.6 s
In [346]:
cv.best_score_
Out[346]:
-0.48023146038013753

Not improvement of score

published_dayofweek

In [347]:
%%time
NUMERIC_COLUMNS = ['duration', 'languages', 'transcript_len'
                  ]

CATEGORICAL_COLUMNS = ['event_type',
       'main_speaker', 'speaker_occupation', 'num_speaker', 'published_dayofweek']

preprocessing = ColumnTransformer(transformers=[
    ('ohe', OneHotEncoder(categories='auto', handle_unknown='ignore'), CATEGORICAL_COLUMNS),
    ('scaler', StandardScaler(), NUMERIC_COLUMNS),
    ('tfidf_0', TfidfVectorizer(), TEXT_COLUMNS[0]),
    ('tfidf_1', TfidfVectorizer(), TEXT_COLUMNS[1]),
    ('tfidf_2', TfidfVectorizer(), TEXT_COLUMNS[2]),
    ('tfidf_3', TfidfVectorizer(), TEXT_COLUMNS[3]),
])

model_ridge = Pipeline(
    steps=[
        ('preprocessing', preprocessing),
        ('ridge', Ridge(random_state=RANDOM_SEED))
    ]
)


cv = GridSearchCV(model_ridge, param_grid={}, scoring='neg_mean_absolute_error', cv=TimeSeriesSplit(n_splits=5),
                 return_train_score=True, verbose=3)
cv.fit(X_train, y_train)
Fitting 5 folds for each of 1 candidates, totalling 5 fits
[CV]  ................................................................
[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[CV] ...................... , score=-0.5545001113157026, total=   1.3s
[CV]  ................................................................
[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:    2.1s remaining:    0.0s
[CV] ..................... , score=-0.46930188759870906, total=   2.0s
[CV]  ................................................................
[Parallel(n_jobs=1)]: Done   2 out of   2 | elapsed:    5.2s remaining:    0.0s
[CV] ..................... , score=-0.45479889351105673, total=   2.6s
[CV]  ................................................................
[CV] ...................... , score=-0.4804840991338316, total=   3.2s
[CV]  ................................................................
[CV] ..................... , score=-0.43782215712619676, total=   3.9s
[Parallel(n_jobs=1)]: Done   5 out of   5 | elapsed:   21.1s finished
CPU times: user 25.3 s, sys: 58.9 ms, total: 25.3 s
Wall time: 25.3 s
In [348]:
cv.best_score_
Out[348]:
-0.4793814297370994

Not improvement of the score

Conclusion on feature engineering

We just found two new useful features:

  • event_type instead of event
  • transcript_len

Cross validation on Ridge shows score improvement with both of them.

Part 8. Cross-validation, hyperparameter tuning

We will use features we already found and selected.

In [350]:
%%time

X = df_ted[['description', 'duration', 'event', 'languages',
       'main_speaker', 'num_speaker',
       'speaker_occupation', 'tags', 'title', 'transcript']].copy()
y = df_ted['target'].copy()

X['transcript'] = X['transcript'].fillna('na')
X['transcript_len'] = X['transcript'].str.len()
X['event_type'] = X['event'].apply(get_event_type)
X.drop('event', axis=1, inplace=True)

NUMERIC_COLUMNS = ['duration', 'languages', 'transcript_len']

CATEGORICAL_COLUMNS = ['main_speaker', 'speaker_occupation', 'num_speaker', 'event_type']


# We will convert 'tags' column to string
TEXT_COLUMNS = ['description', 'tags', 'title', 'transcript']

# Convert tags to string
X['tags'] = X['tags'].apply(lambda tags: ' '.join(tags))

# StandardScaler will convert fields to float64 with warning, so we will do it before
for c in NUMERIC_COLUMNS:
    X[c] = X[c].astype(float)

# Convert all text columns to lower case
for c in TEXT_COLUMNS:
    X[c] = X[c].str.lower()

# Factorize categorical_columns (similar to LabelEncoding)
for c in CATEGORICAL_COLUMNS:
    X[c] = X[c].factorize()[0]

X_train, X_valid, y_train, y_valid = train_test_split(X, y, test_size = 0.3, shuffle=False)
CPU times: user 163 ms, sys: 2.92 ms, total: 166 ms
Wall time: 164 ms
In [351]:
X_train.shape, X_valid.shape, y_train.shape, y_valid.shape
Out[351]:
((1785, 11), (765, 11), (1785,), (765,))
In [395]:
preprocessing = ColumnTransformer(transformers=[
    ('ohe', OneHotEncoder(categories='auto', handle_unknown='ignore'), CATEGORICAL_COLUMNS),
    ('scaler', StandardScaler(), NUMERIC_COLUMNS),
    ('tfidf_0', TfidfVectorizer(), TEXT_COLUMNS[0]),
    ('tfidf_1', TfidfVectorizer(), TEXT_COLUMNS[1]),
    ('tfidf_2', TfidfVectorizer(), TEXT_COLUMNS[2]),
    ('tfidf_3', TfidfVectorizer(), TEXT_COLUMNS[3]),
])

Let's tune alpha (l1 regularization) for Ridge.

In [390]:
model_ridge = Pipeline(
    steps=[
        ('preprocessing', preprocessing),
        ('ridge', Ridge(random_state=RANDOM_SEED))
    ]
)

params = {
    
    'ridge__alpha' : np.logspace(-2, 5, num=8)
}

cv = GridSearchCV(model_ridge, param_grid=params, scoring='neg_mean_absolute_error', cv=TimeSeriesSplit(n_splits=5),
                 return_train_score=True, verbose=3)
cv.fit(X_train, y_train)
Fitting 5 folds for each of 8 candidates, totalling 40 fits
[CV] ridge__alpha=0.01 ...............................................
[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[CV] ..... ridge__alpha=0.01, score=-0.5357153659348366, total=   1.8s
[CV] ridge__alpha=0.01 ...............................................
[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:    2.4s remaining:    0.0s
[CV] ..... ridge__alpha=0.01, score=-0.4793202746461553, total=   2.2s
[CV] ridge__alpha=0.01 ...............................................
[Parallel(n_jobs=1)]: Done   2 out of   2 | elapsed:    5.8s remaining:    0.0s
[CV] ...... ridge__alpha=0.01, score=-0.448421181620962, total=   3.0s
[CV] ridge__alpha=0.01 ...............................................
[CV] ..... ridge__alpha=0.01, score=-0.4906695619710239, total=   3.7s
[CV] ridge__alpha=0.01 ...............................................
[CV] .... ridge__alpha=0.01, score=-0.44097338299417826, total=   4.4s
[CV] ridge__alpha=0.1 ................................................
[CV] ...... ridge__alpha=0.1, score=-0.5353455161397344, total=   1.3s
[CV] ridge__alpha=0.1 ................................................
[CV] ...... ridge__alpha=0.1, score=-0.4780922412197138, total=   2.1s
[CV] ridge__alpha=0.1 ................................................
[CV] ...... ridge__alpha=0.1, score=-0.4477217958033026, total=   2.7s
[CV] ridge__alpha=0.1 ................................................
[CV] ...... ridge__alpha=0.1, score=-0.4893866331473308, total=   3.5s
[CV] ridge__alpha=0.1 ................................................
[CV] ..... ridge__alpha=0.1, score=-0.44021358423441437, total=   4.6s
[CV] ridge__alpha=1.0 ................................................
[CV] ...... ridge__alpha=1.0, score=-0.5322149001704244, total=   1.5s
[CV] ridge__alpha=1.0 ................................................
[CV] ...... ridge__alpha=1.0, score=-0.4690839922595546, total=   2.1s
[CV] ridge__alpha=1.0 ................................................
[CV] ...... ridge__alpha=1.0, score=-0.4444954311015843, total=   2.9s
[CV] ridge__alpha=1.0 ................................................
[CV] ..... ridge__alpha=1.0, score=-0.47892641110948575, total=   3.5s
[CV] ridge__alpha=1.0 ................................................
[CV] ...... ridge__alpha=1.0, score=-0.4364469923530735, total=   4.1s
[CV] ridge__alpha=10.0 ...............................................
[CV] ..... ridge__alpha=10.0, score=-0.5320912105003295, total=   1.2s
[CV] ridge__alpha=10.0 ...............................................
[CV] ..... ridge__alpha=10.0, score=-0.4442325056997138, total=   1.9s
[CV] ridge__alpha=10.0 ...............................................
[CV] ..... ridge__alpha=10.0, score=-0.4452005348298871, total=   2.6s
[CV] ridge__alpha=10.0 ...............................................
[CV] ..... ridge__alpha=10.0, score=-0.4514352365780252, total=   3.1s
[CV] ridge__alpha=10.0 ...............................................
[CV] ..... ridge__alpha=10.0, score=-0.4491669601185079, total=   3.4s
[CV] ridge__alpha=100.0 ..............................................
[CV] .... ridge__alpha=100.0, score=-0.5676194783408207, total=   1.3s
[CV] ridge__alpha=100.0 ..............................................
[CV] ... ridge__alpha=100.0, score=-0.46300717838384514, total=   1.7s
[CV] ridge__alpha=100.0 ..............................................
[CV] ... ridge__alpha=100.0, score=-0.47292708728155525, total=   2.2s
[CV] ridge__alpha=100.0 ..............................................
[CV] ... ridge__alpha=100.0, score=-0.44866508887682943, total=   3.0s
[CV] ridge__alpha=100.0 ..............................................
[CV] .... ridge__alpha=100.0, score=-0.4734402115312486, total=   3.8s
[CV] ridge__alpha=1000.0 .............................................
[CV] ... ridge__alpha=1000.0, score=-0.6538506025194109, total=   1.2s
[CV] ridge__alpha=1000.0 .............................................
[CV] .... ridge__alpha=1000.0, score=-0.528534169213813, total=   1.8s
[CV] ridge__alpha=1000.0 .............................................
[CV] ... ridge__alpha=1000.0, score=-0.5430780081829365, total=   2.2s
[CV] ridge__alpha=1000.0 .............................................
[CV] .. ridge__alpha=1000.0, score=-0.48825027703781926, total=   2.7s
[CV] ridge__alpha=1000.0 .............................................
[CV] .. ridge__alpha=1000.0, score=-0.49576501146287166, total=   3.2s
[CV] ridge__alpha=10000.0 ............................................
[CV] .. ridge__alpha=10000.0, score=-0.7019348604455806, total=   1.3s
[CV] ridge__alpha=10000.0 ............................................
[CV] .. ridge__alpha=10000.0, score=-0.6017495854924655, total=   1.9s
[CV] ridge__alpha=10000.0 ............................................
[CV] .. ridge__alpha=10000.0, score=-0.6529428614540771, total=   2.3s
[CV] ridge__alpha=10000.0 ............................................
[CV] .. ridge__alpha=10000.0, score=-0.5663993241821252, total=   2.8s
[CV] ridge__alpha=10000.0 ............................................
[CV] .. ridge__alpha=10000.0, score=-0.5405247634473374, total=   3.3s
[CV] ridge__alpha=100000.0 ...........................................
[CV] . ridge__alpha=100000.0, score=-0.7081482315571792, total=   1.3s
[CV] ridge__alpha=100000.0 ...........................................
[CV] . ridge__alpha=100000.0, score=-0.6148774530464202, total=   1.8s
[CV] ridge__alpha=100000.0 ...........................................
[CV] . ridge__alpha=100000.0, score=-0.6776921846170127, total=   2.4s
[CV] ridge__alpha=100000.0 ...........................................
[CV] .. ridge__alpha=100000.0, score=-0.585744708430061, total=   2.9s
[CV] ridge__alpha=100000.0 ...........................................
[CV] . ridge__alpha=100000.0, score=-0.5546580164692534, total=   3.7s
[Parallel(n_jobs=1)]: Done  40 out of  40 | elapsed:  2.8min finished
Out[390]:
GridSearchCV(cv=TimeSeriesSplit(max_train_size=None, n_splits=5),
       error_score='raise-deprecating',
       estimator=Pipeline(memory=None,
     steps=[('preprocessing', ColumnTransformer(n_jobs=None, remainder='drop', sparse_threshold=0.3,
         transformer_weights=None,
         transformers=[('ohe', OneHotEncoder(categorical_features=None, categories='auto',
       dtype=<class 'numpy.float64'>, handle_unknown='ignore',
       n_values... fit_intercept=True, max_iter=None,
   normalize=False, random_state=42, solver='auto', tol=0.001))]),
       fit_params=None, iid='warn', n_jobs=None,
       param_grid={'ridge__alpha': array([1.e-02, 1.e-01, 1.e+00, 1.e+01, 1.e+02, 1.e+03, 1.e+04, 1.e+05])},
       pre_dispatch='2*n_jobs', refit=True, return_train_score=True,
       scoring='neg_mean_absolute_error', verbose=3)
In [391]:
cv.best_score_
Out[391]:
-0.46442528954529266
In [392]:
cv.best_params_
Out[392]:
{'ridge__alpha': 10.0}
In [393]:
def plot_param_tuning(params, param_name, cv, x_scale_log=False):

    plt.plot(params[param_name], cv.cv_results_['mean_train_score'], 'o-', label='train')
    plt.plot(params[param_name], cv.cv_results_['mean_test_score'], 'o-', label='test')

    plt.fill_between(params[param_name],
                     cv.cv_results_['mean_train_score'] - cv.cv_results_['std_train_score'],
                     cv.cv_results_['mean_train_score'] + cv.cv_results_['std_train_score'],
                     alpha=0.2
                    )
    plt.fill_between(params[param_name],
                     cv.cv_results_['mean_test_score'] - cv.cv_results_['std_test_score'],
                     cv.cv_results_['mean_test_score'] + cv.cv_results_['std_test_score'],
                     alpha=0.2
                    )
    if x_scale_log:
        plt.xscale('log')

    plt.legend();
In [398]:
plot_param_tuning(params, 'ridge__alpha', cv, x_scale_log=True)
plt.xlabel('alpha')
plt.ylabel('neg_mean_absolute_error')
plt.title('Ridge alpha tuning');

It rather difficult to select good alpha value, because of wide range in standard deviation and different sample sizes due to TimeSeriesSplit. But we can consider alpha=10^2 as good guess because here is minimal difference between train and test sample.

In [399]:
model_lgb = Pipeline(
    steps=[
        ('preprocessing', preprocessing),
        ('lgb', LGBMRegressor(random_state=RANDOM_SEED))
    ]
)

params = {
}

cv = GridSearchCV(model_lgb, param_grid=params, scoring='neg_mean_absolute_error', cv=TimeSeriesSplit(n_splits=5),
                 return_train_score=True, verbose=3)
cv.fit(X_train, y_train)
Fitting 5 folds for each of 1 candidates, totalling 5 fits
[CV]  ................................................................
[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[CV] ...................... , score=-0.5377324746200957, total=   3.0s
[CV]  ................................................................
[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:    3.6s remaining:    0.0s
[CV] ...................... , score=-0.4423628653660104, total=   4.5s
[CV]  ................................................................
[Parallel(n_jobs=1)]: Done   2 out of   2 | elapsed:    9.2s remaining:    0.0s
[CV] ...................... , score=-0.5044568359615489, total=   9.2s
[CV]  ................................................................
[CV] ..................... , score=-0.48230697124313326, total=   9.8s
[CV]  ................................................................
[CV] ...................... , score=-0.5002212593493808, total=  12.0s
[Parallel(n_jobs=1)]: Done   5 out of   5 | elapsed:   46.2s finished
Out[399]:
GridSearchCV(cv=TimeSeriesSplit(max_train_size=None, n_splits=5),
       error_score='raise-deprecating',
       estimator=Pipeline(memory=None,
     steps=[('preprocessing', ColumnTransformer(n_jobs=None, remainder='drop', sparse_threshold=0.3,
         transformer_weights=None,
         transformers=[('ohe', OneHotEncoder(categorical_features=None, categories='auto',
       dtype=<class 'numpy.float64'>, handle_unknown='ignore',
       n_values....0, reg_lambda=0.0, silent=True,
       subsample=1.0, subsample_for_bin=200000, subsample_freq=0))]),
       fit_params=None, iid='warn', n_jobs=None, param_grid={},
       pre_dispatch='2*n_jobs', refit=True, return_train_score=True,
       scoring='neg_mean_absolute_error', verbose=3)
In [401]:
cv.best_score_
Out[401]:
-0.4934160813080337
In [402]:
%%time

model_lgb = Pipeline(
    steps=[
        ('preprocessing', preprocessing),
        ('lgb', LGBMRegressor(random_state=RANDOM_SEED))
    ]
)

params = {
    'lgb__max_depth': [2,3,4,5,6,7,8,9,10,11,12,13,14,15,16]
}

cv = GridSearchCV(model_lgb, param_grid=params, scoring='neg_mean_absolute_error', cv=TimeSeriesSplit(n_splits=5),
                 return_train_score=True, verbose=3)
cv.fit(X_train, y_train)
Fitting 5 folds for each of 15 candidates, totalling 75 fits
[CV] lgb__max_depth=2 ................................................
[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[CV] ...... lgb__max_depth=2, score=-0.5363547363643915, total=   2.6s
[CV] lgb__max_depth=2 ................................................
[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:    3.2s remaining:    0.0s
[CV] ...... lgb__max_depth=2, score=-0.4337780170809148, total=   2.5s
[CV] lgb__max_depth=2 ................................................
[Parallel(n_jobs=1)]: Done   2 out of   2 | elapsed:    6.8s remaining:    0.0s
[CV] ..... lgb__max_depth=2, score=-0.48950605455392715, total=   3.1s
[CV] lgb__max_depth=2 ................................................
[CV] ...... lgb__max_depth=2, score=-0.4718975528129215, total=   3.9s
[CV] lgb__max_depth=2 ................................................
[CV] ..... lgb__max_depth=2, score=-0.48978397338973145, total=   4.6s
[CV] lgb__max_depth=3 ................................................
[CV] ...... lgb__max_depth=3, score=-0.5400955390757082, total=   1.6s
[CV] lgb__max_depth=3 ................................................
[CV] ...... lgb__max_depth=3, score=-0.4372414722430798, total=   2.6s
[CV] lgb__max_depth=3 ................................................
[CV] ...... lgb__max_depth=3, score=-0.5000008257113193, total=   3.5s
[CV] lgb__max_depth=3 ................................................
[CV] ...... lgb__max_depth=3, score=-0.4746778905394939, total=   4.9s
[CV] lgb__max_depth=3 ................................................
[CV] ...... lgb__max_depth=3, score=-0.4837750537751768, total=   5.2s
[CV] lgb__max_depth=4 ................................................
[CV] ...... lgb__max_depth=4, score=-0.5480828489066587, total=   2.6s
[CV] lgb__max_depth=4 ................................................
[CV] ...... lgb__max_depth=4, score=-0.4440811565874886, total=   3.1s
[CV] lgb__max_depth=4 ................................................
[CV] ...... lgb__max_depth=4, score=-0.5167066173854458, total=   3.9s
[CV] lgb__max_depth=4 ................................................
[CV] ...... lgb__max_depth=4, score=-0.4741879893214106, total=   4.9s
[CV] lgb__max_depth=4 ................................................
[CV] ..... lgb__max_depth=4, score=-0.48522061216789525, total=   5.9s
[CV] lgb__max_depth=5 ................................................
[CV] ...... lgb__max_depth=5, score=-0.5389135181631772, total=   1.8s
[CV] lgb__max_depth=5 ................................................
[CV] ........ lgb__max_depth=5, score=-0.44062233385855, total=   4.2s
[CV] lgb__max_depth=5 ................................................
[CV] ...... lgb__max_depth=5, score=-0.5190314501431224, total=   5.7s
[CV] lgb__max_depth=5 ................................................
[CV] ...... lgb__max_depth=5, score=-0.4817039404353225, total=   6.2s
[CV] lgb__max_depth=5 ................................................
[CV] ...... lgb__max_depth=5, score=-0.5012955336791588, total=   7.6s
[CV] lgb__max_depth=6 ................................................
[CV] ...... lgb__max_depth=6, score=-0.5340562291784183, total=   2.1s
[CV] lgb__max_depth=6 ................................................
[CV] ...... lgb__max_depth=6, score=-0.4314344691337506, total=   7.8s
[CV] lgb__max_depth=6 ................................................
[CV] ...... lgb__max_depth=6, score=-0.5084641707433356, total=   5.3s
[CV] lgb__max_depth=6 ................................................
[CV] ...... lgb__max_depth=6, score=-0.4695681248939802, total=   6.9s
[CV] lgb__max_depth=6 ................................................
[CV] ....... lgb__max_depth=6, score=-0.510045108813432, total=   8.6s
[CV] lgb__max_depth=7 ................................................
[CV] ...... lgb__max_depth=7, score=-0.5483741838916626, total=   1.9s
[CV] lgb__max_depth=7 ................................................
[CV] ...... lgb__max_depth=7, score=-0.4424249505053146, total=   3.5s
[CV] lgb__max_depth=7 ................................................
[CV] ...... lgb__max_depth=7, score=-0.5187641657954664, total=   5.2s
[CV] lgb__max_depth=7 ................................................
[CV] ..... lgb__max_depth=7, score=-0.47653225383177505, total=   7.6s
[CV] lgb__max_depth=7 ................................................
[CV] ..... lgb__max_depth=7, score=-0.48790860992204155, total=  12.7s
[CV] lgb__max_depth=8 ................................................
[CV] ...... lgb__max_depth=8, score=-0.5480853388239934, total=   3.0s
[CV] lgb__max_depth=8 ................................................
[CV] ..... lgb__max_depth=8, score=-0.43915056903164623, total=   4.2s
[CV] lgb__max_depth=8 ................................................
[CV] ...... lgb__max_depth=8, score=-0.5086828756505191, total=   6.1s
[CV] lgb__max_depth=8 ................................................
[CV] ..... lgb__max_depth=8, score=-0.48328977947021245, total=   8.6s
[CV] lgb__max_depth=8 ................................................
[CV] ...... lgb__max_depth=8, score=-0.4942216158698957, total=   8.9s
[CV] lgb__max_depth=9 ................................................
[CV] ...... lgb__max_depth=9, score=-0.5395376416484111, total=   1.9s
[CV] lgb__max_depth=9 ................................................
[CV] ...... lgb__max_depth=9, score=-0.4398516371239985, total=   6.4s
[CV] lgb__max_depth=9 ................................................
[CV] ...... lgb__max_depth=9, score=-0.5093088520383662, total=   5.9s
[CV] lgb__max_depth=9 ................................................
[CV] ...... lgb__max_depth=9, score=-0.4827919133555588, total=   7.6s
[CV] lgb__max_depth=9 ................................................
[CV] ...... lgb__max_depth=9, score=-0.5049014306906645, total=  17.5s
[CV] lgb__max_depth=10 ...............................................
[CV] ..... lgb__max_depth=10, score=-0.5468177577719269, total=   3.7s
[CV] lgb__max_depth=10 ...............................................
[CV] ..... lgb__max_depth=10, score=-0.4385671840960952, total=   4.0s
[CV] lgb__max_depth=10 ...............................................
[CV] ..... lgb__max_depth=10, score=-0.5067452905242577, total=   5.9s
[CV] lgb__max_depth=10 ...............................................
[CV] .... lgb__max_depth=10, score=-0.48902941475796496, total=   8.5s
[CV] lgb__max_depth=10 ...............................................
[CV] ..... lgb__max_depth=10, score=-0.4984288957651853, total=  15.2s
[CV] lgb__max_depth=11 ...............................................
[CV] ..... lgb__max_depth=11, score=-0.5399480519041409, total=   1.9s
[CV] lgb__max_depth=11 ...............................................
[CV] .... lgb__max_depth=11, score=-0.43254242652775565, total=   5.9s
[CV] lgb__max_depth=11 ...............................................
[CV] ..... lgb__max_depth=11, score=-0.5031389859045671, total=  10.4s
[CV] lgb__max_depth=11 ...............................................
[CV] .... lgb__max_depth=11, score=-0.48089783261005925, total=  11.3s
[CV] lgb__max_depth=11 ...............................................
[CV] ..... lgb__max_depth=11, score=-0.5001205186533934, total=  14.3s
[CV] lgb__max_depth=12 ...............................................
[CV] ..... lgb__max_depth=12, score=-0.5377324746200957, total=   4.6s
[CV] lgb__max_depth=12 ...............................................
[CV] .... lgb__max_depth=12, score=-0.42878661603795304, total=   4.5s
[CV] lgb__max_depth=12 ...............................................
[CV] ..... lgb__max_depth=12, score=-0.5012145319915863, total=   8.9s
[CV] lgb__max_depth=12 ...............................................
[CV] ..... lgb__max_depth=12, score=-0.4794492816134496, total=   9.6s
[CV] lgb__max_depth=12 ...............................................
[CV] ..... lgb__max_depth=12, score=-0.5073113386460569, total=  11.4s
[CV] lgb__max_depth=13 ...............................................
[CV] ..... lgb__max_depth=13, score=-0.5377324746200957, total=   2.7s
[CV] lgb__max_depth=13 ...............................................
[CV] ..... lgb__max_depth=13, score=-0.4439606981830166, total=   5.1s
[CV] lgb__max_depth=13 ...............................................
[CV] ..... lgb__max_depth=13, score=-0.4954312418345777, total=   8.3s
[CV] lgb__max_depth=13 ...............................................
[CV] ..... lgb__max_depth=13, score=-0.4806526691336914, total=  10.4s
[CV] lgb__max_depth=13 ...............................................
[CV] ...... lgb__max_depth=13, score=-0.505976281011339, total=  11.6s
[CV] lgb__max_depth=14 ...............................................
[CV] ..... lgb__max_depth=14, score=-0.5377324746200957, total=   2.6s
[CV] lgb__max_depth=14 ...............................................
[CV] ..... lgb__max_depth=14, score=-0.4372661787934551, total=   6.8s
[CV] lgb__max_depth=14 ...............................................
[CV] ...... lgb__max_depth=14, score=-0.506699137329816, total=   7.4s
[CV] lgb__max_depth=14 ...............................................
[CV] ..... lgb__max_depth=14, score=-0.4793763908093043, total=   9.8s
[CV] lgb__max_depth=14 ...............................................
[CV] .... lgb__max_depth=14, score=-0.49677176330703043, total=  11.4s
[CV] lgb__max_depth=15 ...............................................
[CV] ..... lgb__max_depth=15, score=-0.5377324746200957, total=   2.0s
[CV] lgb__max_depth=15 ...............................................
[CV] ...... lgb__max_depth=15, score=-0.434209459061889, total=   4.6s
[CV] lgb__max_depth=15 ...............................................
[CV] ..... lgb__max_depth=15, score=-0.5100886400350553, total=   7.3s
[CV] lgb__max_depth=15 ...............................................
[CV] ....... lgb__max_depth=15, score=-0.48164551789708, total=  10.1s
[CV] lgb__max_depth=15 ...............................................
[CV] ..... lgb__max_depth=15, score=-0.5040092203238494, total=  12.6s
[CV] lgb__max_depth=16 ...............................................
[CV] ..... lgb__max_depth=16, score=-0.5377324746200957, total=   2.0s
[CV] lgb__max_depth=16 ...............................................
[CV] ..... lgb__max_depth=16, score=-0.4409961227795214, total=   5.0s
[CV] lgb__max_depth=16 ...............................................
[CV] ...... lgb__max_depth=16, score=-0.502360001799488, total=   7.9s
[CV] lgb__max_depth=16 ...............................................
[CV] ..... lgb__max_depth=16, score=-0.4802425516555526, total=  10.3s
[CV] lgb__max_depth=16 ...............................................
[CV] ..... lgb__max_depth=16, score=-0.5087572902010157, total=  12.6s
[Parallel(n_jobs=1)]: Done  75 out of  75 | elapsed: 10.1min finished
CPU times: user 23min 5s, sys: 2.28 s, total: 23min 7s
Wall time: 10min 10s
In [403]:
cv.best_score_
Out[403]:
-0.48426406684037726
In [405]:
cv.best_params_
Out[405]:
{'lgb__max_depth': 2}
In [404]:
plot_param_tuning(params, 'lgb__max_depth', cv)
plt.xlabel('max_depth')
plt.ylabel('neg_mean_absolute_error')
plt.title('Lgb max_depth tuning');