Our dream is finally coming to life: we've been invited to Jeopardy. This is a U.S. TV show where the participants' general knowledge is tested. A funny quirk of this show is that the candiates are prompted with answers, and must respond with questions. For example, a prompt could be "A statistical significance test named after a greek letter" and a correct answer would be "What's a chi-squared test?".
As data scientists, our first reflex to prepare for the show is to dig deep into a database of past questions. The show's quirk won't affect us. We will be only intersted in the prompts and what words they contain.
As always, we start by taking a look at the data, and preparing it for our analysis.
#importing libraries
import pandas as pd
import numpy as np
import re
import matplotlib.pyplot as plt
from scipy.stats import chisquare
from random import sample
import time
%matplotlib inline
#Reading the data
jeopardy = pd.read_csv('jeopardy.csv')
#Printing the first few rows
jeopardy.head()
Show Number | Air Date | Round | Category | Value | Question | Answer | |
---|---|---|---|---|---|---|---|
0 | 4680 | 2004-12-31 | Jeopardy! | HISTORY | $200 | For the last 8 years of his life, Galileo was ... | Copernicus |
1 | 4680 | 2004-12-31 | Jeopardy! | ESPN's TOP 10 ALL-TIME ATHLETES | $200 | No. 2: 1912 Olympian; football star at Carlisl... | Jim Thorpe |
2 | 4680 | 2004-12-31 | Jeopardy! | EVERYBODY TALKS ABOUT IT... | $200 | The city of Yuma in this state has a record av... | Arizona |
3 | 4680 | 2004-12-31 | Jeopardy! | THE COMPANY LINE | $200 | In 1963, live on "The Art Linkletter Show", th... | McDonald's |
4 | 4680 | 2004-12-31 | Jeopardy! | EPITAPHS & TRIBUTES | $200 | Signer of the Dec. of Indep., framer of the Co... | John Adams |
Our first task will be to normalize the column names, meaning removing any space.
jeopardy.columns
Index(['Show Number', ' Air Date', ' Round', ' Category', ' Value', ' Question', ' Answer'], dtype='object')
jeopardy.columns = jeopardy.columns.str.replace(' ','')
jeopardy.columns
Index(['ShowNumber', 'AirDate', 'Round', 'Category', 'Value', 'Question', 'Answer'], dtype='object')
We are going to normalize the questions' text, as it will make our analysis easier. In this case, this means removing punctuation and converting to lowercase. To do so, we are going to define a function, and use the apply() method.
There is one more detail to take care of, that will become important later. Some of the question texts contain hyperlinks. The reason for this is that images, sounds and videos are sometimes played during the show. We want to remove any link in questions, as the characters in the link are not part of the question text.
To remove them, we will use regex: any link starts with the characters <a and ends with >a.
#Using regex sub functoin to remove punctuation and hyperlinks
def normalize_q_a(s):
#this regex matches any string of the form <a***a> or
#any non-word character
return re.sub('<a.*a>|[^\w ]','',s.lower())
#Normalizing
jeopardy['clean_question'] = jeopardy['Question'].apply(normalize_q_a)
jeopardy['clean_answer'] = jeopardy['Answer'].apply(normalize_q_a)
#Let's see if it worked
jeopardy[['clean_question','clean_answer']].head(5)
clean_question | clean_answer | |
---|---|---|
0 | for the last 8 years of his life galileo was u... | copernicus |
1 | no 2 1912 olympian football star at carlisle i... | jim thorpe |
2 | the city of yuma in this state has a record av... | arizona |
3 | in 1963 live on the art linkletter show this c... | mcdonalds |
4 | signer of the dec of indep framer of the const... | john adams |
We'd also like to normalize the value column, meaning replacing string of the form number$ by an integer. We will also replace the AirDate column with a datetime column.
Before normalizing the value column, let's inspect its possible values.
jeopardy['Value'].value_counts(dropna= False)
$400 3892 $800 2980 $200 2784 $600 1890 $1000 1796 $2000 1074 $1200 1069 $1600 1027 $100 804 $500 798 $300 764 None 336 $1,000 184 $2,000 149 $3,000 70 $1,500 50 $1,200 42 $4,000 32 $5,000 23 $1,800 22 $1,400 20 $1,600 19 $2,500 18 $700 15 $2,200 11 $2,400 8 $3,600 8 $6,000 7 $7,000 7 $1,100 6 ... $5,600 2 $2,100 2 $4,400 2 $4,500 1 $8,200 1 $2,021 1 $1,492 1 $3,389 1 $7,500 1 $2,300 1 $6,200 1 $1,700 1 $10,800 1 $1,111 1 $2,900 1 $2,127 1 $3,900 1 $5,400 1 $5,800 1 $1,020 1 $6,800 1 $6,100 1 $3,300 1 $7,400 1 $4,700 1 $5,200 1 $367 1 $4,100 1 $9,000 1 $750 1 Name: Value, Length: 76, dtype: int64
Some None values! Let's look at a few of these.
jeopardy[jeopardy['Value'] == 'None'].head(5)
ShowNumber | AirDate | Round | Category | Value | Question | Answer | clean_question | clean_answer | |
---|---|---|---|---|---|---|---|---|---|
55 | 4680 | 2004-12-31 | Final Jeopardy! | THE SOLAR SYSTEM | None | Objects that pass closer to the sun than Mercu... | Icarus | objects that pass closer to the sun than mercu... | icarus |
116 | 5957 | 2010-07-06 | Final Jeopardy! | HISTORIC WOMEN | None | She was born in Virginia around 1596 & died in... | Pocahontas | she was born in virginia around 1596 died in ... | pocahontas |
174 | 3751 | 2000-12-18 | Final Jeopardy! | SPORTS LEGENDS | None | If Joe DiMaggio's hitting streak had gone one ... | H.J. Heinz (Heinz 57 Varieties) | if joe dimaggios hitting streak had gone one m... | hj heinz heinz 57 varieties |
235 | 3673 | 2000-07-19 | Final Jeopardy! | THE MAP OF EUROPE | None | Bordering Italy, Austria, Hungary & Croatia, i... | Slovenia | bordering italy austria hungary croatia its o... | slovenia |
296 | 4931 | 2006-02-06 | Final Jeopardy! | FAMOUS SHIPS | None | On December 27, 1831 it departed Plymouth, Eng... | the HMS Beagle | on december 27 1831 it departed plymouth engla... | the hms beagle |
A few things to note here:
#Looking at rows with None value
#And the round isn't Final Jeopardy
jeopardy[(jeopardy['Value'] == 'None') & ~(jeopardy['Round'] == 'Final Jeopardy!')].head(5)
ShowNumber | AirDate | Round | Category | Value | Question | Answer | clean_question | clean_answer | |
---|---|---|---|---|---|---|---|---|---|
12305 | 5332 | 2007-11-13 | Tiebreaker | CHILD'S PLAY | None | A Longfellow poem & a Lillian Hellman play abo... | The Children's Hour | a longfellow poem a lillian hellman play abou... | the childrens hour |
This is the only exception. We see that the round, in this case, was 'Tiebreaker'. The tiebreaker is used in case the two highest scoring contestants have the same score at the end of the show. Again, they make no wager in that case.
In both cases, the string 'None' actually stands for no wager at all. Thus, we do not lose any information by replacing it with 0.
#Defining a normalizing function
def normalize_val(s):
if s == 'None':
return 0
else:
#removing both ',' and '$'
return int(s.replace('$','').replace(',',''))
#Normalizing and converting to datetime
jeopardy['Value'] = jeopardy['Value'].apply(normalize_val)
jeopardy['AirDate'] = pd.to_datetime(jeopardy['AirDate'])
jeopardy[['Value','AirDate']].head()
Value | AirDate | |
---|---|---|
0 | 200 | 2004-12-31 |
1 | 200 | 2004-12-31 |
2 | 200 | 2004-12-31 |
3 | 200 | 2004-12-31 |
4 | 200 | 2004-12-31 |
Now that our dtata cleaning is over, let's start the analysis! We want to determine if we should prioritize studying general knowledge or old Jeopardy questions. To do so, we can start by investigating two things:
First we'll turn answers and questions into lists of words.
jeopardy['split_answer'] = jeopardy['clean_answer'].str.split()
jeopardy['split_question'] = jeopardy['clean_question'].str.split()
#For each row, we apply this function, which computes
#the proportion of words from the answer appearing in the question
def match_count(row):
match_count = 0
#we remove the word 'the' from the answer if it appears
if 'the' in row['split_answer']:
row['split_answer'].remove('the')
if len(row['split_answer']) != 0:
#This loops over words in the answer
#Counts any word that also appears in the question
for word in row['split_answer']:
if word in row['split_question']:
match_count += 1
else:
return 0
return match_count/len(row['split_answer'])
jeopardy['answer_in_question'] = jeopardy.apply(match_count, axis = 1)
jeopardy['answer_in_question'].mean()
0.05668720665470503
On average, answers have 6% of their words appearing in the question. That's probably to low to be useful. However, we can create a histogram to explore this further and make sure there's nothing interesting for us.
#We create an histogramm for the answer in question column
histo = jeopardy['answer_in_question'].value_counts(normalize = True
,bins = 5)*100
histo.sort_index(inplace = True)
#Plotting that histogram
#Creating figure and plot
fig = plt.figure(figsize = (10,10))
ax = fig.add_subplot(1,1,1)
#Setting limits and ticks
ax.set_xticks([0, 1.0, 2.0, 3.0, 4.0, 5.0])
ax.set_xlim(0,5)
ax.set_ylim(0,90)
#Title
ax.set_title('Percentage of Words Already in the Question', size = 25)
#Plotting
histo.plot.bar(color = 'palegoldenrod')
#Adding axes labels
ax.set_xlabel('Percent of Words in Question',fontsize = 20)
ax.set_ylabel('Percentage of Questions', fontsize = 20)
#Adding better xticks labels
ax.set_xticklabels(['0-20%','20-40%','40-60%','60-80%','80-100%'])
#Rotating the xticks labels for readability
plt.xticks(rotation = 20)
(array([0, 1, 2, 3, 4]), <a list of 5 Text xticklabel objects>)
This histogram shows that the vast majority of the answers have none of their words contained in the question. This makes sense, as otherwise it wouldn't make for a very interesting game. This does not seem to be of any help for winning Jeopardy.
As stated at the start, another angle is to determine how often new questions are constructed from old ones. To do so, we will first sort the jeopardy database by date, from oldest to newest.
jeopardy.sort_values(by = 'AirDate', inplace = True)
We will now create a set called terms_used, in which we will progressively add words used in the questions. We will also create a list called question_overlap which will contain, for each question, the proportion of its words that have already been used.
To remove words like 'the', 'a',... from consideration, we will only consider words with six or more letters.
terms_used = set()
question_overlap = []
To fill theses two, we will, for each row:
#Looping over rows
for row in jeopardy.iterrows():
match_count = 0.0
#Only looking at words with more than 6 letters
split_question = [word for word in row[1]['split_question'] \
if len(word) >= 6]
#Counting words already used
for word in split_question:
if word in terms_used:
match_count += 1
#Adding words to term_used
for word in split_question:
terms_used.add(word)
#adding the proportion of word already used to
#question_overlap
if len(split_question) > 0:
match_count = match_count/len(split_question)
question_overlap.append(match_count)
jeopardy['question_overlap'] = question_overlap
jeopardy['question_overlap'].mean()
0.6918449051268193
On average, a question has 70% of its words appearing in older questions. That is a pretty high percentage, and studying old questions may be worthwhile. Let us create a histogram to visualize the repartition of these percentages.
histo = jeopardy['question_overlap'].value_counts(normalize = True
,bins = 10)*100
histo.sort_index(inplace = True)
#Creating figure and plot
fig = plt.figure(figsize = (10,10))
ax = fig.add_subplot(1,1,1)
#Setting limits and ticks
ax.set_xticks([0, 0.5, 1.0, 1.5, 2.0, 2.5, 3.0, 3.5, 4.0, 4.5, 5.0])
ax.set_xlim(0,5)
ax.set_ylim(0,50)
#Title
ax.set_title('Percentage of Words Coming from Old Questions', size = 25)
#Plotting
histo.plot.bar(color = 'palegoldenrod')
#Adding axes labels
ax.set_xlabel('Percentage of Words in Old Questions',fontsize = 20)
ax.set_ylabel('Percentage of Questions', fontsize = 20)
#Adding xticks labels
ax.set_xticklabels(['0-10%','10-20%','20-30%','30-40%','40-50%'
,'50-60%','60-70%','70-80%','80-90%','90-100%'])
#Rotating the xticks labels for readability
plt.xticks(rotation = 20)
(array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9]), <a list of 10 Text xticklabel objects>)
It seems 30% of the questions have more than 90 percent of their words consisting of previously used terms. Another large portion has a least half of their words consisting of previously used terms.
This indicates that studying past questions might be a good idea.
We concluded, in the previosu section, that studying old questions may be worthwhile. But is there a specific theme we should study more? Ideally, we want to study questions that are worth more money. We can find out, using a chi-squared test, for each word in terms used, if this word is used often in high value questions.
To do so, we will split questions into two categories:
Once this is done, for each word, we'll count how many times it appear in high an low value questions. We can also compute expectations for these counts, when assuming that the appearance of this word is independent from value.
The difference between actual count and expected count will be quantififed using a chi-squared test.
#We are going to time our code, to see if it would be
#Reasonable to use on a larger database
#This will store the time when this cell is started
start_time = time.time()
#Defining a function to segregate into high/low value:
def hi_lo(row):
if row['Value'] > 800:
return True
else:
return False
#Applying it to the Dataframe
jeopardy['high_value'] = jeopardy.apply(hi_lo,axis = 1)
Instead of computing the number of igh/low occurences for each word separately, we are going to go through the database once, conting occurences as we go. This is faster, as we go through the database only once, instead of once for each word.
#This database will store the word occurences
#Each word will be a key, and its value a dictionary
#Of the form high: nb of high occurences, low: nb of low occurences
word_counts = {}
#Iterating over rows
for row in jeopardy.iterrows():
#If it's a high value question
if row[1]['high_value'] == True:
#Loop over words in the question
for word in row[1]['split_question']:
#Updating word_counts
if word in word_counts:
word_counts[word]['high'] += 1
else:
word_counts[word] = {'high':1 , 'low':0}
else:
for word in row[1]['split_question']:
if word in word_counts:
word_counts[word]['low'] += 1
else:
word_counts[word] = {'high':0 , 'low':1}
Done! Let's acces some high/low occurences values to get familiar with the syntax.
word_counts['english']
{'high': 37, 'low': 57}
We also need a function that counts the expected number of occurences in high/low value questions, based on total number of occurences of a word, and high/low value questions proportion. For the chi-squared test, we compute this under this assumption that occurence of the word is independent from the value of the question.
#Computing number of high/low value questions
high_q = jeopardy[jeopardy['high_value']].shape[0]
low_q = jeopardy[~jeopardy['high_value']].shape[0]
#And their proportions
high_prop = high_q/(high_q+low_q)
low_prop = 1 - high_prop
#This computes the expected high/low distribution
#For any given word
def exp_count(s):
tot_count = word_counts[s]['high']+word_counts[s]['low']
return [tot_count*high_prop,tot_count*low_prop]
#This dictionnary will store the results of our chi-square test
dico_chi = {}
#Selecting 10 words at random
comparison_terms = sample(terms_used,10)
comparison_terms
['quills', 'armand', 'beniamino', 'raindrops', 'charlestown', 'sledding', 'probably', 'friday', '270000000', 'pajamas']
#Printing the p-values for our random selection
#We are going to store results in the dictionary dico_chi
#This will speed up the computation, as they are a lot of repeat values
for word in comparison_terms:
#store the count as a tuple, as it will be a dico key
hi_lo_word = (word_counts[word]['high'],word_counts[word]['low'])
if hi_lo_word in dico_chi:
p_val = dico_chi[hi_lo_word]
else:
p_val = chisquare(list(hi_lo_word),exp_count(word))[1]
dico_chi[hi_lo_word] = p_val
print('The word {w} appears in :'.format(w = word))
print('{h} high value questions'.format(h = hi_lo_word[0]))
print('{l} low value ones'.format(l = hi_lo_word[1]))
print('This distribution has a p-value of: ')
print(p_val)
print('\n')
The word quills appears in : 0 high value questions 3 low value ones This distribution has a p-value of: 0.27214791766902047 The word armand appears in : 0 high value questions 2 low value ones This distribution has a p-value of: 0.3699222378079571 The word beniamino appears in : 1 high value questions 0 low value ones This distribution has a p-value of: 0.11473257634454047 The word raindrops appears in : 0 high value questions 1 low value ones This distribution has a p-value of: 0.5260772985705469 The word charlestown appears in : 0 high value questions 3 low value ones This distribution has a p-value of: 0.27214791766902047 The word sledding appears in : 0 high value questions 1 low value ones This distribution has a p-value of: 0.5260772985705469 The word probably appears in : 4 high value questions 30 low value ones This distribution has a p-value of: 0.02926264474238991 The word friday appears in : 1 high value questions 6 low value ones This distribution has a p-value of: 0.39999189913636146 The word 270000000 appears in : 1 high value questions 0 low value ones This distribution has a p-value of: 0.11473257634454047 The word pajamas appears in : 0 high value questions 1 low value ones This distribution has a p-value of: 0.5260772985705469
It seems that our code may be fast enough to do the whole vocabulary. Let's try it. We'll store our results in a dictionary.
#This stores the words and their p values
words_p_val = {}
for word in terms_used:
#computing frequency
#using dico_chi to find the p value if we can
hi_lo_word = (word_counts[word]['high'],word_counts[word]['low'])
if hi_lo_word in dico_chi:
p_val = dico_chi[hi_lo_word]
#otherwise we compute it
else:
p_val = chisquare(list(hi_lo_word),exp_count(word))[1]
dico_chi[hi_lo_word] = p_val
words_p_val[word] = {'distribution': hi_lo_word, 'p_val': p_val}
time = time.time() - start_time
print("Dictionary of p-values constructed in {t} seconds".format(t = round(time)))
Dictionary of p-values constructed in 3 seconds
What we now have is a dictionary containing, as keys, every words in terms_used. The values are themselves dictionaries, containing the distribution of the word occurences between high a low values, and the associated p value. It was constructed in three seconds, thus it seems possible to run it on the entire Jeopardy database in a reasonable amount of time.
Now that we have stored everything, we can sort the dictionary by p values, and find the most significant results.
#Sorting the dictionary by p-value
def get_p(key):
return words_p_val.get(key).get('p_val')
sorted_words = sorted(words_p_val, key = get_p)
We can print the 10 most significant words, with their distribution and p values.
#Printing the ten most significant results
for word in sorted_words[0:10]:
print('The word {w} has a distribution of {d} \n and p value of {p}'
.format(w = word
,d = words_p_val[word]['distribution']
,p = words_p_val[word]['p_val']))
print('--------')
The word french has a distribution of (107, 134) and p value of 6.709675721369154e-08 -------- The word painter has a distribution of (16, 8) and p value of 3.854581067310279e-05 -------- The word african has a distribution of (34, 33) and p value of 6.453960170352826e-05 -------- The word italian has a distribution of (45, 53) and p value of 0.00015972040578677148 -------- The word german has a distribution of (45, 53) and p value of 0.00015972040578677148 -------- The word pulitzer has a distribution of (15, 9) and p value of 0.0002476749265413058 -------- The word string has a distribution of (17, 12) and p value of 0.000361935427804063 -------- The word liquid has a distribution of (17, 12) and p value of 0.000361935427804063 -------- The word hormone has a distribution of (9, 3) and p value of 0.00038697277745745515 -------- The word largely has a distribution of (5, 0) and p value of 0.0004204697200298204 --------
We can split these words into two groups:
This indicates that some categories might be worth studying for their tendency to appear in high value questions. We should dive further into the database to see exactly which categories to prioritize.