Jeopardy is an American TV show in which participants answer questions to win money. It has been running for many years. Over multiple rounds, contestants can choose a category, and get a question from that category, where different questions have different dollar values. A more extensive description can be found here.
Imagine you want to participate in Jeopardy - and win. And you wonder yourself: how do I prepare for this? Is it just a matter of studying a lot? Or possibly, is there something to learn from questions from the past?
A couple of years ago, someone crawled Jeopardy archives and posted on Reddit a short article from which one can download a file with no less then 216,930 earlier Jeopardy questions, with answers and other data.
In this notebook, we are going to explore if there is something to learn from Jeopardy history that will help you prepare. The notebook contains the following sections:
2.1 Columns "Question" and "Answer"
2.2 Column "Value"
2.3 Column "Air Date"
3. Data analysis
3.1 Answer included in question
3.2 Repeated and popular terms
3.3 Terms used in high-value questions
4. Conclusions
Let's start with reading in the data (I took the .csv file) and explore it.
# Import pandas library
import pandas as pd
# Import the data into a dataframe
jeopardy = pd.read_csv('JEOPARDY_CSV.csv')
# Show some rows
jeopardy.head(5)
Show Number | Air Date | Round | Category | Value | Question | Answer | |
---|---|---|---|---|---|---|---|
0 | 4680 | 2004-12-31 | Jeopardy! | HISTORY | $200 | For the last 8 years of his life, Galileo was ... | Copernicus |
1 | 4680 | 2004-12-31 | Jeopardy! | ESPN's TOP 10 ALL-TIME ATHLETES | $200 | No. 2: 1912 Olympian; football star at Carlisl... | Jim Thorpe |
2 | 4680 | 2004-12-31 | Jeopardy! | EVERYBODY TALKS ABOUT IT... | $200 | The city of Yuma in this state has a record av... | Arizona |
3 | 4680 | 2004-12-31 | Jeopardy! | THE COMPANY LINE | $200 | In 1963, live on "The Art Linkletter Show", th... | McDonald's |
4 | 4680 | 2004-12-31 | Jeopardy! | EPITAPHS & TRIBUTES | $200 | Signer of the Dec. of Indep., framer of the Co... | John Adams |
# Get column information
jeopardy.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 216930 entries, 0 to 216929 Data columns (total 7 columns): Show Number 216930 non-null int64 Air Date 216930 non-null object Round 216930 non-null object Category 216930 non-null object Value 216930 non-null object Question 216930 non-null object Answer 216928 non-null object dtypes: int64(1), object(6) memory usage: 11.6+ MB
# Get the number of rows and columns
jeopardy.shape
(216930, 7)
Some column names appear to have leading spaces. That's inconvenient, so let's remove those.
# Remove the leading spaces from the column names
jeopardy.columns = jeopardy.columns.str.strip()
# Check the result
jeopardy.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 216930 entries, 0 to 216929 Data columns (total 7 columns): Show Number 216930 non-null int64 Air Date 216930 non-null object Round 216930 non-null object Category 216930 non-null object Value 216930 non-null object Question 216930 non-null object Answer 216928 non-null object dtypes: int64(1), object(6) memory usage: 11.6+ MB
# Check from when the questions originate
jeopardy['Air Date'].value_counts().sort_index()
1984-09-10 48 1984-09-11 50 1984-09-12 51 1984-09-13 53 1984-09-14 54 .. 2012-01-20 58 2012-01-23 61 2012-01-24 59 2012-01-25 61 2012-01-27 30 Name: Air Date, Length: 3640, dtype: int64
# Show a sample again with better layout
# Avoid truncation
pd.set_option('display.max_colwidth', -1)
# Display with left alignment
jeopardy.head(10).style.set_properties(**{'text-align': 'left'}).set_table_styles([ dict(selector='th', props=[('text-align', 'left')] ) ])
Show Number | Air Date | Round | Category | Value | Question | Answer | |
---|---|---|---|---|---|---|---|
0 | 4680 | 2004-12-31 | Jeopardy! | HISTORY | $200 | For the last 8 years of his life, Galileo was under house arrest for espousing this man's theory | Copernicus |
1 | 4680 | 2004-12-31 | Jeopardy! | ESPN's TOP 10 ALL-TIME ATHLETES | $200 | No. 2: 1912 Olympian; football star at Carlisle Indian School; 6 MLB seasons with the Reds, Giants & Braves | Jim Thorpe |
2 | 4680 | 2004-12-31 | Jeopardy! | EVERYBODY TALKS ABOUT IT... | $200 | The city of Yuma in this state has a record average of 4,055 hours of sunshine each year | Arizona |
3 | 4680 | 2004-12-31 | Jeopardy! | THE COMPANY LINE | $200 | In 1963, live on "The Art Linkletter Show", this company served its billionth burger | McDonald's |
4 | 4680 | 2004-12-31 | Jeopardy! | EPITAPHS & TRIBUTES | $200 | Signer of the Dec. of Indep., framer of the Constitution of Mass., second President of the United States | John Adams |
5 | 4680 | 2004-12-31 | Jeopardy! | 3-LETTER WORDS | $200 | In the title of an Aesop fable, this insect shared billing with a grasshopper | the ant |
6 | 4680 | 2004-12-31 | Jeopardy! | HISTORY | $400 | Built in 312 B.C. to link Rome & the South of Italy, it's still in use today | the Appian Way |
7 | 4680 | 2004-12-31 | Jeopardy! | ESPN's TOP 10 ALL-TIME ATHLETES | $400 | No. 8: 30 steals for the Birmingham Barons; 2,306 steals for the Bulls | Michael Jordan |
8 | 4680 | 2004-12-31 | Jeopardy! | EVERYBODY TALKS ABOUT IT... | $400 | In the winter of 1971-72, a record 1,122 inches of snow fell at Rainier Paradise Ranger Station in this state | Washington |
9 | 4680 | 2004-12-31 | Jeopardy! | THE COMPANY LINE | $400 | This housewares store was named for the packaging its merchandise came in & was first displayed on | Crate & Barrel |
# Summary of the questions that we have
print(jeopardy.shape[0],'questions dating from', jeopardy['Air Date'].min(),'to', jeopardy['Air Date'].max())
216930 questions dating from 1984-09-10 to 2012-01-27
# Check how many questions per round
jeopardy['Round'].value_counts()
Jeopardy! 107384 Double Jeopardy! 105912 Final Jeopardy! 3631 Tiebreaker 3 Name: Round, dtype: int64
So we have 216,930 questions from between 1984 and 2012. It looks like the data in the columns is pretty complete (almost no missing values, only 2 answers). Most columns took the format of an object
.
One somewhat surprising observation are the "Questions" and "Answers". From a game-play description I understood that participants do not so much get a "question" that they need to "answer", but rather get an "answer" for which they need to come up with the right "question". The sample above does not really show that. If someone is told "Mc Donald's", I can hardly imagine someone asking which fast food chain served its billionth burger live on The Art Linkletter Show in 1963.
I am not sure about the cause of this descrepancy between (my understanding of) Jeopardy game-play and the question-and-answer-archive. However, for the purpose of our study, it seems okay to just consider this "questions" with "answers".
# Import re library to enable reformatting
import re
# Create a function that takes a string and returns it normalized (no interpunction, all lowercase)
def normalize_string(input):
replaced_interpunction = re.sub(r'\W', ' ', input).lower()
removed_spaces = re.sub(' +',' ', replaced_interpunction).strip()
return removed_spaces
# Test function
normalize_string('Hello!! Do DO dO 16:17 ?two2,and:FOO, bar?')
'hello do do do 16 17 two2 and foo bar'
Looks good. Let's apply this to columns Question
and Answer
. We'll add new columns with the result.
# Add 2 columns, with normalized versions of Question and Answer. (Offer a 'string versions' of the objects.)
jeopardy['question_clean'] = jeopardy['Question'].astype('str').apply(normalize_string)
jeopardy['answer_clean'] = jeopardy['Answer'].astype('str').apply(normalize_string)
# Show result on a random sample
jeopardy.sample(10, random_state = 0)
Show Number | Air Date | Round | Category | Value | Question | Answer | question_clean | answer_clean | |
---|---|---|---|---|---|---|---|---|---|
112079 | 3452 | 1999-09-14 | Double Jeopardy! | SWEET 16 | $1000 | This king had a lot taken off the top January 21, 1793 | Louis XVI | this king had a lot taken off the top january 21 1793 | louis xvi |
50465 | 3967 | 2001-11-27 | Jeopardy! | MISS UNIVERSE | $800 | Crowned in Cyprus in May 2000, Bombay U. grad Lara Dutta represented this country | India | crowned in cyprus in may 2000 bombay u grad lara dutta represented this country | india |
71223 | 5920 | 2010-05-14 | Double Jeopardy! | FAUX | $800 | Nickname of Sam, leader of The Pharaohs, who sang "Wolly Bully" | "The Sham" | nickname of sam leader of the pharaohs who sang wolly bully | the sham |
26234 | 2916 | 1997-04-14 | Double Jeopardy! | FICTIONAL FEMALES | $800 | Miranda, a young woman, appears in several of her works, including "Pale Horse, Pale Rider" | Katherine Anne Porter | miranda a young woman appears in several of her works including pale horse pale rider | katherine anne porter |
86973 | 1295 | 1990-03-30 | Jeopardy! | FAMOUS JOES & JOSEPHS | $200 | This Delaware senator chairs the Senate Judiciary Committee | Joseph Biden | this delaware senator chairs the senate judiciary committee | joseph biden |
127358 | 5685 | 2009-05-01 | Double Jeopardy! | BROWN | $800 | Between 1960 & 1986, he racked up 44 Top 40 hits, but no No. 1s | James Brown | between 1960 1986 he racked up 44 top 40 hits but no no 1s | james brown |
148314 | 3810 | 2001-03-09 | Jeopardy! | IN A MINUTE | $500 | Under the slogan "Real Estate for the Real World", this company claims on average to buy or sell a home every minute | Century 21 | under the slogan real estate for the real world this company claims on average to buy or sell a home every minute | century 21 |
115787 | 2883 | 1997-02-26 | Double Jeopardy! | BLACK AMERICA | $1000 | Ebony & Jet are among the magazines launched by this publisher | John Johnson | ebony jet are among the magazines launched by this publisher | john johnson |
118519 | 3472 | 1999-10-12 | Jeopardy! | A LITTLE DICKENS | $500 | It's no mystery why this work was Dickens' last; he didn't live to finish it | The Mystery of Edwin Drood | it s no mystery why this work was dickens last he didn t live to finish it | the mystery of edwin drood |
193578 | 5447 | 2008-04-22 | Jeopardy! | MELROSE PLACE | $400 | Hey, nice to meet <a href="http://www.j-archive.com/media/2008-04-22_J_08.jpg" target="_blank">this</a> actress who played Jennifer Mancini in 1997; "Charmed", I'm sure | Alyssa Milano | hey nice to meet a href http www j archive com media 2008 04 22_j_08 jpg target _blank this a actress who played jennifer mancini in 1997 charmed i m sure | alyssa milano |
Looks good. The last one in the table shows there is some messy data, where a hyperlink to a picture was included in the data.
We'll ignore that for now, but let's keep it in mind.
Next, we'll change column Value
into a numeric field, to be able to manipulate it easier. In the samples so far we see entries like \$200 and \$1,800. Let's first check if there is more.
# Check which different values there are for field Value
jeopardy['Value'].unique()
array(['$200', '$400', '$600', '$800', '$2,000', '$1000', '$1200', '$1600', '$2000', '$3,200', 'None', '$5,000', '$100', '$300', '$500', '$1,000', '$1,500', '$1,200', '$4,800', '$1,800', '$1,100', '$2,200', '$3,400', '$3,000', '$4,000', '$1,600', '$6,800', '$1,900', '$3,100', '$700', '$1,400', '$2,800', '$8,000', '$6,000', '$2,400', '$12,000', '$3,800', '$2,500', '$6,200', '$10,000', '$7,000', '$1,492', '$7,400', '$1,300', '$7,200', '$2,600', '$3,300', '$5,400', '$4,500', '$2,100', '$900', '$3,600', '$2,127', '$367', '$4,400', '$3,500', '$2,900', '$3,900', '$4,100', '$4,600', '$10,800', '$2,300', '$5,600', '$1,111', '$8,200', '$5,800', '$750', '$7,500', '$1,700', '$9,000', '$6,100', '$1,020', '$4,700', '$2,021', '$5,200', '$3,389', '$4,200', '$5', '$2,001', '$1,263', '$4,637', '$3,201', '$6,600', '$3,700', '$2,990', '$5,500', '$14,000', '$2,700', '$6,400', '$350', '$8,600', '$6,300', '$250', '$3,989', '$8,917', '$9,500', '$1,246', '$6,435', '$8,800', '$2,222', '$2,746', '$10,400', '$7,600', '$6,700', '$5,100', '$13,200', '$4,300', '$1,407', '$12,400', '$5,401', '$7,800', '$1,183', '$1,203', '$13,000', '$11,600', '$14,200', '$1,809', '$8,400', '$8,700', '$11,000', '$5,201', '$1,801', '$3,499', '$5,700', '$601', '$4,008', '$50', '$2,344', '$2,811', '$18,000', '$1,777', '$3,599', '$9,800', '$796', '$3,150', '$20', '$1,810', '$22', '$9,200', '$1,512', '$8,500', '$585', '$1,534', '$13,800', '$5,001', '$4,238', '$16,400', '$1,347', '$2547', '$11,200'], dtype=object)
# Check the amount of 'None' values
len(jeopardy[jeopardy['Value']=='None'])
3634
Not entirely sure what they are, but let's replace all 'None' values with 0. Convert everything else to a number.
# Create a function that takes a string as in Value column and returns a number
def normalize_value(input):
if input == 'None':
output = 0
else:
keep_numbers = input.replace('$','').replace(',','')
output = int(keep_numbers)
#replaced_interpunction = re.sub(r'\W', ' ', input).lower()
#removed_spaces = re.sub(' +',' ', replaced_interpunction).strip()
return output
# Test the function
print (normalize_value('None'), normalize_value('$200'), normalize_value('$1,534'), normalize_value('$200')+normalize_value('$1,534'))
0 200 1534 1734
Looks good. Let's apply this to column Value
. We'll add a new column with the result.
# Add a column, with normalized versions of Value. (Offer a 'string version' of the object.)
jeopardy['value_clean'] = jeopardy['Value'].astype('str').apply(normalize_value)
# Verification 1: check that all are numbers, by summing them
print('Total is:', jeopardy['value_clean'].sum())
# Verification 2: show all values
print(sorted(jeopardy['value_clean'].unique()))
Total is: 160525700 [0, 5, 20, 22, 50, 100, 200, 250, 300, 350, 367, 400, 500, 585, 600, 601, 700, 750, 796, 800, 900, 1000, 1020, 1100, 1111, 1183, 1200, 1203, 1246, 1263, 1300, 1347, 1400, 1407, 1492, 1500, 1512, 1534, 1600, 1700, 1777, 1800, 1801, 1809, 1810, 1900, 2000, 2001, 2021, 2100, 2127, 2200, 2222, 2300, 2344, 2400, 2500, 2547, 2600, 2700, 2746, 2800, 2811, 2900, 2990, 3000, 3100, 3150, 3200, 3201, 3300, 3389, 3400, 3499, 3500, 3599, 3600, 3700, 3800, 3900, 3989, 4000, 4008, 4100, 4200, 4238, 4300, 4400, 4500, 4600, 4637, 4700, 4800, 5000, 5001, 5100, 5200, 5201, 5400, 5401, 5500, 5600, 5700, 5800, 6000, 6100, 6200, 6300, 6400, 6435, 6600, 6700, 6800, 7000, 7200, 7400, 7500, 7600, 7800, 8000, 8200, 8400, 8500, 8600, 8700, 8800, 8917, 9000, 9200, 9500, 9800, 10000, 10400, 10800, 11000, 11200, 11600, 12000, 12400, 13000, 13200, 13800, 14000, 14200, 16400, 18000]
Looks good.
Next, we'll turn Air Date
into a date field, which is easier to analyze. We'll add a new column.
# Add a column, with the airdate as date-time
jeopardy['date_clean'] = pd.to_datetime(jeopardy['Air Date'], format = '%Y-%M-%d')
# Check result
jeopardy.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 216930 entries, 0 to 216929 Data columns (total 11 columns): Show Number 216930 non-null int64 Air Date 216930 non-null object Round 216930 non-null object Category 216930 non-null object Value 216930 non-null object Question 216930 non-null object Answer 216928 non-null object question_clean 216930 non-null object answer_clean 216930 non-null object value_clean 216930 non-null int64 date_clean 216930 non-null datetime64[ns] dtypes: datetime64[ns](1), int64(2), object(8) memory usage: 18.2+ MB
# Check a sample
jeopardy.head(2)
Show Number | Air Date | Round | Category | Value | Question | Answer | question_clean | answer_clean | value_clean | date_clean | |
---|---|---|---|---|---|---|---|---|---|---|---|
0 | 4680 | 2004-12-31 | Jeopardy! | HISTORY | $200 | For the last 8 years of his life, Galileo was under house arrest for espousing this man's theory | Copernicus | for the last 8 years of his life galileo was under house arrest for espousing this man s theory | copernicus | 200 | 2004-01-31 00:12:00 |
1 | 4680 | 2004-12-31 | Jeopardy! | ESPN's TOP 10 ALL-TIME ATHLETES | $200 | No. 2: 1912 Olympian; football star at Carlisle Indian School; 6 MLB seasons with the Reds, Giants & Braves | Jim Thorpe | no 2 1912 olympian football star at carlisle indian school 6 mlb seasons with the reds giants braves | jim thorpe | 200 | 2004-01-31 00:12:00 |
Looks good.
Now we have a set of data in an easy-to-analyze format we can start analysis.
What may happen sometimes, is that the question already contains the answers, or parts of it. If that happens a lot, that may help you develop your strategy to win.
We are going to analyze such overlap between questions and answers by calculating (for every question-with-answer) which fraction of words in the answer is also a word in the question. The words the
and a
will be excluded: they appear a lot, but are not meaningful for this analysis.
# Create a function that takes a row of the dataframe as an input, and returns
# how many times terms in answer_clean occur in question_clean
def count_overlap(row):
# Split question and answer in individual words
split_answer = row['answer_clean'].split()
split_question = row['question_clean'].split()
# print(split_question) # commented out after verifying
# Remove all occurrences of 'the' from the question (as this is not meaningful)
while 'the' in split_question:
split_question.remove('the')
# print(split_question) # commented out after verifying
# Do the same for 'a' (added after seeing the result)
while 'a' in split_question:
split_question.remove('a')
# print(split_question) # commented out after verifying
# Count how many words in the answer appear in the question, calculate the fraction
result = 0
match_count = 0
if len(split_answer) > 0:
for word in split_answer:
if word in split_question:
match_count +=1
# print ('match_count:', match_count) # commented out after verifying
# print ('answer length:',len(split_answer)) # commented out after verifying
result = match_count / len(split_answer)
return result
# test an example that contains overlap
test_row1 = jeopardy.iloc[118519]
print(test_row1['question_clean'])
print(test_row1['answer_clean'])
count_overlap(test_row1)
it s no mystery why this work was dickens last he didn t live to finish it the mystery of edwin drood
0.2
# test an example having 'the' in the question multiple times
# (This test tested whether all instances of 'the' were removed; after commenting out print this is not visible anymore)
test_row2 = jeopardy.iloc[7]
print(test_row2['question_clean'])
print(test_row2['answer_clean'])
count_overlap(test_row2)
no 8 30 steals for the birmingham barons 2 306 steals for the bulls michael jordan
0.0
# test an example where 'the' is in the overlap question containing 'the' multiple times
test_row2 = jeopardy.iloc[5]
print(test_row2['question_clean'])
print(test_row2['answer_clean'])
count_overlap(test_row2)
in the title of an aesop fable this insect shared billing with a grasshopper the ant
0.0
Looks good so far, so let's apply this to the dataframe (add a new column), and then check some more results.
# Add a new column to include the indicator of overlap between question and answer
jeopardy['answer_in_question'] = jeopardy.apply(count_overlap, axis = 1)
# See the outcome: fraction of words in the answer that is also in the question
jeopardy['answer_in_question'].value_counts()
0.000000 195068 0.500000 8179 0.333333 6437 0.250000 2435 1.000000 1237 0.200000 1151 0.666667 654 0.166667 467 0.400000 350 0.142857 210 0.285714 129 0.125000 118 0.750000 98 0.600000 73 0.111111 51 0.222222 40 0.428571 36 0.375000 28 0.100000 21 0.800000 20 0.571429 14 0.090909 12 0.714286 9 0.181818 9 0.300000 9 0.083333 8 0.153846 7 0.625000 7 0.833333 7 0.272727 5 0.555556 3 0.857143 3 0.444444 3 0.888889 2 0.076923 2 0.363636 2 0.066667 2 0.133333 2 0.545455 2 0.071429 2 0.117647 1 0.052632 1 0.769231 1 0.454545 1 0.583333 1 0.727273 1 0.384615 1 0.875000 1 0.909091 1 0.105263 1 0.818182 1 0.700000 1 0.642857 1 0.777778 1 0.095238 1 0.416667 1 0.230769 1 0.214286 1 Name: answer_in_question, dtype: int64
# Check some samples with an overlap fraction of 0.5
jeopardy[jeopardy['answer_in_question'] ==0.5].head(5)
Show Number | Air Date | Round | Category | Value | Question | Answer | question_clean | answer_clean | value_clean | date_clean | answer_in_question | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
53 | 4680 | 2004-12-31 | Double Jeopardy! | MUSICAL TRAINS | $2000 | In 1961 James Brown announced "all aboard" for this train | "Night Train" | in 1961 james brown announced all aboard for this train | night train | 2000 | 2004-01-31 00:12:00 | 0.5 |
68 | 5957 | 2010-07-06 | Jeopardy! | GEOGRAPHY "E" | $600 | This island in the South Pacific is named for the day of its discovery, a religious holiday | Easter Island | this island in the south pacific is named for the day of its discovery a religious holiday | easter island | 600 | 2010-01-06 00:07:00 | 0.5 |
80 | 5957 | 2010-07-06 | Jeopardy! | GEOGRAPHY "E" | $2,000 | The family history you wrote for school might include entering the U.S. at this island in New York Bay | Ellis Island | the family history you wrote for school might include entering the u s at this island in new york bay | ellis island | 2000 | 2010-01-06 00:07:00 | 0.5 |
83 | 5957 | 2010-07-06 | Jeopardy! | BE FRUITFUL & MULTIPLY | $1000 | 2 x 1,035 | 2,070 | 2 x 1 035 | 2 070 | 1000 | 2010-01-06 00:07:00 | 0.5 |
112 | 5957 | 2010-07-06 | Double Jeopardy! | JUST THE FACTS | $2000 | He's the older son of Prince Charles and the late Princess Diana | Prince William | he s the older son of prince charles and the late princess diana | prince william | 2000 | 2010-01-06 00:07:00 | 0.5 |
# Check some samples with an overlap fraction of 0.8
jeopardy[jeopardy['answer_in_question'] ==0.8].head(5)
Show Number | Air Date | Round | Category | Value | Question | Answer | question_clean | answer_clean | value_clean | date_clean | answer_in_question | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
8375 | 4787 | 2005-05-31 | Jeopardy! | TALK LIKE A BRIT | $400 | Of stay in bed, hit someone on the head or rub till it's red, what you do if you cosh | hit someone on the head | of stay in bed hit someone on the head or rub till it s red what you do if you cosh | hit someone on the head | 400 | 2005-01-31 00:05:00 | 0.8 |
10975 | 4362 | 2003-07-15 | Jeopardy! | STUPID ANSWERS | $1000 | It's the state song of the state of Maine | "The State of Maine Song" | it s the state song of the state of maine | the state of maine song | 1000 | 2003-01-15 00:07:00 | 0.8 |
15184 | 3340 | 1999-02-26 | Double Jeopardy! | PUT 'EM IN ORDER | $200 | Oscar Winners "The English Patient", "Unforgiven", "Braveheart" | Unforgiven, Braveheart, The English Patient | oscar winners the english patient unforgiven braveheart | unforgiven braveheart the english patient | 200 | 1999-01-26 00:02:00 | 0.8 |
19375 | 6070 | 2011-01-21 | Double Jeopardy! | JOB HUNTING | $1600 | In a 60-year-old man age really takes its toll on the body, no matter which sport he works in | manager (in man age really) | in a 60 year old man age really takes its toll on the body no matter which sport he works in | manager in man age really | 1600 | 2011-01-21 00:01:00 | 0.8 |
19644 | 4583 | 2004-07-07 | Jeopardy! | WHAT'S THE NEXT LINE? | $800 | "Yes we have no bananas..." | "...We have no bananas today" | yes we have no bananas | we have no bananas today | 800 | 2004-01-07 00:07:00 | 0.8 |
# Check some samples with an overlap fraction of 1.0
jeopardy[jeopardy['answer_in_question'] ==1.0].head(5)
Show Number | Air Date | Round | Category | Value | Question | Answer | question_clean | answer_clean | value_clean | date_clean | answer_in_question | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
266 | 4931 | 2006-02-06 | Double Jeopardy! | NOT A CURRENT NATIONAL CAPITAL | $400 | Ljubljana, Bratislava, Barcelona | Barcelona | ljubljana bratislava barcelona | barcelona | 400 | 2006-01-06 00:02:00 | 1.0 |
272 | 4931 | 2006-02-06 | Double Jeopardy! | NOT A CURRENT NATIONAL CAPITAL | $800 | Istanbul, Ottawa, Amman | Istanbul | istanbul ottawa amman | istanbul | 800 | 2006-01-06 00:02:00 | 1.0 |
278 | 4931 | 2006-02-06 | Double Jeopardy! | NOT A CURRENT NATIONAL CAPITAL | $1200 | Sofia, Sarajevo, Saigon | Saigon | sofia sarajevo saigon | saigon | 1200 | 2006-01-06 00:02:00 | 1.0 |
284 | 4931 | 2006-02-06 | Double Jeopardy! | NOT A CURRENT NATIONAL CAPITAL | $1600 | Bucharest, Bonn, Bern | Bonn | bucharest bonn bern | bonn | 1600 | 2006-01-06 00:02:00 | 1.0 |
290 | 4931 | 2006-02-06 | Double Jeopardy! | NOT A CURRENT NATIONAL CAPITAL | $2000 | Belize City, Guatemala City, Panama City | Belize City | belize city guatemala city panama city | belize city | 2000 | 2006-01-06 00:02:00 | 1.0 |
# Calculate the mean of the overlap fraction
jeopardy['answer_in_question'].mean()
0.042752736697187994
What we can observe:
A strategy where you hope to find answers in the questions themselves is not going to help you whatsoever to win Jeopardy.
Jeopardy has been running for many, many years. One may wonder till what extent the same questions are repeated.
We are going to analyze, not by finding questions that are literally the same, but by analyzing for all questions which fraction of the words in those questions appeared in questions before as well. We will focus on longer terms (6 characters or more), as those are the termss that typically form the 'heart' of the question. Shorter words (not only 'the' and 'a', but also e.g. 'more' and 'each') that are repeated won't teach us a lot.
We'll on the fly also simply create an overview of the terms that appear most in the questions, including a count how many times. That could be interesting information as well.
# Iterate over the rows of the jeopardy dataframe, and calculates the fraction of terms (words longer than 5 characters)
# that appeared in earlier questions as well. Store these values in a list.
# On the fly, create a dictionary of all terms and their frequency over all questions.
# Initate a list with the overlap_count, a set with all terms used, and a dictionary with all words used
questions_overlap = []
terms_used = set()
terms_dictionary = {}
# Iterare
for i, row in jeopardy.iterrows():
split_question = row['question_clean'].split()
# print(split_question) # commented out after verifying
# Only keep the words 6 characters or longer
split_question = [word for word in split_question if len(word) >5]
# print(split_question) # commented out after verifying
match_count = 0
# Check for every term whether used before already. If so: increase counts. If not: add to set and dictionary.
for word in split_question:
if word in terms_used:
match_count += 1
terms_dictionary[word]+=1
else:
terms_dictionary[word] = 1
terms_used.add(word)
# Calculate fraction
if len(split_question)>0:
fraction_used_before = match_count / len(split_question)
# Append fraction to list
questions_overlap.append(fraction_used_before)
# Add the calculated fractions (a list with the correct length and in correct sequence) as a new column to the datafame
jeopardy['fraction_overlap_before'] = questions_overlap
# Take a look at the final rows in the dataframe, where one would expect no overlap with earliers questions
jeopardy[['date_clean','question_clean','answer_clean','fraction_overlap_before']].head(5)
date_clean | question_clean | answer_clean | fraction_overlap_before | |
---|---|---|---|---|
0 | 2004-01-31 00:12:00 | for the last 8 years of his life galileo was under house arrest for espousing this man s theory | copernicus | 0.0 |
1 | 2004-01-31 00:12:00 | no 2 1912 olympian football star at carlisle indian school 6 mlb seasons with the reds giants braves | jim thorpe | 0.0 |
2 | 2004-01-31 00:12:00 | the city of yuma in this state has a record average of 4 055 hours of sunshine each year | arizona | 0.0 |
3 | 2004-01-31 00:12:00 | in 1963 live on the art linkletter show this company served its billionth burger | mcdonald s | 0.0 |
4 | 2004-01-31 00:12:00 | signer of the dec of indep framer of the constitution of mass second president of the united states | john adams | 0.0 |
# Take a look at the final rows in the dataframe, where one would expect no overlap with earliers questions
jeopardy[['date_clean','question_clean','answer_clean','fraction_overlap_before']].tail(20)
date_clean | question_clean | answer_clean | fraction_overlap_before | |
---|---|---|---|---|
216910 | 2006-01-11 00:05:00 | in his prime this athlete said it s hard to be humble when you re as great as i am | muhammad ali | 1.000000 |
216911 | 2006-01-11 00:05:00 | it s home to the holmenkollen ski jump | oslo | 1.000000 |
216912 | 2006-01-11 00:05:00 | we d like to enlighten you about the musical sidd it s based on this novel | siddhartha | 0.500000 |
216913 | 2006-01-11 00:05:00 | he created the musical riddles called the enigma variations | edward elgar | 1.000000 |
216914 | 2006-01-11 00:05:00 | one species of this bird breeds in the arctic tundra vacations at the other end of the globe | a tern | 1.000000 |
216915 | 2006-01-11 00:05:00 | in his teens he worked in an assistant d a s office later his perry mason character made fools of d a s | erle stanley gardner | 1.000000 |
216916 | 2006-01-11 00:05:00 | oscar wilde called this 4 letter word the curse of the drinking classes | work | 1.000000 |
216917 | 2006-01-11 00:05:00 | guyanese capital named for a hanoverian monarch | georgetown | 1.000000 |
216918 | 2006-01-11 00:05:00 | a naughty 18th c novel originally titled memoirs of a woman of pleasure inspired the 2006 musical named for her | fanny hill | 1.000000 |
216919 | 2006-01-11 00:05:00 | if this riddling belgian surrealist painter born 1898 worked for jeopardy he might write this is not a clue | magritte | 0.833333 |
216920 | 2006-01-11 00:05:00 | nightingales robins belong to this family of melodious songbirds | thrushes | 0.833333 |
216921 | 2006-01-11 00:05:00 | her hotsy totsy diaries trace back to one she began as an 11 year old aboard ship in 1914 | anaïs nin | 1.000000 |
216922 | 2006-01-11 00:05:00 | a motto of hers was in politics if you want anything said ask a man if you want anything done ask a woman | margaret thatcher | 1.000000 |
216923 | 2006-01-11 00:05:00 | it s on the suriname river | paramaribo | 1.000000 |
216924 | 2006-01-11 00:05:00 | in 2006 the cast of this long running hit embarked on a href http www j archive com media 2006 05 11_dj_26 wmv an exuberant noisy campaign a to clean up new york city | stomp | 1.000000 |
216925 | 2006-01-11 00:05:00 | this puccini opera turns on the solution to 3 riddles posed by the heroine | turandot | 1.000000 |
216926 | 2006-01-11 00:05:00 | in north america this term is properly applied to only 4 species that are crested including the tufted | a titmouse | 1.000000 |
216927 | 2006-01-11 00:05:00 | in penny lane where this hellraiser grew up the barber shaves another customer then flays him alive | clive barker | 1.000000 |
216928 | 2006-01-11 00:05:00 | from ft sill okla he made the plea arizona is my land my home my father s land to which i now ask to return | geronimo | 1.000000 |
216929 | 2006-01-11 00:05:00 | a silent movie title includes the last name of this 18th c statesman favorite of catherine the great | grigori alexandrovich potemkin | 1.000000 |
# Take a look at the fraction-overlap in the final 100 rows
print(jeopardy.tail(100)['fraction_overlap_before'].value_counts())
1.000000 80 0.800000 5 0.833333 3 0.875000 3 0.888889 2 0.400000 1 0.600000 1 0.666667 1 0.500000 1 0.857143 1 0.750000 1 0.000000 1 Name: fraction_overlap_before, dtype: int64
So it seems that almost all words have been used before.... (For what it's worth.)
# Calculate the mean value of this fraction-overlap
import numpy as np
np.mean(questions_overlap)
0.9225954554223076
So what we can observe is that most longer words that are used in questions, have been used before. Certainly for the later episodes. That is not so surprising, given that we have more than 20 years worth of questions.
I am not convinced though that this piece of knowledge is going to help a lot.
Possibly a more complex analysis could help:
These are complex analyses to do so, and I am not really convinced that it will give a lot of insight.
What we can still do is just take some frequently used terms, and look at some questions that include those terms. Do we happen to see any overlap? For this, we can use the dictionary with word counts that we created. (Which is on overview that by itself can be interesting already.)
# Create (and show) a dictionary with those terms that are used more than 1000 times in questions
top_terms = {k:v for (k,v) in terms_dictionary.items() if v > 1000}
for k in sorted(top_terms, key=top_terms.get, reverse=True):
print(k, top_terms[k])
archive 12979 target 10717 _blank 10649 country 6045 called 5487 president 3294 american 3210 became 3165 played 3014 before 2920 capital 2883 french 2576 famous 2560 island 2534 people 2341 letter 2318 largest 2150 company 2133 author 2074 during 2001 national 1988 british 1976 century 1922 character 1868 little 1831 around 1823 between 1682 series 1636 family 1602 meaning 1571 founded 1546 school 1423 include 1409 million 1377 america 1350 museum 1332 university 1332 number 1331 popular 1322 musical 1322 english 1313 because 1274 second 1268 classic 1268 reports 1268 through 1245 father 1203 person 1177 george 1158 german 1151 general 1130 england 1100 leader 1095 nation 1074 created 1060 italian 1046 william 1034 former 1032
A lot of terms that are used more than 1000 times in questions (over the course of 20 years).
The ones at the very top ('archive', 'target', and '_blank') are surprising actually.
Let's go have a look at examples with some of some frequently-used terms.
# Function to print examples of a term
# It takes a term and a number n as its inputs
# Then the first n questions containing the term will be printed
def print_n_examples_for_term(term, n):
printed = 0
row_index = 0
while printed < n:
row = jeopardy.iloc[row_index]
split_question = row['question_clean'].split()
split_question = [word for word in split_question if len(word) >5]
# print(split_question) # commented out after verifying
if term in split_question:
print(row['Question'])
printed +=1
row_index +=1
# Print for 'galileo' to test (as we know the first question in the dataframe contains Galileo)
print_n_examples_for_term('galileo', 3)
For the last 8 years of his life, Galileo was under house arrest for espousing this man's theory The 4 largest moons of this planet are called Galilean satellites after Galileo, who saw them in 1610 Galileo was the first person to see the rings around this planet
# Print 5 examples for 'archive'
print_n_examples_for_term('archive', 5)
<a href="http://www.j-archive.com/media/2004-12-31_DJ_23.mp3">Beyond ovoid abandonment, beyond ovoid betrayal... you won't believe the ending when he "Hatches the Egg"</a> The shorter glass seen <a href="http://www.j-archive.com/media/2004-12-31_DJ_12.jpg" target="_blank">here</a>, or a quaint cocktail made with sugar & bitters <a href="http://www.j-archive.com/media/2004-12-31_DJ_26.mp3">Ripped from today's headlines, he was a turtle king gone mad; Mack was the one good turtle who'd bring him down</a> <a href="http://www.j-archive.com/media/2004-12-31_DJ_25.mp3">Somewhere between truth & fiction lies Marco's reality... on Halloween, you won't believe you saw it on this St.</a> <a href="http://www.j-archive.com/media/2004-12-31_DJ_24.mp3">"500 Hats"... 500 ways to die. On July 4th, this young boy will defy a king... & become a legend</a>
# Print 5 examples for '_blank'
print_n_examples_for_term('_blank', 5)
The shorter glass seen <a href="http://www.j-archive.com/media/2004-12-31_DJ_12.jpg" target="_blank">here</a>, or a quaint cocktail made with sugar & bitters Say the name of <a href="http://www.j-archive.com/media/2010-07-06_DJ_26.jpg" target="_blank">this</a> type of mollusk you see Say <a href="http://www.j-archive.com/media/2010-07-06_DJ_27.jpg" target="_blank">this</a> state that was admitted to the Union in 1859 <a href="http://www.j-archive.com/media/2010-07-06_DJ_14.jpg" target="_blank">This dog breed seen here</a> is a loyal and protective companion Say the name of <a href="http://www.j-archive.com/media/2010-07-06_DJ_28.jpg" target="_blank">this</a> bug; don't worry, it doesn't breathe fire
# Print 5 examples for '_blank'
print_n_examples_for_term('target', 5)
The shorter glass seen <a href="http://www.j-archive.com/media/2004-12-31_DJ_12.jpg" target="_blank">here</a>, or a quaint cocktail made with sugar & bitters Say the name of <a href="http://www.j-archive.com/media/2010-07-06_DJ_26.jpg" target="_blank">this</a> type of mollusk you see Say <a href="http://www.j-archive.com/media/2010-07-06_DJ_27.jpg" target="_blank">this</a> state that was admitted to the Union in 1859 <a href="http://www.j-archive.com/media/2010-07-06_DJ_14.jpg" target="_blank">This dog breed seen here</a> is a loyal and protective companion Say the name of <a href="http://www.j-archive.com/media/2010-07-06_DJ_28.jpg" target="_blank">this</a> bug; don't worry, it doesn't breathe fire
# Print 10 examples for 'country'
print_n_examples_for_term('country', 10)
Africa's lowest temperature was 11 degrees below zero in 1935 at Ifrane, just south of Fez in this country Cross-country skiing is sometimes referred to by these 2 letters, the same ones used to denote 90 in Roman numerals Parts of the Arabian and Libyan deserts are found in this African country A 7.0 magnitude earthquake in this Caribbean country Jan. 12, 2010 brought a world outpouring of aid Andy Garcia is a native of this country whose flag is seen here This Mediterranean country whose flag is seen here is "The Word" Porfirio Diaz seized power in this country in 1876, ruled for 35 years, fled in 1911 & died in exile Exiled for manslaughter, Eric the Red was forced to leave this country around 981 Moshoeshoe II was exiled twice before regaining this southern African country's throne in 1995 Under the 1814 Treaty of Kiel, this country gave Norway to Sweden but kept Greenland & other islands
# Print 10 examples for 'president'
print_n_examples_for_term('president', 10)
Signer of the Dec. of Indep., framer of the Constitution of Mass., second President of the United States His first act after being sworn in as president of the Confederacy was to send a peace commission to Washington, D.C. In the midst of the Korean War, this South Korean president was elected to his second of 4 terms Its headquarters compound in Langley, Virginia is named for Former President George Bush This president's 1972 visit to China inspired an opera that played at the Kennedy Center in 1988 This political satire starred John Travolta as a Southern governor running for president If a president is impeached, this official presides over the trial in the Senate On January 20, 1965 he was inaugurated as U.S. vice president Gerald Ford was the last president born under this "crab"by sign In 1976 this current president of France founded the Rally for the Republic Party
What we can observe:
The value of this 2nd observation must be discounted though. These are the first examples for each row chronologically, and it would of course be better to check for examples that are years apart.
For what we have seen though, this does not help us that much with finding a study strategy for Jeopardy.
Let's check if there are any particular terms that appear significantly more in high-value questions.
We will do the following, for multiple popular terms:
Let me explain in layman's terms how that works. Suppose that 30% of all Jeopardy questions are low-value questions and 70% are high-value questions. Now we are going to check for all questions that contains the term 'president'. Then we have an expectation that also for those, the ratio is 30% / 70%. If you find 29% / 71%: sounds still reasonable. If you find 10% / 90%: that looks suspicious though. With a chi-squared test one can calculate how likely such an observation still is under the hypothesis that the expected distribution is 30% / 70%. If that likelihood is very small (e.g. smaller than 5%) we reject the idea that the outcome is a mere coincidence, and conclude that the term 'president' is truly under- or over-represented (which one depends on the observation) in the high-value questions.
For more background about chi-squared tests, refer to the internet, there are many descriptions. Here is one example.
So the null-hypothesis is "The term <.....> is not over- or under-represented in high-value questions".
# Get overall numbers of low_value and high_value questions
low_value_max = 800
low_value_count = len(jeopardy[jeopardy['value_clean']<=800])
high_value_count = len(jeopardy[jeopardy['value_clean']> 800])
low_value_fraction = low_value_count / (low_value_count + high_value_count)
high_value_fraction = high_value_count / (low_value_count + high_value_count)
print(low_value_count, high_value_count, low_value_count + high_value_count, low_value_fraction, high_value_fraction)
155508 61422 216930 0.7168579726178952 0.2831420273821048
# Create a function that gets a term and returns how frequently it occurs in low-value and high-value questions
def return_low_high (term):
low_count = 0
high_count = 0
for i, row in jeopardy.iterrows():
split_question = row['question_clean'].split()
split_question = [word for word in split_question if len(word) >5]
# print(split_question) # commented out after verifying
if term in split_question:
# print(split_question, row['value_clean']) # commented out after verifying
if row['value_clean'] <= 800:
low_count += 1
if row['value_clean'] > 800:
high_count +=1
return low_count, high_count
# Test the function on th term 'galileo'
test_term_1 = 'galileo'
a, b = return_low_high(test_term_1)
print (a,b)
32 7
# Test the function on the term 'president'
test_term_1 = 'president'
a, b = return_low_high(test_term_1)
print (a,b, a+b)
2297 883 3180
For 'president' one would initially expect a total of 3294 as that is the count we had for 'president' in our dictionary. One explanation can be that the term 'president' appeared more than once in some questions. For the dictionary each of them was counted. On this occassion, we count questions that contain the term 'president', regardless how many times the word is in that question.
I did not check whether this explains indeed. Rather, I conclude that the functions seems to work, so let's apply it to a couple of popular terms.
# Select popular terms that are used in questions a lot
popular_terms = ['country','president','american', 'capital', 'island']
# For those terms, calculate how many times it appears in low-value and high-value questions. Story in a dictionary.
popular_terms_low_high = {}
for term in popular_terms:
popular_terms_low_high[term] = return_low_high(term)
# Show the result
popular_terms_low_high
{'country': (4332, 1647), 'president': (2297, 883), 'american': (2115, 1053), 'capital': (1988, 797), 'island': (1665, 770)}
So for each of these terms we can now do a chi-squared test to figure out if these counts are significantly different than what one could expect.
Let's first do a quick round for the term 'country', then do it for all terms in a more structured way.
# Calculated expected values for 'country' and print them
country_expected_low = low_value_fraction * (4332 + 1647)
country_expected_high = high_value_fraction * (4332 + 1647)
print(country_expected_low, country_expected_high)
4286.093818282396 1692.9061817176046
# Import chisquare test from library
from scipy.stats import chisquare
# Execute chisquared test for 'country'
observed = np.array([4332, 1647])
expected = np.array([country_expected_low, country_expected_high])
chisquare_value, pvalue = chisquare(observed, expected) # returns a list
# Print result
print(chisquare_value,pvalue)
1.7365061769211498 0.18758212679269415
Now let's do this for all popular terms that we selected in the same way, and print the results in a readable way.
# For all selected popular terms, perform a chi-squared test and present the results in a readable format
for term in popular_terms:
observed_low = popular_terms_low_high[term][0]
observed_high = popular_terms_low_high[term][1]
observed_total = observed_low + observed_high
expected_low = low_value_fraction * (observed_total)
expected_high = high_value_fraction * (observed_total)
print('Term:', term)
print('Observed (low/high):', observed_low, observed_high, "{:.1%} {:.1%}".format(observed_low/observed_total, observed_high/observed_total))
print('Expected (low/high):', round(expected_low,1), round(expected_high,1), "{:.1%} {:.1%}".format(expected_low/observed_total, expected_high/observed_total))
observed = np.array([observed_low, observed_high])
expected = np.array([expected_low, expected_high])
chisquare_value, pvalue = chisquare(observed, expected) # returns a list
print('P-value:', pvalue)
print('In words: the probability of the null-hypothesis that the term', term, 'is not over- or underrepresented in high-value questions is', pvalue)
print('\n')
Term: country Observed (low/high): 4332 1647 72.5% 27.5% Expected (low/high): 4286.1 1692.9 71.7% 28.3% P-value: 0.18758212679269415 In words: the probability of the null-hypothesis that the term country is not over- or underrepresented in high-value questions is 0.18758212679269415 Term: president Observed (low/high): 2297 883 72.2% 27.8% Expected (low/high): 2279.6 900.4 71.7% 28.3% P-value: 0.49362469359700045 In words: the probability of the null-hypothesis that the term president is not over- or underrepresented in high-value questions is 0.49362469359700045 Term: american Observed (low/high): 2115 1053 66.8% 33.2% Expected (low/high): 2271.0 897.0 71.7% 28.3% P-value: 7.641747011862597e-10 In words: the probability of the null-hypothesis that the term american is not over- or underrepresented in high-value questions is 7.641747011862597e-10 Term: capital Observed (low/high): 1988 797 71.4% 28.6% Expected (low/high): 1996.4 788.6 71.7% 28.3% P-value: 0.722302281221693 In words: the probability of the null-hypothesis that the term capital is not over- or underrepresented in high-value questions is 0.722302281221693 Term: island Observed (low/high): 1665 770 68.4% 31.6% Expected (low/high): 1745.5 689.5 71.7% 28.3% P-value: 0.000290975819901636 In words: the probability of the null-hypothesis that the term island is not over- or underrepresented in high-value questions is 0.000290975819901636
Observations:
So we could give some advice now: study things that are 'american' or relate to 'islands'. It is questionable though whether this will really help.
We started with the question whether from the analysis of questions-and-answers from the past, we can give advice about how to prepare if you are going to participate in Jeopardy. And saw the following:
It is very hard to defend though that any of this knowledge is going to help you prepare for Jeopardy a lot.
One could certainly analyze more, e.g.:
That's fun doing! If your goal is to prepare for Jeopardy though, I doubt though whether this is a good investment of your time, given what we saw so far. You may better just study general knowledge instead!