Jeopardy is a popular American TV show where participants answer questions to win money.The show debuted in March 30, 1964 and has since become very popular. You can learn more about jeopardy here.
The dataset we will be working with is JEOPARDY_CSV.csv
and it conatains 216,930 rows and was collected from reddit. Each row on the dataset represents a single question on a single episode of Jeopardy. Here are explanations of each column:
Show Number
- the Jeopardy episode number.Air Date
- the date the episode aired.Round
- the round of Jeopardy.Category
- the category of the question.Value
- the number of dollars the correct answer is worth.Question
- the text of the question.Answer
- the text of the answer.The goal of this project is to analyse the dataset and look for patterns that could help you win jeopardy.
import pandas as pd
import re
import numpy as np
from random import choice
from scipy.stats import chisquare
jeopardy = pd.read_csv('JEOPARDY_CSV.csv')
jeopardy.head()
Show Number | Air Date | Round | Category | Value | Question | Answer | |
---|---|---|---|---|---|---|---|
0 | 4680 | 2004-12-31 | Jeopardy! | HISTORY | $200 | For the last 8 years of his life, Galileo was ... | Copernicus |
1 | 4680 | 2004-12-31 | Jeopardy! | ESPN's TOP 10 ALL-TIME ATHLETES | $200 | No. 2: 1912 Olympian; football star at Carlisl... | Jim Thorpe |
2 | 4680 | 2004-12-31 | Jeopardy! | EVERYBODY TALKS ABOUT IT... | $200 | The city of Yuma in this state has a record av... | Arizona |
3 | 4680 | 2004-12-31 | Jeopardy! | THE COMPANY LINE | $200 | In 1963, live on "The Art Linkletter Show", th... | McDonald's |
4 | 4680 | 2004-12-31 | Jeopardy! | EPITAPHS & TRIBUTES | $200 | Signer of the Dec. of Indep., framer of the Co... | John Adams |
jeopardy.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 216930 entries, 0 to 216929 Data columns (total 7 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 Show Number 216930 non-null int64 1 Air Date 216930 non-null object 2 Round 216930 non-null object 3 Category 216930 non-null object 4 Value 216930 non-null object 5 Question 216930 non-null object 6 Answer 216928 non-null object dtypes: int64(1), object(6) memory usage: 11.6+ MB
With the exception of the Answer
column, none of the column have null values. There are just two null values in the Answer
column so we are going to drop the rows with null values as they are quite insignificant.
jeopardy.dropna(inplace=True)
jeopardy.columns
Index(['Show Number', ' Air Date', ' Round', ' Category', ' Value', ' Question', ' Answer'], dtype='object')
jeopardy.columns = jeopardy.columns.str.strip() # removes white space at the beiginning and end of strings
jeopardy.columns
Index(['Show Number', 'Air Date', 'Round', 'Category', 'Value', 'Question', 'Answer'], dtype='object')
def clean_text(text):
''' Takes in a text input, removes every punctuation
and converts every word to lower case.'''
text = re.sub(r'[^\w\s]', '', text) # replaces characters that are not alpha-numeric follow by a whitespace with an empty string
text = text.lower()
return text
def clean_value(value):
value = re.sub(r'\W', '', value)
'''value column contains none dtype,
use try and except to avoid raising valueError'''
try:
value = int(value)
except:
value = 0
return value
jeopardy['clean_question'] = jeopardy['Question'].apply(clean_text)
jeopardy['clean_answer'] = jeopardy['Answer'].apply(clean_text)
jeopardy['clean_value'] = jeopardy['Value'].apply(clean_value)
jeopardy['Air Date'] = pd.to_datetime(jeopardy['Air Date'])
jeopardy.head()
Show Number | Air Date | Round | Category | Value | Question | Answer | clean_question | clean_answer | clean_value | |
---|---|---|---|---|---|---|---|---|---|---|
0 | 4680 | 2004-12-31 | Jeopardy! | HISTORY | $200 | For the last 8 years of his life, Galileo was ... | Copernicus | for the last 8 years of his life galileo was u... | copernicus | 200 |
1 | 4680 | 2004-12-31 | Jeopardy! | ESPN's TOP 10 ALL-TIME ATHLETES | $200 | No. 2: 1912 Olympian; football star at Carlisl... | Jim Thorpe | no 2 1912 olympian football star at carlisle i... | jim thorpe | 200 |
2 | 4680 | 2004-12-31 | Jeopardy! | EVERYBODY TALKS ABOUT IT... | $200 | The city of Yuma in this state has a record av... | Arizona | the city of yuma in this state has a record av... | arizona | 200 |
3 | 4680 | 2004-12-31 | Jeopardy! | THE COMPANY LINE | $200 | In 1963, live on "The Art Linkletter Show", th... | McDonald's | in 1963 live on the art linkletter show this c... | mcdonalds | 200 |
4 | 4680 | 2004-12-31 | Jeopardy! | EPITAPHS & TRIBUTES | $200 | Signer of the Dec. of Indep., framer of the Co... | John Adams | signer of the dec of indep framer of the const... | john adams | 200 |
jeopardy.info()
<class 'pandas.core.frame.DataFrame'> Int64Index: 216928 entries, 0 to 216929 Data columns (total 10 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 Show Number 216928 non-null int64 1 Air Date 216928 non-null datetime64[ns] 2 Round 216928 non-null object 3 Category 216928 non-null object 4 Value 216928 non-null object 5 Question 216928 non-null object 6 Answer 216928 non-null object 7 clean_question 216928 non-null object 8 clean_answer 216928 non-null object 9 clean_value 216928 non-null int64 dtypes: datetime64[ns](1), int64(2), object(7) memory usage: 18.2+ MB
One strategy you might want to consider when answering questions is taking a hint from the question to derive an answer. We are going to look at all the questions and find out what percentage of questions have their answers in them.
def count_matches(row):
'''returns the proportion for clean_question
& clean_answer with matching terms'''
split_answer = row['clean_answer'].split()
split_question = row['clean_question'].split()
match_count = 0
if 'the' in split_answer:
split_answer.remove('the')
if len(split_answer) == 0: # to avoid dividing by 0
return 0
for i in split_answer:
if i in split_question:
match_count +=1
return match_count / len(split_answer)
jeopardy['answer_in_question'] = jeopardy.apply(count_matches, axis=1)
jeopardy.head()
Show Number | Air Date | Round | Category | Value | Question | Answer | clean_question | clean_answer | clean_value | answer_in_question | |
---|---|---|---|---|---|---|---|---|---|---|---|
0 | 4680 | 2004-12-31 | Jeopardy! | HISTORY | $200 | For the last 8 years of his life, Galileo was ... | Copernicus | for the last 8 years of his life galileo was u... | copernicus | 200 | 0.0 |
1 | 4680 | 2004-12-31 | Jeopardy! | ESPN's TOP 10 ALL-TIME ATHLETES | $200 | No. 2: 1912 Olympian; football star at Carlisl... | Jim Thorpe | no 2 1912 olympian football star at carlisle i... | jim thorpe | 200 | 0.0 |
2 | 4680 | 2004-12-31 | Jeopardy! | EVERYBODY TALKS ABOUT IT... | $200 | The city of Yuma in this state has a record av... | Arizona | the city of yuma in this state has a record av... | arizona | 200 | 0.0 |
3 | 4680 | 2004-12-31 | Jeopardy! | THE COMPANY LINE | $200 | In 1963, live on "The Art Linkletter Show", th... | McDonald's | in 1963 live on the art linkletter show this c... | mcdonalds | 200 | 0.0 |
4 | 4680 | 2004-12-31 | Jeopardy! | EPITAPHS & TRIBUTES | $200 | Signer of the Dec. of Indep., framer of the Co... | John Adams | signer of the dec of indep framer of the const... | john adams | 200 | 0.0 |
jeopardy['answer_in_question'].mean()
0.057921237245162335
On average, only 6% of questions have their answers in the questions asked. This is not a whole lot of questions and means we can't hope to win by trying to figure out the answers of questions using the question. So the best strategy will be to actually study for jeopardy.
It is somewhat common that in most Q & A competitions, questions are repeated. We want to find out what percentage of questions are repeated and if it is a good strategy to study past questions.
question_overlap = []
terms_used = set()
jeopardy = jeopardy.sort_values('Air Date')
for row in jeopardy.iterrows():
row = row[1]
split_question = row["clean_question"].split(" ")
split_question = [word for word in split_question if len(word) > 5]
match_count = 0
for word in split_question:
if word in terms_used:
match_count += 1
for word in split_question:
terms_used.add(word)
if len(split_question) > 0: # to avoid dividing by 0
match_count /= len(split_question)
question_overlap.append(match_count)
jeopardy['question_overlap'] = question_overlap
jeopardy.head()
Show Number | Air Date | Round | Category | Value | Question | Answer | clean_question | clean_answer | clean_value | answer_in_question | question_overlap | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
84523 | 1 | 1984-09-10 | Jeopardy! | LAKES & RIVERS | $100 | River mentioned most often in the Bible | the Jordan | river mentioned most often in the bible | the jordan | 100 | 0.000000 | 0.0 |
84565 | 1 | 1984-09-10 | Double Jeopardy! | THE BIBLE | $1000 | According to 1st Timothy, it is the "root of a... | the love of money | according to 1st timothy it is the root of all... | the love of money | 1000 | 0.333333 | 0.0 |
84566 | 1 | 1984-09-10 | Double Jeopardy! | '50'S TV | $1000 | Name under which experimenter Don Herbert taug... | Mr. Wizard | name under which experimenter don herbert taug... | mr wizard | 1000 | 0.000000 | 0.0 |
84567 | 1 | 1984-09-10 | Double Jeopardy! | NATIONAL LANDMARKS | $1000 | D.C. building shaken by November '83 bomb blast | the Capitol | dc building shaken by november 83 bomb blast | the capitol | 1000 | 0.000000 | 0.0 |
84568 | 1 | 1984-09-10 | Double Jeopardy! | NOTORIOUS | $1000 | After the deed, he leaped to the stage shoutin... | John Wilkes Booth | after the deed he leaped to the stage shouting... | john wilkes booth | 1000 | 0.000000 | 0.0 |
jeopardy['question_overlap'].mean()
0.8721734034756163
87% of terms used in old questions are repeated on newer questions so it might be worth it to look at older questions when preparing for Jeopardy.
We want to find if there is any relationhsip between certain terms and high value questions so we are prepared enough to answer high value questions. We are going to be using a chisquare hypothesis testing to find this.
def value_category(row):
'''categorises rows into high or low value
1 = high value, 0 = low vaue'''
if row['clean_value'] > 800:
return 1
else:
return 0
jeopardy['high_value'] = jeopardy.apply(value_category, axis=1)
def count_value(word):
'''counts the value of individual words
in the clean question column'''
low_count = 0
high_count = 0
for row in jeopardy.iterrows():
row = row[1]
split_question = row['clean_question'].split(' ')
if word in split_question:
if row['high_value'] == 1:
high_count += 1
else:
low_count += 1
return high_count, low_count
comparison_terms = []
comparison_terms = [choice(list(terms_used)) for i in range(10)] # picks a random smaple of 10 terms with replacement
observed_expected = []
for i in comparison_terms:
result = count_value(i)
observed_expected.append(result)
print(observed_expected)
[(0, 1), (0, 1), (1, 0), (0, 1), (1, 5), (4, 7), (1, 1), (0, 1), (3, 1), (2, 0)]
high_value_count = jeopardy[jeopardy['high_value'] == 1].shape[0]
low_value_count = jeopardy[jeopardy['high_value'] == 0].shape[0]
chi_squared = []
for value in observed_expected:
total = sum(value)
total_prop = total / jeopardy.shape[0]
exp_high = high_value_count * total_prop
exp_low = low_value_count * total_prop
observed = np.array([value[0], value[1]])
expected = np.array([exp_low, exp_high])
chi_squared.append(chisquare(observed, expected))
chi_squared
[Power_divergenceResult(statistic=2.5317638631109367, pvalue=0.11157543053393834), Power_divergenceResult(statistic=2.5317638631109367, pvalue=0.11157543053393834), Power_divergenceResult(statistic=0.39498154412048414, pvalue=0.5296924445254796), Power_divergenceResult(statistic=2.5317638631109367, pvalue=0.11157543053393834), Power_divergenceResult(statistic=8.948179686982321, pvalue=0.0027774621368179186), Power_divergenceResult(statistic=6.76146672712397, pvalue=0.009314716768224153), Power_divergenceResult(statistic=0.4633727036157106, pvalue=0.4960519396377898), Power_divergenceResult(statistic=2.5317638631109367, pvalue=0.11157543053393834), Power_divergenceResult(statistic=0.02164944004882361, pvalue=0.8830235016084509), Power_divergenceResult(statistic=0.7899630882409683, pvalue=0.3741112870360538)]
In our observed_expected
list, terms seem to be more frequent in lower value questions, this could be due to the fact that there are more low value questions than high value ones. In cases where there were significant differences(of at least 3) in the term frequencies for low and high value, the pvalues are all less than 0.05 which would mean a strong relationship between those terms and low value words which makes sense as low value questions are more common. Although it was a small sample, there are no strong relationship between terms and high value questions.
Jeopardy has rounds and here we want to find out the most frequent category in each of the rounds.
jeopardy['Round'].value_counts(normalize=True)
Jeopardy! 0.495017 Double Jeopardy! 0.488231 Final Jeopardy! 0.016738 Tiebreaker 0.000014 Name: Round, dtype: float64
jeopardy_grp = jeopardy.groupby(['Round'])
for i in jeopardy['Round'].unique():
j_round = jeopardy_grp.get_group(i)
top_cat_proportion = j_round['Category'].value_counts(normalize=True)[0] # returns the value for the category with the highest proportion
top_cat_percentage = top_cat_proportion * 100
top_cat_name = j_round['Category'].value_counts().index[0] # returns the name of the category with the highest frequency in each round
print(f'''
{top_cat_name} category make up {top_cat_percentage:.3}% of the questions in {i} round.
''')
POTPOURRI category make up 0.237% of the questions in Jeopardy! round. BEFORE & AFTER category make up 0.425% of the questions in Double Jeopardy! round. U.S. PRESIDENTS category make up 1.38% of the questions in Final Jeopardy! round. THE AMERICAN REVOLUTION category make up 33.3% of the questions in Tiebreaker round.
Most of the questions in our dataset are from the Jeopardy!
and Double Jeopardy!
rounds, with these round making up nearly 99% of the data, even though we know the top categories for these rounds, these categories make up only a small percentage of the total questions; 0.2% and 0.3% respectively. Focusing on just one particular category of question for a specific round isn't a very good strategy.
While there is no guaranteed strategy to winning Jeopardy as we have found out, it might be worth while to look at past questions while preparing.
There also isn't any significant relationship between any term and high questions, so there is no keyword to look out for to prepare for high value questions.
There isn't a significant question category to focus on for any jeopardy round, it's best to be prepared for as much ccategories as possible.