Best Way To Prepare For Jeopardy¶

Introduction¶

Jeopardy is a popular American TV show where participants answer questions to win money.The show debuted in March 30, 1964 and has since become very popular. You can learn more about jeopardy here.

The dataset we will be working with is JEOPARDY_CSV.csv and it conatains 216,930 rows and was collected from reddit. Each row on the dataset represents a single question on a single episode of Jeopardy. Here are explanations of each column:

Show Number - the Jeopardy episode number.
Air Date - the date the episode aired.
Round - the round of Jeopardy.
Category - the category of the question.
Value - the number of dollars the correct answer is worth.
Question - the text of the question.
Answer - the text of the answer.

The goal of this project is to analyse the dataset and look for patterns that could help you win jeopardy.

Data Exploration¶

In [1]:

import pandas as pd
import re
import numpy as np
from random import choice
from scipy.stats import chisquare

In [2]:

jeopardy = pd.read_csv('JEOPARDY_CSV.csv')
jeopardy.head()

Out[2]:

	Show Number	Air Date	Round	Category	Value	Question	Answer
0	4680	2004-12-31	Jeopardy!	HISTORY	$200	For the last 8 years of his life, Galileo was ...	Copernicus
1	4680	2004-12-31	Jeopardy!	ESPN's TOP 10 ALL-TIME ATHLETES	$200	No. 2: 1912 Olympian; football star at Carlisl...	Jim Thorpe
2	4680	2004-12-31	Jeopardy!	EVERYBODY TALKS ABOUT IT...	$200	The city of Yuma in this state has a record av...	Arizona
3	4680	2004-12-31	Jeopardy!	THE COMPANY LINE	$200	In 1963, live on "The Art Linkletter Show", th...	McDonald's
4	4680	2004-12-31	Jeopardy!	EPITAPHS & TRIBUTES	$200	Signer of the Dec. of Indep., framer of the Co...	John Adams

In [3]:

jeopardy.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 216930 entries, 0 to 216929
Data columns (total 7 columns):
 #   Column       Non-Null Count   Dtype 
---  ------       --------------   ----- 
 0   Show Number  216930 non-null  int64 
 1    Air Date    216930 non-null  object
 2    Round       216930 non-null  object
 3    Category    216930 non-null  object
 4    Value       216930 non-null  object
 5    Question    216930 non-null  object
 6    Answer      216928 non-null  object
dtypes: int64(1), object(6)
memory usage: 11.6+ MB

With the exception of the Answer column, none of the column have null values. There are just two null values in the Answer column so we are going to drop the rows with null values as they are quite insignificant.

Data Cleaning¶

In [4]:

jeopardy.dropna(inplace=True)

In [5]:

jeopardy.columns

Out[5]:

Index(['Show Number', ' Air Date', ' Round', ' Category', ' Value',
       ' Question', ' Answer'],
      dtype='object')

In [6]:

jeopardy.columns = jeopardy.columns.str.strip() # removes white space at the beiginning and end of strings
jeopardy.columns

Out[6]:

Index(['Show Number', 'Air Date', 'Round', 'Category', 'Value', 'Question',
       'Answer'],
      dtype='object')

In [7]:

def clean_text(text):
    ''' Takes in a text input, removes every punctuation
    and converts every word to lower case.'''
    text = re.sub(r'[^\w\s]', '', text) # replaces characters that are not alpha-numeric follow by a whitespace with an empty string
    text = text.lower()
    return text

def clean_value(value):
    value = re.sub(r'\W', '', value)
    
    '''value column contains none dtype, 
    use try and except to avoid raising valueError'''
    try:
        value = int(value)
    except:
        value = 0
    return value

In [8]:

jeopardy['clean_question'] = jeopardy['Question'].apply(clean_text)
jeopardy['clean_answer'] = jeopardy['Answer'].apply(clean_text)
jeopardy['clean_value'] = jeopardy['Value'].apply(clean_value)
jeopardy['Air Date'] = pd.to_datetime(jeopardy['Air Date'])

jeopardy.head()

Out[8]:

	Show Number	Air Date	Round	Category	Value	Question	Answer	clean_question	clean_answer	clean_value
0	4680	2004-12-31	Jeopardy!	HISTORY	$200	For the last 8 years of his life, Galileo was ...	Copernicus	for the last 8 years of his life galileo was u...	copernicus	200
1	4680	2004-12-31	Jeopardy!	ESPN's TOP 10 ALL-TIME ATHLETES	$200	No. 2: 1912 Olympian; football star at Carlisl...	Jim Thorpe	no 2 1912 olympian football star at carlisle i...	jim thorpe	200
2	4680	2004-12-31	Jeopardy!	EVERYBODY TALKS ABOUT IT...	$200	The city of Yuma in this state has a record av...	Arizona	the city of yuma in this state has a record av...	arizona	200
3	4680	2004-12-31	Jeopardy!	THE COMPANY LINE	$200	In 1963, live on "The Art Linkletter Show", th...	McDonald's	in 1963 live on the art linkletter show this c...	mcdonalds	200
4	4680	2004-12-31	Jeopardy!	EPITAPHS & TRIBUTES	$200	Signer of the Dec. of Indep., framer of the Co...	John Adams	signer of the dec of indep framer of the const...	john adams	200

In [9]:

jeopardy.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 216928 entries, 0 to 216929
Data columns (total 10 columns):
 #   Column          Non-Null Count   Dtype         
---  ------          --------------   -----         
 0   Show Number     216928 non-null  int64         
 1   Air Date        216928 non-null  datetime64[ns]
 2   Round           216928 non-null  object        
 3   Category        216928 non-null  object        
 4   Value           216928 non-null  object        
 5   Question        216928 non-null  object        
 6   Answer          216928 non-null  object        
 7   clean_question  216928 non-null  object        
 8   clean_answer    216928 non-null  object        
 9   clean_value     216928 non-null  int64         
dtypes: datetime64[ns](1), int64(2), object(7)
memory usage: 18.2+ MB

Answer In Questions:¶

One strategy you might want to consider when answering questions is taking a hint from the question to derive an answer. We are going to look at all the questions and find out what percentage of questions have their answers in them.

In [10]:

def count_matches(row):
    '''returns the proportion for clean_question
    & clean_answer with matching terms'''
    
    split_answer = row['clean_answer'].split()
    split_question = row['clean_question'].split()
    match_count = 0
    
    if 'the' in split_answer:
        split_answer.remove('the')
    
    if len(split_answer) == 0: # to avoid dividing by 0
        return 0
    
    for i in split_answer:
        if i in split_question:
            match_count +=1
    return match_count / len(split_answer)

jeopardy['answer_in_question'] =  jeopardy.apply(count_matches, axis=1)

In [11]:

jeopardy.head()

Out[11]:

	Show Number	Air Date	Round	Category	Value	Question	Answer	clean_question	clean_answer	clean_value
0	4680	2004-12-31	Jeopardy!	HISTORY	$200	For the last 8 years of his life, Galileo was ...	Copernicus	for the last 8 years of his life galileo was u...	copernicus	200
1	4680	2004-12-31	Jeopardy!	ESPN's TOP 10 ALL-TIME ATHLETES	$200	No. 2: 1912 Olympian; football star at Carlisl...	Jim Thorpe	no 2 1912 olympian football star at carlisle i...	jim thorpe	200
2	4680	2004-12-31	Jeopardy!	EVERYBODY TALKS ABOUT IT...	$200	The city of Yuma in this state has a record av...	Arizona	the city of yuma in this state has a record av...	arizona	200
3	4680	2004-12-31	Jeopardy!	THE COMPANY LINE	$200	In 1963, live on "The Art Linkletter Show", th...	McDonald's	in 1963 live on the art linkletter show this c...	mcdonalds	200
4	4680	2004-12-31	Jeopardy!	EPITAPHS & TRIBUTES	$200	Signer of the Dec. of Indep., framer of the Co...	John Adams	signer of the dec of indep framer of the const...	john adams	200

In [12]:

jeopardy['answer_in_question'].mean()

Out[12]:

0.057921237245162335

On average, only 6% of questions have their answers in the questions asked. This is not a whole lot of questions and means we can't hope to win by trying to figure out the answers of questions using the question. So the best strategy will be to actually study for jeopardy.

Repeated Questions¶

It is somewhat common that in most Q & A competitions, questions are repeated. We want to find out what percentage of questions are repeated and if it is a good strategy to study past questions.

In [13]:

question_overlap = []
terms_used = set()

jeopardy = jeopardy.sort_values('Air Date')

for row in jeopardy.iterrows():
    row = row[1]
    split_question = row["clean_question"].split(" ")
    split_question = [word for word in split_question if len(word) > 5]
    match_count = 0
    for word in split_question:
        if word in terms_used:
            match_count += 1
    for word in split_question:
        terms_used.add(word)
    if len(split_question) > 0: # to avoid dividing by 0
        match_count /= len(split_question)
    question_overlap.append(match_count)
    
jeopardy['question_overlap'] = question_overlap

In [14]:

jeopardy.head()

Out[14]:

	Show Number	Air Date	Round	Category	Value	Question	Answer	clean_question	clean_answer	clean_value	answer_in_question
84523	1	1984-09-10	Jeopardy!	LAKES & RIVERS	$100	River mentioned most often in the Bible	the Jordan	river mentioned most often in the bible	the jordan	100	0.000000
84565	1	1984-09-10	Double Jeopardy!	THE BIBLE	$1000	According to 1st Timothy, it is the "root of a...	the love of money	according to 1st timothy it is the root of all...	the love of money	1000	0.333333
84566	1	1984-09-10	Double Jeopardy!	'50'S TV	$1000	Name under which experimenter Don Herbert taug...	Mr. Wizard	name under which experimenter don herbert taug...	mr wizard	1000	0.000000
84567	1	1984-09-10	Double Jeopardy!	NATIONAL LANDMARKS	$1000	D.C. building shaken by November '83 bomb blast	the Capitol	dc building shaken by november 83 bomb blast	the capitol	1000	0.000000
84568	1	1984-09-10	Double Jeopardy!	NOTORIOUS	$1000	After the deed, he leaped to the stage shoutin...	John Wilkes Booth	after the deed he leaped to the stage shouting...	john wilkes booth	1000	0.000000

In [15]:

jeopardy['question_overlap'].mean()

Out[15]:

0.8721734034756163

87% of terms used in old questions are repeated on newer questions so it might be worth it to look at older questions when preparing for Jeopardy.

High Value vs Low Value Questions.¶

We want to find if there is any relationhsip between certain terms and high value questions so we are prepared enough to answer high value questions. We are going to be using a chisquare hypothesis testing to find this.

In [16]:

def value_category(row):
    '''categorises rows into high or low value
    1 = high value, 0 = low vaue'''
    
    if row['clean_value'] > 800:
        return 1
    else:
        return 0
    
jeopardy['high_value'] = jeopardy.apply(value_category, axis=1)

In [17]:

def count_value(word):
    '''counts the value of individual words 
    in the clean question column'''
    
    low_count = 0
    high_count = 0
    for row in jeopardy.iterrows():
        row = row[1]
        split_question = row['clean_question'].split(' ')
        if word in split_question:
            if row['high_value'] == 1:
                high_count += 1
            else:
                low_count += 1
    return high_count, low_count

In [18]:

comparison_terms = []
comparison_terms = [choice(list(terms_used)) for i in range(10)] # picks a random smaple of 10 terms with replacement

observed_expected = []
for i in comparison_terms:
    result = count_value(i)
    observed_expected.append(result)
    
print(observed_expected)

[(0, 1), (0, 1), (1, 0), (0, 1), (1, 5), (4, 7), (1, 1), (0, 1), (3, 1), (2, 0)]

In [19]:

high_value_count = jeopardy[jeopardy['high_value'] == 1].shape[0]
low_value_count = jeopardy[jeopardy['high_value'] == 0].shape[0]

chi_squared = []
for value in observed_expected:
    total = sum(value)
    total_prop = total / jeopardy.shape[0]
    exp_high = high_value_count * total_prop
    exp_low = low_value_count * total_prop
    
    observed = np.array([value[0], value[1]])
    expected = np.array([exp_low, exp_high])
    chi_squared.append(chisquare(observed, expected))
    
chi_squared

Out[19]:

[Power_divergenceResult(statistic=2.5317638631109367, pvalue=0.11157543053393834),
 Power_divergenceResult(statistic=2.5317638631109367, pvalue=0.11157543053393834),
 Power_divergenceResult(statistic=0.39498154412048414, pvalue=0.5296924445254796),
 Power_divergenceResult(statistic=2.5317638631109367, pvalue=0.11157543053393834),
 Power_divergenceResult(statistic=8.948179686982321, pvalue=0.0027774621368179186),
 Power_divergenceResult(statistic=6.76146672712397, pvalue=0.009314716768224153),
 Power_divergenceResult(statistic=0.4633727036157106, pvalue=0.4960519396377898),
 Power_divergenceResult(statistic=2.5317638631109367, pvalue=0.11157543053393834),
 Power_divergenceResult(statistic=0.02164944004882361, pvalue=0.8830235016084509),
 Power_divergenceResult(statistic=0.7899630882409683, pvalue=0.3741112870360538)]

In our observed_expected list, terms seem to be more frequent in lower value questions, this could be due to the fact that there are more low value questions than high value ones. In cases where there were significant differences(of at least 3) in the term frequencies for low and high value, the pvalues are all less than 0.05 which would mean a strong relationship between those terms and low value words which makes sense as low value questions are more common. Although it was a small sample, there are no strong relationship between terms and high value questions.

Popular Categories Per Round.¶

Jeopardy has rounds and here we want to find out the most frequent category in each of the rounds.

In [20]:

jeopardy['Round'].value_counts(normalize=True)

Out[20]:

Jeopardy!           0.495017
Double Jeopardy!    0.488231
Final Jeopardy!     0.016738
Tiebreaker          0.000014
Name: Round, dtype: float64

In [21]:

jeopardy_grp =  jeopardy.groupby(['Round'])

In [22]:

for i in jeopardy['Round'].unique():
    j_round = jeopardy_grp.get_group(i)
    top_cat_proportion = j_round['Category'].value_counts(normalize=True)[0] # returns the value for the category with the highest proportion
    top_cat_percentage = top_cat_proportion * 100
    top_cat_name = j_round['Category'].value_counts().index[0] # returns the name of the category with the highest frequency in each round
    
    print(f'''
    {top_cat_name} category make up {top_cat_percentage:.3}% of the questions in {i} round.
''')

    POTPOURRI category make up 0.237% of the questions in Jeopardy! round.


    BEFORE & AFTER category make up 0.425% of the questions in Double Jeopardy! round.


    U.S. PRESIDENTS category make up 1.38% of the questions in Final Jeopardy! round.


    THE AMERICAN REVOLUTION category make up 33.3% of the questions in Tiebreaker round.

Most of the questions in our dataset are from the Jeopardy! and Double Jeopardy! rounds, with these round making up nearly 99% of the data, even though we know the top categories for these rounds, these categories make up only a small percentage of the total questions; 0.2% and 0.3% respectively. Focusing on just one particular category of question for a specific round isn't a very good strategy.

Conclusion¶

While there is no guaranteed strategy to winning Jeopardy as we have found out, it might be worth while to look at past questions while preparing.
There also isn't any significant relationship between any term and high questions, so there is no keyword to look out for to prepare for high value questions.
There isn't a significant question category to focus on for any jeopardy round, it's best to be prepared for as much ccategories as possible.