Winning Jeopardy¶

Jeopardy is a popular TV show in the US where participants answer questions to win money. It's been running for a few decades, and is a major force in popular culture.

The dataset is named jeopardy.csv, and contains 20000 rows from the beginning of a full dataset of Jeopardy questions, which can be downloaded from here.

Data Dictionary:

Show Number - the Jeopardy episode number of the show this question was in.
Air Date - the date the episode aired.
Round - the round of Jeopardy that the question was asked in. Jeopardy has several rounds as each episode progresses.
Category - the category of the question.
Value - the number of dollars answering the question correctly is worth.
Question - the text of the question.
Answer - the text of the answer.

Aim¶

Let's say we want to compete on Jeopardy, and we're looking for any edge we can get to win. In this project, we'll work with a dataset of Jeopardy questions to figure out some patterns in the questions that could help us win.

Introduction¶

We will extract data into pandas dataframe.
Clean the dataset by dropping colimns with null values
Clean the Column names

In [1]:

import pandas as pd

jeopardy = pd.read_csv('jeopardy.csv')
jeopardy.dropna(inplace=True)
jeopardy.head()

Out[1]:

	Show Number	Air Date	Round	Category	Value	Question	Answer
0	4680	2004-12-31	Jeopardy!	HISTORY	$200	For the last 8 years of his life, Galileo was ...	Copernicus
1	4680	2004-12-31	Jeopardy!	ESPN's TOP 10 ALL-TIME ATHLETES	$200	No. 2: 1912 Olympian; football star at Carlisl...	Jim Thorpe
2	4680	2004-12-31	Jeopardy!	EVERYBODY TALKS ABOUT IT...	$200	The city of Yuma in this state has a record av...	Arizona
3	4680	2004-12-31	Jeopardy!	THE COMPANY LINE	$200	In 1963, live on "The Art Linkletter Show", th...	McDonald's
4	4680	2004-12-31	Jeopardy!	EPITAPHS & TRIBUTES	$200	Signer of the Dec. of Indep., framer of the Co...	John Adams

In [2]:

jeopardy.columns

Out[2]:

Index(['Show Number', ' Air Date', ' Round', ' Category', ' Value',
       ' Question', ' Answer'],
      dtype='object')

In [3]:

# Removing spaces from column names
jeopardy.columns = [x.strip() for x in jeopardy.columns]
jeopardy.columns

Out[3]:

Index(['Show Number', 'Air Date', 'Round', 'Category', 'Value', 'Question',
       'Answer'],
      dtype='object')

In [4]:

jeopardy.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 216928 entries, 0 to 216929
Data columns (total 7 columns):
Show Number    216928 non-null int64
Air Date       216928 non-null object
Round          216928 non-null object
Category       216928 non-null object
Value          216928 non-null object
Question       216928 non-null object
Answer         216928 non-null object
dtypes: int64(1), object(6)
memory usage: 13.2+ MB

Normalizing Text¶

Before we begin to do analysis, we need to normalize all of the text columns (the Question and Answer columns). The idea is to lowercase words and remove punctuation so Don't and don't aren't considered to be different words while comparing them.

In [5]:

import re

def normalizing_string(string):
    string = string.lower()
    string = re.sub("[^A-Z0-9a-z\s]", "", string)
    return string

In [6]:

jeopardy['clean_question'] = jeopardy['Question'].apply(normalizing_string)
jeopardy['clean_answer'] = jeopardy['Answer'].apply(normalizing_string)
jeopardy.head()

Out[6]:

	Show Number	Air Date	Round	Category	Value	Question	Answer	clean_question	clean_answer
0	4680	2004-12-31	Jeopardy!	HISTORY	$200	For the last 8 years of his life, Galileo was ...	Copernicus	for the last 8 years of his life galileo was u...	copernicus
1	4680	2004-12-31	Jeopardy!	ESPN's TOP 10 ALL-TIME ATHLETES	$200	No. 2: 1912 Olympian; football star at Carlisl...	Jim Thorpe	no 2 1912 olympian football star at carlisle i...	jim thorpe
2	4680	2004-12-31	Jeopardy!	EVERYBODY TALKS ABOUT IT...	$200	The city of Yuma in this state has a record av...	Arizona	the city of yuma in this state has a record av...	arizona
3	4680	2004-12-31	Jeopardy!	THE COMPANY LINE	$200	In 1963, live on "The Art Linkletter Show", th...	McDonald's	in 1963 live on the art linkletter show this c...	mcdonalds
4	4680	2004-12-31	Jeopardy!	EPITAPHS & TRIBUTES	$200	Signer of the Dec. of Indep., framer of the Co...	John Adams	signer of the dec of indep framer of the const...	john adams

Normalizing Columns¶

There are some other columns to be normalized.

The Value column should be numeric to manipulate it more easily. We'll need to remove the dollar sign from the beginning of each value and convert the column from text to numeric.

The Air Date column should also be a datetime, not a string, enabling us to work with it more easily.

In [7]:

def normalizing_values(value):
    value = re.sub("[^A-Z0-9a-z\s]", "", value)
    try:
        value = int(value)
    except Exception:
        value = 0
    return value

In [8]:

jeopardy['clean_value'] = jeopardy['Value'].apply(normalizing_values)
jeopardy['Air Date'] = pd.to_datetime(jeopardy['Air Date'])
print(jeopardy.dtypes)
jeopardy.head()

Show Number                int64
Air Date          datetime64[ns]
Round                     object
Category                  object
Value                     object
Question                  object
Answer                    object
clean_question            object
clean_answer              object
clean_value                int64
dtype: object

Out[8]:

	Show Number	Air Date	Round	Category	Value	Question	Answer	clean_question	clean_answer	clean_value
0	4680	2004-12-31	Jeopardy!	HISTORY	$200	For the last 8 years of his life, Galileo was ...	Copernicus	for the last 8 years of his life galileo was u...	copernicus	200
1	4680	2004-12-31	Jeopardy!	ESPN's TOP 10 ALL-TIME ATHLETES	$200	No. 2: 1912 Olympian; football star at Carlisl...	Jim Thorpe	no 2 1912 olympian football star at carlisle i...	jim thorpe	200
2	4680	2004-12-31	Jeopardy!	EVERYBODY TALKS ABOUT IT...	$200	The city of Yuma in this state has a record av...	Arizona	the city of yuma in this state has a record av...	arizona	200
3	4680	2004-12-31	Jeopardy!	THE COMPANY LINE	$200	In 1963, live on "The Art Linkletter Show", th...	McDonald's	in 1963 live on the art linkletter show this c...	mcdonalds	200
4	4680	2004-12-31	Jeopardy!	EPITAPHS & TRIBUTES	$200	Signer of the Dec. of Indep., framer of the Co...	John Adams	signer of the dec of indep framer of the const...	john adams	200

In order to figure out whether to study past questions, study general knowledge, or not study it all, it would be helpful to figure out two things:

How often the answer is deducible from the question.
How often new questions are repeats of older questions.

We can answer the second question by seeing how often complex words (> 6 characters) reoccur. We can answer the first question by seeing how many times words in the answer also occur in the question.

We'll work on the first question now, and come back to the second.

How often the answer is deducible from the question.¶

In [15]:

def function_ans_in_ques(row):
    split_answer = row['clean_answer'].split()
    split_question = row['clean_question'].split()

    match_count = 0

    try:
        split_answer.remove('the')
    except ValueError:
        pass

    if len(split_answer) == 0:
        return 0

    for element in split_answer:
        if element in split_question:
            match_count += 1
    return match_count/len(split_answer)

jeopardy['answer_in_question'] = jeopardy.apply(function_ans_in_ques, axis=1)
jeopardy['answer_in_question'].mean()

Out[15]:

0.057921237245162335

In [16]:

jeopardy[jeopardy['answer_in_question'] != 0][['clean_question', 'clean_answer']].head()

Out[16]:

	clean_question	clean_answer
14	on june 28 1994 the natl weather service began...	the uv index
24	this asian political party was founded in 1885...	the congress party
31	it can be a place to leave your puppy when you...	a kennel
38	during the 19541955 sun sessions elvis climbed...	the mystery train
53	in 1961 james brown announced all aboard for t...	night train

The answer only appears in the question about 6% of the time. This isn't a huge number, and means that we probably can't just hope that hearing a question will enable us to figure out the answer. We'll probably have to study.

How often new questions are repeats of older questions¶

In [17]:

overlap_ratio = []
terms_repeated_in_ques = []
terms_repeated_overall = set()
terms_used = set()

for i, rows in jeopardy.iterrows():
    split_question = rows['clean_question'].split(" ")
    terms = [x for x in split_question if len(x) > 5]   ## Words which are more then 5 letters long 
    temp = []
    match_count = 0

    for word in terms:
        if word in terms_used:
            match_count += 1                  ## increases match count if word is already seen earlier
            terms_repeated_overall.add(word)  ## adds word to the repeated words set (smaller then used words)
            temp.append(word)                 ## Word added in temporary array to be added in repeat words 
                                              ## column in dataframe
        terms_used.add(word)                  ## adds word to the used words set
            
    if len(terms) > 0:
        match_count /= len(terms)
        
    overlap_ratio.append(match_count)
    terms_repeated_in_ques.append(temp)
    
jeopardy['overlap_ratio'] = overlap_ratio
jeopardy['overlap_terms'] = terms_repeated_in_ques

jeopardy['overlap_ratio'].mean()

Out[17]:

0.8740091471018069

There is a 87% overlap of words between new questions and old ones. However words can be put together as different phases with a big difference in meaning. So this huge overlap is not super significant.

Low-Value vs High-Value Questions¶

Let's say we only want to study questions that pertain to high value questions instead of low value questions. This will help us earn more money when we're on Jeopardy.

We can actually figure out which terms correspond to high-value questions using a chi-squared test. We'll first need to narrow down the questions into two categories:

Low value - Any row where Value is less than 800.
High value - Any row where Value is greater than 800.

In [13]:

def high_or_low_value(row):
    value = 0
    if row['clean_value'] > 800:
        value = 1
    return value
    
jeopardy['high_value'] = jeopardy.apply(high_or_low_value, axis=1)
jeopardy.head()

Out[13]:

	Show Number	Air Date	Round	Category	Value	Question	Answer	clean_question	clean_answer	clean_value	overlap_terms
0	4680	2004-12-31	Jeopardy!	HISTORY	$200	For the last 8 years of his life, Galileo was ...	Copernicus	for the last 8 years of his life galileo was u...	copernicus	200	[]
1	4680	2004-12-31	Jeopardy!	ESPN's TOP 10 ALL-TIME ATHLETES	$200	No. 2: 1912 Olympian; football star at Carlisl...	Jim Thorpe	no 2 1912 olympian football star at carlisle i...	jim thorpe	200	[]
2	4680	2004-12-31	Jeopardy!	EVERYBODY TALKS ABOUT IT...	$200	The city of Yuma in this state has a record av...	Arizona	the city of yuma in this state has a record av...	arizona	200	[]
3	4680	2004-12-31	Jeopardy!	THE COMPANY LINE	$200	In 1963, live on "The Art Linkletter Show", th...	McDonald's	in 1963 live on the art linkletter show this c...	mcdonalds	200	[]
4	4680	2004-12-31	Jeopardy!	EPITAPHS & TRIBUTES	$200	Signer of the Dec. of Indep., framer of the Co...	John Adams	signer of the dec of indep framer of the const...	john adams	200	[]

High Value column categorizes data into either High Value [1] or Low Value [0].

Observed Quantity of High Value vs Low Value Questions¶

We will create a function that takes in a word, then return the # of high/low values questions this word showed up in.

In [19]:

def high_or_low_count(word):
    low_count = 0
    high_count = 0
    
    for i, rows in jeopardy.iterrows():
        split_question = rows['clean_question'].split(" ")
        if word in split_question:
            if rows['high_value'] == 1:
                high_count += 1
            else:
                low_count += 1
                
    return high_count, low_count

observed_high_low = []
comparison_terms = list(terms_repeated_overall)[:5]

for item in comparison_terms:
    observed_high_low.append(high_or_low_count(item))
    
observed_high_low

Out[19]:

[(15, 31), (0, 2), (16, 40), (2, 1), (1, 2)]

In [20]:

comparison_terms

Out[20]:

['somewhere', 'cholla', 'patriot', 'noonan', 'bharat']

For terms in comparison_terms the Observed High Count, Low Count is mentioned in observed_high_low

Applying the Chi-Squared test¶

We can use the chi squared test to see if the values of the terms in "comparsion_terms" are statiscally significant.

For that, we will find the expected High Count, Low Count.
Then, we will pass expected and observed values through chisquare function from scipy.stats to get the chi-value.

In [21]:

from scipy.stats import chisquare
import numpy as np

high_value_count = jeopardy[jeopardy['high_value'] == 1].shape[0]
low_value_count = jeopardy[jeopardy['high_value'] == 0].shape[0]

chi_squared = []

for item in observed_high_low:
    total = sum(item)
    total_prop = total/jeopardy.shape[0]
    high_value_expected = total_prop*high_value_count
    low_value_expected = total_prop*low_value_count
    
    observed = np.array([item[0], item[1]])
    expected = np.array([high_value_expected, low_value_expected])
    
    chi_squared.append(chisquare(observed, expected))
    
chi_squared

Out[21]:

[Power_divergenceResult(statistic=0.4179159369510027, pvalue=0.5179787872243642),
 Power_divergenceResult(statistic=0.7899630882409683, pvalue=0.3741112870360538),
 Power_divergenceResult(statistic=0.0018217776638311067, pvalue=0.9659547992512113),
 Power_divergenceResult(statistic=2.174012332188078, pvalue=0.1403596010701264),
 Power_divergenceResult(statistic=0.03723001319762459, pvalue=0.8469974958245368)]

Chi Squared Results¶

None of the p values are less than 0.05, so these are not statiscally significant.

*However, if we perform the same test for all the words, then words with pvalue less then 0.05 and high chi-square value would be most valuable to study.*