Winning Jeopardy

Jeopardy is a popular TV show in the US where participants answer questions to win money. It's been running for a few decades, and is a major force in popular culture.

The dataset is named jeopardy.csv, and contains 20000 rows from the beginning of a full dataset of Jeopardy questions, which can be downloaded from here.

Data Dictionary:

Show Number - the Jeopardy episode number of the show this question was in.
Air Date - the date the episode aired.
Round - the round of Jeopardy that the question was asked in. Jeopardy has several rounds as each episode progresses.
Category - the category of the question.
Value - the number of dollars answering the question correctly is worth.
Question - the text of the question.
Answer - the text of the answer.

Aim

Let's say we want to compete on Jeopardy, and we're looking for any edge we can get to win. In this project, we'll work with a dataset of Jeopardy questions to figure out some patterns in the questions that could help us win.

Introduction

  • We will extract data into pandas dataframe.
  • Clean the dataset by dropping colimns with null values
  • Clean the Column names
In [1]:
import pandas as pd

jeopardy = pd.read_csv('jeopardy.csv')
jeopardy.dropna(inplace=True)
jeopardy.head()
Out[1]:
Show Number Air Date Round Category Value Question Answer
0 4680 2004-12-31 Jeopardy! HISTORY $200 For the last 8 years of his life, Galileo was ... Copernicus
1 4680 2004-12-31 Jeopardy! ESPN's TOP 10 ALL-TIME ATHLETES $200 No. 2: 1912 Olympian; football star at Carlisl... Jim Thorpe
2 4680 2004-12-31 Jeopardy! EVERYBODY TALKS ABOUT IT... $200 The city of Yuma in this state has a record av... Arizona
3 4680 2004-12-31 Jeopardy! THE COMPANY LINE $200 In 1963, live on "The Art Linkletter Show", th... McDonald's
4 4680 2004-12-31 Jeopardy! EPITAPHS & TRIBUTES $200 Signer of the Dec. of Indep., framer of the Co... John Adams
In [2]:
jeopardy.columns
Out[2]:
Index(['Show Number', ' Air Date', ' Round', ' Category', ' Value',
       ' Question', ' Answer'],
      dtype='object')
In [3]:
# Removing spaces from column names
jeopardy.columns = [x.strip() for x in jeopardy.columns]
jeopardy.columns
Out[3]:
Index(['Show Number', 'Air Date', 'Round', 'Category', 'Value', 'Question',
       'Answer'],
      dtype='object')
In [4]:
jeopardy.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 216928 entries, 0 to 216929
Data columns (total 7 columns):
Show Number    216928 non-null int64
Air Date       216928 non-null object
Round          216928 non-null object
Category       216928 non-null object
Value          216928 non-null object
Question       216928 non-null object
Answer         216928 non-null object
dtypes: int64(1), object(6)
memory usage: 13.2+ MB

Normalizing Text

Before we begin to do analysis, we need to normalize all of the text columns (the Question and Answer columns). The idea is to lowercase words and remove punctuation so Don't and don't aren't considered to be different words while comparing them.

In [5]:
import re

def normalizing_string(string):
    string = string.lower()
    string = re.sub("[^A-Z0-9a-z\s]", "", string)
    return string
In [6]:
jeopardy['clean_question'] = jeopardy['Question'].apply(normalizing_string)
jeopardy['clean_answer'] = jeopardy['Answer'].apply(normalizing_string)
jeopardy.head()
Out[6]:
Show Number Air Date Round Category Value Question Answer clean_question clean_answer
0 4680 2004-12-31 Jeopardy! HISTORY $200 For the last 8 years of his life, Galileo was ... Copernicus for the last 8 years of his life galileo was u... copernicus
1 4680 2004-12-31 Jeopardy! ESPN's TOP 10 ALL-TIME ATHLETES $200 No. 2: 1912 Olympian; football star at Carlisl... Jim Thorpe no 2 1912 olympian football star at carlisle i... jim thorpe
2 4680 2004-12-31 Jeopardy! EVERYBODY TALKS ABOUT IT... $200 The city of Yuma in this state has a record av... Arizona the city of yuma in this state has a record av... arizona
3 4680 2004-12-31 Jeopardy! THE COMPANY LINE $200 In 1963, live on "The Art Linkletter Show", th... McDonald's in 1963 live on the art linkletter show this c... mcdonalds
4 4680 2004-12-31 Jeopardy! EPITAPHS & TRIBUTES $200 Signer of the Dec. of Indep., framer of the Co... John Adams signer of the dec of indep framer of the const... john adams

Normalizing Columns

There are some other columns to be normalized.

The Value column should be numeric to manipulate it more easily. We'll need to remove the dollar sign from the beginning of each value and convert the column from text to numeric.

The Air Date column should also be a datetime, not a string, enabling us to work with it more easily.

In [7]:
def normalizing_values(value):
    value = re.sub("[^A-Z0-9a-z\s]", "", value)
    try:
        value = int(value)
    except Exception:
        value = 0
    return value
In [8]:
jeopardy['clean_value'] = jeopardy['Value'].apply(normalizing_values)
jeopardy['Air Date'] = pd.to_datetime(jeopardy['Air Date'])
print(jeopardy.dtypes)
jeopardy.head()
Show Number                int64
Air Date          datetime64[ns]
Round                     object
Category                  object
Value                     object
Question                  object
Answer                    object
clean_question            object
clean_answer              object
clean_value                int64
dtype: object
Out[8]:
Show Number Air Date Round Category Value Question Answer clean_question clean_answer clean_value
0 4680 2004-12-31 Jeopardy! HISTORY $200 For the last 8 years of his life, Galileo was ... Copernicus for the last 8 years of his life galileo was u... copernicus 200
1 4680 2004-12-31 Jeopardy! ESPN's TOP 10 ALL-TIME ATHLETES $200 No. 2: 1912 Olympian; football star at Carlisl... Jim Thorpe no 2 1912 olympian football star at carlisle i... jim thorpe 200
2 4680 2004-12-31 Jeopardy! EVERYBODY TALKS ABOUT IT... $200 The city of Yuma in this state has a record av... Arizona the city of yuma in this state has a record av... arizona 200
3 4680 2004-12-31 Jeopardy! THE COMPANY LINE $200 In 1963, live on "The Art Linkletter Show", th... McDonald's in 1963 live on the art linkletter show this c... mcdonalds 200
4 4680 2004-12-31 Jeopardy! EPITAPHS & TRIBUTES $200 Signer of the Dec. of Indep., framer of the Co... John Adams signer of the dec of indep framer of the const... john adams 200

In order to figure out whether to study past questions, study general knowledge, or not study it all, it would be helpful to figure out two things:

How often the answer is deducible from the question.
How often new questions are repeats of older questions.

We can answer the second question by seeing how often complex words (> 6 characters) reoccur. We can answer the first question by seeing how many times words in the answer also occur in the question.

We'll work on the first question now, and come back to the second.


How often the answer is deducible from the question.

In [15]:
def function_ans_in_ques(row):
    split_answer = row['clean_answer'].split()
    split_question = row['clean_question'].split()

    match_count = 0

    try:
        split_answer.remove('the')
    except ValueError:
        pass

    if len(split_answer) == 0:
        return 0

    for element in split_answer:
        if element in split_question:
            match_count += 1
    return match_count/len(split_answer)

jeopardy['answer_in_question'] = jeopardy.apply(function_ans_in_ques, axis=1)
jeopardy['answer_in_question'].mean()
Out[15]:
0.057921237245162335
In [16]:
jeopardy[jeopardy['answer_in_question'] != 0][['clean_question', 'clean_answer']].head()
Out[16]:
clean_question clean_answer
14 on june 28 1994 the natl weather service began... the uv index
24 this asian political party was founded in 1885... the congress party
31 it can be a place to leave your puppy when you... a kennel
38 during the 19541955 sun sessions elvis climbed... the mystery train
53 in 1961 james brown announced all aboard for t... night train

The answer only appears in the question about 6% of the time. This isn't a huge number, and means that we probably can't just hope that hearing a question will enable us to figure out the answer. We'll probably have to study.


How often new questions are repeats of older questions

In [17]:
overlap_ratio = []
terms_repeated_in_ques = []
terms_repeated_overall = set()
terms_used = set()

for i, rows in jeopardy.iterrows():
    split_question = rows['clean_question'].split(" ")
    terms = [x for x in split_question if len(x) > 5]   ## Words which are more then 5 letters long 
    temp = []
    match_count = 0

    for word in terms:
        if word in terms_used:
            match_count += 1                  ## increases match count if word is already seen earlier
            terms_repeated_overall.add(word)  ## adds word to the repeated words set (smaller then used words)
            temp.append(word)                 ## Word added in temporary array to be added in repeat words 
                                              ## column in dataframe
        terms_used.add(word)                  ## adds word to the used words set
            
    if len(terms) > 0:
        match_count /= len(terms)
        
    overlap_ratio.append(match_count)
    terms_repeated_in_ques.append(temp)
    
jeopardy['overlap_ratio'] = overlap_ratio
jeopardy['overlap_terms'] = terms_repeated_in_ques

jeopardy['overlap_ratio'].mean()
Out[17]:
0.8740091471018069

There is a 87% overlap of words between new questions and old ones. However words can be put together as different phases with a big difference in meaning. So this huge overlap is not super significant.


Low-Value vs High-Value Questions

Let's say we only want to study questions that pertain to high value questions instead of low value questions. This will help us earn more money when we're on Jeopardy.

We can actually figure out which terms correspond to high-value questions using a chi-squared test. We'll first need to narrow down the questions into two categories:

Low value - Any row where Value is less than 800.
High value - Any row where Value is greater than 800.

In [13]:
def high_or_low_value(row):
    value = 0
    if row['clean_value'] > 800:
        value = 1
    return value
    
jeopardy['high_value'] = jeopardy.apply(high_or_low_value, axis=1)
jeopardy.head()
Out[13]:
Show Number Air Date Round Category Value Question Answer clean_question clean_answer clean_value answer_in_question overlap_ratio overlap_terms high_value
0 4680 2004-12-31 Jeopardy! HISTORY $200 For the last 8 years of his life, Galileo was ... Copernicus for the last 8 years of his life galileo was u... copernicus 200 0.0 0.0 [] 0
1 4680 2004-12-31 Jeopardy! ESPN's TOP 10 ALL-TIME ATHLETES $200 No. 2: 1912 Olympian; football star at Carlisl... Jim Thorpe no 2 1912 olympian football star at carlisle i... jim thorpe 200 0.0 0.0 [] 0
2 4680 2004-12-31 Jeopardy! EVERYBODY TALKS ABOUT IT... $200 The city of Yuma in this state has a record av... Arizona the city of yuma in this state has a record av... arizona 200 0.0 0.0 [] 0
3 4680 2004-12-31 Jeopardy! THE COMPANY LINE $200 In 1963, live on "The Art Linkletter Show", th... McDonald's in 1963 live on the art linkletter show this c... mcdonalds 200 0.0 0.0 [] 0
4 4680 2004-12-31 Jeopardy! EPITAPHS & TRIBUTES $200 Signer of the Dec. of Indep., framer of the Co... John Adams signer of the dec of indep framer of the const... john adams 200 0.0 0.0 [] 0

High Value column categorizes data into either High Value [1] or Low Value [0].


Observed Quantity of High Value vs Low Value Questions

We will create a function that takes in a word, then return the # of high/low values questions this word showed up in.

In [19]:
def high_or_low_count(word):
    low_count = 0
    high_count = 0
    
    for i, rows in jeopardy.iterrows():
        split_question = rows['clean_question'].split(" ")
        if word in split_question:
            if rows['high_value'] == 1:
                high_count += 1
            else:
                low_count += 1
                
    return high_count, low_count

observed_high_low = []
comparison_terms = list(terms_repeated_overall)[:5]

for item in comparison_terms:
    observed_high_low.append(high_or_low_count(item))
    
observed_high_low
Out[19]:
[(15, 31), (0, 2), (16, 40), (2, 1), (1, 2)]
In [20]:
comparison_terms
Out[20]:
['somewhere', 'cholla', 'patriot', 'noonan', 'bharat']

For terms in comparison_terms the Observed High Count, Low Count is mentioned in observed_high_low


Applying the Chi-Squared test

We can use the chi squared test to see if the values of the terms in "comparsion_terms" are statiscally significant.

For that, we will find the expected High Count, Low Count.
Then, we will pass expected and observed values through chisquare function from scipy.stats to get the chi-value.

In [21]:
from scipy.stats import chisquare
import numpy as np

high_value_count = jeopardy[jeopardy['high_value'] == 1].shape[0]
low_value_count = jeopardy[jeopardy['high_value'] == 0].shape[0]

chi_squared = []

for item in observed_high_low:
    total = sum(item)
    total_prop = total/jeopardy.shape[0]
    high_value_expected = total_prop*high_value_count
    low_value_expected = total_prop*low_value_count
    
    observed = np.array([item[0], item[1]])
    expected = np.array([high_value_expected, low_value_expected])
    
    chi_squared.append(chisquare(observed, expected))
    
chi_squared
Out[21]:
[Power_divergenceResult(statistic=0.4179159369510027, pvalue=0.5179787872243642),
 Power_divergenceResult(statistic=0.7899630882409683, pvalue=0.3741112870360538),
 Power_divergenceResult(statistic=0.0018217776638311067, pvalue=0.9659547992512113),
 Power_divergenceResult(statistic=2.174012332188078, pvalue=0.1403596010701264),
 Power_divergenceResult(statistic=0.03723001319762459, pvalue=0.8469974958245368)]

Chi Squared Results

None of the p values are less than 0.05, so these are not statiscally significant.

However, if we perform the same test for all the words, then words with pvalue less then 0.05 and high chi-square value would be most valuable to study.