Jeopardy is a popular TV show in the US where participants answer questions to win money. It's been running for a few decades, and is a major force in popular culture.
The dataset is named jeopardy.csv, and contains 20000 rows from the beginning of a full dataset of Jeopardy questions, which can be downloaded from here.
Data Dictionary:
Show Number - the Jeopardy episode number of the show this question was in.
Air Date - the date the episode aired.
Round - the round of Jeopardy that the question was asked in. Jeopardy has several rounds as each episode progresses.
Category - the category of the question.
Value - the number of dollars answering the question correctly is worth.
Question - the text of the question.
Answer - the text of the answer.
Let's say we want to compete on Jeopardy, and we're looking for any edge we can get to win. In this project, we'll work with a dataset of Jeopardy questions to figure out some patterns in the questions that could help us win.
import pandas as pd
jeopardy = pd.read_csv('jeopardy.csv')
jeopardy.dropna(inplace=True)
jeopardy.head()
Show Number | Air Date | Round | Category | Value | Question | Answer | |
---|---|---|---|---|---|---|---|
0 | 4680 | 2004-12-31 | Jeopardy! | HISTORY | $200 | For the last 8 years of his life, Galileo was ... | Copernicus |
1 | 4680 | 2004-12-31 | Jeopardy! | ESPN's TOP 10 ALL-TIME ATHLETES | $200 | No. 2: 1912 Olympian; football star at Carlisl... | Jim Thorpe |
2 | 4680 | 2004-12-31 | Jeopardy! | EVERYBODY TALKS ABOUT IT... | $200 | The city of Yuma in this state has a record av... | Arizona |
3 | 4680 | 2004-12-31 | Jeopardy! | THE COMPANY LINE | $200 | In 1963, live on "The Art Linkletter Show", th... | McDonald's |
4 | 4680 | 2004-12-31 | Jeopardy! | EPITAPHS & TRIBUTES | $200 | Signer of the Dec. of Indep., framer of the Co... | John Adams |
jeopardy.columns
Index(['Show Number', ' Air Date', ' Round', ' Category', ' Value', ' Question', ' Answer'], dtype='object')
# Removing spaces from column names
jeopardy.columns = [x.strip() for x in jeopardy.columns]
jeopardy.columns
Index(['Show Number', 'Air Date', 'Round', 'Category', 'Value', 'Question', 'Answer'], dtype='object')
jeopardy.info()
<class 'pandas.core.frame.DataFrame'> Int64Index: 216928 entries, 0 to 216929 Data columns (total 7 columns): Show Number 216928 non-null int64 Air Date 216928 non-null object Round 216928 non-null object Category 216928 non-null object Value 216928 non-null object Question 216928 non-null object Answer 216928 non-null object dtypes: int64(1), object(6) memory usage: 13.2+ MB
Before we begin to do analysis, we need to normalize all of the text columns (the Question
and Answer
columns). The idea is to lowercase words and remove punctuation so Don't
and don't
aren't considered to be different words while comparing them.
import re
def normalizing_string(string):
string = string.lower()
string = re.sub("[^A-Z0-9a-z\s]", "", string)
return string
jeopardy['clean_question'] = jeopardy['Question'].apply(normalizing_string)
jeopardy['clean_answer'] = jeopardy['Answer'].apply(normalizing_string)
jeopardy.head()
Show Number | Air Date | Round | Category | Value | Question | Answer | clean_question | clean_answer | |
---|---|---|---|---|---|---|---|---|---|
0 | 4680 | 2004-12-31 | Jeopardy! | HISTORY | $200 | For the last 8 years of his life, Galileo was ... | Copernicus | for the last 8 years of his life galileo was u... | copernicus |
1 | 4680 | 2004-12-31 | Jeopardy! | ESPN's TOP 10 ALL-TIME ATHLETES | $200 | No. 2: 1912 Olympian; football star at Carlisl... | Jim Thorpe | no 2 1912 olympian football star at carlisle i... | jim thorpe |
2 | 4680 | 2004-12-31 | Jeopardy! | EVERYBODY TALKS ABOUT IT... | $200 | The city of Yuma in this state has a record av... | Arizona | the city of yuma in this state has a record av... | arizona |
3 | 4680 | 2004-12-31 | Jeopardy! | THE COMPANY LINE | $200 | In 1963, live on "The Art Linkletter Show", th... | McDonald's | in 1963 live on the art linkletter show this c... | mcdonalds |
4 | 4680 | 2004-12-31 | Jeopardy! | EPITAPHS & TRIBUTES | $200 | Signer of the Dec. of Indep., framer of the Co... | John Adams | signer of the dec of indep framer of the const... | john adams |
There are some other columns to be normalized.
The Value
column should be numeric to manipulate it more easily. We'll need to remove the dollar sign from the beginning of each value and convert the column from text to numeric.
The Air Date
column should also be a datetime, not a string, enabling us to work with it more easily.
def normalizing_values(value):
value = re.sub("[^A-Z0-9a-z\s]", "", value)
try:
value = int(value)
except Exception:
value = 0
return value
jeopardy['clean_value'] = jeopardy['Value'].apply(normalizing_values)
jeopardy['Air Date'] = pd.to_datetime(jeopardy['Air Date'])
print(jeopardy.dtypes)
jeopardy.head()
Show Number int64 Air Date datetime64[ns] Round object Category object Value object Question object Answer object clean_question object clean_answer object clean_value int64 dtype: object
Show Number | Air Date | Round | Category | Value | Question | Answer | clean_question | clean_answer | clean_value | |
---|---|---|---|---|---|---|---|---|---|---|
0 | 4680 | 2004-12-31 | Jeopardy! | HISTORY | $200 | For the last 8 years of his life, Galileo was ... | Copernicus | for the last 8 years of his life galileo was u... | copernicus | 200 |
1 | 4680 | 2004-12-31 | Jeopardy! | ESPN's TOP 10 ALL-TIME ATHLETES | $200 | No. 2: 1912 Olympian; football star at Carlisl... | Jim Thorpe | no 2 1912 olympian football star at carlisle i... | jim thorpe | 200 |
2 | 4680 | 2004-12-31 | Jeopardy! | EVERYBODY TALKS ABOUT IT... | $200 | The city of Yuma in this state has a record av... | Arizona | the city of yuma in this state has a record av... | arizona | 200 |
3 | 4680 | 2004-12-31 | Jeopardy! | THE COMPANY LINE | $200 | In 1963, live on "The Art Linkletter Show", th... | McDonald's | in 1963 live on the art linkletter show this c... | mcdonalds | 200 |
4 | 4680 | 2004-12-31 | Jeopardy! | EPITAPHS & TRIBUTES | $200 | Signer of the Dec. of Indep., framer of the Co... | John Adams | signer of the dec of indep framer of the const... | john adams | 200 |
In order to figure out whether to study past questions, study general knowledge, or not study it all, it would be helpful to figure out two things:
How often the answer is deducible from the question.
How often new questions are repeats of older questions.
We can answer the second question by seeing how often complex words (> 6 characters) reoccur. We can answer the first question by seeing how many times words in the answer also occur in the question.
We'll work on the first question now, and come back to the second.
def function_ans_in_ques(row):
split_answer = row['clean_answer'].split()
split_question = row['clean_question'].split()
match_count = 0
try:
split_answer.remove('the')
except ValueError:
pass
if len(split_answer) == 0:
return 0
for element in split_answer:
if element in split_question:
match_count += 1
return match_count/len(split_answer)
jeopardy['answer_in_question'] = jeopardy.apply(function_ans_in_ques, axis=1)
jeopardy['answer_in_question'].mean()
0.057921237245162335
jeopardy[jeopardy['answer_in_question'] != 0][['clean_question', 'clean_answer']].head()
clean_question | clean_answer | |
---|---|---|
14 | on june 28 1994 the natl weather service began... | the uv index |
24 | this asian political party was founded in 1885... | the congress party |
31 | it can be a place to leave your puppy when you... | a kennel |
38 | during the 19541955 sun sessions elvis climbed... | the mystery train |
53 | in 1961 james brown announced all aboard for t... | night train |
The answer only appears in the question about 6% of the time. This isn't a huge number, and means that we probably can't just hope that hearing a question will enable us to figure out the answer. We'll probably have to study.
overlap_ratio = []
terms_repeated_in_ques = []
terms_repeated_overall = set()
terms_used = set()
for i, rows in jeopardy.iterrows():
split_question = rows['clean_question'].split(" ")
terms = [x for x in split_question if len(x) > 5] ## Words which are more then 5 letters long
temp = []
match_count = 0
for word in terms:
if word in terms_used:
match_count += 1 ## increases match count if word is already seen earlier
terms_repeated_overall.add(word) ## adds word to the repeated words set (smaller then used words)
temp.append(word) ## Word added in temporary array to be added in repeat words
## column in dataframe
terms_used.add(word) ## adds word to the used words set
if len(terms) > 0:
match_count /= len(terms)
overlap_ratio.append(match_count)
terms_repeated_in_ques.append(temp)
jeopardy['overlap_ratio'] = overlap_ratio
jeopardy['overlap_terms'] = terms_repeated_in_ques
jeopardy['overlap_ratio'].mean()
0.8740091471018069
There is a 87% overlap of words between new questions and old ones. However words can be put together as different phases with a big difference in meaning. So this huge overlap is not super significant.
Let's say we only want to study questions that pertain to high value questions instead of low value questions. This will help us earn more money when we're on Jeopardy.
We can actually figure out which terms correspond to high-value questions using a chi-squared test. We'll first need to narrow down the questions into two categories:
Low value - Any row where Value is less than 800.
High value - Any row where Value is greater than 800.
def high_or_low_value(row):
value = 0
if row['clean_value'] > 800:
value = 1
return value
jeopardy['high_value'] = jeopardy.apply(high_or_low_value, axis=1)
jeopardy.head()
Show Number | Air Date | Round | Category | Value | Question | Answer | clean_question | clean_answer | clean_value | answer_in_question | overlap_ratio | overlap_terms | high_value | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 4680 | 2004-12-31 | Jeopardy! | HISTORY | $200 | For the last 8 years of his life, Galileo was ... | Copernicus | for the last 8 years of his life galileo was u... | copernicus | 200 | 0.0 | 0.0 | [] | 0 |
1 | 4680 | 2004-12-31 | Jeopardy! | ESPN's TOP 10 ALL-TIME ATHLETES | $200 | No. 2: 1912 Olympian; football star at Carlisl... | Jim Thorpe | no 2 1912 olympian football star at carlisle i... | jim thorpe | 200 | 0.0 | 0.0 | [] | 0 |
2 | 4680 | 2004-12-31 | Jeopardy! | EVERYBODY TALKS ABOUT IT... | $200 | The city of Yuma in this state has a record av... | Arizona | the city of yuma in this state has a record av... | arizona | 200 | 0.0 | 0.0 | [] | 0 |
3 | 4680 | 2004-12-31 | Jeopardy! | THE COMPANY LINE | $200 | In 1963, live on "The Art Linkletter Show", th... | McDonald's | in 1963 live on the art linkletter show this c... | mcdonalds | 200 | 0.0 | 0.0 | [] | 0 |
4 | 4680 | 2004-12-31 | Jeopardy! | EPITAPHS & TRIBUTES | $200 | Signer of the Dec. of Indep., framer of the Co... | John Adams | signer of the dec of indep framer of the const... | john adams | 200 | 0.0 | 0.0 | [] | 0 |
High Value column categorizes data into either High Value [1] or Low Value [0].
We will create a function that takes in a word, then return the # of high/low values questions this word showed up in.
def high_or_low_count(word):
low_count = 0
high_count = 0
for i, rows in jeopardy.iterrows():
split_question = rows['clean_question'].split(" ")
if word in split_question:
if rows['high_value'] == 1:
high_count += 1
else:
low_count += 1
return high_count, low_count
observed_high_low = []
comparison_terms = list(terms_repeated_overall)[:5]
for item in comparison_terms:
observed_high_low.append(high_or_low_count(item))
observed_high_low
[(15, 31), (0, 2), (16, 40), (2, 1), (1, 2)]
comparison_terms
['somewhere', 'cholla', 'patriot', 'noonan', 'bharat']
For terms in comparison_terms
the Observed High Count, Low Count is mentioned in observed_high_low
We can use the chi squared test to see if the values of the terms in "comparsion_terms" are statiscally significant.
For that, we will find the expected High Count, Low Count.
Then, we will pass expected and observed values through chisquare
function from scipy.stats
to get the chi-value.
from scipy.stats import chisquare
import numpy as np
high_value_count = jeopardy[jeopardy['high_value'] == 1].shape[0]
low_value_count = jeopardy[jeopardy['high_value'] == 0].shape[0]
chi_squared = []
for item in observed_high_low:
total = sum(item)
total_prop = total/jeopardy.shape[0]
high_value_expected = total_prop*high_value_count
low_value_expected = total_prop*low_value_count
observed = np.array([item[0], item[1]])
expected = np.array([high_value_expected, low_value_expected])
chi_squared.append(chisquare(observed, expected))
chi_squared
[Power_divergenceResult(statistic=0.4179159369510027, pvalue=0.5179787872243642), Power_divergenceResult(statistic=0.7899630882409683, pvalue=0.3741112870360538), Power_divergenceResult(statistic=0.0018217776638311067, pvalue=0.9659547992512113), Power_divergenceResult(statistic=2.174012332188078, pvalue=0.1403596010701264), Power_divergenceResult(statistic=0.03723001319762459, pvalue=0.8469974958245368)]
None of the p values are less than 0.05, so these are not statiscally significant.
*However, if we perform the same test for all the words, then words with pvalue less then 0.05 and high chi-square value would be most valuable to study.*