# Winning Jeopardy¶

Jeopardy is a popular TV show in the US where participants answer questions to win money. It's been running for a few decades, and is a major force in popular culture.

The dataset is named jeopardy.csv, and contains 20000 rows from the beginning of a full dataset of Jeopardy questions, which can be downloaded from here.

Data Dictionary:

Show Number - the Jeopardy episode number of the show this question was in.
Air Date - the date the episode aired.
Round - the round of Jeopardy that the question was asked in. Jeopardy has several rounds as each episode progresses.
Category - the category of the question.
Value - the number of dollars answering the question correctly is worth.
Question - the text of the question.
Answer - the text of the answer.

### Aim¶

Let's say we want to compete on Jeopardy, and we're looking for any edge we can get to win. In this project, we'll work with a dataset of Jeopardy questions to figure out some patterns in the questions that could help us win.

### Introduction¶

• We will extract data into pandas dataframe.
• Clean the dataset by dropping colimns with null values
• Clean the Column names
In [1]:
import pandas as pd

jeopardy.dropna(inplace=True)

Out[1]:
Show Number Air Date Round Category Value Question Answer
0 4680 2004-12-31 Jeopardy! HISTORY $200 For the last 8 years of his life, Galileo was ... Copernicus 1 4680 2004-12-31 Jeopardy! ESPN's TOP 10 ALL-TIME ATHLETES$200 No. 2: 1912 Olympian; football star at Carlisl... Jim Thorpe
2 4680 2004-12-31 Jeopardy! EVERYBODY TALKS ABOUT IT... $200 The city of Yuma in this state has a record av... Arizona 3 4680 2004-12-31 Jeopardy! THE COMPANY LINE$200 In 1963, live on "The Art Linkletter Show", th... McDonald's
4 4680 2004-12-31 Jeopardy! EPITAPHS & TRIBUTES $200 Signer of the Dec. of Indep., framer of the Co... John Adams In [2]: jeopardy.columns  Out[2]: Index(['Show Number', ' Air Date', ' Round', ' Category', ' Value', ' Question', ' Answer'], dtype='object') In [3]: # Removing spaces from column names jeopardy.columns = [x.strip() for x in jeopardy.columns] jeopardy.columns  Out[3]: Index(['Show Number', 'Air Date', 'Round', 'Category', 'Value', 'Question', 'Answer'], dtype='object') In [4]: jeopardy.info()  <class 'pandas.core.frame.DataFrame'> Int64Index: 216928 entries, 0 to 216929 Data columns (total 7 columns): Show Number 216928 non-null int64 Air Date 216928 non-null object Round 216928 non-null object Category 216928 non-null object Value 216928 non-null object Question 216928 non-null object Answer 216928 non-null object dtypes: int64(1), object(6) memory usage: 13.2+ MB  ### Normalizing Text¶ Before we begin to do analysis, we need to normalize all of the text columns (the Question and Answer columns). The idea is to lowercase words and remove punctuation so Don't and don't aren't considered to be different words while comparing them. In [5]: import re def normalizing_string(string): string = string.lower() string = re.sub("[^A-Z0-9a-z\s]", "", string) return string  In [6]: jeopardy['clean_question'] = jeopardy['Question'].apply(normalizing_string) jeopardy['clean_answer'] = jeopardy['Answer'].apply(normalizing_string) jeopardy.head()  Out[6]: Show Number Air Date Round Category Value Question Answer clean_question clean_answer 0 4680 2004-12-31 Jeopardy! HISTORY$200 For the last 8 years of his life, Galileo was ... Copernicus for the last 8 years of his life galileo was u... copernicus
1 4680 2004-12-31 Jeopardy! ESPN's TOP 10 ALL-TIME ATHLETES $200 No. 2: 1912 Olympian; football star at Carlisl... Jim Thorpe no 2 1912 olympian football star at carlisle i... jim thorpe 2 4680 2004-12-31 Jeopardy! EVERYBODY TALKS ABOUT IT...$200 The city of Yuma in this state has a record av... Arizona the city of yuma in this state has a record av... arizona
3 4680 2004-12-31 Jeopardy! THE COMPANY LINE $200 In 1963, live on "The Art Linkletter Show", th... McDonald's in 1963 live on the art linkletter show this c... mcdonalds 4 4680 2004-12-31 Jeopardy! EPITAPHS & TRIBUTES$200 Signer of the Dec. of Indep., framer of the Co... John Adams signer of the dec of indep framer of the const... john adams

### Normalizing Columns¶

There are some other columns to be normalized.

The Value column should be numeric to manipulate it more easily. We'll need to remove the dollar sign from the beginning of each value and convert the column from text to numeric.

The Air Date column should also be a datetime, not a string, enabling us to work with it more easily.

In [7]:
def normalizing_values(value):
value = re.sub("[^A-Z0-9a-z\s]", "", value)
try:
value = int(value)
except Exception:
value = 0
return value

In [8]:
jeopardy['clean_value'] = jeopardy['Value'].apply(normalizing_values)
jeopardy['Air Date'] = pd.to_datetime(jeopardy['Air Date'])
print(jeopardy.dtypes)

Show Number                int64
Air Date          datetime64[ns]
Round                     object
Category                  object
Value                     object
Question                  object
clean_question            object
clean_value                int64
dtype: object

Out[8]:
Show Number Air Date Round Category Value Question Answer clean_question clean_answer clean_value
0 4680 2004-12-31 Jeopardy! HISTORY $200 For the last 8 years of his life, Galileo was ... Copernicus for the last 8 years of his life galileo was u... copernicus 200 1 4680 2004-12-31 Jeopardy! ESPN's TOP 10 ALL-TIME ATHLETES$200 No. 2: 1912 Olympian; football star at Carlisl... Jim Thorpe no 2 1912 olympian football star at carlisle i... jim thorpe 200
2 4680 2004-12-31 Jeopardy! EVERYBODY TALKS ABOUT IT... $200 The city of Yuma in this state has a record av... Arizona the city of yuma in this state has a record av... arizona 200 3 4680 2004-12-31 Jeopardy! THE COMPANY LINE$200 In 1963, live on "The Art Linkletter Show", th... McDonald's in 1963 live on the art linkletter show this c... mcdonalds 200
4 4680 2004-12-31 Jeopardy! EPITAPHS & TRIBUTES $200 Signer of the Dec. of Indep., framer of the Co... John Adams signer of the dec of indep framer of the const... john adams 200 In order to figure out whether to study past questions, study general knowledge, or not study it all, it would be helpful to figure out two things: How often the answer is deducible from the question. How often new questions are repeats of older questions. We can answer the second question by seeing how often complex words (> 6 characters) reoccur. We can answer the first question by seeing how many times words in the answer also occur in the question. We'll work on the first question now, and come back to the second. ### How often the answer is deducible from the question.¶ In [15]: def function_ans_in_ques(row): split_answer = row['clean_answer'].split() split_question = row['clean_question'].split() match_count = 0 try: split_answer.remove('the') except ValueError: pass if len(split_answer) == 0: return 0 for element in split_answer: if element in split_question: match_count += 1 return match_count/len(split_answer) jeopardy['answer_in_question'] = jeopardy.apply(function_ans_in_ques, axis=1) jeopardy['answer_in_question'].mean()  Out[15]: 0.057921237245162335 In [16]: jeopardy[jeopardy['answer_in_question'] != 0][['clean_question', 'clean_answer']].head()  Out[16]: clean_question clean_answer 14 on june 28 1994 the natl weather service began... the uv index 24 this asian political party was founded in 1885... the congress party 31 it can be a place to leave your puppy when you... a kennel 38 during the 19541955 sun sessions elvis climbed... the mystery train 53 in 1961 james brown announced all aboard for t... night train The answer only appears in the question about 6% of the time. This isn't a huge number, and means that we probably can't just hope that hearing a question will enable us to figure out the answer. We'll probably have to study. ### How often new questions are repeats of older questions¶ In [17]: overlap_ratio = [] terms_repeated_in_ques = [] terms_repeated_overall = set() terms_used = set() for i, rows in jeopardy.iterrows(): split_question = rows['clean_question'].split(" ") terms = [x for x in split_question if len(x) > 5] ## Words which are more then 5 letters long temp = [] match_count = 0 for word in terms: if word in terms_used: match_count += 1 ## increases match count if word is already seen earlier terms_repeated_overall.add(word) ## adds word to the repeated words set (smaller then used words) temp.append(word) ## Word added in temporary array to be added in repeat words ## column in dataframe terms_used.add(word) ## adds word to the used words set if len(terms) > 0: match_count /= len(terms) overlap_ratio.append(match_count) terms_repeated_in_ques.append(temp) jeopardy['overlap_ratio'] = overlap_ratio jeopardy['overlap_terms'] = terms_repeated_in_ques jeopardy['overlap_ratio'].mean()  Out[17]: 0.8740091471018069 There is a 87% overlap of words between new questions and old ones. However words can be put together as different phases with a big difference in meaning. So this huge overlap is not super significant. ### Low-Value vs High-Value Questions¶ Let's say we only want to study questions that pertain to high value questions instead of low value questions. This will help us earn more money when we're on Jeopardy. We can actually figure out which terms correspond to high-value questions using a chi-squared test. We'll first need to narrow down the questions into two categories: Low value - Any row where Value is less than 800. High value - Any row where Value is greater than 800. In [13]: def high_or_low_value(row): value = 0 if row['clean_value'] > 800: value = 1 return value jeopardy['high_value'] = jeopardy.apply(high_or_low_value, axis=1) jeopardy.head()  Out[13]: Show Number Air Date Round Category Value Question Answer clean_question clean_answer clean_value answer_in_question overlap_ratio overlap_terms high_value 0 4680 2004-12-31 Jeopardy! HISTORY$200 For the last 8 years of his life, Galileo was ... Copernicus for the last 8 years of his life galileo was u... copernicus 200 0.0 0.0 [] 0
1 4680 2004-12-31 Jeopardy! ESPN's TOP 10 ALL-TIME ATHLETES $200 No. 2: 1912 Olympian; football star at Carlisl... Jim Thorpe no 2 1912 olympian football star at carlisle i... jim thorpe 200 0.0 0.0 [] 0 2 4680 2004-12-31 Jeopardy! EVERYBODY TALKS ABOUT IT...$200 The city of Yuma in this state has a record av... Arizona the city of yuma in this state has a record av... arizona 200 0.0 0.0 [] 0
3 4680 2004-12-31 Jeopardy! THE COMPANY LINE $200 In 1963, live on "The Art Linkletter Show", th... McDonald's in 1963 live on the art linkletter show this c... mcdonalds 200 0.0 0.0 [] 0 4 4680 2004-12-31 Jeopardy! EPITAPHS & TRIBUTES$200 Signer of the Dec. of Indep., framer of the Co... John Adams signer of the dec of indep framer of the const... john adams 200 0.0 0.0 [] 0

High Value column categorizes data into either High Value [1] or Low Value [0].

### Observed Quantity of High Value vs Low Value Questions¶

We will create a function that takes in a word, then return the # of high/low values questions this word showed up in.

In [19]:
def high_or_low_count(word):
low_count = 0
high_count = 0

for i, rows in jeopardy.iterrows():
split_question = rows['clean_question'].split(" ")
if word in split_question:
if rows['high_value'] == 1:
high_count += 1
else:
low_count += 1

return high_count, low_count

observed_high_low = []
comparison_terms = list(terms_repeated_overall)[:5]

for item in comparison_terms:
observed_high_low.append(high_or_low_count(item))

observed_high_low

Out[19]:
[(15, 31), (0, 2), (16, 40), (2, 1), (1, 2)]
In [20]:
comparison_terms

Out[20]:
['somewhere', 'cholla', 'patriot', 'noonan', 'bharat']

For terms in comparison_terms the Observed High Count, Low Count is mentioned in observed_high_low

### Applying the Chi-Squared test¶

We can use the chi squared test to see if the values of the terms in "comparsion_terms" are statiscally significant.

For that, we will find the expected High Count, Low Count.
Then, we will pass expected and observed values through chisquare function from scipy.stats to get the chi-value.

In [21]:
from scipy.stats import chisquare
import numpy as np

high_value_count = jeopardy[jeopardy['high_value'] == 1].shape[0]
low_value_count = jeopardy[jeopardy['high_value'] == 0].shape[0]

chi_squared = []

for item in observed_high_low:
total = sum(item)
total_prop = total/jeopardy.shape[0]
high_value_expected = total_prop*high_value_count
low_value_expected = total_prop*low_value_count

observed = np.array([item[0], item[1]])
expected = np.array([high_value_expected, low_value_expected])

chi_squared.append(chisquare(observed, expected))

chi_squared

Out[21]:
[Power_divergenceResult(statistic=0.4179159369510027, pvalue=0.5179787872243642),
Power_divergenceResult(statistic=0.7899630882409683, pvalue=0.3741112870360538),
Power_divergenceResult(statistic=0.0018217776638311067, pvalue=0.9659547992512113),
Power_divergenceResult(statistic=2.174012332188078, pvalue=0.1403596010701264),
Power_divergenceResult(statistic=0.03723001319762459, pvalue=0.8469974958245368)]

### Chi Squared Results¶

None of the p values are less than 0.05, so these are not statiscally significant.

However, if we perform the same test for all the words, then words with pvalue less then 0.05 and high chi-square value would be most valuable to study.