Jeopardy is a popular TV show in the US where participants answer questions to win money. It's been running for many years, and is a major force in popular culture. In this project, we'll work with a dataset of Jeopardy questions to figure out some patterns in the questions that could help us win.
The dataset contains 20000 rows from the beginning of a full dataset of Jeopardy questions, which can be downloaded here. Each row in the dataset represents a single question on a single episode of Jeopardy. Here are explanations of each column:
import pandas as pd
import re
import random
import numpy as np
from scipy.stats import chisquare
# Read the csv file, parsing dates from " Air Date" column
jeopardy = pd.read_csv("jeopardy.csv", parse_dates = [" Air Date"])
# Display basic info
print("Shape of the dataframe:", jeopardy.shape)
print("Columns of the dataframe:", jeopardy.columns)
display(jeopardy.info())
display(jeopardy.head())
Shape of the dataframe: (216930, 7) Columns of the dataframe: Index(['Show Number', ' Air Date', ' Round', ' Category', ' Value', ' Question', ' Answer'], dtype='object') <class 'pandas.core.frame.DataFrame'> RangeIndex: 216930 entries, 0 to 216929 Data columns (total 7 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 Show Number 216930 non-null int64 1 Air Date 216930 non-null datetime64[ns] 2 Round 216930 non-null object 3 Category 216930 non-null object 4 Value 216930 non-null object 5 Question 216930 non-null object 6 Answer 216928 non-null object dtypes: datetime64[ns](1), int64(1), object(5) memory usage: 11.6+ MB
None
Show Number | Air Date | Round | Category | Value | Question | Answer | |
---|---|---|---|---|---|---|---|
0 | 4680 | 2004-12-31 | Jeopardy! | HISTORY | $200 | For the last 8 years of his life, Galileo was ... | Copernicus |
1 | 4680 | 2004-12-31 | Jeopardy! | ESPN's TOP 10 ALL-TIME ATHLETES | $200 | No. 2: 1912 Olympian; football star at Carlisl... | Jim Thorpe |
2 | 4680 | 2004-12-31 | Jeopardy! | EVERYBODY TALKS ABOUT IT... | $200 | The city of Yuma in this state has a record av... | Arizona |
3 | 4680 | 2004-12-31 | Jeopardy! | THE COMPANY LINE | $200 | In 1963, live on "The Art Linkletter Show", th... | McDonald's |
4 | 4680 | 2004-12-31 | Jeopardy! | EPITAPHS & TRIBUTES | $200 | Signer of the Dec. of Indep., framer of the Co... | John Adams |
# Remove white spaces from column names
jeopardy.columns = jeopardy.columns.str.replace(' ', '')
print("Columns of the dataframe:", jeopardy.columns)
Columns of the dataframe: Index(['ShowNumber', 'AirDate', 'Round', 'Category', 'Value', 'Question', 'Answer'], dtype='object')
We'll continue by converting "Question" and "Answer" columns lowercase and removing all punctuation.
# Define the function
def normalize(string):
"""
Takes a string as input, transforms it to lowercase and removes punctuation
Args:
string: String to be normalized
Returns:
normalized_string: String after lowercase transformation and no punctuation
"""
lower_string = str(string).lower()
normalized_string = re.sub(r"[^\w\s]", "", lower_string)
return normalized_string
# Apply the function to "Question" and "Answer" columns
jeopardy["clean_question"] = jeopardy["Question"].apply(normalize)
jeopardy["clean_answer"] = jeopardy["Answer"].apply(normalize)
# Display results
display(jeopardy.head())
ShowNumber | AirDate | Round | Category | Value | Question | Answer | clean_question | clean_answer | |
---|---|---|---|---|---|---|---|---|---|
0 | 4680 | 2004-12-31 | Jeopardy! | HISTORY | $200 | For the last 8 years of his life, Galileo was ... | Copernicus | for the last 8 years of his life galileo was u... | copernicus |
1 | 4680 | 2004-12-31 | Jeopardy! | ESPN's TOP 10 ALL-TIME ATHLETES | $200 | No. 2: 1912 Olympian; football star at Carlisl... | Jim Thorpe | no 2 1912 olympian football star at carlisle i... | jim thorpe |
2 | 4680 | 2004-12-31 | Jeopardy! | EVERYBODY TALKS ABOUT IT... | $200 | The city of Yuma in this state has a record av... | Arizona | the city of yuma in this state has a record av... | arizona |
3 | 4680 | 2004-12-31 | Jeopardy! | THE COMPANY LINE | $200 | In 1963, live on "The Art Linkletter Show", th... | McDonald's | in 1963 live on the art linkletter show this c... | mcdonalds |
4 | 4680 | 2004-12-31 | Jeopardy! | EPITAPHS & TRIBUTES | $200 | Signer of the Dec. of Indep., framer of the Co... | John Adams | signer of the dec of indep framer of the const... | john adams |
Next step is removing "$" from "Value" column and converting it to integer type.
# Define the function
def norm_value(value):
"""
Takes a string as input, removes "$" character and transforms it to integer.
If errors occur (e.g., string is empty), assigns a value of zero.
Args:
value: Input string to be normalized
Returns:
int_value: Value as integer and without "$" character
"""
clean_value = re.sub(r"[\D]", "", value)
try:
int_value = int(clean_value)
except:
int_value = 0
return int_value
# Apply the function to "Value" column
jeopardy["clean_value"] = jeopardy["Value"].apply(norm_value)
# Display results
display(jeopardy.head())
ShowNumber | AirDate | Round | Category | Value | Question | Answer | clean_question | clean_answer | clean_value | |
---|---|---|---|---|---|---|---|---|---|---|
0 | 4680 | 2004-12-31 | Jeopardy! | HISTORY | $200 | For the last 8 years of his life, Galileo was ... | Copernicus | for the last 8 years of his life galileo was u... | copernicus | 200 |
1 | 4680 | 2004-12-31 | Jeopardy! | ESPN's TOP 10 ALL-TIME ATHLETES | $200 | No. 2: 1912 Olympian; football star at Carlisl... | Jim Thorpe | no 2 1912 olympian football star at carlisle i... | jim thorpe | 200 |
2 | 4680 | 2004-12-31 | Jeopardy! | EVERYBODY TALKS ABOUT IT... | $200 | The city of Yuma in this state has a record av... | Arizona | the city of yuma in this state has a record av... | arizona | 200 |
3 | 4680 | 2004-12-31 | Jeopardy! | THE COMPANY LINE | $200 | In 1963, live on "The Art Linkletter Show", th... | McDonald's | in 1963 live on the art linkletter show this c... | mcdonalds | 200 |
4 | 4680 | 2004-12-31 | Jeopardy! | EPITAPHS & TRIBUTES | $200 | Signer of the Dec. of Indep., framer of the Co... | John Adams | signer of the dec of indep framer of the const... | john adams | 200 |
In order to figure out whether to study past questions, study general knowledge, or not study it all, it would be helpful to figure out two things:
We can answer the second question by seeing how often complex words (> 6 characters) reoccur. We can answer the first question by seeing how many times words in the answer also occur in the question. We'll work on the first question and come back to the second.
# Define the function
def match_words(row):
"""
Takes a row as input, and returns the proportion of words in the "clean_answer" column
that are in the "clean_question" column too.
Args:
row: row of the dataframe to work with.
Returns:
p_answer_in_question : proportion of words in the "clean_answer" column that are in the "clean_question" column too.
"""
split_answer = row["clean_answer"].split()
split_question = row["clean_question"].split()
match_count = 0
if "the" in split_answer:
split_answer.remove("the")
if len(split_answer) == 0:
return 0
for word in split_answer:
if word in split_question:
match_count += 1
p_answer_in_question = match_count / len(split_answer)
return p_answer_in_question
# Apply the function to the dataframe
jeopardy["answer_in_question"] = jeopardy.apply(match_words, axis=1)
# Calculate the mean value of the new column
mean_value = jeopardy["answer_in_question"].mean()
# Display the results
print("The mean percentage of words in the answer that also occur in the question is", round(mean_value * 100, 1), "%")
The mean percentage of words in the answer that also occur in the question is 5.8 %
The analysis above shows that we would not recommend to try to answer the question with words already contained on it.
Let's say we want to investigate how often new questions are repeats of older ones. We can't completely answer this, because we only have about 10% of the full Jeopardy question dataset, but we can investigate it at least. We'll only analyze words with six or more characters to filter out words like "the" and "than", which are commonly used, but don't tell us a lot about a question.
# Initiate empty list and set that are to be used later
question_overlap = []
terms_used = set()
# Sort the dataframe by "AirDate" coklumn
jeopardy = jeopardy.sort_values(by = "AirDate")
# For each row of the dataframe, split "clean_question" column, keep words with >5 characters,
# and check if each word has been previously used - "terms_used" set. Calculate the proportion of used words per row.
for index, row in jeopardy.iterrows():
split_question = row["clean_question"].split()
split_question = [q for q in split_question if len(q) > 5]
match_count = 0
for word in split_question:
if word in terms_used:
match_count += 1
terms_used.add(word)
if len(split_question) > 0:
match_count /= len(split_question)
question_overlap.append(match_count)
# Assign the list of proportions calculated above to "question_overlap" column.
jeopardy["question_overlap"] = question_overlap
# Convert the column to float type
jeopardy["question_overlap"] = jeopardy["question_overlap"].astype(float)
# Calculate mean proportion of the new column
mean_val = jeopardy["question_overlap"].mean()
# Display results from the bottom (more chances to get proportions different from zero, as the words have been previously used)
display(jeopardy.tail())
# Display mean proportion of used words
print("The mean proportion of questions that have been already used is", round(mean_val * 100, 1), "%")
ShowNumber | AirDate | Round | Category | Value | Question | Answer | clean_question | clean_answer | clean_value | answer_in_question | question_overlap | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
105947 | 6300 | 2012-01-27 | Jeopardy! | VISITING THE CITY | $800 | There's a great opera house on Bennelong Point... | Sydney | theres a great opera house on bennelong point ... | sydney | 800 | 0.0 | 1.000000 |
105948 | 6300 | 2012-01-27 | Jeopardy! | PANTS | $1,400 | Tight-fitting pants patterned after those worn... | toreador pants | tightfitting pants patterned after those worn ... | toreador pants | 1400 | 0.5 | 1.000000 |
105949 | 6300 | 2012-01-27 | Jeopardy! | CHILD ACTORS | $800 | This kid, with a familiar last name, is seen <... | Jaden Smith | this kid with a familiar last name is seen a h... | jaden smith | 800 | 0.0 | 0.500000 |
105951 | 6300 | 2012-01-27 | Jeopardy! | LESSER-KNOWN SCIENTISTS | $800 | Joseph Lagrange insisted on 10 as the basic un... | the metric system | joseph lagrange insisted on 10 as the basic un... | the metric system | 800 | 0.5 | 0.777778 |
105930 | 6300 | 2012-01-27 | Jeopardy! | PANTS | $200 | A synonym for freight, or pants with large bel... | cargo pants | a synonym for freight or pants with large bell... | cargo pants | 200 | 0.5 | 0.833333 |
The mean proportion of questions that have been already used is 87.3 %
# Create a column with value 1 for "clean_value" values greater than 800, and 0 otherwise
jeopardy["high_value"] = jeopardy["clean_value"].apply(lambda x: 1 if x >= 800 else 0)
# Create the function
def high_low_counts(word):
"""
Takes a word as input, and returns the number of times it appeared in high value and low value questions
Args:
word: Word to be searched in previous questions
Returns:
high_count: Number of times the word appeared in high value questions
low_count: Number of times the word appeared in low value questions
"""
low_count = 0
high_count= 0
for index, row in jeopardy.iterrows():
split_question = row["clean_question"].split()
if word in split_question:
if row["high_value"] == 1:
high_count += 1
else:
low_count += 1
return high_count, low_count
# Take a random sample from the terms_used list to test the function
random.seed(1)
comparison_terms = random.sample(list(terms_used), 10)
print("The random sample of terms is:", comparison_terms)
# Apply the function in the sample, and display the results
observed_expected = []
for word in comparison_terms:
observed_expected.append(high_low_counts(word))
print("The observed high and low values counts are:", observed_expected)
The random sample of terms is: ['hrefhttpwwwjarchivecommedia20010717_j_01mp3johnny', 'offending', 'behaves', 'thiswell', 'hydrants', 'hrefhttpwwwjarchivecommedia20080321_dj_27jpg', 'voorhis', 'neosho', 'hrefhttpwwwjarchivecommedia20051031_dj_05ajpg', 'schulman'] The observed high and low values counts are: [(0, 1), (1, 4), (0, 4), (0, 1), (1, 1), (1, 0), (1, 1), (0, 1), (1, 0), (1, 0)]
Now that we've found the observed counts for a few terms, we can compute the expected counts and the chi-squared value.
high_value_count = jeopardy["high_value"].value_counts()[1]
low_value_count = jeopardy["high_value"].value_counts()[0]
chi_squared = []
for list in observed_expected:
total = sum(list)
total_prop = total / jeopardy.shape[0]
expected_high = high_value_count * total_prop
expected_low = low_value_count * total_prop
chisq_value, pvalue_gender_income = chisquare(np.array(list), np.array([expected_high, expected_low]))
chi_squared.append([chisq_value, pvalue_gender_income])
display(chi_squared)
[[0.7544157608695651, 0.38508176583769604], [1.0792362429771156, 0.29886847648042514], [3.0176630434782603, 0.08236207316095237], [0.7544157608695651, 0.38508176583769604], [0.039972400921050075, 0.8415345528964892], [1.325529040972535, 0.24960216618620146], [0.039972400921050075, 0.8415345528964892], [0.7544157608695651, 0.38508176583769604], [1.325529040972535, 0.24960216618620146], [1.325529040972535, 0.24960216618620146]]
# Transform the observed_expected values to dataframe
results = pd.DataFrame(observed_expected, index = comparison_terms, columns = ["Low value count", "High value count"])
# Add the chi square and p values as columns
results["Chi"] = chi_squared
results[["Chi Square", "p value"]] = pd.DataFrame(results.Chi.tolist(), index= results.index)
results.drop("Chi", axis=1, inplace = True)
# Display the results
display(results)
Low value count | High value count | Chi Square | p value | |
---|---|---|---|---|
hrefhttpwwwjarchivecommedia20010717_j_01mp3johnny | 0 | 1 | 0.754416 | 0.385082 |
offending | 1 | 4 | 1.079236 | 0.298868 |
behaves | 0 | 4 | 3.017663 | 0.082362 |
thiswell | 0 | 1 | 0.754416 | 0.385082 |
hydrants | 1 | 1 | 0.039972 | 0.841535 |
hrefhttpwwwjarchivecommedia20080321_dj_27jpg | 1 | 0 | 1.325529 | 0.249602 |
voorhis | 1 | 1 | 0.039972 | 0.841535 |
neosho | 0 | 1 | 0.754416 | 0.385082 |
hrefhttpwwwjarchivecommedia20051031_dj_05ajpg | 1 | 0 | 1.325529 | 0.249602 |
schulman | 1 | 0 | 1.325529 | 0.249602 |
The results above show that, from the sample words that we've used our function with, there's no statistically significant difference regarding whether they appear more on high value or low value questions.
The present project has shown some interesting results from the data on the Jeopardy TV quiz:
Therefore, if we were to study for the contest, we would bet on studying questions that have already been used.