Preparing for Jeopardy¶

Jeopardy is an American TV show in which participants answer questions to win money. It has been running for many years. Over multiple rounds, contestants can choose a category, and get a question from that category, where different questions have different dollar values. A more extensive description can be found here.

Imagine you want to participate in Jeopardy - and win. And you wonder yourself: how do I prepare for this? Is it just a matter of studying a lot? Or possibly, is there something to learn from questions from the past?

A couple of years ago, someone crawled Jeopardy archives and posted on Reddit a short article from which one can download a file with no less then 216,930 earlier Jeopardy questions, with answers and other data.

In this notebook, we are going to explore if there is something to learn from Jeopardy history that will help you prepare. The notebook contains the following sections:

Initial data exploration
Data re-formatting

2.1 Columns "Question" and "Answer"
2.2 Column "Value"
2.3 Column "Air Date" 3. Data analysis
3.1 Answer included in question
3.2 Repeated and popular terms
3.3 Terms used in high-value questions
4. Conclusions

1. Initial data exploration¶

Let's start with reading in the data (I took the .csv file) and explore it.

In [1]:

# Import pandas library 
import pandas as pd

# Import the data into a dataframe
jeopardy = pd.read_csv('JEOPARDY_CSV.csv')

# Show some rows
jeopardy.head(5)

Out[1]:

	Show Number	Air Date	Round	Category	Value	Question	Answer
0	4680	2004-12-31	Jeopardy!	HISTORY	$200	For the last 8 years of his life, Galileo was ...	Copernicus
1	4680	2004-12-31	Jeopardy!	ESPN's TOP 10 ALL-TIME ATHLETES	$200	No. 2: 1912 Olympian; football star at Carlisl...	Jim Thorpe
2	4680	2004-12-31	Jeopardy!	EVERYBODY TALKS ABOUT IT...	$200	The city of Yuma in this state has a record av...	Arizona
3	4680	2004-12-31	Jeopardy!	THE COMPANY LINE	$200	In 1963, live on "The Art Linkletter Show", th...	McDonald's
4	4680	2004-12-31	Jeopardy!	EPITAPHS & TRIBUTES	$200	Signer of the Dec. of Indep., framer of the Co...	John Adams

In [2]:

# Get column information
jeopardy.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 216930 entries, 0 to 216929
Data columns (total 7 columns):
Show Number    216930 non-null int64
 Air Date      216930 non-null object
 Round         216930 non-null object
 Category      216930 non-null object
 Value         216930 non-null object
 Question      216930 non-null object
 Answer        216928 non-null object
dtypes: int64(1), object(6)
memory usage: 11.6+ MB

In [3]:

# Get the number of rows and columns
jeopardy.shape

Out[3]:

(216930, 7)

Some column names appear to have leading spaces. That's inconvenient, so let's remove those.

In [4]:

# Remove the leading spaces from the column names
jeopardy.columns = jeopardy.columns.str.strip()

# Check the result
jeopardy.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 216930 entries, 0 to 216929
Data columns (total 7 columns):
Show Number    216930 non-null int64
Air Date       216930 non-null object
Round          216930 non-null object
Category       216930 non-null object
Value          216930 non-null object
Question       216930 non-null object
Answer         216928 non-null object
dtypes: int64(1), object(6)
memory usage: 11.6+ MB

In [5]:

# Check from when the questions originate 
jeopardy['Air Date'].value_counts().sort_index()

Out[5]:

1984-09-10    48
1984-09-11    50
1984-09-12    51
1984-09-13    53
1984-09-14    54
              ..
2012-01-20    58
2012-01-23    61
2012-01-24    59
2012-01-25    61
2012-01-27    30
Name: Air Date, Length: 3640, dtype: int64

In [6]:

# Show a sample again with better layout

# Avoid truncation
pd.set_option('display.max_colwidth', -1)
# Display with left alignment
jeopardy.head(10).style.set_properties(**{'text-align': 'left'}).set_table_styles([ dict(selector='th', props=[('text-align', 'left')] ) ])

Out[6]:

	Show Number	Air Date	Round	Category	Value	Question	Answer
0	4680	2004-12-31	Jeopardy!	HISTORY	$200	For the last 8 years of his life, Galileo was under house arrest for espousing this man's theory	Copernicus
1	4680	2004-12-31	Jeopardy!	ESPN's TOP 10 ALL-TIME ATHLETES	$200	No. 2: 1912 Olympian; football star at Carlisle Indian School; 6 MLB seasons with the Reds, Giants & Braves	Jim Thorpe
2	4680	2004-12-31	Jeopardy!	EVERYBODY TALKS ABOUT IT...	$200	The city of Yuma in this state has a record average of 4,055 hours of sunshine each year	Arizona
3	4680	2004-12-31	Jeopardy!	THE COMPANY LINE	$200	In 1963, live on "The Art Linkletter Show", this company served its billionth burger	McDonald's
4	4680	2004-12-31	Jeopardy!	EPITAPHS & TRIBUTES	$200	Signer of the Dec. of Indep., framer of the Constitution of Mass., second President of the United States	John Adams
5	4680	2004-12-31	Jeopardy!	3-LETTER WORDS	$200	In the title of an Aesop fable, this insect shared billing with a grasshopper	the ant
6	4680	2004-12-31	Jeopardy!	HISTORY	$400	Built in 312 B.C. to link Rome & the South of Italy, it's still in use today	the Appian Way
7	4680	2004-12-31	Jeopardy!	ESPN's TOP 10 ALL-TIME ATHLETES	$400	No. 8: 30 steals for the Birmingham Barons; 2,306 steals for the Bulls	Michael Jordan
8	4680	2004-12-31	Jeopardy!	EVERYBODY TALKS ABOUT IT...	$400	In the winter of 1971-72, a record 1,122 inches of snow fell at Rainier Paradise Ranger Station in this state	Washington
9	4680	2004-12-31	Jeopardy!	THE COMPANY LINE	$400	This housewares store was named for the packaging its merchandise came in & was first displayed on	Crate & Barrel

In [7]:

# Summary of the questions that we have
print(jeopardy.shape[0],'questions dating from', jeopardy['Air Date'].min(),'to', jeopardy['Air Date'].max())

216930 questions dating from 1984-09-10 to 2012-01-27

In [8]:

# Check how many questions per round
jeopardy['Round'].value_counts()

Out[8]:

Jeopardy!           107384
Double Jeopardy!    105912
Final Jeopardy!     3631  
Tiebreaker          3     
Name: Round, dtype: int64

So we have 216,930 questions from between 1984 and 2012. It looks like the data in the columns is pretty complete (almost no missing values, only 2 answers). Most columns took the format of an object.

One somewhat surprising observation are the "Questions" and "Answers". From a game-play description I understood that participants do not so much get a "question" that they need to "answer", but rather get an "answer" for which they need to come up with the right "question". The sample above does not really show that. If someone is told "Mc Donald's", I can hardly imagine someone asking which fast food chain served its billionth burger live on The Art Linkletter Show in 1963.

I am not sure about the cause of this descrepancy between (my understanding of) Jeopardy game-play and the question-and-answer-archive. However, for the purpose of our study, it seems okay to just consider this "questions" with "answers".

2. Data re-formatting¶

For analysis purposes, it is helpful though to reformat and normalize parts of the data. That's what we will do in this section.

2.1 Columns "Question" and "Answer"¶

To be able to correctly count words (in Question and Answer):

remove interpunction
put everything in lowercase

In [9]:

# Import re library to enable reformatting 
import re

# Create a function that takes a string and returns it normalized (no interpunction, all lowercase)
def normalize_string(input):
    replaced_interpunction = re.sub(r'\W', ' ', input).lower()
    removed_spaces = re.sub(' +',' ', replaced_interpunction).strip()
    return removed_spaces

In [10]:

# Test function
normalize_string('Hello!! Do DO dO  16:17 ?two2,and:FOO, bar?')

Out[10]:

'hello do do do 16 17 two2 and foo bar'

Looks good. Let's apply this to columns Question and Answer. We'll add new columns with the result.

In [11]:

# Add 2 columns, with normalized versions of Question and Answer. (Offer a 'string versions' of the objects.)
jeopardy['question_clean'] = jeopardy['Question'].astype('str').apply(normalize_string)
jeopardy['answer_clean'] = jeopardy['Answer'].astype('str').apply(normalize_string)

In [12]:

# Show result on a random sample
jeopardy.sample(10, random_state = 0)

Out[12]:

	Show Number	Air Date	Round	Category	Value	Question	Answer	question_clean	answer_clean
112079	3452	1999-09-14	Double Jeopardy!	SWEET 16	$1000	This king had a lot taken off the top January 21, 1793	Louis XVI	this king had a lot taken off the top january 21 1793	louis xvi
50465	3967	2001-11-27	Jeopardy!	MISS UNIVERSE	$800	Crowned in Cyprus in May 2000, Bombay U. grad Lara Dutta represented this country	India	crowned in cyprus in may 2000 bombay u grad lara dutta represented this country	india
71223	5920	2010-05-14	Double Jeopardy!	FAUX	$800	Nickname of Sam, leader of The Pharaohs, who sang "Wolly Bully"	"The Sham"	nickname of sam leader of the pharaohs who sang wolly bully	the sham
26234	2916	1997-04-14	Double Jeopardy!	FICTIONAL FEMALES	$800	Miranda, a young woman, appears in several of her works, including "Pale Horse, Pale Rider"	Katherine Anne Porter	miranda a young woman appears in several of her works including pale horse pale rider	katherine anne porter
86973	1295	1990-03-30	Jeopardy!	FAMOUS JOES & JOSEPHS	$200	This Delaware senator chairs the Senate Judiciary Committee	Joseph Biden	this delaware senator chairs the senate judiciary committee	joseph biden
127358	5685	2009-05-01	Double Jeopardy!	BROWN	$800	Between 1960 & 1986, he racked up 44 Top 40 hits, but no No. 1s	James Brown	between 1960 1986 he racked up 44 top 40 hits but no no 1s	james brown
148314	3810	2001-03-09	Jeopardy!	IN A MINUTE	$500	Under the slogan "Real Estate for the Real World", this company claims on average to buy or sell a home every minute	Century 21	under the slogan real estate for the real world this company claims on average to buy or sell a home every minute	century 21
115787	2883	1997-02-26	Double Jeopardy!	BLACK AMERICA	$1000	Ebony & Jet are among the magazines launched by this publisher	John Johnson	ebony jet are among the magazines launched by this publisher	john johnson
118519	3472	1999-10-12	Jeopardy!	A LITTLE DICKENS	$500	It's no mystery why this work was Dickens' last; he didn't live to finish it	The Mystery of Edwin Drood	it s no mystery why this work was dickens last he didn t live to finish it	the mystery of edwin drood
193578	5447	2008-04-22	Jeopardy!	MELROSE PLACE	$400	Hey, nice to meet <a href="http://www.j-archive.com/media/2008-04-22_J_08.jpg" target="_blank">this</a> actress who played Jennifer Mancini in 1997; "Charmed", I'm sure	Alyssa Milano	hey nice to meet a href http www j archive com media 2008 04 22_j_08 jpg target _blank this a actress who played jennifer mancini in 1997 charmed i m sure	alyssa milano

Looks good. The last one in the table shows there is some messy data, where a hyperlink to a picture was included in the data.

We'll ignore that for now, but let's keep it in mind.

2.2 Column "Value"¶

Next, we'll change column Value into a numeric field, to be able to manipulate it easier. In the samples so far we see entries like \$200 and \$1,800. Let's first check if there is more.

In [13]:

# Check which different values there are for field Value
jeopardy['Value'].unique()

Out[13]:

array(['$200', '$400', '$600', '$800', '$2,000', '$1000', '$1200',
       '$1600', '$2000', '$3,200', 'None', '$5,000', '$100', '$300',
       '$500', '$1,000', '$1,500', '$1,200', '$4,800', '$1,800', '$1,100',
       '$2,200', '$3,400', '$3,000', '$4,000', '$1,600', '$6,800',
       '$1,900', '$3,100', '$700', '$1,400', '$2,800', '$8,000', '$6,000',
       '$2,400', '$12,000', '$3,800', '$2,500', '$6,200', '$10,000',
       '$7,000', '$1,492', '$7,400', '$1,300', '$7,200', '$2,600',
       '$3,300', '$5,400', '$4,500', '$2,100', '$900', '$3,600', '$2,127',
       '$367', '$4,400', '$3,500', '$2,900', '$3,900', '$4,100', '$4,600',
       '$10,800', '$2,300', '$5,600', '$1,111', '$8,200', '$5,800',
       '$750', '$7,500', '$1,700', '$9,000', '$6,100', '$1,020', '$4,700',
       '$2,021', '$5,200', '$3,389', '$4,200', '$5', '$2,001', '$1,263',
       '$4,637', '$3,201', '$6,600', '$3,700', '$2,990', '$5,500',
       '$14,000', '$2,700', '$6,400', '$350', '$8,600', '$6,300', '$250',
       '$3,989', '$8,917', '$9,500', '$1,246', '$6,435', '$8,800',
       '$2,222', '$2,746', '$10,400', '$7,600', '$6,700', '$5,100',
       '$13,200', '$4,300', '$1,407', '$12,400', '$5,401', '$7,800',
       '$1,183', '$1,203', '$13,000', '$11,600', '$14,200', '$1,809',
       '$8,400', '$8,700', '$11,000', '$5,201', '$1,801', '$3,499',
       '$5,700', '$601', '$4,008', '$50', '$2,344', '$2,811', '$18,000',
       '$1,777', '$3,599', '$9,800', '$796', '$3,150', '$20', '$1,810',
       '$22', '$9,200', '$1,512', '$8,500', '$585', '$1,534', '$13,800',
       '$5,001', '$4,238', '$16,400', '$1,347', '$2547', '$11,200'],
      dtype=object)

In [14]:

# Check the amount of 'None' values
len(jeopardy[jeopardy['Value']=='None'])

Out[14]:

Not entirely sure what they are, but let's replace all 'None' values with 0. Convert everything else to a number.

In [15]:

# Create a function that takes a string as in Value column and returns a number
def normalize_value(input):
    if input == 'None':
        output = 0
    else:
        keep_numbers = input.replace('$','').replace(',','')
        output = int(keep_numbers)
    #replaced_interpunction = re.sub(r'\W', ' ', input).lower()
    #removed_spaces = re.sub(' +',' ', replaced_interpunction).strip()
    return output

In [16]:

# Test the function
print (normalize_value('None'), normalize_value('$200'), normalize_value('$1,534'), normalize_value('$200')+normalize_value('$1,534'))

0 200 1534 1734

Looks good. Let's apply this to column Value. We'll add a new column with the result.

In [17]:

# Add a column, with normalized versions of Value. (Offer a 'string version' of the object.)
jeopardy['value_clean'] = jeopardy['Value'].astype('str').apply(normalize_value)

In [18]:

# Verification 1: check that all are numbers, by summing them
print('Total is:', jeopardy['value_clean'].sum())
# Verification 2: show all values
print(sorted(jeopardy['value_clean'].unique()))

Total is: 160525700
[0, 5, 20, 22, 50, 100, 200, 250, 300, 350, 367, 400, 500, 585, 600, 601, 700, 750, 796, 800, 900, 1000, 1020, 1100, 1111, 1183, 1200, 1203, 1246, 1263, 1300, 1347, 1400, 1407, 1492, 1500, 1512, 1534, 1600, 1700, 1777, 1800, 1801, 1809, 1810, 1900, 2000, 2001, 2021, 2100, 2127, 2200, 2222, 2300, 2344, 2400, 2500, 2547, 2600, 2700, 2746, 2800, 2811, 2900, 2990, 3000, 3100, 3150, 3200, 3201, 3300, 3389, 3400, 3499, 3500, 3599, 3600, 3700, 3800, 3900, 3989, 4000, 4008, 4100, 4200, 4238, 4300, 4400, 4500, 4600, 4637, 4700, 4800, 5000, 5001, 5100, 5200, 5201, 5400, 5401, 5500, 5600, 5700, 5800, 6000, 6100, 6200, 6300, 6400, 6435, 6600, 6700, 6800, 7000, 7200, 7400, 7500, 7600, 7800, 8000, 8200, 8400, 8500, 8600, 8700, 8800, 8917, 9000, 9200, 9500, 9800, 10000, 10400, 10800, 11000, 11200, 11600, 12000, 12400, 13000, 13200, 13800, 14000, 14200, 16400, 18000]

Looks good.

2.3 Column "Air Date"¶

Next, we'll turn Air Date into a date field, which is easier to analyze. We'll add a new column.

In [19]:

# Add a column, with the airdate as date-time
jeopardy['date_clean'] = pd.to_datetime(jeopardy['Air Date'], format = '%Y-%M-%d')
# Check result
jeopardy.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 216930 entries, 0 to 216929
Data columns (total 11 columns):
Show Number       216930 non-null int64
Air Date          216930 non-null object
Round             216930 non-null object
Category          216930 non-null object
Value             216930 non-null object
Question          216930 non-null object
Answer            216928 non-null object
question_clean    216930 non-null object
answer_clean      216930 non-null object
value_clean       216930 non-null int64
date_clean        216930 non-null datetime64[ns]
dtypes: datetime64[ns](1), int64(2), object(8)
memory usage: 18.2+ MB

In [20]:

# Check a sample
jeopardy.head(2)

Out[20]:

	Show Number	Air Date	Round	Category	Value	Question	Answer	question_clean	answer_clean	value_clean	date_clean
0	4680	2004-12-31	Jeopardy!	HISTORY	$200	For the last 8 years of his life, Galileo was under house arrest for espousing this man's theory	Copernicus	for the last 8 years of his life galileo was under house arrest for espousing this man s theory	copernicus	200	2004-01-31 00:12:00
1	4680	2004-12-31	Jeopardy!	ESPN's TOP 10 ALL-TIME ATHLETES	$200	No. 2: 1912 Olympian; football star at Carlisle Indian School; 6 MLB seasons with the Reds, Giants & Braves	Jim Thorpe	no 2 1912 olympian football star at carlisle indian school 6 mlb seasons with the reds giants braves	jim thorpe	200	2004-01-31 00:12:00

Looks good.

3. Data analysis¶

Now we have a set of data in an easy-to-analyze format we can start analysis.

3.1 Answer included in question¶

What may happen sometimes, is that the question already contains the answers, or parts of it. If that happens a lot, that may help you develop your strategy to win.

We are going to analyze such overlap between questions and answers by calculating (for every question-with-answer) which fraction of words in the answer is also a word in the question. The words the and a will be excluded: they appear a lot, but are not meaningful for this analysis.

In [21]:

# Create a function that takes a row of the dataframe as an input, and returns
# how many times terms in answer_clean occur in question_clean
def count_overlap(row):
    # Split question and answer in individual words
    split_answer = row['answer_clean'].split()
    split_question = row['question_clean'].split()
    # print(split_question) # commented out after verifying
    
    # Remove all occurrences of 'the' from the question (as this is not meaningful)
    while 'the' in split_question:
        split_question.remove('the')
    # print(split_question) # commented out after verifying
    
    # Do the same for 'a' (added after seeing the result)
    while 'a' in split_question:
        split_question.remove('a')
    # print(split_question) # commented out after verifying
    
    # Count how many words in the answer appear in the question, calculate the fraction
    result = 0
    match_count = 0
    if len(split_answer) > 0:
        for word in split_answer:
            if word in split_question:
                match_count +=1
        # print ('match_count:', match_count) # commented out after verifying
        # print ('answer length:',len(split_answer)) # commented out after verifying
        result = match_count / len(split_answer)
    
    return result       
    

In [22]:

# test an example that contains overlap
test_row1 = jeopardy.iloc[118519]
print(test_row1['question_clean'])
print(test_row1['answer_clean'])
count_overlap(test_row1)

it s no mystery why this work was dickens last he didn t live to finish it
the mystery of edwin drood

Out[22]:

0.2

In [23]:

# test an example having 'the' in the question multiple times
# (This test tested whether all instances of 'the' were removed; after commenting out print this is not visible anymore)
test_row2 = jeopardy.iloc[7]
print(test_row2['question_clean'])
print(test_row2['answer_clean'])
count_overlap(test_row2)

no 8 30 steals for the birmingham barons 2 306 steals for the bulls
michael jordan

Out[23]:

0.0

In [24]:

# test an example where 'the' is in the overlap  question containing 'the' multiple times
test_row2 = jeopardy.iloc[5]
print(test_row2['question_clean'])
print(test_row2['answer_clean'])
count_overlap(test_row2)

in the title of an aesop fable this insect shared billing with a grasshopper
the ant

Out[24]:

0.0

Looks good so far, so let's apply this to the dataframe (add a new column), and then check some more results.

In [25]:

# Add a new column to include the indicator of overlap between question and answer
jeopardy['answer_in_question'] = jeopardy.apply(count_overlap, axis = 1)

In [26]:

# See the outcome: fraction of words in the answer that is also in the question
jeopardy['answer_in_question'].value_counts()

Out[26]:

0.000000    195068
0.500000    8179  
0.333333    6437  
0.250000    2435  
1.000000    1237  
0.200000    1151  
0.666667    654   
0.166667    467   
0.400000    350   
0.142857    210   
0.285714    129   
0.125000    118   
0.750000    98    
0.600000    73    
0.111111    51    
0.222222    40    
0.428571    36    
0.375000    28    
0.100000    21    
0.800000    20    
0.571429    14    
0.090909    12    
0.714286    9     
0.181818    9     
0.300000    9     
0.083333    8     
0.153846    7     
0.625000    7     
0.833333    7     
0.272727    5     
0.555556    3     
0.857143    3     
0.444444    3     
0.888889    2     
0.076923    2     
0.363636    2     
0.066667    2     
0.133333    2     
0.545455    2     
0.071429    2     
0.117647    1     
0.052632    1     
0.769231    1     
0.454545    1     
0.583333    1     
0.727273    1     
0.384615    1     
0.875000    1     
0.909091    1     
0.105263    1     
0.818182    1     
0.700000    1     
0.642857    1     
0.777778    1     
0.095238    1     
0.416667    1     
0.230769    1     
0.214286    1     
Name: answer_in_question, dtype: int64

In [27]:

# Check some samples with an overlap fraction of 0.5
jeopardy[jeopardy['answer_in_question'] ==0.5].head(5)

Out[27]:

	Show Number	Air Date	Round	Category	Value	Question	Answer	question_clean	answer_clean	value_clean	date_clean	answer_in_question
53	4680	2004-12-31	Double Jeopardy!	MUSICAL TRAINS	$2000	In 1961 James Brown announced "all aboard" for this train	"Night Train"	in 1961 james brown announced all aboard for this train	night train	2000	2004-01-31 00:12:00	0.5
68	5957	2010-07-06	Jeopardy!	GEOGRAPHY "E"	$600	This island in the South Pacific is named for the day of its discovery, a religious holiday	Easter Island	this island in the south pacific is named for the day of its discovery a religious holiday	easter island	600	2010-01-06 00:07:00	0.5
80	5957	2010-07-06	Jeopardy!	GEOGRAPHY "E"	$2,000	The family history you wrote for school might include entering the U.S. at this island in New York Bay	Ellis Island	the family history you wrote for school might include entering the u s at this island in new york bay	ellis island	2000	2010-01-06 00:07:00	0.5
83	5957	2010-07-06	Jeopardy!	BE FRUITFUL & MULTIPLY	$1000	2 x 1,035	2,070	2 x 1 035	2 070	1000	2010-01-06 00:07:00	0.5
112	5957	2010-07-06	Double Jeopardy!	JUST THE FACTS	$2000	He's the older son of Prince Charles and the late Princess Diana	Prince William	he s the older son of prince charles and the late princess diana	prince william	2000	2010-01-06 00:07:00	0.5

In [28]:

# Check some samples with an overlap fraction of 0.8
jeopardy[jeopardy['answer_in_question'] ==0.8].head(5)

Out[28]:

	Show Number	Air Date	Round	Category	Value	Question	Answer	question_clean	answer_clean	value_clean	date_clean	answer_in_question
8375	4787	2005-05-31	Jeopardy!	TALK LIKE A BRIT	$400	Of stay in bed, hit someone on the head or rub till it's red, what you do if you cosh	hit someone on the head	of stay in bed hit someone on the head or rub till it s red what you do if you cosh	hit someone on the head	400	2005-01-31 00:05:00	0.8
10975	4362	2003-07-15	Jeopardy!	STUPID ANSWERS	$1000	It's the state song of the state of Maine	"The State of Maine Song"	it s the state song of the state of maine	the state of maine song	1000	2003-01-15 00:07:00	0.8
15184	3340	1999-02-26	Double Jeopardy!	PUT 'EM IN ORDER	$200	Oscar Winners "The English Patient", "Unforgiven", "Braveheart"	Unforgiven, Braveheart, The English Patient	oscar winners the english patient unforgiven braveheart	unforgiven braveheart the english patient	200	1999-01-26 00:02:00	0.8
19375	6070	2011-01-21	Double Jeopardy!	JOB HUNTING	$1600	In a 60-year-old man age really takes its toll on the body, no matter which sport he works in	manager (in man age really)	in a 60 year old man age really takes its toll on the body no matter which sport he works in	manager in man age really	1600	2011-01-21 00:01:00	0.8
19644	4583	2004-07-07	Jeopardy!	WHAT'S THE NEXT LINE?	$800	"Yes we have no bananas..."	"...We have no bananas today"	yes we have no bananas	we have no bananas today	800	2004-01-07 00:07:00	0.8

In [29]:

# Check some samples with an overlap fraction of 1.0
jeopardy[jeopardy['answer_in_question'] ==1.0].head(5)

Out[29]:

	Show Number	Air Date	Round	Category	Value	Question	Answer	question_clean	answer_clean	value_clean	date_clean	answer_in_question
266	4931	2006-02-06	Double Jeopardy!	NOT A CURRENT NATIONAL CAPITAL	$400	Ljubljana, Bratislava, Barcelona	Barcelona	ljubljana bratislava barcelona	barcelona	400	2006-01-06 00:02:00	1.0
272	4931	2006-02-06	Double Jeopardy!	NOT A CURRENT NATIONAL CAPITAL	$800	Istanbul, Ottawa, Amman	Istanbul	istanbul ottawa amman	istanbul	800	2006-01-06 00:02:00	1.0
278	4931	2006-02-06	Double Jeopardy!	NOT A CURRENT NATIONAL CAPITAL	$1200	Sofia, Sarajevo, Saigon	Saigon	sofia sarajevo saigon	saigon	1200	2006-01-06 00:02:00	1.0
284	4931	2006-02-06	Double Jeopardy!	NOT A CURRENT NATIONAL CAPITAL	$1600	Bucharest, Bonn, Bern	Bonn	bucharest bonn bern	bonn	1600	2006-01-06 00:02:00	1.0
290	4931	2006-02-06	Double Jeopardy!	NOT A CURRENT NATIONAL CAPITAL	$2000	Belize City, Guatemala City, Panama City	Belize City	belize city guatemala city panama city	belize city	2000	2006-01-06 00:02:00	1.0

In [30]:

# Calculate the mean of the overlap fraction
jeopardy['answer_in_question'].mean()

Out[30]:

0.042752736697187994

What we can observe:

for the vast majority of questions (195K out of 216K) there is no overlap at all: the answer does not appear in the question; and on average only 0.04 words of the answer were part of the question
then, when looking at examples with relatively much overlap, it's not going to be helpful; e.g. multiple choice questions where the answer is indeed part of the question; or the question asks 'which island' and the answer contains 'island'

A strategy where you hope to find answers in the questions themselves is not going to help you whatsoever to win Jeopardy.

3.2 Repeated and popular terms¶

Jeopardy has been running for many, many years. One may wonder till what extent the same questions are repeated.

We are going to analyze, not by finding questions that are literally the same, but by analyzing for all questions which fraction of the words in those questions appeared in questions before as well. We will focus on longer terms (6 characters or more), as those are the termss that typically form the 'heart' of the question. Shorter words (not only 'the' and 'a', but also e.g. 'more' and 'each') that are repeated won't teach us a lot.

We'll on the fly also simply create an overview of the terms that appear most in the questions, including a count how many times. That could be interesting information as well.

In [31]:

# Iterate over the rows of the jeopardy dataframe, and calculates the fraction of terms (words longer than 5 characters)
# that appeared in earlier questions as well. Store these values in a list.

# On the fly, create a dictionary of all terms and their frequency over all questions.

# Initate a list with the overlap_count, a set with all terms used, and a dictionary with all words used
questions_overlap = []
terms_used = set()
terms_dictionary = {}

# Iterare 
for i, row in jeopardy.iterrows():
    split_question = row['question_clean'].split()
    # print(split_question) # commented out after verifying
    # Only keep the words 6 characters or longer 
    split_question = [word for word in split_question if len(word) >5]
    # print(split_question) # commented out after verifying
    
    match_count = 0
    
    # Check for every term whether used before already. If so: increase counts. If not: add to set and dictionary.
    for word in split_question:
        if word in terms_used:
            match_count += 1
            terms_dictionary[word]+=1
        else:
            terms_dictionary[word] = 1
        terms_used.add(word)
    
    # Calculate fraction 
    if len(split_question)>0:
        fraction_used_before = match_count / len(split_question)
    
    # Append fraction to list
    questions_overlap.append(fraction_used_before)

In [32]:

# Add the calculated fractions (a list with the correct length and in correct sequence) as a new column to the datafame
jeopardy['fraction_overlap_before'] = questions_overlap

In [33]:

# Take a look at the final rows in the dataframe, where one would expect no overlap with earliers questions
jeopardy[['date_clean','question_clean','answer_clean','fraction_overlap_before']].head(5)

Out[33]:

	date_clean	question_clean	answer_clean
0	2004-01-31 00:12:00	for the last 8 years of his life galileo was under house arrest for espousing this man s theory	copernicus
1	2004-01-31 00:12:00	no 2 1912 olympian football star at carlisle indian school 6 mlb seasons with the reds giants braves	jim thorpe
2	2004-01-31 00:12:00	the city of yuma in this state has a record average of 4 055 hours of sunshine each year	arizona
3	2004-01-31 00:12:00	in 1963 live on the art linkletter show this company served its billionth burger	mcdonald s
4	2004-01-31 00:12:00	signer of the dec of indep framer of the constitution of mass second president of the united states	john adams

In [34]:

# Take a look at the final rows in the dataframe, where one would expect no overlap with earliers questions
jeopardy[['date_clean','question_clean','answer_clean','fraction_overlap_before']].tail(20)

Out[34]:

	date_clean	question_clean	answer_clean	fraction_overlap_before
216910	2006-01-11 00:05:00	in his prime this athlete said it s hard to be humble when you re as great as i am	muhammad ali	1.000000
216911	2006-01-11 00:05:00	it s home to the holmenkollen ski jump	oslo	1.000000
216912	2006-01-11 00:05:00	we d like to enlighten you about the musical sidd it s based on this novel	siddhartha	0.500000
216913	2006-01-11 00:05:00	he created the musical riddles called the enigma variations	edward elgar	1.000000
216914	2006-01-11 00:05:00	one species of this bird breeds in the arctic tundra vacations at the other end of the globe	a tern	1.000000
216915	2006-01-11 00:05:00	in his teens he worked in an assistant d a s office later his perry mason character made fools of d a s	erle stanley gardner	1.000000
216916	2006-01-11 00:05:00	oscar wilde called this 4 letter word the curse of the drinking classes	work	1.000000
216917	2006-01-11 00:05:00	guyanese capital named for a hanoverian monarch	georgetown	1.000000
216918	2006-01-11 00:05:00	a naughty 18th c novel originally titled memoirs of a woman of pleasure inspired the 2006 musical named for her	fanny hill	1.000000
216919	2006-01-11 00:05:00	if this riddling belgian surrealist painter born 1898 worked for jeopardy he might write this is not a clue	magritte	0.833333
216920	2006-01-11 00:05:00	nightingales robins belong to this family of melodious songbirds	thrushes	0.833333
216921	2006-01-11 00:05:00	her hotsy totsy diaries trace back to one she began as an 11 year old aboard ship in 1914	anaïs nin	1.000000
216922	2006-01-11 00:05:00	a motto of hers was in politics if you want anything said ask a man if you want anything done ask a woman	margaret thatcher	1.000000
216923	2006-01-11 00:05:00	it s on the suriname river	paramaribo	1.000000
216924	2006-01-11 00:05:00	in 2006 the cast of this long running hit embarked on a href http www j archive com media 2006 05 11_dj_26 wmv an exuberant noisy campaign a to clean up new york city	stomp	1.000000
216925	2006-01-11 00:05:00	this puccini opera turns on the solution to 3 riddles posed by the heroine	turandot	1.000000
216926	2006-01-11 00:05:00	in north america this term is properly applied to only 4 species that are crested including the tufted	a titmouse	1.000000
216927	2006-01-11 00:05:00	in penny lane where this hellraiser grew up the barber shaves another customer then flays him alive	clive barker	1.000000
216928	2006-01-11 00:05:00	from ft sill okla he made the plea arizona is my land my home my father s land to which i now ask to return	geronimo	1.000000
216929	2006-01-11 00:05:00	a silent movie title includes the last name of this 18th c statesman favorite of catherine the great	grigori alexandrovich potemkin	1.000000

In [35]:

# Take a look at the fraction-overlap in the final 100 rows
print(jeopardy.tail(100)['fraction_overlap_before'].value_counts())

1.000000    80
0.800000    5 
0.833333    3 
0.875000    3 
0.888889    2 
0.400000    1 
0.600000    1 
0.666667    1 
0.500000    1 
0.857143    1 
0.750000    1 
0.000000    1 
Name: fraction_overlap_before, dtype: int64

So it seems that almost all words have been used before.... (For what it's worth.)

In [36]:

# Calculate the mean value of this fraction-overlap 
import numpy as np
np.mean(questions_overlap)

Out[36]:

0.9225954554223076

So what we can observe is that most longer words that are used in questions, have been used before. Certainly for the later episodes. That is not so surprising, given that we have more than 20 years worth of questions.

I am not convinced though that this piece of knowledge is going to help a lot.

Possibly a more complex analysis could help:

rather than looking at individual words, look at combinations of words
don't look back to all history, but only to e.g. the last one or two years if there is any overlap

These are complex analyses to do so, and I am not really convinced that it will give a lot of insight.

What we can still do is just take some frequently used terms, and look at some questions that include those terms. Do we happen to see any overlap? For this, we can use the dictionary with word counts that we created. (Which is on overview that by itself can be interesting already.)

In [37]:

# Create (and show) a dictionary with those terms that are used more than 1000 times in questions
top_terms = {k:v for (k,v) in terms_dictionary.items() if v > 1000}

for k in sorted(top_terms, key=top_terms.get, reverse=True):
    print(k, top_terms[k])

archive 12979
target 10717
_blank 10649
country 6045
called 5487
president 3294
american 3210
became 3165
played 3014
before 2920
capital 2883
french 2576
famous 2560
island 2534
people 2341
letter 2318
largest 2150
company 2133
author 2074
during 2001
national 1988
british 1976
century 1922
character 1868
little 1831
around 1823
between 1682
series 1636
family 1602
meaning 1571
founded 1546
school 1423
include 1409
million 1377
america 1350
museum 1332
university 1332
number 1331
popular 1322
musical 1322
english 1313
because 1274
second 1268
classic 1268
reports 1268
through 1245
father 1203
person 1177
george 1158
german 1151
general 1130
england 1100
leader 1095
nation 1074
created 1060
italian 1046
william 1034
former 1032

A lot of terms that are used more than 1000 times in questions (over the course of 20 years).

The ones at the very top ('archive', 'target', and '_blank') are surprising actually.

Let's go have a look at examples with some of some frequently-used terms.

In [38]:

# Function to print examples of a term
# It takes a term and a number n as its inputs
# Then the first n questions containing the term will be printed

def print_n_examples_for_term(term, n):
    printed = 0
    row_index = 0
    while printed < n:
        row = jeopardy.iloc[row_index]
        split_question = row['question_clean'].split()
        split_question = [word for word in split_question if len(word) >5]
        # print(split_question) # commented out after verifying
        if term in split_question:
            print(row['Question'])
            printed +=1
        row_index +=1

In [39]:

# Print for 'galileo' to test (as we know the first question in the dataframe contains Galileo)
print_n_examples_for_term('galileo', 3)

For the last 8 years of his life, Galileo was under house arrest for espousing this man's theory
The 4 largest moons of this planet are called Galilean satellites after Galileo, who saw them in 1610
Galileo was the first person to see the rings around this planet

In [40]:

# Print 5 examples for 'archive'
print_n_examples_for_term('archive', 5)

<a href="http://www.j-archive.com/media/2004-12-31_DJ_23.mp3">Beyond ovoid abandonment, beyond ovoid betrayal... you won't believe the ending when he "Hatches the Egg"</a>
The shorter glass seen <a href="http://www.j-archive.com/media/2004-12-31_DJ_12.jpg" target="_blank">here</a>, or a quaint cocktail made with sugar & bitters
<a href="http://www.j-archive.com/media/2004-12-31_DJ_26.mp3">Ripped from today's headlines, he was a turtle king gone mad; Mack was the one good turtle who'd bring him down</a>
<a href="http://www.j-archive.com/media/2004-12-31_DJ_25.mp3">Somewhere between truth & fiction lies Marco's reality... on Halloween, you won't believe you saw it on this St.</a>
<a href="http://www.j-archive.com/media/2004-12-31_DJ_24.mp3">"500 Hats"... 500 ways to die.  On July 4th, this young boy will defy a king... & become a legend</a>

In [41]:

# Print 5 examples for '_blank'
print_n_examples_for_term('_blank', 5)

The shorter glass seen <a href="http://www.j-archive.com/media/2004-12-31_DJ_12.jpg" target="_blank">here</a>, or a quaint cocktail made with sugar & bitters
Say the name of <a href="http://www.j-archive.com/media/2010-07-06_DJ_26.jpg" target="_blank">this</a> type of mollusk you see
Say <a href="http://www.j-archive.com/media/2010-07-06_DJ_27.jpg" target="_blank">this</a> state that was admitted to the Union in 1859
<a href="http://www.j-archive.com/media/2010-07-06_DJ_14.jpg" target="_blank">This dog breed seen here</a> is a loyal and protective companion
Say the name of <a href="http://www.j-archive.com/media/2010-07-06_DJ_28.jpg" target="_blank">this</a> bug; don't worry, it doesn't breathe fire

In [42]:

# Print 5 examples for '_blank'
print_n_examples_for_term('target', 5)

The shorter glass seen <a href="http://www.j-archive.com/media/2004-12-31_DJ_12.jpg" target="_blank">here</a>, or a quaint cocktail made with sugar & bitters
Say the name of <a href="http://www.j-archive.com/media/2010-07-06_DJ_26.jpg" target="_blank">this</a> type of mollusk you see
Say <a href="http://www.j-archive.com/media/2010-07-06_DJ_27.jpg" target="_blank">this</a> state that was admitted to the Union in 1859
<a href="http://www.j-archive.com/media/2010-07-06_DJ_14.jpg" target="_blank">This dog breed seen here</a> is a loyal and protective companion
Say the name of <a href="http://www.j-archive.com/media/2010-07-06_DJ_28.jpg" target="_blank">this</a> bug; don't worry, it doesn't breathe fire

In [43]:

# Print 10 examples for 'country'
print_n_examples_for_term('country', 10)

Africa's lowest temperature was 11 degrees below zero in 1935 at Ifrane, just south of Fez in this country
Cross-country skiing is sometimes referred to by these 2 letters, the same ones used to denote 90 in Roman numerals
Parts of the Arabian and Libyan deserts are found in this African country
A 7.0 magnitude earthquake in this Caribbean country Jan. 12, 2010 brought a world outpouring of aid
Andy Garcia is a native of this country whose flag is seen here
This Mediterranean country whose flag is seen here is "The Word"
Porfirio Diaz seized power in this country in 1876, ruled for 35 years, fled in 1911 & died in exile
Exiled for manslaughter, Eric the Red was forced to leave this country around 981
Moshoeshoe II was exiled twice before regaining this southern African country's throne in 1995
Under the 1814 Treaty of Kiel, this country gave Norway to Sweden but kept Greenland & other islands

In [44]:

# Print 10 examples for 'president'
print_n_examples_for_term('president', 10)

Signer of the Dec. of Indep., framer of the Constitution of Mass., second President of the United States
His first act after being sworn in as president of the Confederacy was to send a peace commission to Washington, D.C.
In the midst of the Korean War, this South Korean president was elected to his second of 4 terms
Its headquarters compound in Langley, Virginia is named for Former President George Bush
This president's 1972 visit to China inspired an opera that played at the Kennedy Center in 1988
This political satire starred John Travolta as a Southern governor running for president
If a president is impeached, this official presides over the trial in the Senate
On January 20, 1965 he was inaugurated as U.S. vice president
Gerald Ford was the last president born under this "crab"by sign
In 1976 this current president of France founded the Rally for the Republic Party

What we can observe:

that the most-used terms 'archive', '_blank' and 'target' are not really terms used in questions. Rather, many questions contain hyperlinks that include these terms
if we check frequently used terms like 'country' and 'president', we see all different questions.

The value of this 2nd observation must be discounted though. These are the first examples for each row chronologically, and it would of course be better to check for examples that are years apart.

For what we have seen though, this does not help us that much with finding a study strategy for Jeopardy.

3.3 Terms used in high-value questions¶

Let's check if there are any particular terms that appear significantly more in high-value questions.

We will do the following, for multiple popular terms:

check how many times this term appears in low-value questions (<= 800) vs high-value questions (>800)
do a chi-squared test if that ratio is realistic given expectations of overall low-value vs high-value questions

Let me explain in layman's terms how that works. Suppose that 30% of all Jeopardy questions are low-value questions and 70% are high-value questions. Now we are going to check for all questions that contains the term 'president'. Then we have an expectation that also for those, the ratio is 30% / 70%. If you find 29% / 71%: sounds still reasonable. If you find 10% / 90%: that looks suspicious though. With a chi-squared test one can calculate how likely such an observation still is under the hypothesis that the expected distribution is 30% / 70%. If that likelihood is very small (e.g. smaller than 5%) we reject the idea that the outcome is a mere coincidence, and conclude that the term 'president' is truly under- or over-represented (which one depends on the observation) in the high-value questions.

For more background about chi-squared tests, refer to the internet, there are many descriptions. Here is one example.

So the null-hypothesis is "The term <.....> is not over- or under-represented in high-value questions".

In [45]:

# Get overall numbers of low_value and high_value questions
low_value_max = 800

low_value_count = len(jeopardy[jeopardy['value_clean']<=800])
high_value_count = len(jeopardy[jeopardy['value_clean']> 800])

low_value_fraction = low_value_count / (low_value_count + high_value_count)
high_value_fraction = high_value_count / (low_value_count + high_value_count)

print(low_value_count, high_value_count, low_value_count + high_value_count, low_value_fraction, high_value_fraction)
                               

155508 61422 216930 0.7168579726178952 0.2831420273821048

In [46]:

# Create a function that gets a term and returns how frequently it occurs in low-value and high-value questions

def return_low_high (term):
    low_count = 0
    high_count = 0
    for i, row in jeopardy.iterrows():
        split_question = row['question_clean'].split()
        split_question = [word for word in split_question if len(word) >5]
        # print(split_question) # commented out after verifying
        if term in split_question:
            # print(split_question, row['value_clean']) # commented out after verifying
            if row['value_clean'] <= 800:
                low_count += 1
            if row['value_clean'] > 800:
                high_count +=1
    return low_count, high_count
    

In [47]:

# Test the function on th term 'galileo'
test_term_1 = 'galileo'

a, b = return_low_high(test_term_1)
print (a,b)

32 7

In [48]:

# Test the function on the term 'president'
test_term_1 = 'president'

a, b = return_low_high(test_term_1)
print (a,b, a+b)

2297 883 3180

For 'president' one would initially expect a total of 3294 as that is the count we had for 'president' in our dictionary. One explanation can be that the term 'president' appeared more than once in some questions. For the dictionary each of them was counted. On this occassion, we count questions that contain the term 'president', regardless how many times the word is in that question.

I did not check whether this explains indeed. Rather, I conclude that the functions seems to work, so let's apply it to a couple of popular terms.

In [49]:

# Select popular terms that are used in questions a lot
popular_terms = ['country','president','american', 'capital', 'island']

# For those terms, calculate how many times it appears in low-value and high-value questions. Story in a dictionary.
popular_terms_low_high = {}
for term in popular_terms:
    popular_terms_low_high[term] = return_low_high(term)

In [50]:

# Show the result
popular_terms_low_high

Out[50]:

{'country': (4332, 1647),
 'president': (2297, 883),
 'american': (2115, 1053),
 'capital': (1988, 797),
 'island': (1665, 770)}

So for each of these terms we can now do a chi-squared test to figure out if these counts are significantly different than what one could expect.

Let's first do a quick round for the term 'country', then do it for all terms in a more structured way.

In [51]:

# Calculated expected values for 'country' and print them
country_expected_low = low_value_fraction * (4332 + 1647)
country_expected_high = high_value_fraction * (4332 + 1647)
print(country_expected_low, country_expected_high)

4286.093818282396 1692.9061817176046

In [52]:

# Import chisquare test from library
from scipy.stats import chisquare

# Execute chisquared test for 'country'
observed = np.array([4332, 1647])
expected = np.array([country_expected_low, country_expected_high])
chisquare_value, pvalue = chisquare(observed, expected) # returns a list

# Print result
print(chisquare_value,pvalue)

1.7365061769211498 0.18758212679269415

Now let's do this for all popular terms that we selected in the same way, and print the results in a readable way.

In [53]:

# For all selected popular terms, perform a chi-squared test and present the results in a readable format
for term in popular_terms:
    observed_low = popular_terms_low_high[term][0]
    observed_high = popular_terms_low_high[term][1]
    observed_total = observed_low + observed_high
    expected_low = low_value_fraction * (observed_total)
    expected_high = high_value_fraction * (observed_total)
    print('Term:', term)
    print('Observed (low/high):', observed_low, observed_high, "{:.1%} {:.1%}".format(observed_low/observed_total, observed_high/observed_total))
    print('Expected (low/high):', round(expected_low,1), round(expected_high,1), "{:.1%} {:.1%}".format(expected_low/observed_total, expected_high/observed_total)) 
    observed = np.array([observed_low, observed_high])
    expected = np.array([expected_low, expected_high])
    chisquare_value, pvalue = chisquare(observed, expected) # returns a list
    print('P-value:', pvalue)
    print('In words: the probability of the null-hypothesis that the term', term, 'is not over- or underrepresented in high-value questions is', pvalue)
    print('\n')
    

    

Term: country
Observed (low/high): 4332 1647 72.5% 27.5%
Expected (low/high): 4286.1 1692.9 71.7% 28.3%
P-value: 0.18758212679269415
In words: the probability of the null-hypothesis that the term country is not over- or underrepresented in high-value questions is 0.18758212679269415


Term: president
Observed (low/high): 2297 883 72.2% 27.8%
Expected (low/high): 2279.6 900.4 71.7% 28.3%
P-value: 0.49362469359700045
In words: the probability of the null-hypothesis that the term president is not over- or underrepresented in high-value questions is 0.49362469359700045


Term: american
Observed (low/high): 2115 1053 66.8% 33.2%
Expected (low/high): 2271.0 897.0 71.7% 28.3%
P-value: 7.641747011862597e-10
In words: the probability of the null-hypothesis that the term american is not over- or underrepresented in high-value questions is 7.641747011862597e-10


Term: capital
Observed (low/high): 1988 797 71.4% 28.6%
Expected (low/high): 1996.4 788.6 71.7% 28.3%
P-value: 0.722302281221693
In words: the probability of the null-hypothesis that the term capital is not over- or underrepresented in high-value questions is 0.722302281221693


Term: island
Observed (low/high): 1665 770 68.4% 31.6%
Expected (low/high): 1745.5 689.5 71.7% 28.3%
P-value: 0.000290975819901636
In words: the probability of the null-hypothesis that the term island is not over- or underrepresented in high-value questions is 0.000290975819901636

Observations:

For 'american' and 'island' we see a probability of almost zero that the observed numbers are a mere coincidence. Looking at the numbers, one can state that for these terms there is a significant overrepresentation in high-value questions
For 'country', 'president' and 'capital' there is some over-or under-representation as well, but there is not sufficient evidence (likelihood) that that is not by chance.

So we could give some advice now: study things that are 'american' or relate to 'islands'. It is questionable though whether this will really help.

4. Conclusions¶

We started with the question whether from the analysis of questions-and-answers from the past, we can give advice about how to prepare if you are going to participate in Jeopardy. And saw the following:

Expecting to find the answers of questions within the question themselves is not a viable strategy. If there are words in the question that are also part of the answer at all (which doesn't happen a lot in the first place), that is not going to help.
Almost all terms in questions (longer than 5 characters) have appeared before in questions. However, there are really a lot of 'popular' terms, and when looking at some examples this did not mean that the questions were repeated.
There appear to be popular terms that are (statistically significant) used relatively a lot in high-value questions.

It is very hard to defend though that any of this knowledge is going to help you prepare for Jeopardy a lot.

One could certainly analyze more, e.g.:

Play with the functions developed in this notebook, e.g. to detect more terms that are overrepresented in high-value questions.
Do more thorough analysis. E.g. when comparing with history, only look at the past couple of years rather than to more than 20 years. Or look at combinations of terms rather than individual ones in questions.
Do an analysis using the 'Category' column (that we ignored so far).

That's fun doing! If your goal is to prepare for Jeopardy though, I doubt though whether this is a good investment of your time, given what we saw so far. You may better just study general knowledge instead!