Media is deeply coupled with American culture; in 2017 alone almost 600 million tickets were sold at movie theaters in the US (http://www.the-numbers.com/market/). How is the media consumed affecting the public?
One theory presents media as a possibly negative influencer of culture. Dating back even to the 1950's, media has often been villainized as an inciter of violence. More recently in 1999, a Gallup poll found that 62% of the poulation believed violent media was a major cause of violent actions (http://www.gallup.com/poll/5626/blame-game-youth-media-violence.aspx). Often mass shootings are immeditaley tied to the culprit's media consumption habits; this happenned famously with Sandy Hook, and also with San Bernardino (https://www.thea-blast.org/top-stories/2016/01/14/is-violence-in-the-media-to-blame-for-mass-shootings/). Society clearly concedes that movies should be content restricted, and age restricted as proved with our TV censor laws as well as movie rating system (https://en.wikipedia.org/wiki/Censorship_in_the_United_States https://en.wikipedia.org/wiki/Motion_Picture_Association_of_America_film_rating_system). Violent films are screened, and children are prevented from viewing them. Yet common sense tells us that with media being such a hallmark of our culture, media cannot create violent people. People can enjoy violent films without being incited to commit violence.
Though there is an immediate desire to blame media for violence, research actually often states the opposite. There was a study done by the National Bureau of Economic Research in which violent films were shown to reduce violent crime and drug consumption. Specifically “estimates suggest that in the short-run violent movies deter almost 1,000 assaults on an average weekend” (http://www.nber.org/papers/w13718). This study claims that the violent nature of the films serves as an outlet for violent energy, thus leading to less committed acts. Our team wonders if there is an effect of violent media in the long run. We also wonder about the validity of their claim that the violent content itself is what influences the deterance of violent acts in the short run?
Other researchers disagree that violent films do promote aggression (http://www.nytimes.com/2008/01/07/business/media/07violence.html). This study argues that exposure to violence increases over all agressive behavior. This article however fails to answer to the fact that there is a decrease in violent crime in the short run in conjunction with violent films.
Another study, asnwering to the decrease in violent crime, says it is not the content of the films that cause that reduced crime. The paper asserts that it is not a result of genre, but a result of occupying young men (those most likely to commit violent crime). This article also offers that specific actors could contribute to crime rate as well. (https://eml.berkeley.edu/~sdellavi/wp/moviescrimeQJEProofs2009.pdf).
There is clearly a lot of controversy and differing opinions surrounding the topic of violent media's effects on the public. Does violent media have long term effects? Will the effect of violent media be seen through a larger time scale? Is violent film the only genre which will influence crime rate? Do other genres that attract young male audiences effect crime similarly to violent films?
We hope to combine these questions in one project looking at genre specifically and enlarging the time frame. This will give us a better, more complete, picture on violent crime and its interaction with media. Our question stands: How does the number of each genre of film produced per year affect crime rate in San Diego? Is there a large scale, large time frame, influence deriving from the type of media consumed?
We originally started with three data sets, two relating to crime and one relating to movies.
Our movie data set was taken from Kaggle and was already pre scraped for us off of IMDB. This data set contained movies dating back 100 years, and each entry had information about title, main actor, genre, director, budget, year released, rating, and more. The data set can be found at : https://www.kaggle.com/deepmatrix/imdb-5000-movie-dataset. In total the data contains 5044 entries. We hope to use this data to make new dataframes that isolate movies by year released and by genre.
When we load this data we call it "moviedata" and the file is called "movie_metadata.csv."
We then had two data sets on crime, one national and one local to the San Diego region. Our local data set spanned from 2008 to 2012 and had detailed information on time, location, and type of each crime in the San Diego region. This data can be found at : http://data.sandiegodata.org/dataset/clarinova_com-crime-incidents-casnd-7ba4-extract and has 797979 entries. This data will be used initally in our analysis just of the years 2008 - 2012 to see if there are local large scale effects to media consumption.
When we load this data we call it "incidents" and the file is called "incidents-5y.csv."
Our national dataset has data dating back to 1960 which reported number of crime by types (murder, robbery etc) corresponding to year. The national data set can be found at : https://data.world/government/uniform-crime-reports and had 53 entries. This data set will be used after our preliminary tests to enlarge our timescale and frame.
When we load this data we call it "reportNational" and the file is called "CrimeReports.csv."
#Import all the packages we needed
%matplotlib inline
%config Inlinebackend.figure_format = 'retina'
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd
import patsy
import statsmodels.api as sm
from scipy.stats import chi2_contingency
from sklearn import datasets
#Here we load the data
moviedata = pd.read_csv("movie_metadata.csv")
incidents = pd.read_csv("incidents-5y.csv")
reportNational = pd.read_csv("Uniform crime reports.csv")
We are only interested in certain parts of the dataframe so we're going to select the year, the title of the movie, the genre, the keywords, the country and the rating of the movie. We remove any missing data points. And then lastly we exclude any movies not released in the US. We chose to do this because we assume that only movie played regularly in the US will affect US crime rates.
#Data Clean for the moviedata
#Select:
# title_year, movie_title, genres, plot_keywords, country, content_rating
#Methods:
# drop all the NAN and extract "USA" movies
movie1 = moviedata[['title_year','movie_title','genres','plot_keywords','country','content_rating']]
movie2 = movie1.dropna()
movieUSA = movie2[movie2['country'] == "USA"]
USASort = movieUSA.sort_values(['title_year'], ascending = True)
At the end of the above box, we sort the movies by year released. We will only be focusing on movies released from 2008 to 2012 because those dated correspond to our crime data set. In the next box we exclude any movie not released in the time period 200-2012; having the data sorted made this step much easier.
#sort values in ascending year from 2008 to 2012
Myear0812 = USASort[(USASort['title_year'] >= 2008) & (USASort['title_year'] <= 2012)].reset_index(drop=True)
Because 2008-2012 is a fairly small time frame, we assume there will be no large differences in total number of movies released each year. Below we visualize our data to check this prediction.
#Movie production in each year from 2008-2012 approximate the same
plt.figure(figsize = (15,3))
year = Myear0812.loc[:, 'title_year'].value_counts().sort_index()
year.plot.bar()
<matplotlib.axes._subplots.AxesSubplot at 0x11e779d30>
Our assumption was correct, each year from 2008 to 2012 has approximatley the same number of movies released.
Now we want to split our main movie dataframe up by year, to further analyze each year in our time frame.
#Sort data into df by year
Mgy08 = USASort[USASort['title_year'] == 2008.0]
Mgy09 = USASort[USASort['title_year'] == 2009.0]
Mgy10 = USASort[USASort['title_year'] == 2010.0]
Mgy11 = USASort[USASort['title_year'] == 2011.0]
Mgy12 = USASort[USASort['title_year'] == 2012.0]
The genres that we have identified are : action, animation, comedy, documentary ... etc Below we initialize a series containing all the genres. This will help us if we need a list of genres later, and will also allow us to know how many movies of each genre was produced in each year.
#Set Variable for all movie genres
ser_genres = ['Action','Animation','Comedy','Documentary','Family','Horror','Musical','Romance', 'Sport','War','Adventure','Biography','Crime','Drama','Fantasy','History','Music','Mystery','Sci-Fi','Thriller','Western']
Below we initialize data frames that will eventually be populated with the counts of each genre by year.
#Creating a new data frame for 2008 to 2012
df08 = pd.DataFrame(ser_genres)
df08['n'] = 0
df08.columns =['genres','n']
df09 = pd.DataFrame(ser_genres)
df09['n'] = 0
df09.columns =['genres','n']
df10 = pd.DataFrame(ser_genres)
df10['n'] = 0
df10.columns =['genres','n']
df11 = pd.DataFrame(ser_genres)
df11['n'] = 0
df11.columns =['genres','n']
df12 = pd.DataFrame(ser_genres)
df12['n'] = 0
df12.columns =['genres','n']
Because we will need to repeat the process of searching our movie dataset for genre name, and then populating the dataframes with the counts, we made three methods.
#Function: get a substring from the Series and return number x in series, y in string
def countString(x, y):
num = x.str.contains(y).value_counts()
if num[False] == len(x):
return 0
else:
num = num[True]
return num
#Function: get data and store in df_temp
def fillDataFramesMov(database,df_temp):
for i in range (0,len(df_temp)):
n = countString(database['genres'],df_temp.ix[i,'genres'])
df_temp.ix[i,'n'] = n
return df_temp
def fillDataFramesCrime(database,df_temp):
for i in range (0,len(df_temp)):
n = countString(database['type'],df_temp.ix[i,'crime'])
df_temp.ix[i,'n'] = n
return df_temp
With out methods, we populate each movie dataset with the counts of movie genres by year.
#Assign frequency to the movie genres in 2008
df08 = fillDataFramesMov(Mgy08,df08)
#Assign frequency to the movie genres in 2009
df09 = fillDataFramesMov(Mgy09,df09)
#Assign frequency to the movie genres in 2010
df10 = fillDataFramesMov(Mgy10,df10)
#Assign frequency to the movie genres in 2011
df11 = fillDataFramesMov(Mgy11,df11)
#Assign frequency to the movie genres in 2012
df12 = fillDataFramesMov(Mgy12,df12)
With our movie data cleaned and organized, we now move onto our crime data. We drop data points that are irrelevant to our experiement, only keeping year, the kind of crime, and the location of the crime. We then focus the neighborhood on those most proximal to theaters, and therefore we assume most influenced by movies. Because the data set is so large (797979 entries), we randomly sample 1000 data points. This will provide a representative sample without being overwhelming to work with.
#Clean for the incidents data
#Select: year, is_night, type, city, nbrhood, asr_zone, lat, lon
#Method: assuming incidents happened in commercial area (asr_zone = 6.0)
# we want population close to the theather(commmerical) area
# drop all NAN.
# taking 1000 samples randomly
incidents1 = incidents[['year', 'is_night', 'type', 'city', 'nbrhood', 'asr_zone', 'lat', 'lon']]
incidentsN = incidents1.dropna()
commercial = incidentsN[incidentsN['asr_zone'] == 6]
c = commercial.sample(n=1000).reset_index(drop=True)
We want to know which crimes occur more often than others. We don't think that each type of crime occurs as often as others. Let's visualize this in the form of a bar graph.
#Categorization of crime from our 1000 samples
plt.figure(figsize = (10,3))
x = c.loc[:, 'type'].value_counts()
x.plot.barh().invert_yaxis()
plt.xlabel('Event Count', color = "Blue")
f1 = plt.gcf()
We see that drug and alcohol violations are the most popular in San Diego but the number of theft/larceny crimes are much higher than the number of sex crimes and crimes regarding weapons. This could be not representative of the actual data, but we have confidence that a sample of 1000 is robust to chance effects.
Below we will repeat similar steps to clean our crime data as performed for our movie data. We isolate crime by year within our time frame of 2008-2012. We create a series describing type of crime. We then initialize and populate our datasets that contain counts of kind of crime by year.
#Get each title year in crime
c08 = c[c['year'] == 2008]
c09 = c[c['year'] == 2009]
c10 = c[c['year'] == 2010]
c11 = c[c['year'] == 2011]
c12 = c[c['year'] == 2012]
#Crime by category
ser_crimes = ['DRUGS','LARCENY','VEHICLE','MOTOR','ASSAULT','BURGLARY','VANDALISM','DUI','ROBBERY','FRAUD','SEX','WEAPONS','ARSON','HOMICIDE']
#Creating a new data frame for 2008 to 2012
dfc08 = pd.DataFrame(ser_crimes)
dfc08['n'] = 0
dfc08.columns =['crime','n']
dfc09 = pd.DataFrame(ser_crimes)
dfc09['n'] = 0
dfc09.columns =['crime','n']
dfc10 = pd.DataFrame(ser_crimes)
dfc10['n'] = 0
dfc10.columns =['crime','n']
dfc11 = pd.DataFrame(ser_crimes)
dfc11['n'] = 0
dfc11.columns =['crime','n']
dfc12 = pd.DataFrame(ser_crimes)
dfc12['n'] = 0
dfc12.columns =['crime','n']
#Assign frequency to the movie genres in 2008
dfc08 = fillDataFramesCrime(c08,dfc08)
#Assign frequency to the movie genres in 2009
dfc09 = fillDataFramesCrime(c09,dfc09)
#Assign frequency to the movie genres in 2010
dfc10 = fillDataFramesCrime(c10,dfc10)
#Assign frequency to the movie genres in 2011
dfc11 = fillDataFramesCrime(c11,dfc11)
#Assign frequency to the movie genres in 2012
dfc12 = fillDataFramesCrime(c12,dfc12)
We now have two sets of five data sets. Movie genres by year are described by the data sets df08-df12. Crime type by year is described in the data sets dfc08-dfc12.
To be better able to work with our datasets, we combined the two sets of five to make one master dataset for crime, and one for genre. Below each chunk of code is the displayed new dataset.
#Merge all Genres by Year
df08.columns = ['Genres','2008']
df09.columns = ['Genres','2009']
df10.columns = ['Genres','2010']
df11.columns = ['Genres','2011']
df12.columns = ['Genres','2012']
genresY = df08.merge(df09, on='Genres').merge(df10, on='Genres').merge(df11, on='Genres').merge(df12, on='Genres')
genresY = pd.pivot_table(genresY, index = 'Genres')
genresY
2008 | 2009 | 2010 | 2011 | 2012 | |
---|---|---|---|---|---|
Genres | |||||
Action | 33 | 33 | 41 | 40 | 39 |
Adventure | 28 | 31 | 29 | 32 | 30 |
Animation | 7 | 11 | 9 | 12 | 9 |
Biography | 7 | 4 | 7 | 5 | 8 |
Comedy | 68 | 72 | 71 | 67 | 70 |
Crime | 21 | 28 | 24 | 18 | 26 |
Documentary | 6 | 3 | 4 | 6 | 4 |
Drama | 79 | 84 | 76 | 69 | 67 |
Family | 21 | 21 | 28 | 21 | 16 |
Fantasy | 20 | 27 | 27 | 18 | 28 |
History | 5 | 2 | 3 | 1 | 4 |
Horror | 17 | 22 | 16 | 17 | 22 |
Music | 13 | 12 | 7 | 8 | 10 |
Musical | 5 | 3 | 3 | 4 | 4 |
Mystery | 13 | 20 | 11 | 18 | 12 |
Romance | 43 | 45 | 51 | 27 | 33 |
Sci-Fi | 20 | 28 | 13 | 23 | 19 |
Sport | 9 | 4 | 5 | 6 | 5 |
Thriller | 37 | 48 | 45 | 41 | 45 |
War | 7 | 3 | 2 | 2 | 3 |
Western | 1 | 0 | 3 | 2 | 2 |
#Merge all Crime by Year
dfc08.columns = ['Crime','2008']
dfc09.columns = ['Crime','2009']
dfc10.columns = ['Crime','2010']
dfc11.columns = ['Crime','2011']
dfc12.columns = ['Crime','2012']
crimeY = dfc08.merge(dfc09, on='Crime').merge(dfc10, on='Crime').merge(dfc11, on='Crime').merge(dfc12, on='Crime')
#removes rows and columns with only 0s
#used later for chi-squared test
crime_no_zero = crimeY[:12]
crime_no_zero = crime_no_zero.drop('2012', 1)
crime_no_zero = pd.pivot_table(crime_no_zero, index = "Crime")
#crime_no_zero
crime_pivot = pd.pivot_table(crimeY, index = "Crime")
crime_pivot
2008 | 2009 | 2010 | 2011 | 2012 | |
---|---|---|---|---|---|
Crime | |||||
ARSON | 0 | 0 | 1 | 0 | 0 |
ASSAULT | 21 | 22 | 9 | 19 | 22 |
BURGLARY | 14 | 15 | 7 | 16 | 9 |
DRUGS | 62 | 59 | 75 | 63 | 27 |
DUI | 19 | 17 | 22 | 15 | 17 |
FRAUD | 7 | 2 | 8 | 6 | 7 |
HOMICIDE | 0 | 0 | 0 | 0 | 0 |
LARCENY | 29 | 22 | 36 | 39 | 36 |
MOTOR | 25 | 19 | 8 | 8 | 13 |
ROBBERY | 10 | 7 | 7 | 3 | 5 |
SEX | 4 | 0 | 3 | 5 | 5 |
VANDALISM | 12 | 11 | 13 | 12 | 8 |
VEHICLE | 54 | 41 | 21 | 18 | 31 |
WEAPONS | 3 | 1 | 1 | 2 | 0 |
We also wanted the same data set, but rotated for easier graphing. Below is the code that rotates the data sets, as well as the printed new datasets.
#Creates data frame with columns being genres
genByGen = genresY.transpose()
genByGen.columns = ser_genres
#genByGen = genByGen.drop(genByGen.index[[0]])
genByGen.index.name = 'Year'
genByGen
Action | Animation | Comedy | Documentary | Family | Horror | Musical | Romance | Sport | War | ... | Biography | Crime | Drama | Fantasy | History | Music | Mystery | Sci-Fi | Thriller | Western | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Year | |||||||||||||||||||||
2008 | 33 | 28 | 7 | 7 | 68 | 21 | 6 | 79 | 21 | 20 | ... | 17 | 13 | 5 | 13 | 43 | 20 | 9 | 37 | 7 | 1 |
2009 | 33 | 31 | 11 | 4 | 72 | 28 | 3 | 84 | 21 | 27 | ... | 22 | 12 | 3 | 20 | 45 | 28 | 4 | 48 | 3 | 0 |
2010 | 41 | 29 | 9 | 7 | 71 | 24 | 4 | 76 | 28 | 27 | ... | 16 | 7 | 3 | 11 | 51 | 13 | 5 | 45 | 2 | 3 |
2011 | 40 | 32 | 12 | 5 | 67 | 18 | 6 | 69 | 21 | 18 | ... | 17 | 8 | 4 | 18 | 27 | 23 | 6 | 41 | 2 | 2 |
2012 | 39 | 30 | 9 | 8 | 70 | 26 | 4 | 67 | 16 | 28 | ... | 22 | 10 | 4 | 12 | 33 | 19 | 5 | 45 | 3 | 2 |
5 rows × 21 columns
#Creates data frame with columns being crime genres
criByCri = crimeY.transpose()
criByCri.columns = ser_crimes
criByCri = criByCri.drop(criByCri.index[[0]])
criByCri.index.name = 'Year'
criByCri
DRUGS | LARCENY | VEHICLE | MOTOR | ASSAULT | BURGLARY | VANDALISM | DUI | ROBBERY | FRAUD | SEX | WEAPONS | ARSON | HOMICIDE | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Year | ||||||||||||||
2008 | 62 | 29 | 54 | 25 | 21 | 14 | 12 | 19 | 10 | 7 | 4 | 3 | 0 | 0 |
2009 | 59 | 22 | 41 | 19 | 22 | 15 | 11 | 17 | 7 | 2 | 0 | 1 | 0 | 0 |
2010 | 75 | 36 | 21 | 8 | 9 | 7 | 13 | 22 | 7 | 8 | 3 | 1 | 1 | 0 |
2011 | 63 | 39 | 18 | 8 | 19 | 16 | 12 | 15 | 3 | 6 | 5 | 2 | 0 | 0 |
2012 | 27 | 36 | 31 | 13 | 22 | 9 | 8 | 17 | 5 | 7 | 5 | 0 | 0 | 0 |
Data visualization is key, and is vital to communicating scientific results to the public. We use multiple different methods of visualizations below. We mainly focus on bar graphs because our data is comparing multipe groups and also tracking changes over time. All of our graphs have color legends to explain what each genre or crime type is presented as.
These first graphs show each inividual year as a singualr bar, then each genre is colored to demonstrate what proportion of the year's production it represents.
#plot each year with proportional genres
g1 = genByGen.groupby(['Action','Animation','Comedy','Documentary','Family','Horror','Musical','Romance', 'Sport','War','Adventure','Biography','Crime','Drama',
'Fantasy','History','Music','Mystery','Sci-Fi','Thriller','Western'])
fig = g1.plot(kind='bar',figsize = (10,10),stacked=True,sharex =True)
The next boxes demonstrate our ambition to come up with a line of regression that could predict both number of crime type and number of movies within a genre by year of production. However we decided this was not useful because the number of any kind of movie or any kind of crime did not change much year to year.
#returns genre based on index
def getGenre(num):
genrez = ['Action','Animation','Comedy','Documentary','Family','Horror','Musical','Romance', 'Sport','War','Adventure','Biography','Crime','Drama',
'Fantasy','History','Music','Mystery','Sci-Fi','Thriller','Western']
return genrez[num]
#will make lines
linesMov = pd.DataFrame(columns = ['Genres','a','b'])
linesMov['Genres'] = ['Action','Animation','Comedy','Documentary','Family','Horror','Musical','Romance', 'Sport','War','Adventure','Biography','Crime','Drama',
'Fantasy','History','Music','Mystery','Sci-Fi','Thriller','Western']
#x = np.array(genByGen[tempGen])
#y = np.array(genByGen.index)
#for i in range (0,(len(genByGen.columns))-1):
# tempGen = getGenre(i)
# a,b = np.polyfit(x,y,1)
# linesMov.ix[i,'a'] = a
# linesMov.ix[i,'b'] = b
In the next plots we have a combination of bar graph and histogram. This graph shows the number of movies produced in a genre, but for each genre we plot data from 2008-2012. This graph helps visualize the unequal distribution of genre type (more dramas than sport films). But the graph also shows the relative repetition of genres by year.
#Plot line graph for each year x = 'genres', y = 'amounts'
#each year is identical
genresY.plot.bar(figsize = (15,5))
<matplotlib.axes._subplots.AxesSubplot at 0x11bd869e8>
Next we plot the same data, but as a function of year with each bar representing a genre.
#Plot line graph for each year x = 'genres', y = 'amounts'
#each year is identical
genByGen.plot.barh(figsize = (15,10));plt.legend(loc='best')
#genresY.plot.line(figsize = (15,5))
#genresY['Action'].plot(kind='line')
<matplotlib.legend.Legend at 0x11c5f9a20>
We now repeat the above graphs, but with our crime data. The first graph shows number of each type of crime. Each bar represents a specfiic year during 2008-2012. Again we can see that relative number of crimes are unchanging year to year, but certain crimes are always committed more than other (more drug use than arson).
#Plot line graph for each year x = 'crime type', y = 'amounts'
#each year is identical
crime_pivot.plot.bar(figsize = (15,5))
<matplotlib.axes._subplots.AxesSubplot at 0x11cf84e48>
The graph below shows number of crime by year, with each bar representing a particular type of crime.
#Plot line graph for each year x = 'crimes', y = 'amounts'
#each year is identical
criByCri.plot.bar(figsize = (10,10));plt.legend(loc='best')
<matplotlib.legend.Legend at 0x11d2b0e48>
To further emphasize our point that there is not much change from year to year, we have plotted each year in another program. By using seaborn we made graphs that are more pleasing to the eye. Unfortunately, these graphs repeat our above findings, that some crimes routinely occur more than others. But, there is no significant variance in crime from year to year.
#crime 08
sns.set(style="whitegrid", color_codes=True)
sns.barplot(x='2008', y='Crime', data=crimeY, orient = "h");
#crime 09
sns.set(style="whitegrid", color_codes=True)
sns.barplot(x='2009', y='Crime', data=crimeY, orient = "h");
#crime 10
sns.set(style="whitegrid", color_codes=True)
sns.barplot(x='2010', y='Crime', data=crimeY, orient = "h");
#crime 11
sns.set(style="whitegrid", color_codes=True)
sns.barplot(x='2011', y='Crime', data=crimeY, orient = "h");
After visualizing our data and performing a sniff test, we think it is clear that there will be no significant result comparing genres of movies to types of crime. Because both the genre distribution and crime distribution do not change year to year, we decided to investigate another facet of our data set...
We went back to our original data set of movies, and again selected on movies in the US. But this time we kept all movies from 1960-2012. We decided to to this so we can compare it to our larger dataset of crimes going back to 1960. We hope by expanding our timeframe larger than 5 years we will see more patters.
#Take year, genres, country from USASort and select movies from 1960-2012
m1 = USASort[['title_year', 'genres', 'country']]
m1 = m1[(m1.title_year > 1959) & (m1.title_year < 2013)]
We also suspect that focusing on all genres is too big of a focus. Because our experiement is based off of previous work focusing on violent films and violent crimes, we decide to follow in their footsteps. We isolate only movies that probably have violent themes from the Crime, War, Horror, and Action genres.
#Get Crime, War, Horror, Action movies from Genres
c = 'Crime'
w = 'War'
h = 'Horror'
a = 'Action'
for index, row in m1.iterrows():
if row['genres'].find(c or w or h or a) == -1 :
m1.drop(index, inplace=True)
We then group together and count all the films we suspect as having violent content, meaning have a possibly violent genre.
#Group and count genres which contains keyword 'crime','war','horror','action' in response to the year
m1 = m1[['title_year', 'genres']]
m1 = m1.groupby('title_year').count()
We struggled to make a method to count the number movies in violent genres by year, so we manually counted and placed results in an array representing number of war, crime, action, or horror movie by year.
#Creating a vector that represents the genres in each year
vector = np.array([0, 1,1,1,0,0,1,2, 0,1,0,4,2,2,1,0,1, 0,0,2,1,3,1,3,3,2,2,5,4,4,5,9,8,8,16,17,23,25,20,23,26,41,35,31,22,24,21,27,22,29,24,19,
27])
Next we prune our dataset on national crime back to 1960 for relevent data. We only care about violent crime, so for the new data set we only care about the year, and total number of violent crime.
#Get year, violent crime total from national report
selection = reportNational[['Year', 'Violent crime total']]
reportNational_pivot = pd.pivot_table(reportNational, index = "Year")
reportNational
Year | Population | Violent crime total | Murder and nonnegligent Manslaughter | Forcible rape | Robbery | Aggravated assault | Property crime total | Burglary | Larceny-theft | Motor vehicle theft | |
---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1960 | 179323175 | 288460 | 9110 | 17190 | 107840 | 154320 | 3095700 | 912100 | 1855400 | 328200 |
1 | 1961 | 182992000 | 289390 | 8740 | 17220 | 106670 | 156760 | 3198600 | 949600 | 1913000 | 336000 |
2 | 1962 | 185771000 | 301510 | 8530 | 17550 | 110860 | 164570 | 3450700 | 994300 | 2089600 | 366800 |
3 | 1963 | 188483000 | 316970 | 8640 | 17650 | 116470 | 174210 | 3792500 | 1086400 | 2297800 | 408300 |
4 | 1964 | 191141000 | 364220 | 9360 | 21420 | 130390 | 203050 | 4200400 | 1213200 | 2514400 | 472800 |
5 | 1965 | 193526000 | 387390 | 9960 | 23410 | 138690 | 215330 | 4352000 | 1282500 | 2572600 | 496900 |
6 | 1966 | 195576000 | 430180 | 11040 | 25820 | 157990 | 235330 | 4793300 | 1410100 | 2822000 | 561200 |
7 | 1967 | 197457000 | 499930 | 12240 | 27620 | 202910 | 257160 | 5403500 | 1632100 | 3111600 | 659800 |
8 | 1968 | 199399000 | 595010 | 13800 | 31670 | 262840 | 286700 | 6125200 | 1858900 | 3482700 | 783600 |
9 | 1969 | 201385000 | 661870 | 14760 | 37170 | 298850 | 311090 | 6749000 | 1981900 | 3888600 | 878500 |
10 | 1970 | 203235298 | 738820 | 16000 | 37990 | 349860 | 334970 | 7359200 | 2205000 | 4225800 | 928400 |
11 | 1971 | 206212000 | 816500 | 17780 | 42260 | 387700 | 368760 | 7771700 | 2399300 | 4424200 | 948200 |
12 | 1972 | 208230000 | 834900 | 18670 | 46850 | 376290 | 393090 | 7413900 | 2375500 | 4151200 | 887200 |
13 | 1973 | 209851000 | 875910 | 19640 | 51400 | 384220 | 420650 | 7842200 | 2565500 | 4347900 | 928800 |
14 | 1974 | 211392000 | 974720 | 20710 | 55400 | 442400 | 456210 | 9278700 | 3039200 | 5262500 | 977100 |
15 | 1975 | 213124000 | 1039710 | 20510 | 56090 | 470500 | 492620 | 10252700 | 3265300 | 5977700 | 1009600 |
16 | 1976 | 214659000 | 1004210 | 18780 | 57080 | 427810 | 500530 | 10345500 | 3108700 | 6270800 | 966000 |
17 | 1977 | 216332000 | 1029580 | 19120 | 63500 | 412610 | 534350 | 9955000 | 3071500 | 5905700 | 977700 |
18 | 1978 | 218059000 | 1085550 | 19560 | 67610 | 426930 | 571460 | 10123400 | 3128300 | 5991000 | 1004100 |
19 | 1979 | 220099000 | 1208030 | 21460 | 76390 | 480700 | 629480 | 11041500 | 3327700 | 6601000 | 1112800 |
20 | 1980 | 225349264 | 1344520 | 23040 | 82990 | 565840 | 672650 | 12063700 | 3795200 | 7136900 | 1131700 |
21 | 1981 | 229465714 | 1361820 | 22520 | 82500 | 592910 | 663900 | 12061900 | 3779700 | 7194400 | 1087800 |
22 | 1982 | 231664458 | 1322390 | 21010 | 78770 | 553130 | 669480 | 11652000 | 3447100 | 7142500 | 1062400 |
23 | 1983 | 233791994 | 1258087 | 19308 | 78918 | 506567 | 653294 | 10850543 | 3129851 | 6712759 | 1007933 |
24 | 1984 | 235824902 | 1273282 | 18692 | 84233 | 485008 | 685349 | 10608473 | 2984434 | 6591874 | 1032165 |
25 | 1985 | 237923795 | 1327767 | 18976 | 87671 | 497874 | 723246 | 11102590 | 3073348 | 6926380 | 1102862 |
26 | 1986 | 240132887 | 1489169 | 20613 | 91459 | 542775 | 834322 | 11722700 | 3241410 | 7257153 | 1224137 |
27 | 1987 | 242288918 | 1483999 | 20096 | 91111 | 517704 | 855088 | 12024709 | 3236184 | 7499851 | 1288674 |
28 | 1988 | 244498982 | 1566221 | 20675 | 92486 | 542968 | 910092 | 12356865 | 3218077 | 7705872 | 1432916 |
29 | 1989 | 246819230 | 1646037 | 21500 | 94504 | 578326 | 951707 | 12605412 | 3168170 | 7872442 | 1564800 |
30 | 1990 | 249464396 | 1820127 | 23438 | 102555 | 639271 | 1054863 | 12655486 | 3073909 | 7945670 | 1635907 |
31 | 1991 | 252153092 | 1911767 | 24703 | 106593 | 687732 | 1092739 | 12961116 | 3157150 | 8142228 | 1661738 |
32 | 1992 | 255029699 | 1932274 | 23760 | 109062 | 672478 | 1126974 | 12505917 | 2979884 | 7915199 | 1610834 |
33 | 1993 | 257782608 | 1926017 | 24526 | 106014 | 659870 | 1135607 | 12218777 | 2834808 | 7820909 | 1563060 |
34 | 1994 | 260327021 | 1857670 | 23326 | 102216 | 618949 | 1113179 | 12131873 | 2712774 | 7879812 | 1539287 |
35 | 1995 | 262803276 | 1798792 | 21606 | 97470 | 580509 | 1099207 | 12063935 | 2593784 | 7997710 | 1472441 |
36 | 1996 | 265228572 | 1688540 | 19645 | 96252 | 535594 | 1037049 | 11805323 | 2506400 | 7904685 | 1394238 |
37 | 1997 | 267783607 | 1636096 | 18208 | 96153 | 498534 | 1023201 | 11558475 | 2460526 | 7743760 | 1354189 |
38 | 1998 | 270248003 | 1533887 | 16974 | 93144 | 447186 | 976583 | 10951827 | 2332735 | 7376311 | 1242781 |
39 | 1999 | 272690813 | 1426044 | 15522 | 89411 | 409371 | 911740 | 10208334 | 2100739 | 6955520 | 1152075 |
40 | 2000 | 281421906 | 1425486 | 15586 | 90178 | 408016 | 911706 | 10182584 | 2050992 | 6971590 | 1160002 |
41 | 2001 | 285317559 | 1439480 | 16037 | 90863 | 423557 | 909023 | 10437189 | 2116531 | 7092267 | 1228391 |
42 | 2002 | 287973924 | 1423677 | 16229 | 95235 | 420806 | 891407 | 10455277 | 2151252 | 7057379 | 1246646 |
43 | 2003 | 290788976 | 1383676 | 16528 | 93883 | 414235 | 859030 | 10442862 | 2154834 | 7026802 | 1261226 |
44 | 2004 | 293656842 | 1360088 | 16148 | 95089 | 401470 | 847381 | 10319386 | 2144446 | 6937089 | 1237851 |
45 | 2005 | 296507061 | 1390745 | 16740 | 94347 | 417438 | 862220 | 10174754 | 2155448 | 6783447 | 1235859 |
46 | 2006 | 299398484 | 1435123 | 17309 | 94472 | 449246 | 874096 | 10019601 | 2194993 | 6626363 | 1198245 |
47 | 2007 | 301621157 | 1422970 | 17128 | 92160 | 447324 | 866358 | 9882212 | 2190198 | 6591542 | 1100472 |
48 | 2008 | 304059724 | 1394461 | 16465 | 90750 | 443563 | 843683 | 9774152 | 2228887 | 6586206 | 959059 |
49 | 2009 | 307006550 | 1325896 | 15399 | 89241 | 408742 | 812514 | 9337060 | 2203313 | 6338095 | 795652 |
50 | 2010 | 309330219 | 1251248 | 14722 | 85593 | 369089 | 781844 | 9112625 | 2168459 | 6204601 | 739565 |
51 | 2011 | 311587816 | 1206031 | 14661 | 84175 | 354772 | 752423 | 9052743 | 2185140 | 6151095 | 716508 |
52 | 2012 | 313914040 | 1214464 | 14827 | 84376 | 354522 | 760739 | 8975438 | 2103787 | 6150598 | 721053 |
We will now construct a dataframe that goes back to 1960 from 2012, it contains the total number of violent crime by year as well as total number of movies with possible violent content (action, horror, crime, war).
#Creating a final report that contains year, violent crime, movies(contains the violent key words) nationally
col = ['year', 'violent crime', 'movies']
final_report = pd.DataFrame(columns = col)
final_report['year'] = selection['Year']
final_report['violent crime'] = selection['Violent crime total']
final_report['movies'] = vector
final_report
year | violent crime | movies | |
---|---|---|---|
0 | 1960 | 288460 | 0 |
1 | 1961 | 289390 | 1 |
2 | 1962 | 301510 | 1 |
3 | 1963 | 316970 | 1 |
4 | 1964 | 364220 | 0 |
5 | 1965 | 387390 | 0 |
6 | 1966 | 430180 | 1 |
7 | 1967 | 499930 | 2 |
8 | 1968 | 595010 | 0 |
9 | 1969 | 661870 | 1 |
10 | 1970 | 738820 | 0 |
11 | 1971 | 816500 | 4 |
12 | 1972 | 834900 | 2 |
13 | 1973 | 875910 | 2 |
14 | 1974 | 974720 | 1 |
15 | 1975 | 1039710 | 0 |
16 | 1976 | 1004210 | 1 |
17 | 1977 | 1029580 | 0 |
18 | 1978 | 1085550 | 0 |
19 | 1979 | 1208030 | 2 |
20 | 1980 | 1344520 | 1 |
21 | 1981 | 1361820 | 3 |
22 | 1982 | 1322390 | 1 |
23 | 1983 | 1258087 | 3 |
24 | 1984 | 1273282 | 3 |
25 | 1985 | 1327767 | 2 |
26 | 1986 | 1489169 | 2 |
27 | 1987 | 1483999 | 5 |
28 | 1988 | 1566221 | 4 |
29 | 1989 | 1646037 | 4 |
30 | 1990 | 1820127 | 5 |
31 | 1991 | 1911767 | 9 |
32 | 1992 | 1932274 | 8 |
33 | 1993 | 1926017 | 8 |
34 | 1994 | 1857670 | 16 |
35 | 1995 | 1798792 | 17 |
36 | 1996 | 1688540 | 23 |
37 | 1997 | 1636096 | 25 |
38 | 1998 | 1533887 | 20 |
39 | 1999 | 1426044 | 23 |
40 | 2000 | 1425486 | 26 |
41 | 2001 | 1439480 | 41 |
42 | 2002 | 1423677 | 35 |
43 | 2003 | 1383676 | 31 |
44 | 2004 | 1360088 | 22 |
45 | 2005 | 1390745 | 24 |
46 | 2006 | 1435123 | 21 |
47 | 2007 | 1422970 | 27 |
48 | 2008 | 1394461 | 22 |
49 | 2009 | 1325896 | 29 |
50 | 2010 | 1251248 | 24 |
51 | 2011 | 1206031 | 19 |
52 | 2012 | 1214464 | 27 |
Below we visualize the data as a scatter plot. Each data point is a year as represented by the total number of violent crimes and total number of possibly violent movies.
#Plot the result
plt.figure(figsize = (15, 10))
plt.scatter(final_report['violent crime'], final_report['movies'])
plt.title("Number of movies vs. Violent crimes")
for label, movie_count, crime_count in zip(final_report['year'], final_report['movies'], final_report['violent crime']):
plt.annotate(label, xy = (crime_count, movie_count), xytext = (1, -1), textcoords = 'offset points')
plt.xlabel('Number of Violent crimes')
plt.ylabel('Number of Violent keywords in Movies')
<matplotlib.text.Text at 0x11e24de48>
We want to see if on a national scale there is an association between the year and the number of each type of crime committed. We will run a chi squared test. This will allow us to later compare it to the results of a chi squared test between year and the number of each type of crime committed in San Diego.
valueN, pN, dfN, expectedN = chi2_contingency(observed= reportNational_pivot)
Nationally there seems to be an association between year and type of crime committed. However, our time frame is much larger nationally than our time frame for San Diego. So we will see if there is still an association on a national level between the years 2008 and 2012.
part_of_national = reportNational[48:53]
valuepN, pPN, dfPN, expectedPN= chi2_contingency(observed= part_of_national)
We want to check if there is any relation between year and the number of movies released in each genre. We will run a chi-squared test to see if there is an association. We don't expect there to be an association, but we will run the test to be sure.
valueG, pG, dfG, expectedG = chi2_contingency(observed= genresY)
print(pG)
0.984870188048
Given the high p-value, we can conclude that there is no association between the two variables. For consistency we will run another chi-squared test on year and the number of each type of crime.
valueC, pC, dfC, expectedC = chi2_contingency(observed = crime_no_zero )
print(pC)
0.000706387719749
The p-value here is high as well meaning that there is no association between the two variables. Since we don't have an association between the types of crimes and year and the types of movies and year, we will now focus our search on a more narrow scope. We will see if there is an association between the number of violent crimes committed in a year nationally, and the number of American movies released in a year under the genres Action, Crime, War and Horror. We created a scatter plot above and the relationship does not appear to be entirely linear but we will run a regression test to be sure. In this case our null hypothesis is that there is no significant relationship in the number of violent crimes committed in a year and the number of movies released in those specific genres.
a1, b1 = np.polyfit(final_report['violent crime'], final_report['movies'], 1)
sns.regplot(final_report['violent crime'], final_report['movies'])
<matplotlib.axes._subplots.AxesSubplot at 0x11d5caeb8>
#sm.OLS()
result = sm.OLS(final_report['violent crime'], final_report['movies']).fit()
print(result.summary())
OLS Regression Results ============================================================================== Dep. Variable: violent crime R-squared: 0.567 Model: OLS Adj. R-squared: 0.559 Method: Least Squares F-statistic: 68.13 Date: Tue, 13 Jun 2017 Prob (F-statistic): 5.06e-11 Time: 15:32:37 Log-Likelihood: -798.18 No. Observations: 53 AIC: 1598. Df Residuals: 52 BIC: 1600. Df Model: 1 Covariance Type: nonrobust ============================================================================== coef std err t P>|t| [95.0% Conf. Int.] ------------------------------------------------------------------------------ movies 6.196e+04 7506.389 8.254 0.000 4.69e+04 7.7e+04 ============================================================================== Omnibus: 2.676 Durbin-Watson: 0.091 Prob(Omnibus): 0.262 Jarque-Bera (JB): 2.332 Skew: -0.407 Prob(JB): 0.312 Kurtosis: 2.373 Cond. No. 1.00 ============================================================================== Warnings: [1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
The results shown above tell us that R-squared is 0.567 which gives us an R value of 0.752 . Since the p-value is much smaller than 0.05, we can reject our null hypothesis. We find that there is a relationship between the number of violent crimes committed in a year and the number of Action, Horror, War and Crime related movies released. It is important to note that initially the conditions for inference weren't completely met since the relationship doesn't look linear.
While according to simple statistics it appears we have found an effect, this may be more an artifact than a real discovery.
We demonstrated with this project our ability to clean large data sets, deciding which variables are relevant to our question. But it is also important to remember with each assumption, we may have jeopardized our results! We chose to only look at American movies, which could make us oblivious to results informed by foreign films. We also persribed 'War', 'Crime', 'Horror', and 'Action' as violent films. This is not always true, nor exclusive. We feel it was a good estimate of content, but we could have also analyzed keywords to better understand content.
We also had a very large time frame. We first focused on 2008-2012 in which we saw no effects. But then we zoomed out to focus on 1960-2012. While at this stage we could see change, this time frame is so large that the change is more likely due to outside factors and not movies. Also movies are not played continuously for a year, the release year can be not representative of when the film was shown. Also movies only last an hour, so perhaps crime should also be analyzed in terms of hour, not year.
Ignoring all the possible confounds, we conclude that there is an effect on crime rate as a result of movie genre. We find a statistically significant effect that as number of violent films produced increases, so does number of violent crimes committed.
A possible future direction for studies like this would be to focus on number of attendees and popularity of movies compared to crime rate. This would address previous research that postulates that crime reduction during violent movies showtimes is because people are occupied in the theater, not because of the content. Also this research could be replicated but focusing on key words, and also attempting to disentangle overall trends in crime and movie genres from our results.
This research is vital because we consume immense amounts of media, especially in the US. This media may be affecting how we think and act, and it is important to stay critical of what we consume. But this also does not mean that you should feel guilty for watching your favorite Tarantino film!