Summary
We furthered our analysis on referrals/suspensions by analyzing the California data set.
By doing descriptive statistics on the data set of total referrals by schools we found out that the standard deviation of these data sets are actually higher than their mean. Since the entries must be nonzeroes, this implies that a small groups of schools whcih have an exceptionally high number of referrals brought up the standard deviation. In hope of explaining this unusualness, we looked at the racial composition of schools and found out that schools with higher percentage of ethnic group 5 and 6 are more likely to have higher referrals per person compared to other schools.
This shows the process of extracting useful information from the California Discipline and Referrals Dataset. Methods for extraction and visualization are found below
import pandas as pd
import pylab as P
import matplotlib.pyplot as plt
import numpy as np
from collections import Counter
from nltk import FreqDist
%matplotlib inline
See comments for individual descriptions.
# read in file as pandas dataframe
def read_in_file(filename):
data = pd.read_csv(filename)
return data
# for sanity checks of data within columns
# prints out the unique categorical values found in specified columns
def get_unique_values(df, verbose=False):
column_names = list(df)
agg_levels = pd.unique(df[column_names[0]])
names = pd.unique(df[column_names[2]])
discip_type = pd.unique(df[column_names[3]])
ethnic_groups = pd.unique(df[column_names[4]])
if verbose:
print("Aggregate Levels are: %s" % agg_levels)
print("Names of schools are: %s" % names)
print("Discipline types are: %s" % discip_type)
print("Ethnic groups are: %s" % ethnic_groups)
return
# subset the data by AGGREGATION LEVEL
# D=Local educational agency totals (includes districts and direct funded charter schools)
# O=County totals
# S=SchoolTotals
# T=State totals
def subset_by_agg_level(df):
column_names = list(df)
county_data = df[df.AggegateLevel == 'O']
district_data = df[df.AggegateLevel == 'D']
school_data = df[df.AggegateLevel == 'S']
state_data = df[df.AggegateLevel == 'T']
# label columns in subset:
county_data.columns = column_names
district_data.columns = column_names
school_data.columns = column_names
state_data.columns = column_names
return county_data, district_data, school_data, state_data
# export a specified dataframe to csv
def export_df_to_csv(df, output_filename):
df.to_csv(output_filename, index=False, header=True)
return
# grab a subset of a dataframe by column indices
def subset_cols_dataframe(df, col_indexes):
subset = df.iloc[:, col_indexes]
return subset
# append together an array of dataframes(any number) in order
# example: array_df = [dataframe1, dataframe2, dataframe3]
def append_dataframes(array_df):
temp = array_df[0]
for i in range(1, len(array_df)):
temp = temp.append(array_df[i])
return temp
# this chunk of code just takes the input csv and subsets it
ca14 = read_in_file('CA_discip14.csv') # 2014 California Data saved in ca14
county_data14, district_data14, school_data14, state_data14 = subset_by_agg_level(ca14)
ca13 = read_in_file('CA_discip13.csv') # 2013 California Data saved in ca13
county_data13, district_data13, school_data13, state_data13 = subset_by_agg_level(ca13)
ca12 = read_in_file('CA_discip12.csv') # 2012 California Data saved in ca12
county_data12, district_data12, school_data12, state_data12 = subset_by_agg_level(ca12)
print("2012 data:")
get_unique_values(ca12, verbose=True)
print("2013 data:")
get_unique_values(ca13, verbose=True)
print("2014 data:")
get_unique_values(ca14, verbose=True)
print("-----------------------------------------------------------------------------------------")
2012 data: Aggregate Levels are: ['O' 'D' 'S' 'T' nan] Names of schools are: ['Alameda' 'Alpine' 'Amador' ..., 'Wheatland Community Day High' 'State' nan] Discipline types are: ['E' 'I' 'O' nan] Ethnic groups are: [ 2. 6. 0. 4. 5. 1. 3. 9. 7. nan] 2013 data: Aggregate Levels are: ['O' 'D' 'S' 'T' nan] Names of schools are: ['Alameda' 'Alpine' 'Amador' ..., 'Wheatland Community Day High' 'State' nan] Discipline types are: ['E' 'I' 'O' nan] Ethnic groups are: [ 2. 6. 4. 5. 1. 3. 9. 7. 0. nan] 2014 data: Aggregate Levels are: ['O' 'D' 'S' 'T' nan] Names of schools are: ['Alameda' 'Alpine' 'Amador' ..., 'Wheatland Community Day High' 'State' nan] Discipline types are: ['E' 'I' 'O' nan] Ethnic groups are: [ 2. 6. 0. 4. 5. 3. 9. 7. 1. nan] -----------------------------------------------------------------------------------------
Since the data imported makes sense (categorial values are as expected, did not split names incorrectly, etc), now we want to subset the data with respect to columns we actually need for each analyses.
# creating the useful dataframe for visualizing change in schools over time
indices_schools = [2, -2,-1] # for this information we only need the name of school, total referrals, and year
schools14 = subset_cols_dataframe(school_data14, indices_schools) # slicing out columns for each year
schools13 = subset_cols_dataframe(school_data13, indices_schools)
schools12 = subset_cols_dataframe(school_data12, indices_schools)
school_totals = append_dataframes([schools14, schools13, schools12]) # appending all years together
by_school = school_totals.groupby(['Name','Year']).sum()
print("Preview of groupby product: ")
print(by_school[0:10])
Preview of groupby product: Total Name Year 100 Black Men of the Bay Area Community 2013 0 2014 0 180 Program 2013 0 A. E. Arnold Elementary 2012 0 2013 0 2014 0 A. G. Currie Middle 2012 84 2013 88 2014 137 A. J. Cook Elementary 2013 0
print(by_school.describe())
Total count 20885.000000 mean 51.283649 std 115.911895 min 0.000000 25% 0.000000 50% 14.000000 75% 54.000000 max 3536.000000
As you can see, there is a very high standard deviation of referrals between schools. What does this tell us?
Most schools (75%) have 54 referrals or less a year.
by_school.sort('Total', ascending=False).head(50)
Total | ||
---|---|---|
Name | Year | |
Franklin Middle | 2012 | 3536 |
Garey High | 2012 | 2927 |
Blaker-Kinser Junior High | 2012 | 2312 |
Merrill F. West High | 2012 | 2248 |
Vallejo High | 2012 | 1963 |
Southwest High | 2013 | 1917 |
Central Union High | 2012 | 1890 |
Franklin Middle | 2014 | 1682 |
Encina Preparatory High | 2012 | 1660 |
Fontana High | 2012 | 1626 |
East Bakersfield High | 2013 | 1617 |
Oasis Community | 2013 | 1597 |
Charles W. Tewinkle Middle | 2012 | 1550 |
Los Banos Junior High | 2012 | 1479 |
Eisenhower Senior High | 2012 | 1448 |
Southwest High | 2012 | 1429 |
Calexico High | 2013 | 1322 |
Silverado High | 2012 | 1310 |
San Joaquin County Community | 2012 | 1299 |
Hogan Middle | 2014 | 1279 |
Phoenix High Community Day | 2012 | 1264 |
Bakersfield High | 2013 | 1230 |
Solano Middle | 2012 | 1157 |
Tioga Middle | 2012 | 1155 |
Franklin Middle | 2013 | 1152 |
John Muir Middle | 2012 | 1138 |
South High | 2013 | 1123 |
Alisal High | 2013 | 1121 |
Washington Middle | 2012 | 1120 |
San Leandro High | 2012 | 1115 |
Eisenhower Senior High | 2013 | 1103 |
Victor Valley High | 2012 | 1076 |
Cobalt Institute of Math and Science Academy | 2012 | 1074 |
Estancia High | 2012 | 1055 |
Ukiah High | 2012 | 1041 |
Gifford C. Cole Middle | 2012 | 1033 |
Nidorf Barry J. Juvenile Hall | 2012 | 1014 |
Calexico High | 2012 | 1012 |
Littlerock High | 2013 | 1000 |
Sequoia Middle | 2012 | 990 |
Los Padrinos Juvenile Hall | 2013 | 979 |
Modesto High | 2012 | 968 |
Lincoln High | 2012 | 954 |
Bear Creek High | 2013 | 953 |
Alisal High | 2012 | 953 |
Littlerock High | 2014 | 949 |
Fred C. Beyer High | 2012 | 947 |
Lindsay Senior High | 2013 | 942 |
Saddleback High | 2012 | 938 |
Silverado High | 2014 | 933 |
total_ref = by_school.Total.tolist() # get just the number total referrals/year in a list
cleanedList = [x for x in total_ref if str(x) != 'nan'] # temp fix since my 'nan' is string not NaN
# make histogram from this list of values
yearly_ref = Counter()
for value in cleanedList:
yearly_ref[value] += 1
yearly_ref_values = list(yearly_ref.keys())
print(yearly_ref.most_common(15))
[(0.0, 9089), (11.0, 564), (12.0, 450), (13.0, 329), (14.0, 285), (15.0, 275), (16.0, 243), (17.0, 203), (25.0, 181), (18.0, 175), (20.0, 164), (27.0, 163), (19.0, 161), (22.0, 161), (23.0, 152)]
P.hist(yearly_ref_values, bins=100, facecolor='green')
plt.title('Counts of Number of Referrals per Year')
plt.ylabel('Number of Schools')
plt.xlabel('Number Referrals')
<matplotlib.text.Text at 0x10df87588>
# choose which columns you want to subset
indices_ethnicity = [4, 6, 12]
# state data w/o agg_level and CID and name
indices_IDless = [4, 3, 5, 6, 7, 8, 9, 10, 12]
ethnic14state = subset_cols_dataframe(state_data14, indices_IDless)
ethnic13state = subset_cols_dataframe(state_data13, indices_IDless)
ethnic12state = subset_cols_dataframe(state_data12, indices_IDless)
# append the state data from all 3 years together
ethnic_discip_state = append_dataframes([ethnic14state, ethnic13state, ethnic12state])
# optional: uncomment to export this dataframe in csv form
# export_df_to_csv(ethnic_discip_state, 'ethnic_discip_state.csv')
group_year_ethnicity = ethnic_discip_state.groupby(['Ethnicity', 'Year']).sum()
group_ethnic = ethnic_discip_state.groupby(['Ethnicity']).sum()
group_year = ethnic_discip_state.groupby(['Year']).sum()
# optional: uncomment to export this dataframe in csv form
# export_df_to_csv(group_year_ethnicity, 'group_year_ethnicity.csv')
group_year_ethnicity.plot(kind = 'bar',stacked = True, title = 'Referrals by Ethnicity and Year')
<matplotlib.axes._subplots.AxesSubplot at 0x10ec9b668>
#InAmGroup = group_year_ethnicity.groupby(group_year_ethnicity.Ethnicity==1)
group_year_ethnicity.describe().plot(kind = 'area', stacked =True, title = '')
<matplotlib.axes._subplots.AxesSubplot at 0x110994668>
group_year_ethnicity.describe()
Weapons | Drugs | ViolenceWithInjury | ViolenceWithoutInjury | OtherNonDefiance | OtherDefiance | |
---|---|---|---|---|---|---|
count | 27.000000 | 27.00000 | 27.000000 | 27.000000 | 27.000000 | 27.000000 |
mean | 1698.888889 | 5519.62963 | 5588.814815 | 22907.000000 | 3082.592593 | 29629.111111 |
std | 2744.328001 | 9428.57981 | 8030.540436 | 33888.038374 | 4863.285675 | 48120.322021 |
min | 76.000000 | 199.00000 | 270.000000 | 1053.000000 | 133.000000 | 965.000000 |
25% | 159.000000 | 503.50000 | 456.500000 | 1558.500000 | 287.500000 | 1992.500000 |
50% | 324.000000 | 1012.00000 | 1015.000000 | 4315.000000 | 597.000000 | 4697.000000 |
75% | 1801.500000 | 5563.50000 | 10269.000000 | 41724.000000 | 4322.000000 | 43781.000000 |
max | 9432.000000 | 31517.00000 | 25863.000000 | 111628.000000 | 15880.000000 | 190815.000000 |
ethnic_discip = read_in_file('ethnic_discip_state.csv')
print("Columns are: %s" %list(ethnic_discip)) # columns of dataframe
print("Number of entries is: %s" %len(ethnic_discip)) # num entries
Columns are: ['Ethnicity', 'DisciplineType', 'Weapons', 'Drugs', 'ViolenceWithInjury', 'ViolenceWithoutInjury', 'OtherNonDefiance', 'OtherDefiance', 'Total', 'Year'] Number of entries is: 81
ethnic_discip.plot(x='Ethnicity',y ='Drugs', kind = 'scatter',xticks=[0,1,2,3,4,5,6,7,8,9], title = 'Total Referrals by Ethnicity')
<matplotlib.axes._subplots.AxesSubplot at 0x112d8ddd8>
byInAm = ethnic_discip[ethnic_discip.Ethnicity==1]
print(byInAm)
Ethnicity DisciplineType Weapons Drugs ViolenceWithInjury \ 5 1 E 22 35 12 14 1 I 13 28 95 23 1 O 155 546 541 32 1 E 23 40 16 41 1 I 14 21 113 50 1 O 173 612 605 59 1 E 24 46 24 68 1 I 8 14 89 77 1 O 213 548 615 ViolenceWithoutInjury OtherNonDefiance OtherDefiance Total Year 5 25 3 3 100 2014 14 310 55 897 1398 2014 23 2179 253 1742 5416 2014 32 41 3 24 147 2013 41 382 42 1377 1949 2013 50 2443 249 2368 6450 2013 59 37 6 21 158 2012 68 320 65 1626 2122 2012 77 2392 263 3050 7081 2012
year = byInAm.groupby(['Year']).sum()
print(year)
Ethnicity Weapons Drugs ViolenceWithInjury ViolenceWithoutInjury \ Year 2012 3 245 608 728 2749 2013 3 210 673 734 2866 2014 3 190 609 648 2514 OtherNonDefiance OtherDefiance Total Year 2012 334 4697 9361 2013 294 3769 8546 2014 311 2642 6914