California Demographics Descriptive Statistics

Summary

We furthered our analysis on referrals/suspensions by analyzing the California data set.

By doing descriptive statistics on the data set of total referrals by schools we found out that the standard deviation of these data sets are actually higher than their mean. Since the entries must be nonzeroes, this implies that a small groups of schools whcih have an exceptionally high number of referrals brought up the standard deviation. In hope of explaining this unusualness, we looked at the racial composition of schools and found out that schools with higher percentage of ethnic group 5 and 6 are more likely to have higher referrals per person compared to other schools.

This shows the process of extracting useful information from the California Discipline and Referrals Dataset. Methods for extraction and visualization are found below

In [1]:
import pandas as pd
import pylab as P
import matplotlib.pyplot as plt
import numpy as np
from collections import Counter
from nltk import FreqDist
%matplotlib inline 

Some methods I wrote for an easier time subsetting and concatnating datasets.

See comments for individual descriptions.

In [2]:
# read in file as pandas dataframe
def read_in_file(filename):
    data = pd.read_csv(filename)
    return data

# for sanity checks of data within columns
#  prints out the unique categorical values found in specified columns
def get_unique_values(df, verbose=False):
    column_names = list(df)
    agg_levels = pd.unique(df[column_names[0]])
    names = pd.unique(df[column_names[2]])
    discip_type = pd.unique(df[column_names[3]])
    ethnic_groups = pd.unique(df[column_names[4]])
    if verbose:
        print("Aggregate Levels are: %s" % agg_levels)
        print("Names of schools are: %s" % names)
        print("Discipline types are: %s" % discip_type)
        print("Ethnic groups are: %s" % ethnic_groups)
    return

#  subset the data by AGGREGATION LEVEL
#  D=Local educational agency totals (includes districts and direct funded charter schools)
#  O=County totals
#  S=SchoolTotals
#  T=State totals
def subset_by_agg_level(df):
    column_names = list(df)
    county_data = df[df.AggegateLevel == 'O']
    district_data = df[df.AggegateLevel == 'D']
    school_data = df[df.AggegateLevel == 'S']
    state_data = df[df.AggegateLevel == 'T']
    # label columns in subset:
    county_data.columns = column_names
    district_data.columns = column_names
    school_data.columns = column_names
    state_data.columns = column_names
    return county_data, district_data, school_data, state_data

#  export a specified dataframe to csv
def export_df_to_csv(df, output_filename):
    df.to_csv(output_filename, index=False, header=True)
    return

# grab a subset of a dataframe by column indices
def subset_cols_dataframe(df, col_indexes):
    subset = df.iloc[:, col_indexes]
    return subset

# append together an array of dataframes(any number) in order
# example: array_df = [dataframe1, dataframe2, dataframe3]
def append_dataframes(array_df):
    temp = array_df[0]
    for i in range(1, len(array_df)):
        temp = temp.append(array_df[i])
    return temp

Parsing the California Dataset text files into csv form:

In [3]:
#  this chunk of code just takes the input csv and subsets it
ca14 = read_in_file('CA_discip14.csv')  # 2014 California Data saved in ca14
county_data14, district_data14, school_data14, state_data14 = subset_by_agg_level(ca14)
ca13 = read_in_file('CA_discip13.csv')  # 2013 California Data saved in ca13
county_data13, district_data13, school_data13, state_data13 = subset_by_agg_level(ca13)
ca12 = read_in_file('CA_discip12.csv')  # 2012 California Data saved in ca12
county_data12, district_data12, school_data12, state_data12 = subset_by_agg_level(ca12)

Getting a quick overview of values contained in the data (sanity check):

In [4]:
print("2012 data:")
get_unique_values(ca12, verbose=True)
print("2013 data:")
get_unique_values(ca13, verbose=True)
print("2014 data:")
get_unique_values(ca14, verbose=True)
print("-----------------------------------------------------------------------------------------")
2012 data:
Aggregate Levels are: ['O' 'D' 'S' 'T' nan]
Names of schools are: ['Alameda' 'Alpine' 'Amador' ..., 'Wheatland Community Day High' 'State'
 nan]
Discipline types are: ['E' 'I' 'O' nan]
Ethnic groups are: [  2.   6.   0.   4.   5.   1.   3.   9.   7.  nan]
2013 data:
Aggregate Levels are: ['O' 'D' 'S' 'T' nan]
Names of schools are: ['Alameda' 'Alpine' 'Amador' ..., 'Wheatland Community Day High' 'State'
 nan]
Discipline types are: ['E' 'I' 'O' nan]
Ethnic groups are: [  2.   6.   4.   5.   1.   3.   9.   7.   0.  nan]
2014 data:
Aggregate Levels are: ['O' 'D' 'S' 'T' nan]
Names of schools are: ['Alameda' 'Alpine' 'Amador' ..., 'Wheatland Community Day High' 'State'
 nan]
Discipline types are: ['E' 'I' 'O' nan]
Ethnic groups are: [  2.   6.   0.   4.   5.   3.   9.   7.   1.  nan]
-----------------------------------------------------------------------------------------

Querying the Imported Data

Since the data imported makes sense (categorial values are as expected, did not split names incorrectly, etc), now we want to subset the data with respect to columns we actually need for each analyses.

  • I made a method earlier called subset_cols_dataframe(df, col_indexes) to do this easily.
  • There is also a method called append_dataframes(array_df) which takes in any size array of dataframes and appends them together. I use this method to put together all 3 years worth of data.

Looking at Each School's Total Referrals Over Time

In [5]:
# creating the useful dataframe for visualizing change in schools over time
indices_schools = [2, -2,-1]  # for this information we only need the name of school, total referrals, and year

schools14 = subset_cols_dataframe(school_data14, indices_schools)  # slicing out columns for each year
schools13 = subset_cols_dataframe(school_data13, indices_schools)
schools12 = subset_cols_dataframe(school_data12, indices_schools)

school_totals = append_dataframes([schools14, schools13, schools12])  # appending all years together

Now add up the totals with respect to each school:

In [6]:
by_school = school_totals.groupby(['Name','Year']).sum()
print("Preview of groupby product: ")
print(by_school[0:10])
Preview of groupby product: 
                                              Total
Name                                    Year       
100 Black Men of the Bay Area Community 2013      0
                                        2014      0
180 Program                             2013      0
A. E. Arnold Elementary                 2012      0
                                        2013      0
                                        2014      0
A. G. Currie Middle                     2012     84
                                        2013     88
                                        2014    137
A. J. Cook Elementary                   2013      0

Descriptive statistics on the totals for each school every year:

In [7]:
print(by_school.describe())
              Total
count  20885.000000
mean      51.283649
std      115.911895
min        0.000000
25%        0.000000
50%       14.000000
75%       54.000000
max     3536.000000

Thoughts on these numbers:

As you can see, there is a very high standard deviation of referrals between schools. What does this tell us?

  • The range of data, especially at the tails, is very large
  • Why? Look at the max number of referrals at one school.

Most schools (75%) have 54 referrals or less a year.

Lets take a closer look at what is causing this skew in the data:

Schools with the the most referrals:

In [8]:
by_school.sort('Total', ascending=False).head(50)
Out[8]:
Total
Name Year
Franklin Middle 2012 3536
Garey High 2012 2927
Blaker-Kinser Junior High 2012 2312
Merrill F. West High 2012 2248
Vallejo High 2012 1963
Southwest High 2013 1917
Central Union High 2012 1890
Franklin Middle 2014 1682
Encina Preparatory High 2012 1660
Fontana High 2012 1626
East Bakersfield High 2013 1617
Oasis Community 2013 1597
Charles W. Tewinkle Middle 2012 1550
Los Banos Junior High 2012 1479
Eisenhower Senior High 2012 1448
Southwest High 2012 1429
Calexico High 2013 1322
Silverado High 2012 1310
San Joaquin County Community 2012 1299
Hogan Middle 2014 1279
Phoenix High Community Day 2012 1264
Bakersfield High 2013 1230
Solano Middle 2012 1157
Tioga Middle 2012 1155
Franklin Middle 2013 1152
John Muir Middle 2012 1138
South High 2013 1123
Alisal High 2013 1121
Washington Middle 2012 1120
San Leandro High 2012 1115
Eisenhower Senior High 2013 1103
Victor Valley High 2012 1076
Cobalt Institute of Math and Science Academy 2012 1074
Estancia High 2012 1055
Ukiah High 2012 1041
Gifford C. Cole Middle 2012 1033
Nidorf Barry J. Juvenile Hall 2012 1014
Calexico High 2012 1012
Littlerock High 2013 1000
Sequoia Middle 2012 990
Los Padrinos Juvenile Hall 2013 979
Modesto High 2012 968
Lincoln High 2012 954
Bear Creek High 2013 953
Alisal High 2012 953
Littlerock High 2014 949
Fred C. Beyer High 2012 947
Lindsay Senior High 2013 942
Saddleback High 2012 938
Silverado High 2014 933

Cumulative Histogram of total referrals per year

In [9]:
total_ref = by_school.Total.tolist() # get just the number total referrals/year in a list
cleanedList = [x for x in total_ref if str(x) != 'nan']  # temp fix since my 'nan' is string not NaN

# make histogram from this list of values
yearly_ref = Counter()
for value in cleanedList:
    yearly_ref[value] += 1
    
yearly_ref_values = list(yearly_ref.keys())

Most common number of referrals per School per Year in (occurence, num_schools) form

In [10]:
print(yearly_ref.most_common(15))
[(0.0, 9089), (11.0, 564), (12.0, 450), (13.0, 329), (14.0, 285), (15.0, 275), (16.0, 243), (17.0, 203), (25.0, 181), (18.0, 175), (20.0, 164), (27.0, 163), (19.0, 161), (22.0, 161), (23.0, 152)]
In [11]:
P.hist(yearly_ref_values, bins=100, facecolor='green')
plt.title('Counts of Number of Referrals per Year')
plt.ylabel('Number of Schools')
plt.xlabel('Number Referrals')
Out[11]:
<matplotlib.text.Text at 0x10df87588>

Takeaways

  • There are a small number of schools which have a lot more referrals than the average school
  • This is an interesting point to keep in mind when approaching how to model the rest of the data
  • Although the greater majority of schools may have a low number of yearly referrals, our job is not done
  • It's more important to focus on the few schools that have an extraordinarily large amount of referrals

Why do select schools have such a large number of referrals compared to the majority?

In [12]:
#  choose which columns you want to subset
indices_ethnicity = [4, 6, 12]

# state data w/o agg_level and CID and name
indices_IDless = [4, 3, 5, 6, 7, 8, 9, 10, 12]
ethnic14state = subset_cols_dataframe(state_data14, indices_IDless)
ethnic13state = subset_cols_dataframe(state_data13, indices_IDless)
ethnic12state = subset_cols_dataframe(state_data12, indices_IDless)

# append the state data from all 3 years together
ethnic_discip_state = append_dataframes([ethnic14state, ethnic13state, ethnic12state])

# optional: uncomment to export this dataframe in csv form
# export_df_to_csv(ethnic_discip_state, 'ethnic_discip_state.csv')

Creating dataframe for visualizing descriptive statistics on referrals with respect to ethnicity:

Now, to see at the number of referrals of each ethnicity with repect to year:
In [13]:
group_year_ethnicity = ethnic_discip_state.groupby(['Ethnicity', 'Year']).sum()
group_ethnic = ethnic_discip_state.groupby(['Ethnicity']).sum()
group_year = ethnic_discip_state.groupby(['Year']).sum()

# optional: uncomment to export this dataframe in csv form
# export_df_to_csv(group_year_ethnicity, 'group_year_ethnicity.csv')

This information visualized:

In [14]:
group_year_ethnicity.plot(kind = 'bar',stacked = True, title = 'Referrals by Ethnicity and Year')
Out[14]:
<matplotlib.axes._subplots.AxesSubplot at 0x10ec9b668>
In [15]:
#InAmGroup = group_year_ethnicity.groupby(group_year_ethnicity.Ethnicity==1)
In [16]:
group_year_ethnicity.describe().plot(kind = 'area', stacked =True, title = '')
Out[16]:
<matplotlib.axes._subplots.AxesSubplot at 0x110994668>
In [17]:
group_year_ethnicity.describe()
Out[17]:
Weapons Drugs ViolenceWithInjury ViolenceWithoutInjury OtherNonDefiance OtherDefiance
count 27.000000 27.00000 27.000000 27.000000 27.000000 27.000000
mean 1698.888889 5519.62963 5588.814815 22907.000000 3082.592593 29629.111111
std 2744.328001 9428.57981 8030.540436 33888.038374 4863.285675 48120.322021
min 76.000000 199.00000 270.000000 1053.000000 133.000000 965.000000
25% 159.000000 503.50000 456.500000 1558.500000 287.500000 1992.500000
50% 324.000000 1012.00000 1015.000000 4315.000000 597.000000 4697.000000
75% 1801.500000 5563.50000 10269.000000 41724.000000 4322.000000 43781.000000
max 9432.000000 31517.00000 25863.000000 111628.000000 15880.000000 190815.000000
In [18]:
ethnic_discip = read_in_file('ethnic_discip_state.csv')
print("Columns are: %s" %list(ethnic_discip)) # columns of dataframe
print("Number of entries is: %s" %len(ethnic_discip)) # num entries
Columns are: ['Ethnicity', 'DisciplineType', 'Weapons', 'Drugs', 'ViolenceWithInjury', 'ViolenceWithoutInjury', 'OtherNonDefiance', 'OtherDefiance', 'Total', 'Year']
Number of entries is: 81
In [19]:
ethnic_discip.plot(x='Ethnicity',y ='Drugs', kind = 'scatter',xticks=[0,1,2,3,4,5,6,7,8,9], title = 'Total Referrals by Ethnicity')
Out[19]:
<matplotlib.axes._subplots.AxesSubplot at 0x112d8ddd8>
In [20]:
byInAm = ethnic_discip[ethnic_discip.Ethnicity==1]
print(byInAm)
    Ethnicity DisciplineType  Weapons  Drugs  ViolenceWithInjury  \
5           1              E       22     35                  12   
14          1              I       13     28                  95   
23          1              O      155    546                 541   
32          1              E       23     40                  16   
41          1              I       14     21                 113   
50          1              O      173    612                 605   
59          1              E       24     46                  24   
68          1              I        8     14                  89   
77          1              O      213    548                 615   

    ViolenceWithoutInjury  OtherNonDefiance  OtherDefiance  Total  Year  
5                      25                 3              3    100  2014  
14                    310                55            897   1398  2014  
23                   2179               253           1742   5416  2014  
32                     41                 3             24    147  2013  
41                    382                42           1377   1949  2013  
50                   2443               249           2368   6450  2013  
59                     37                 6             21    158  2012  
68                    320                65           1626   2122  2012  
77                   2392               263           3050   7081  2012  
In [21]:
year = byInAm.groupby(['Year']).sum()
print(year)
      Ethnicity  Weapons  Drugs  ViolenceWithInjury  ViolenceWithoutInjury  \
Year                                                                         
2012          3      245    608                 728                   2749   
2013          3      210    673                 734                   2866   
2014          3      190    609                 648                   2514   

      OtherNonDefiance  OtherDefiance  Total  
Year                                          
2012               334           4697   9361  
2013               294           3769   8546  
2014               311           2642   6914