by Numan Yilmaz, 18 July 2016

Thousands of students graduate from schools in California every year. The California Department of Education (CDE) has one of the largest student populations in the country. You can find more information about the CDE at http://www.cde.ca.gov.

The CDE has a Data & Statistics section on their website. A variety of datasets have been shared for educators, researchers and the public from the early 90s to current.

This projected uses the dataset for the 2014 academic year. The following questions will be answered:

• What is the maximum, minimum and total number of graduates in that academic year?
• What is the mean and median number of graduates?
• How many male and female graduates are there? What is the range?
• What is the number of graduates by ethnicity?

In [1]:
import warnings
warnings.simplefilter('ignore', FutureWarning)

%matplotlib inline
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
plt.style.use('ggplot')

In [2]:
from IPython.core.display import HTML
HTML('<style>{}</style>'.format(css))

Out[2]:

This code and CSS file is from Brandon Rhodes's lecture at PyCon 2015. https://www.youtube.com/watch?v=5JnMutdy6Fw. It allows you to differentiate the column and index from the dataframe and visualize the dataset.

### CDS_CODE¶

This 14-digit code is the official, unique identification of a school within California. The first two digits identify the county, the next five digits identify the school district, and the last seven digits identify the school.

### ETHNIC¶

This is a coded field for race/ethnic designation. The ethnic designations are coded as follows:

• Code 0 = Not reported
• Code 1 = American Indian or Alaska Native, Not Hispanic
• Code 2 = Asian, Not Hispanic
• Code 3 = Pacific Islander, Not Hispanic
• Code 4 = Filipino, Not Hispanic
• Code 5 = Hispanic or Latino
• Code 6 = African American, not Hispanic (formerly known as Black, not Hispanic)
• Code 7 = White, not Hispanic
• Code 9 = Two or More Races, Not Hispanic (See Glossary for complete definitions of the ethnic groups.)

### GENDER¶

This field is a coded field identifying gender. The gender is coded as follows:

• M = Male
• F = Female

Number of twelfth-grade graduates. This data includes summer graduates and does not include students with high school equivalencies (i.e., General Educational Development (GED) test or California High School Proficiency Examination (CHSPE)).

Number of twelfth-grade graduates who also completed all courses required for entry into the University of California (UC) and/or California State University (CSU) with a grade "C" or better. This data includes summer graduates and does not include students with high school equivalencies (i.e., GED or CHSPE).

### YEAR¶

Year of data.

In [3]:
# assign dataframe to df

Out[3]:
0 1100170109835 2 M 12 1 1415
1 1100170109835 5 M 1 0 1415
2 1100170109835 6 F 3 0 1415
3 1100170109835 6 M 1 0 1415
4 1100170109835 5 F 7 1 1415
In [4]:
#get a general overview of dataset
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 21013 entries, 0 to 21012
Data columns (total 6 columns):
CDS_CODE    21013 non-null int64
ETHNIC      21013 non-null int64
GENDER      21013 non-null object
YEAR        21013 non-null int64
dtypes: int64(5), object(1)
memory usage: 985.1+ KB


There are 21,013 rows in our dataset. There is no need to delete or remove spaces, letters or missing data. The type of columns are also listed.

In [5]:
# another way to learn data types
df.dtypes

Out[5]:
CDS_CODE     int64
ETHNIC       int64
GENDER      object
YEAR         int64
dtype: object
In [6]:
# another way of getting the size of dataset
len(df['CDS_CODE'])

Out[6]:
21013

How many schools are there in the dataset?

In [7]:
len(df['CDS_CODE'].unique())

Out[7]:
2490
In [8]:
grouped = df.groupby(['CDS_CODE']).sum().sort_values('GRADS', ascending=False).head(10)[['GRADS','UC_GRADS']]
grouped.plot(kind='bar', figsize=(15, 8))

Out[8]:
<matplotlib.axes._subplots.AxesSubplot at 0x10617b940>

Let's do some research on the GRADS column, which is the number of graduate students in California in the 2014-2015 academic year.

In [9]:
#For convenience, let's assign the output of this dataframe to a variable...

In [10]:
# largest number of graduates

Out[10]:
508
In [11]:
# smallest number of graduates

Out[11]:
1
In [12]:
# display the result of the largest number of graduates, sorted by graduates' value.

Out[12]:
7022 19648731936749 5 F 508 221 1415
4685 19644511932441 5 F 430 218 1415
7021 19648731936749 5 M 422 138 1415
4686 19644511932441 5 M 422 135 1415
4701 19644511939164 5 F 385 206 1415
In [13]:
## total graduated students in the academic year of 2014-2015

Out[13]:
426950

now let's add UC_GRADS columns to our dataframe and apply function to see multiple columns

In [14]:
df[['GRADS', 'UC_GRADS']].apply(sum)

Out[14]:
GRADS       426950
dtype: int64

OR

In [15]:
df[['GRADS', 'UC_GRADS']].sum()

Out[15]:
GRADS       426950
dtype: int64
In [16]:
# how many percent of graduate qualify to go to the UC.

Out[16]:
43.372526056915326
In [17]:
# mean of grads column

Out[17]:
20.318374339694476
In [18]:
# median of grads column

Out[18]:
5.0

The median is lower than the mean. This indicates that some of the schools had a very high number of graduates, pushing the value of the mean up.

In [19]:
# or just use describe() to get general info about the column

Out[19]:
count    21013.000000
mean        20.318374
std         39.613251
min          1.000000
25%          1.000000
50%          5.000000
75%         17.000000
max        508.000000
Name: GRADS, dtype: float64

Calculate the number of male graduates and assign them to a new variable called m

In [20]:
m = df[df.GENDER == 'M']
m

Out[20]:
210947

Calculate the number of female graduates and assign them to a new variable called f

In [21]:
f = df[df.GENDER == 'F']
f

Out[21]:
216003

Range of female and male student numbers

In [22]:
f - m

Out[22]:
5056

According to the dataset 5056 more female students graduated in the academic year of 2014.

We have previously seen the number of total graduates in the 2014 academic year. Let's confirm here.

In [23]:
f + m

Out[23]:
426950

Now, lets focus on ethnics and graduates columns and do some math.

In [24]:
#For convenience, let's assign the output of this dataframe to a new variable, df_ethnic


Out[24]:
0 2 12
1 5 1
2 6 3
3 6 1
4 5 7

Number of students whose ethnicity code is 2.

In [25]:
d = df_ethnic
d = d[d['ETHNIC'] == 2]
d

Out[25]:
41698

The value of etchicity codes are in the dataset description. They are stored as integers. Let's create a function which assigns a value to each code. The function is called eth.

In [26]:
def eth (value):
if value == 0:
return 'Not Reported'
elif value == 1:
return 'American Indian or Alaska Native, Not Hispanic'
elif value == 2:
return 'Asian, Not Hispanic'
elif value == 3:
return 'Pacific Islander, Not Hispanic'
elif value == 4:
return 'Filipino, Not Hispanic'
elif value == 5:
return 'Hispanic or Latino'
elif value == 6:
return 'African American, not Hispanic'
elif value == 7:
return 'White, not Hispanic'
elif value == 9:
return 'Two or More Races, Not Hispanic'
else:
return value

In [27]:
# test the funtion
eth(2)

Out[27]:
'Asian, Not Hispanic'
In [28]:
# test the funtion one more time
eth(5)

Out[28]:
'Hispanic or Latino'
In [29]:
#apply the funtion in all dataset and create a new column

df_ethnic['ETHNIC_NAME'] = df_ethnic['ETHNIC'].apply(eth)

/anaconda/anaconda/lib/python3.6/site-packages/ipykernel/__main__.py:3: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
app.launch_new_instance()

Out[29]:
0 2 12 Asian, Not Hispanic
1 5 1 Hispanic or Latino
2 6 3 African American, not Hispanic
3 6 1 African American, not Hispanic
4 5 7 Hispanic or Latino
In [30]:
# create a new variable called df_ethnicName


Out[30]:
0 12 Asian, Not Hispanic
1 1 Hispanic or Latino
2 3 African American, not Hispanic
3 1 African American, not Hispanic
4 7 Hispanic or Latino
In [31]:
#plot the number of total graduates by ethnicity

df_ethnicName.groupby(df_ethnicName['ETHNIC_NAME']).sum().plot(kind = 'barh', grid= True, figsize = (10,5))

Out[31]:
<matplotlib.axes._subplots.AxesSubplot at 0x1123cacf8>

Let's go back to our original dataaset

In [32]:
df.head()

Out[32]:
0 1100170109835 2 M 12 1 1415
1 1100170109835 5 M 1 0 1415
2 1100170109835 6 F 3 0 1415
3 1100170109835 6 M 1 0 1415
4 1100170109835 5 F 7 1 1415

Now, Let's see the relation between GRADS and UC_GRADS columns by scatter plotting.

In [33]:
df.plot(x=['GRADS'], y=['UC_GRADS'], kind='scatter', grid=True, figsize = (10, 5))

Out[33]:
<matplotlib.axes._subplots.AxesSubplot at 0x1127267f0>

As we have seen before from the mean and median, schools have different number of graduates. Graduate numbers between 0 and 150 and UC graduates number between 0 to 50 are high. There are some schools had zero UC grads.

# Conclusion¶

Close to half million students graduated from in a California high school in the 2014-2015 academic year. There are more female gradutes then male graduates. Almost half of the students are eligible to continue their education in a California University. Hispanic or Latino ethnicity is dominant in the schools. There is a big variety in the number of graduates. Most schools had graduates between 0 and 120.

In [ ]: