This tutorial was partially adapted from http://enipedia.tudelft.nl/wiki/OpenRefine_Tutorial, where you can learn more about Open Refine. It used to be called Google Refine so try that too when you are searching for information. However, in our case, we use python to do the same thing.

The university data can be downloaed from http://enipedia.tudelft.nl/enipedia/images/f/ff/UniversityData.zip

## What you can learn¶

The data contains quite a few issues, and this tutorial shows how to do things like:

• Cleaning up inconsistent spelling of terms (i.e. "USA", "U.S.A", "U.S.", etc).
• Converting values that are text descriptions of numeric values (i.e. $123 million) to actual numeric values (i.e. 123000000) which are usable for analysis. • Identifying which rows of a specific column contain a search term • Extracting and cleaning values for dates • Removing duplicate rows • Using a scatterplot to visualize relationships between values in different columns • Finding geographic coordinates for a list of place names (i.e. the names of universities, etc.) • Exporting cleaned data to Excel # Reading Data with Pandas¶ In [487]: %matplotlib inline import sys import pandas as pd import statsmodels.api as sm from collections import Counter, defaultdict import numpy as np import datetime import matplotlib.pyplot as plt import matplotlib matplotlib.style.use('ggplot')  In [579]: df = pd.read_csv('universityData.csv', sep = '\t', encoding = 'utf-8') df.head()  Out[579]: university endowment numFaculty numDoctoral country numStaff established numPostgrad numUndergrad numStudents 0 Paris Universitas 15 5500 8000 France NaN 2005 NaN 25000 70000 1 Paris Universitas 15 5500 8000 France NaN 2005 NaN 25000 70000 2 Lumi%C3%A8re University Lyon 2 121 NaN 1355 France NaN 1835 7046 14851 27393 3 Confederation College 4700000 NaN NaN Canada NaN 1967 not available pre-university students; technical 21160 4 Rocky Mountain College 16586100 NaN NaN United States NaN 1878 66 878 894 In [580]: print df.university[2]  Lumi%C3%A8re University Lyon 2  Apparently, there are duplicate rows. In [489]: len(df)  Out[489]: 75043 # Deduplicate entries¶ In [490]: df = df.drop_duplicates() len(df)  Out[490]: 15534 We already see an issue here where there is both the full name of a country (United States) and its abbreviation (US). To fix this, we can just copy/paste "United States" as the new cell value. # Clean up country names¶ In [491]: df['country'].unique()  Out[491]: array(['France', 'Canada', 'United States', 'USA', 'Italy', 'South Korea', 'Japan', 'United States of America', 'England, UK', 'United States )', 'Saudi Arabia', 'Honduras', 'United Kingdom', 'England', 'the Netherlands', 'India', 'Russia', 'U.S.', 'Brazil', 'US', 'U.S.A.', 'Philippines', 'Australia', 'England, United Kingdom', 'Wales', ',', 'China', 'South Africa', 'UK', 'Puerto Rico', 'Bulgaria', 'Botswana', 'Taiwan', 'Sri Lanka', 'Colombia', 'Iran', 'Russian Federation', 'Rossija', 'Finland', 'Republic of China', 'Chile', 'Romania', 'Utopia', 'Singapore', 'Lebanon', 'Turkey', 'Canada B1P 6L2', 'Jordan', 'Albania', 'Canada C1A 4P3 Telephone: 902-566-0439 Fax: 902-566-0795', 'Scotland', 'Pakistan', 'Scotland, United Kingdom', 'Denmark', 'Mexico', 'Thailand', 'Argentina', 'Cura%C3%A7ao', 'Sweden', 'Scotland, UK', 'Bangladesh', 'Nepal', 'Netherlands', 'Switzerland', 'Egypt', 'Spain', 'Satellite locations:', 'Nassau, Bahamas Fort Myers, FL Jacksonville, FL Miami, FL Miramar, FL Orlando, FL Palm Beach, FL Tampa, FL'], dtype=object) In [492]: country_df = df.groupby('country').size() for k in country_df.index: print(k, country_df[k])  (',', 2) ('Albania', 8) ('Argentina', 1) ('Australia', 49) ('Bangladesh', 27) ('Botswana', 1) ('Brazil', 10) ('Bulgaria', 2) ('Canada', 625) ('Canada B1P 6L2', 40) ('Canada C1A 4P3 Telephone: 902-566-0439 Fax: 902-566-0795', 1) ('Chile', 1) ('China', 4) ('Colombia', 1) ('Cura%C3%A7ao', 1) ('Denmark', 8) ('Egypt', 1) ('England', 338) ('England, UK', 286) ('England, United Kingdom', 64) ('Finland', 2) ('France', 8) ('Honduras', 1) ('India', 40) ('Iran', 9) ('Italy', 2) ('Japan', 112) ('Jordan', 2) ('Lebanon', 5) ('Mexico', 2) ('Nassau, Bahamas Fort Myers, FL Jacksonville, FL Miami, FL Miramar, FL Orlando, FL Palm Beach, FL Tampa, FL', 1) ('Nepal', 4) ('Netherlands', 2) ('Pakistan', 2) ('Philippines', 147) ('Puerto Rico', 2) ('Republic of China', 1) ('Romania', 1) ('Rossija', 2) ('Russia', 1) ('Russian Federation', 2) ('Satellite locations:', 1) ('Saudi Arabia', 1) ('Scotland', 64) ('Scotland, UK', 16) ('Scotland, United Kingdom', 16) ('Singapore', 4) ('South Africa', 7) ('South Korea', 7) ('Spain', 1) ('Sri Lanka', 4) ('Sweden', 24) ('Switzerland', 8) ('Taiwan', 1) ('Thailand', 4) ('Turkey', 4) ('U.S.', 304) ('U.S.A.', 81) ('UK', 83) ('US', 455) ('USA', 5104) ('United Kingdom', 405) ('United States', 6949) ('United States )', 1) ('United States of America', 167) ('Utopia', 1) ('Wales', 2) ('the Netherlands', 2)  In [493]: us_condition = df['country'].isin(['U.S.', 'U.S.A.', 'US', 'United States', 'United States )', 'United States of America']) df['country'][us_condition] = 'USA'  In [494]: ca_condition =[] for i in df['country']: if i.__contains__('Canada'): ca_condition.append(True) else: ca_condition.append(False) df['country'][ca_condition] = 'Canada'  In [495]: ca_condition =[] for i in df['country']: if i.__contains__('China'): ca_condition.append(True) else: ca_condition.append(False) df['country'][ca_condition] = 'China'  In [496]: ca_condition =[] for i in df['country']: if i.__contains__('England'): ca_condition.append(True) else: ca_condition.append(False) df['country'][ca_condition] = 'England'  In [497]: ca_condition =[] for i in df['country']: if i.__contains__('Scotland'): ca_condition.append(True) else: ca_condition.append(False) df['country'][ca_condition] = 'England'  In [498]: uk_condition = df['country'].isin(['UK', 'United Kingdom', 'Wales']) df['country'][uk_condition] = 'England'  In [499]: ca_condition =[] for i in df['country']: if i.__contains__('Netherlands'): ca_condition.append(True) else: ca_condition.append(False) df['country'][ca_condition] = 'Netherlands'  In [500]: ca_condition =[] for i in df['country']: if i.__contains__('Russia'): ca_condition.append(True) else: ca_condition.append(False) df['country'][ca_condition] = 'Russia'  In [501]: ca_condition =[] for i in df['country']: if i.__contains__('Bahamas'): ca_condition.append(True) else: ca_condition.append(False) df['country'][ca_condition] = 'Bahamas'  In [502]: 'Russian Federation'.__contains__('Russia')  Out[502]: True In [503]: uk_condition = df['country']=='Cura%C3%A7ao' df['country'][uk_condition] = "Curacao"  In [504]: df[df['country']==',']  Out[504]: university endowment numFaculty numDoctoral country numStaff established numPostgrad numUndergrad numStudents 2130 Universidad Ju%C3%A1rez Aut%C3%B3noma de Tabasco 23760.0 2188 NaN , 1087 1958-11-20 971 24921 35271 3077 Universidad Ju%C3%A1rez Aut%C3%B3noma de Tabasco 34140.0 2188 NaN , 1087 1958-11-20 971 24921 35271 Universidad Juárez Autónoma de Tabasco is a public institution of higher learning located in Villahermosa, Tabasco, Mexico. In [505]: uk_condition = df['country']==',' df['country'][uk_condition] = "Mexico"  In [506]: df[df['country']=='Satellite locations:']  Out[506]: university endowment numFaculty numDoctoral country numStaff established numPostgrad numUndergrad numStudents 75009 Nova Southeastern University US$64.5 million 2083 NaN Satellite locations: 4319 1964 22060 6397 28457

Nova Southeastern University (NSU) is a private nonprofit university, with a main campus located on 300 acres (120 ha) in Davie, in the US state of Florida. Formerly referred to as "Nova" and now commonly called "NSU", the university currently consists of 18 colleges and schools offering over 175 programs of study with more than 250 majors.

In [507]:
uk_condition = df['country']=='Satellite locations:'
df['country'][uk_condition]  = "USA"

In [508]:
country_df = df.groupby('country').size()
for k in country_df.index:
print(k, country_df[k])

('Albania', 8)
('Argentina', 1)
('Australia', 49)
('Bahamas', 1)
('Botswana', 1)
('Brazil', 10)
('Bulgaria', 2)
('Chile', 1)
('China', 5)
('Colombia', 1)
('Curacao', 1)
('Denmark', 8)
('Egypt', 1)
('England', 1274)
('Finland', 2)
('France', 8)
('Honduras', 1)
('India', 40)
('Iran', 9)
('Italy', 2)
('Japan', 112)
('Jordan', 2)
('Lebanon', 5)
('Mexico', 4)
('Nepal', 4)
('Netherlands', 4)
('Pakistan', 2)
('Philippines', 147)
('Puerto Rico', 2)
('Romania', 1)
('Rossija', 2)
('Russia', 3)
('Saudi Arabia', 1)
('Singapore', 4)
('South Africa', 7)
('South Korea', 7)
('Spain', 1)
('Sri Lanka', 4)
('Sweden', 24)
('Switzerland', 8)
('Taiwan', 1)
('Thailand', 4)
('Turkey', 4)
('USA', 13062)
('Utopia', 1)


# Clean up values for the number of students¶

We need to clean the data for the number of students. Not all of the values are numeric, and many of them contain bits of text in addition to the actual number of the students. To figure out which entries need to be fixed, we need to use a Numeric facet:

In [509]:
df.describe()

Out[509]:
count 15534 15504 13541 602 15534 3166 15534 15183 15532 13830
unique 1085 1172 601 63 47 352 448 887 1104 698
top Pennsylvania State University 1.708E9 8864 60 USA 3373 1855 800 77,179 Total 800
freq 8960 4480 8960 192 13062 216 8970 651 896 564
In [510]:
df.numStudents.unique()

Out[510]:
array(['70000', '27393', '21160', '894', '15553', '62801', '65234', nan,
'15424', '4533', '5756', '24424', '2426', '7730', '2987', '10477',
'13773', '14388', '14522', '32472', '40829', '17101', '7497',
'2400', '7306', '17500', '4463', '2200', '6448', '4863', '900+',
'13570', '4927', '32739', '1604', '2499', '13785', '16317', '339',
'1874', '10790', '4494', '4836', '4176', '4352', '30819', '6314',
'26851', '3251', '616', '49589', '50116', '3224', '52405',
'~50,000', '7773', '3785', '3822', '5000', '20712', '21000',
'21048', '7787', '17231', '18502', '6158', '11804', '36041',
'38912', '29390', '1207', '3090', '3249', '24370', '2654', '2815',
'15705', '14575', '1144', '10235', '1015', '3657', '1033', '1062',
'9840', '32762', '14806', '10623', '13438', '18900', '23420',
'3192', '2480', '12125', '4800', '30930', '7848', '5560', '12130',
'25063', '25277', '16567', '12270', '2566', '7881', '1972', '1779',
'2715', '4179', '3200', '971', '7538', '10133', '6214', '14196',
'35271', '11034', '17055', '14153', '998', '4238', '9251', '23500',
'4666', '17502', '11722', '47430', '4600', '9000', '7110', '3120',
'20655', '2766', '29290', '3740', '8394', '600', '3927', '1836',
'7313', '4061', '2073', '2183', '1315', '15484', '3345', '735',
'15872', '11308', '15889', '7885', '1066', '1331', '3574', '6840',
'1737', '11604', '6058', '1917', '5079', '23859', '23930', '2110',
'2640', '14000', '2942', '3483', '4854', '5930', '23175', '23855',
'2355', '9625', '4500', '1000', '779', '1760', '1885', '37641',
'3739', '17000', 'http://www.brocku.ca/athletics/quickfacts.php',
'20762', '21080', '757', '1116', '2000', '9748', '14177', '2242',
'2488', '19000', '2305', '16548', '16809', '38140', '13863', '4900',
'10363', '21016', '27008', '18,234 -', '26,101 -', '36000', '4200',
'21091', '110436', '18396', '18448', '9558', '7.07184E7',
'7.71264E7', '10000', '1650', '5097', '5471', '5008', '7542',
'8800', '1791', '23122', '30328', '6080', '2776', '14235', '2636',
'2744', '4250', '8094', '13600', '34767', '3709', '22125', '24875',
'2225', '9203', '27676', '8955', '25215', '29549', '30000', '11659',
'17', '3168', '17849', '22280', '19020', '800', '2725', '32275',
'33825', '4337', '3974', '12114', '23135', '3000', '3485', '29703',
'8831', '30303', '11476', '11842', '602', '8649', '1703', '3500',
'5201', '3719', '42099', '42595', '5760', '6082', '2175', '2231',
'22479', '4183', '47122', '5790', '4014', '28602', '42513', '1118',
'15405', '18442', '5033', '4600800.0', '10534', '28525', '55014',
'7751', '7755', '58698', '7567', '11201', '2100', '24531', '1500',
'1,500+', '46422', '46423', '754', '17290', '39165', '9300',
'41674', '42716', '42761', '42907', '49020', '12445', '19333',
'22405', '24662', '18572', '11593', '3209', '23162', '23590',
'8742', '5552', '5,552 total', '1415', '16822', '7942', '6957',
'11733', '28311', '5200', '13995', '14070', '30823', '31766',
'10688', '12557', '21225', '25313', '12024', '12683', '2124',
'5500', '5514', '5525', '12725', '9517', '35274', '37132', '450',
'23315', '24735', '10502', '20407', '8034', '1210', '46302',
'13959', '24849', '13490', '15265', '8606', '29462', '7615',
'24170', '8603', '8645', '9135', '8878', '9370', '29616', '22707',
'2334', '28018', '33585', '21210', '3439', '19700', '25890',
'26073', '9555', '13351', '13676', '4278', '35200', '14560', '7943',
'8005', '12714', '12002', '23761', '24977', '12400', '82', '3432',
'24594', '2648', '19890', '10820', '6101', '2565', '29894', '29952',
'5260', '22275', '16000', '20222', '21011', '37426', '19391',
'25700', '26960', '5379', '24593', '14765', '2900',
'faculty to student ratio: 12:1', '9150', '1600', '35204', '4000',
'47966', '1683', '~1,610', '363', '44', '32304', '6945', '51611',
'51721', '51853', '30540', '32500', '13183', '9940', '18475',
'38934', '57409', '24225', '22805', '30461', '44595', '49000',
'4991', '9339', '23085', '23336', '17455', '10383', '12000', '811',
'3246', '21827', '2282', '11646', '15642', '8810', '23600', '24192',
'8298', '23470', '13089', '19415', '19740', '41215', '2836', '3700',
'3770', '3829', '3721', '4525', '5365', '3050', '2345', '7700',
'7257', '2263', '19780', '2589', '2542', '29898', '2120', '2300',
'3600', '4384', '1748', '7928', '7005', '7277', '7740', '2270',
'1130.0', '1841.0', '6919', '47878', '34000', '9106', '8278',
'3378', '7131', '19966', '6964', '5809', '19664', '19993', '3420',
'2658', '2609', '1938', '2442', '42606', '2435', '10074', '20212',
'20,000+', '1460', '6276', '1308', '2719', '40000', '11065', '2660',
'3633', '80022011', '20487', '111329', '852', '131403', '24595',
'28', '1371', '3800', '23883', '3137', '33788', '28394', '30377',
'5.29092E7', '5.49E7', '14706', '720', '2674', '2886', '23103',
'8000', '10482', '6', '1890', '1872', '34255', '53337', '27523',
'18498', '11922', '6398', '15189', '70440', '72254', '23588',
'24378', '8985', '3461', '2', '27209', '22974', '11581', '47800',
'47954', '14769', '15195', '1771', '16355',
'One MEELLLLIOONNN DOLL HAIRS', '19500', '17499', '10894', '33500',
'15319', '3480', '28094', '8846', '6726', '27606', '20939', '20330',
'21535', '6647', '1454', '20643', '21329', '13925', '4496', '17261',
'17333', '21559', '630', '795', '6290', '8524', '33977', '44817',
'95833', '630 Dickinson School of Law', '800 College of Medicine',
'8,524 World Campus', '44,817 University Park', '795 Great Valley',
'95,833 Total', '33,977 Commonwealth Campuses',
'6,290 PA College of Tech', '25045', '29887', '31040', '27816',
'87274', '17999', '17950', '4197033329', '41710', '15649', '11956',
'28766', '28823', '2724', '45963', '978', '11180', '12312', '19721',
'5233', '29689', '4072', '11867', '38010', '18630', '27195',
'15657', '14820', '13381', '13893', '3245', '15064', '15473',
'7521', '5913', '5998', '33602', '31899', '39271', '39922', '33490',
'24125', '28994', '28290', '1115', '17351', '2284', '17533',
'18971', '39697', '39726', '27269', '19379', '13410', '2203',
'19728', '3005', '3045', '3540', '2459', '18004', '45126', '45954',
'2559', '15446', '32611', '56868', '28203', '20956', '3537', '6654',
'285392012', '15920', '34870', '8289', '25469', '10549', '18762',
'14713', '28091', '55115', '32653', '5152', '380', '14754', '7764',
'15951', '16040', '15536', '9799', '9352', '28457'], dtype=object)
In [511]:
float('5.29092E7')

Out[511]:
52909200.0
In [512]:
df.numStudents = [str(i).replace('+', '').replace('~', '') for i in df.numStudents]
df.numStudents = [str(i).replace(',', '').replace('-', '').strip() for i in df.numStudents]
df.numStudents = [i.split(' ')[0] for i in df.numStudents]

In [513]:
# https://stackoverflow.com/questions/4138202/using-isdigit-for-floats
def isDigit(x):
try:
float(x)
return True
except ValueError:
return False

In [514]:
isDigit('1130.0')

Out[514]:
True
In [515]:
str.isdigit('1130.0')

Out[515]:
False
In [516]:
df['numStudents'] = df['numStudents'].apply(lambda x: np.float(x)
if isDigit(x)
else np.nan)

In [517]:
odds = {}
for i in df['numStudents']:
if not isDigit(i):
try:
odds[i] += 1
except:
odds[i] = 1

In [518]:
len(df)

Out[518]:
15534
In [519]:
odds

Out[519]:
{}
In [520]:
df.numStudents.iloc[0]

Out[520]:
70000.0
In [521]:
df.numStudents.unique()[:10]

Out[521]:
array([ 70000.,  27393.,  21160.,    894.,  15553.,  62801.,  65234.,
nan,  15424.,   4533.])
In [522]:
df.numStudents.max(), df.numStudents.min()

Out[522]:
(4197033329.0, 2.0)
In [523]:
# df.numStudents.astype('float')

In [524]:
# pd.to_numeric(df.numStudents)#, errors='ignore')

In [525]:
df.describe()

Out[525]:
numStudents
count 1.380000e+04
mean 3.934969e+05
std 3.585477e+07
min 2.000000e+00
25% 1.065000e+03
50% 1.068800e+04
75% 3.397700e+04
max 4.197033e+09
In [526]:
plt.hist(df.numStudents.dropna(), bins = 200, color='blue')
plt.yscale('log')
plt.xscale('log')
plt.show()


# Clean up values for the endowment¶

First remove the numeric facet for numStudents and create a new numeric facet for endowment. Select only the non-numeric values, as was done for the number of students. Already we see issues like "US$1.3 billion" and "US$186 million"

In [527]:
df['endowment']

Out[527]:
0                       15
2                      121
3                  4700000
4                 16586100
5                 16586100
6                 40200750
7                 40200750
8                 40200750
9                 40200750
10                40200750
11                40200750
12                40200750
13                40200750
14                40200750
15                40200750
16                40200750
17                40200750
18               562000000
19               562000000
22                     NaN
23                     NaN
24                     NaN
25                     NaN
26                     NaN
27                   1.3E7
28                   1.0E7
29                   3.5E8
30                   4.5E8
31                     0.0
32                     0.0
...
74996        $2,17 billion 74997$2,17 billion
74998        $2,17 billion 74999$2,17 billion
75000        $2,17 billion 75001$2,17 billion
75002        $2,17 billion 75003$2,17 billion
75004    US $401.2 million 75006 US$213.2 million
75007      US $381 million 75008 US$64.5 million
75009     US $64.5 million 75010 US$64.5 million
75011    US $716.8 million 75012 US$716.8 million
75013    US $716.8 million 75014 US$716.8 million
75015    US $716.8 million 75016 US$716.8 million
75017    US $716.8 million 75018 US$716.8 million
75019    US $716.8 million 75020 US$716.8 million
75021    US $716.8 million 75022 US$716.8 million
75023    US $716.8 million 75024 US$716.8 million
75025    US $716.8 million 75026 US$716.8 million
Name: endowment, dtype: object
In [528]:
np.sum(odds.values())

Out[528]:
0.0
In [529]:
odds = {}
for i in df['endowment']:
if not isDigit(i):
try:
odds[i] += 1
except:
odds[i] = 1

odds.items()

Out[529]:
[('US$226 million', 1), ('US$6.44 billion', 16),
('1,5 billion \xe2\x82\xac', 1),
('US $988.626', 1), ('USD$130.0 million', 1),
('S$830 million', 2), ('$772.8 million AUD', 2),
('CHF 193.64 million annual budget', 4),
('US $620 million', 2), ('US$7.2 billion', 1),
('US $64.5 million', 3), ('US$612 million', 4),
('US$668 million', 1), ('US$12,197,377', 1),
('US $13,111,000', 1), ('US$13.3 million', 1),
('US $1.77 billion', 32), ('R$ 43.518 million', 1),
('Rs1 billion', 1),
('U.S. $n/a', 1), ('C$143 million', 1),
('US $1.786 billion', 16), ('US$ 398.9 million', 1),
('US$270.6 million', 2), ('State: US$101 million', 1),
('US $388.7 million', 6), ('US$24,011,651', 4),
('USD $54.500 million', 1), ('US$ 12 million', 1),
('US $29 million', 1), ('US$582 million  Beneficiary of Duke Endowment', 1),
('US$800 million', 1), ('US$380 million', 2),
('US$28.6 million', 1), ('US$49 million', 1),
('INR 40 Crore', 4),
('US $16.7 billion', 1), ('\xc2\xa3 61.3M', 16), ('US$ 176.50 million', 1),
('$138 million;', 1), ('US$1.55 billion', 32),
('US$14.543 million', 1), ('US$ 4.3 million', 1),
('US$687 million', 1), ('US$2.1 million', 1),
('US $12.3 million', 1), ('US$511 million', 1),
('US$144 million', 1), ('Indian Rupees', 16), ('US$665 million', 8),
('$21.9 million USD http://colleges.usnews.rankingsandreviews.com/best-colleges/sarasota-fl/ringling-college-12574', 1), ('US$2.224 billion in 2006', 48),
('C$77.5 million', 3), ('US$51 million', 1),
('US$40.3 million', 1), ('US$96.8 million', 1),
('US $14.8 million', 1), ('US$ 66.15 million', 1),
('US$59,600,000', 8), ('US$45,478,931', 1),
('US$25 million', 1), ('US$52 million', 1),
('US$79 million', 1), ('US$66 million', 1),
('US$2.002 billion in 2012', 8), ('Barstow College Foundation', 1), ('US$  billion', 2),
('US$20.8 million', 1), ('US$1.5 million', 2),
('US$70,025,283', 1), ('US$ 120 million', 1),
('US$190 million', 1), ('\xe2\x82\xb14.46 billion', 32), ('US$86.7 million', 2),
('US $77.9 million', 1), ('US$31.1 million', 8),
('US$472 million', 1), ('US$719.85 million', 1),
('\xe0\xb8\xbf5.95 billion', 2),
('US$5.28 billion', 9), ('US$14 million', 1),
('\xc2\xa3 57.8M', 16),
('No available figures', 2),
('US$16.4 million', 1), ('US$266.2 million', 1),
('\xc2\xa361.3M', 16),
('US$155.9 million', 1), ('US$59 million', 1),
('US $14.5 million', 1), ('US$ 16.5 billion', 2),
('US$1.4 billion', 1), ('US$80.0 million', 1),
('US $867 million', 4), ('US$28,610,629', 1),
('Unknown, depends on donations by businesses affiliated with the G\xc3\xbclen movement.',
4),
('US $1.9 billion', 2), ('US$590 million', 1),
('US$541 million', 1), ('US$85 million', 1),
('$1.008 billion USD', 2), ('US$513.4 million', 1),
('US$108.0 million', 6), ('US$ 349.7 million', 1),
('US$110.2 million', 4), ('US$16 million', 1),
('US$5.35 billion', 16), ('US$1.24 billion', 2),
('US $10.1 million', 1), ('S$88.62 million', 1),
('US $1.87 billion', 1), ('US$330.2 million', 2),
('US$1.707 billion', 2), ('US$283 million', 1),
('US$-- billion', 2), ('US$67 million', 1),
('US $213 million', 1), ('Unreported', 1), ('US$3.5 billion', 1),
('US $165.9 million', 1), ('US$254 million', 1),
('US $7.1 million', 1), ('US$7.80 billion', 48),
('US $704 million', 2), ('US$513 million', 2),
('R2,173.4 million', 3),
('USD $71.5 million', 1), ('C$5,930,722', 2),
('US $807 million', 1), ('US$ 518 million', 1),
('US$123,456,789', 1), ('Ministry of Human Resources Development, Government of India', 1), ('US$36.389 million', 1),
('US $329 million', 6), ('US$127.29 million', 1),
('US $42 million', 1), ('USD$41 million', 2),
('$CAD145 million', 1), ('US$27 million', 1),
('US$474 million', 1), ('US$ 1.25 billion', 6),
('$44 million USD', 1), ('US$73.9 Million', 1),
('US $13.6 million', 4), ('$30+ million', 1),
('US $117.4 million', 1), ('$22.5M', 1),
('NT$1.2 billion', 1), ('US$400.5 million', 1),
('over $100 million', 1), ('US$ 657 million', 2),
('C$920.7 million', 8), ('US$5.1 billion', 4),
('US$47.46 million', 1), ('US$ 10.0 million', 1),
('US $88.5 Million', 1), ('U.S.$19.2 million', 2),
('C$1.518 billion', 32), ('US$73.0 million', 1),
('US $27.4 million', 16), ('US$109.4 million', 1),
('US$52,177,934', 1), ('US$1.04 billion', 1),
('U.S. $25.7 million', 2), ('US$63 million', 1),
('US$675 million', 1), ('AUD$1.0082 Bn', 1),
('US $80 million', 1), ('Grants', 1), ('U.S.$30.3 million', 1),
('US$27.9 million', 2), ('US$ 10.3 billion', 1),
('US$138 million', 1), ('US$1 billion', 2),
('US $2.18 Billion', 10), ('US$ 962.196 million', 2),
('$72 million;', 1), ('US$2.54 billion', 16),
('US$14.4 billion', 2), ('$2,17 billion', 16),
('US$8.6 million', 1), ('US$65 million', 1),
('US $246 million', 1), ('US$847.8 million', 6),
('$800M', 4), ('US$45.3 million', 1),
('US $6.56 billion', 48), ('US$785 million', 8),
('US $21,727,123', 1), ('C$ 32.5 million', 2),
('$155.5 Million as of May 2006', 1), ('US$208.9 million', 3),
('US $7,313,220', 1), ('PhP 395.024 Million', 1), ('C$807.5 million', 8),
('US$2.5 Billion', 10), ('US$165 million', 2),
('US $2.59 billion', 2), ('US$185.7 million', 1),
('US$7.8 billion', 1), ('US$3 million', 1),
('US $2.22 billion', 2), ('US$ 7.0 billion', 1),
('US $41,617,510', 2), ('US$63.6 million', 1),
('US$40.5 million', 1), ('US$208.9 million parent institution', 16),
('US$30.2 million', 1), ('US$13,330,712.40', 4),
('US $665 million', 8), ('$429.95 million 2011', 1),
('http://dbpedia.org/resource/Public', 1),
('USD $622 million', 1), ('US$1.77 billion', 32),
('A$1.2371 billion', 8), ('US$955 million', 6),
('S$88,619 million', 1), ('US$ 237.6', 16),
('C$13.1 million', 1), ('US$30.1 million', 2),
('US$320 million', 1), ('US$1.88 billion', 1),
('US19.2 million', 2),
('US$170.2 million', 1), ('HK$5.3 billion', 1),
('US $5.1 billion', 4), ('US$908 million', 1),
('approx. $100 million', 1), ('\xe2\x82\xb1233 million', 32), ('US$140.9 million', 32),
('US$631,947,260.', 1), ('US$38.7 million', 1),
('$31.4 million USD', 2), ('U.S.$201.9 million', 2),
('US $478 million', 4), ('US$25.1 million', 1),
('USD $175,552,772', 1), ('R$189,976,461.00', 1),
('US$8.8 billion', 1), ('US$1.3 billion', 19),
('--09-30', 1),
('US $6.3 billion', 1), ('US$3.6 million', 1),
('US$10,386,829', 1), ('US$ 1.02 billion', 16),
('US $6.1 million', 1), ('US$45 million', 1),
('US $754.1 million', 48), ('US$1.12 billion', 49),
('US $640 million', 3), ('US$196 million', 1),
('US $145.7 million', 1), ('US$522 million', 1),
('US$63.1 million', 2), ('US$ 5.24 billion', 1),
('C$121.9 million', 8), ('US$ 39,009,814', 16),
('US $122.4 million', 1), ('US$ 65.12 million', 1),
('--06-30', 12),
('US$67,364,000', 8), ('US$ 19.3 billion', 1),
('Government of India, Government of Maharashtra', 2),
('US$931 thousand', 1), ('US$ 2.93 billion', 3),
('US $270 million', 1), ('\xe0\xb8\xbf4,171,014,200', 2), ('US$149.4 million', 1),
('US $69 million', 1), ('US$ 22.7 million', 1),
('US $16.5 Million', 1), ('US$ $14.543 million', 1), ('Public', 2), ('US$12.8 million', 1),
('US$30.1 million', 1), ('US$2.7 million', 36),
('C$34 million', 2), ('US$532 million', 1),
('US $25.2 Million', 1), ('R$229,647,000', 2),
('$43 million 2011', 1), ('$90000 million AUD', 2),
('US$69,551,090', 32), ('US$105.9 million', 32),
('$55.0 million in 2010', 1), ('US$1.008 billion', 2),
('P334 million', 1),
('Other: $48 million', 1), ('US$1.017 billion', 48),
('$173.4 million 2011', 1), ('US$237.6 million', 16),
('Government Funded', 2),
('CAN$200 million', 1), ('US$791,231,000', 1),
('USD $1.04 billion', 1), ('US$3.1 million', 1),
('$13 million Annual Report', 1), ('US$56.1 million', 32),
('USD $75,500,000', 2), ('$29 million 2011', 1),
('US$93.7 million', 1), ('US$17,167,060', 1),
('$179 million as of May 2010', 1), ('US$197.4 million', 8),
('US$357 million', 1), ('US$26.8 million', 1),
('US $1.519 billion', 24), ('US$975 million', 1),
('US $206.2 million', 1), ('US$1.14 billion', 2),
('US $374.4million', 4), ('C$230 million', 1),
('US $31.8 million', 1), ('Rs300 Crore per year', 1), ('US$73.1 million', 1),
('US$2.9 billion', 1), ('US$ 4.45 billion', 1),
('$65.8 M http://www.nacubo.org/Images/All%20Institutions%20Listed%20by%20FY%202007%20Market%20Value%20of%20Endowment%20Assets_2007%20NES.pdf', 1), ('CHF 183 million annual budget', 4), ('$70,025,283 USD', 1),
('$38 M annually', 1), ('\xe2\x82\xb1286 million', 1), ('\xc2\xa35.9M', 2), ('US$134.6 million parent institution', 1),
('US $1.449 billion', 16), ('US$1.218 billion', 1),
('R$200,000,000', 1), ('US$ 50 million', 1),
('US $49 million', 1), ('Approx. 450,00,00,000 Rs. per annum', 2), ('US$113.3 million', 1),
('US $186 million', 1), ('US$44 Million', 1),
('CAD$712 million', 48), ('US$13.8 billion', 2),
('USD $241.7 million', 1), ('C$ 94.7 million', 2),
('US$29.9 million', 1), ('US$239 million', 1),
('US$6.8 billion', 2), ('US$87.4 million', 1),
('US$13,324,363', 2), ('A$1.3 billion', 1),
('US $115.4 million', 1), ('US$1.14 billion', 2),
('US$2.8 billion', 1), ('R429.6 million', 1), ('US$716.8 million', 16),
('US$50 million', 1), ('US$664.6 million', 1),
('US $127.679 million', 1), ('US$144 million', 1),
('US $624.7 million', 1), ('A$829 million', 1),
('C$33 million', 2), ('A$882.8 million', 1),
('US $384.3 million', 2), ('$46.3 million 2011', 1),
('US$133,171,000', 4), ('US$1.295 billion', 16),
('US $467 million', 1), ('US$101 million', 4),
('C$99M', 1), ('US$1.058 Billion', 1),
('C$1.437 billion', 32), ('US$9.6 million', 1),
('US $544 million', 1), ('US$1.25 billion', 6),
('US $174.8 Million', 1), ('US$6.58 billion', 16),
('US$8.1 million', 8), ('USD$2 billion', 2),
('US$95 million', 4), ('US$28.9 million', 2),
('\xe2\x82\xb18.5 billion', 1),
('US $603 million', 11), ('US$25.8 million', 4),
('US $368.7 million', 1), ('-', 1), ('US$375.2 million', 1),
('US$87.1 million', 1), ('C$9,576,604', 1),
('US$15.7 million', 1), ('US$880.3 million', 8),
('A$72.3 million', 1), ('US$ 1.6 billion', 1),
('C$114 million', 1), ('$19.2 million 2011', 1),
('US$61.58 million', 1), ('$139 million .', 4),
('Govt., UGC', 27),
('US $5.4 billion', 8), ('US$32 billion', 6),
('$147M', 1), ('US$2.64 billion', 1),
('US $1.62 billion', 9), ('US$381 million', 1),
('US$152 million', 1), ('US$37,524,837', 1),
('$100 million +', 2), ('--02-29', 324), ('US$28.9 million', 2),
('Approx. 450,00,00,000 Rs.', 2),
('U.S. $9.6 million', 1), ('US$54.9 million', 1),
('US$199,876,589', 1), ('US$284.3 million', 2),
('US$37 million', 2), ('US$481 million', 1),
('US$848.3 million', 1), ('US$30.29 million', 1),
('US$1.194 billion', 16), ('US$170.1 million', 2),
('US $469.2 million', 16), ('$12.5 million.', 1),
('US$2.0 billion', 3), ('US$29,985,851', 1),
('U.S.$4.6 billion', 16), ('US$700 million', 1),
('US $213.2 million', 1), ('US$256 million', 1),
('USD $52.0 million', 1), ('US$10 billion', 1),
('$731.28 million 2011', 1), ('US$8.76 million', 1),
('CAD$1.05 billion', 48), ('US$3,254,000', 2),
('US $70,000,000', 2), ('Over$405 million', 1),
('Grants/private funding', 1),
('US $1.703 billion', 24), ('per annum', 7), ('US$1.65 billion', 3),
('US$525 million', 1), ('US$ 739 million', 1),
('US$96.6 million', 1), ('US$2.503 billion in 2012', 48),
('US$453 million', 1), ('US$22.4 million', 2),
('US$6.58 billion', 16), ('US$63.35 million', 2),
('US $720 million', 2), ('PH\xe2\x82\xb1 4 billion', 32), ('US$ 27 million', 2),
('US$244 million', 6), ('US$6.8  billion', 16),
('C$164.6 Million', 8), ('US$1 Billion+', 1),
('US $401.2 million', 1), ('USD$12,820,445', 1),
('US $529 million', 1), ('US$2.002 billionin 2012', 8),
('R$1.078.315.210,14', 1), ('US$283 million', 1),
('US $169.3 million', 1), ('US$3.7 million', 1),
('US $174,505', 1), ('U.S.$436.7 million', 1),
('US $815.1 million', 48), ('US$47 million', 4),
('US $813,882', 1), ('US$1.5 billion', 32),
('US$51 million', 1), ('US$411 million', 1),
('USD$246.1 million', 1), ('US$47 Million', 1),
('$28.07 Million USD', 1), ('$603.6 million parent institution', 3),
('US $7.8 billion', 48), ('US$12 million', 2),
('US $37,360,494', 1), ('US$ 17.1 billion', 2),
('Total: approx. $149 million', 1), ('US$130 million', 1),
('R$653.010.340,45', 1), ('US$891 million', 1),
('US $650,000', 1), ('US$61.5 million', 1),
('US$350,500,000', 2), ('U.S.$226 million', 1),
('US $39 million', 4), ('more than$100 million', 2)]
In [530]:
np.float('1E6')

Out[530]:
1000000.0
In [531]:
df.endowment = [str(i).replace('US $', '').replace('US$', '') for i in df.endowment]
df.endowment = [str(i).replace('USD$', '').replace('USD$', '') for i in df.endowment]
df.endowment = [str(i).replace('U.S. $', '').replace(',', '').strip() for i in df.endowment] endowment = [] for i in df.endowment: if i.__contains__('$'):
endowment.append(str(i).split('$')[1]) else: endowment.append(i) df.endowment = endowment  In [532]: df.endowment = [str(i).replace(' million', 'E6').replace(' billion', 'E9').strip() for i in df.endowment] df.endowment = [str(i).replace('million', 'E6').replace('billion', 'E9').strip() for i in df.endowment] df.endowment = [str(i).replace(' Million', 'E6').replace(' Billion', 'E9').strip() for i in df.endowment] df.endowment = [str(i).split(' ')[0] for i in df.endowment] df.endowment = [str(i).replace('M', 'E6').strip() for i in df.endowment] df.endowment = [str(i).replace(';', '').replace('+', '').strip() for i in df.endowment] # df.endowment = [str(i).split('xbf')[1] for i in df.endowment] # df.endowment = [str(i).split('xb')[1] for i in df.endowment] # df.endowment = [str(i).split('xa')[1] for i in df.endowment]  After most of this has been cleaned up, select the non-numeric values, and delete them, just as was done for the numStudents. In [533]: df['endowment'] = df['endowment'].apply(lambda x: np.float(x) if isDigit(x) else np.nan)  In [534]: df.describe()  Out[534]: endowment numStudents count 1.490400e+04 1.380000e+04 mean 2.149103e+09 3.934969e+05 std 1.927573e+10 3.585477e+07 min 0.000000e+00 2.000000e+00 25% 2.430000e+08 1.065000e+03 50% 1.546000e+09 1.068800e+04 75% 1.708000e+09 3.397700e+04 max 1.545840e+12 4.197033e+09 • 巴西雷亚尔的符号 R\$
• CANADA DOLLARS C\$• 澳元的货币符号 A\$

Both "million" and "Million" are in the values, so it's useful to convert all the values to lowercase instead of cleaning this up twice.

In [535]:
df.head()

Out[535]:
0 Paris Universitas 15.0 5500 8000 France NaN 2005 NaN 25000 70000.0
2 Lumi%C3%A8re University Lyon 2 121.0 NaN 1355 France NaN 1835 7046 14851 27393.0
3 Confederation College 4700000.0 NaN NaN Canada NaN 1967 not available pre-university students; technical 21160.0
4 Rocky Mountain College 16586100.0 NaN NaN USA NaN 1878 66 878 894.0
5 Rocky Mountain College 16586100.0 NaN NaN USA NaN 1878 66 878 894.0

# Others¶

## numFaculty¶

In [536]:
df.numFaculty = [str(i).replace('Total: ', '').replace(',', '') for i in df.numFaculty]
df.numFaculty = [str(i).replace('>', '').replace('~', '') for i in df.numFaculty]
df.numFaculty = [str(i).split(' ')[0] for i in df.numFaculty]

In [537]:
df['numFaculty'] = df['numFaculty'].apply(lambda x: np.float(x)
if isDigit(x)
else np.nan)

In [538]:
odds = {}
for i in df['numFaculty']:
if not isDigit(i):
try:
odds[i] += 1
except:
odds[i] = 1

odds.items()

Out[538]:
[]

## numDoctoral¶

In [539]:
odds = {}
for i in df['numDoctoral']:
if not isDigit(i):
try:
odds[i] += 1
except:
odds[i] = 1

odds.items()

Out[539]:
[('available', 1),
('N\\A', 27),
('not available', 40),
('N.A', 16),
('~25', 4)]
In [540]:
df.numDoctoral = [str(i).replace('~', '') for i in df.numFaculty]

In [541]:
df['numDoctoral'] = df['numDoctoral'].apply(lambda x: np.float(x)
if isDigit(x)
else np.nan)


## numStaff¶

In [542]:
odds = {}
for i in df['numStaff']:
if not isDigit(i):
try:
odds[i] += 1
except:
odds[i] = 1

odds.items()

Out[542]:
[('incl. 1,403 academics and 150 researchers', 2),
('Total: 873', 48),
('Approximately 7,170', 1),
('approximately 30', 1),
('Around 120', 27),
('~500', 4),
('190 researchers & lecturers, 153 administrative & technical staff', 2),
('Approximately 500', 1),
('4,478 employees', 64),
('Around 5,000', 48),
('>21,000', 2),
('?', 1),
('~100', 4),
('full-time, part-time', 1),
('Total: 1,608', 2),
('Total: 1,600', 16),
('Full-time: 1,469', 2),
('156 full-time; 229 part-time', 1),
('Part-time: 139', 2),
('960 full-time, 460 part-time', 16),
('appx. 20', 8),
('~', 1)]
In [543]:
df.numStaff = [str(i).replace('Total: ', '').replace(',', '') for i in df.numStaff]
df.numStaff = [str(i).replace('>', '').replace('~', '') for i in df.numStaff]
df.numStaff = [str(i).replace('Around ', '').replace('appx. ', '') for i in df.numStaff]
df.numStaff = [str(i).replace(' employees', '').replace('Approximately ', '') for i in df.numStaff]
df.numStaff = [str(i).replace('Full-time: ', '').replace('Part-time: ', '') for i in df.numStaff]
df.numStaff = [str(i).replace('approximately ', '') for i in df.numStaff]
df.numStaff = [str(i).split(' ')[0] for i in df.numStaff]

In [544]:
df['numStaff'] = df['numStaff'].apply(lambda x: np.float(x)
if isDigit(x)
else np.nan)


In [545]:
odds = {}
if not isDigit(i):
try:
odds[i] += 1
except:
odds[i] = 1

odds.items()

Out[545]:
[('268 full-time MBA', 24),
('approx. 300', 1),
('~650', 4),
('n/a', 1),
('Aprx. 2,000', 2),
('630 Dickinson School of Law', 640),
('120+', 8),
('795 Great Valley', 640),
('Does not offer postgraduate studies', 1),
('~3,914', 1),
('4,937 full-time, 1,446 part-time', 2),
('800 College of Medicine', 640),
('not available', 1),
('~3,200', 2),
('Some postdoctoral students and visiting scholars', 1),
('~160', 3),
('142 Ph.D. students', 2),
('1,682 Commonwealth Campuses', 640),
('available', 1),
('71 MLHR', 24),
('14,020 Total', 640),
('Approximately 1000', 1),
('Approx. 600', 1),
('----', 4),
('~500', 4),
('~5,500', 2),
('~60', 4),
('325 part-time MBA', 24),
('over 1,300', 1),
('N\\A', 9),
('none', 1),
('6,223 University Park', 640),
('ca. 3,230', 2),
('95 MAcc', 24),
('9,957 \xe2\x80\x93 Vancouver', 24),
('531 \xe2\x80\x93 Okanagan', 24),
('3,890 World Campus', 640),
('TBD', 2)]
In [546]:
df.numPostgrad = [str(i).replace('~', '').replace(',', '') for i in df.numPostgrad]


In [547]:
df['numPostgrad'] = df['numPostgrad'].apply(lambda x: np.float(x)
if isDigit(x)
else np.nan)


In [548]:
odds = {}
if not isDigit(i):
try:
odds[i] += 1
except:
odds[i] = 1

odds.items()

Out[548]:
[('4,747 full-time', 96),
('over 2,000', 1),
('Approximately 10,000', 1),
('2,000+', 8),
('approximately 1,500', 1),
('n/a', 1),
('Aprx. 2,000', 2),
('School of Liberal Arts; School of Science & Technology; School of Graduate & Professional Studies; Shannon School of Business',
20),
('approx. 2,000', 1),
('36,518 \xe2\x80\x93 Vancouver', 24),
('~25,000', 2),
('65 per year', 1),
('77,179 Total', 896),
('approx. 1,200', 8),
('Approx. 12,000', 2),
('38,594  University Park', 896),
('32,295 Commonwealth Campuses', 896),
('Approximately 2,300', 1),
('7,004 \xe2\x80\x93 Okanagan', 24),
('~400', 1),
('pre-university students; technical', 1),
('Around 10,000', 2),
('available', 3),
('21,726 -', 24),
('None', 1),
('900+', 1),
('ca. 3,046', 2),
('Approx. 7,100', 1),
('6,290 PA College of Tech', 896),
('4,634 World Campus', 896),
('28,477 full-time, 2,102 part-time', 2),
('none', 1),
('approx. 2,150', 4),
('Approximately 730', 1),
('~13,000', 2),
('~2,000', 4),
('diploma, degree available', 1),
('Approx. 13,000', 1),
('~1,560', 1),
('???', 1),
('~1,550', 4)]
In [549]:
df.numUndergrad = [str(i).replace('~', '').replace(',', '') for i in df.numUndergrad]

In [550]:
df['numUndergrad'] = df['numUndergrad'].apply(lambda x: np.float(x)
if isDigit(x)
else np.nan)


## established¶

In [551]:
df.established

Out[551]:
0                            2005
2                            1835
3                            1967
4                            1878
5                            1878
6                            1901
7                            1901
8                            1947
9                            1947
10                           1963
11                           1963
12       1963 - university status
13       1963 - university status
14       1947 - four-year college
15       1947 - four-year college
16                         1901 -
17                         1901 -
18                           1924
19                           1924
22                           1970
23                     1918-05-01
24                           1925
25               Established 1985
26                 Chartered 1984
27                           1994
28                           1947
29                           1948
30                           1878
31                     2004-09-30
32                     2004-09-30
...
74996                        1848
74997                        1848
74998                        1848
74999                        1848
75000                        1848
75001                        1848
75002                        1848
75003                        1848
75004                  1881-08-28
75006                  1855-10-15
75007                        1911
75008                        1964
75009                        1964
75010                        1964
75011                        1851
75012                        1851
75013                        1851
75014                        1851
75015                        1851
75016                        1851
75017                        1851
75018                        1851
75019                        1851
75020                        1851
75021                        1851
75022                        1851
75023                        1851
75024                        1851
75025                        1851
75026                        1851
Name: established, dtype: object
In [570]:
odds = {}
for i in df['established']:
if not isDigit(i):
try:
odds[i] += 1
except:
odds[i] = 1

odds.items()

Out[570]:
[]
In [571]:
import re

def getYear(s):
try:
match = re.match(r'.*([1-3][0-9]{3})', s)
return np.int(match.group(1))
except:
return np.nan

In [572]:
df.established = [getYear(i) for i in df.established]

In [573]:
df.describe()

Out[573]:
count 1.490400e+04 1.348700e+04 1.348700e+04 3.161000e+03 15186.000000 1.513800e+04 1.550000e+04 1.380000e+04
mean 2.149103e+09 8.695432e+03 8.695432e+03 1.057127e+04 1865.196036 1.104538e+04 2.762582e+05 3.934969e+05
std 1.927573e+10 1.286384e+05 1.286384e+05 2.483439e+05 65.865596 3.622826e+05 2.581618e+07 3.585477e+07
min 0.000000e+00 1.000000e+00 1.000000e+00 2.000000e+00 1066.000000 0.000000e+00 0.000000e+00 2.000000e+00
25% 2.430000e+08 1.953000e+03 1.953000e+03 1.592000e+03 1855.000000 8.000000e+02 6.290000e+03 1.065000e+03
50% 1.546000e+09 8.864000e+03 8.864000e+03 2.799000e+03 1855.000000 3.067000e+03 1.782100e+04 1.068800e+04
75% 1.708000e+09 8.864000e+03 8.864000e+03 5.000000e+03 1881.000000 6.223000e+03 3.667500e+04 3.397700e+04
max 1.545840e+12 1.407201e+07 1.407201e+07 1.280201e+07 2012.000000 2.998201e+07 3.198523e+09 4.197033e+09

# University¶

In [90]:
df['university'].unique(), df['university'].unique().size

Out[90]:
(array(['Paris Universitas', 'Lumi%C3%A8re University Lyon 2',
'Confederation College', ..., 'University of San Francisco',
'Loyola Marymount University', 'Nova Southeastern University'], dtype=object),
1085)
In [236]:
# university_df = df.groupby('university').size()
# for k in university_df.index:
#     print(k, university_df[k])