Quick Look at REF Data

The UK Higher Education Research Excellence Framework, 2014 results are out, so I gave myself an hour to explore the data, see what's there, and get an idea for some of the more obvious stories we might try to pull out.

The data is published as an Excel spreadsheet, so let's grab a copy.

In [5]:
#Grab the data file from a copied URL
#!curl http://results.ref.ac.uk/\(S\(jjedtxoydmmvwidxuktryu15\)\)/DownloadFile/AllResults/xlsx > ref2014.xlsx
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100  636k  100  636k    0     0   107k      0  0:00:05  0:00:05 --:--:--  187k
In [37]:
import pandas as pd
xls=pd.ExcelFile('ref2014.xlsx')

#How many sheets are there?
xls.sheet_names
Out[37]:
['REF2014 Profiles']

So there's just a single sheet. Let's have a look at it.

In [39]:
#Quick preview of the data
df=pd.read_excel('ref2014.xlsx')
df[:10]
Out[39]:
2014 Research Excellence Framework Results Unnamed: 1 Unnamed: 2 Unnamed: 3 Unnamed: 4 Unnamed: 5 Unnamed: 6 Unnamed: 7 Unnamed: 8 Unnamed: 9 Unnamed: 10 Unnamed: 11 Unnamed: 12 Unnamed: 13 Unnamed: 14 Unnamed: 15
0 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
1 Quality profiles for all submissions NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
2 Note: In this table, joint submissions are sho... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
3 UKPRN Institution SortOrder MainPanel UOA UnitOfAssessment msubId MultipleSubmission JointSubmission Profile StaffFte FourStar ThreeStar TwoStar OneStar Unclassified
4 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN Percentage of the submission meeting the stand... NaN NaN NaN NaN
5 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
6 Institution code (UKPRN) Institution name Institution sort order Main panel Unit of assessment number Unit of assessment name Multiple submission letter Multiple submission name Joint submission Profile FTE Category A staff submitted 4* 3* 2* 1* unclassified
7 10000291 Anglia Ruskin University 10 A 3 Allied Health Professions, Dentistry, Nursing ... NaN NaN Outputs 11.3 6.4 68.1 25.5 0 0
8 10000291 Anglia Ruskin University 10 A 3 Allied Health Professions, Dentistry, Nursing ... NaN NaN Impact 11.3 20 80 0 0 0
9 10000291 Anglia Ruskin University 10 A 3 Allied Health Professions, Dentistry, Nursing ... NaN NaN Environment 11.3 12.5 75 12.5 0 0

Row 3 (counting form 0) has column codes, row 6 has full column names. Let's load in the data as a simple dataframe using the full column names.

In [40]:
#We have some metadata rows so let's try again...
df=pd.read_excel('ref2014.xlsx',header=6)
df[:5]
Out[40]:
Institution code (UKPRN) Institution name Institution sort order Main panel Unit of assessment number Unit of assessment name Multiple submission letter Multiple submission name Joint submission Profile FTE Category A staff submitted 4* 3* 2* 1* unclassified
0 10000291 Anglia Ruskin University 10 A 3 Allied Health Professions, Dentistry, Nursing ... NaN NaN Outputs 11.3 6.4 68.1 25.5 0 0
1 10000291 Anglia Ruskin University 10 A 3 Allied Health Professions, Dentistry, Nursing ... NaN NaN Impact 11.3 20 80 0 0 0
2 10000291 Anglia Ruskin University 10 A 3 Allied Health Professions, Dentistry, Nursing ... NaN NaN Environment 11.3 12.5 75 12.5 0 0
3 10000291 Anglia Ruskin University 10 A 3 Allied Health Professions, Dentistry, Nursing ... NaN NaN Overall 11.3 10 72 18 0 0
4 10000291 Anglia Ruskin University 10 A 4 Psychology, Psychiatry and Neuroscience NaN NaN Outputs 13.7 10.3 61.5 25.6 2.6 0

Preliminary Exploration of the Data

The dataset is completely new to me so let's get a feel for some of the things it contains.

In [24]:
#What are the columns?
for col in df.columns: print(col)
Institution code (UKPRN)
Institution name
Institution sort order
Main panel
Unit of assessment number
Unit of assessment name
Multiple submission letter
Multiple submission name
Joint submission
Profile
FTE Category A staff submitted
4*
3*
2*
1*
unclassified

Visual inspection of the first few lines of the dataframe shown above suggests that each institution puts in for different units of assessment, and each unit of assessment receives scores on four profile elements:

In [42]:
df['Profile'].unique()
Out[42]:
array(['Outputs', 'Impact', 'Environment', 'Overall'], dtype=object)
In [26]:
#What are the units of assessment?
df['Unit of assessment name'].unique()
Out[26]:
array(['Allied Health Professions, Dentistry, Nursing and Pharmacy',
       'Psychology, Psychiatry and Neuroscience', 'Biological Sciences',
       'General Engineering',
       'Architecture, Built Environment and Planning',
       'Geography, Environmental Studies and Archaeology',
       'Business and Management Studies', 'Law',
       'Social Work and Social Policy', 'Education',
       'English Language and Literature', 'History',
       'Art and Design: History, Practice and Theory',
       'Music, Drama, Dance and Performing Arts',
       'Communication, Cultural and Media Studies, Library and Information Management',
       'Computer Science and Informatics',
       'Electrical and Electronic Engineering, Metallurgy and Materials',
       'Area Studies', 'Modern Languages and Linguistics', 'Chemistry',
       'Physics', 'Mathematical Sciences',
       'Aeronautical, Mechanical, Chemical and Manufacturing Engineering',
       'Sport and Exercise Sciences, Leisure and Tourism',
       'Earth Systems and Environmental Sciences',
       'Economics and Econometrics', 'Politics and International Studies',
       'Sociology', 'Philosophy', 'Clinical Medicine',
       'Public Health, Health Services and Primary Care',
       'Civil and Construction Engineering', 'Classics',
       'Theology and Religious Studies',
       'Agriculture, Veterinary and Food Science',
       'Anthropology and Development Studies'], dtype=object)
In [30]:
#How many institutions?
df['Institution code (UKPRN)'].unique().size
Out[30]:
154
In [34]:
#How many institutions by unit of assessment, ordered?
#Rather than mulitpl count, let's base this on the Overall profile results
df[df['Profile']=='Overall'].groupby(['Unit of assessment name']).size().order(ascending=False)
Out[34]:
Unit of assessment name
Business and Management Studies                                 101
Allied Health Professions, Dentistry, Nursing and Pharmacy       94
Computer Science and Informatics                                 89
English Language and Literature                                  89
Music, Drama, Dance and Performing Arts                          84
Art and Design: History, Practice and Theory                     84
History                                                          83
Psychology, Psychiatry and Neuroscience                          82
Education                                                        76
Geography, Environmental Studies and Archaeology                 74
Communication, Cultural and Media Studies, Library and Information Management     67
Law                                                              67
General Engineering                                              62
Social Work and Social Policy                                    62
Modern Languages and Linguistics                                 57
Politics and International Studies                               56
Mathematical Sciences                                            53
Sport and Exercise Sciences, Leisure and Tourism                 51
Architecture, Built Environment and Planning                     45
Earth Systems and Environmental Sciences                         45
Biological Sciences                                              44
Physics                                                          41
Philosophy                                                       40
Chemistry                                                        37
Electrical and Electronic Engineering, Metallurgy and Materials     37
Theology and Religious Studies                                   33
Public Health, Health Services and Primary Care                  32
Clinical Medicine                                                31
Sociology                                                        29
Agriculture, Veterinary and Food Science                         29
Economics and Econometrics                                       28
Anthropology and Development Studies                             25
Aeronautical, Mechanical, Chemical and Manufacturing Engineering     25
Area Studies                                                     23
Classics                                                         22
Civil and Construction Engineering                               14
dtype: int64
In [46]:
#Which institutions submitted to most Units of Assessment?
df[df['Profile']=='Overall'].groupby(['Institution name']).size().order(ascending=False)
Out[46]:
Institution name
University College London     36
University of Manchester      35
University of Sheffield       35
University of Leeds           33
University of Birmingham      33
University of Nottingham      32
University of Cambridge       32
University of Glasgow         32
University of Oxford          31
University of Edinburgh       31
University of Bristol         31
Queen's University Belfast    28
Newcastle University          28
King's College London         27
Cardiff University            27
...
Royal College of Art                              1
Norwich University of the Arts                    1
Royal College of Music                            1
Royal Conservatoire of Scotland                   1
Royal Northern College of Music                   1
SRUC                                              1
Guildhall School of Music & Drama                 1
London Business School                            1
Stranmillis University College                    1
Institute of Zoology                              1
Trinity Laban Conservatoire of Music and Dance    1
Heythrop College                                  1
Royal Academy of Music                            1
Harper Adams University                           1
St Mary's University College                      1
Length: 154, dtype: int64

The way I phrased the previous question compared to the way I ran the query assumes that an insitution can only appear once per Unit of Assessment. Is that true?

In [53]:
#Can an institution have more than one submission to the same Units of Assessment?
df[df['Profile']=='Overall'].groupby(['Institution name','Unit of assessment name']).size().order(ascending=False)[:10]
Out[53]:
Institution name           Unit of assessment name                                         
King's College London      Allied Health Professions, Dentistry, Nursing and Pharmacy          3
University College London  Geography, Environmental Studies and Archaeology                    2
University of Oxford       Anthropology and Development Studies                                2
University of Sheffield    Allied Health Professions, Dentistry, Nursing and Pharmacy          2
                           Aeronautical, Mechanical, Chemical and Manufacturing Engineering    2
University of East Anglia  Allied Health Professions, Dentistry, Nursing and Pharmacy          2
Swansea University         Modern Languages and Linguistics                                    2
Liverpool Hope University  Music, Drama, Dance and Performing Arts                             2
University of Leeds        Music, Drama, Dance and Performing Arts                             2
University of York         Music, Drama, Dance and Performing Arts                             2
dtype: int64

Hmmm... so what's going on there then?

In [58]:
dfo=df[df['Profile']=='Overall']
dfo[(dfo['Institution name']=='King\'s College London') & (dfo['Unit of assessment name']=='Allied Health Professions, Dentistry, Nursing and Pharmacy')]
Out[58]:
Institution code (UKPRN) Institution name Institution sort order Main panel Unit of assessment number Unit of assessment name Multiple submission letter Multiple submission name Joint submission Profile FTE Category A staff submitted 4* 3* 2* 1* unclassified
2359 10003645 King's College London 690 A 3 Allied Health Professions, Dentistry, Nursing ... A Dentistry NaN Overall 52.95 42 48 9 1 0
2363 10003645 King's College London 690 A 3 Allied Health Professions, Dentistry, Nursing ... B Pharmacy and Nutritional Sciences NaN Overall 93.93 42 49 8 0 1
2367 10003645 King's College London 690 A 3 Allied Health Professions, Dentistry, Nursing ... C Nursing and Palliative Care NaN Overall 40.75 43 47 10 0 0

Ah, so we can have multiple submissions... Maybe I should have read some guidance about how the data is presented?!;-)

In that table above, the 4*, 3*, 2* numbers are pretty consistent. Is that unusual? (Perhaps file that thought for later if we have time.)

Ranking in a Unit of Assessment

Folk always want league tables. How about we rank institutions in a unit of assessment? But what does rank mean? Let's start by just doing it on the basis of the Overall scores on the doors, omitting the number of staff submitted.

In [62]:
dfo[dfo['Unit of assessment name']=='Civil and Construction Engineering'].sort(['4*','3*','2*','1*','unclassified'],
                                                                               ascending=False)[:5]
Out[62]:
Institution code (UKPRN) Institution name Institution sort order Main panel Unit of assessment number Unit of assessment name Multiple submission letter Multiple submission name Joint submission Profile FTE Category A staff submitted 4* 3* 2* 1* unclassified
7203 10007814 Cardiff University 7080 B 14 Civil and Construction Engineering NaN NaN Overall 14.30 47 50 3 0 0
2179 10003270 Imperial College London 630 B 14 Civil and Construction Engineering NaN NaN Overall 56.60 47 48 5 0 0
6279 10007852 University of Dundee 6720 B 14 Civil and Construction Engineering NaN NaN Overall 14.50 41 49 10 0 0
3575 10007798 University of Manchester 1220 B 14 Civil and Construction Engineering NaN NaN Overall 21.85 30 60 9 1 0
3799 10007799 Newcastle University 1280 B 14 Civil and Construction Engineering NaN NaN Overall 40.60 30 58 12 0 0

How does this compare with a ranking on the basis of another Profile? eg Impact?

In [63]:
dfi=df[df['Profile']=='Impact']
dfi[dfi['Unit of assessment name']=='Civil and Construction Engineering'].sort(['4*','3*','2*','1*','unclassified'],
                                                                               ascending=False)[:5]
Out[63]:
Institution code (UKPRN) Institution name Institution sort order Main panel Unit of assessment number Unit of assessment name Multiple submission letter Multiple submission name Joint submission Profile FTE Category A staff submitted 4* 3* 2* 1* unclassified
7201 10007814 Cardiff University 7080 B 14 Civil and Construction Engineering NaN NaN Impact 14.30 100 0 0 0 0
2177 10003270 Imperial College London 630 B 14 Civil and Construction Engineering NaN NaN Impact 56.60 65.7 34.3 0 0 0
3573 10007798 University of Manchester 1220 B 14 Civil and Construction Engineering NaN NaN Impact 21.85 50 50 0 0 0
6277 10007852 University of Dundee 6720 B 14 Civil and Construction Engineering NaN NaN Impact 14.50 50 30 20 0 0
3797 10007799 Newcastle University 1280 B 14 Civil and Construction Engineering NaN NaN Impact 40.60 42 58 0 0 0

This may be an interesting canned question... Take in the name of a unit of assessment and a profile, and return the sorted table.

In [66]:
def rankUoAbyProfile(uoa,profile):
    tmp=df[df['Profile']==profile]
    tmp=tmp[tmp['Unit of assessment name']==uoa].sort(['4*','3*','2*','1*','unclassified'],ascending=False)
    return tmp
In [ ]:
rankUoAbyProfile('Computer Science and Informatics','Environment')[:5]

I guess another way of looking at insitutions is by the number of people they're submitting? Can we assume that each profile has the same number of FTEs associated with it? Go with that for now, but really should check.

In [73]:
def rankUoAbySize(uoa,profile='Overall'):
    tmp=df[df['Profile']==profile]
    tmp=tmp[tmp['Unit of assessment name']==uoa].sort(['FTE Category A staff submitted','4*','3*','2*','1*','unclassified'],ascending=False)
    return tmp
In [74]:
rankUoAbySize('History')[:5]
Out[74]:
Institution code (UKPRN) Institution name Institution sort order Main panel Unit of assessment number Unit of assessment name Multiple submission letter Multiple submission name Joint submission Profile FTE Category A staff submitted 4* 3* 2* 1* unclassified
4399 10007774 University of Oxford 1410 D 30 History NaN NaN Overall 130.05 45 37 17 1 0
915 10007788 University of Cambridge 220 D 30 History NaN NaN Overall 115.10 44 37 18 1 0
6431 10007790 University of Edinburgh 6730 D 30 History NaN NaN Overall 61.22 32 51 15 2 0
3351 10004063 London School of Economics and Political Science 1150 D 30 History NaN NaN Overall 44.00 34 48 17 1 0
6591 10007794 University of Glasgow 6760 D 30 History NaN NaN Overall 43.80 40 40 18 2 0

Within an Institution...

Which submission attracted most entries?

In [75]:
def internalSize(hei):
    tmp=dfo[dfo['Institution name']==hei]
    tmp=tmp.sort(['FTE Category A staff submitted','4*','3*','2*','1*','unclassified'],ascending=False)
    return tmp
In [76]:
internalSize("Open University")
Out[76]:
Institution code (UKPRN) Institution name Institution sort order Main panel Unit of assessment number Unit of assessment name Multiple submission letter Multiple submission name Joint submission Profile FTE Category A staff submitted 4* 3* 2* 1* unclassified
4187 10007773 Open University 1390 B 7 Earth Systems and Environmental Sciences NaN NaN Overall 58.01 11 66 22 1 0
4223 10007773 Open University 1390 C 25 Education NaN NaN Overall 54.26 38 31 26 5 0
4215 10007773 Open University 1390 C 23 Sociology NaN NaN Overall 37.00 22 42 36 0 0
4195 10007773 Open University 1390 B 11 Computer Science and Informatics NaN NaN Overall 31.10 13 62 24 1 0
4247 10007773 Open University 1390 D 34 Art and Design: History, Practice and Theory NaN NaN Overall 23.20 29 57 12 2 0
4219 10007773 Open University 1390 C 24 Anthropology and Development Studies NaN NaN Overall 22.10 24 45 27 4 0
4191 10007773 Open University 1390 B 10 Mathematical Sciences NaN NaN Overall 18.80 9 53 33 0 5
4211 10007773 Open University 1390 C 22 Social Work and Social Policy NaN NaN Overall 18.60 15 54 29 2 0
4199 10007773 Open University 1390 B 13 Electrical and Electronic Engineering, Metallu... NaN NaN Overall 18.00 8 77 15 0 0
4207 10007773 Open University 1390 C 19 Business and Management Studies NaN NaN Overall 17.90 19 53 25 3 0
4183 10007773 Open University 1390 A 3 Allied Health Professions, Dentistry, Nursing ... NaN NaN Overall 17.00 16 61 22 1 0
4227 10007773 Open University 1390 D 29 English Language and Literature NaN NaN Overall 16.50 26 50 23 1 0
4231 10007773 Open University 1390 D 30 History NaN NaN Overall 14.30 27 49 23 1 0
4203 10007773 Open University 1390 C 17 Geography, Environmental Studies and Archaeology NaN NaN Overall 13.00 23 53 21 3 0
4235 10007773 Open University 1390 D 31 Classics NaN NaN Overall 11.80 16 39 42 3 0
4251 10007773 Open University 1390 D 35 Music, Drama, Dance and Performing Arts NaN NaN Overall 10.60 44 50 6 0 0
4239 10007773 Open University 1390 D 32 Philosophy NaN NaN Overall 8.00 6 31 59 4 0
4243 10007773 Open University 1390 D 33 Theology and Religious Studies NaN NaN Overall 6.00 18 35 47 0 0

How about ranking on the rankings within an institution?

In [77]:
def internalRanking(hei):
    tmp=dfo[dfo['Institution name']==hei]
    tmp=tmp.sort(['4*','3*','2*','1*','unclassified'],ascending=False)
    return tmp
In [78]:
internalRanking("Open University")
Out[78]:
Institution code (UKPRN) Institution name Institution sort order Main panel Unit of assessment number Unit of assessment name Multiple submission letter Multiple submission name Joint submission Profile FTE Category A staff submitted 4* 3* 2* 1* unclassified
4251 10007773 Open University 1390 D 35 Music, Drama, Dance and Performing Arts NaN NaN Overall 10.60 44 50 6 0 0
4223 10007773 Open University 1390 C 25 Education NaN NaN Overall 54.26 38 31 26 5 0
4247 10007773 Open University 1390 D 34 Art and Design: History, Practice and Theory NaN NaN Overall 23.20 29 57 12 2 0
4231 10007773 Open University 1390 D 30 History NaN NaN Overall 14.30 27 49 23 1 0
4227 10007773 Open University 1390 D 29 English Language and Literature NaN NaN Overall 16.50 26 50 23 1 0
4219 10007773 Open University 1390 C 24 Anthropology and Development Studies NaN NaN Overall 22.10 24 45 27 4 0
4203 10007773 Open University 1390 C 17 Geography, Environmental Studies and Archaeology NaN NaN Overall 13.00 23 53 21 3 0
4215 10007773 Open University 1390 C 23 Sociology NaN NaN Overall 37.00 22 42 36 0 0
4207 10007773 Open University 1390 C 19 Business and Management Studies NaN NaN Overall 17.90 19 53 25 3 0
4243 10007773 Open University 1390 D 33 Theology and Religious Studies NaN NaN Overall 6.00 18 35 47 0 0
4183 10007773 Open University 1390 A 3 Allied Health Professions, Dentistry, Nursing ... NaN NaN Overall 17.00 16 61 22 1 0
4235 10007773 Open University 1390 D 31 Classics NaN NaN Overall 11.80 16 39 42 3 0
4211 10007773 Open University 1390 C 22 Social Work and Social Policy NaN NaN Overall 18.60 15 54 29 2 0
4195 10007773 Open University 1390 B 11 Computer Science and Informatics NaN NaN Overall 31.10 13 62 24 1 0
4187 10007773 Open University 1390 B 7 Earth Systems and Environmental Sciences NaN NaN Overall 58.01 11 66 22 1 0
4191 10007773 Open University 1390 B 10 Mathematical Sciences NaN NaN Overall 18.80 9 53 33 0 5
4199 10007773 Open University 1390 B 13 Electrical and Electronic Engineering, Metallu... NaN NaN Overall 18.00 8 77 15 0 0
4239 10007773 Open University 1390 D 32 Philosophy NaN NaN Overall 8.00 6 31 59 4 0

What other data could we bring in?

Funding from Gateway to Research?

Okay... that's it, time up....