HKBU Library 2019 Workshop @ HKBU

Ep. 1 A flashtalk on data processing

  • Notes: The codes in this notebook are modified from various sources, including the official tutorial, tutorial 01, and this one. All codes and data files demonstrated here are for educational purposes only and released under the MIT licence.

Importing the raw data

in this notebook, we will cover some commonly used data cleaning steps.

In [ ]:
import pandas as pd 
In [ ]:
# the raw csv data file is from here: https://github.com/realpython/python-data-cleaning/tree/master/Datasets 

df = pd.read_csv('data/BL-Flickr-Images-Book_2.csv')
df.head()
In [ ]:
df.shape 
# (the number of cases/observations, the number of variables)
In [ ]:
df.info()
In [ ]:
df.columns
In [ ]:
# can drop some useless columns (variables)
to_drop = ['Edition Statement',
           'Corporate Author',
           'Corporate Contributors',
           'Former owner',
           'Engraver',
           'Contributors',
           'Issuance type',
           'Shelfmarks']
In [ ]:
df.drop(to_drop, inplace=True, axis=1) # axis=0 along the rows (namely, index in pandas), and axis=1 along the columns.
# do the same thing: df.drop(columns=to_drop, inplace=True) 
In [ ]:
df.columns # now you can see that the columns are dropped 
In [ ]:
df.head()
In [ ]:
# whether the records are unique 
df['Identifier'].is_unique
In [ ]:
# set a new index 
df = df.set_index('Identifier')
In [ ]:
df.head()
In [ ]:
# identify a place if having the identifier 
# loc = location-based indexing 
df.loc[472]
In [ ]:
 

Cleaning the numerical columns

Regular Expression

  • A regular expression, also referred to as “regex” or “regexp”, provides a concise and flexible means for matching strings of text, such as particular characters, words, or patterns of characters. A regular expression is written in a formal language that can be interpreted by a regular expression processor.

Resources:

  1. a general introduction: https://regexr.com/
  2. a quick start: https://www.regular-expressions.info/quickstart.html
  3. test your code: https://regex101.com/
  4. A cheat sheet: https://www.rexegg.com/regex-quickstart.html
In [ ]:
 
In [ ]:
df.loc[1800:, 'Date of Publication'].head(10)
In [ ]:
# regular expression here 
regex = r'^(\d{4})'
In [ ]:
# about string extraction: https://pandas.pydata.org/pandas-docs/stable/generated/pandas.Series.str.extract.html 
extr = df['Date of Publication'].str.extract(r'^(\d{4})', expand=False) 
extr.head()
In [ ]:
df['Date of Publication'] = pd.to_numeric(extr)
In [ ]:
df['Date of Publication'].dtype
In [ ]:
df['Date of Publication']
In [ ]:
df['Date of Publication'].isnull().sum() # how many missing values? 
In [ ]:
df.isnull().sum() # missing values in the entire dataset  
In [ ]:
 

Cleaning the strings

In [ ]:
df['Place of Publication'].head(10)
In [ ]:
print(df.loc[4157862])
print('---------------------------- another case ----------------------------') 
print(df.loc[4159587])
In [ ]:
df['Place of Publication'].unique()
In [ ]:
place_df = df.groupby('Place of Publication').size()
for k in place_df.index:
    print(k, place_df[k])
In [ ]:
# let's take a look at London 
london_pub =[]
for i in df['Place of Publication']:
    if i.__contains__('London'):
        london_pub.append(True)
    else:
        london_pub.append(False)

df['Place of Publication'][london_pub]  = 'London'
In [ ]:
df['Place of Publication']
In [ ]:
Newcastle_pub = df['Place of Publication'].isin(['Newcastle-upon-Tyne', 'Newcastle upon Tyne'])
df['Place of Publication'][Newcastle_pub]  = 'Newcastle' 
In [ ]:
 
In [ ]:
import numpy as np 

pub = df['Place of Publication']
oxford_pub = pub.str.contains('Oxford')
df['Place of Publication'] = np.where(oxford_pub, 'Oxford', 
                                      pub.str.replace('-', ' '))
In [ ]:
df['Place of Publication']
In [ ]:
df.head(20)
In [ ]: