HKBU Library 2019 Workshop @ HKBU¶

Ep. 1 A flashtalk on data processing¶

Date & Venue: 10 April 2019, 4/F HKBU Library
Facilitator: Dr. Xinzhi Zhang (JOUR, Hong Kong Baptist University, MSc in AI & Digital Media)
xzzhang2@gmail.com || http://drxinzhizhang.com/ || https://github.com/xzzhang2 || @xin_zhi_zhang || https://scholar.google.com.hk/citations?user=iOFeIDIAAAAJ&hl=en
Workshop Outcomes
1. Installing Python
2. --> Data processing <--
3. Importing and knowing your data 0. Pandas, Matplotlib, and Seaborn
  1. Getting the attributes of the your data
  2. Case selection
  3. Basic statistics
  4. Pivot table
4. Data exploration: exploring the data by visualization
  1. Univariate (unidimensional) and bivariate (two-way) data visualization
  2. Multivariate (multidimensional) problem-driven data visualization and data-driven exploration

Notes: The codes in this notebook are modified from various sources, including the official tutorial, tutorial 01, and this one. All codes and data files demonstrated here are for educational purposes only and released under the MIT licence.

Importing the raw data¶

in this notebook, we will cover some commonly used data cleaning steps.

In [ ]:

import pandas as pd 

In [ ]:

# the raw csv data file is from here: https://github.com/realpython/python-data-cleaning/tree/master/Datasets 

df = pd.read_csv('data/BL-Flickr-Images-Book_2.csv')
df.head()

In [ ]:

df.shape 
# (the number of cases/observations, the number of variables)

In [ ]:

df.info()

In [ ]:

df.columns

In [ ]:

# can drop some useless columns (variables)
to_drop = ['Edition Statement',
           'Corporate Author',
           'Corporate Contributors',
           'Former owner',
           'Engraver',
           'Contributors',
           'Issuance type',
           'Shelfmarks']

In [ ]:

df.drop(to_drop, inplace=True, axis=1) # axis=0 along the rows (namely, index in pandas), and axis=1 along the columns.
# do the same thing: df.drop(columns=to_drop, inplace=True) 

In [ ]:

df.columns # now you can see that the columns are dropped 

In [ ]:

df.head()

In [ ]:

# whether the records are unique 
df['Identifier'].is_unique

In [ ]:

# set a new index 
df = df.set_index('Identifier')

In [ ]:

df.head()

In [ ]:

# identify a place if having the identifier 
# loc = location-based indexing 
df.loc[472]

In [ ]:

Cleaning the numerical columns¶

Regular Expression¶

A regular expression, also referred to as “regex” or “regexp”, provides a concise and flexible means for matching strings of text, such as particular characters, words, or patterns of characters. A regular expression is written in a formal language that can be interpreted by a regular expression processor.

Resources:

a general introduction: https://regexr.com/
a quick start: https://www.regular-expressions.info/quickstart.html
test your code: https://regex101.com/
A cheat sheet: https://www.rexegg.com/regex-quickstart.html

In [ ]:

df.loc[1800:, 'Date of Publication'].head(10)

In [ ]:

# regular expression here 
regex = r'^(\d{4})'

In [ ]:

# about string extraction: https://pandas.pydata.org/pandas-docs/stable/generated/pandas.Series.str.extract.html 
extr = df['Date of Publication'].str.extract(r'^(\d{4})', expand=False) 
extr.head()

In [ ]:

df['Date of Publication'] = pd.to_numeric(extr)

In [ ]:

df['Date of Publication'].dtype

In [ ]:

df['Date of Publication']

In [ ]:

df['Date of Publication'].isnull().sum() # how many missing values? 

In [ ]:

df.isnull().sum() # missing values in the entire dataset  

In [ ]:

Cleaning the strings¶

In [ ]:

df['Place of Publication'].head(10)

In [ ]:

print(df.loc[4157862])
print('---------------------------- another case ----------------------------') 
print(df.loc[4159587])

In [ ]:

df['Place of Publication'].unique()

In [ ]:

place_df = df.groupby('Place of Publication').size()
for k in place_df.index:
    print(k, place_df[k])

In [ ]:

# let's take a look at London 
london_pub =[]
for i in df['Place of Publication']:
    if i.__contains__('London'):
        london_pub.append(True)
    else:
        london_pub.append(False)

df['Place of Publication'][london_pub]  = 'London'

In [ ]:

df['Place of Publication']

In [ ]:

Newcastle_pub = df['Place of Publication'].isin(['Newcastle-upon-Tyne', 'Newcastle upon Tyne'])
df['Place of Publication'][Newcastle_pub]  = 'Newcastle' 

In [ ]:

import numpy as np 

pub = df['Place of Publication']
oxford_pub = pub.str.contains('Oxford')
df['Place of Publication'] = np.where(oxford_pub, 'Oxford', 
                                      pub.str.replace('-', ' '))

In [ ]:

df['Place of Publication']

In [ ]:

df.head(20)

In [ ]: