# HKBU Library 2019 Workshop @ HKBU¶

## Ep. 1 A flashtalk on data processing¶

• Date & Venue: 10 April 2019, 4/F HKBU Library

• Facilitator: Dr. Xinzhi Zhang (JOUR, Hong Kong Baptist University, MSc in AI & Digital Media)

• Workshop Outcomes

1. Installing Python
2. --> Data processing <--
3. Importing and knowing your data
1. Pandas, Matplotlib, and Seaborn
2. Getting the attributes of the your data
3. Case selection
4. Basic statistics
5. Pivot table
4. Data exploration: exploring the data by visualization
1. Univariate (unidimensional) and bivariate (two-way) data visualization
2. Multivariate (multidimensional) problem-driven data visualization and data-driven exploration
• Notes: The codes in this notebook are modified from various sources, including the official tutorial, tutorial 01, and this one. All codes and data files demonstrated here are for educational purposes only and released under the MIT licence.

## Importing the raw data¶

in this notebook, we will cover some commonly used data cleaning steps.

import pandas as pd

# the raw csv data file is from here: https://github.com/realpython/python-data-cleaning/tree/master/Datasets


df.shape
# (the number of cases/observations, the number of variables)

df.info()

df.columns

# can drop some useless columns (variables)
to_drop = ['Edition Statement',
'Corporate Author',
'Corporate Contributors',
'Former owner',
'Engraver',
'Contributors',
'Issuance type',
'Shelfmarks']

df.drop(to_drop, inplace=True, axis=1) # axis=0 along the rows (namely, index in pandas), and axis=1 along the columns.
# do the same thing: df.drop(columns=to_drop, inplace=True)

df.columns # now you can see that the columns are dropped

df.head()

# whether the records are unique
df['Identifier'].is_unique

# set a new index
df = df.set_index('Identifier')

df.head()

# identify a place if having the identifier
# loc = location-based indexing
df.loc[472]

## Cleaning the numerical columns¶

### Regular Expression¶

• A regular expression, also referred to as “regex” or “regexp”, provides a concise and flexible means for matching strings of text, such as particular characters, words, or patterns of characters. A regular expression is written in a formal language that can be interpreted by a regular expression processor.

Resources:

1. a general introduction: https://regexr.com/
2. a quick start: https://www.regular-expressions.info/quickstart.html
4. A cheat sheet: https://www.rexegg.com/regex-quickstart.html
df.loc[1800:, 'Date of Publication'].head(10)

In [ ]:
# regular expression here
regex = r'^(\d{4})'

# about string extraction: https://pandas.pydata.org/pandas-docs/stable/generated/pandas.Series.str.extract.html
extr = df['Date of Publication'].str.extract(r'^(\d{4})', expand=False)

df['Date of Publication'] = pd.to_numeric(extr)

df['Date of Publication'].dtype

df['Date of Publication']

df['Date of Publication'].isnull().sum() # how many missing values?

df.isnull().sum() # missing values in the entire dataset

## Cleaning the strings¶

df['Place of Publication'].head(10)

print(df.loc[4157862])
print('---------------------------- another case ----------------------------')
print(df.loc[4159587])

df['Place of Publication'].unique()

place_df = df.groupby('Place of Publication').size()
for k in place_df.index:
print(k, place_df[k])

# let's take a look at London
london_pub =[]
for i in df['Place of Publication']:
if i.__contains__('London'):
london_pub.append(True)
else:
london_pub.append(False)

df['Place of Publication'][london_pub]  = 'London'

df['Place of Publication']

Newcastle_pub = df['Place of Publication'].isin(['Newcastle-upon-Tyne', 'Newcastle upon Tyne'])
df['Place of Publication'][Newcastle_pub]  = 'Newcastle'

import numpy as np

pub = df['Place of Publication']
oxford_pub = pub.str.contains('Oxford')
df['Place of Publication'] = np.where(oxford_pub, 'Oxford',
pub.str.replace('-', ' '))

df['Place of Publication']

df.head(20)

