Date & Venue: 10 April 2019, 4/F HKBU Library
Facilitator: Dr. Xinzhi Zhang (JOUR, Hong Kong Baptist University, MSc in AI & Digital Media)
xzzhang2@gmail.com || http://drxinzhizhang.com/ || https://github.com/xzzhang2 || @xin_zhi_zhang || https://scholar.google.com.hk/citations?user=iOFeIDIAAAAJ&hl=en
Workshop Outcomes
in this notebook, we will cover some commonly used data cleaning steps.
import pandas as pd
# the raw csv data file is from here: https://github.com/realpython/python-data-cleaning/tree/master/Datasets
df = pd.read_csv('data/BL-Flickr-Images-Book_2.csv')
df.head()
df.shape
# (the number of cases/observations, the number of variables)
df.info()
df.columns
# can drop some useless columns (variables)
to_drop = ['Edition Statement',
'Corporate Author',
'Corporate Contributors',
'Former owner',
'Engraver',
'Contributors',
'Issuance type',
'Shelfmarks']
df.drop(to_drop, inplace=True, axis=1) # axis=0 along the rows (namely, index in pandas), and axis=1 along the columns.
# do the same thing: df.drop(columns=to_drop, inplace=True)
df.columns # now you can see that the columns are dropped
df.head()
# whether the records are unique
df['Identifier'].is_unique
# set a new index
df = df.set_index('Identifier')
df.head()
# identify a place if having the identifier
# loc = location-based indexing
df.loc[472]
Resources:
df.loc[1800:, 'Date of Publication'].head(10)
# regular expression here
regex = r'^(\d{4})'
# about string extraction: https://pandas.pydata.org/pandas-docs/stable/generated/pandas.Series.str.extract.html
extr = df['Date of Publication'].str.extract(r'^(\d{4})', expand=False)
extr.head()
df['Date of Publication'] = pd.to_numeric(extr)
df['Date of Publication'].dtype
df['Date of Publication']
df['Date of Publication'].isnull().sum() # how many missing values?
df.isnull().sum() # missing values in the entire dataset
df['Place of Publication'].head(10)
print(df.loc[4157862])
print('---------------------------- another case ----------------------------')
print(df.loc[4159587])
df['Place of Publication'].unique()
place_df = df.groupby('Place of Publication').size()
for k in place_df.index:
print(k, place_df[k])
# let's take a look at London
london_pub =[]
for i in df['Place of Publication']:
if i.__contains__('London'):
london_pub.append(True)
else:
london_pub.append(False)
df['Place of Publication'][london_pub] = 'London'
df['Place of Publication']
Newcastle_pub = df['Place of Publication'].isin(['Newcastle-upon-Tyne', 'Newcastle upon Tyne'])
df['Place of Publication'][Newcastle_pub] = 'Newcastle'
import numpy as np
pub = df['Place of Publication']
oxford_pub = pub.str.contains('Oxford')
df['Place of Publication'] = np.where(oxford_pub, 'Oxford',
pub.str.replace('-', ' '))
df['Place of Publication']
df.head(20)