pd.options.display.max_colwidth = 500
df = pd.read_csv(r'examples/NA_handling_methods.csv', names=['Description'], sep='\n')
df.index = list(range(1, 5))

def f1(x):
    x = x.split(' ')
    return x.pop(0)


df['Argument'] = df['Description'].map(f1) # get the first word

def f2(x):
    x = x.split(' ')
    return " ".join(x[1:])

df['Description'] = df['Description'].map(f2) # remove the first word

df = df.reindex(columns=['Argument', 'Description'])
print("NA handling methods")
df

NA handling methods

Out[7]:

	Argument	Description
1	dropna	Filter axis labels based on whether values for each label have missing data, with varying thresholds for how much missing data to tolerate.
2	fillna	Fill in missing data with some value or using an interpolation method such as 'ffill' or 'bfill'.
3	isnull	Return boolean values indicating which values are missing/NA.
4	notnull	Negation of isnull.

Filtering Out Missing Data¶

There are a few ways to filter out missing data. While you always have the option to do it by hand using pandas.isnull and boolean indexing, the dropna can be helpful.

In [8]:

# On a Series, it returns the Series with only the non-null data and index values

from numpy import nan as NA

In [9]:

data = pd.Series([1, NA, 3.5, NA, 7])

In [10]:

data.dropna()
# data[data.notnull()]

Out[10]:

0    1.0
2    3.5
4    7.0
dtype: float64

With DataFrame objects, things are a bit more complex. You may want to drop rows or columns that are all NA or only those containing any NAs. dropna by default drops any row containing a missing value

In [11]:

data = pd.DataFrame([[1., 6.5, 3.], [1., NA, NA],
                     [NA, NA, NA], [NA, 6.5, 3.]])
data

Out[11]:

	0	1	2
0	1.0	6.5	3.0
1	1.0	NaN	NaN
2	NaN	NaN	NaN
3	NaN	6.5	3.0

In [12]:

cleaned = data.dropna()
cleaned

Out[12]:

	0	1	2
0	1.0	6.5	3.0

In [13]:

# Passing how='all' will only drop rows that are all NA

data.dropna(how='all')
# data.dropna(how='any')

Out[13]:

	0	1	2
0	1.0	6.5	3.0
1	1.0	NaN	NaN
3	NaN	6.5	3.0

In [14]:

# To drop columns in the same way, pass axis=1

data[4] = NA
data

Out[14]:

	0	1	2	4
0	1.0	6.5	3.0	NaN
1	1.0	NaN	NaN	NaN
2	NaN	NaN	NaN	NaN
3	NaN	6.5	3.0	NaN

In [15]:

data.dropna(axis=1, how='all')

Out[15]:

	0	1	2
0	1.0	6.5	3.0
1	1.0	NaN	NaN
2	NaN	NaN	NaN
3	NaN	6.5	3.0

A related way to filter out DataFrame rows tends to concern time series data. Suppose you want to keep only rows containing a certain number of observations. You can indicate this with the thresh argument

In [16]:

df = pd.DataFrame(np.random.randn(7, 3))
df.iloc[:4, 1] = NA
df.iloc[:2, 2] = NA
df

Out[16]:

	0	1	2
0	1.196168	NaN	NaN
1	-1.007727	NaN	NaN
2	2.909089	NaN	-2.142449
3	-0.866425	NaN	1.881426
4	0.300446	-1.266830	0.296984
5	-0.963050	1.649808	-1.686630
6	-0.087371	-0.342160	-0.608004

In [17]:

print(df.dropna())
print(df.dropna(thresh=2))

          0         1         2
4  0.300446 -1.266830  0.296984
5 -0.963050  1.649808 -1.686630
6 -0.087371 -0.342160 -0.608004

          0         1         2
2  2.909089       NaN -2.142449
3 -0.866425       NaN  1.881426
4  0.300446 -1.266830  0.296984
5 -0.963050  1.649808 -1.686630
6 -0.087371 -0.342160 -0.608004

In [18]:

df

Out[18]:

	0	1	2
0	1.196168	NaN	NaN
1	-1.007727	NaN	NaN
2	2.909089	NaN	-2.142449
3	-0.866425	NaN	1.881426
4	0.300446	-1.266830	0.296984
5	-0.963050	1.649808	-1.686630
6	-0.087371	-0.342160	-0.608004

In [19]:

print(df.dropna(subset=[1, 2], how='any').shape)
print(df.dropna(subset=[1, 2], how='all').shape)

(3, 3)

(5, 3)

Filling In Missing Data¶

In [20]:

df.fillna(0)

Out[20]:

	0	1	2
0	1.196168	0.000000	0.000000
1	-1.007727	0.000000	0.000000
2	2.909089	0.000000	-2.142449
3	-0.866425	0.000000	1.881426
4	0.300446	-1.266830	0.296984
5	-0.963050	1.649808	-1.686630
6	-0.087371	-0.342160	-0.608004

In [21]:

# Calling fillna with a dict, you can use a different fill value for each column

df.fillna({1: 0.5, 2: 0})

Out[21]:

	0	1	2
0	1.196168	0.500000	0.000000
1	-1.007727	0.500000	0.000000
2	2.909089	0.500000	-2.142449
3	-0.866425	0.500000	1.881426
4	0.300446	-1.266830	0.296984
5	-0.963050	1.649808	-1.686630
6	-0.087371	-0.342160	-0.608004

In [22]:

# fillna returns a new object, but you can modify the existing object in-place
_ = df.fillna(0, inplace=True)
df

Out[22]:

	0	1	2
0	1.196168	0.000000	0.000000
1	-1.007727	0.000000	0.000000
2	2.909089	0.000000	-2.142449
3	-0.866425	0.000000	1.881426
4	0.300446	-1.266830	0.296984
5	-0.963050	1.649808	-1.686630
6	-0.087371	-0.342160	-0.608004

In [23]:

df = pd.DataFrame(np.random.randn(6, 3))

In [24]:

df.iloc[2:, 1] = NA
df.iloc[4:, 2] = NA
df

Out[24]:

	0	1	2
0	-1.283063	1.456059	0.639081
1	-0.863694	-0.012437	-0.741003
2	0.533858	NaN	-0.741355
3	1.690786	NaN	-0.396711
4	-0.296663	NaN	NaN
5	-1.422233	NaN	NaN

In [25]:

df.fillna(method='ffill')

Out[25]:

	0	1	2
0	-1.283063	1.456059	0.639081
1	-0.863694	-0.012437	-0.741003
2	0.533858	-0.012437	-0.741355
3	1.690786	-0.012437	-0.396711
4	-0.296663	-0.012437	-0.396711
5	-1.422233	-0.012437	-0.396711

In [26]:

df.fillna(method='ffill', limit=2)

Out[26]:

	0	1	2
0	-1.283063	1.456059	0.639081
1	-0.863694	-0.012437	-0.741003
2	0.533858	-0.012437	-0.741355
3	1.690786	-0.012437	-0.396711
4	-0.296663	NaN	-0.396711
5	-1.422233	NaN	-0.396711

In [27]:

# we might pass mean or median of a series

data = pd.Series([1., NA, 3.5, NA, 7])

In [28]:

data.fillna(data.mean())

Out[28]:

0    1.000000
1    3.833333
2    3.500000
3    3.833333
4    7.000000
dtype: float64

Arguments - Description for fillna

value - Scalar value or dict-like object to use to fill missing values

method - Interpolation; by default 'ffill' if function called with no other arguments

axis - Axis to fill on; default axis=0

inplace - Modify the calling object without producing a copy

limit - For forward and backward filling, maximum number of consecutive periods to fill

Data Transformation¶

Removing Duplicates¶

In [29]:

data = pd.DataFrame({'k1': ['one', 'two'] * 3 + ['two'],
                    'k2': [1, 1, 2, 3, 3, 4, 4]})
data

Out[29]:

	k1	k2
0	one	1
1	two	1
2	one	2
3	two	3
4	one	3
5	two	4
6	two	4

The DataFrame method duplicated returns a boolean Series indicating whether each row is a duplicate (has been observed in a previous row) or not.

In [30]:

data.duplicated()

Out[30]:

0    False
1    False
2    False
3    False
4    False
5    False
6     True
dtype: bool

In [31]:

# drop_duplicates returns a DataFrame where the duplicated array is False

data.drop_duplicates()

Out[31]:

	k1	k2
0	one	1
1	two	1
2	one	2
3	two	3
4	one	3
5	two	4

In [32]:

# Suppose we had an additional column of values and wanted to 
# filter duplicates only based on the 'k1' column

data['v1'] = range(7)
data

Out[32]:

	k1	k2	v1
0	one	1	0
1	two	1	1
2	one	2	2
3	two	3	3
4	one	3	4
5	two	4	5
6	two	4	6

In [33]:

data.drop_duplicates(['k1'])

Out[33]:

	k1	k2	v1
0	one	1	0
1	two	1	1

Transforming Data Using a Function or Mapping¶

For many datasets, you may wish to perform some transformation based on the values in an array, Series, or column in a DataFrame.

In [34]:

data = pd.DataFrame({'food': ['bacon', 'pulled pork', 'bacon',
                              'Pastrami', 'corned beef', 'Bacon',
                              'pastrami', 'honey ham', 'nova lox'],
                     'ounces': [4, 3, 12, 6, 7.5, 8, 3, 5, 6]})
data

Out[34]:

	food	ounces
0	bacon	4.0
1	pulled pork	3.0
2	bacon	12.0
3	Pastrami	6.0
4	corned beef	7.5
5	Bacon	8.0
6	pastrami	3.0
7	honey ham	5.0
8	nova lox	6.0

In [35]:

# Suppose you wanted to add a column indicating the 
# type of animal that each food came from

meat_to_animal = {
 'bacon': 'pig',
 'pulled pork': 'pig',
 'pastrami': 'cow',
 'corned beef': 'cow',
 'honey ham': 'pig',
 'nova lox': 'salmon'
}

In [36]:

lowercased = data['food'].str.lower()
lowercased

Out[36]:

0          bacon
1    pulled pork
2          bacon
3       pastrami
4    corned beef
5          bacon
6       pastrami
7      honey ham
8       nova lox
Name: food, dtype: object

In [37]:

data['animal'] = lowercased.map(meat_to_animal)
# or
# data['food'].map(lambda x: meat_to_animal[x.lower()])
data

Out[37]:

	food	ounces	animal
0	bacon	4.0	pig
1	pulled pork	3.0	pig
2	bacon	12.0	pig
3	Pastrami	6.0	cow
4	corned beef	7.5	cow
5	Bacon	8.0	pig
6	pastrami	3.0	cow
7	honey ham	5.0	pig
8	nova lox	6.0	salmon

Replacing Values¶

Filling in missing data with the fillna method is a special case of more general value replacement. As you’ve already seen, map can be used to modify a subset of values in an object but replace provides a simpler and more flexible way to do so

In [38]:

data = pd.Series([1., -999., 2., -999., -1000., 3.])

In [39]:

data

Out[39]:

0       1.0
1    -999.0
2       2.0
3    -999.0
4   -1000.0
5       3.0
dtype: float64

In [40]:

print(data.replace(-999, np.nan))
print(data.replace([-999, -1000], np.nan))
print(data.replace([-999, -1000], [np.nan, 0]))
print(data.replace({-999: np.nan, -1000: 0}, inplace=True))

0       1.0
1       NaN
2       2.0
3       NaN
4   -1000.0
5       3.0
dtype: float64

0    1.0
1    NaN
2    2.0
3    NaN
4    NaN
5    3.0
dtype: float64

0    1.0
1    NaN
2    2.0
3    NaN
4    0.0
5    3.0
dtype: float64

None

Renaming Axis Indexes¶

axis labels can be transformed by a function or mapping of some form to produce new, differently labeled objects.

In [41]:

data = pd.DataFrame(np.arange(12).reshape((3, 4)),
                    index=['Ohio', 'Colorado', 'New York'],
                    columns=['one', 'two', 'three', 'four'])

In [42]:

transform = lambda x: x[:4].upper()

# Like a Series, the axis indexes have a map method

print(data.index.map(transform))

# -----------
trans = lambda x: x[:2].upper()
print(data.index.map(trans))

Index(['OHIO', 'COLO', 'NEW '], dtype='object')

Index(['OH', 'CO', 'NE'], dtype='object')

In [43]:

data.index = data.index.map(transform)
data

Out[43]:

	one	two	three	four
OHIO	0	1	2	3
COLO	4	5	6	7
NEW	8	9	10	11

If you want to create a transformed version of a dataset without modifying the original, a useful method is rename

In [44]:

data.rename(index=str.title, columns=str.upper)

Out[44]:

	ONE	TWO	THREE	FOUR
Ohio	0	1	2	3
Colo	4	5	6	7
New	8	9	10	11

In [45]:

# rename can be used in conjunction with a dict-like object
# providing new values for a subset of the axis labels

data.rename(index={'OHIO': 'INDIANA'},
            columns={'three': 'peekaboo'}, inplace=True)
data

Out[45]:

	one	two	peekaboo	four
INDIANA	0	1	2	3
COLO	4	5	6	7
NEW	8	9	10	11

In [46]:

# ToD: Discretization and Binning

Detecting and Filtering Outliers¶

Filtering or transforming outliers is largely a matter of applying array operations.

In [47]:

data = pd.DataFrame(np.random.randn(1000, 4))
data.describe()

Out[47]:

	0	1	2	3
count	1000.000000	1000.000000	1000.000000	1000.000000
mean	0.015383	0.024896	0.038430	-0.064249
std	1.008377	0.996469	0.989867	0.995204
min	-3.141061	-3.430585	-2.886733	-3.260575
25%	-0.708238	-0.597616	-0.656763	-0.766798
50%	0.051303	0.051438	0.048483	-0.071339
75%	0.730775	0.691972	0.666596	0.639310
max	3.766905	3.349857	3.361674	2.854713

In [48]:

col = data[0]

In [49]:

col[np.abs(col) > 3]

Out[49]:

1     -3.141061
56     3.766905
247    3.008167
787   -3.135201
Name: 0, dtype: float64

To select all rows having a value exceeding 3 or –3, you can use the any method on a boolean DataFrame

In [50]:

data[(np.abs(data) > 3).any(axis=1)]

Out[50]:

	0	1	2	3
1	-3.141061	0.783127	0.469851	-0.395636
56	3.766905	0.030909	1.633494	1.140207
247	3.008167	-0.992888	0.260335	-3.063802
327	-0.074563	3.349857	2.032817	-0.609838
405	-0.362679	-0.103159	3.087357	2.517252
448	-0.437464	3.054526	-0.254406	1.047457
611	-0.369185	-3.430585	-0.515431	-1.668876
619	-0.311282	0.535545	3.039966	-1.199799
631	-0.923738	0.346249	-0.485636	-3.260575
765	-0.856659	-0.717638	-1.155797	-3.069395
787	-3.135201	-1.338375	-0.404171	1.373373
917	0.437193	-3.026429	-0.438433	0.024585
984	-0.629788	0.308067	3.361674	-1.683448

Values can be set based on these criteria. Here is code to cap values outside the interval –3 to 3

In [51]:

data[np.abs(data) > 3] = np.sign(data) * 3

In [52]:

data.describe()

Out[52]:

	0	1	2	3
count	1000.000000	1000.000000	1000.000000	1000.000000
mean	0.014885	0.024949	0.037941	-0.063855
std	1.004937	0.993711	0.988330	0.994002
min	-3.000000	-3.000000	-2.886733	-3.000000
25%	-0.708238	-0.597616	-0.656763	-0.766798
50%	0.051303	0.051438	0.048483	-0.071339
75%	0.730775	0.691972	0.666596	0.639310
max	3.000000	3.000000	3.000000	2.854713

In [53]:

# The statement np.sign(data) produces 1 and –1 values based on 
# whether the values in data are positive or negative

np.sign(data).head()

Out[53]:

	0	1	2	3
0	-1.0	-1.0	-1.0	-1.0
1	-1.0	1.0	1.0	-1.0
2	1.0	-1.0	1.0	-1.0
3	1.0	-1.0	1.0	-1.0
4	1.0	-1.0	-1.0	1.0

Permutation and Random Sampling¶

Permuting (randomly reordering) a Series or the rows in a DataFrame is easy to do using the numpy.random.permutation function. Calling permutation with the length of the axis you want to permute produces an array of integers indicating the new ordering

In [54]:

df = pd.DataFrame(np.arange(20).reshape((5, 4)))

In [55]:

sampler = np.random.permutation(5)
sampler

Out[55]:

array([3, 4, 1, 2, 0])

That array can then be used in iloc-based indexing or the equivalent take function

In [56]:

df

Out[56]:

	0	1	2	3
0	0	1	2	3
1	4	5	6	7
2	8	9	10	11
3	12	13	14	15
4	16	17	18	19

In [57]:

df.take(sampler)

Out[57]:

	0	1	2	3
3	12	13	14	15
4	16	17	18	19
1	4	5	6	7
2	8	9	10	11
0	0	1	2	3

To select a random subset without replacement, you can use the sample method on Series and DataFrame

In [58]:

df.sample(n=3)

Out[58]:

	0	1	2	3
4	16	17	18	19
2	8	9	10	11
1	4	5	6	7

In [59]:

#  sample with replacement

choices = pd.Series([5, 7, -1, 6, 4])
draws = choices.sample(n=10, replace=True)
draws

Out[59]:

3    6
1    7
3    6
0    5
3    6
0    5
1    7
2   -1
2   -1
4    4
dtype: int64

Computing Indicator/Dummy Variables¶

In [60]:

train = pd.read_csv('http://bit.ly/kaggletrain')
train.head()

Out[60]:

	PassengerId	Survived	Pclass	Name	Sex	Age	SibSp	Ticket	Fare	Cabin	Embarked
0	1	0	3	Braund, Mr. Owen Harris	male	22.0	1	A/5 21171	7.2500	NaN	S
1	2	1	1	Cumings, Mrs. John Bradley (Florence Briggs Thayer)	female	38.0	1	PC 17599	71.2833	C85	C
2	3	1	3	Heikkinen, Miss. Laina	female	26.0	0	STON/O2. 3101282	7.9250	NaN	S
3	4	1	1	Futrelle, Mrs. Jacques Heath (Lily May Peel)	female	35.0	1	113803	53.1000	C123	S
4	5	0	3	Allen, Mr. William Henry	male	35.0	0	373450	8.0500	NaN	S

In [61]:

# create a dummy column for Sex column

# using map method
train['Sex_male'] = train.Sex.map({'female': 0, 'male': 1})
train.head()

Out[61]:

	PassengerId	Survived	Pclass	Name	Sex	Age	SibSp	Ticket	Fare	Cabin	Embarked	Sex_male
0	1	0	3	Braund, Mr. Owen Harris	male	22.0	1	A/5 21171	7.2500	NaN	S	1
1	2	1	1	Cumings, Mrs. John Bradley (Florence Briggs Thayer)	female	38.0	1	PC 17599	71.2833	C85	C	0
2	3	1	3	Heikkinen, Miss. Laina	female	26.0	0	STON/O2. 3101282	7.9250	NaN	S	0
3	4	1	1	Futrelle, Mrs. Jacques Heath (Lily May Peel)	female	35.0	1	113803	53.1000	C123	S	0
4	5	0	3	Allen, Mr. William Henry	male	35.0	0	373450	8.0500	NaN	S	1

In [62]:

# using get_dummies
  
pd.get_dummies(train.Sex)

Out[62]:

	female	male
0	0	1
1	1	0
2	1	0
3	1	0
4	0	1
...	...	...
886	0	1
887	1	0
888	1	0
889	0	1
890	0	1

891 rows × 2 columns

Generally, k categorical values, then use k-1 dummy variables to represent it.

for Sex, we have to categorical values so we need only 1 variable to encode all the given information. So we drop the first column.

In [63]:

pd.get_dummies(train.Sex, prefix='Sex').iloc[:, 1:]

Out[63]:

	Sex_male
0	1
1	0
2	0
3	0
4	1
...	...
886	1
887	0
888	0
889	1
890	1

891 rows × 1 columns

In [64]:

train.Embarked.value_counts()

Out[64]:

S    644
C    168
Q     77
Name: Embarked, dtype: int64

In [65]:

embarked_dummies = pd.get_dummies(train.Embarked, prefix='Embarked').iloc[:, 1:]

In [66]:

train = pd.concat([train, embarked_dummies], axis=1)
train.head()

Out[66]:

	PassengerId	Survived	Pclass	Name	Sex	Age	SibSp	Ticket	Fare	Cabin	Embarked	Sex_male	Embarked_S
0	1	0	3	Braund, Mr. Owen Harris	male	22.0	1	A/5 21171	7.2500	NaN	S	1	1
1	2	1	1	Cumings, Mrs. John Bradley (Florence Briggs Thayer)	female	38.0	1	PC 17599	71.2833	C85	C	0	0
2	3	1	3	Heikkinen, Miss. Laina	female	26.0	0	STON/O2. 3101282	7.9250	NaN	S	0	1
3	4	1	1	Futrelle, Mrs. Jacques Heath (Lily May Peel)	female	35.0	1	113803	53.1000	C123	S	0	1
4	5	0	3	Allen, Mr. William Henry	male	35.0	0	373450	8.0500	NaN	S	1	1

In [67]:

# pass columns to get_dummies makes it more easy

train = pd.read_csv('http://bit.ly/kaggletrain')

pd.get_dummies(train, columns=['Sex', 'Embarked'])

Out[67]:

	PassengerId	Survived	Pclass	Name	Age	SibSp	Parch	Ticket	Fare	Cabin	Sex_female	Sex_male	Embarked_C	Embarked_Q	Embarked_S
0	1	0	3	Braund, Mr. Owen Harris	22.0	1	0	A/5 21171	7.2500	NaN	0	1	0	0	1
1	2	1	1	Cumings, Mrs. John Bradley (Florence Briggs Thayer)	38.0	1	0	PC 17599	71.2833	C85	1	0	1	0	0
2	3	1	3	Heikkinen, Miss. Laina	26.0	0	0	STON/O2. 3101282	7.9250	NaN	1	0	0	0	1
3	4	1	1	Futrelle, Mrs. Jacques Heath (Lily May Peel)	35.0	1	0	113803	53.1000	C123	1	0	0	0	1
4	5	0	3	Allen, Mr. William Henry	35.0	0	0	373450	8.0500	NaN	0	1	0	0	1
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
886	887	0	2	Montvila, Rev. Juozas	27.0	0	0	211536	13.0000	NaN	0	1	0	0	1
887	888	1	1	Graham, Miss. Margaret Edith	19.0	0	0	112053	30.0000	B42	1	0	0	0	1
888	889	0	3	Johnston, Miss. Catherine Helen "Carrie"	NaN	1	2	W./C. 6607	23.4500	NaN	1	0	0	0	1
889	890	1	1	Behr, Mr. Karl Howell	26.0	0	0	111369	30.0000	C148	0	1	1	0	0
890	891	0	3	Dooley, Mr. Patrick	32.0	0	0	370376	7.7500	NaN	0	1	0	1	0

891 rows × 15 columns

In [68]:

# drop the first column after get_dummies

pd.get_dummies(train, columns=['Sex', 'Embarked'], drop_first=True)

Out[68]:

	PassengerId	Survived	Pclass	Name	Age	SibSp	Parch	Ticket	Fare	Cabin	Sex_male	Embarked_Q	Embarked_S
0	1	0	3	Braund, Mr. Owen Harris	22.0	1	0	A/5 21171	7.2500	NaN	1	0	1
1	2	1	1	Cumings, Mrs. John Bradley (Florence Briggs Thayer)	38.0	1	0	PC 17599	71.2833	C85	0	0	0
2	3	1	3	Heikkinen, Miss. Laina	26.0	0	0	STON/O2. 3101282	7.9250	NaN	0	0	1
3	4	1	1	Futrelle, Mrs. Jacques Heath (Lily May Peel)	35.0	1	0	113803	53.1000	C123	0	0	1
4	5	0	3	Allen, Mr. William Henry	35.0	0	0	373450	8.0500	NaN	1	0	1
...	...	...	...	...	...	...	...	...	...	...	...	...	...
886	887	0	2	Montvila, Rev. Juozas	27.0	0	0	211536	13.0000	NaN	1	0	1
887	888	1	1	Graham, Miss. Margaret Edith	19.0	0	0	112053	30.0000	B42	0	0	1
888	889	0	3	Johnston, Miss. Catherine Helen "Carrie"	NaN	1	2	W./C. 6607	23.4500	NaN	0	0	1
889	890	1	1	Behr, Mr. Karl Howell	26.0	0	0	111369	30.0000	C148	1	0	0
890	891	0	3	Dooley, Mr. Patrick	32.0	0	0	370376	7.7500	NaN	1	1	0

891 rows × 13 columns

In [69]:

# example of MovieLens 1M dataset

mnames = ['movieid', 'title', 'genres']

In [70]:

movies = pd.read_table(r'MovieLens-1M/movies.dat', sep='::', header=None,
                       names=mnames, engine='python')

In [71]:

movies.head()

Out[71]:

	movieid	title	genres
0	1	Toy Story (1995)	Animation\|Children's\|Comedy
1	2	Jumanji (1995)	Adventure\|Children's\|Fantasy
2	3	Grumpier Old Men (1995)	Comedy\|Romance
3	4	Waiting to Exhale (1995)	Comedy\|Drama
4	5	Father of the Bride Part II (1995)	Comedy

In [72]:

all_genres = []

for x in movies.genres:
    all_genres.extend(x.split('|'))

In [73]:

genres = pd.unique(all_genres)
genres

Out[73]:

array(['Animation', "Children's", 'Comedy', 'Adventure', 'Fantasy',
       'Romance', 'Drama', 'Action', 'Crime', 'Thriller', 'Horror',
       'Sci-Fi', 'Documentary', 'War', 'Musical', 'Mystery', 'Film-Noir',
       'Western'], dtype=object)

In [74]:

# One way to construct the indicator DataFrame is to start with a
# DataFrame of allzeros

zero_matrix = np.zeros((len(movies), len(genres)))

In [75]:

dummies = pd.DataFrame(zero_matrix, columns=genres)

Now, iterate through each movie and set entries in each row of dummies to 1. To do this, we use the dummies.columns to compute the column indices for each genre

In [76]:

gen = movies.genres[0]
print(gen.split('|'))
dummies.columns.get_indexer(gen.split('|'))

['Animation', "Children's", 'Comedy']

Out[76]:

array([0, 1, 2], dtype=int64)

Then, we can use .iloc to set values based on these indices

In [77]:

for i, gen in enumerate(movies.genres):
    indices = dummies.columns.get_indexer(gen.split('|'))
    dummies.iloc[i, indices] = 1

In [78]:

# join with movies

movies_windic = movies.join(dummies.add_prefix('Genre_'))

In [79]:

movies_windic.iloc[0]

Out[79]:

movieid                                        1
title                           Toy Story (1995)
genres               Animation|Children's|Comedy
Genre_Animation                                1
Genre_Children's                               1
Genre_Comedy                                   1
Genre_Adventure                                0
Genre_Fantasy                                  0
Genre_Romance                                  0
Genre_Drama                                    0
Genre_Action                                   0
Genre_Crime                                    0
Genre_Thriller                                 0
Genre_Horror                                   0
Genre_Sci-Fi                                   0
Genre_Documentary                              0
Genre_War                                      0
Genre_Musical                                  0
Genre_Mystery                                  0
Genre_Film-Noir                                0
Genre_Western                                  0
Name: 0, dtype: object

String Manipulation¶

Basic String methods¶

In [80]:

val = 'a,b, guido'

pieces = [x.strip() for x in val.split(',')]
pieces

Out[80]:

['a', 'b', 'guido']

In [81]:

"::".join(pieces)

Out[81]:

'a::b::guido'

built-in String methods

Argument	Description
count	Return the number of non-overlapping occurrences of substring in the string.
endswith	Returns True if string ends with suffix.
startswith	Returns True if string starts with prefix.
join	Use string as delimiter for concatenating a sequence of other strings.
index	Return position of first character in substring if found in the string; raises ValueError if not found.
find	Return position of first character of rst occurrence of substring in the string; like index, but returns –1 if not found.
rfind	Return position of first character of last occurrence of substring in the string; returns –1 if not found.
replace	Replace occurrences of string with another string.
strip, rstrip, lstrip	Trim whitespace, including newlines; equivalent to x.strip() (and rstrip, lstrip, respectively)

                  for each element|

|split|Break string into list of substrings using passed delimiter. | |lower|Convert alphabet characters to lowercase. | |upper|Convert alphabet characters to uppercase. | |casefold|Convert characters to lowercase, and convert any region-specific variable character combinations to a common comparable form.
|ljust, rjust|Left justify or right justify, respectively; pad opposite side of string with spaces (or some other fill character) to return a string with a minimum width |

Regular Expressions¶

The re module functions fall into three categories: pattern matching, substitution, and splitting. Naturally these are all related; a regex describes a pattern to locate in the text, which can then be used for many purposes.

In [82]:

import re

In [83]:

text = "foo    bar\tbaz   \tqux"
text

Out[83]:

'foo    bar\tbaz   \tqux'

In [84]:

re.split('\s+', text)

Out[84]:

['foo', 'bar', 'baz', 'qux']

When you call re.split('\s+', text), the regular expression is first compiled, and then its split method is called on the passed text. You can compile the regex yourself with re.compile, forming a reusable regex object

Creating a regex object with re.compile is highly recommended if you intend to apply the same expression to many strings; doing so will save CPU cycles

In [85]:

regex = re.compile('\s+')

In [86]:

regex.split(text)

Out[86]:

['foo', 'bar', 'baz', 'qux']

In [87]:

text = """Dave dave@google.com
Steve steve@gmail.com
Rob rob@gmail.com
Ryan ryan@yahoo.com
"""
pattern = r'[A-Z0-9._%+-]+@[A-Z0-9.-]+\.[A-Z]{2,4}'

# re.IGNORECASE makes the regex case-insensitive
regex = re.compile(pattern, flags=re.IGNORECASE)

In [88]:

regex.findall(text)

Out[88]:

['dave@google.com', 'steve@gmail.com', 'rob@gmail.com', 'ryan@yahoo.com']

match and search are closely related to findall. While findall returns all matches in a string, search returns only the first match. More rigidly, match only matches at the beginning of the string

search returns a special match object for the first email address in the text. For the preceding regex, the match object can only tell us the start and end position of the pattern in the string

In [89]:

m = regex.search(text)

In [90]:

Out[90]:

<re.Match object; span=(5, 20), match='dave@google.com'>

In [91]:

text[m.start(): m.end()]

Out[91]:

'dave@google.com'

In [92]:

print(regex.match(text))

None

In [93]:

print(regex.sub('REDACTED', text))

Dave REDACTED
Steve REDACTED
Rob REDACTED
Ryan REDACTED

segment each address into its three components: username, domain name, and domain suffix

In [94]:

pattern = r'([A-Z0-9._%+-]+)@([A-Z0-9.-]+)\.([A-Z]{2,4})'

In [95]:

regex = re.compile(pattern, flags=re.IGNORECASE)

In [96]:

m = regex.match('wesm@bright.net')
m.groups()

Out[96]:

('wesm', 'bright', 'net')

In [97]:

regex.findall(text)

Out[97]:

[('dave', 'google', 'com'),
 ('steve', 'gmail', 'com'),
 ('rob', 'gmail', 'com'),
 ('ryan', 'yahoo', 'com')]

In [98]:

print(regex.sub(r'Username: \1, Domain: \2, Suffix: \3', text))

Dave Username: dave, Domain: google, Suffix: com
Steve Username: steve, Domain: gmail, Suffix: com
Rob Username: rob, Domain: gmail, Suffix: com
Ryan Username: ryan, Domain: yahoo, Suffix: com

Regular expression methods

Argument	Description
findall	Return all non-overlapping matching patterns in a string as a list
finditer	Like findall, but returns an iterator
match	Match pattern at start of string and optionally segment pattern components into groups; if the pattern matches, returns a match object, and otherwise None
search	Scan string for match to pattern; returning a match object if so; unlike match, the match can be anywhere in the string as opposed to only at the beginning
split	Break string into pieces at each occurrence of pattern
sub, subn	Replace all (sub) or first n occurrences (subn) of pattern in string with replacement expression; use symbols \1, \2, ... to refer to match group elements in the replacement string

Vectorized String Functions in pandas¶

In [99]:

data = {'Dave': 'dave@google.com', 'Steve': 'steve@gmail.com',
        'Rob': 'rob@gmail.com', 'Wes': np.nan}

data = pd.Series(data)
data

Out[99]:

Dave     dave@google.com
Steve    steve@gmail.com
Rob        rob@gmail.com
Wes                  NaN
dtype: object

In [100]:

data.isnull()

Out[100]:

Dave     False
Steve    False
Rob      False
Wes       True
dtype: bool

You can apply string and regular expression methods can be applied (passing a lambda or other function) to each value using data.map, but it will fail on the NA (null) values. To cope with this, Series has array-oriented methods for string operations that skip NA values. These are accessed through Series’s str attribute; for example,

In [101]:

data.str.contains('gmail')

Out[101]:

Dave     False
Steve     True
Rob       True
Wes        NaN
dtype: object

Regular expressions can be used, too, along with any re options like IGNORECASE

In [102]:

pattern

Out[102]:

'([A-Z0-9._%+-]+)@([A-Z0-9.-]+)\\.([A-Z]{2,4})'

In [103]:

data.str.findall(pattern, flags=re.IGNORECASE)

Out[103]:

Dave     [(dave, google, com)]
Steve    [(steve, gmail, com)]
Rob        [(rob, gmail, com)]
Wes                        NaN
dtype: object

In [104]:

 matches = data.str.match(pattern, flags=re.IGNORECASE)

There are a couple of ways to do vectorized element retrieval. Either use str.get or index into the str attribute

In [105]:

matches

Out[105]:

Dave     True
Steve    True
Rob      True
Wes       NaN
dtype: object

Table of Contents

Handling Missing Data¶