import pandas as pd
import numpy as np
import builtins
def print(*args, **kwargs):
builtins.print(*args, **kwargs, end='\n\n')
The way that missing data is represented in pandas objects is somewhat imperfect, but it is functional for a lot of users. For numeric data, pandas uses the floating-point value NaN (Not a Number) to represent missing data. We call this a sentinel value that can be easily detected
string_data = pd.Series(['aardvark', 'artichoke', np.nan, 'avocado'])
string_data
0 aardvark 1 artichoke 2 NaN 3 avocado dtype: object
string_data.isnull()
0 False 1 False 2 True 3 False dtype: bool
string_data.isnull().sum()
1
# The built-in Python None value is also treated as NA in object arrays
string_data[0] = None
string_data.isnull()
0 True 1 False 2 True 3 False dtype: bool
pd.options.display.max_colwidth = 500
df = pd.read_csv(r'examples/NA_handling_methods.csv', names=['Description'], sep='\n')
df.index = list(range(1, 5))
def f1(x):
x = x.split(' ')
return x.pop(0)
df['Argument'] = df['Description'].map(f1) # get the first word
def f2(x):
x = x.split(' ')
return " ".join(x[1:])
df['Description'] = df['Description'].map(f2) # remove the first word
df = df.reindex(columns=['Argument', 'Description'])
print("NA handling methods")
df
NA handling methods
Argument | Description | |
---|---|---|
1 | dropna | Filter axis labels based on whether values for each label have missing data, with varying thresholds for how much missing data to tolerate. |
2 | fillna | Fill in missing data with some value or using an interpolation method such as 'ffill' or 'bfill'. |
3 | isnull | Return boolean values indicating which values are missing/NA. |
4 | notnull | Negation of isnull. |
There are a few ways to filter out missing data. While you always have the option to do it by hand using pandas.isnull and boolean indexing, the dropna can be helpful.
# On a Series, it returns the Series with only the non-null data and index values
from numpy import nan as NA
data = pd.Series([1, NA, 3.5, NA, 7])
data.dropna()
# data[data.notnull()]
0 1.0 2 3.5 4 7.0 dtype: float64
With DataFrame objects, things are a bit more complex. You may want to drop rows or columns that are all NA or only those containing any NAs. dropna by default drops any row containing a missing value
data = pd.DataFrame([[1., 6.5, 3.], [1., NA, NA],
[NA, NA, NA], [NA, 6.5, 3.]])
data
0 | 1 | 2 | |
---|---|---|---|
0 | 1.0 | 6.5 | 3.0 |
1 | 1.0 | NaN | NaN |
2 | NaN | NaN | NaN |
3 | NaN | 6.5 | 3.0 |
cleaned = data.dropna()
cleaned
0 | 1 | 2 | |
---|---|---|---|
0 | 1.0 | 6.5 | 3.0 |
# Passing how='all' will only drop rows that are all NA
data.dropna(how='all')
# data.dropna(how='any')
0 | 1 | 2 | |
---|---|---|---|
0 | 1.0 | 6.5 | 3.0 |
1 | 1.0 | NaN | NaN |
3 | NaN | 6.5 | 3.0 |
# To drop columns in the same way, pass axis=1
data[4] = NA
data
0 | 1 | 2 | 4 | |
---|---|---|---|---|
0 | 1.0 | 6.5 | 3.0 | NaN |
1 | 1.0 | NaN | NaN | NaN |
2 | NaN | NaN | NaN | NaN |
3 | NaN | 6.5 | 3.0 | NaN |
data.dropna(axis=1, how='all')
0 | 1 | 2 | |
---|---|---|---|
0 | 1.0 | 6.5 | 3.0 |
1 | 1.0 | NaN | NaN |
2 | NaN | NaN | NaN |
3 | NaN | 6.5 | 3.0 |
A related way to filter out DataFrame rows tends to concern time series data. Suppose you want to keep only rows containing a certain number of observations. You can indicate this with the thresh argument
df = pd.DataFrame(np.random.randn(7, 3))
df.iloc[:4, 1] = NA
df.iloc[:2, 2] = NA
df
0 | 1 | 2 | |
---|---|---|---|
0 | 1.196168 | NaN | NaN |
1 | -1.007727 | NaN | NaN |
2 | 2.909089 | NaN | -2.142449 |
3 | -0.866425 | NaN | 1.881426 |
4 | 0.300446 | -1.266830 | 0.296984 |
5 | -0.963050 | 1.649808 | -1.686630 |
6 | -0.087371 | -0.342160 | -0.608004 |
print(df.dropna())
print(df.dropna(thresh=2))
0 1 2 4 0.300446 -1.266830 0.296984 5 -0.963050 1.649808 -1.686630 6 -0.087371 -0.342160 -0.608004 0 1 2 2 2.909089 NaN -2.142449 3 -0.866425 NaN 1.881426 4 0.300446 -1.266830 0.296984 5 -0.963050 1.649808 -1.686630 6 -0.087371 -0.342160 -0.608004
df
0 | 1 | 2 | |
---|---|---|---|
0 | 1.196168 | NaN | NaN |
1 | -1.007727 | NaN | NaN |
2 | 2.909089 | NaN | -2.142449 |
3 | -0.866425 | NaN | 1.881426 |
4 | 0.300446 | -1.266830 | 0.296984 |
5 | -0.963050 | 1.649808 | -1.686630 |
6 | -0.087371 | -0.342160 | -0.608004 |
print(df.dropna(subset=[1, 2], how='any').shape)
print(df.dropna(subset=[1, 2], how='all').shape)
(3, 3) (5, 3)
df.fillna(0)
0 | 1 | 2 | |
---|---|---|---|
0 | 1.196168 | 0.000000 | 0.000000 |
1 | -1.007727 | 0.000000 | 0.000000 |
2 | 2.909089 | 0.000000 | -2.142449 |
3 | -0.866425 | 0.000000 | 1.881426 |
4 | 0.300446 | -1.266830 | 0.296984 |
5 | -0.963050 | 1.649808 | -1.686630 |
6 | -0.087371 | -0.342160 | -0.608004 |
# Calling fillna with a dict, you can use a different fill value for each column
df.fillna({1: 0.5, 2: 0})
0 | 1 | 2 | |
---|---|---|---|
0 | 1.196168 | 0.500000 | 0.000000 |
1 | -1.007727 | 0.500000 | 0.000000 |
2 | 2.909089 | 0.500000 | -2.142449 |
3 | -0.866425 | 0.500000 | 1.881426 |
4 | 0.300446 | -1.266830 | 0.296984 |
5 | -0.963050 | 1.649808 | -1.686630 |
6 | -0.087371 | -0.342160 | -0.608004 |
# fillna returns a new object, but you can modify the existing object in-place
_ = df.fillna(0, inplace=True)
df
0 | 1 | 2 | |
---|---|---|---|
0 | 1.196168 | 0.000000 | 0.000000 |
1 | -1.007727 | 0.000000 | 0.000000 |
2 | 2.909089 | 0.000000 | -2.142449 |
3 | -0.866425 | 0.000000 | 1.881426 |
4 | 0.300446 | -1.266830 | 0.296984 |
5 | -0.963050 | 1.649808 | -1.686630 |
6 | -0.087371 | -0.342160 | -0.608004 |
df = pd.DataFrame(np.random.randn(6, 3))
df.iloc[2:, 1] = NA
df.iloc[4:, 2] = NA
df
0 | 1 | 2 | |
---|---|---|---|
0 | -1.283063 | 1.456059 | 0.639081 |
1 | -0.863694 | -0.012437 | -0.741003 |
2 | 0.533858 | NaN | -0.741355 |
3 | 1.690786 | NaN | -0.396711 |
4 | -0.296663 | NaN | NaN |
5 | -1.422233 | NaN | NaN |
df.fillna(method='ffill')
0 | 1 | 2 | |
---|---|---|---|
0 | -1.283063 | 1.456059 | 0.639081 |
1 | -0.863694 | -0.012437 | -0.741003 |
2 | 0.533858 | -0.012437 | -0.741355 |
3 | 1.690786 | -0.012437 | -0.396711 |
4 | -0.296663 | -0.012437 | -0.396711 |
5 | -1.422233 | -0.012437 | -0.396711 |
df.fillna(method='ffill', limit=2)
0 | 1 | 2 | |
---|---|---|---|
0 | -1.283063 | 1.456059 | 0.639081 |
1 | -0.863694 | -0.012437 | -0.741003 |
2 | 0.533858 | -0.012437 | -0.741355 |
3 | 1.690786 | -0.012437 | -0.396711 |
4 | -0.296663 | NaN | -0.396711 |
5 | -1.422233 | NaN | -0.396711 |
# we might pass mean or median of a series
data = pd.Series([1., NA, 3.5, NA, 7])
data.fillna(data.mean())
0 1.000000 1 3.833333 2 3.500000 3 3.833333 4 7.000000 dtype: float64
Arguments - Description for fillna
value - Scalar value or dict-like object to use to fill missing values
method - Interpolation; by default 'ffill' if function called with no other arguments
axis - Axis to fill on; default axis=0
inplace - Modify the calling object without producing a copy
limit - For forward and backward filling, maximum number of consecutive periods to fill
data = pd.DataFrame({'k1': ['one', 'two'] * 3 + ['two'],
'k2': [1, 1, 2, 3, 3, 4, 4]})
data
k1 | k2 | |
---|---|---|
0 | one | 1 |
1 | two | 1 |
2 | one | 2 |
3 | two | 3 |
4 | one | 3 |
5 | two | 4 |
6 | two | 4 |
The DataFrame method duplicated returns a boolean Series indicating whether each row is a duplicate (has been observed in a previous row) or not.
data.duplicated()
0 False 1 False 2 False 3 False 4 False 5 False 6 True dtype: bool
# drop_duplicates returns a DataFrame where the duplicated array is False
data.drop_duplicates()
k1 | k2 | |
---|---|---|
0 | one | 1 |
1 | two | 1 |
2 | one | 2 |
3 | two | 3 |
4 | one | 3 |
5 | two | 4 |
# Suppose we had an additional column of values and wanted to
# filter duplicates only based on the 'k1' column
data['v1'] = range(7)
data
k1 | k2 | v1 | |
---|---|---|---|
0 | one | 1 | 0 |
1 | two | 1 | 1 |
2 | one | 2 | 2 |
3 | two | 3 | 3 |
4 | one | 3 | 4 |
5 | two | 4 | 5 |
6 | two | 4 | 6 |
data.drop_duplicates(['k1'])
k1 | k2 | v1 | |
---|---|---|---|
0 | one | 1 | 0 |
1 | two | 1 | 1 |
For many datasets, you may wish to perform some transformation based on the values in an array, Series, or column in a DataFrame.
data = pd.DataFrame({'food': ['bacon', 'pulled pork', 'bacon',
'Pastrami', 'corned beef', 'Bacon',
'pastrami', 'honey ham', 'nova lox'],
'ounces': [4, 3, 12, 6, 7.5, 8, 3, 5, 6]})
data
food | ounces | |
---|---|---|
0 | bacon | 4.0 |
1 | pulled pork | 3.0 |
2 | bacon | 12.0 |
3 | Pastrami | 6.0 |
4 | corned beef | 7.5 |
5 | Bacon | 8.0 |
6 | pastrami | 3.0 |
7 | honey ham | 5.0 |
8 | nova lox | 6.0 |
# Suppose you wanted to add a column indicating the
# type of animal that each food came from
meat_to_animal = {
'bacon': 'pig',
'pulled pork': 'pig',
'pastrami': 'cow',
'corned beef': 'cow',
'honey ham': 'pig',
'nova lox': 'salmon'
}
lowercased = data['food'].str.lower()
lowercased
0 bacon 1 pulled pork 2 bacon 3 pastrami 4 corned beef 5 bacon 6 pastrami 7 honey ham 8 nova lox Name: food, dtype: object
data['animal'] = lowercased.map(meat_to_animal)
# or
# data['food'].map(lambda x: meat_to_animal[x.lower()])
data
food | ounces | animal | |
---|---|---|---|
0 | bacon | 4.0 | pig |
1 | pulled pork | 3.0 | pig |
2 | bacon | 12.0 | pig |
3 | Pastrami | 6.0 | cow |
4 | corned beef | 7.5 | cow |
5 | Bacon | 8.0 | pig |
6 | pastrami | 3.0 | cow |
7 | honey ham | 5.0 | pig |
8 | nova lox | 6.0 | salmon |
Filling in missing data with the fillna method is a special case of more general value replacement. As you’ve already seen, map can be used to modify a subset of values in an object but replace provides a simpler and more flexible way to do so
data = pd.Series([1., -999., 2., -999., -1000., 3.])
data
0 1.0 1 -999.0 2 2.0 3 -999.0 4 -1000.0 5 3.0 dtype: float64
print(data.replace(-999, np.nan))
print(data.replace([-999, -1000], np.nan))
print(data.replace([-999, -1000], [np.nan, 0]))
print(data.replace({-999: np.nan, -1000: 0}, inplace=True))
0 1.0 1 NaN 2 2.0 3 NaN 4 -1000.0 5 3.0 dtype: float64 0 1.0 1 NaN 2 2.0 3 NaN 4 NaN 5 3.0 dtype: float64 0 1.0 1 NaN 2 2.0 3 NaN 4 0.0 5 3.0 dtype: float64 None
axis labels can be transformed by a function or mapping of some form to produce new, differently labeled objects.
data = pd.DataFrame(np.arange(12).reshape((3, 4)),
index=['Ohio', 'Colorado', 'New York'],
columns=['one', 'two', 'three', 'four'])
transform = lambda x: x[:4].upper()
# Like a Series, the axis indexes have a map method
print(data.index.map(transform))
# -----------
trans = lambda x: x[:2].upper()
print(data.index.map(trans))
Index(['OHIO', 'COLO', 'NEW '], dtype='object') Index(['OH', 'CO', 'NE'], dtype='object')
data.index = data.index.map(transform)
data
one | two | three | four | |
---|---|---|---|---|
OHIO | 0 | 1 | 2 | 3 |
COLO | 4 | 5 | 6 | 7 |
NEW | 8 | 9 | 10 | 11 |
If you want to create a transformed version of a dataset without modifying the original, a useful method is rename
data.rename(index=str.title, columns=str.upper)
ONE | TWO | THREE | FOUR | |
---|---|---|---|---|
Ohio | 0 | 1 | 2 | 3 |
Colo | 4 | 5 | 6 | 7 |
New | 8 | 9 | 10 | 11 |
# rename can be used in conjunction with a dict-like object
# providing new values for a subset of the axis labels
data.rename(index={'OHIO': 'INDIANA'},
columns={'three': 'peekaboo'}, inplace=True)
data
one | two | peekaboo | four | |
---|---|---|---|---|
INDIANA | 0 | 1 | 2 | 3 |
COLO | 4 | 5 | 6 | 7 |
NEW | 8 | 9 | 10 | 11 |
# ToD: Discretization and Binning
Filtering or transforming outliers is largely a matter of applying array operations.
data = pd.DataFrame(np.random.randn(1000, 4))
data.describe()
0 | 1 | 2 | 3 | |
---|---|---|---|---|
count | 1000.000000 | 1000.000000 | 1000.000000 | 1000.000000 |
mean | 0.015383 | 0.024896 | 0.038430 | -0.064249 |
std | 1.008377 | 0.996469 | 0.989867 | 0.995204 |
min | -3.141061 | -3.430585 | -2.886733 | -3.260575 |
25% | -0.708238 | -0.597616 | -0.656763 | -0.766798 |
50% | 0.051303 | 0.051438 | 0.048483 | -0.071339 |
75% | 0.730775 | 0.691972 | 0.666596 | 0.639310 |
max | 3.766905 | 3.349857 | 3.361674 | 2.854713 |
col = data[0]
col[np.abs(col) > 3]
1 -3.141061 56 3.766905 247 3.008167 787 -3.135201 Name: 0, dtype: float64
To select all rows having a value exceeding 3 or –3, you can use the any method on a boolean DataFrame
data[(np.abs(data) > 3).any(axis=1)]
0 | 1 | 2 | 3 | |
---|---|---|---|---|
1 | -3.141061 | 0.783127 | 0.469851 | -0.395636 |
56 | 3.766905 | 0.030909 | 1.633494 | 1.140207 |
247 | 3.008167 | -0.992888 | 0.260335 | -3.063802 |
327 | -0.074563 | 3.349857 | 2.032817 | -0.609838 |
405 | -0.362679 | -0.103159 | 3.087357 | 2.517252 |
448 | -0.437464 | 3.054526 | -0.254406 | 1.047457 |
611 | -0.369185 | -3.430585 | -0.515431 | -1.668876 |
619 | -0.311282 | 0.535545 | 3.039966 | -1.199799 |
631 | -0.923738 | 0.346249 | -0.485636 | -3.260575 |
765 | -0.856659 | -0.717638 | -1.155797 | -3.069395 |
787 | -3.135201 | -1.338375 | -0.404171 | 1.373373 |
917 | 0.437193 | -3.026429 | -0.438433 | 0.024585 |
984 | -0.629788 | 0.308067 | 3.361674 | -1.683448 |
Values can be set based on these criteria. Here is code to cap values outside the interval –3 to 3
data[np.abs(data) > 3] = np.sign(data) * 3
data.describe()
0 | 1 | 2 | 3 | |
---|---|---|---|---|
count | 1000.000000 | 1000.000000 | 1000.000000 | 1000.000000 |
mean | 0.014885 | 0.024949 | 0.037941 | -0.063855 |
std | 1.004937 | 0.993711 | 0.988330 | 0.994002 |
min | -3.000000 | -3.000000 | -2.886733 | -3.000000 |
25% | -0.708238 | -0.597616 | -0.656763 | -0.766798 |
50% | 0.051303 | 0.051438 | 0.048483 | -0.071339 |
75% | 0.730775 | 0.691972 | 0.666596 | 0.639310 |
max | 3.000000 | 3.000000 | 3.000000 | 2.854713 |
# The statement np.sign(data) produces 1 and –1 values based on
# whether the values in data are positive or negative
np.sign(data).head()
0 | 1 | 2 | 3 | |
---|---|---|---|---|
0 | -1.0 | -1.0 | -1.0 | -1.0 |
1 | -1.0 | 1.0 | 1.0 | -1.0 |
2 | 1.0 | -1.0 | 1.0 | -1.0 |
3 | 1.0 | -1.0 | 1.0 | -1.0 |
4 | 1.0 | -1.0 | -1.0 | 1.0 |
Permuting (randomly reordering) a Series or the rows in a DataFrame is easy to do using the numpy.random.permutation function. Calling permutation with the length of the axis you want to permute produces an array of integers indicating the new ordering
df = pd.DataFrame(np.arange(20).reshape((5, 4)))
sampler = np.random.permutation(5)
sampler
array([3, 4, 1, 2, 0])
That array can then be used in iloc-based indexing or the equivalent take function
df
0 | 1 | 2 | 3 | |
---|---|---|---|---|
0 | 0 | 1 | 2 | 3 |
1 | 4 | 5 | 6 | 7 |
2 | 8 | 9 | 10 | 11 |
3 | 12 | 13 | 14 | 15 |
4 | 16 | 17 | 18 | 19 |
df.take(sampler)
0 | 1 | 2 | 3 | |
---|---|---|---|---|
3 | 12 | 13 | 14 | 15 |
4 | 16 | 17 | 18 | 19 |
1 | 4 | 5 | 6 | 7 |
2 | 8 | 9 | 10 | 11 |
0 | 0 | 1 | 2 | 3 |
To select a random subset without replacement, you can use the sample method on Series and DataFrame
df.sample(n=3)
0 | 1 | 2 | 3 | |
---|---|---|---|---|
4 | 16 | 17 | 18 | 19 |
2 | 8 | 9 | 10 | 11 |
1 | 4 | 5 | 6 | 7 |
# sample with replacement
choices = pd.Series([5, 7, -1, 6, 4])
draws = choices.sample(n=10, replace=True)
draws
3 6 1 7 3 6 0 5 3 6 0 5 1 7 2 -1 2 -1 4 4 dtype: int64
train = pd.read_csv('http://bit.ly/kaggletrain')
train.head()
PassengerId | Survived | Pclass | Name | Sex | Age | SibSp | Parch | Ticket | Fare | Cabin | Embarked | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | 0 | 3 | Braund, Mr. Owen Harris | male | 22.0 | 1 | 0 | A/5 21171 | 7.2500 | NaN | S |
1 | 2 | 1 | 1 | Cumings, Mrs. John Bradley (Florence Briggs Thayer) | female | 38.0 | 1 | 0 | PC 17599 | 71.2833 | C85 | C |
2 | 3 | 1 | 3 | Heikkinen, Miss. Laina | female | 26.0 | 0 | 0 | STON/O2. 3101282 | 7.9250 | NaN | S |
3 | 4 | 1 | 1 | Futrelle, Mrs. Jacques Heath (Lily May Peel) | female | 35.0 | 1 | 0 | 113803 | 53.1000 | C123 | S |
4 | 5 | 0 | 3 | Allen, Mr. William Henry | male | 35.0 | 0 | 0 | 373450 | 8.0500 | NaN | S |
# create a dummy column for Sex column
# using map method
train['Sex_male'] = train.Sex.map({'female': 0, 'male': 1})
train.head()
PassengerId | Survived | Pclass | Name | Sex | Age | SibSp | Parch | Ticket | Fare | Cabin | Embarked | Sex_male | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | 0 | 3 | Braund, Mr. Owen Harris | male | 22.0 | 1 | 0 | A/5 21171 | 7.2500 | NaN | S | 1 |
1 | 2 | 1 | 1 | Cumings, Mrs. John Bradley (Florence Briggs Thayer) | female | 38.0 | 1 | 0 | PC 17599 | 71.2833 | C85 | C | 0 |
2 | 3 | 1 | 3 | Heikkinen, Miss. Laina | female | 26.0 | 0 | 0 | STON/O2. 3101282 | 7.9250 | NaN | S | 0 |
3 | 4 | 1 | 1 | Futrelle, Mrs. Jacques Heath (Lily May Peel) | female | 35.0 | 1 | 0 | 113803 | 53.1000 | C123 | S | 0 |
4 | 5 | 0 | 3 | Allen, Mr. William Henry | male | 35.0 | 0 | 0 | 373450 | 8.0500 | NaN | S | 1 |
# using get_dummies
pd.get_dummies(train.Sex)
female | male | |
---|---|---|
0 | 0 | 1 |
1 | 1 | 0 |
2 | 1 | 0 |
3 | 1 | 0 |
4 | 0 | 1 |
... | ... | ... |
886 | 0 | 1 |
887 | 1 | 0 |
888 | 1 | 0 |
889 | 0 | 1 |
890 | 0 | 1 |
891 rows × 2 columns
Generally, k categorical values, then use k-1 dummy variables to represent it.
for Sex, we have to categorical values so we need only 1 variable to encode all the given information. So we drop the first column.
pd.get_dummies(train.Sex, prefix='Sex').iloc[:, 1:]
Sex_male | |
---|---|
0 | 1 |
1 | 0 |
2 | 0 |
3 | 0 |
4 | 1 |
... | ... |
886 | 1 |
887 | 0 |
888 | 0 |
889 | 1 |
890 | 1 |
891 rows × 1 columns
train.Embarked.value_counts()
S 644 C 168 Q 77 Name: Embarked, dtype: int64
embarked_dummies = pd.get_dummies(train.Embarked, prefix='Embarked').iloc[:, 1:]
train = pd.concat([train, embarked_dummies], axis=1)
train.head()
PassengerId | Survived | Pclass | Name | Sex | Age | SibSp | Parch | Ticket | Fare | Cabin | Embarked | Sex_male | Embarked_Q | Embarked_S | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | 0 | 3 | Braund, Mr. Owen Harris | male | 22.0 | 1 | 0 | A/5 21171 | 7.2500 | NaN | S | 1 | 0 | 1 |
1 | 2 | 1 | 1 | Cumings, Mrs. John Bradley (Florence Briggs Thayer) | female | 38.0 | 1 | 0 | PC 17599 | 71.2833 | C85 | C | 0 | 0 | 0 |
2 | 3 | 1 | 3 | Heikkinen, Miss. Laina | female | 26.0 | 0 | 0 | STON/O2. 3101282 | 7.9250 | NaN | S | 0 | 0 | 1 |
3 | 4 | 1 | 1 | Futrelle, Mrs. Jacques Heath (Lily May Peel) | female | 35.0 | 1 | 0 | 113803 | 53.1000 | C123 | S | 0 | 0 | 1 |
4 | 5 | 0 | 3 | Allen, Mr. William Henry | male | 35.0 | 0 | 0 | 373450 | 8.0500 | NaN | S | 1 | 0 | 1 |
# pass columns to get_dummies makes it more easy
train = pd.read_csv('http://bit.ly/kaggletrain')
pd.get_dummies(train, columns=['Sex', 'Embarked'])
PassengerId | Survived | Pclass | Name | Age | SibSp | Parch | Ticket | Fare | Cabin | Sex_female | Sex_male | Embarked_C | Embarked_Q | Embarked_S | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | 0 | 3 | Braund, Mr. Owen Harris | 22.0 | 1 | 0 | A/5 21171 | 7.2500 | NaN | 0 | 1 | 0 | 0 | 1 |
1 | 2 | 1 | 1 | Cumings, Mrs. John Bradley (Florence Briggs Thayer) | 38.0 | 1 | 0 | PC 17599 | 71.2833 | C85 | 1 | 0 | 1 | 0 | 0 |
2 | 3 | 1 | 3 | Heikkinen, Miss. Laina | 26.0 | 0 | 0 | STON/O2. 3101282 | 7.9250 | NaN | 1 | 0 | 0 | 0 | 1 |
3 | 4 | 1 | 1 | Futrelle, Mrs. Jacques Heath (Lily May Peel) | 35.0 | 1 | 0 | 113803 | 53.1000 | C123 | 1 | 0 | 0 | 0 | 1 |
4 | 5 | 0 | 3 | Allen, Mr. William Henry | 35.0 | 0 | 0 | 373450 | 8.0500 | NaN | 0 | 1 | 0 | 0 | 1 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
886 | 887 | 0 | 2 | Montvila, Rev. Juozas | 27.0 | 0 | 0 | 211536 | 13.0000 | NaN | 0 | 1 | 0 | 0 | 1 |
887 | 888 | 1 | 1 | Graham, Miss. Margaret Edith | 19.0 | 0 | 0 | 112053 | 30.0000 | B42 | 1 | 0 | 0 | 0 | 1 |
888 | 889 | 0 | 3 | Johnston, Miss. Catherine Helen "Carrie" | NaN | 1 | 2 | W./C. 6607 | 23.4500 | NaN | 1 | 0 | 0 | 0 | 1 |
889 | 890 | 1 | 1 | Behr, Mr. Karl Howell | 26.0 | 0 | 0 | 111369 | 30.0000 | C148 | 0 | 1 | 1 | 0 | 0 |
890 | 891 | 0 | 3 | Dooley, Mr. Patrick | 32.0 | 0 | 0 | 370376 | 7.7500 | NaN | 0 | 1 | 0 | 1 | 0 |
891 rows × 15 columns
# drop the first column after get_dummies
pd.get_dummies(train, columns=['Sex', 'Embarked'], drop_first=True)
PassengerId | Survived | Pclass | Name | Age | SibSp | Parch | Ticket | Fare | Cabin | Sex_male | Embarked_Q | Embarked_S | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | 0 | 3 | Braund, Mr. Owen Harris | 22.0 | 1 | 0 | A/5 21171 | 7.2500 | NaN | 1 | 0 | 1 |
1 | 2 | 1 | 1 | Cumings, Mrs. John Bradley (Florence Briggs Thayer) | 38.0 | 1 | 0 | PC 17599 | 71.2833 | C85 | 0 | 0 | 0 |
2 | 3 | 1 | 3 | Heikkinen, Miss. Laina | 26.0 | 0 | 0 | STON/O2. 3101282 | 7.9250 | NaN | 0 | 0 | 1 |
3 | 4 | 1 | 1 | Futrelle, Mrs. Jacques Heath (Lily May Peel) | 35.0 | 1 | 0 | 113803 | 53.1000 | C123 | 0 | 0 | 1 |
4 | 5 | 0 | 3 | Allen, Mr. William Henry | 35.0 | 0 | 0 | 373450 | 8.0500 | NaN | 1 | 0 | 1 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
886 | 887 | 0 | 2 | Montvila, Rev. Juozas | 27.0 | 0 | 0 | 211536 | 13.0000 | NaN | 1 | 0 | 1 |
887 | 888 | 1 | 1 | Graham, Miss. Margaret Edith | 19.0 | 0 | 0 | 112053 | 30.0000 | B42 | 0 | 0 | 1 |
888 | 889 | 0 | 3 | Johnston, Miss. Catherine Helen "Carrie" | NaN | 1 | 2 | W./C. 6607 | 23.4500 | NaN | 0 | 0 | 1 |
889 | 890 | 1 | 1 | Behr, Mr. Karl Howell | 26.0 | 0 | 0 | 111369 | 30.0000 | C148 | 1 | 0 | 0 |
890 | 891 | 0 | 3 | Dooley, Mr. Patrick | 32.0 | 0 | 0 | 370376 | 7.7500 | NaN | 1 | 1 | 0 |
891 rows × 13 columns
# example of MovieLens 1M dataset
mnames = ['movieid', 'title', 'genres']
movies = pd.read_table(r'MovieLens-1M/movies.dat', sep='::', header=None,
names=mnames, engine='python')
movies.head()
movieid | title | genres | |
---|---|---|---|
0 | 1 | Toy Story (1995) | Animation|Children's|Comedy |
1 | 2 | Jumanji (1995) | Adventure|Children's|Fantasy |
2 | 3 | Grumpier Old Men (1995) | Comedy|Romance |
3 | 4 | Waiting to Exhale (1995) | Comedy|Drama |
4 | 5 | Father of the Bride Part II (1995) | Comedy |
all_genres = []
for x in movies.genres:
all_genres.extend(x.split('|'))
genres = pd.unique(all_genres)
genres
array(['Animation', "Children's", 'Comedy', 'Adventure', 'Fantasy', 'Romance', 'Drama', 'Action', 'Crime', 'Thriller', 'Horror', 'Sci-Fi', 'Documentary', 'War', 'Musical', 'Mystery', 'Film-Noir', 'Western'], dtype=object)
# One way to construct the indicator DataFrame is to start with a
# DataFrame of allzeros
zero_matrix = np.zeros((len(movies), len(genres)))
dummies = pd.DataFrame(zero_matrix, columns=genres)
Now, iterate through each movie and set entries in each row of dummies to 1. To do this, we use the dummies.columns to compute the column indices for each genre
gen = movies.genres[0]
print(gen.split('|'))
dummies.columns.get_indexer(gen.split('|'))
['Animation', "Children's", 'Comedy']
array([0, 1, 2], dtype=int64)
Then, we can use .iloc to set values based on these indices
for i, gen in enumerate(movies.genres):
indices = dummies.columns.get_indexer(gen.split('|'))
dummies.iloc[i, indices] = 1
# join with movies
movies_windic = movies.join(dummies.add_prefix('Genre_'))
movies_windic.iloc[0]
movieid 1 title Toy Story (1995) genres Animation|Children's|Comedy Genre_Animation 1 Genre_Children's 1 Genre_Comedy 1 Genre_Adventure 0 Genre_Fantasy 0 Genre_Romance 0 Genre_Drama 0 Genre_Action 0 Genre_Crime 0 Genre_Thriller 0 Genre_Horror 0 Genre_Sci-Fi 0 Genre_Documentary 0 Genre_War 0 Genre_Musical 0 Genre_Mystery 0 Genre_Film-Noir 0 Genre_Western 0 Name: 0, dtype: object
val = 'a,b, guido'
pieces = [x.strip() for x in val.split(',')]
pieces
['a', 'b', 'guido']
"::".join(pieces)
'a::b::guido'
built-in String methods
Argument | Description |
---|---|
count | Return the number of non-overlapping occurrences of substring in the string. |
endswith | Returns True if string ends with suffix. |
startswith | Returns True if string starts with prefix. |
join | Use string as delimiter for concatenating a sequence of other strings. |
index | Return position of first character in substring if found in the string; raises ValueError if not found. |
find | Return position of first character of rst occurrence of substring in the string; like index, but returns –1 if not found. |
rfind | Return position of first character of last occurrence of substring in the string; returns –1 if not found. |
replace | Replace occurrences of string with another string. |
strip, rstrip, lstrip | Trim whitespace, including newlines; equivalent to x.strip() (and rstrip, lstrip, respectively) |
for each element|
|split|Break string into list of substrings using passed delimiter. |
|lower|Convert alphabet characters to lowercase. |
|upper|Convert alphabet characters to uppercase. |
|casefold|Convert characters to lowercase, and convert any region-specific variable character combinations to a common comparable form.
|ljust, rjust|Left justify or right justify, respectively; pad opposite side of string with spaces (or some other fill character) to return a string with a minimum width |
The re module functions fall into three categories: pattern matching, substitution, and splitting. Naturally these are all related; a regex describes a pattern to locate in the text, which can then be used for many purposes.
import re
text = "foo bar\tbaz \tqux"
text
'foo bar\tbaz \tqux'
re.split('\s+', text)
['foo', 'bar', 'baz', 'qux']
When you call re.split('\s+', text), the regular expression is first compiled, and then its split method is called on the passed text. You can compile the regex yourself with re.compile, forming a reusable regex object
Creating a regex object with re.compile is highly recommended if you intend to apply the same expression to many strings; doing so will save CPU cycles
regex = re.compile('\s+')
regex.split(text)
['foo', 'bar', 'baz', 'qux']
text = """Dave dave@google.com
Steve steve@gmail.com
Rob rob@gmail.com
Ryan ryan@yahoo.com
"""
pattern = r'[A-Z0-9._%+-]+@[A-Z0-9.-]+\.[A-Z]{2,4}'
# re.IGNORECASE makes the regex case-insensitive
regex = re.compile(pattern, flags=re.IGNORECASE)
regex.findall(text)
['dave@google.com', 'steve@gmail.com', 'rob@gmail.com', 'ryan@yahoo.com']
match and search are closely related to findall. While findall returns all matches in a string, search returns only the first match. More rigidly, match only matches at the beginning of the string
search returns a special match object for the first email address in the text. For the preceding regex, the match object can only tell us the start and end position of the pattern in the string
m = regex.search(text)
m
<re.Match object; span=(5, 20), match='dave@google.com'>
text[m.start(): m.end()]
'dave@google.com'
print(regex.match(text))
None
print(regex.sub('REDACTED', text))
Dave REDACTED Steve REDACTED Rob REDACTED Ryan REDACTED
segment each address into its three components: username, domain name, and domain suffix
pattern = r'([A-Z0-9._%+-]+)@([A-Z0-9.-]+)\.([A-Z]{2,4})'
regex = re.compile(pattern, flags=re.IGNORECASE)
m = regex.match('wesm@bright.net')
m.groups()
('wesm', 'bright', 'net')
regex.findall(text)
[('dave', 'google', 'com'), ('steve', 'gmail', 'com'), ('rob', 'gmail', 'com'), ('ryan', 'yahoo', 'com')]
print(regex.sub(r'Username: \1, Domain: \2, Suffix: \3', text))
Dave Username: dave, Domain: google, Suffix: com Steve Username: steve, Domain: gmail, Suffix: com Rob Username: rob, Domain: gmail, Suffix: com Ryan Username: ryan, Domain: yahoo, Suffix: com
Regular expression methods
Argument | Description |
---|---|
findall | Return all non-overlapping matching patterns in a string as a list |
finditer | Like findall, but returns an iterator |
match | Match pattern at start of string and optionally segment pattern components into groups; if the pattern matches, returns a match object, and otherwise None |
search | Scan string for match to pattern; returning a match object if so; unlike match, the match can be anywhere in the string as opposed to only at the beginning |
split | Break string into pieces at each occurrence of pattern |
sub, subn | Replace all (sub) or first n occurrences (subn) of pattern in string with replacement expression; use symbols \1, \2, ... to refer to match group elements in the replacement string |
data = {'Dave': 'dave@google.com', 'Steve': 'steve@gmail.com',
'Rob': 'rob@gmail.com', 'Wes': np.nan}
data = pd.Series(data)
data
Dave dave@google.com Steve steve@gmail.com Rob rob@gmail.com Wes NaN dtype: object
data.isnull()
Dave False Steve False Rob False Wes True dtype: bool
You can apply string and regular expression methods can be applied (passing a lambda or other function) to each value using data.map, but it will fail on the NA (null) values. To cope with this, Series has array-oriented methods for string operations that skip NA values. These are accessed through Series’s str attribute; for example,
data.str.contains('gmail')
Dave False Steve True Rob True Wes NaN dtype: object
Regular expressions can be used, too, along with any re options like IGNORECASE
pattern
'([A-Z0-9._%+-]+)@([A-Z0-9.-]+)\\.([A-Z]{2,4})'
data.str.findall(pattern, flags=re.IGNORECASE)
Dave [(dave, google, com)] Steve [(steve, gmail, com)] Rob [(rob, gmail, com)] Wes NaN dtype: object
matches = data.str.match(pattern, flags=re.IGNORECASE)
There are a couple of ways to do vectorized element retrieval. Either use str.get or index into the str attribute
matches
Dave True Steve True Rob True Wes NaN dtype: object