String Munging In Dataframe¶

Author: Chris Albon, @ChrisAlbon
Date: -
Repo: Python 3 code snippets for data science
Note:

import modules¶

In [20]:

import pandas as pd
import numpy as np
import re as re

Create dataframe¶

In [21]:

raw_data = {'first_name': ['Jason', 'Molly', 'Tina', 'Jake', 'Amy'], 
        'last_name': ['Miller', 'Jacobson', 'Ali', 'Milner', 'Cooze'], 
        'email': ['jas203@gmail.com', 'momomolly@gmail.com', np.NAN, 'battler@milner.com', 'Ames1234@yahoo.com'], 
        'preTestScore': [4, 24, 31, 2, 3],
        'postTestScore': [25, 94, 57, 62, 70]}
df = pd.DataFrame(raw_data, columns = ['first_name', 'last_name', 'email', 'preTestScore', 'postTestScore'])
df

Out[21]:

	first_name	last_name	email	preTestScore	postTestScore
0	Jason	Miller	jas203@gmail.com	4	25
1	Molly	Jacobson	momomolly@gmail.com	24	94
2	Tina	Ali	NaN	31	57
3	Jake	Milner	battler@milner.com	2	62
4	Amy	Cooze	Ames1234@yahoo.com	3	70

5 rows × 5 columns

Which strings in the email column contains 'gmail'¶

In [22]:

df['email'].str.contains('gmail')

Out[22]:

0     True
1     True
2      NaN
3    False
4    False
Name: email, dtype: object

Create a regular expression pattern that breaks apart emails¶

In [23]:

pattern = '([A-Z0-9._%+-]+)@([A-Z0-9.-]+)\\.([A-Z]{2,4})'

Find everything in df.email that contains that pattern¶

In [24]:

df['email'].str.findall(pattern, flags=re.IGNORECASE)

Out[24]:

0       [(jas203, gmail, com)]
1    [(momomolly, gmail, com)]
2                          NaN
3     [(battler, milner, com)]
4     [(Ames1234, yahoo, com)]
Name: email, dtype: object

Create a pandas series containing the email elements¶

In [29]:

matches = df['email'].str.match(pattern, flags=re.IGNORECASE)
matches

Out[29]:

0       (jas203, gmail, com)
1    (momomolly, gmail, com)
2                        NaN
3     (battler, milner, com)
4     (Ames1234, yahoo, com)
Name: email, dtype: object

Select the domains of the df.email¶

In [42]:

matches.str[1]

Out[42]:

0     gmail
1     gmail
2       NaN
3    milner
4     yahoo
Name: email, dtype: object