Oftentimes, there will be something a bit off with the string data in your dataset. You may want to replace some characters, change the case, or strip the whitespace. You know, anything you normally need to do with strings.
Now this might lead you to want to loop through each row and manipulate the data, but before you do that, step back and lean into vectorization.
A Series
provides a way to use vectorized string methods in a property named str
and the vectorized methods are then available.
Let's take a look at some examples that require us to use these methods.
# Setup
import os
import pandas as pd
from utils import make_chaos
pd.options.display.max_rows = 10
transactions = pd.read_csv(os.path.join('data', 'transactions.csv'), index_col=0)
# Pay no attention to the person behind the curtain
make_chaos(transactions, 42, ['sender'], lambda val: '$' + val)
make_chaos(transactions, 88, ['receiver'], lambda val: val.upper())
When CashBox first got started, usernames were allowed to start with a dollar sign. As time progressed, they changed their mind. They made a mass update to the system. However, someone on the Customer Support team reported that there are some records in the transactions
DataFrame
still showing some senders whose user name still had the $ prefix.
In order to get ahold of those rows where the sender starts with a $, we can use the Series.str.startswith
method. This will return a boolean Series
which we can use as an index.
transactions[transactions.sender.str.startswith('$')]
sender | receiver | amount | sent_date | |
---|---|---|---|---|
59 | $porter | gail7896 | 75.16 | 2018-05-14 |
70 | $emily.lewis | kevin | 5.49 | 2018-05-21 |
158 | $robinson | rodriguez | 8.91 | 2018-06-25 |
168 | $nancy | margaret265 | 84.15 | 2018-06-26 |
198 | $acook | adam.saunders | 9.31 | 2018-07-04 |
... | ... | ... | ... | ... |
877 | $april9082 | jacob.davis | 50.37 | 2018-09-21 |
889 | $victor | anthony1788 | 39.06 | 2018-09-21 |
900 | $andersen | corey.ingram | 4.81 | 2018-09-22 |
927 | $janet.williams | bsmith | 50.15 | 2018-09-23 |
934 | $robert8280 | roger | 98.35 | 2018-09-24 |
42 rows × 4 columns
We can now just go ahead and replace all $
with an empty string, essentially removing all $
from every sender by using the Series.str.replace
method.
# Replace all "$" in the sender field with an empty string
transactions.sender = transactions.sender.str.replace('$', '')
# Verify we got them all
len(transactions[transactions.sender.str.startswith('$')])
0
When you want to select or merge by specific values, the case, you know upper case or lower case, matters.
Our CashBox customer support representative also raised another issue for us to take a look at. All usernames should be lowercased, but they have reported that they noticed the receiver
column has some uppercased values.
We can get a handle on those by using the Series.str.isupper
method which will return a boolean Series
that we can use for an index.
transactions[transactions.receiver.str.isupper()]
sender | receiver | amount | sent_date | |
---|---|---|---|---|
2 | rose.eaton | EMILY.LEWIS | 62.67 | 2018-02-15 |
5 | francis.hernandez | LMOORE | 91.46 | 2018-03-14 |
14 | palmer | CHAD.CHEN | 36.27 | 2018-04-07 |
28 | elang | DONNA1922 | 26.07 | 2018-04-23 |
34 | payne | GRIFFIN4992 | 85.21 | 2018-04-26 |
... | ... | ... | ... | ... |
963 | stanley7729 | JOSEPH.LOPEZ | 50.84 | 2018-09-25 |
977 | martha6969 | PATRICIA | 87.33 | 2018-09-25 |
987 | alvarado | PAMELA | 48.74 | 2018-09-25 |
990 | robert | HEATHER.WADE | 86.44 | 2018-09-25 |
992 | pamela | CALEB | 25.01 | 2018-09-25 |
88 rows × 4 columns
So let's select the rows we want from transactions
and then update the receiver
column to the matching lowercased value. We can use the Series.str.lower
vectorized method.
# Update the receiver column of the specific rows that are uppercased.
transactions.loc[transactions.receiver.str.isupper(), 'receiver'] = transactions.receiver.str.lower()
# Verify that we got them
len(transactions[transactions.receiver.str.isupper()])
0
As you work on cleaning up datasets, you'll end up in this space quite a bit. Make sure to check out the documentation on String handling so you know what super powers you have on your side.