Manipulation Techniques

There are lots of ways to change the shape and data in your DataFrame.

Let's explore the popular options.

In [1]:
# Setup
from datetime import datetime
import os

import numpy as np
import pandas as pd

users = pd.read_csv(os.path.join('data', 'users.csv'), index_col=0)
transactions = pd.read_csv(os.path.join('data', 'transactions.csv'), index_col=0)
# Pop out a quick sanity check
(users.shape, transactions.shape)
Out[1]:
((475, 7), (998, 4))

Assigning values

Let's assume that we work on the CashBox Customer Support team and the user with name of Adrian Fang called and told us about an error in his balance. It should be $35, but is currently $30.01

With Chaining (don't do this)

Let's walk ourselves right into a common problem...

In [2]:
# First let's make sure there is only one Adrian Fang
users[(users.first_name == "Adrian") & (users.last_name == "Fang")]
Out[2]:
first_name last_name email email_verified signup_date referral_count balance
adrian Adrian Fang [email protected] True 2018-04-28 3 30.01

Yo, Adrian! I did it!

Our goal is to update the balance, so the common thought process here, usually leads for us to just chain off the returned DataFrame like so...

In [3]:
users[(users.first_name == "Adrian") & (users.last_name == "Fang")]['balance']
Out[3]:
adrian    30.01
Name: balance, dtype: float64

... and since that appears to work, maybe we'll go ahead and set it to the new value.

In [4]:
users[(users.first_name == "Adrian") & (users.last_name == "Fang")]['balance'] = 35.00
/Users/craig/miniconda3/envs/panduhs/lib/python3.7/site-packages/ipykernel_launcher.py:1: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  """Entry point for launching an IPython kernel.

Uh oh! As you can see in the SettingWithCopyWarning above that you should not attempt to chain and assign this way.

Using loc

The solution is to use the DataFrame.loc object to locate the row and specific column to update.

In [5]:
users.loc[(users.first_name == "Adrian") & (users.last_name == "Fang"), 'balance'] = 35.00
# Display our updated user with the new value assigned
users.loc['adrian']
Out[5]:
first_name                               Adrian
last_name                                  Fang
email             [email protected]
email_verified                             True
signup_date                          2018-04-28
referral_count                                3
balance                                      35
Name: adrian, dtype: object

Using at

You can also use the DataFrame.at method to quickly set scalar values

In [6]:
users.at['adrian', 'balance'] = 35.00

Adding Rows

So we changed the balance variable for Adrian, and now we need to track that the transaction occurred.

Let's take a quick peek at the transcactions DataFrame.

In [7]:
transactions.head()
Out[7]:
sender receiver amount sent_date
0 stein smoyer 49.03 2018-01-24
1 holden4580 joshua.henry 34.64 2018-02-06
2 rose.eaton emily.lewis 62.67 2018-02-15
3 lmoore kallen 1.94 2018-03-05
4 scott3928 lmoore 27.82 2018-03-10
In [8]:
# Let's build a new record
record = dict(sender=np.nan, receiver='adrian', amount=4.99, sent_date=datetime.now().date())

Appending with DataFrame.append

The DataFrame.append](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.append.html) method adds a new row to a new dataset. This method doesn't change the original dataset it returns a copy of the DataFrame with the new row(s) appended.

The index for our transactions is auto assigned, so we'll set the ignore_index keyword argument to True, so it gets generated.

In [9]:
# Remember this is returning a copy...
transactions.append(record, ignore_index=True).tail()
Out[9]:
sender receiver amount sent_date
994 king3246 john 25.37 2018-09-25
995 shernandez kristen1581 75.77 2018-09-25
996 leah6255 jholloway 63.62 2018-09-25
997 pamela michelle4225 2.54 2018-09-25
998 NaN adrian 4.99 2018-11-06

If you are appending multiple rows, the more effective way to get the job done is by using the pandas.concat method.

Setting With Enlargement

If you assign to a non-existent index key, the DataFrame will be enlarged automatically, the row will just be added.

There is a slight problem here, as the index in the transactions DataFrame is autogenerated. A popular workaround is to figure out the last used index, and increment it.

In [10]:
# Largest current record, incremented
next_key = transactions.index.max() + 1
transactions.loc[next_key] = record 
# Make sure it got added
transactions.tail()
Out[10]:
sender receiver amount sent_date
994 king3246 john 25.37 2018-09-25
995 shernandez kristen1581 75.77 2018-09-25
996 leah6255 jholloway 63.62 2018-09-25
997 pamela michelle4225 2.54 2018-09-25
998 NaN adrian 4.99 2018-11-06

Adding Columns

You can add columns much like you do rows, missing values will be set to np.nan.

Setting With Enlargement

In [11]:
latest_id = transactions.index.max()
# Add a new column named notes
transactions.at[latest_id, 'notes'] = 'Adrian called customer support to report billing error.'
transactions.tail()
Out[11]:
sender receiver amount sent_date notes
994 king3246 john 25.37 2018-09-25 NaN
995 shernandez kristen1581 75.77 2018-09-25 NaN
996 leah6255 jholloway 63.62 2018-09-25 NaN
997 pamela michelle4225 2.54 2018-09-25 NaN
998 NaN adrian 4.99 2018-11-06 Adrian called customer support to report billi...

The column can be added and assigned from an expression.

In [12]:
# Add a new column called large. This is a bad name and use of a column ;)
transactions['large'] = transactions.amount > 70
In [13]:
transactions.head()
Out[13]:
sender receiver amount sent_date notes large
0 stein smoyer 49.03 2018-01-24 NaN False
1 holden4580 joshua.henry 34.64 2018-02-06 NaN False
2 rose.eaton emily.lewis 62.67 2018-02-15 NaN False
3 lmoore kallen 1.94 2018-03-05 NaN False
4 scott3928 lmoore 27.82 2018-03-10 NaN False

Renaming Columns

Renaming columns can be acheived using the DataFrame.rename method. You specify the current name(s) as the key(s) and the new name(s) as the value(s).

By default this returns a copy, but you can use the inplace parameter to change the existing DataFrame.

In [14]:
transactions.rename(columns={'large': 'big_sender'}, inplace=True)
transactions.head()
Out[14]:
sender receiver amount sent_date notes big_sender
0 stein smoyer 49.03 2018-01-24 NaN False
1 holden4580 joshua.henry 34.64 2018-02-06 NaN False
2 rose.eaton emily.lewis 62.67 2018-02-15 NaN False
3 lmoore kallen 1.94 2018-03-05 NaN False
4 scott3928 lmoore 27.82 2018-03-10 NaN False

Deleting Columns

In addition to slicing a DataFrame to simply not include a specific existing column. You can also drop columns by name. Let's remove the two that we added, in place.

In [15]:
transactions.drop(columns=['notes'], inplace=True)
transactions.head()
Out[15]:
sender receiver amount sent_date big_sender
0 stein smoyer 49.03 2018-01-24 False
1 holden4580 joshua.henry 34.64 2018-02-06 False
2 rose.eaton emily.lewis 62.67 2018-02-15 False
3 lmoore kallen 1.94 2018-03-05 False
4 scott3928 lmoore 27.82 2018-03-10 False

You might have seen this done before using the axis parameter.

Let's get rid of the oddly named big_sender column. Why'd you let me name it that way?

In [16]:
transactions.drop(['big_sender'], axis='columns', inplace=True)
transactions.head()
Out[16]:
sender receiver amount sent_date
0 stein smoyer 49.03 2018-01-24
1 holden4580 joshua.henry 34.64 2018-02-06
2 rose.eaton emily.lewis 62.67 2018-02-15
3 lmoore kallen 1.94 2018-03-05
4 scott3928 lmoore 27.82 2018-03-10

Deleting Rows

You can see also use the DataFrame.drop method to remove row(s) by index.

In [17]:
last_key = transactions.index.max()
transactions.drop(index=[last_key], inplace=True)
transactions.tail()
Out[17]:
sender receiver amount sent_date
993 coleman sarah.evans 36.29 2018-09-25
994 king3246 john 25.37 2018-09-25
995 shernandez kristen1581 75.77 2018-09-25
996 leah6255 jholloway 63.62 2018-09-25
997 pamela michelle4225 2.54 2018-09-25