Manipulation Techniques¶

There are lots of ways to change the shape and data in your DataFrame.

Let's explore the popular options.

In [1]:

# Setup
from datetime import datetime
import os

import numpy as np
import pandas as pd

from utils import render

users = pd.read_csv(os.path.join('data', 'users.csv'), index_col=0)
transactions = pd.read_csv(os.path.join('data', 'transactions.csv'), index_col=0)
# Pop out a quick sanity check
(users.shape, transactions.shape)

Out[1]:

((475, 7), (998, 4))

Assigning values¶

Let's assume that we work on the CashBox Customer Support team and the user with name of Adrian Yang called and told us about an error in his balance. It should be $35, but is currently $30.01

With Chaining (don't do this)¶

Let's walk ourselves right into a common problem...

In [2]:

# First let's make sure there is only one Adrian Yang
users[(users.first_name == "Adrian") & (users.last_name == "Yang")]

Out[2]:

	first_name	last_name	email	email_verified	signup_date	referral_count	balance
adrian	Adrian	Yang	adrian.yang@teamtreehouse.com	True	2018-04-28	3	30.01

Yo, Adrian! I did it!

Our goal is to update the balance, so the common thought process here, usually leads for us to just chain off the returned DataFrame like so...

In [3]:

users[(users.first_name == "Adrian") & (users.last_name == "Yang")]['balance']

Out[3]:

adrian    30.01
Name: balance, dtype: float64

... and since that appears to work, maybe we'll go ahead and set it to the new value.

In [4]:

users[(users.first_name == "Adrian") & (users.last_name == "Yang")]['balance'] = 35.00

/Users/craig/miniconda3/envs/panduhs/lib/python3.7/site-packages/ipykernel_launcher.py:1: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  """Entry point for launching an IPython kernel.

Uh oh! As you can see in the SettingWithCopyWarning above that you should not attempt to chain and assign this way.

Using `loc`¶

The solution is to use the DataFrame.loc object to locate the row and specific column to update.

In [5]:

users.loc[(users.first_name == "Adrian") & (users.last_name == "Yang"), 'balance'] = 35.00
# Display our updated user with the new value assigned
users.loc['adrian']

Out[5]:

first_name                               Adrian
last_name                                  Yang
email             adrian.yang@teamtreehouse.com
email_verified                             True
signup_date                          2018-04-28
referral_count                                3
balance                                      35
Name: adrian, dtype: object

Using `at`¶

You can also use the DataFrame.at method to quickly set scalar values

In [6]:

users.at['adrian', 'balance'] = 35.00

Adding Rows¶

So we changed the balance variable for Adrian, and now we need to track that the transaction occurred.

Let's take a quick peek at the transcactions DataFrame.

In [7]:

transactions.head()

Out[7]:

	sender	receiver	amount	sent_date
0	stein	smoyer	49.03	2018-01-24
1	holden4580	joshua.henry	34.64	2018-02-06
2	rose.eaton	emily.lewis	62.67	2018-02-15
3	lmoore	kallen	1.94	2018-03-05
4	scott3928	lmoore	27.82	2018-03-10

In [8]:

# Let's build a new record
record = dict(sender=np.nan, receiver='adrian', amount=4.99, sent_date=datetime.now().date())

Appending with `DataFrame.append`¶

There is a method on DataFrames that allow a way to append a new row to a new dataset. This returns a copy of the DataFrame with the new row(s) appended.

The index for our transactions is auto assigned, so we'll set ths ignore_index keyword argument to True, so it gets generated.

In [9]:

# Remember this is returning a copy...
transactions.append(record, ignore_index=True).tail()

Out[9]:

	sender	receiver	amount	sent_date
994	king3246	john	25.37	2018-09-25
995	shernandez	kristen1581	75.77	2018-09-25
996	leah6255	jholloway	63.62	2018-09-25
997	pamela	michelle4225	2.54	2018-09-25
998	NaN	adrian	4.99	2018-10-08

If you are appending multiple rows, the more effective way to get the job done is by using the pandas.concat method.

Setting With Enlargement¶

If you assign to a non-existent index key, the DataFrame will be enlarged automatically, the row will just be added.

There is a slight problem here, as the index in the transactions DataFrame is autogenerated. A popular workaround is to figure out the last used index, and increment it.

In [10]:

# Largest current record, incremented
next_key = transactions.index.max() + 1
transactions.loc[next_key] = record 
# Make sure it got added
transactions.tail()

Out[10]:

	sender	receiver	amount	sent_date
994	king3246	john	25.37	2018-09-25
995	shernandez	kristen1581	75.77	2018-09-25
996	leah6255	jholloway	63.62	2018-09-25
997	pamela	michelle4225	2.54	2018-09-25
998	NaN	adrian	4.99	2018-10-08

Adding Columns¶

You can add columns much like you do rows, missing values will be set to np.nan.

Setting With Enlargement¶

In [11]:

latest_id = transactions.index.max()
# Add a new column named notes
transactions.at[latest_id, 'notes'] = 'Adrian called customer support to report billing error.'
transactions.tail()

Out[11]:

	sender	receiver	amount	sent_date	notes
994	king3246	john	25.37	2018-09-25	NaN
995	shernandez	kristen1581	75.77	2018-09-25	NaN
996	leah6255	jholloway	63.62	2018-09-25	NaN
997	pamela	michelle4225	2.54	2018-09-25	NaN
998	NaN	adrian	4.99	2018-10-08	Adrian called customer support to report billi...

The column can be added and assigned from an expression.

In [12]:

# Add a new column called large. This is a bad name and use of a column ;)
transactions['large'] = transactions.amount > 70

In [13]:

transactions.head()

Out[13]:

	sender	receiver	amount	sent_date	notes	large
0	stein	smoyer	49.03	2018-01-24	NaN	False
1	holden4580	joshua.henry	34.64	2018-02-06	NaN	False
2	rose.eaton	emily.lewis	62.67	2018-02-15	NaN	False
3	lmoore	kallen	1.94	2018-03-05	NaN	False
4	scott3928	lmoore	27.82	2018-03-10	NaN	False

Renaming Columns¶

Renaming columns can be acheived using the DataFrame.rename method. You specify the current name(s) as the key(s) and the new name(s) as the value(s).

By default this returns a copy, but you can use the inplace command to change the existing DataFrame.

In [14]:

transactions.rename(columns={'large': 'big_sender'}, inplace=True)
transactions.head()

Out[14]:

	sender	receiver	amount	sent_date	notes	big_sender
0	stein	smoyer	49.03	2018-01-24	NaN	False
1	holden4580	joshua.henry	34.64	2018-02-06	NaN	False
2	rose.eaton	emily.lewis	62.67	2018-02-15	NaN	False
3	lmoore	kallen	1.94	2018-03-05	NaN	False
4	scott3928	lmoore	27.82	2018-03-10	NaN	False

Deleting Columns¶

In addition to slicing a DataFrame to simply not include a specific existing column. You can also drop columns by name.

Let's remove the two that we added, in place.

In [15]:

transactions.drop(columns=['notes'], inplace=True)
transactions.head()

Out[15]:

	sender	receiver	amount	sent_date	big_sender
0	stein	smoyer	49.03	2018-01-24	False
1	holden4580	joshua.henry	34.64	2018-02-06	False
2	rose.eaton	emily.lewis	62.67	2018-02-15	False
3	lmoore	kallen	1.94	2018-03-05	False
4	scott3928	lmoore	27.82	2018-03-10	False

You might also seen this done using the axis parameter.

Let's get rid of the oddly named big_sender column. Why'd you let me name it that way?

In [16]:

transactions.drop(['big_sender'], axis='columns', inplace=True)
transactions.head()

Out[16]:

	sender	receiver	amount	sent_date
0	stein	smoyer	49.03	2018-01-24
1	holden4580	joshua.henry	34.64	2018-02-06
2	rose.eaton	emily.lewis	62.67	2018-02-15
3	lmoore	kallen	1.94	2018-03-05
4	scott3928	lmoore	27.82	2018-03-10

Deleting Rows¶

You can see also use the DataFrame.drop method to remove row(s) by index.

In [17]:

last_key = transactions.index.max()
transactions.drop(index=[last_key], inplace=True)
transactions.tail()

Out[17]:

	sender	receiver	amount	sent_date
993	coleman	sarah.evans	36.29	2018-09-25
994	king3246	john	25.37	2018-09-25
995	shernandez	kristen1581	75.77	2018-09-25
996	leah6255	jholloway	63.62	2018-09-25
997	pamela	michelle4225	2.54	2018-09-25

Manipulation Techniques¶

Assigning values¶

With Chaining (don't do this)¶

Using loc¶

Using at¶

Adding Rows¶

Appending with DataFrame.append¶

Setting With Enlargement¶

Adding Columns¶

Setting With Enlargement¶

Renaming Columns¶

Deleting Columns¶

Deleting Rows¶

Using `loc`¶

Using `at`¶

Appending with `DataFrame.append`¶