There are lots of ways to change the shape and data in your DataFrame
.
Let's explore the popular options.
# Setup
from datetime import datetime
import os
import numpy as np
import pandas as pd
from utils import render
users = pd.read_csv(os.path.join('data', 'users.csv'), index_col=0)
transactions = pd.read_csv(os.path.join('data', 'transactions.csv'), index_col=0)
# Pop out a quick sanity check
(users.shape, transactions.shape)
# First let's make sure there is only one Adrian Yang
users[(users.first_name == "Adrian") & (users.last_name == "Yang")]
Our goal is to update the balance, so the common thought process here, usually leads for us to just chain off the returned DataFrame
like so...
users[(users.first_name == "Adrian") & (users.last_name == "Yang")]['balance']
... and since that appears to work, maybe we'll go ahead and set it to the new value.
users[(users.first_name == "Adrian") & (users.last_name == "Yang")]['balance'] = 35.00
users.loc[(users.first_name == "Adrian") & (users.last_name == "Yang"), 'balance'] = 35.00
# Display our updated user with the new value assigned
users.loc['adrian']
at
¶You can also use the DataFrame.at
method to quickly set scalar values
users.at['adrian', 'balance'] = 35.00
So we changed the balance
variable for Adrian, and now we need to track that the transaction occurred.
Let's take a quick peek at the transcactions
DataFrame.
transactions.head()
# Let's build a new record
record = dict(sender=np.nan, receiver='adrian', amount=4.99, sent_date=datetime.now().date())
DataFrame.append
¶There is a method on DataFrame
s that allow a way to append
a new row to a new dataset. This returns a copy of the DataFrame with the new row(s) appended.
The index for our transactions
is auto assigned, so we'll set ths ignore_index
keyword argument to True
, so it gets generated.
# Remember this is returning a copy...
transactions.append(record, ignore_index=True).tail()
If you are appending multiple rows, the more effective way to get the job done is by using the pandas.concat
method.
If you assign to a non-existent index key, the DataFrame will be enlarged automatically, the row will just be added.
There is a slight problem here, as the index in the transactions
DataFrame is autogenerated. A popular workaround is to figure out the last used index, and increment it.
# Largest current record, incremented
next_key = transactions.index.max() + 1
transactions.loc[next_key] = record
# Make sure it got added
transactions.tail()
You can add columns much like you do rows, missing values will be set to np.nan
.
latest_id = transactions.index.max()
# Add a new column named notes
transactions.at[latest_id, 'notes'] = 'Adrian called customer support to report billing error.'
transactions.tail()
The column can be added and assigned from an expression.
# Add a new column called large. This is a bad name and use of a column ;)
transactions['large'] = transactions.amount > 70
transactions.head()
Renaming columns can be acheived using the DataFrame.rename
method. You specify the current name(s) as the key(s) and the new name(s) as the value(s).
By default this returns a copy, but you can use the inplace
command to change the existing DataFrame
.
transactions.rename(columns={'large': 'big_sender'}, inplace=True)
transactions.head()
In addition to slicing a DataFrame
to simply not include a specific existing column. You can also drop columns by name.
Let's remove the two that we added, in place.
transactions.drop(columns=['notes'], inplace=True)
transactions.head()
You might also seen this done using the axis
parameter.
Let's get rid of the oddly named big_sender
column. Why'd you let me name it that way?
transactions.drop(['big_sender'], axis='columns', inplace=True)
transactions.head()
You can see also use the DataFrame.drop
method to remove row(s) by index.
last_key = transactions.index.max()
transactions.drop(index=[last_key], inplace=True)
transactions.tail()