There are lots of ways to change the shape and data in your DataFrame
.
Let's explore the popular options.
# Setup
from datetime import datetime
import os
import numpy as np
import pandas as pd
users = pd.read_csv(os.path.join('data', 'users.csv'), index_col=0)
transactions = pd.read_csv(os.path.join('data', 'transactions.csv'), index_col=0)
# Pop out a quick sanity check
(users.shape, transactions.shape)
((475, 7), (998, 4))
# First let's make sure there is only one Adrian Fang
users[(users.first_name == "Adrian") & (users.last_name == "Fang")]
first_name | last_name | email_verified | signup_date | referral_count | balance | ||
---|---|---|---|---|---|---|---|
adrian | Adrian | Fang | adrian.fang@teamtreehouse.com | True | 2018-04-28 | 3 | 30.01 |
Our goal is to update the balance, so the common thought process here, usually leads for us to just chain off the returned DataFrame
like so...
users[(users.first_name == "Adrian") & (users.last_name == "Fang")]['balance']
adrian 30.01 Name: balance, dtype: float64
... and since that appears to work, maybe we'll go ahead and set it to the new value.
users[(users.first_name == "Adrian") & (users.last_name == "Fang")]['balance'] = 35.00
/Users/craig/miniconda3/envs/panduhs/lib/python3.7/site-packages/ipykernel_launcher.py:1: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame. Try using .loc[row_indexer,col_indexer] = value instead See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy """Entry point for launching an IPython kernel.
users.loc[(users.first_name == "Adrian") & (users.last_name == "Fang"), 'balance'] = 35.00
# Display our updated user with the new value assigned
users.loc['adrian']
first_name Adrian last_name Fang email adrian.fang@teamtreehouse.com email_verified True signup_date 2018-04-28 referral_count 3 balance 35 Name: adrian, dtype: object
at
¶You can also use the DataFrame.at
method to quickly set scalar values
users.at['adrian', 'balance'] = 35.00
So we changed the balance
variable for Adrian, and now we need to track that the transaction occurred.
Let's take a quick peek at the transcactions
DataFrame.
transactions.head()
sender | receiver | amount | sent_date | |
---|---|---|---|---|
0 | stein | smoyer | 49.03 | 2018-01-24 |
1 | holden4580 | joshua.henry | 34.64 | 2018-02-06 |
2 | rose.eaton | emily.lewis | 62.67 | 2018-02-15 |
3 | lmoore | kallen | 1.94 | 2018-03-05 |
4 | scott3928 | lmoore | 27.82 | 2018-03-10 |
# Let's build a new record
record = dict(sender=np.nan, receiver='adrian', amount=4.99, sent_date=datetime.now().date())
DataFrame.append
¶The DataFrame.append
](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.append.html) method adds a new row to a new dataset. This method doesn't change the original dataset it returns a copy of the DataFrame with the new row(s) appended.
The index for our transactions
is auto assigned, so we'll set the ignore_index
keyword argument to True
, so it gets generated.
# Remember this is returning a copy...
transactions.append(record, ignore_index=True).tail()
sender | receiver | amount | sent_date | |
---|---|---|---|---|
994 | king3246 | john | 25.37 | 2018-09-25 |
995 | shernandez | kristen1581 | 75.77 | 2018-09-25 |
996 | leah6255 | jholloway | 63.62 | 2018-09-25 |
997 | pamela | michelle4225 | 2.54 | 2018-09-25 |
998 | NaN | adrian | 4.99 | 2018-11-06 |
If you are appending multiple rows, the more effective way to get the job done is by using the pandas.concat
method.
If you assign to a non-existent index key, the DataFrame will be enlarged automatically, the row will just be added.
There is a slight problem here, as the index in the transactions
DataFrame is autogenerated. A popular workaround is to figure out the last used index, and increment it.
# Largest current record, incremented
next_key = transactions.index.max() + 1
transactions.loc[next_key] = record
# Make sure it got added
transactions.tail()
sender | receiver | amount | sent_date | |
---|---|---|---|---|
994 | king3246 | john | 25.37 | 2018-09-25 |
995 | shernandez | kristen1581 | 75.77 | 2018-09-25 |
996 | leah6255 | jholloway | 63.62 | 2018-09-25 |
997 | pamela | michelle4225 | 2.54 | 2018-09-25 |
998 | NaN | adrian | 4.99 | 2018-11-06 |
You can add columns much like you do rows, missing values will be set to np.nan
.
latest_id = transactions.index.max()
# Add a new column named notes
transactions.at[latest_id, 'notes'] = 'Adrian called customer support to report billing error.'
transactions.tail()
sender | receiver | amount | sent_date | notes | |
---|---|---|---|---|---|
994 | king3246 | john | 25.37 | 2018-09-25 | NaN |
995 | shernandez | kristen1581 | 75.77 | 2018-09-25 | NaN |
996 | leah6255 | jholloway | 63.62 | 2018-09-25 | NaN |
997 | pamela | michelle4225 | 2.54 | 2018-09-25 | NaN |
998 | NaN | adrian | 4.99 | 2018-11-06 | Adrian called customer support to report billi... |
The column can be added and assigned from an expression.
# Add a new column called large. This is a bad name and use of a column ;)
transactions['large'] = transactions.amount > 70
transactions.head()
sender | receiver | amount | sent_date | notes | large | |
---|---|---|---|---|---|---|
0 | stein | smoyer | 49.03 | 2018-01-24 | NaN | False |
1 | holden4580 | joshua.henry | 34.64 | 2018-02-06 | NaN | False |
2 | rose.eaton | emily.lewis | 62.67 | 2018-02-15 | NaN | False |
3 | lmoore | kallen | 1.94 | 2018-03-05 | NaN | False |
4 | scott3928 | lmoore | 27.82 | 2018-03-10 | NaN | False |
Renaming columns can be acheived using the DataFrame.rename
method. You specify the current name(s) as the key(s) and the new name(s) as the value(s).
By default this returns a copy, but you can use the inplace
parameter to change the existing DataFrame
.
transactions.rename(columns={'large': 'big_sender'}, inplace=True)
transactions.head()
sender | receiver | amount | sent_date | notes | big_sender | |
---|---|---|---|---|---|---|
0 | stein | smoyer | 49.03 | 2018-01-24 | NaN | False |
1 | holden4580 | joshua.henry | 34.64 | 2018-02-06 | NaN | False |
2 | rose.eaton | emily.lewis | 62.67 | 2018-02-15 | NaN | False |
3 | lmoore | kallen | 1.94 | 2018-03-05 | NaN | False |
4 | scott3928 | lmoore | 27.82 | 2018-03-10 | NaN | False |
In addition to slicing a DataFrame
to simply not include a specific existing column. You can also drop columns by name. Let's remove the two that we added, in place.
transactions.drop(columns=['notes'], inplace=True)
transactions.head()
sender | receiver | amount | sent_date | big_sender | |
---|---|---|---|---|---|
0 | stein | smoyer | 49.03 | 2018-01-24 | False |
1 | holden4580 | joshua.henry | 34.64 | 2018-02-06 | False |
2 | rose.eaton | emily.lewis | 62.67 | 2018-02-15 | False |
3 | lmoore | kallen | 1.94 | 2018-03-05 | False |
4 | scott3928 | lmoore | 27.82 | 2018-03-10 | False |
You might have seen this done before using the axis
parameter.
Let's get rid of the oddly named big_sender
column. Why'd you let me name it that way?
transactions.drop(['big_sender'], axis='columns', inplace=True)
transactions.head()
sender | receiver | amount | sent_date | |
---|---|---|---|---|
0 | stein | smoyer | 49.03 | 2018-01-24 |
1 | holden4580 | joshua.henry | 34.64 | 2018-02-06 |
2 | rose.eaton | emily.lewis | 62.67 | 2018-02-15 |
3 | lmoore | kallen | 1.94 | 2018-03-05 |
4 | scott3928 | lmoore | 27.82 | 2018-03-10 |
You can see also use the DataFrame.drop
method to remove row(s) by index.
last_key = transactions.index.max()
transactions.drop(index=[last_key], inplace=True)
transactions.tail()
sender | receiver | amount | sent_date | |
---|---|---|---|---|
993 | coleman | sarah.evans | 36.29 | 2018-09-25 |
994 | king3246 | john | 25.37 | 2018-09-25 |
995 | shernandez | kristen1581 | 75.77 | 2018-09-25 |
996 | leah6255 | jholloway | 63.62 | 2018-09-25 |
997 | pamela | michelle4225 | 2.54 | 2018-09-25 |