When you receive a CSV file from an external source, you'll want to get a feel for the data.
Let's import some data and then learn how to explore it.
# Setup
import os
import pandas as pd
# We use os.path.join because Windows uses a back slash (\) to separate directories
# while others use a forward slash (/)
users_file_name = os.path.join('data', 'users.csv')
users_file_name
'data/users.csv'
If you want to take a peek at your CSV file, you could open it in an editor.
Let's just use some standard Python to see the first couple lines of the file.
# Open the file and print out the first 5 lines
with open(users_file_name) as lines:
for _ in range(5):
# The `file` object is an iterator, so just get the next line
print(next(lines))
,first_name,last_name,email,email_verified,signup_date,referral_count,balance aaron,Aaron,Davis,aaron6348@gmail.com,True,2018-08-31,6,18.14 acook,Anthony,Cook,cook@gmail.com,True,2018-05-12,2,55.45 adam.saunders,Adam,Saunders,adam@gmail.com,False,2018-05-29,3,72.12 adrian,Adrian,Fang,adrian.fang@teamtreehouse.com,True,2018-04-28,3,30.01
Notice how the first line is a header row. It has column names in it. By default, it will be assumed that the first row is your header row.
Also note how the first column of that header row is empty...the values below in that first column appear to be usernames. They are what we want for the index.
We can use the index_col
parameter of the pandas.read_csv
function.
# Create a new `DataFrame` and set the index to the first column
users = pd.read_csv(users_file_name, index_col=0)
A quick way to check and see if you got your CSV file read correctly is to use the DataFrame.head
method. This gives you the first x number of rows. The head method, by default, returns 5 records. You can specify the number you want as the first argument, for instance users.head(10)
will return the first 10 rows.
users.head()
first_name | last_name | email_verified | signup_date | referral_count | balance | ||
---|---|---|---|---|---|---|---|
aaron | Aaron | Davis | aaron6348@gmail.com | True | 2018-08-31 | 6 | 18.14 |
acook | Anthony | Cook | cook@gmail.com | True | 2018-05-12 | 2 | 55.45 |
adam.saunders | Adam | Saunders | adam@gmail.com | False | 2018-05-29 | 3 | 72.12 |
adrian | Adrian | Fang | adrian.fang@teamtreehouse.com | True | 2018-04-28 | 3 | 30.01 |
adrian.blair | Adrian | Blair | adrian9335@gmail.com | True | 2018-06-16 | 7 | 25.85 |
Nice! We got it. So let's see how many rows we have. There are a couple of ways.
# Pythonic approach still works
len(users)
475
Side note: This length call is quick. Under the covers this DataFrame.__len__
call is actually performing a len(df.index)
, counting the rows by using the index. You might see older code that uses the style of len(df.index)
to get a count of rows. As of pandas version 0.11, len(df)
is the same as len(df.index)
.
The DataFrame.shape
property works just like np.array.shape
does. This is the length of each axis of your data frame, rows and columns.
users.shape
(475, 7)
The DataFrame.count
method will count up each column for how many non-empty values we have.
users.count()
first_name 475 last_name 430 email 475 email_verified 475 signup_date 475 referral_count 475 balance 475 dtype: int64
Most of our columns include values for each row, but looks like last_name has some missing ones. The missing data will show up as np.nan
-- NumPy's not a number -- in those records.
The count
method is data missing aware.
Remember that a DataFrame
has the ability to contain multiple data types or dtypes
.
You can use the DataFrame.dtypes
to see the dtype
of each column.
users.dtypes
first_name object last_name object email object email_verified bool signup_date object referral_count int64 balance float64 dtype: object
As you can see, most of the data types of these columns were inferred, or assumed, correctly. See how automatically email_verified
is bool
, referral_count
is an integer, and balance
a float. This happened when we used pd.read_csv
.
One thing to note though that the signup_date
field is an object
and not a datetime
. You can convert these during or after import if you need to, and we'll do some of that later in this course.
The DataFrame.describe
method is a great way to get a vibe for all numeric data in your DataFrame
. You'll notice only columns that have numeric data are returned, and ones with booleans or text, like email_verified
and first_name
are left out.
You'll see lots of different aggregations.
users.describe()
referral_count | balance | |
---|---|---|
count | 475.000000 | 475.000000 |
mean | 3.429474 | 49.933263 |
std | 2.281085 | 28.280448 |
min | 0.000000 | 0.050000 |
25% | 2.000000 | 25.305000 |
50% | 3.000000 | 51.570000 |
75% | 5.000000 | 74.480000 |
max | 7.000000 | 99.900000 |
Most of these aggregations are available by themselves as well
# The mean or average
users.mean()
email_verified 0.818947 referral_count 3.429474 balance 49.933263 dtype: float64
# Standard deviation
users.std()
email_verified 0.385468 referral_count 2.281085 balance 28.280448 dtype: float64
# The minimum of each column
users.min()
first_name Aaron email aalvarez@hotmail.com email_verified False signup_date 2018-01-01 referral_count 0 balance 0.05 dtype: object
# The maximum of each column
users.max()
first_name Zachary email zneal@gmail.com email_verified True signup_date 2018-09-25 referral_count 7 balance 99.9 dtype: object
Since columns are in reality a Series
you can quickly access their counts of different values using the value_counts
method.
users.email_verified.value_counts()
True 389 False 86 Name: email_verified, dtype: int64
By default the value counts are sorted descending, so the most frequent are at top.
# Most common first name
users.first_name.value_counts().head()
Mark 11 David 10 Michael 9 Jennifer 7 Christopher 7 Name: first_name, dtype: int64
You can create a new DataFrame
that is sorted by using the sort_values
method.
Let's sort the DataFrame so that the user with the highest balance
is at the top. By default, ascending order is assumed, you can change that by setting the ascending
keyword argument to False
.
users.sort_values(by='balance', ascending=False).head()
first_name | last_name | email_verified | signup_date | referral_count | balance | ||
---|---|---|---|---|---|---|---|
twhite | Timothy | White | white5136@hotmail.com | True | 2018-07-06 | 5 | 99.90 |
karen.snow | Karen | Snow | ksnow@yahoo.com | True | 2018-05-06 | 2 | 99.38 |
king | Billy | King | billy.king@hotmail.com | True | 2018-05-29 | 4 | 98.80 |
king3246 | Brittney | King | brittney@yahoo.com | True | 2018-04-15 | 6 | 98.79 |
crane203 | Valerie | Crane | valerie7051@hotmail.com | True | 2018-05-12 | 3 | 98.69 |
You'll notice that sort_values
call actually created a new DataFrame
. If you want to permanently change the sort from the default (sorted by index), you can pass True
as an argument to the inplace
keyword parameter.
# Sort first by last_name and then first_name. By default, np.nan show up at the end
users.sort_values(by=['last_name', 'first_name'], inplace=True)
# Sort order is now changed
users.head()
first_name | last_name | email_verified | signup_date | referral_count | balance | ||
---|---|---|---|---|---|---|---|
darlene.adams | Darlene | Adams | adams@hotmail.com | True | 2018-09-15 | 2 | 67.02 |
lauren | Lauren | Aguilar | lauren.aguilar@summers.com | False | 2018-05-31 | 4 | 69.90 |
daniel | Daniel | Allen | allen@hotmail.com | False | 2018-07-01 | 2 | 21.21 |
kallen | Kathy | Allen | kathy@hotmail.com | False | 2018-02-20 | 1 | 43.72 |
alvarado | Denise | Alvarado | alvarado@hotmail.com | True | 2018-09-07 | 6 | 26.72 |
And if you want to sort by the index, like it was originally, you can use the sort_index
method.
users.sort_index(inplace=True)
users.head()
first_name | last_name | email_verified | signup_date | referral_count | balance | ||
---|---|---|---|---|---|---|---|
aaron | Aaron | Davis | aaron6348@gmail.com | True | 2018-08-31 | 6 | 18.14 |
acook | Anthony | Cook | cook@gmail.com | True | 2018-05-12 | 2 | 55.45 |
adam.saunders | Adam | Saunders | adam@gmail.com | False | 2018-05-29 | 3 | 72.12 |
adrian | Adrian | Fang | adrian.fang@teamtreehouse.com | True | 2018-04-28 | 3 | 30.01 |
adrian.blair | Adrian | Blair | adrian9335@gmail.com | True | 2018-06-16 | 7 | 25.85 |