Exploration Methods¶

When you receive a CSV file from an external source, you'll want to get a feel for the data.

Let's import some data and then learn how to explore it.

In [1]:

# Setup
import os
import pandas as pd

# We use os.path.join because Windows uses a back slash (\) to separate directories
#  while others use a forward slash (/)
users_file_name = os.path.join('data', 'users.csv')
users_file_name

Out[1]:

'data/users.csv'

CSV File Exploration¶

If you want to take a peek at your CSV file, you could open it in an editor.

Let's just use some standard Python to see the first couple lines of the file.

In [19]:

# Open the file and print out the first 5 lines
with open(users_file_name) as lines:
    for _ in range(5):
        # The `file` object is an iterator, so just get the next line 
        print(next(lines))

,first_name,last_name,email,email_verified,signup_date,referral_count,balance

aaron,Aaron,Davis,aaron6348@gmail.com,True,2018-08-31,6,18.14

acook,Anthony,Cook,cook@gmail.com,True,2018-05-12,2,55.45

adam.saunders,Adam,Saunders,adam@gmail.com,False,2018-05-29,3,72.12

adrian,Adrian,Fang,adrian.fang@teamtreehouse.com,True,2018-04-28,3,30.01

Notice how the first line is a header row. It has column names in it. By default, it will be assumed that the first row is your header row.

Also note how the first column of that header row is empty...the values below in that first column appear to be usernames. They are what we want for the index.

We can use the index_col parameter of the pandas.read_csv function.

In [3]:

# Create a new `DataFrame` and set the index to the first column
users = pd.read_csv(users_file_name, index_col=0)

Explore your imported DataFrame¶

A quick way to check and see if you got your CSV file read correctly is to use the DataFrame.head method. This gives you the first x number of rows. The head method, by default, returns 5 records. You can specify the number you want as the first argument, for instance users.head(10) will return the first 10 rows.

In [4]:

users.head()

Out[4]:

	first_name	last_name	email	email_verified	signup_date	referral_count	balance
aaron	Aaron	Davis	aaron6348@gmail.com	True	2018-08-31	6	18.14
acook	Anthony	Cook	cook@gmail.com	True	2018-05-12	2	55.45
adam.saunders	Adam	Saunders	adam@gmail.com	False	2018-05-29	3	72.12
adrian	Adrian	Fang	adrian.fang@teamtreehouse.com	True	2018-04-28	3	30.01
adrian.blair	Adrian	Blair	adrian9335@gmail.com	True	2018-06-16	7	25.85

Nice! We got it. So let's see how many rows we have. There are a couple of ways.

In [5]:

# Pythonic approach still works
len(users)

Out[5]:

Side note: This length call is quick. Under the covers this DataFrame.__len__ call is actually performing a len(df.index), counting the rows by using the index. You might see older code that uses the style of len(df.index) to get a count of rows. As of pandas version 0.11, len(df) is the same as len(df.index).

The DataFrame.shape property works just like np.array.shape does. This is the length of each axis of your data frame, rows and columns.

In [6]:

users.shape

Out[6]:

(475, 7)

Exploring from a bird's eye view¶

Counts¶

The DataFrame.count method will count up each column for how many non-empty values we have.

In [7]:

users.count()

Out[7]:

first_name        475
last_name         430
email             475
email_verified    475
signup_date       475
referral_count    475
balance           475
dtype: int64

Most of our columns include values for each row, but looks like last_name has some missing ones. The missing data will show up as np.nan -- NumPy's not a number -- in those records.

The count method is data missing aware.

Remember that a DataFrame has the ability to contain multiple data types or dtypes.

You can use the DataFrame.dtypes to see the dtype of each column.

Data types¶

In [8]:

users.dtypes

Out[8]:

first_name         object
last_name          object
email              object
email_verified       bool
signup_date        object
referral_count      int64
balance           float64
dtype: object

As you can see, most of the data types of these columns were inferred, or assumed, correctly. See how automatically email_verified is bool, referral_count is an integer, and balance a float. This happened when we used pd.read_csv.

One thing to note though that the signup_date field is an object and not a datetime. You can convert these during or after import if you need to, and we'll do some of that later in this course.

Describe your data¶

The DataFrame.describe method is a great way to get a vibe for all numeric data in your DataFrame. You'll notice only columns that have numeric data are returned, and ones with booleans or text, like email_verified and first_name are left out.

You'll see lots of different aggregations.

In [9]:

users.describe()

Out[9]:

	referral_count	balance
count	475.000000	475.000000
mean	3.429474	49.933263
std	2.281085	28.280448
min	0.000000	0.050000
25%	2.000000	25.305000
50%	3.000000	51.570000
75%	5.000000	74.480000
max	7.000000	99.900000

Most of these aggregations are available by themselves as well

In [10]:

# The mean or average
users.mean()

Out[10]:

email_verified     0.818947
referral_count     3.429474
balance           49.933263
dtype: float64

In [11]:

# Standard deviation
users.std()

Out[11]:

email_verified     0.385468
referral_count     2.281085
balance           28.280448
dtype: float64

In [12]:

# The minimum of each column
users.min()

Out[12]:

first_name                       Aaron
email             aalvarez@hotmail.com
email_verified                   False
signup_date                 2018-01-01
referral_count                       0
balance                           0.05
dtype: object

In [13]:

# The maximum of each column
users.max()

Out[13]:

first_name                Zachary
email             zneal@gmail.com
email_verified               True
signup_date            2018-09-25
referral_count                  7
balance                      99.9
dtype: object

Since columns are in reality a Series you can quickly access their counts of different values using the value_counts method.

In [14]:

users.email_verified.value_counts()

Out[14]:

True     389
False     86
Name: email_verified, dtype: int64

By default the value counts are sorted descending, so the most frequent are at top.

In [15]:

# Most common first name
users.first_name.value_counts().head()

Out[15]:

Mark           11
David          10
Michael         9
Jennifer        7
Christopher     7
Name: first_name, dtype: int64

Rearranging your data¶

You can create a new DataFrame that is sorted by using the sort_values method.

Let's sort the DataFrame so that the user with the highest balance is at the top. By default, ascending order is assumed, you can change that by setting the ascending keyword argument to False.

In [16]:

users.sort_values(by='balance', ascending=False).head()

Out[16]:

	first_name	last_name	email	email_verified	signup_date	referral_count	balance
twhite	Timothy	White	white5136@hotmail.com	True	2018-07-06	5	99.90
karen.snow	Karen	Snow	ksnow@yahoo.com	True	2018-05-06	2	99.38
king	Billy	King	billy.king@hotmail.com	True	2018-05-29	4	98.80
king3246	Brittney	King	brittney@yahoo.com	True	2018-04-15	6	98.79
crane203	Valerie	Crane	valerie7051@hotmail.com	True	2018-05-12	3	98.69

You'll notice that sort_values call actually created a new DataFrame. If you want to permanently change the sort from the default (sorted by index), you can pass True as an argument to the inplace keyword parameter.

In [17]:

# Sort first by last_name and then first_name. By default, np.nan show up at the end
users.sort_values(by=['last_name', 'first_name'], inplace=True)
# Sort order is now changed
users.head()

Out[17]:

	first_name	last_name	email	email_verified	signup_date	referral_count	balance
darlene.adams	Darlene	Adams	adams@hotmail.com	True	2018-09-15	2	67.02
lauren	Lauren	Aguilar	lauren.aguilar@summers.com	False	2018-05-31	4	69.90
daniel	Daniel	Allen	allen@hotmail.com	False	2018-07-01	2	21.21
kallen	Kathy	Allen	kathy@hotmail.com	False	2018-02-20	1	43.72
alvarado	Denise	Alvarado	alvarado@hotmail.com	True	2018-09-07	6	26.72

And if you want to sort by the index, like it was originally, you can use the sort_index method.

In [18]:

users.sort_index(inplace=True)
users.head()

Out[18]:

	first_name	last_name	email	email_verified	signup_date	referral_count	balance
aaron	Aaron	Davis	aaron6348@gmail.com	True	2018-08-31	6	18.14
acook	Anthony	Cook	cook@gmail.com	True	2018-05-12	2	55.45
adam.saunders	Adam	Saunders	adam@gmail.com	False	2018-05-29	3	72.12
adrian	Adrian	Fang	adrian.fang@teamtreehouse.com	True	2018-04-28	3	30.01
adrian.blair	Adrian	Blair	adrian9335@gmail.com	True	2018-06-16	7	25.85