#!/usr/bin/env python # coding: utf-8 # # Exploration Methods # When you receive a CSV file from an external source, you'll want to get a feel for the data. # # Let's import some data and then learn how to explore it. # In[1]: # Setup import os import pandas as pd from utils import render # We use os.path.join because Windows uses a back slash (\) to separate directories # while others use a forward slash (/) users_file_name = os.path.join('data', 'users.csv') users_file_name # ## CSV File Exploration # If you want to take a peek at your CSV file, you could open it in an editor. # # Let's just use some standard Python to see the first couple lines of the file. # In[2]: # Open the file and print out the first 5 lines with open(users_file_name) as lines: for _ in range(5): # The `file` object is an iterator, so just get the next line render(next(lines)) # Notice how the first line is a header row. It has column names in it. By default, it will be assumed that the first row is your header row. # # Also note how the first column of that header row is empty...the values below in that first column appear to be usernames. They are what we want for the index. # # We can use the `index_col` parameter of the `pandas.read_csv` function. # In[3]: # Create a new `DataFrame` and set the index to the first column users = pd.read_csv(users_file_name, index_col=0) # ## Explore your imported DataFrame # A quick way to check and see if you got your CSV file read correctly is to use the [`DataFrame.head`](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.head.html) method. This gives you the first **x** number of rows. It defaults to 5. # In[4]: users.head() # Nice! We got it. So let's see how many rows we have. There are a couple of ways. # In[5]: # Pythonic approach still works len(users) # *Side note*: This length call is quick. Under the covers this `DataFrame.__len__` call is actually performing a `len(df.index)`, counting the rows by using the index. You might see older code that uses the style of `len(df.index)` to get a count of rows. As of pandas version 0.11, `len(df)` is the same as `len(df.index)`. # The `DataFrame.shape` property works just like `np.array.shape` does. This is the length of each axis of your data frame, rows and columns. # In[6]: users.shape # ### Exploring from a bird's eye view # #### Counts # The [`DataFrame.count`](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.count.html) method will count up each column for how many non-empty values we have. # In[7]: users.count() # Looks like most of our columns are present in all of our rows, but looks like **last_name** has some `np.nan` values. # # # The `count` method is data missing aware. # Remember that a `DataFrame` has the ability to contain multiple data types of [`dtypes`](https://pandas.pydata.org/pandas-docs/stable/basics.html#basics-dtypes). # # You can use the `DataFrame.dtypes` to see the `dtype` of each column. # #### Data types # In[8]: users.dtypes # As you can see, most of the data types of these columns were inferred, or assumed, correctly. See how automatically **`email_verified`** is `bool`, **`referral_count`** is an integer, and **`balance`** a float. This happened when we used `pd.read_csv`. # # One thing to note though that the **`signup_date`** field is an `object` and not a `datetime`. You can convert these druing or after import if you need to, and we'll do some of that later in this course. # #### Describe your data # The `DataFrame.describe` method is a great way to get a vibe for all numeric data in your `DataFrame`. You'll see lot's of different aggregations. # In[9]: users.describe() # Most of these aggregations are available by themselves as well # In[10]: # The mean or average users.mean() # In[11]: # Standard deviation users.std() # In[12]: # The minimum of each column users.min() # In[13]: # The maximum of each column users.max() # Since columns are in reality a `Series` you can quickly access their counts of different values using the [`value_counts`](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.Series.value_counts.html) method. # In[14]: users.email_verified.value_counts() # By default the value counts are sorted descending, so the most frequent are at top. # In[15]: # Most common first name users.first_name.value_counts().head() # ### Rearranging your data # You can create a new `DataFrame` that is sorted by using the [`sort_values`](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.sort_values.html) method. # # Let's sort the DataFrame so that the user with the highest **`balance`** is at the top. By default, ascending order is assumend, you can change that by setting the `ascending` keyword argument to `False`. # In[16]: users.sort_values(by='balance', ascending=False).head() # You'll notice that `sort_values` call actually created a new `DataFrame`. If you want to permanently change the sort from the default (sorted by index), you can pass `True` as an argument to the `inplace` keyword parameter. # In[17]: # Sort first by last_name and then first_name. By default, np.nan show up at the end users.sort_values(by=['last_name', 'first_name'], inplace=True) # Sort order is now changed users.head() # And if you want to sort by the index, like it was originally, you can use the `sort_index` method. # In[18]: users.sort_index(inplace=True) users.head()