#!/usr/bin/env python
# coding: utf-8

# # Exploration Methods
# When you receive a CSV file from an external source, you'll want to get a feel for the data.
# 
# Let's import some data and then learn how to explore it.

# In[1]:


# Setup
import os
import pandas as pd

from utils import render

# We use os.path.join because Windows uses a back slash (\) to separate directories
#  while others use a forward slash (/)
users_file_name = os.path.join('data', 'users.csv')
users_file_name


# ## CSV File Exploration
# If you want to take a peek at your CSV file, you could open it in an editor. 
# 
# Let's just use some standard Python to see the first couple lines of the file.

# In[2]:


# Open the file and print out the first 5 lines
with open(users_file_name) as lines:
    for _ in range(5):
        # The `file` object is an iterator, so just get the next line 
        render(next(lines))


# Notice how the first line is a header row. It has column names in it. By default, it will be assumed that the first row is your header row.
# 
# Also note how the first column of that header row is empty...the values below in that first column appear to be usernames.  They are what we want for the index.
# 
# We can use the `index_col` parameter of the `pandas.read_csv` function.

# In[3]:


# Create a new `DataFrame` and set the index to the first column
users = pd.read_csv(users_file_name, index_col=0)


# ## Explore your imported DataFrame

# A quick way to check and see if you got your CSV file read correctly is to use the [`DataFrame.head`](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.head.html) method. This gives you the first **x** number of rows. It defaults to 5.

# In[4]:


users.head()


# Nice! We got it.  So let's see how many rows we have.  There are a couple of ways.

# In[5]:


# Pythonic approach still works
len(users)


# *Side note*: This length call is quick. Under the covers this `DataFrame.__len__` call is actually performing a `len(df.index)`, counting the rows by using the index. You might see older code that uses the style of `len(df.index)` to get a count of rows. As of pandas version 0.11, `len(df)` is the same as `len(df.index)`.  

# The `DataFrame.shape` property works just like `np.array.shape` does. This is the length of each axis of your data frame, rows and columns.

# In[6]:


users.shape


# ### Exploring from a bird's eye view

# #### Counts

# The [`DataFrame.count`](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.count.html) method will count up each column for how many non-empty values we have.

# In[7]:


users.count()


# Looks like most of our columns are present in all of our rows, but looks like **last_name** has some `np.nan` values.
# 
# 
# The `count` method is data missing aware.

# Remember that a `DataFrame` has the ability to contain multiple data types of [`dtypes`](https://pandas.pydata.org/pandas-docs/stable/basics.html#basics-dtypes).
# 
# You can use the `DataFrame.dtypes` to see the `dtype` of each column.

# #### Data types

# In[8]:


users.dtypes


# As you can see, most of the data types of these columns were inferred, or assumed, correctly.  See how automatically **`email_verified`** is `bool`, **`referral_count`** is an integer, and **`balance`** a float. This happened when we used `pd.read_csv`. 
# 
# One thing to note though that the **`signup_date`** field is an `object` and not a `datetime`. You can convert these druing or after import if you need to, and we'll do some of that later in this course.

# #### Describe your data
# The `DataFrame.describe` method is a great way to get a vibe for all numeric data in your `DataFrame`. You'll see lot's of different aggregations.

# In[9]:


users.describe()


# Most of these aggregations are available by themselves as well

# In[10]:


# The mean or average
users.mean()


# In[11]:


# Standard deviation
users.std()


# In[12]:


# The minimum of each column
users.min()


# In[13]:


# The maximum of each column
users.max()


# Since columns are in reality a `Series` you can quickly access their counts of different values using the [`value_counts`](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.Series.value_counts.html) method.

# In[14]:


users.email_verified.value_counts()


# By default the value counts are sorted descending, so the most frequent are at top.

# In[15]:


# Most common first name
users.first_name.value_counts().head()


# ### Rearranging your data
# You can create a new `DataFrame` that is sorted by using the [`sort_values`](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.sort_values.html) method.
# 
# Let's sort the DataFrame so that the user with the highest **`balance`** is at the top. By default, ascending order is assumend, you can change that by setting the `ascending` keyword argument to `False`.

# In[16]:


users.sort_values(by='balance', ascending=False).head()


# You'll notice that `sort_values` call actually created a new `DataFrame`. If you want to permanently change the sort from the default (sorted by index), you can pass `True` as an argument to the `inplace` keyword parameter.

# In[17]:


# Sort first by last_name and then first_name. By default, np.nan show up at the end
users.sort_values(by=['last_name', 'first_name'], inplace=True)
# Sort order is now changed
users.head()


# And if you want to sort by the index, like it was originally, you can use the `sort_index` method.

# In[18]:


users.sort_index(inplace=True)
users.head()