Dataframes 10/7/2019 & 10/25/2019

1.A recap from last time

Recall we previously had talked about pandas series. Dataframes include columns that are like individual pandas series. They resemble tabular data in Microsoft Excel. Keep in mind it has a row and column index. Let's construct a Dataframe object. A common way with using a dict of equal length lists or numpy arrays. Each column can then be accessed as a series object.

In [2]:
import pandas as pd

states = ['Ohio', 'Ohio', 'Ohio', 'Nevada', 'Nevada']
year = [2000, 2001, 2002, 2001, 2002]
pop = [1.5, 1.7, 3.6, 2.4, 2.9]
data = {'States': states, 'Year': year, 'Population': pop}
dataframe = pd.DataFrame(data)
dataframe #May have to put print(dataframe) in your text editor
Out[2]:
Population States Year
0 1.5 Ohio 2000
1 1.7 Ohio 2001
2 3.6 Ohio 2002
3 2.4 Nevada 2001
4 2.9 Nevada 2002
In [5]:
#Specify columns, index
dataframe2 = pd.DataFrame(data, columns = ['States', 'Population', 'Year'],
                          index = ['one', 'two', 'three', 'four', 'five'])

dataframe2
Out[5]:
States Population Year
one Ohio 1.5 2000
two Ohio 1.7 2001
three Ohio 3.6 2002
four Nevada 2.4 2001
five Nevada 2.9 2002

Selecting columns will return a series object. Use the syntax below:

In [9]:
dataframe2['Population']
#dataframe2[['Population']] <- this turns into a dataframe object
Out[9]:
one      1.5
two      1.7
three    3.6
four     2.4
five     2.9
Name: Population, dtype: float64

With this in mind, we can create new columns in a similar matter:

In [7]:
dataframe2['West'] = ['False', 'False', 'False', 'True', 'True']
dataframe2
Out[7]:
States Population Year West
one Ohio 1.5 2000 False
two Ohio 1.7 2001 False
three Ohio 3.6 2002 False
four Nevada 2.4 2001 True
five Nevada 2.9 2002 True

2.Indexing

Sometimes, we may choose to retrieve one or more columns and possibly assign it to some variable. Most of the time, we can select pieces of the dataframe and analyze them during exploratory data analysis, the phase before machine learning.

In [54]:
salaries = pd.read_csv("Position_Salaries.csv")
In [11]:
salaries.columns # < - Printing columns
Out[11]:
Index(['Position', 'Level', 'Salary'], dtype='object')

Retrieve the level and salary columns

In [12]:
salaries[['Level', 'Salary']]
Out[12]:
Level Salary
0 1 45000
1 2 50000
2 3 60000
3 4 80000
4 5 110000
5 6 150000
6 7 200000
7 8 300000
8 9 500000
9 10 1000000

Print first five rows

In [13]:
salaries[:5]
Out[13]:
Position Level Salary
0 Business Analyst 1 45000
1 Junior Consultant 2 50000
2 Senior Consultant 3 60000
3 Manager 4 80000
4 Country Manager 5 110000

2a.Boolean Indexing

In [18]:
salaries[salaries['Salary'] > 100000]
Out[18]:
Position Level Salary
4 Country Manager 5 110000
5 Region Manager 6 150000
6 Partner 7 200000
7 Senior Partner 8 300000
8 C-level 9 500000
9 CEO 10 1000000
In [21]:
salaries['Salary'][salaries['Salary'] > 100000]
Out[21]:
4     110000
5     150000
6     200000
7     300000
8     500000
9    1000000
Name: Salary, dtype: int64

2b. iloc/loc

In [25]:
salaries.iloc[0] #prints first row, returns series object
Out[25]:
Position    Business Analyst
Level                      1
Salary                 45000
Name: 0, dtype: object
In [28]:
(salaries.iloc[:, [0,1]]) #prints all rows, first two columns; We choose column 1 and column 2 (indexed at 0 & 1)
Out[28]:
Position Level
0 Business Analyst 1
1 Junior Consultant 2
2 Senior Consultant 3
3 Manager 4
4 Country Manager 5
5 Region Manager 6
6 Partner 7
7 Senior Partner 8
8 C-level 9
9 CEO 10
In [30]:
salaries.iloc[[4,5,6], :] #prints 4th, 5th, 6th rows, all columns 
Out[30]:
Position Level Salary
4 Country Manager 5 110000
5 Region Manager 6 150000
6 Partner 7 200000
In [56]:
#Slightly bit more complicated:
import numpy as np
salaries.iloc[np.r_[1:3, 6:8, 8], np.r_[0:1, 2:3]] #last index is exclusive unless specifically stated
Out[56]:
Position Salary
1 Junior Consultant 50000
2 Senior Consultant 60000
6 Partner 200000
7 Senior Partner 300000
8 C-level 500000
In [55]:
salaries.iloc[4, 1] = 88888 # Change individual value
salaries
Out[55]:
Position Level Salary
0 Business Analyst 1 45000
1 Junior Consultant 2 50000
2 Senior Consultant 3 60000
3 Manager 4 80000
4 Country Manager 88888 110000
5 Region Manager 6 150000
6 Partner 7 200000
7 Senior Partner 8 300000
8 C-level 9 500000
9 CEO 10 1000000

LOC is fairly similar; instead of numbers, use the row/column names. for this specific example, let's change the index of the dataframe to the positions

In [57]:
salaries.set_index(['Position'], inplace = True)
salaries
Out[57]:
Level Salary
Position
Business Analyst 1 45000
Junior Consultant 2 50000
Senior Consultant 3 60000
Manager 4 80000
Country Manager 88888 110000
Region Manager 6 150000
Partner 7 200000
Senior Partner 8 300000
C-level 9 500000
CEO 10 1000000
In [59]:
salaries.loc[['Business Analyst', 'Junior Consultant'], :] #prints first two rows, all columns
Out[59]:
Level Salary
Position
Business Analyst 1 45000
Junior Consultant 2 50000
In [60]:
salaries.loc[:, ['Level', 'Salary']] #prints all rows, second/third column
Out[60]:
Level Salary
Position
Business Analyst 1 45000
Junior Consultant 2 50000
Senior Consultant 3 60000
Manager 4 80000
Country Manager 88888 110000
Region Manager 6 150000
Partner 7 200000
Senior Partner 8 300000
C-level 9 500000
CEO 10 1000000

We use the 'any' function to determine if any of the values in the Salary column contain values less than $10,000.

In [61]:
(salaries.Salary < 10000).values.any()
Out[61]:
False

Basic Arithmetic

In [62]:
# Let's reset the index:
# multiply the salaries by 2.
salaries.reset_index(inplace = True)
salaries['Salary'] = salaries['Salary'] * 2
salaries
Out[62]:
Position Level Salary
0 Business Analyst 1 90000
1 Junior Consultant 2 100000
2 Senior Consultant 3 120000
3 Manager 4 160000
4 Country Manager 88888 220000
5 Region Manager 6 300000
6 Partner 7 400000
7 Senior Partner 8 600000
8 C-level 9 1000000
9 CEO 10 2000000

You can use other operations when dealing and maniuplating with individual columns.

Filling Misisng Data

We will be importing the wine dataset. This dataset contains null values (missing values). 5 columns contain some missing data, while 22 rows contain some missing data.

In [11]:
wine = pd.read_csv("winequality-red.csv")
wine.isnull().any(axis = 0) #add the .sum operator
Out[11]:
fixed acidity           False
volatile acidity        False
citric acid              True
residual sugar          False
chlorides                True
free sulfur dioxide     False
total sulfur dioxide    False
density                 False
pH                       True
sulphates                True
alcohol                  True
quality                 False
dtype: bool
In [4]:
#Check the first few rows of the data frame. 
wine.head()
Out[4]:
fixed acidity volatile acidity citric acid residual sugar chlorides free sulfur dioxide total sulfur dioxide density pH sulphates alcohol quality
0 7.4 0.70 0.00 1.9 0.076 11.0 34.0 0.9978 3.51 0.56 9.4 5
1 7.8 0.88 0.00 2.6 0.098 25.0 67.0 0.9968 3.20 0.68 9.8 5
2 7.8 0.76 0.04 2.3 0.092 15.0 54.0 0.9970 3.26 0.65 9.8 5
3 11.2 0.28 0.56 1.9 0.075 17.0 60.0 0.9980 3.16 0.58 9.8 6
4 7.4 0.70 0.00 1.9 0.076 11.0 34.0 0.9978 3.51 0.56 9.4 5
In [ ]:
#wine.dropna(axis=0, inplace = True)
#wine.dropna(axis = 1, inplace = True)

Sometimes, we may want to fill in the missing data with a specific value. Keep in mind that dropping null values or replacing them with a particular value may ultimately affect the accuracy of your model. It's important to utilize certain strategies on how to deal with null values depending on the context of the problem. This is referred to as data imputation. Here is an article on some approaches used: https://www.theanalysisfactor.com/seven-ways-to-make-up-data-common-methods-to-imputing-missing-data/

In [67]:
#Fill all missing data with zeros
wine.fillna(0, inplace = True)
wine.isnull().any(axis=0)
Out[67]:
fixed acidity           False
volatile acidity        False
citric acid             False
residual sugar          False
chlorides               False
free sulfur dioxide     False
total sulfur dioxide    False
density                 False
pH                      False
sulphates               False
alcohol                 False
quality                 False
dtype: bool

We can fill columns with missing data by specific values corresponding to their column.

In [13]:
wine.fillna({wine.columns[2]:0, wine.columns[4]:2}, inplace = True) #Columns 2 and 4 values have been replaced. {name of dataframe}.columns provides a list of column names
wine.isnull().any(axis = 0)
Out[13]:
fixed acidity           False
volatile acidity        False
citric acid             False
residual sugar          False
chlorides               False
free sulfur dioxide     False
total sulfur dioxide    False
density                 False
pH                       True
sulphates                True
alcohol                  True
quality                 False
dtype: bool
In [70]:
wine.head()
Out[70]:
fixed acidity volatile acidity citric acid residual sugar chlorides free sulfur dioxide total sulfur dioxide density pH sulphates alcohol quality
0 7.4 0.70 0.00 1.9 0.076 11.0 34.0 0.9978 3.51 0.56 9.4 5
1 7.8 0.88 0.00 2.6 0.098 25.0 67.0 0.9968 3.20 0.68 9.8 5
2 7.8 0.76 0.04 2.3 0.092 15.0 54.0 0.9970 3.26 0.65 9.8 5
3 11.2 0.28 0.56 1.9 0.075 17.0 60.0 0.9980 3.16 0.58 9.8 6
4 7.4 0.70 0.00 1.9 0.076 11.0 34.0 0.9978 3.51 0.56 9.4 5