Introduction to Dataset Processing¶

by Carl Shan and Jen Selby¶

This Jupyter Notebook will share more details about how to process your data.

Data processing is like preparing the ingredients before cooking; if you prepare them poorly (e.g., leave things half-peeled and dirty) , the meal will taste poor no matter how skillful a chef you are.

It's similarly true in machine learning. Dataset processing can be one of the most important things you can do to get your model to perform well.

You can read more about dataset processing on the course notes here.

If you haven't already, follow the setup instructions here to get all necessary software installed.

Table of Contents¶

Setup
Section 1: Converting Categorical Values to Numerical Ones
Section 2: Dealing with Text Data
Section 3: Dealing with Null Values
Section 4: Now Let's Learn How to Standardize Data
Bonus: What if we would like to split up columns?

Hint: Use the `?` symbol¶

As you go through this notebook, as well as learn more about processing data in iPython, it will be helpful to know the ? symbol.

E.g., You can try to type the following into Python

import sklearn

sklearn?

Typing the ? symbol after a function, module or variable will bring up the documentation of that bit of code, assuming it exists. It'll tell you more about the variable, function or module.

Setup¶

In [ ]:

import pandas as pd
from sklearn import preprocessing

Download the student performance data and change the path below to wherever you put the data.

In [ ]:

student_data = pd.read_csv('../data/student/student-mat.csv', sep=';')

In [ ]:

student_data.head()

Section 1: Converting Categorical Values to Numerical Ones¶

Looking at the data above, we want to convert a number of the columns from categorical to numerical. Most machine learning models deal with numbers and don't know how to model data that is in text form. As a result we need to learn how to do things such as e.g., convert the values in the school column to numbers.

First, let's see what values there are in the `school` column¶

In [ ]:

# This shows a list of unique values and how many times they appear
student_data['school'].value_counts()

In [ ]:

# Converting values in the school column to text
# We are going to define a function that takes a single value and apply it to all the values
def convert_school(row):
    if row == 'GP':
        return 0
    elif row == 'MS':
        return 1
    else:
        return None

Avoid for loops¶

Normally, we might write a for loop like the one below. But this is really slow when using Pandas.

Don't write loops like this.¶

In [ ]:

# Here's a slow way of using the above function
%time
converted_school = []

for row in student_data['school']:
    new_value = convert_school(row)
    converted_school.append(new_value)
converted_school


# Don't do this! It's very slow.

Instead, Use `.apply`¶

This will do the same thing as the for loop above, but much faster. It'll apply a function to all the rows of a DataFrame.

In [ ]:

%time
converted_school = student_data['school'].apply(convert_school)
converted_school

Look how much faster that took!

Or, you can use `.map()`¶

You can also use the .map() function to map certain values to other data.

For example, imagine you had a column named 'colors' that contained the values "red" and "blue" and you wanted to convert these to the numbers 1 and 2.

mappings = {
    'red': 1,
    'blue': 2
}

data['colors_mapped'] = data['colors'].map(mappings)

The above will create a new column called colors_mapped that now has the values 1 and 2.

Using sklearn's built-in preprocessing module, we can do the same thing¶

In [ ]:

enc_school = preprocessing.LabelEncoder()
transformed_school = enc_school.fit_transform(student_data['school'])
transformed_school

We can also use one-hot encoding if we have more than two values. We still need to encode it first, as we did above.¶

See example at https://stackoverflow.com/a/43589167/2159992

In [ ]:

enc_mjob = preprocessing.LabelEncoder()
encoded_mjob = enc_mjob.fit_transform(student_data['Mjob'])
encoded_mjob

In [ ]:

onehot_mjob = preprocessing.OneHotEncoder(sparse=False)
transformed_mjob = onehot_mjob.fit_transform(encoded_mjob.reshape(-1,1))
transformed_mjob

Once we've fitted the label encoder and one-hot encoder, we can use them to transform more values.

In [ ]:

onehot_mjob.transform(enc_mjob.transform(['other', 'health']).reshape(-1,1))

What if we want to apply a transform that looks at multiple values in the row?¶

For instance, what if we want to create a new column with a 1 if both parents have the highest level of education measured?

In [ ]:

medu_index = student_data.columns.get_loc('Medu')
fedu_index = student_data.columns.get_loc('Fedu')

def both_parents_edu(row):
    if row[medu_index] > 3 and row[fedu_index] >= 4:
        return 1
    else:
        return 0
    
# axis 1 means that we will apply the function to each row
student_data['parents_high_edu'] = student_data.apply(both_parents_edu, axis=1)
student_data.head(10)

Section 2: Dealing with Text Data¶

pandas has a lot of built-in modules that work with text-based data.

sklearn similarly has a lot of modules for this as well.

This section gives a brief outline of the things you can try.

If you want to see a fuller list, with examples, of how pandas deals with text data, you can look at the documentation here.

In [ ]:

#### First, I'm going to make some fake data that we can work with for the rest of this section

data = pd.DataFrame(data={'text': ['apple', '%badly,formatted,data%', 'pear']})

In [ ]:

data

Removing or replacing data¶

Okay, we want to remove the ',' and '%' symbols from the data. How do we do so?

In [ ]:

data['text_removed'] = data['text'].str.replace(',', '')

In [ ]:

data

Nice. Now try and replace the '%' symbols.

In [ ]:

#### Your code here

Checking to see if a string contains a certain value¶

Now, we want to see if a text contains certain values, and only get the rows that contains those values.

In [ ]:

### Again, I have to make some fake data

data = pd.DataFrame(data={'text': ['Nueva Maverick', 'San Francisco Maverick', 'Vikings']})

In [ ]:

data

Cool, what if we only wanted to get the rows that contained the word 'Maverick'?

In [ ]:

data['text'].str.contains('Maverick')

Now we can use this Series of boolean True and False values to index into our data!

In [ ]:

condition = data['text'].str.contains('Maverick')

filtered_data = data[condition]

In [ ]:

filtered_data

There is a whole list of other things you can do with your text data.¶

Some are listed below:

str.startswith() and str.endswith() - checks to see if a string starts or ends with a given argument
str.count() - counts the number of appearances of a certain pattern
str.numeric() - checks to see if the string is numeric (e.g., 23123 is a digit whereas 213123abc is not)
str.split() - splits the string on some deliminter and returns a dataframe of the string, split on the characters.

There's plenty more and you can see the documentation here for more.

What about transforming text data into a DataFrame?¶

In [ ]:

raw_text = ["""This is a giant series of sentences that you want to convert into a DataFrame containing 
the raw counts for each word. There are some abbr. and some punctuations here and there that make things more complicated.
So how in the world do we turn this into something that we can build a machine learning model off of?
"""]

Okay, so we want to turn the above into a DataFrame where every column is a different word, and each entry stores the number of times that word came up.

We're going to use the CountVectorizer class in sklearn.

A more in-depth tutorial on how to use it, and more, can be found here.

In [ ]:

from sklearn.feature_extraction.text import CountVectorizer

In [ ]:

# Initializing an empty CountVectorizer object
count_vect = CountVectorizer()

In [ ]:

# Now we fit the object to our actual data
counts = count_vect.fit_transform(raw_text)

In [ ]:

# This is a `sparse matrix` class. It saves our computer space.
counts

In [ ]:

# Let's use the `.todense()` function to turn this sparse matrix into something that can be transformed into a DataFrame
word_counts_df = pd.DataFrame(data=counts.todense())

In [ ]:

word_counts_df

Great, but what do each of the columns mean?

We can inspect the count_vect.vocabulary_ attribute to find out.

In [ ]:

count_vect.vocabulary_

Great. Now we know the words for each of the datasets

Your challenge: Your challenge is to write some code so that you end converting the columns in word_counts_df to each of the words in count_vect.vocabulary_.

In [11]:

# If you're successful it should look like the output below.

Out[11]:

	abbr	and	are	build	can	complicated	containing	convert	counts	dataframe	...	there	things	this	to	turn	want	we	word	world	you
0	1	2	1	1	1	1	1	1	1	1	...	2	1	2	1	1	1	2	1	1	1

1 rows × 45 columns

In [ ]:

### Your code here

Section 3: Dealing with Null Values¶

To show you how to deal with null values, I'm going to make some simulated data of students.

In [ ]:

import numpy as np
grades = np.random.choice(range(1, 13), 100) # chooses 100 random numbers between 1 - 12
num_friends_or_none = list(range(0, 20)) + [None] * 5
num_friends = np.random.choice(num_friends_or_none, 100)
new_data = pd.DataFrame(data={'Grade': grades, '# Friends': num_friends})

In [ ]:

new_data.head(n=20)

One way to deal with null values is to drop them¶

In [ ]:

new_data['# Friends'].dropna()

We can also drop any rows with nulls from the entire table.¶

In [ ]:

new_data.dropna()

Or we can replace the null values with an average¶

In [ ]:

average_friends = new_data['# Friends'].mean()
new_data['# Friends'].fillna(average_friends)

In [ ]:

new_data['# Friends'] = new_data['# Friends'].fillna(average_friends)

What if instead of null values, there is something else that stands for missing values?¶

Try the replace function.

In [ ]:

grades = np.random.choice(range(1, 13), 100) # chooses 100 random numbers between 1 - 12
num_friends_or_none = list(range(0, 20)) + ["Unknown"] * 5
num_friends = np.random.choice(num_friends_or_none, 100)
unknown_data = pd.DataFrame(data={'Grade': grades, '# Friends': num_friends})
unknown_data

In [ ]:

unknown_data.replace("Unknown", 10)

Section 4: Now Let's Learn How to Standardize Data¶

By that I mean to transform our data so that it has a mean of 0 and a standard deviation of 1.

Why would we want to do this?

Well often we will have strange parameter estimates on many models models if different bits of our data are in wildly different ranges.

Many researchers have noted the importance of standardizing variables for multivariate analysis.

Otherwise, variables measured at different scales do not contribute equally to the analysis.

For example, in boundary detection, a variable that ranges between 0 and 100 will outweigh a variable that ranges between 0 and 1. Using these variables without standardization in effect gives the variable with the larger range a weight of 100 in the analysis.

Transforming the data to comparable scales can prevent this problem. Typical data standardization procedures equalize the range and/or data variability.

Source

In [ ]:

from sklearn.preprocessing import StandardScaler

In [ ]:

scaler = StandardScaler()

In [ ]:

scaler.fit_transform(new_data)

The above will transform the data so that all the columns have an average of 0 and a standard deviation of 1.

You can read the full documentatio for the StandardScaler here.

Bonus: What if we would like to split up columns?¶

Maybe you have data in a column that's a mashup between multiple values.

For example, imagine if you have a column that stores values like:

'8th Grade - 13 years old' and '12th grade - 17 years old' and you want to create two columns: grade and age to store the two separate bits of data.

How do you do so?

In [ ]:

# I'm going to generate some fake data here. Ignore the below>
grades = np.random.choice(range(1, 13), 100) # chooses 100 random numbers between 1 - 12
grades_and_ages = ['Grade {grade}  - {age} years old'.format(grade=grade, age=grade+6) for grade in grades]
num_friends_or_none = list(range(0, 20)) + ["Unknown"] * 5
num_friends = np.random.choice(num_friends_or_none, 100)
combined_data = pd.DataFrame(data={'Grade and Age': grades_and_ages, '# Friends': num_friends})

In [ ]:

combined_data

Awesome, now let's split things up. We'll use the built in .str.split() function with the extra input expand=True.

The expand=True will convert the splitted data into a DataFrame instead of keeping a list of values.

(Try taking out expand=True and seeing what happens.)

In [ ]:

combined_data['Grade and Age'].str.split(' - ', expand=True)

Your challenge: Write some code that does the following:

Removes the words 'Grade' and 'years old' from the data
Takes these two columns and puts them back into the original combined_data DataFrame.

In [ ]:

### Your code here

Introduction to Dataset Processing¶

by Carl Shan and Jen Selby¶

Table of Contents¶

Hint: Use the ? symbol¶

Setup¶

Section 1: Converting Categorical Values to Numerical Ones¶

First, let's see what values there are in the school column¶

Avoid for loops¶

Don't write loops like this.¶

Instead, Use .apply¶

Or, you can use .map()¶

Using sklearn's built-in preprocessing module, we can do the same thing¶

We can also use one-hot encoding if we have more than two values. We still need to encode it first, as we did above.¶

What if we want to apply a transform that looks at multiple values in the row?¶

Section 2: Dealing with Text Data¶

Removing or replacing data¶

Checking to see if a string contains a certain value¶

There is a whole list of other things you can do with your text data.¶

What about transforming text data into a DataFrame?¶

Section 3: Dealing with Null Values¶

One way to deal with null values is to drop them¶

We can also drop any rows with nulls from the entire table.¶

Or we can replace the null values with an average¶

What if instead of null values, there is something else that stands for missing values?¶

Section 4: Now Let's Learn How to Standardize Data¶

Bonus: What if we would like to split up columns?¶

Hint: Use the `?` symbol¶

First, let's see what values there are in the `school` column¶

Instead, Use `.apply`¶

Or, you can use `.map()`¶