This Jupyter Notebook will share more details about how to process your data.
Data processing is like preparing the ingredients before cooking; if you prepare them poorly (e.g., leave things half-peeled and dirty) , the meal will taste poor no matter how skillful a chef you are.
It's similarly true in machine learning. Dataset processing can be one of the most important things you can do to get your model to perform well.
You can read more about dataset processing on the course notes here.
If you haven't already, follow the setup instructions here to get all necessary software installed.
?
symbol¶As you go through this notebook, as well as learn more about processing data in iPython, it will be helpful to know the ?
symbol.
E.g., You can try to type the following into Python
import sklearn
sklearn?
Typing the ?
symbol after a function, module or variable will bring up the documentation of that bit of code, assuming it exists. It'll tell you more about the variable, function or module.
import pandas as pd
from sklearn import preprocessing
Download the student performance data and change the path below to wherever you put the data.
student_data = pd.read_csv('../data/student/student-mat.csv', sep=';')
student_data.head()
Looking at the data above, we want to convert a number of the columns from categorical to numerical. Most machine learning models deal with numbers and don't know how to model data that is in text form. As a result we need to learn how to do things such as e.g., convert the values in the school
column to numbers.
school
column¶# This shows a list of unique values and how many times they appear
student_data['school'].value_counts()
# Converting values in the school column to text
# We are going to define a function that takes a single value and apply it to all the values
def convert_school(row):
if row == 'GP':
return 0
elif row == 'MS':
return 1
else:
return None
# Here's a slow way of using the above function
%time
converted_school = []
for row in student_data['school']:
new_value = convert_school(row)
converted_school.append(new_value)
converted_school
# Don't do this! It's very slow.
.apply
¶This will do the same thing as the for loop above, but much faster. It'll apply a function to all the rows of a DataFrame
.
%time
converted_school = student_data['school'].apply(convert_school)
converted_school
Look how much faster that took!
.map()
¶You can also use the .map()
function to map certain values to other data.
For example, imagine you had a column named 'colors'
that contained the values "red"
and "blue"
and you wanted to convert these to the numbers 1
and 2
.
mappings = {
'red': 1,
'blue': 2
}
data['colors_mapped'] = data['colors'].map(mappings)
The above will create a new column called colors_mapped
that now has the values 1
and 2
.
enc_school = preprocessing.LabelEncoder()
transformed_school = enc_school.fit_transform(student_data['school'])
transformed_school
See example at https://stackoverflow.com/a/43589167/2159992
enc_mjob = preprocessing.LabelEncoder()
encoded_mjob = enc_mjob.fit_transform(student_data['Mjob'])
encoded_mjob
onehot_mjob = preprocessing.OneHotEncoder(sparse=False)
transformed_mjob = onehot_mjob.fit_transform(encoded_mjob.reshape(-1,1))
transformed_mjob
Once we've fitted the label encoder and one-hot encoder, we can use them to transform more values.
onehot_mjob.transform(enc_mjob.transform(['other', 'health']).reshape(-1,1))
For instance, what if we want to create a new column with a 1 if both parents have the highest level of education measured?
medu_index = student_data.columns.get_loc('Medu')
fedu_index = student_data.columns.get_loc('Fedu')
def both_parents_edu(row):
if row[medu_index] > 3 and row[fedu_index] >= 4:
return 1
else:
return 0
# axis 1 means that we will apply the function to each row
student_data['parents_high_edu'] = student_data.apply(both_parents_edu, axis=1)
student_data.head(10)
pandas
has a lot of built-in modules that work with text-based data.
sklearn
similarly has a lot of modules for this as well.
This section gives a brief outline of the things you can try.
If you want to see a fuller list, with examples, of how pandas
deals with text data, you can look at the documentation here.
#### First, I'm going to make some fake data that we can work with for the rest of this section
data = pd.DataFrame(data={'text': ['apple', '%badly,formatted,data%', 'pear']})
data
Okay, we want to remove the ','
and '%'
symbols from the data. How do we do so?
data['text_removed'] = data['text'].str.replace(',', '')
data
Nice. Now try and replace the '%'
symbols.
#### Your code here
Now, we want to see if a text contains certain values, and only get the rows that contains those values.
### Again, I have to make some fake data
data = pd.DataFrame(data={'text': ['Nueva Maverick', 'San Francisco Maverick', 'Vikings']})
data
Cool, what if we only wanted to get the rows that contained the word 'Maverick'
?
data['text'].str.contains('Maverick')
Now we can use this Series
of boolean
True
and False
values to index into our data!
condition = data['text'].str.contains('Maverick')
filtered_data = data[condition]
filtered_data
Some are listed below:
str.startswith()
and str.endswith()
- checks to see if a string starts or ends with a given argumentstr.count()
- counts the number of appearances of a certain patternstr.numeric()
- checks to see if the string is numeric (e.g., 23123
is a digit whereas 213123abc
is not)str.split()
- splits the string on some deliminter and returns a dataframe of the string, split on the characters.There's plenty more and you can see the documentation here for more.
raw_text = ["""This is a giant series of sentences that you want to convert into a DataFrame containing
the raw counts for each word. There are some abbr. and some punctuations here and there that make things more complicated.
So how in the world do we turn this into something that we can build a machine learning model off of?
"""]
Okay, so we want to turn the above into a DataFrame where every column is a different word, and each entry stores the number of times that word came up.
We're going to use the CountVectorizer
class in sklearn
.
A more in-depth tutorial on how to use it, and more, can be found here.
from sklearn.feature_extraction.text import CountVectorizer
# Initializing an empty CountVectorizer object
count_vect = CountVectorizer()
# Now we fit the object to our actual data
counts = count_vect.fit_transform(raw_text)
# This is a `sparse matrix` class. It saves our computer space.
counts
# Let's use the `.todense()` function to turn this sparse matrix into something that can be transformed into a DataFrame
word_counts_df = pd.DataFrame(data=counts.todense())
word_counts_df
Great, but what do each of the columns mean?
We can inspect the count_vect.vocabulary_
attribute to find out.
count_vect.vocabulary_
Great. Now we know the words for each of the datasets
Your challenge: Your challenge is to write some code so that you end converting the columns in word_counts_df
to each of the words in count_vect.vocabulary_
.
# If you're successful it should look like the output below.
abbr | and | are | build | can | complicated | containing | convert | counts | dataframe | ... | there | things | this | to | turn | want | we | word | world | you | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | 2 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | ... | 2 | 1 | 2 | 1 | 1 | 1 | 2 | 1 | 1 | 1 |
1 rows × 45 columns
### Your code here
To show you how to deal with null values, I'm going to make some simulated data of students.
import numpy as np
grades = np.random.choice(range(1, 13), 100) # chooses 100 random numbers between 1 - 12
num_friends_or_none = list(range(0, 20)) + [None] * 5
num_friends = np.random.choice(num_friends_or_none, 100)
new_data = pd.DataFrame(data={'Grade': grades, '# Friends': num_friends})
new_data.head(n=20)
new_data['# Friends'].dropna()
new_data.dropna()
average_friends = new_data['# Friends'].mean()
new_data['# Friends'].fillna(average_friends)
new_data['# Friends'] = new_data['# Friends'].fillna(average_friends)
Try the replace function.
grades = np.random.choice(range(1, 13), 100) # chooses 100 random numbers between 1 - 12
num_friends_or_none = list(range(0, 20)) + ["Unknown"] * 5
num_friends = np.random.choice(num_friends_or_none, 100)
unknown_data = pd.DataFrame(data={'Grade': grades, '# Friends': num_friends})
unknown_data
unknown_data.replace("Unknown", 10)
By that I mean to transform our data so that it has a mean of 0 and a standard deviation of 1.
Why would we want to do this?
Well often we will have strange parameter estimates on many models models if different bits of our data are in wildly different ranges.
Many researchers have noted the importance of standardizing variables for multivariate analysis.
Otherwise, variables measured at different scales do not contribute equally to the analysis.
For example, in boundary detection, a variable that ranges between 0 and 100 will outweigh a variable that ranges between 0 and 1. Using these variables without standardization in effect gives the variable with the larger range a weight of 100 in the analysis.
Transforming the data to comparable scales can prevent this problem. Typical data standardization procedures equalize the range and/or data variability.
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
scaler.fit_transform(new_data)
The above will transform the data so that all the columns have an average of 0 and a standard deviation of 1.
You can read the full documentatio for the StandardScaler
here.
Maybe you have data in a column that's a mashup between multiple values.
For example, imagine if you have a column that stores values like:
'8th Grade - 13 years old'
and '12th grade - 17 years old'
and you want to create two columns: grade
and age
to store the two separate bits of data.
How do you do so?
# I'm going to generate some fake data here. Ignore the below>
grades = np.random.choice(range(1, 13), 100) # chooses 100 random numbers between 1 - 12
grades_and_ages = ['Grade {grade} - {age} years old'.format(grade=grade, age=grade+6) for grade in grades]
num_friends_or_none = list(range(0, 20)) + ["Unknown"] * 5
num_friends = np.random.choice(num_friends_or_none, 100)
combined_data = pd.DataFrame(data={'Grade and Age': grades_and_ages, '# Friends': num_friends})
combined_data
Awesome, now let's split things up. We'll use the built in .str.split()
function with the extra input expand=True
.
The expand=True
will convert the splitted data into a DataFrame
instead of keeping a list of values.
(Try taking out expand=True
and seeing what happens.)
combined_data['Grade and Age'].str.split(' - ', expand=True)
Your challenge: Write some code that does the following:
combined_data
DataFrame.### Your code here