🤖⚡ scikit-learn tip #10 (video)¶

Q: Why set a value for "random_state"?

A: Ensures that a "random" process will output the same results every time, which makes your code reproducible (by you and others!)

See example 👇

In [1]:

import pandas as pd
df = pd.read_csv('http://bit.ly/kaggletrain', nrows=6)

In [2]:

cols = ['Fare', 'Embarked', 'Sex']
X = df[cols]
y = df['Survived']

In [3]:

from sklearn.model_selection import train_test_split

In [4]:

Out[4]:

	Fare	Embarked	Sex
0	7.2500	S	male
1	71.2833	C	female
2	7.9250	S	female
3	53.1000	S	female
4	8.0500	S	male
5	8.4583	Q	male

In [5]:

# any positive integer can be used for the random_state value
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.5, random_state=1)
X_train

Out[5]:

	Fare	Embarked	Sex
0	7.2500	S	male
3	53.1000	S	female
5	8.4583	Q	male

In [6]:

# using the SAME random_state value results in the SAME random split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.5, random_state=1)
X_train

Out[6]:

	Fare	Embarked	Sex
0	7.2500	S	male
3	53.1000	S	female
5	8.4583	Q	male

In [7]:

# using a DIFFERENT random_state value results in a DIFFERENT random split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.5, random_state=2)
X_train

Out[7]:

	Fare	Embarked	Sex
2	7.9250	S	female
5	8.4583	Q	male
0	7.2500	S	male

🤖⚡ scikit-learn tip #10 (video)¶

Want more tips? View all tips on GitHub or Sign up to receive 2 tips by email every week 💌¶