Lets take a look at some data, ask some questions and use linear regression to solve said questions.
# imports
import numpy as np
from sklearn import linear_model
import matplotlib.pyplot as plt
import pandas as pd
# this allows plots to appear directly in the notebook
%matplotlib inline
# read data into a DataFrame and verify contents
data = pd.read_csv('./goog.csv')
data.head(3)
Now that we have our dataset, lets split the dates and prices into their own frames.
# create a dataframe for the dates
# select the days only from the Date column using a for loop
dates = [int(i.split('-')[0]) for i in np.array(data)[:,0]]
# create a dataframe for the open prices
# select the data in the Open column
prices = np.array(data)[:,1]
# create a dataframe for the high prices
# select the data from the High column
high = np.array(data)[:,2]
# select the data from the Prices column
prices = np.array([prices]).T
# select the data from the Dates column
dates = np.array([dates]).T
#high = np.array([high]).T
#prices = np.hstack((price, high))
#print(dates)
#print(prices)
#print(high)
With our price and date data split, we can now create functions to simplify the process.
The first function will be for predicting the price of a stock on day 'x'.
# define a function for predicting the price
# given the dates and prices in a dataframe
# and a day value represented as x
def predict_price(dates, prices, x):
# initialize the linear regression model
linear_mod = linear_model.LinearRegression()
# fit the data to the model
linear_mod.fit(dates, prices)
# store the result of linear prediction at value x
predicted_price = linear_mod.predict(x)
# return the predicted price, linear coefficient, and the intercept
return predicted_price, linear_mod.coef_, linear_mod.intercept_
Next, in order to properly show the variation in our methods, let's create a function for plotting our data points.
The second function will be for displaying a visualized plot of our data and prediction.
# define a function for displaying a plot given
# the dates and prices data as X and Y values
def show_plot(dates, prices):
# initialize the linear regression model
linear_mod = linear_model.LinearRegression()
# fit the submitted data to the model
linear_mod.fit(dates, prices)
# mark the scatter points using the dates and prices as X and Y values
plt.scatter(dates, prices, color='lime')
# plot the line of best fit using our model prediction
plt.plot(dates, linear_mod.predict(dates), color='blue', linewidth=3)
#display the model
plt.show()
return
With our functions in place let's test them!
# display the result of the predict_prices function
# pass the function the dates, prices, and an x value
predict_price(dates, prices, 39)
Now let's view our plot.
# display the result of the show_plot function
show_plot(dates, prices)
For this practical example we will be using a prepared dataset provided by sklearn's 'datasets' class.
The dataset we will be using is called 'iris'. The goal for this example is to predict the target value by using the feature values.
# import required libraries
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
from sklearn import linear_model, datasets
This enables us to simply load a dataset by calling a simple function that returns a 'bunch' of data - quite literally.
A 'bunch' is similar to a data-dictionary, as it provides attributes for our dataset, mainly:
‘data’, the data to learn
‘target’, the classification labels
‘target_names’, the meaning of the labels
‘feature_names’, the meaning of the features
‘DESCR’, the full description of the dataset
In order to build our new dataframe - we will need to extract the data and the target!
The code below will load the iris* dataset's bunch of data so we can then store it in a dataframe.*
iris = datasets.load_iris()
Click the link below to learn more about this command and the other datasets:
With the above in mind, let's get the iris dataset bunch so we can turn it into a dataframe using pandas' 'DataFrame()' function.
# import some data to play with using sklearn's datasets.load_iris() function
iris = datasets.load_iris()
# display the chunk feature names or description
print(iris.feature_names)
# use pandas to combine the data with the target
# define the column/feature names for our new dataframe
df = pd.DataFrame(np.c_[iris.data, iris.target], columns = ["Sepal Length", "Sepal Width",
"Petal Length", "Petal Width",
"Class"])
# verify successful creation of the dataframe
#df.head()
It is always good practice to verify your data before continuing to the next step!
By shuffling before splitting our data into training and testing pools, it will help us to verify the effectiveness of the logistic regression model on our data by ensuring it is always trained/tested on a different series of data.After we have our sub-frame of selected features and their relative target values; We can randomize the order using shuffle to increase the variability between results.
# select the first two features from the bunch data and combine it with the target values
data = np.c_[iris.data[:, :2], iris.target]
# shuffle the data for increased variability when splitting into train/test
#print(data)
Now that we have our data together and randomized, let's assign the 'X' and 'y' data to be plotted.
We can do this by simply selecting all rows (:), and then specifying the column number we want as our starting point (:# or #:).
Look to the combine function above if you're feeling lost about selecting data by row and column
# for the X values, we want all rows and the first two columns ( :2 )
#X = ?
# for the y values, we want all rows and the last column ( 2: )
#y = ?
#print(X)
#print(y)
Our data is now ready to be split into training and test sets.
However, before we split it, let's determine the point at which it will be split.
We determine this using a factor of the data's shape to determine a row value which represents a location 70 percent of the way through the data.
For the training data we will select all of the data following the calculated row.
# determine the percentile split of data for training and testing
# 0.7 is equal to 70/30 : test/train
test_train_split = 0.7
# get the training data for X using it's shape multiplied by
# the split that was determined above to select the data AFTER that row
X_training = X[:int(X.shape[0]*test_train_split),:]
# get the training data for Y using it's shape multiplied by
# the split that was determined above to select the data AFTER that row
y_training = y[:int(y.shape[0]*test_train_split)]
#print(int(X.shape[0]*train_test_split))
#print(int(y.shape[0]*train_test_split))
Our training data is prepared, now all we need to do is change the position of the ':' to select the data preceding the calculated row.
This technique makes it easy to manually assign data on the fly when tinkering with what is the most effective train/test split for your data.
That being said, a split of 70/30 is quite common and should suffice in the majority of cases!
# get the training data for X using it's shape multiplied by
# the split that was determined above to select the data BEFORE that row
#X_testing = ?
# get the training data for Y using it's shape multiplied by
# the split that was determined above to select the data BEFORE that row
#y_testing = ?
#print(X_testing)
#print(y_testing)
#print(X_testing)
#print(X_testing.ravel())
#print(X_testing.shape)
#print(y_testing)
#print(y_testing.ravel())
#print(y_testing.shape)
# Initialize the model using the LogisticRegression function
logreg = linear_model.LogisticRegression()
Then we need to fit the training data to the model
# We use the initialized function to then fit the data.
logreg.fit(X_training, y_training.ravel())
# run prediction on the test data and store it as 'Z'
Z = logreg.predict(X_testing)
# compare the data
#print(Z)
#print(y_testing.ravel())
def classification_rate(y, Z):
num_right = 0
for i in range(len(Z)):
if y[i] == Z[i]:
num_right = num_right + 1
return num_right/Z.shape[0]
classification_rate(y_testing.ravel(), Z)
from sklearn import metrics, model_selection
from sklearn.linear_model import LogisticRegression
Z_cross_validation = model_selection.cross_val_predict(LogisticRegression(), X, y.ravel(), cv=10)
#print(model_selection.cross_val_score(LogisticRegression(), X, y.ravel()))
With our new cross validation predictions let's see what the classification rate changed too.
# Execute the classification_rate function on the new predictions (Z*)
import numpy as np
# from scipy.interpolate import *
import matplotlib.pyplot as plt
%matplotlib inline
Hopefully this will help visualize what's going on a bit better
##### Create a couple arrays with
X = np.array([0,1,2,3,4,5])
y = np.array([0,0.8,0.9,0.1,-0.8,-0.5])
# print to observe
print(X)
print(y)
Let's begin fitting our data. The polyfit method uses the sum of square errors to compute the line of best fit.
In this first piece of code, we're going to stick with a straight line
# The last parameters is a 1 for now as we'll do linear to begin with
# We will use the 'polyfit()' function to determine the slope
# and intercept from our data
p1 = np.polyfit(X, y, 1)
# This prints the slope and intercept
print(p1)
Now that we have our slope and intercept, let's plot it as a first degree polynomial
# polyval plots the data with respect to the slope, intercept, and data
plt.plot(X, np.polyval(p1, X), color='blue')
# use the polyfit() function
p2 = np.polyfit(X, y, 2)
p3 = np.polyfit(X, y, 3)
# use polyval to plot the data again
plt.plot(X, np.polyval(p2, X), color='lime')
plt.plot(X, np.polyval(p3, X), color='red')
Observe the p1 values
p1 # y = Ax + b
p2 # y = Ax^2 + Bx + C
p3 # y = Ax^3 + Bx^2 + Cx + D
# plot the data points
# ('yo' just means yellow circle markers)
plt.plot(X,y,'yo')
xp = X
#xp = np.linspace(-2,6,100)
# plot the data
plt.plot(xp, np.polyval(p1,xp), color='blue')
plt.plot(xp, np.polyval(p2,xp), color='red')
plt.plot(xp, np.polyval(p3,xp), color='lime')
Display the polynomial values for each defined point on the plot.
np.polyval(p3, X)