Notebook This notebook introduces the use of pandas and the formula framework in statsmodels in the context of linear modeling.
Some general imports
Outcome: S, salaries for IT staff in a corporation Predictors: X, experience in years M, managment, 2 levels, 0=non-management, 1=management E, education, 3 levels, 1=Bachelor's, 2=Master's, 3=Ph.D
A context manager that mimics R's with() function can be found here: https://gist.github.com/2347382Let's explore the data
Look at the design matrix created for us. Every results instance has a reference to the model.
Since we initially passed in a DataFrame, we have a transformed DataFrame available.
There is a reference to the original untouched data in
If you use the formula interface, statsmodels remembers this transformation. Say you want to know the predicted salary for someone with 12 years experience and a Master's degree who is in a management position
So far we've assumed that the effect of experience is the same for each level of education and professional role. Perhaps this assumption isn't merited. We can formally test this using some interactions.We can start by seeing if our model assumptions are met. Let's look at a residuals plot.And some formal testsPlot the residuals within the groups separately.
The contrasts are created here under the hood by patsy.Interact education with management
There looks to be an outlier.
1. Fit the same linear model as above (i.e. without interaction terms) but with the cleaned data (=salary_table_clean). Remember to call the .fit() method on your model. Print the model summary. Compare the R-squared between this and the previous model.
2. Fit the same linear model above but with an interaction term between education and experience. Again, use the cleaned data set. Print the model summary.
3. Run an ANOVA (or F-test) to test whether the interaction (education*experience) is significant (there are two valid ways to do this).
4. Run an ANOVA to test whether there exists an interaction between *education* and *management* on salary. Is there? Note: so that the code will continue to run below, create an interaction model and call it "interM_lm32".
Re-plotting the residuals
r final plot of the fitted values
From our first look at the data, the difference between Master's and PhD in the management group is different than in the non-management group. This is an interaction between the two qualitative variables management, M and education, E. We can visualize this by first removing the effect of experience, then plotting the means within each of the 6 groups using interaction.plot.