#!/usr/bin/env python # coding: utf-8 # # Strutural Equation Modelling (SEM) with Jupyter and Anaconda # This notebook is intended as an example of how to "do SEM" using different tools that are accessible in a Jupyter IPython Notebook installed with Anaconda. The notebook uses a Python3 kernel and interfaces with several R packages. Alternatives would be to run an R kernel in the notebook directly, or to use an RStudio markdown notebook and the reticulate package to access Python functionality from R. #
Stefan Rank (strank at strank.info)
# To Do for future revisions: # # * Integrate instructions on using different hosted / collaborative notebook options for this tutorial: # # * CoCalc: https://cocalc.com/ # * Google Colaboratory in Google Drive: https://colab.research.google.com/ # * (Python works, conda & R aren't available so most of the meat of this notebook won't work. If you block third-party cookies, you might have to allow cookies from \[*.\]googleusercontent.com to get output working) # * Microsoft Azure Notebooks: https://notebooks.azure.com/ # * Binder: https://mybinder.org/ # * A JupyterHub server: https://jupyterhub.readthedocs.io/en/stable/ (if I ever get to installing one) # # * Turn it into a RISE presentation https://rise.readthedocs.io/en/docs_hot_fixes/ (ideally also using the "Split Cells Notebook" extension for better visual organization) # # * Integrate other SEM tools available in R: # # * lavaan, openmx, semplot (DONE) # * sem (less well documented/supported) # * metasem and metafor for meta-analysis (goes beyond normal SEM) # * simsem's semtools # * bigsem, ctsem https://github.com/cdriveraus/ctsem, gsem, lavaan.survey, regsem, semdiag # # * Compare with other tools: # # * Onyx GUI for SEM http://onyx.brandmaier.de/ # For local use, I recommend jupyterlab as the interface, usually acessible as http://localhost:8888/lab # ## Installing Required (and Optional) Components # To run the code in this notebook, a local (or hosted) installation of Anaconda is required. The following installs R components available through Anaconda. We use the --yes option to avoid prompts for confirmation (impossible in the notebook) but that also means you cannot check what will be installed before it proceeds. To avoid this, you could run the installations from an Anaconda prompt or using the Anaconda Navigator. # R Studio is not strictly required, but it is a useful alternative way of using R. Also note that this assumes an Anaconda installation/environment that you have permission to change. On Windows, that means choosing the recommended "for this user only" option on install of Anaconda. Otherwise, these installations require Administrator / root rights. # In[2]: get_ipython().system('conda install --yes r-essentials') # In[3]: # !conda install --yes rstudio # In[4]: get_ipython().system('conda install --yes r-lavaan') # ### Installing and Enabling the Python-R bridge # Now we need `rpy2` which provides the bridge between this python notebook and R. (Also installing tzlocal due to a current dependency bug in rpy2.) # In[3]: get_ipython().system('conda install --yes tzlocal rpy2') # Let's **enable the rpy2 extension**, so that we can then execute R code with the %%R magic command at the top of a cell. # In[1]: get_ipython().run_line_magic('load_ext', 'rpy2.ipython') # If you are on Windows, and you do not get text output from %%R cells showing up in the notebook, but instead in the console window where jupyter is running, this is a bug in rpy2 on Windows, and there's a workaround to capture stdout by running the following cells, see https://github.com/vitorcurtis/RWinOut # In[21]: get_ipython().run_cell_magic('R', '', 'install.packages(c("R.utils"))\n') # In[23]: get_ipython().system('curl -O "https://raw.githubusercontent.com/vitorcurtis/RWinOut/master/RWinOut.py"') # In[2]: get_ipython().run_line_magic('load_ext', 'RWinOut') # ### Installing Interface Conveniences # And finally we want to see plots directly in the notebook. The simplest way is to request inline plots. # In[3]: get_ipython().run_line_magic('matplotlib', 'inline') # But assuming you are running this in a JuyterLab interface, you might want the `ipympl` library to get interactive widget plots. # In[41]: # We install ipympl with pip, as it is not yet readily available with conda: get_ipython().system('pip install ipympl') # In[41]: # nodejs is needed for the interactive features if using JupyterLab, # the corresponding package for normal notebooks, widgetsnbextension, should already be installed. get_ipython().system('conda install --yes nodejs') # In[41]: # install the jupyterlab extensions: get_ipython().system('jupyter labextension install @jupyter-widgets/jupyterlab-manager') get_ipython().system('jupyter labextension install jupyter-matplotlib') # Now enable widget-based plots: # In[1]: get_ipython().run_line_magic('matplotlib', 'widget') # Finally, since this is a long document, the following extension adds a table-of-contents sidebar to the JupyterLab interface: # In[ ]: get_ipython().system('jupyter labextension install @jupyterlab/toc') # ### Installing More R Packages # Now for some R packages that are not readily available in an Anaconda default install. They might be available through the *conda-forge* "channel" - however, at the time of writing, I cannot recommend this, as the performance of `conda install` is abysmal when using R packages from that repository. # When using a server-hosted notebook, some or all of these packages might already be installed. # In[16]: get_ipython().run_cell_magic('R', '', 'install.packages(c("semPlot", "OpenMx", "semTools", "sem", "gpairs", "GGally"))\n') # A note to avoid possible confusion: `lavaan` provides a function `cfa` as a convenience for confirmatory factor analysis. There is also an R package called `cfa` - however, that one is not related to SEM. # ## Importing and Configuring Python & R Packages # First, import all the basic packages in Python, such as `pandas`, `numpy`, `matplotlib`. # Also import `seaborn` for simple high-level plots with decent looks. It complements the default plotting provided by `matplotlib`. See here, for a useful brief overview of looking at data using some of seaborn's plot types: https://elitedatascience.com/python-seaborn-tutorial # In[4]: import numpy as np import pandas as pd import matplotlib.pyplot as plt import seaborn as sns # Next, some of the libraries that are statistics-oriented. # In[5]: import statsmodels.api as sm import statsmodels.formula.api as smf # the R-like interface for statsmodels import statsmodels.graphics as smg import sklearn # Now for loading R packages, we will be using. # In[6]: get_ipython().run_cell_magic('R', '', 'library(lavaan)\nlibrary(semPlot)\nlibrary(OpenMx)\nlibrary(semTools)\n') # In R, there are also several packages providing convenient high-level plots, such as generalized pairs plots. # In[7]: get_ipython().run_cell_magic('R', '', 'library(GGally)\nlibrary(gpairs)\n') # ### Configure Plotting # For the purpose of getting nice visual output, we will also set some defaults for plotting libraries. # In[8]: defaultfigwidth, defaultfigheight = 10, 9 # set a slightly larger default size for plots. default dpi is 100 plt.rcParams['figure.figsize'] = [defaultfigwidth, defaultfigheight] # enable seaborn's defaults for nicer plots overall: sns.set(color_codes=True) # There is currently no built-in way to set default dimensions for plots in R. The %R magic command from the `rpy2` library accepts width, height, and units parameters like this: `%%R -w 10 -h 9 -u in -r 100` but it would be nice to set defaults. # Since this is python, there is a way around that using monkey-patching. Note that this is usually a Bad Idea(TM) and should be avoided if possible. It is also purely cosmetic for the purposes of this notebook, so it can be safely ignored. :) # In[9]: # these are the defaults we want to set: default_units = 'in' # inch, to make it more easily comparable to matpplotlib default_res = 100 # dpi, same as default in matplotlib default_width = 10 default_height = 9 # try monkey-patching a function in rpy2, so we effectively get these # default settings for the width, height, and units arguments of the %R magic command import rpy2 old_setup_graphics = rpy2.ipython.rmagic.RMagics.setup_graphics def new_setup_graphics(self, args): if getattr(args, 'units') is not None: if args.units != default_units: # a different units argument was passed, do not apply defaults return old_setup_graphics(self, args) args.units = default_units if getattr(args, 'res') is None: args.res = default_res if getattr(args, 'width') is None: args.width = default_width if getattr(args, 'height') is None: args.height = default_height return old_setup_graphics(self, args) rpy2.ipython.rmagic.RMagics.setup_graphics = new_setup_graphics # ## Loading Data # We are borrowing example data from this excellent course offered at Harvard, S090A1: https://canvas.harvard.edu/courses/8737/pages/data # The actual data is from the Zambian Early Childhood Development Project. The full sample has more than 1600 Zambian six-year-olds, from a study led by Günther Fink and Stephanie Zuilkowski. # The data is in the proprietory stata format, so we first need to convert it and import it. We will use `pandas.read_stata` but this could also be accomplished in R with the `foreign` package. First, we write a helper function to download the data file if it is not in the current directly. Defining a function will allow us to re-use it later for other datasets. # In[10]: import os.path import urllib.request def downloadIfMissing(filenameData, remoteLocation): '''Check if the file exists. If not, try downloading from remoteLocation.''' if not os.path.isfile(filenameData): with urllib.request.urlopen(remoteLocation) as response: with open(filenameData, 'xb') as destinationFile: destinationFile.write(response.read()) # In[11]: # make sure we have the small data file available in the current directory, if not, try to download it: filenameSmallZambiaData = "S090_InClass_Zambia.dta" downloadIfMissing(filenameSmallZambiaData, "https://canvas.harvard.edu/courses/8737/files/1839865/download") # In[12]: # read the data into a pandas dataframe smallZambiaDF = pd.read_stata(filenameSmallZambiaData) len(smallZambiaDF) # should return 1613 # In[13]: # make sure we have the full measurement data file available in the current directory, if not, try to download it: filenameMeasureZambiaData = "S090_InClass_Zambia_Measurement.dta" downloadIfMissing(filenameMeasureZambiaData, "https://canvas.harvard.edu/courses/8737/files/1994882/download") # In[14]: # read the data into a pandas dataframe measureZambiaDF = pd.read_stata(filenameMeasureZambiaData) len(measureZambiaDF) # should return 1623 # ## Data Overview # We now have the data in a dataframe, let's get an overview of what kind of data we are dealing with. # Now let's have a look at the dataframe. # In[15]: smallZambiaDF.head() # equivalent to [:5] i.e. first five entries # In[10]: smallZambiaDF.info() # In[11]: smallZambiaDF.describe(include='all', percentiles=[]) # describe categorical and numerical columns, don't bother with percentiles # In[12]: smallZambiaDF.male.describe() # In[13]: smallZambiaDF.male.value_counts() # In[14]: smallZambiaDF.wealth.value_counts() # In[16]: # visually check relations between numeric variables grid = sns.pairplot(smallZambiaDF, hue="ece", height=defaultfigheight/6, kind='scatter') # we have to use an explicit height per facet-figure here, since a grid of figures doesn't follow the matplotlib default size # In[17]: grid = sns.pairplot(smallZambiaDF, hue="ece", height=defaultfigheight/6, kind='reg') # linear regressions on top of scatter # In[18]: pd.scatter_matrix(smallZambiaDF) plt.show() # In[19]: plt.figure() sns.swarmplot(x="male", y="socemo", data=smallZambiaDF) plt.show() # In[17]: get_ipython().run_cell_magic('R', '-i smallZambiaDF', 'pairs(smallZambiaDF)\n') # In[18]: get_ipython().run_cell_magic('R', '-i smallZambiaDF', 'ggpairs(smallZambiaDF) # from library GGally\n') # In[33]: get_ipython().run_cell_magic('R', '-i smallZambiaDF', 'gpairs(smallZambiaDF) # from library gpairs\n') # In[20]: # Calculate correlations corr = smallZambiaDF.corr() # Heatmap plt.figure() sns.heatmap(corr) plt.show() # To Do: implement an equivalent to `ggpairs` in python, similar to `scatter_matrix_all` found here: http://photo.etangkk.com/Python/blog-04.asp but using seaborn's PairGrid functionality. # In[ ]: # In[ ]: # In[ ]: # In[ ]: # What are our observed variables like? Continuous (e.g. age or height), ordinal (e.g. Likert items), or categorical (e.g. gender) or even dichotomous/binary (e.g. yes/no questions). Is validity (accurately captures the theoretical construct) and reliability (consistent results across time/rater/items) already established or just assumed? # In[ ]: # In[ ]: # ## Data Cleanup and Checking Assumptions # SEM maximum likelihood estimator assumes normality of all endogenous variables, and implies they are continuous. But these assumptions can sometimes be relaxed. # # TODO: how to check? # ### Simple Regression # First, we do a simple regression. We can use `scikit-learn` (more oriented towards machine learning, less support for detailed statistics output) or `statsmodels` (more statistics oriented, with R-like interface) or R. # In[74]: result = smf.ols(formula='socemo ~ ece', data=smallZambiaDF).fit() # In[25]: print(result.params) # In[26]: print(result.summary()) # In[75]: fig = smg.regressionplots.abline_plot(model_results=result) fig.axes[0].scatter(smallZambiaDF['ece'], smallZambiaDF['socemo']) # Since this is based on a categorical variable (ece), it is effectively a one-way ANOVA, so we can also print it as that. # In[27]: table = sm.stats.anova_lm(result, typ=2) # Type 2 Anova DataFrame print(table) # In[28]: pd.get_dummies(smallZambiaDF, columns=['ece',]).head() # In[30]: smallZambiaDF['ece'].str.get_dummies()['ECE'].head() # In[31]: smallZambiaDF['socemo'].shape, smallZambiaDF['ece'].str.get_dummies()[['ECE']].shape # In[32]: linreg = sklearn.linear_model.LinearRegression() linreg.fit(smallZambiaDF['ece'].str.get_dummies()[['ECE']], smallZambiaDF['socemo']) # In[33]: print(linreg.intercept_, linreg.coef_) # In[61]: sns.lmplot(x='ece_ECE', y='socemo', data=pd.get_dummies(smallZambiaDF), height=defaultfigheight) # In[62]: f = plt.figure() residaxes = sns.residplot(x='ece_ECE', y='socemo', data=pd.get_dummies(smallZambiaDF)) # Calculating and plotting the simple linear model with R: # In[30]: get_ipython().run_cell_magic('R', '-i smallZambiaDF', 'model = lm(socemo ~ ece, data = smallZambiaDF)\nsummary(model)\n') # In[39]: get_ipython().run_cell_magic('R', '-i smallZambiaDF', 'plot(socemo ~ ece, data = smallZambiaDF)\nabline(model)\n') # In[40]: get_ipython().run_cell_magic('R', '-i smallZambiaDF', 'xyplot(socemo ~ ece, data = smallZambiaDF)\n') # # ## Setting up the Model for SEM # For comparison let's use SEM to run the same linear regression model as above. # In[121]: get_ipython().run_cell_magic('R', '-i smallZambiaDF', "semfit <- sem('socemo ~ reasoning', data=smallZambiaDF, meanstructure=TRUE)\nsummary(semfit)\n") # In[22]: dummiedSmallZambiaDF = pd.get_dummies(smallZambiaDF, dtype=np.int8) # We use int8 instead of uint8 for the dummy variables, because R has no unsigned integers, # so we would get warnings when passing these new variables into R. # Further: the new dummy columns use the actual value as part of the column name, # and therefore might have spaces in the column name. # this is not great for specifying models, so let us replace the spaces with underscores: dummiedSmallZambiaDF.columns = dummiedSmallZambiaDF.columns.str.replace(' ', '_') # In[122]: get_ipython().run_cell_magic('R', '-i dummiedSmallZambiaDF', "semfit <- sem('socemo ~ ece_ECE', data=dummiedSmallZambiaDF, meanstructure=TRUE)\nsummary(semfit, standardized=TRUE)\n") # In[125]: get_ipython().run_cell_magic('R', '-i dummiedSmallZambiaDF', "semfit <- sem('socemo ~ ece_ECE', data=dummiedSmallZambiaDF, estimator='MLMV', meanstructure=TRUE)\nsummary(semfit, standardized=TRUE)\n") # In[84]: get_ipython().run_cell_magic('R', '-i dummiedSmallZambiaDF', 'semPaths(lm(\'socemo ~ ece_ECE\', data=dummiedSmallZambiaDF), "std")\n') # In[23]: get_ipython().run_cell_magic('R', '-i dummiedSmallZambiaDF', 'bigmodel <- \'\n socemo ~ ece_ECE + wealth\n reasoning ~ ece_ECE + wealth + books_Books_in_home\n socemo ~~ reasoning\n\'\nsemfit <- sem(bigmodel, data=dummiedSmallZambiaDF, estimator="MLMV", meanstructure=TRUE)\nsummary(semfit, standardized=TRUE)\n') # In[129]: get_ipython().run_cell_magic('R', '', 'semPaths(semfit, "std")\n') # In[25]: get_ipython().run_cell_magic('R', '', 'semPaths(semfit, "std", curvePivot=TRUE, intercepts=FALSE, layout="tree2", rotation=2, nCharNodes=9, sizeMan=15,\n groups=list(c("wealth", "ece_ECE", "books_Books_in_home"), c("reasoning", "socemo")),\n color=c("lightblue", "purple"))\n') # In[90]: get_ipython().run_cell_magic('R', '', 'fitMeasures(semfit)\n') # In[91]: get_ipython().run_cell_magic('R', '', 'fitMeasures(semfit, c("cfi", "tli", "rmsea", "srmr"))\n') # Now let's try to do the same with OpenMX. Here, we need to first drop variables that we are not actually using in the model, otherwise OpenMX will complain. # In[107]: smallZambiaDFForOMX = dummiedSmallZambiaDF[["wealth", "ece_ECE", "books_Books_in_home", "reasoning", "socemo"]] # In[168]: get_ipython().run_cell_magic('R', '-i smallZambiaDFForOMX', 'manifests <- names(smallZambiaDFForOMX)\nmodel <- mxModel("zambia simple", type="RAM",\n manifestVars=manifests,\n mxPath(from=c("wealth", "ece_ECE"), to="socemo", arrows=1, free=TRUE),\n mxPath(from=c("wealth", "ece_ECE", "books_Books_in_home"), to="reasoning", arrows=1, free=TRUE),\n mxPath(from="socemo", to="reasoning", arrows=2),\n mxPath(from=manifests, arrows=2, values=.8), # needs starting values or it won\'t converge\n mxPath(from="one", to=manifests),\n mxData(observed=smallZambiaDFForOMX, type="raw"))\nmodel\n') # In[169]: get_ipython().run_cell_magic('R', '', 'modelRun <- mxRun(model)\n') # In[172]: get_ipython().run_cell_magic('R', '', 'mxCheckIdentification(model)\n') # In[171]: get_ipython().run_cell_magic('R', '', 'summary(modelRun)\n') # In[114]: get_ipython().run_cell_magic('R', '', 'semPaths(modelRun, "std")\n') # In[173]: get_ipython().run_cell_magic('R', '', 'semPaths(modelRun, "std", curvePivot=TRUE, intercepts=FALSE, layout="tree2", rotation=2, nCharNodes=9, sizeMan=15,\n groups=list(c("wealth", "ece_ECE", "books_Books_in_home"), c("reasoning", "socemo")),\n color=c("lightblue", "purple"))\n') # There's also another package for doing SEM in R, calles `sem`, but it seems older and less well supported/documented. # ### A more complex model # Research Question: Do ECE participation and book availability fully explain the relations between wealth and children's socio-emotional & reasoning skills in the Zambian context? # In[196]: get_ipython().run_cell_magic('R', '-i dummiedSmallZambiaDF', 'biggermodel <- \'\n ece_ECE ~ wealth\n books_Books_in_home ~ wealth\n socemo ~ ece_ECE + books_Books_in_home\n reasoning ~ ece_ECE + books_Books_in_home\n socemo ~~ reasoning\n\'\nsemfit2 <- sem(biggermodel, data=dummiedSmallZambiaDF, estimator="MLR", meanstructure=TRUE)\n# WLS would correspond to the adf method, but does not run to completion with this dataset\nsummary(semfit2, standardized=TRUE)\n') # In[183]: get_ipython().run_cell_magic('R', '', 'fitMeasures(semfit2, c("cfi", "tli", "rmsea", "srmr"))\n') # In[179]: get_ipython().run_cell_magic('R', '', 'semPaths(semfit2, "std", curvePivot=TRUE, intercepts=FALSE, layout="tree2", rotation=2, nCharNodes=9, sizeMan=15,\n groups=list(c("wealth", "ece_ECE", "books_Books_in_home"), c("reasoning", "socemo")),\n color=c("lightblue", "purple"))\n') # ### High-level tools for SEM in R # simsem's semtools, see https://github.com/simsem/semTools/wiki/Functions # In[119]: get_ipython().run_cell_magic('R', '', 'chisqSmallN(semfit)\n') # bigsem, ctsem, gsem, lavaan.survey, regsem, semdiag # ## Model Fit # A caveat regarding fit measures: similar to the arbitraty cutoff for p-values indicating significance, the use of fit indices is controversial. # Usual indices: # # * chisquare, basic but can be unclear # * RMSEA (root mean square error of approximation) based on chisquare but takes degrees of freedom (model parsimony) and sample size into account. Should be close to 0 ( < 0.05-0.08) # * CFI/TLI (Bentler comparative fit index, Tucker Lewis index) measure relative fit vs an "independence model" (with covariances 0). Should be close to 1 ( > 0.90-0.95) # * SRMR (standardized root mean square residual) measures the difference between observed and predicted correlations. Should be close to 0 ( < 0.06-0.08) # In[148]: get_ipython().run_cell_magic('R', '', 'fitMeasures(semfit, c("cfi", "tli", "rmsea", "srmr"))\n') # ## Latent Variables and Mediation # A simpler model to show mediation. # In[200]: get_ipython().run_cell_magic('R', '-i dummiedSmallZambiaDF', 'simplermodel <- \'\n # Here we use a and b and c to keep track of the parameter estimations,\n # so that we can then use it to estimate the indirect and total effects.\n # ece is the supposed Mediator\n ece_ECE ~ a*wealth\n socemo ~ c*wealth + b*ece_ECE\n # also estimate indirect effect:\n indirect := a*b\n # total effect:\n total := c + (a*b)\n\'\nsemfit3 <- sem(simplermodel, data=dummiedSmallZambiaDF, estimator="MLR", meanstructure=TRUE)\nsummary(semfit3, standardized=TRUE)\n') # In[201]: get_ipython().run_cell_magic('R', '', 'fitMeasures(semfit3, c("cfi", "tli", "rmsea", "srmr"))\n') # In[202]: get_ipython().run_cell_magic('R', '', 'semPaths(semfit3, "std", curvePivot=TRUE, intercepts=FALSE, layout="tree2", rotation=2, nCharNodes=9, sizeMan=15,\n groups=list(c("wealth", "ece_ECE"), c("socemo")),\n color=c("lightblue", "purple"))\n') # Other packages that should be useful for mediation analysis, for example to compare results: `psych` and `MBESS`, see here http://nickmichalak.com/blog_entries/2018/nrg01/nrg01.html # ### Comparing two different models # The semtools package is helpful for this. # In[206]: get_ipython().run_cell_magic('R', '-i dummiedSmallZambiaDF', 'twostagesemfit1 <- sem.2stage(model=biggermodel, data=dummiedSmallZambiaDF, estimator="MLR", meanstructure=TRUE)\nsummary(twostagesemfit1, standardized=TRUE)\n') # In[207]: get_ipython().run_cell_magic('R', '-i dummiedSmallZambiaDF', 'biggerermodel <- \'\n ece_ECE ~ wealth\n books_Books_in_home ~ wealth\n socemo ~ wealth + ece_ECE + books_Books_in_home\n reasoning ~ wealth + ece_ECE + books_Books_in_home\n socemo ~~ reasoning\n\'\ntwostagesemfit2 <- sem.2stage(model=biggerermodel, data=dummiedSmallZambiaDF, estimator="MLR", meanstructure=TRUE)\nsummary(twostagesemfit2, standardized=TRUE)\n') # In[209]: get_ipython().run_cell_magic('R', '', 'anova(twostagesemfit1, twostagesemfit2) # equivalent to "likelihood-ratio test"\n') # In[210]: get_ipython().run_cell_magic('R', '', 'miPowerFit(semfit2)\n') # ### Multi-group analysis # Compare a model across groups in the data, for example male vs. female. See http://lavaan.ugent.be/tutorial/groups.html for how to specify path constraints across groups. # In[212]: get_ipython().run_cell_magic('R', '-i dummiedSmallZambiaDF', 'multigroupsemfit <- sem(model=biggermodel, data=dummiedSmallZambiaDF, estimator="MLR", meanstructure=TRUE,\n group="male_Male")\nsummary(multigroupsemfit, standardized=TRUE)\n') # ## Measurement Models, Confirmatory Factor Analysis # SEM can be used not only to relate measured variables, as above, but also to relate indicators (items, or full "scales" that combine items) to theorized "latent" variables that cannot be directly observed. # We use the extended version of the zambia dataset here which includes several more variables that are directly observed items. Initially, we'll use 5 items each for a socioemotional scale and a task-orientation scale. Each item is rated on a scale from 1 (never) to 4 (always). # In[215]: measureZambiaDF.head() # In[219]: measureZambiaDF.info() # In[224]: measureZambiaDF.describe(include="all")[["se1", "se2", "to1", "to2"]] # The se variables are not yet numerically coded, so we have to fix that, and prepare dummy-coding if needed. # In[226]: measureZambiaDF["se1"].unique() # In[26]: for colnum in range(1,6): measureZambiaDF["se" + str(colnum) + "c"] = measureZambiaDF["se" + str(colnum)].cat.codes # In[27]: measureZambiaDF["se1c"].unique() # Let's check correlations to make sure our assumptions, i.e. that indicators within a factor are correlated, is correct. # In[28]: # Calculate correlations corr = measureZambiaDF[["se1c", "se2c", "se3c", "se4c", "se5c", "to1", "to2", "to3", "to4", "to5", "socemo", "taskorient"]].corr() # Heatmap plt.figure() sns.heatmap(corr) plt.show() # In[234]: corr # In[ ]: # In[29]: get_ipython().run_cell_magic('R', '-i measureZambiaDF', "model <- '\n to_latent =~ to1 + to2 + to3 + to4 + to5 \n se_latent =~ se1c + se2c + se3c + se4c + se5c \n'\nfit <- cfa(model, data=measureZambiaDF)\nsummary(fit, fit.measures=TRUE)\n") # In[241]: get_ipython().run_cell_magic('R', '', 'semPaths(fit, "std", curvePivot=TRUE, intercepts=FALSE, layout="tree2", rotation=2, nCharNodes=9, sizeMan=5, sizeLat=10,\n groups="latents",\n color=c("lightblue", "purple"))\n') # TODO: compare against models (similar to above) that use one latent factor instead of two, or that load some indicators (e.g. se4c) to both. # ### Combining structural and measurement models for Structural Regression # Let's test whether ECE partially explains the relation between wealth and children's socioemotional and task orientation skills. # In[243]: # we need dummy coding again here, to use ECE dummiedMeasureZambiaDF = pd.get_dummies(measureZambiaDF, dtype=np.int8) # We use int8 instead of uint8 for the dummy variables, because R has no unsigned integers, # so we would get warnings when passing these new variables into R. # Further: the new dummy columns use the actual value as part of the column name, # and therefore might have spaces in the column name. # this is not great for specifying models, so let us replace the spaces with underscores: dummiedMeasureZambiaDF.columns = dummiedMeasureZambiaDF.columns.str.replace(' ', '_') # In[244]: get_ipython().run_cell_magic('R', '-i dummiedMeasureZambiaDF', 'fullsrmodel <- \'\n to_latent =~ to1 + to2 + to3 + to4 + to5 \n se_latent =~ se1c + se2c + se3c + se4c + se5c \n ece_ECE ~ wealth\n se_latent ~ ece_ECE + wealth\n to_latent ~ ece_ECE + wealth\n\'\nsrsemfit <- sem(model=fullsrmodel, data=dummiedMeasureZambiaDF, estimator="MLR", meanstructure=TRUE)\nsummary(srsemfit, standardized=TRUE)\n') # In[245]: get_ipython().run_cell_magic('R', '', 'semPaths(srsemfit, "std", curvePivot=TRUE, intercepts=FALSE, layout="tree2", rotation=2, nCharNodes=9, sizeMan=5, sizeLat=10,\n groups="latents",\n color=c("lightblue", "purple"))\n') # ## Presenting the Results # In[ ]: # ## Notes on Aspects Skipped # * Power calculations. Is the number of subjects/observations large enough for the number of parameters/connections to be estimated. In general, SEM needs large sample sizes above 100. # * ... # In[ ]: # ## For reference: some models from tutorials with built-in data # In[85]: get_ipython().run_cell_magic('R', '', "model <- ' visual =~ x1 + x2 + x3 \n textual =~ x4 + x5 + x6\n speed =~ x7 + x8 + x9 '\nfit <- cfa(model, data=HolzingerSwineford1939)\n") # In[86]: get_ipython().run_cell_magic('R', '', 'summary(fit, fit.measures=TRUE)\n') # In[87]: get_ipython().run_cell_magic('R', '', "model <- '\n x1 ~ y3\n # measurement model\n # ind60 =~ x1 + x2 + x3\n # dem60 =~ y1 + y2 + y3 + y4\n # dem65 =~ y5 + y6 + y7 + y8\n # regressions\n # dem60 ~ ind60\n # dem65 ~ ind60 + dem60\n'\nfit <- sem(model, data=PoliticalDemocracy)\nsummary(fit, standardized=TRUE)\n")