Structural equational modeling is a very popular statistical technique in the social sciences, as it is very flexible and includes factor analysis, path analysis and others as special cases. While usually done with specialized programs, the same can be achieved in Mathematica, which has the benefit of allowing control of any aspect of the calculation. Moreover, a second, more flexible, approach to calculating these models is described that is conceptually much easier yet potentially more powerful. This second approach is used to describe a solution of the attenuation problem of regression.
The SEM Method
Linear structural equation modeling (SEM) is a technique that has found widespread use in many sciences in the last decades. An early foundational work is Bollen [1]; a more recent overview is provided by Hoyle [2]. The basic idea is to model the linear structure of observed variables of cases (observations, subjects) by linear equations that may involve latent variables. These variables are not measured directly but inferred from the observed variables by their linear relation to the observed variables.
Many commercial programs (including LISREL, Amos, Mplus) and free ones (including lavaan, sem, OpenMX) have been developed to carry out the estimation procedure. From my perspective, the R package lavaan [3, 4] by Yves Rosseel is the most reliable and convenient one among the free programs. I use it as the gold standard to judge results of my own code.
This article first gives a quick overview of the standard SEM theory, then shows how to perform the calculations in Mathematica. In the last section, a second approach is discussed.
The Standard Example
There is a standard example due to Bollen that is also used in the lavaan manual. The dataset consists of observations of 11 manifest variables , , , , , , , , , , . SEM models are usually depicted graphically. In the lavaan documentation, this is displayed as in Figure 1.
Figure 1. Bollen’s democracy model (image from lavaan documentation [4]).
The variables , , are observed variables that measure the construct of industrialization in 1960, which is described by the latent variable . This means that the level of industrialization is assumed to be representable by one number for each country, but this number cannot be measured directly; it has to be inferred from its linear relation to gross national product , energy consumption per capita and share of industrial workers . Next, and are the democracy levels in 1960 and 1965, measured by , , , and , , , (these indicators are freedom of the press, etc.). The data matrix consists of these 11 numbers for each of 75 countries (cases). The data is delivered with the lavaan package for R. The aim of estimating the model is twofold. First, the weights of the linear connections (represented in the picture by arrows) are estimated. These arrows encode linear equations by the rule that all arrows that end in a variable indicate a linear combination that yields the value of this variable plus some error term variable. To bring this mysterious language down to earth, here are the equations represented in Figure 1:
, ,
, ,
, ,
,
.
The variable is called an exogenous latent variable because no arrow ends there. It has no associated error variable. However, its manifest (measured) indicator variables , , have associated error variables (they are called in [1]). The indicator variables , , , and , , , of the two endogenous latent variables (those latent variables where arrows end) have error variables (called in [1]). The equations that relate latent and manifest variables define the measurement part of the model. The two equations (coming from three arrows) between the latent variables are the structure model, usually of most interest. Fitting the model to the data gives estimates for the weights of the arrows, , , , , …. The second goal of SEM modeling is to check how well the structure of the model fits the data; that is, SEM is also a hypothesis-testing method.
The equations given do not yet identify all variables. Assume we have a solution of ; then for any number , the numbers and would be solutions, too. To avoid this problem, we either fix the variance of the latent variables to be 1 or we fix some of the weights to be 1. This is the default in lavaan and we adopt it here, hence , , .
The Standard Way of Estimating SEM
Ever since SEM’s invention, SEM models are estimated by calculating the model’s covariance matrix. From the data, we get the empirical covariance matrix . On the other hand, from the model, we can calculate a theoretical covariance matrix between the observed variables. ( depends on the model and thus on the parameters.) For example, one entry in this matrix would be . Using linearity and other properties of the covariance, this boils down to a matrix with entries that are polynomials in the model parameters and the covariances and variances between latent variables and error variables. However, without further assumptions, this gives a lot of covariances (e.g. ) that are not determined by the model and hence must be estimated. As this usually leads to too much freedom, the broad assumption is that most error variables are uncorrelated. Only some covariances between error variables are not assumed to be 0; those are marked in the diagram by two-headed arrows between the observed variables. For every pair of observed variables, we calculate the covariance by using the above given model equation as replacement rules and applies linearity and independence assumptions. In the end, we get a covariance matrix that depends on the model parameters , , , , , … and on the variances of the latent variables and the covariances of error variables that are not assumed to be 0. Details can be found in Bollen [1].
To fit the empirical and the theoretical covariance matrix, we have to choose these parameters to minimize some distance function. The three most common are uniform least-square, , generalized least-square, (I is the identity matrix), and maximum likelihood, (here is the number of manifest variables).
Now we are in the position to define a Mathematica function that performs SEM. First, we define the helper function that gets all variables contained in an expression in such a way that, for example, counts as one variable.
Here is an example.
The method will be explained with Bollen’s democracy dataset, so first, we need to load this dataset. The file bollen.csv contains headers (the names of the variables are saved in the list ) and a first column numbering the cases, which is dropped.
The data has 75 rows.
Here is the first row of 11 numbers.
The model itself has to be specified as a list of replacement rules that mirror the model equations discussed.
The code for the estimation function includes some utilities. For example, it defines its own covariance and variance functions that take into account which variables are assumed to be uncorrelated. The input of is the data matrix , a matrix of numerical values, one row per case. The structural equations are given in the format detailed in the previous section, “The Standard Example.” Moreover, the function needs:
• the lists of free parameters, (e.g. path weights)
• endogenous latent variables,
• exogenous latent variables,
• the list of error variables of latent variables,
• errors of exogenous manifest variables
• errors of endogenous manifest variables
• a list of pairs of error variables specifying which error variables are allowed to be correlated
The code after defining can be omitted on a first reading; it is only needed to calculate some fit indices (if required by the option , which asks to do the fit index (FI) calculation; similarly, asks to do the maximum likelihood estimation). The estimation is done at the end of the function.
The goal of the first half of the program is the definition of the covariance function that takes into account the SEM assumptions: that most error variables are uncorrelated (except those specified to be correlated), leaving variances of latent variables as symbolic entities to be estimated.
This function is then used to calculate the model implied covariance matrix . Applying the model equation rules repeatedly gives a matrix that depends only on parameters, variances of latent variables and error variables and some allowed covariances of error variables. The code from the line defining (the degree of freedom) onward is only important for getting fit indices. If we are only interested in estimating the model parameters, the next interesting lines are where is applied to estimate the model. As described in the introduction, there are several strategies to measure deviation of covariance matrices; for example, the definition of is a straightforward coding for minimizing .
Let us run the code on Bollen’s model in a simplified version where no correlation of error variables is assumed. This may take several minutes.
The result combines parameter, variance and covariance estimations according to the various estimating strategies. To judge how well the model fits the data, you can set the option to some fit indices:
• RMSEA is the root square mean error
• CFI is the comparable fit index
• TLI is the Tucker–Lewis fit index
• NFI is the normed fit index
RMSEA should be less than 0.1 or better, less than 0.05, and the last three should all be greater than 0.9 or 0.95 for good model fit.
The results of estimating using the three different methods differ somewhat. This is not a bug of our program; lavaan determines the same numbers up to several decimal places. There are results in the literature about which methods are equivalent under which conditions. For these fit indices to be interpretable, we need to assume that the data is multivariate normally distributed. If this assumption is violated, then we should judge model fit by other indices, which is beyond the scope of this article; however, they could be calculated based on the current approach as well. The book edited by Hoyle [2] gives some information on these methods.
For the original model that allows some covariances between error variables, the runtime gets worse, especially for maximum likelihood estimation. Hence, this is turned off in the following code.
The results of both models are exactly the same as calculated with lavaan.
An Alternative Approach: Case-based Estimation
When I first learned about SEM, I was puzzled by the many notions (e.g. exogenous, endogenous) and the assumptions needed. For example, I felt that correlation of error variables should be calculated by the estimation algorithm and not be set at will when specifying the model. However, these difficulties seem to play no large role in practice and there are thousands of research papers (mainly) in the social sciences that use these methods with great success. Yet, there are some reasons why the standard approach to SEM via covariance matrices can be criticized (a more detailed discussion is given in [5]). Traditional SEM:
• is well suited only for linear models (there are some nonlinear extensions, but they have not yet become mainstream)
• does not give estimates of the values of latent variables for each case (Bayesian variants can do this)
• requires the covariance matrix of observed data to be nonsingular; however, improving measurement methods in , , , for example, may result in highly correlated measures of (in the extreme case with identical vectors of measured values) and hence their covariance matrix will be almost singular
• has resulting estimations for parameters that depend a lot on the estimation method used
• forbids certain linear models that are not identified in this approach, even though the model itself is sensible and well defined (e.g. the number of covariances of error variables allowed to be nonzero is limited, although in practice there may be correlations)
You may then wonder why the covariance matrix–based approach is so popular. I suppose that more than 40 years ago, computers were not powerful enough to deal with a full dataset, so that the information reduction by calculating the correlation matrix was essential. Since then, many powerful programs have been developed and research has been carried out that gave a good understanding of conditions under which the method works well. Moreover, the psychometric community reached a consensus on how model fit should be judged and thus studies using this method faced no problem being published.
After this discussion of pros and cons, it is time to present the following case-based approach to SEM estimation that is very easy (one may even call it naive) to implement but is also very flexible and with today’s computing power, it is feasible in many real-world situations.
Hence, I propose to do SEM case-based by least-square optimization of the defects of the equations. Assume we have observations (cases) of variables , . A general equational model consists of equations , , which involve the data, latent variables , , and parameters . Then the latent variables and the parameters are estimated by minimizing .
Another twist is needed to get the best results, however. The above objective function gives all equations the same weight. However, it turned out (by working with simulated data where it is clear which parameters should be found) that we get better results by multiplying by a factor that gives the equations different weights, that is, . The factor can be modified by an option in the code that follows. Best results are obtained for , where is the number of latent variables in . The idea behind this choice is that an equation that involves only one latent variable links this variable directly to the manifest data and thus should have a high weight. In contrast, equations with many latent variables are not so close to the manifest observations and are thus are more hypothetical, so they should have a lower weight.
The model equations are not formulated as rules as for the first SEM, but as equations with the name of the error variable attached to each equation. Moreover, the dataset is not normalized, so there are nonzero intercepts in the linear equations. In the first approach this had no consequences, because such additive values are eliminated by calculating the covariance matrix, but in the SEM2 approach, intercepts must be modeled explicitly (and we have the benefit of getting estimates for them as well).
The function SEM2 that carries out the model estimation takes as input and the names of the manifest () and latent variables (). At the technical heart of the function is the subroutine . This function takes an equation involving latent variables (e.g. ) and adds to the objective function the appropriate term for each case (i.e. with values from the data replacing the names of manifest variables):
There is one option.
This code estimates Bollen’s model.
As mentioned, there is a version that weights equations according to the number of latent variables they have.
The results for the estimates differ from what is calculated in the traditional covariance matrix–based approach given for . A simulation study that compares the two approaches [5] showed that in many situations the case-based approach gives better results, especially when the assumption of independent errors is violated. Moreover, the case-based approach is easily applied to nonlinear equations. However, in certain situations it may be necessary to perform the minimization with higher accuracy than provided by standard hardware floating-point numbers.
Application to Measurement Error
In standard linear regression , one assumes that the independent variables are measured exactly, while the dependent variable has an error that is ideally normally distributed. If the independent variables are measured with error too, standard linear regression underestimates the regression coefficient. This is the famous attenuation problem and I will show how to solve it. Let us first simulate a dataset with error on both variables.
Then linear regression underestimates the slope, which should be 0.5.
When using case-based modeling, several strategies are possible. We may use one or two latent variables for the true values. As the true dependent variable is just , the following code uses just one latent variable. Another twist is that the equations are divided by the empirical standard deviations to put them on an equal footing.
This example shows both the power of this method and the responsibility of the modeler to set up sensible equations. If we are sure that the errors are uncorrelated, we may add as another constraint to further improve the estimate. This may also be done automatically with an extended version of SEM2, which will be published when its development is completed.
Summary
Two methods for the estimation of structural equational models are presented. One uses the traditional covariance matrix–based approach and is therefore restricted to linear equations, while the other approach is more general but not yet established in practice. Estimating the models is rather easy in Mathematica, but the numerical problems that arise can be demanding. The new case-based approach is very flexible and promising in certain situations where the standard approach shows limitations.
Conclusion
Case-based calculation of SEM looks very promising given the numerical power of today’s computers and might give insight in situations where the restrictions of the traditional approach urge researchers into making assumptions that may not be warranted.
Acknowledgments
It is my pleasure to thank Ed Merkle and Yves Rosseel for many explanations of SEM.
References
[1] | K. A. Bollen, Structural Equations with Latent Variables, New York: Wiley, 1989. |
[2] | R. H. Hoyle (ed.), Handbook of Structural Equation Modeling, New York: Guilford Press, 2012. |
[3] | K. Gana and G. Broc, Structural Equation Modeling with lavaan, Hoboken: John Wiley & Sons, 2019. |
[4] | Y. Rosseel. “lavaan.” (Aug 25, 2019) https://lavaan.ugent.be. |
[5] | R. Oldenburg, “Case-based vs. Covariance-based SEM,” forthcoming. |
R. Oldenburg, “Structural Equation Modeling,” The Mathematica Journal, 2020. https://doi.org/10.3888/tmj.22–5. |
About the Author
Reinhard Oldenburg has studied physics and mathematics and received a PhD in algebra. He has been a high-school teacher and now holds a professorship in Mathematics Education at Augsburg University. His research interests are computer algebra, the logic of elementary algebra and real-world applications.
Reinhard Oldenburg
Augsburg University
Mathematics Department
Universitätsstraße 14
86159 Augsburg, Germany
reinhard.oldenburg@math.uni-augsburg.de