-
Notifications
You must be signed in to change notification settings - Fork 4
/
Copy path10-Estimation.Rmd
234 lines (166 loc) · 31.3 KB
/
10-Estimation.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
# Estimation {#ch10}
In the chapter on identification we saw that model parameters can be written as a function of elements from $\mathbf{\Sigma}_\text{population}$. Subsequently, we can use these equations to estimate model parameters derived from the sample variances and covariances ($\mathbf{\Sigma}_\text{sample}$). For just-identified models, the number of known elements in $\mathbf{\Sigma}_\text{sample}$ equals the number of unknown model parameters, and one can derive parameter estimates by solving the equations that satisfy $\hat{\mathbf{\Sigma}}_\text{model}$ = $\mathbf{\Sigma}_\text{sample}$. For models that are over identified (i.e., that have positive $df$) the estimation of model parameters is more complicated. In our illustrative example of child anxiety, we can derive three different equations in terms of population variances and covariances (i.e., by rewriting the equations of $\sigma_{41}$, $\sigma_{42}$ and $\sigma_{43}$) for the expression of the single model parameter $\beta_{43}$. When we subsequently want to use these equations to estimate the model parameter $\hat\beta_{43}$ we can never derive an estimate so that all three elements $\hat\sigma_{41}$, $\hat\sigma_{42}$ and $\hat\sigma_{43}$ of $\hat{\mathbf{\Sigma}}_\text{model}$ are equal to the corresponding elements of $\mathbf{\Sigma}_\text{sample}$. Therefore, for over-identified models there will always be a certain amount of discrepancy between $\hat{\mathbf{\Sigma}}_\text{model}$ and $\mathbf{\Sigma}_\text{sample}$. In such cases, the estimation of model parameters requires an iterative procedure to minimize the discrepancy between $\mathbf{\Sigma}_\text{sample}$ and $\hat{\mathbf{\Sigma}}_\text{model}$ (i.e., What value of $\hat\beta_{43}$will lead to expressions of $\hat\sigma_{41}$, $\hat\sigma_{42}$ and $\hat\sigma_{43}$ that are as close as possible to the corresponding elements in $\mathbf{\Sigma}_\text{sample}$?).
The estimation procedures are iterative because the initial solution (i.e., the first attempt to estimate model parameters and evaluate the discrepancy between $\mathbf{\Sigma}_\text{sample}$ and $\hat{\mathbf{\Sigma}}_\text{model}$) is improved through subsequent cycles of calculations until the improvement in model fit falls below a predefined minimum value (i.e., the decrease in discrepancy between $\mathbf{\Sigma}_\text{sample}$ and $\hat{\mathbf{\Sigma}}_\text{model}$ is very small). When this happens, the estimation process has converged. Convergence may be achieved more quickly when the initial solution, or the initial estimates of the parameters (i.e., start values) are reasonably accurate. If these initial estimates are completely inaccurate, then iterative estimation may even fail to converge. Most computer programs automatically generate reasonable start values and will give a warning if a solution has not converged. However, sometimes it may be necessary to provide user-specified start values in order for the solution to converge, especially for more complex models.
Several discrepancy functions exist that can be used for the estimation of model parameters. Three desirable qualities of a good estimation procedure are *unbiasedness*, *consistency* and *efficiency.* Suppose we repeat the estimation procedure for a given model an infinite number of times on an infinite number of datasets. The estimator is unbiased when the expected parameter values (i.e., means of the sampling distributions of parameter estimates) are equal to the population values, i.e., they are not systematically biased upwards or downwards. The estimator is consistent when parameter estimates are closer to the population parameters with increased sample size, i.e., the larger the sample size, the closer the estimated values are to the population values (given that the sample is representative of the population). Efficiency refers to the variance of the sampling distribution of parameter estimates, where the most efficient estimator has a sampling distribution with the smallest variance (and thus the smallest $SE$s).
The most intuitive method that could be used to minimize the discrepancies between $\mathbf{\Sigma}_\text{sample}$ and $\hat{\mathbf{\Sigma}}_\text{model}$ is to simply calculate the difference between the two matrices and try to minimize the values within the resulting residual covariance matrix. This is what is commonly referred to as the unweighted least squares (ULS) discrepancy function^[In a later [chapter](#ch21) we will introduce mean structures, where we will give versions of Equations \@ref(eq:10-01) and \@ref(eq:10-02) that incorporate observed and model-implied means]:
```{=tex}
\begin{equation}
\text{F}_\text{ULS}(\mathbf\Sigma_{\text{sample}}, \hat{\mathbf{\Sigma}}_{\text{model}}) = ½ \text{trace}((\mathbf\Sigma_{\text{sample}}, \hat{\mathbf{\Sigma}}_{\text{model}})^2),
(\#eq:10-01)
\end{equation}
```
which minimizes the squared deviations between the observed sample variances and covariances and the corresponding elements predicted by the model. If the deviations are zero (i.e., observed and model-implied matrices are identical), then $F_\text{ULS}=0$, indicating perfect fit. Any nonzero discrepancies yield a positive $F_\text{ULS}$. Minimizing $F_\text{ULS}$ yields unweighted least squares (ULS) estimates of all the model parameters given by Equations \@ref(eq:3-09) and \@ref(eq:3-10), under the assumption that the model is valid and identified. It leads to unbiased and consistent estimates of model parameters, and does not require that the observed variables have a particular distribution. However, it is not the most efficient estimator.
A particularly attractive discrepancy function is given by:
```{=tex}
\begin{equation}
\text{F}_{\text{ML}}(\mathbf\Sigma_{\text{sample}}, \hat{\mathbf{\Sigma}}_{\text{model}}) = \log|\hat{\mathbf{\Sigma}}_{\text{model}}|-\log|\mathbf{\Sigma}_{\text{sample}}| + \text{trace}(\mathbf{\Sigma}_{\text{sample}}\hat{\mathbf{\Sigma}}^{-1}_{\text{model}})-p,
(\#eq:10-02)
\end{equation}
```
which gives so-called maximum likelihood (ML) estimates of all model parameters, assuming that the distribution of the scores of the observed variables ($p$) is multivariate normal and that the model is valid and identified. It is the most widely used fitting function for general structural equation models, and the estimator is unbiased, consistent and more efficient than ULS. Although it is a more complex nonlinear function, it has additional important properties. First, it allows for tests of statistical significance of parameter estimates as the estimators are asymptotically normally distributed. Second, unlike $F_\text{ULS}$, $F_\text{ML}$ is scale invariant and scale free. These properties refer to the dependency of the fitting function on the units of measurement of the observed variables. When the fitting function is scale invariant, the value of the fit function is independent of the measurement scales of the variables, i.e., the value of the fit function will be the same for different scales of measurement. In addition, scale-freeness refers to the property that when the scales of the observed variables are linearly transformed (i.e., multiplied by and/or added to a constant), the relationship between the parameter estimates of the transformed and untransformed solution can be derived too (i.e., they are linearly transformed in a similar fashion).
In general, smaller values of discrepancy functions indicate better fit of the model to the data. A value of zero indicates the fit is perfect, i.e., the parameter estimates can perfectly reproduce the sample covariance matrix ($\hat{\mathbf{\Sigma}}_\text{model}=\mathbf{\Sigma}_\text{sample}$).
Using Equation \@ref(eq:10-02) to arrive at ML estimates of model parameter of the path model from our illustrative example, yields:
$$
\hat{\mathbf{\Sigma}}_\text{model} =
\begin{bmatrix}
91.58 & \\
53.36 & 194.32 \\
28.39 & 50.90 & 130.19 \\
2.05 & 3.68 & 9.41 & 7.56
\end{bmatrix}
$$
whereas the observed variances and covariances of the sample are given by:
$$
\mathbf{\Sigma}_\text{sample} =
\begin{bmatrix}
91.58 & \\
53.36 & 194.32 \\
28.39 & 50.90 & 130.19 \\
9.21 & 4.98 & 9.41 & 7.56
\end{bmatrix}
$$
When we compare $\hat{\mathbf{\Sigma}}_\text{model}$ and $\mathbf{\Sigma}_\text{sample}$ from our illustrative example, we can see that there are some discrepancies between the two matrices. Specifically, the elements $\hat{\sigma}_{41}$ and $\hat{\sigma}_{42}$ of $\hat{\mathbf{\Sigma}}_\text{model}$ are different from the corresponding sample covariance. These two elements featured in the equations for the over-identified model parameter $\hat{\beta}_{43}$. As can be seen, the model parameter was estimated so that the element $\hat{\beta}_{43}$ of $\hat{\mathbf{\Sigma}}_\text{model}$ equals the corresponding element of the sample covariance matrix, which thus leads to a misfit for the elements $\hat{\sigma}_{41}$ and $\hat{\sigma}_{42}$. The residual covariance matrix (i.e., $\mathbf{\Sigma}_\text{sample} - \hat{\mathbf{\Sigma}}_\text{model}$) is:
$$
\mathbf{\Sigma}_{\text{residual}} =
\begin{bmatrix}
0\\
0&0\\
0&0&0\\
7.2 & 1.3 & 0 & 0
\end{bmatrix}
$$
This shows that the covariance between variable 4 (child anxiety) and variables 1 and 2 (parent anxiety and parent perfectionism) are underestimated by the model. When the residual values would be negative, this would indicate that the covariances are overestimated by the model. Model fit evaluation (which is the topic of Chapter 11) is used to evaluate whether the amount of misfit in these elements is substantial enough to reject the model, i.e., concluding that the specified model is not an adequate representation of the developmental process of child anxiety.
It is important to note that models should only be fit to covariance matrices. Fitting a model to a correlation matrix without additional constraints yields incorrect $SE$s (Cudeck, 1989), so Wald $z$ tests do not have nominal Type I error rates.
## Improper Solutions
Even when the estimation procedure has reached convergence, a researcher should always be aware of inadmissible solutions and so-called Heywood cases. Heywood cases refer to parameter estimates with an illogical value, like for example negative variance estimates, or estimated correlations with absolute values larger than 1.0^[If the estimated parameter is a covariance, then it would be necessary to check the standardized solution to see whether the correlation is greater than 1 in absolute value. Calculating a 95\% CI for the correlation would in that case require standardizing the upper and lower confidence limits of the covariance, using the model-implied $SD$s (i.e., square-roots of estimated variances).]. These kind of estimates have no plausible interpretations, and therefore such solutions are inadmissible. Sometimes the associated SEs of parameter estimates are very large (e.g., 999,999.99), which usually indicates that there is a problem with the model solution. Inadmissible solutions may be caused by several factors, including underidentification, misspecification, or bad starting values. Some computer programs give warnings that parameter estimates take on illogical values, but it is important to always carefully inspect all parameter estimates, unstandardized and standardized, to ensure that the solution is indeed admissible. Chen, Bollen, Paxton, Curran and Kirby (2001) describe the possible causes of, consequences of, and possible strategies to handle improper solutions in more detail.
For instance, a Heywood case could result from either (a) model misspecification or (b) sampling error. So before trying to "fix" a Heywood case, one should first test the null hypothesis that the parameter is an admissible solution. For example, if the 95% CI for a residual variance includes positive values, then you cannot reject the null hypothesis (using α = .05) that the true population value is indeed positive; in this case, if the model fits well and there are no other signs of misspecification, you could conclude that the true parameter is simply close enough to zero that sampling error occasionally yields a negative estimate.
## Specifying starting values in `lavaan`
Methods for choosing default start values are quite excellent in modern SEM software such as M*plus* and `lavaan`, so it is rarely necessary for a user to provide start values manually. However, it may be useful to know how to do so, particularly when having difficulty converging on a proper (or any) solution. There are three ways to provide non-default starting values in `lavaan`. The first two involve specifying them in the `lavaan` model syntax, either using the `start()` modifier:
```
AWmodel <- ' child_anx ~ start(0.666)*overcontrol '
```
Or you may use the ? operator:
```
AWmodel <- ' child_anx ~ 0.666?overcontrol '
```
Note that if you want to provide both a starting value and use other modifiers/operators (e.g., to fix a parameter or label it) on the same parameter, then you have to specify that parameter twice on the same line of syntax:
```
AWmodel <- ' child_anx ~ b43*overcontrol + start(0.666)*overcontrol '
```
## Alternative estimators
Additional estimators are available, but are less often used with continuous data. For categorical data, however, covariance structure analysis is not appropriate, so some version of the weighted least-squares (WLS) estimator is typically used. WLS estimation is available for continuous data as well, although this special case of WLS is referred to in the SEM literature as the asymptotical distribution-free (ADF; Browne, 1984) estimator. The WLS fit function is:
```{=tex}
\begin{equation}
\text{F}_{\text{WLS}}(\mathbf{\Sigma}_\text{sample},\hat{\mathbf{\Sigma}}_\text{model}) = \text{vech}(\mathbf{\Sigma}_\text{sample} - \hat{\mathbf{\Sigma}}_\text{model})^{T} W^{-1} \text{vech}(\mathbf{\Sigma}_\text{sample} - \hat{\mathbf{\Sigma}}_\text{model}),
(\#eq:10-03)
\end{equation}
```
where `vech()` vectorizes the unique elements of its argument. The WLS fit function essentially incorporates higher-order information beyond the covariance matrix (specifically, kurtosis) into a weight matrix (the $W$ matrix in Equation \@ref(eq:10-03)). It is usually chosen to be the sampling covariance matrix of the elements in $\mathbf{\Sigma}_\text{sample}$. Equation \@ref(eq:10-03) is a very general fit function, because in principle, any weight matrix can be used. If an identity matrix is used as the weight matrix, this yields the ULS estimator. The ML fit function (using Wishart likelihood for analyses of covariance structure only) can actually be conceived of as WLS where the weight matrix is derived from $\hat{\mathbf{\Sigma}}_\text{model}$. Likewise, there is also a “generalized” least-squares estimator (GLS), which can be conceived as WLS if the weight matrix would be derived from $\mathbf{\Sigma}_\text{sample}$. GLS also requires multivariate normality, but it is not as efficient as MLE, so there is no advantage in using GLS.
Computers did not always have such great hardware (processing and memory) and software (optimization techniques to find ML solutions) that we are used to today. Therefore, GLS was popular 20 years ago when it was more common that SEM programs would fail to converge on a ML solution. Nowadays, most of the above estimators (ULS, GLS, ADF) are rarely, if ever, used. Not only is the ML estimator the most efficient (i.e., the smallest SEs), but being based on likelihoods comes with other advantages, such as being able to calculate likelihood-based information criteria (e.g., AIC and BIC) to compare models. Also, ADF requires extremely large sample sizes to be stable (i.e., $N$ > 2000 or 5000). Although MLE is based on asymptotic (“large-sample”) theory, sample sizes in the hundreds are often sufficient for stable solutions (or even ~120 for smaller models).
## Robust estimators
If data are not drawn from a multivariate normal population, then ML estimates are still consistent (and effectively unbiased, as sample size increases). However, the $SE$s tend to be underestimated and fit statistic inflated, both leading to inflated Type I error rates. If you have a sample size in the thousands, you could just use ADF. However, such large samples are rare for practicing behavioral researchers. A more practical solution is to “correct” the SEs and fit statistic so that they do yield nominal Type I error rates.
The most often used correction for $SE$s is to use a “sandwich-type” estimator. The asymptotic covariance matrix of model parameters, $A$, estimates how much model parameters would (co)vary from sample to sample. $SE$s are the square-roots of the diagonal elements of $A$ (i.e., the sampling variances). If the normality assumption does not hold, then $A$ is incorrect, but it can be corrected using a weight matrix, $W$, calculated from the observed kurtosis. Without getting into the technical details, the term “sandwich” refers to the basic form of the matrix-algebra equation, in which $A$ is the outer “break” and $W$ is the “meat” (see Savalei, 2014, for more information).
Several robust corrections for the fit statistic have been proposed. The Satorra–Bentler (2001) scaled test statistic is the most popular, but it requires complete data, whereas the Yuan–Bentler (2000) scaled test statistic (T2\*) can be used with partially observed data (i.e., using full-information maximum likelihood)—they are asymptotically equivalent to each other (i.e., in large samples). These methods “scale” the actual distribution of the test statistic to have a mean that equals the df, which is the actual mean of the $\chi^2$ distribution. Satterthwaite-type methods also adjust the variance of the test statistic, so that its variance is approximately equal to the $\chi^2$ distribution’s (i.e., $2 \times df$). Scaling and shifting the observed statistic will yield Type I error rates that are closer to the nominal $α$ level. These scaled–shifted statistics are sometimes referred to as mean- and variance-adjusted statistics (e.g., in M*plus*). Because the unadjusted $\chi^2$ fit statistic will have inflated Type I errors when the normality assumption is not met, a scaled test statistic usually indicates better fit.
## Requesting alternative and robust estimators in lavaan
The `lavaan()` function (and the `cfa()` and `sem()` wrappers) accept arguments to specify the desired estimator, type of test statistic, and method for calculating $SE$s. In the examples up until now, we have been using the defaults for normally distributed data. Below, we explicitly request those default arguments:
```{r}
AWmodelOut <- lavaan(model = AWmodel,
sample.cov = AWcov, sample.nobs = 77,
likelihood = "wishart", fixed.x = FALSE,
estimator = "ML", se = "standard", test = "standard")
```
Alternative estimators can be requested easily; for example, to request GLS, simply use the argument `estimator = "GLS"`. When analyzing categorical outcomes (see Chapter 23), the default will be switched to `estimator = "DWLS"` (diagonally weighted least squares).
If you are analyzing raw data instead of summary statistics, robust $SE$s and fit statistics can be requested just as easily. For example, to request sandwich-type $SE$s, use the argument `se = "robust.sem"` with complete data or `se = "robust.huber.white"` with missing data (in combination with `missing = "FIML"`). To request a scaled test statistic, use the argument `test = "Satorra.Bentler"` with complete data or `test = "Yuan.Bentler"` with missing data (in combination with missing = "FIML"; see the following section for more information). Bootstrapping can be used to calculate $SE$s and the $p$ value for the fit statistic by requesting both `se = "boot"` and `test = "boot"` (in addition to setting the number of bootstrap samples to use in the `boot` argument); bootstrapping the fit statistics is currently only available with complete data, although the `semTools` package includes a function (`bsBootMiss()`) to bootstrap fit with missing data.
Because the popular SEM software M*plus* uses a single shortcut to specify the choice of estimator, $SE$s, and fit statistic, many of those shortcuts are being used by applied researchers as though they are the names of the estimator itself (e.g., “MLR” for robust ML). Even if robust $SE$s and fit statistic are used, ML is still the estimation method use to obtain point estimates of parameters (i.e., assuming normality)---only the $SE$s and fit statistic are adjusted *after* parameters are estimated. Still, `lavaan` allows some of these shortcuts, which you can read about on the `?lavaan` help page, in the description of the estimator argument. Likewise, the help page describes all other available estimators (e.g., `"PML"` for pairwise maximum likelihood), $SE$s (e.g., based on first-order derivatives), and test statistics (e.g., scaled and shifted, or mean- and variance-adjusted).
## Alternative calculation of the ML discrepancy function when analysing raw data {#ch10-6}
Equation \@ref(eq:10-02) shows how to calculate $F_\text{ML}$ from summary statistics (i.e., observed ($\mathbf{\Sigma}_\text{sample}$) and model-implied ($\hat{\mathbf{\Sigma}}_\text{model}$) covariance matrices). This equation only applies when analysing complete data. In practice, data are often partially observed. When some data are missing, it is still possible to use full-information maximum likelihood (FIML) estimation, where the likelihood of each individual is calculated using only the parameters relevant to the variables on which they have observed data. Individual $i$’s likelihood is calculated by plugging their vector of observed values ($\mathrm{y}_i$) into the multivariate normal probability density function:
```{=tex}
\begin{equation}
\text{likelihood}(\textbf{y}_i) = \frac{1}{(2\pi)^{p/2|\mathbf{\Sigma}|1/2}}\mathrm{e}^{-\frac{1}{2}(y_i-\mu)'\mathbf{\Sigma}^{-1}(y_i-\mu)}
(\#eq:10-04)
\end{equation}
```
using the model-implied mean vector^[When mean structure is excluded from the model, μ can be set to a vector of zeros, and each variable in the data set should be mean-centered.] as $μ$ and the model-implied covariance matrix as $\mathbf\Sigma$, where $p$ is the number of variables in $\mathrm{y}_i$. After calculating the likelihood for each row of data, the joint likelihood of observing the entire sample is the product^[This is the same as calculating the probability of flipping a coin twice and getting “heads” both times. The probability of “heads” is 50% for each coin-toss, so the probability of 2 “heads” is $0.5 \times 0.5 = 0.25$ (or 25\%).] of all the individual likelihoods. Because likelihoods are bound between 0 and 1, the joint likelihood becomes very small—too small even for computers to calculate with enough precision^[R, for instance, only keeps track to the 16th decimal place, although it might only be very precise to the 12th decimal place in practice.]. It is easier to work with log-likelihoods because the log of numbers ranging from 0 to 1 will range from $−\infty$ to 0 (so not as much precision is needed). Because the log of a product is equal to the sum of the individual logs---$\log(a \times b) = \log(a) + \log(b)$---we can simply add together all the individual log-likelihoods to calculate the log-likelihood of the sample. The log of equation \@ref(eq:10-04) is:
```{=tex}
\begin{equation}
\ell_i = \frac{p}{2}\log(2\pi) - \log(|\mathbf{\Sigma}|) - \frac{1}{2}(y_i-\mu)'\mathbf{\Sigma}^{-1}(y_i-\mu)
(\#eq:10-05)
\end{equation}
```
In a model with 5 variables, person $i$’s vector $\mathrm{y}_i$ would have 5 observed values in it, the model-implied mean vector^[When mean structure is excluded from the model, $μ$ can be set to zero and data should be mean-centered.] μ would have 5 means in it, and the model-implied covariance matrix $\mathbf\Sigma$ would have 5 rows and 5 columns. If person $i$ has complete data, all this information would be included in the calculation of equation \@ref(eq:10-05):
$$
y_i = \begin{bmatrix} y_1 \\ y_2 \\ y_3 \\ y_4 \\ y_5 \end{bmatrix} \mu_i = \begin{bmatrix} \mu_1 \\ \mu_2\\ \mu_3\\ \mu_4\\ \mu_5 \end{bmatrix} \mathbf{\Sigma}_i = \begin{bmatrix}
\sigma^2_1 & \sigma_{12} & \sigma_{13} & \sigma_{14} & \sigma_{15} \\
\sigma_{21} & \sigma^2_2 & \sigma_{23} & \sigma_{24} & \sigma_{25} \\
\sigma_{31} & \sigma_{32} & \sigma^2_3 & \sigma_{34} & \sigma_{35} \\
\sigma_{41} & \sigma_{42} & \sigma_{43} & \sigma^2_4 & \sigma_{45} \\
\sigma_{51} & \sigma_{52} & \sigma_{53} & \sigma_{54} & \sigma^2_5 \end{bmatrix}
$$
However, if person $i$ was only observed on $\mathrm{y}_1$, $\mathrm{y}_2$, and $\mathrm{y}_5$:
$$
y_i = \begin{bmatrix} y_1 \\ y_2 \\. \\ . \\ y_5 \end{bmatrix} \mu_i = \begin{bmatrix} \mu_1 \\ \mu_2\\ \color{lightgrey}{\mu_3} \\ \color{lightgrey}{\mu_4}\\ \mu_5 \end{bmatrix}
\mathbf{\Sigma}_i = \begin{bmatrix}
\sigma^2_1 & \sigma_{12} & \color{lightgrey}{\sigma_{13}} & \color{lightgrey}{\sigma_{14}} & \sigma_{15} \\
\sigma_{21} & \sigma^2_2 & \color{lightgrey}{\sigma_{23}} & \color{lightgrey}{\sigma_{24}} & \sigma_{25} \\
\color{lightgrey}{\sigma_{31}} & \color{lightgrey}{\sigma_{32}} & \color{lightgrey}{\sigma^2_3} & \color{lightgrey}{\sigma_{34}} & \color{lightgrey}{\sigma_{35}} \\
\color{lightgrey}{\sigma_{41}} & \color{lightgrey}{\sigma_{42}} & \color{lightgrey}{\sigma_{43}} & \color{lightgrey}{\sigma^2_4} & \color{lightgrey}{\sigma_{45}} \\
\sigma_{51} & \sigma_{52} & \color{lightgrey}{\sigma_{53}} & \color{lightgrey}{\sigma_{54}} & \sigma^2_5
\end{bmatrix}
$$
Only use parameters involving those variables to calculate her/his $\ell_i$:
$$
y_i = \begin{bmatrix} y_1 \\ y_2 \\ y_5 \end{bmatrix}
\mu_i = \begin{bmatrix} \mu_1 \\ \mu_2\\ \mu_5 \end{bmatrix}
\mathbf{\Sigma}_i = \begin{bmatrix}
\sigma^2_1 & \sigma_{12} & \sigma_{15} \\
\sigma_{21} & \sigma^2_2 & \sigma_{25} \\
\sigma_{51} & \sigma_{52} & \sigma^2_5
\end{bmatrix}
$$
Thus, an individual’s likelihood is calculated using all available information. The log-likelihood $\ell$ for a sample is the sum of all the individual log-likelihoods ($\ell = \mathbf\Sigma\ell_i$), so FIML estimation utilizes each person’s available information when choosing the best parameter estimates that maximize the likelihood of the entire sample.
We discuss tests statistics for model fit and model comparison in Chapter 11. However, in order to understand how $\ell$ is related to $F_\text{ML}$, we must first introduce the likelihood-ratio test (LRT). The ratio of two likelihoods tells you how much more (or less) likely you are to observe the sample data if the model in the numerator (A) is true than if the model in the denominator (B) is true: $\frac{\text{likelihood under model A}}{\text{likelihood under model B}}$. Recall that we work with log-likelihoods for greater precision. Furthermore, we multiply $\ell$ by −2 so that the resulting statistics are distributed as $\chi^2$ random variables (see Chapter 11 for details). Because the log of a ratio is equal to the difference in each log---i.e., $log(a / b) = log(a) − log(b)$---we can calculate the log of the likelihood-ratio as a difference between the log-likelihoods $\ell$:
```{=tex}
\begin{equation}
LRT = -2\ell_A - (-2\ell_B) = -2 \times (\ell_A - \ell_B)
(\#eq:10-06)
\end{equation}
```
Suppose we label our hypothesized (target) model as Model A, and we compare it to a perfectly fitting saturated model (labelled Model B) whose estimates are therefore identical to the observed sample statistics ($\mathbf{\Sigma}_\text{sample}$). If the hypothesized model is the true data-generating model, then there should be little or no difference between $\ell_{\text{Target}}$ and $\ell_{\text{Saturated}}$, so the LRT should be small (close to zero). You might notice how equation \@ref(eq:10-02) incorporates this concept into the calculation of $F_\text{ML}$, by taking the difference between the $\log$s (of the determinants) of the target-model-implied $\Sigma$ (i.e., $\hat{\mathbf{\Sigma}}_\text{model}$) and the saturated-model-implied $\mathbf\Sigma$ (i.e., $\mathbf{\Sigma}_\text{sample}$), which would be Models A and B in equation \@ref(eq:10-06).
Having established the concept of a LRT, we can show how to calculate the same $F_\text{ML}$ from equation \@ref(eq:10-02) using $\ell$. Notice that a mean ($\overline{X}$) is simply a sum of values divided by the number of values: $\overline{X} = \frac{1}{N} \Sigma x_i = \frac{\Sigma x_i}{N}$. If we calculate equation \@ref(eq:10-06) using the average log-likelihood ($\overline{\ell} = \frac{1}{N} \Sigma\ell_i = \frac{\ell}{N}$) instead of the sum $\ell$, then the result would be $\frac{LRT}{N} = F_{\text{ML}}$. So $F_\text{ML}}$ can be interpreted as the average person’s log-likelihood-ratio in the sample. As you will see in Chapter 11, the $\chi^2$ statistic is used to test whether the model fits the entire sample, so we simply convert the mean ($F_\text{ML}$) back to a sum (LRT, or $\chi^2$) by multiplying the mean by the sample size^[Or by $N − 1$ when analysing only covariance structure based on complete-data summary statistics.] ($N \times F_{\text{ML}}$).
## Using FIML to analyse raw data with missing values in lavaan {#ch10-7}
In `lavaan`, the default way to handle missing values is to apply listwise deletion. FIML estimation will be invoked by adding the argument `missing = "ML"` to the model. The missing values in the raw data (here stored in the object `AWdata`) should be specified as `NA`. Since using FIML requires the analysis of the means of the variables, one should also include the model for the means in the model syntax. To specify an unrestricted mean structure one would freely estimate an intercept for each observed variable. The `lavaan` syntax to specify the intercept for a variable involves regressing the variable on the constant 1. For example, the intercept of variable $\mathrm{y}_1$ would be estimated by including the syntax: `y1 ~ 1`.
```
AWmodelOut <- lavaan(model = AWmodel,
data = AWdata,
missing = "ML")
```
## References {-}
Browne, M. W. (1984). Asymptotic distribution free methods in the analysis of covariance structures. *British Journal of Mathematical and Statistical Psychology, 37*, 62--83. doi:10.1111/j.2044-8317.1984.tb00789.x
Chen, F., Bollen, K. A., Paxton, P., Curran, P. J., & Kirby, J. B. (2001). Improper solutions in structural equation models. *Sociological Methods & Research, 29*, 468--508. doi:10.1177/0049124101029004003
Cudeck, R. (1989). Analysis of correlation matrices using covariance structure models. *Psychological Bulletin, 105*(2), 317--327. doi:10.1037/0033-2909.105.2.317
Heywood, H. B. (1931). On finite sequences of real numbers. *Proceedings of the Royal Society of London, 134*, 486--501.
Jöreskog, K. G. (1973). A general method for estimating a linear structural equation system. In A. S. Goldberger & O. D. Duncan (Eds.). *Structural equation models in the social sciences* (pp. 85--112). New York: Academic Press.
Myung, J. (2003). Tutorial on maximum likelihood estimation. *Journal of Mathematical Psychology, 47*, 90--100. doi:10.1016/S0022-2496(02)00028-7
Satorra, A., & Bentler, P. M. (2001). A scaled difference chi-square test statistic for moment structure analysis. *Psychometrika, 66*(4), 507--514. doi:10.1007/BF02296192
Savalei, V. (2014). Understanding robust corrections in structural equation modeling. *Structural Equation Modeling, 21*(1), 149--160. doi:10.1080/10705511.2013.824793
Yuan, K.-H., & Bentler, P. M. (2000). Three likelihood-based methods for mean and covariance structure analysis with nonnormal missing data. *Sociological Methodology, 30*(1), 165--200. doi:10.1111/0081-1750.00078