LCVAR/README.Rmd at master · AnieBee/LCVAR · GitHub

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
---
title: "LCVARclust Example"
subtitle: " "
author: "Anja Franziska Ernst  </br>  a.f.ernst[at]rug.nl </br> "
date: "December 12, 2019 </br>"
output: rmarkdown::github_document
css: styles.css
---


<div class="black">
</br>
</br>
This document illustrates how to fit a latent class vector-autoregressive model to a time series using the LCVARclust R function.

</br>
For technical details see:

<div style="font-size: 80%;">
  Ernst, A. F., Albers, C. J., Jeronimus, B. F. & Timmerman, M. E. (2020). Inter-individual differences in multivariate time series: Latent class vector-autoregressive modelling.     _European Journal of Psychological Assessment_.
</div>
</br>
Please report bugs to: <div class="UniRed"> a.f.ernst[at]rug.nl. </div>
</br>
</br>
</br>
</div>

## Load the required packages  {.large}

```{r,  message=FALSE}
require(fastDummies) # (v.1.1.0) for dummy_cols()
require(MASS) # (v.7.3-49) for ginv()
require(mvtnorm) # (v.1.0-6) for dmvnorm()

```


## Source all functions that are required to run LCVARclust {.large}
```{r,  message=FALSE}

source("Functions/calculateA.R")
source("Functions/calculateB.R")
source("Functions/calculateBandWZero.R")
source("Functions/calculateCoefficientsForRandoAndRational.R")
source("Functions/calculateFYZ.R")
source("Functions/calculatePosterior.R")
source("Functions/calculateTau.R")
source("Functions/calculateRatio.R")
source("Functions/calculateNPara.R")
source("Functions/calculateIC.R")
source("Functions/calculateW.R")
source("Functions/calculateU.R")
source("Functions/calculateSigma.R")
source("Functions/calculateLagList.R")
source("Functions/callCalculateCoefficientsForRandoAndRational.R")
source("Functions/callEMFuncs.R")
source("Functions/checkComponentsCollapsed.R")
source("Functions/checkConvergence.R")
source("Functions/checkOutliers.R")
source("Functions/checkSingularitySigma.R")
source("Functions/constraintsOnB.R")
source("Functions/checkLikelihoodsNA.R")
source("Functions/checkPosteriorsNA.R")
source("Functions/createOutputList.R")
source("Functions/createX.R")
source("Functions/determineLagOrder.R")
source("Functions/reorderLags.R")
source("Functions/InitFuncs.R")
source("Functions/EMInit.R")
source("Functions/EMFunc.R")
source("Functions/LCVARclust.R")

```


## Load the data {.large}
As an example we will analyze a generated data set "Dataset". Make sure no missing values are included in your dataframe, the LCVARclust function does not allow any missing values.
```{r,  message=FALSE}
load("ExampleData/ExampleData.RData")
head(Dataset)
```


## Run the LCVARclust function {.large}


The following arguments can be specified:

- **Data**: The dataframe to be used.

- **yVars**: An integer vector specifying the position of the column(s) in dataframe **Data** that contain the endogenous variables (= the VAR time series).

- **Time**: An integer specifying the position of the column in dataframe **Data** that contains the time point.

- **ID**: An integer specifying the position of the column in dataframe **Data** that contains the ID variable for every participant.

- **Covariates**: Constraints on the parameters of the exogenous variable(s). So far only "equal-within-clusters" can be specified.

- **Clusters**: An integer vector specifying the numbers of mixture components (clusters) for which LCVAR models are to be calculated.

- **LowestLag**: An integer specifying the lowest number of VAR(_p_) lags to consider in the calculation of LCVAR models.

- **HighestLag**: An integer specifying the higest number of VAR(_p_) lags to consider in the calculation of LCVAR models.

- **smallestClN**: An integer specifying the lowest number of individuals allowed in a cluster. When during estimation the crisp cluster membership of a cluster indicates less than **smallestClN** individuals, the covariance matrix and the posterior probabilities of cluster membership are reset.

- **ICType**: The information criterion used to select the ideal model for a given number of clusters across all EM-starts and lag combinations. One of c("HQ", "SC", "AIC").

- **seme**: An integer specifying the value supplied to $\texttt{set.seed()}$. Using the same seed guarantees reproducibility of solutions.

- **Rand**: The number of pseudo-random EM-starts used in fitting each possible model.

- **Rational**: Should a rational EM-start be used as well? Accepts TRUE or FALSE.

- **SigmaIncrease**: A numerical value specifying the value by which every element of Sigma will be increased when posterior probabilities of cluster memberships are reset.

- **it**: An integer specifying the maximum number of EM-iterations allowed after every EM-start. After **it** EM-iterations an EM-start is forced to terminate.

- **Conv**: A numerical value specifying the convergence criterion of the log likelihood to determine convergence of an EM-start. For details see Ernst et al. (2020) Inter-individual differences in multivariate time series: Latent class vector-autoregressive modelling.

Optional arguments:

- **xContinuous**: An integer vector specifying the position of the column(s) in dataframe **Data** that contain the continuous exogenous variable(s).

- **xFactor**: An integer vector specifying the position of the column(s) in dataframe **Data** that contain the categorical exogenous variable(s).

- **Initialization**: An integer specifying the position of a column in dataframe **Data** that contains a guess at participants' cluster membership for a fixed number of clusters.


For every fixed number of clusters as specified in **Clusters**, each combination of lag orders between **LowestLag** and **HighestLag** corresponds to a different statistical model. The models associated with all possible combinations of lag orders are estimated. For each model, the algorithm uses several EM-starts based on: pseudo-random initializations (**Rand**), a k-means based rational initialization (**Rational**), a guess at cluster membership (**Initialization**), and the use of a previous solution (always used).
Every EM-start leads to one solution after several EM-iterations. Solutions are either reached because the likelihood converged (**Conv**) or because the maximum number of EM-iterations has been reached (**it**). Thus several solutions are reached for every possible statistical model. The ideal statistical model for a given number of clusters across all EM-starts and all lag combinations is determined with the information criterion specified in **ICType**. As a result, for every number of clusters specified in **Clusters** there will be one solution displayed in $\texttt{Result\$BestSolutionsPerCluster}$.


```{r}
Result = LCVARclust(Data = Dataset, yVars = 1:4, Time = 6, ID = 5,
                    Covariates = "equal-within-clusters",
                    Clusters = 2, LowestLag = 1, HighestLag = 2, smallestClN = 3,
                    ICType = "HQ", seme = 3, Rand = 2, Rational = TRUE,
                    SigmaIncrease = 10, it = 25, Conv = 1e-06, xContinuous = 7, xFactor = 8)
```


```{r, echo=FALSE}
 # #The number of pseudo-random EM-starts used in estimating every statistical model can be set via **Rand**; whether a rational EM-start should be included in the estimation of every statistical model can be set via **Rational**; Optionally, researchers can supply a guess at cluster membership via **Initialization**.
```


## Interpret the output {.large}

$\texttt{Result}$ contains two lists: $\texttt{BestSolutionsPerCluster}$ which contains the ideal solution for each number of clusters, and $\texttt{AllSolutions}$ which contains the solutions for all starts of all estimated models. The two lists are structured as follows:

+ $\texttt{Result\$BestSolutionsPerCluster[[a]]}$: contains the ideal solution for the ath number of clusters within range **Clusters** across all lag combinations and EM-starts.
+ $\texttt{Result\$AllSolutions[[a]][[b]][[c]]:}$ contains the solution for the ath number of clusters within range **Clusters** for the bth combination of lag orders on the cth EM-start.

The output for every solution contains:


- **Converged**: Whether this EM-start converged.

- **A**: The VAR(p)-coefficients for all clusters. The rows give the variables, the columns the lag coefficients. Lag coefficients quantify the influence the endogenous variables
(columns) have on the endogenous variables (rows) at future time points.

- **B**: The exogenous coefficients for all clusters.

- **EMRepetitions**: The Number of EM-iterations before the algorithm terminated.

- **last.loglik**: The log likelihood of the this model.

- **nPara**: The number of parameters of this model.

- **Sigma**: The error covariance matrix.

- **LogLikelihood**: A numeric vector showing the value of the log likelihood after every EM-iteration.

- **EMiterationReset**: A vector of logical values indicating whether any values were reset during an EM-iteration.

- **PosteriorProbs**: A numeric vector indicating the posterior probabilities of cluster membership of every person. The output is ordered by factor(**ID**).

- **Lags**: The number of VAR(_p_) lags for every cluster in this solution. In our example, the best solution for two number of clusters is based on 1 lag for both clusters, we can conclude that the information criterion specified in **ICType** selected a VAR(1) model for both clusters.

- **Classification**: The crisp cluster membership for every person in this solution. The output is ordered by factor(**ID**).

- **IC**: The value of the information criterion that was specified in **ICType**.

- **SC**: The value of the SC information criterion.

- **Proportions**: The estimated mixing proportions of the clusters.


```{r}

Result$BestSolutionsPerCluster

Result$AllSolutions[[1]][[2]][[4]]

```