The temporal and financial resources needed to qualitatively analyze textual data can be prohibitive. However, a growing collection of quantitative modeling approaches for text offer potential alternatives. One latent variable approach, topic modeling, assumes that text can be summarized as a mixture of latent categories (i.e., topics; Blei, Ng, & Jordan, 2003). By representing patterns of word co-occurrences as topics, major themes and features of the text can be extracted and studied on their own or used as variables in subsequent analyses. Applications of topic modeling in psychological research include prediction of students’ test scores (Kim, Kwak, & Cohen, 2017; Kim, Kwak, Cardozo-Gaibisso, Buxton, & Cohen, 2017), understanding open-ended survey questions regarding sources of anxiety (Rohrer, Brümmer, Schmukle, Goebel, & Wagner, 2017), and, in conjunction with item-response theory models, prediction of clinical diagnoses (He, 2013).
Many applications estimate topics and then use them as predictors in later regression models. However, this two-stage procedure ignores variability in the latent topic estimates which can produce biased estimates and incorrect Type I error rates for the regression coefficients. Blei and McAuliffe (2010) proposed a one-stage model, supervised topic modeling (SLDA), wherein the latent topics predict an observed outcome and showed that it can outperform two-stage approaches. However, SLDA does not allow for the inclusion of additional predictors of the outcome which limits its utility for psychological research. Therefore, we extended SLDA to jointly estimate a latent variable model of the text and a regression model to predict an outcome using both the latent topics and additional predictors. The SLDAX model consists of two primary components: (1) a topic model for the text and (2) a regression model predicting the outcome using the topics and additional predictors. See Figure 1 for a graphical representation of the model and Figure 2 for the model as a generative process.
To estimate the SLDAX model, we analytically derived a Gibbs sampling algorithm. We further implemented the model in an R package to help researchers adopt the model. To evaluate the model and the Gibbs estimation method, we conducted a simulation study based on an SLDAX model with a continuous predictor (
where
Figure 1. Directed acyclic graphical representation of the SLDAX model
Figure 2. Generative process for the SLDAX model.
We assume a vocabulary of V unique words and let
Draw
Draw residual variance
For topic
4.1) Draw
4.2) For word
4.2.1) Draw topic indicator
4.2.2) Draw word
4.3) Draw outcome
Blei, D. M., & McAuliffe, J. D. (2010). Supervised topic models. ArXiv:1003.0783 [Stat]. Retrieved from https://arxiv.org/abs/1003.0783
Blei, D. M., Ng, A. Y., & Jordan, M. I. (2003). Latent Dirichlet Allocation. Journal of Machine Learning Research, 3, 993–1022. https://doi.org/10.1162/jmlr.2003.3.4-5.993
Cohen, J. (1988). Multiple regression and correlation analysis. In Statistical power analysis for the behavioral sciences (2nd ed., pp. 407–467). Hillsdale, NJ: Lawrence Erlbaum Associates.
He, Q. (2013). Text mining and IRT for psychiatric and psychological assessment (Dissertation). University of Twente. https://doi.org/10.3990/1.9789036500562
Kim, S., Kwak, M., Cardozo-Gaibisso, L., Buxton, C., & Cohen, A. S. (2017). Statistical and qualitative analyses of students’ answers to a constructed response test of science inquiry knowledge. Journal of Writing Analytics, 1(1), 82–102. Retrieved from https://journals.colostate.edu/index.php/analytics/article/view/120/87
Kim, S., Kwak, M., & Cohen, A. S. (2017). A mixture partial credit model analysis using language-based covariates. In L. A. van der Ark, M. Wiberg, S. A. Culpepper, J. A. Douglas, & W. C. Wang (Eds.), Quantitative Psychology (pp. 321–333). https://doi.org/10.1007/978-3-319-56294-0_28
Rohrer, J. M., Brümmer, M., Schmukle, S. C., Goebel, J., & Wagner, G. G. (2017). “What else are you worried about?” – Integrating textual responses into quantitative social science research. PLoS ONE, 12(7), e0182156. https://doi.org/10.1371/journal.pone.0182156