Supervised latent Dirichlet allocation with covariates: A Bayesian structural and measurement model of text and covariates

Kenneth Tyler Wilcox, Ross Jacobucci, Zhiyong Zhang, Brooke A. Ammerman

January, 2021

Abstract

Text is a burgeoning data source for psychological researchers, but little methodological research has focused on adapting popular modeling approaches for text to the context of psychological research. One popular measurement model for text, topic modeling, uses a latent mixture model to represent topics underlying a body of documents. Recently, psychologists have studied relationships between these topics and other psychological measures by using estimates of the topics as regression predictors along with other manifest variables. While similar two-stage approaches involving estimated latent variables are known to yield biased estimates and incorrect standard errors, two-stage topic modeling approaches have received limited statistical study and, as we show, are subject to the same problems. To address these problems, we proposed a novel statistical model — supervised latent Dirichlet allocation with covariates (SLDAX) — that jointly incorporates a latent variable measurement model of text and a structural regression model to allow the latent topics and other manifest variables to serve as predictors of an outcome. Using a simulation study with data characteristics consistent with psychological text data, we found that SLDAX estimates were generally more accurate and more efficient. To illustrate the application of SLDAX and a two-stage approach, we provide an empirical clinical application to compare the application of both the two-stage and SLDAX approaches. Finally, we implemented the SLDAX model in an open-source R package to facilitate its use and further study.

Type

Preprint

Publication

PsyArXiv

Directed acyclic graph of the SLDAX model. Observed variables are represented by shaded circles: $w_{dn}$ denotes the $n$th word in document $d$; for subject $d$, $\\vec{x}_d$ denotes $p$ predictor scores and $y_d$ denotes the outcome for subject $d$. Latent variables are represented by unshaded circles: $z_{dn}$ denotes topic assignments for each word in each document; $\\vec{\\theta}_d$ denotes the $K$ topic proportions for each document; $\\vec{\\beta}_k$ denotes the $V$ topic-word probabilities for topic $k$; $\\vec{\\eta}$ denotes the regression coefficients relating $\\vec{x}_d$ and $\\vec{\\bar{z}}_d$ to $y_d$; $\\sigma^2$ denotes the residual variance of $Y$. Fixed parameters are represented by dots: $\\vec{\\alpha}$ denotes the hyperparameters of the topic probabilities; $\\vec{\\gamma}$ denotes the hyperparameters of the topic-word probabilities; $\\vec{\\mu}_0$ and $\\vec{\\Sigma}_0$ denote the prior mean vector and covariance matrix of $\\vec{\\eta}$, respectively; $a_0$ and $b_0$ are the shape and rate hyperparameters for $\\sigma^2$. A set of (conditionally) independent replicates (i.e., words given topics; documents; word probabilities given a topic) is represented by a rectangle. — Directed acyclic graph of the SLDAX model. Observed variables are represented by shaded circles: $w_{d n}$ denotes the $n$ th word in document $d$ ; for subject $d$ , $\vec{x}d $d e n o t e s$ p $p r e d i c t o r s c o r e s a n d$ y_d $d e n o t e s t h e o u t c o m e f o r s u b j e c t$ d $. L a t e n t v a r i a b l e s a r e r e p r e s e n t e d b y u n s h a d e d c i r c l e s :$ z{dn} $d e n o t e s t o p i c a s s i g n m e n t s f o r e a c h w o r d i n e a c h d o c u m e n t;$ \vec{\theta}_d $d e n o t e s t h e$ K $t o p i c p r o p o r t i o n s f o r e a c h d o c u m e n t;$ \vec{\beta}_k $d e n o t e s t h e$ V $t o p i c - w o r d p r o b a b i l i t i e s f o r t o p i c$ k $;$ \vec{\eta} $d e n o t e s t h e r e g r e s s i o n c o e f f i c i e n t s r e l a t i n g$ \vec{x}_d $a n d$ \vec{\bar{z}}_d $t o$ y_d $;$ \sigma^2 $d e n o t e s t h e r e s i d u a l v a r i a n c e o f$ Y $. F i x e d p a r a m e t e r s a r e r e p r e s e n t e d b y d o t s :$ \vec{\alpha} $d e n o t e s t h e h y p e r p a r a m e t e r s o f t h e t o p i c p r o b a b i l i t i e s;$ \vec{\gamma} $d e n o t e s t h e h y p e r p a r a m e t e r s o f t h e t o p i c - w o r d p r o b a b i l i t i e s;$ \vec{\mu}_0 $a n d$ \vec{\Sigma}_0 $d e n o t e t h e p r i o r m e a n v e c t o r a n d c o v a r i a n c e m a t r i x o f$ \vec{\eta} $, r e s p e c t i v e l y;$ a_0 $a n d$ b_0 $a r e t h e s h a p e a n d r a t e h y p e r p a r a m e t e r s f o r$ \sigma^2$. A set of (conditionally) independent replicates (i.e., words given topics; documents; word probabilities given a topic) is represented by a rectangle.

Supervised latent Dirichlet allocation with covariates: A Bayesian structural and measurement model of text and covariates

Abstract

Kenneth Tyler Wilcox

Statistical Consultant