Evaluation of the unsupervised latent Dirichlet allocation model through simulation

Directed acylic graph of the LDA model.

Abstract

Topic modeling using latent Dirichlet allocation (LDA) has been increasingly endorsed as a popular procedure in text-mining. Unsupervised topic modeling focuses on the identification of the correct number of latent topics and clustering words into latent topics in the text mining area. This method is critical for latent topic modeling but understudied. Although an enormously wide range of applications emerges in empirical researches, evaluation of the performance of LDA has not covered all variates of text corpses with idiosyncrasies. For instance, there can be limited word counts in the text corpses of interest such as short interview transcripts, social media posts, and online reviews on venues with no more than one-hundred words in practice, while we find that the performance of topic modeling is not completely examined for short text. In this paper, we develop a systematic strategy to simulate data for evaluation of the performance of LDA. Based on the unsupervised analysis results, we demonstrate the effectiveness of LDA under various simulated conditions and provide our recommendations concerning the parameter choice for simulation and empirical analysis. In addition, we illustrate the label switching issues in simulation and provide adequate methods to deal with the specific situation encountered in massive simulation for practical methodology research.

Date
May 26, 2020 — May 27, 2020
Location
University of Notre Dame
Notre Dame,
Kenneth Tyler Wilcox
Kenneth Tyler Wilcox
Statistical Consultant

My research interests include integrative data analysis, meta-analysis, topic modeling, Bayesian statistics, multilevel modeling, statistical programming, and psychology.