An introduction to ensemble methods for machine learning

Abstract

While single models can clearly and effectively explain data, particularly when inference and interpretation is a priority, aggregation of multiple models often provides tremendous gains in inference and prediction. Aggregated models, commonly known as ensembles, are a powerful addition to a statistician’s and data scientist’s “toolbox.” I provide an overview of several current benchmark ensemble algorithms in contemporary machine learning. My focus includes general bagging, random subspace learning, and boosting algorithms; the powerful random forest algorithm proposed by Breiman (2001) provides a focal point for this discussion. Such ensembles commonly deliver unmatched predictive accuracy without overfitting training data; I explore the ways in which ensembles balance the trade-off between estimation bias and variance. Furthermore, I emphasize the use of ensembles in nonparametric function estimation. An intuition for their use, strengths, and weaknesses is developed through exploration of data sets using implementations of ensemble algorithms implemented available in the R programming language.

Date
Apr 23, 2016
Kenneth Tyler Wilcox
Kenneth Tyler Wilcox
Statistical Consultant

My research interests include integrative data analysis, meta-analysis, topic modeling, Bayesian statistics, multilevel modeling, statistical programming, and psychology.