An introduction to ensemble methods for machine learning

Abstract

While single models can clearly and effectively explain data, particularly when inference and interpretation is a priority, aggregation of multiple models often provides tremendous gains in inference and prediction. Aggregated models, commonly known as ensembles, are a powerful addition to a statistician’s and data scientist’s “toolbox.” I provide an overview of several current benchmark ensemble algorithms in contemporary machine learning. My focus includes general bagging, random subspace learning, and boosting algorithms; the powerful random forest algorithm proposed by Breiman (2001) provides a focal point for this discussion. Such ensembles commonly deliver unmatched predictive accuracy without overfitting training data; I explore the ways in which ensembles balance the trade-off between estimation bias and variance. Furthermore, I emphasize the use of ensembles in nonparametric function estimation. An intuition for their use, strengths, and weaknesses is developed through exploration of data sets using implementations of ensemble algorithms implemented available in the R programming language.

Date

Apr 23, 2016

Event

Fifth Annual Joint Conference of the Upstate Chapters of the American Statistical Association

CART machine learning data mining ensembles tutorial

An introduction to ensemble methods for machine learning

Abstract

Kenneth Tyler Wilcox

Statistical Consultant