Trees

Tree based methods are a well known modeling technique that are used for both regression and classification. The general idea is that we segment the feature space into individual subspaces. The rules for segmenting the data into their respective subspaces is summarized by the tree, and is why tree methods are sometimes called decision tree methods. Tree methods have numerous different approaches, such as bagging, boosting and random forests.

Common Applications

Common Problem Types

Regression
Classification
Rules-based segmentation
Problems where interpretability is critical

A Brief History

Tree based methods were first published in the early 1960’s, and have since exploded into a remarkable diversity of techniques and approaches that was aided by the growth of free software and cheaper hardware to implement computations that were challenging to do by hand, but relatively easier for computers. They found themselves sometimes enhancing traditional models such as least squares and logistic regression. If you are interested in a technical overview of the various approaches and a more in depth history, see (2014, Loh).

Code Examples

All of the code examples are written in Python, unless otherwise noted.

Containers

These are code examples in the form of Jupyter notebooks running in a container that come with all the data, libraries, and code you’ll need to run it. Click here to learn why you should be using containers, along with how to do so.

Boosting

Explore gradient boosting, a tree-based method, using XGBoost to analyze hotel customer data.

Quickstart: Download Docker, then run the commands below in a terminal.

#pull container, only needs to be run once
docker pull ghcr.io/thedatamine/starter-guides:boosting

#run container
docker run -p 8888:8888 -it ghcr.io/thedatamine/starter-guides:boosting

Need help implementing any of this code? Feel free to reach out to [email protected] and we can help!

Resources

All resources are chosen by Data Mine staff to be of decent quality, and most if not all content is free.

Websites

Tree-based Methods (Penn State)

Tree-based models (Exam PA Study Guide)

Tree-based Methods (Stanford)

Videos

Decision and Classification Trees, Clearly Explained!!! (StatQuest with Josh Starmer, ~18 minutes)

Decision Tree Classification Clearly Explained! (Normalized Nerd, ~10 minutes)

Random Forest Algorithm Clearly Explained (Normalized Nerd, ~8 minutes)

Statistical Learning: 8.1 Tree based methods (Stanford Online, ~14 minutes)

Regression Decision Trees (IQmates, ~18 minutes)

Books

Introduction to Statistical Learning (Also known as the "machine learning bible", see Chapter 8 for tree based methods)

Tree-based Methods for Statistical Learning in R

Random Forests with R

Articles

Tree-based Machine Learning Methods for Survey Research (2019)

Tree-based Machine Learning Methods for Modeling and Forecasting Mortality (2022)

Modelling Soil Temperature by Tree-Based Machine Learning Methods in Different Climatic Regions of China (2022)