Classification modeling workflow using tidymodels

tidyverse

tidymodels

stacks

XGBoost

Building a classification pipeline on the Weight Lifting Exercises dataset with tidymodels — workflowsets for cross-validated tuning across four candidate models, plus a stacks ensemble.

Author

Konstantinos Patelis

Published

April 11, 2021

Modified

May 12, 2026

A while back, I was working through the final assignment in the Practical Machine Learning Coursera course (part of the JHU Data Science Specialization), which entailed creating a model to predict the way people perform a weight-lifting exercise using data from accelerometers on the belt, forearm, arm, and dumbell of each participant. I thought this was a good opportunity to practice using the tidymodels family of packages to tackle this classification problem. So, in this post we will go through the series of steps to create our predictive model. We will cover defining the data pre-processing, specifying the model(s) to fit and using cross-validation to tune model hyperparameters. Additionally, we’ll have a look at one of the recent additions to the tidymodels packages, stacks to create an ensemble model out of our base models. We’ll see that most of our models perform almost equally well and an ensemble model is not required for achieving improved accuracy, and is presented mostly because this was a good opportunity to try it out. This blog post documents the approach taken to model this dataset.

Note

This post has been refreshed since it was first published in April 2021. The modeling section now uses the workflowsets package — which became available on CRAN as I was originally writing the post and which the original conclusion pointed at as the obvious next step — to tune all four candidate models in a single pipeline. Parallel execution has also been migrated from doParallel to the future framework that tune 1.2.0 adopted in 2024. See the Modified date in the title block for the refresh date.

Data

We’ll use the Weight Lifting Exercises data set (Velloso et al. 2013), provided as part of the Human Activity Recognition project. The data available to us is split in two parts, one is the training set, which contains the classe variable which we want to train our model to predict, and the quiz set, that contains 20 observations for which we needed to predict classe as part of the course assignment. For this blog post we’ll focus on the first data set, which we will split in two parts, one used for actually training the model and one to assess its accuracy.

## Libraries

# General data wrangling
library(tidyverse)

── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.2.1     ✔ readr     2.2.0
✔ forcats   1.0.1     ✔ stringr   1.6.0
✔ ggplot2   4.0.3     ✔ tibble    3.3.1
✔ lubridate 1.9.5     ✔ tidyr     1.3.2
✔ purrr     1.2.2     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

library(skimr)

# Modeling packages
library(tidymodels)

── Attaching packages ────────────────────────────────────── tidymodels 1.5.0 ──
✔ broom        1.0.12     ✔ rsample      1.3.2 
✔ dials        1.4.3      ✔ tailor       0.1.0 
✔ infer        1.1.0      ✔ tune         2.1.0 
✔ modeldata    1.5.1      ✔ workflows    1.3.0 
✔ parsnip      1.5.0      ✔ workflowsets 1.1.1 
✔ recipes      1.3.2      ✔ yardstick    1.4.0 
── Conflicts ───────────────────────────────────────── tidymodels_conflicts() ──
✖ scales::discard() masks purrr::discard()
✖ dplyr::filter()   masks stats::filter()
✖ recipes::fixed()  masks stringr::fixed()
✖ dplyr::lag()      masks stats::lag()
✖ yardstick::spec() masks readr::spec()
✖ recipes::step()   masks stats::step()

library(stacks)

Registered S3 method overwritten by 'butcher':
  method                 from    
  as.character.dev_topic generics

# Visualization
library(corrr)


Attaching package: 'corrr'

The following object is masked from 'package:skimr':

    focus

library(plotly)


Attaching package: 'plotly'

The following object is masked from 'package:ggplot2':

    last_plot

The following object is masked from 'package:stats':

    filter

The following object is masked from 'package:graphics':

    layout

# Parallelization
library(future)

# EDA - will not be showing the outputs from using these packages, but very useful for exploration
# library(DataExplorer)
# library(explore)

theme_set(theme_bw())

## Data

initial_data <- read_data("https://d396qusza40orc.cloudfront.net/predmachlearn/pml-training.csv") %>% select(-1) # read_data is a wrapper around read_csv

New names:
• `` -> `...1`

EDA

Let’s split the initial data a training set and an test set (80/20 split). For the exploratory analysis and subsequent modeling, hyperparameter tuning and model evaluation I will use the training data set. Then the model will be used on the test data to predict out-of-sample accuracy.

set.seed(1992)

split <- initial_split(initial_data, prop = .8, strata = classe)

train <-  training(split)
test <- testing(split)

skim(train)

Data summary
Name	train
Number of rows	15695
Number of columns	159
_______________________
Column type frequency:
character	105
numeric	53
POSIXct	1
________________________
Group variables	None

Variable type: character

skim_variable	n_missing	complete_rate	min	max	n_unique
user_name	0	1.00	5	8	6
raw_timestamp_part_1	0	1.00	10	10	837
raw_timestamp_part_2	0	1.00	3	6	13803
new_window	0	1.00	2	3	2
kurtosis_roll_belt	15382	0.02	7	9	306
kurtosis_picth_belt	15382	0.02	7	9	260
kurtosis_yaw_belt	15382	0.02	7	7	1
skewness_roll_belt	15382	0.02	7	9	304
skewness_roll_belt.1	15382	0.02	7	9	271
skewness_yaw_belt	15382	0.02	7	7	1
max_roll_belt	15382	0.02	1	5	165
max_picth_belt	15382	0.02	1	2	20
max_yaw_belt	15382	0.02	3	7	59
min_roll_belt	15382	0.02	2	5	162
min_pitch_belt	15382	0.02	1	2	15
min_yaw_belt	15382	0.02	3	7	59
amplitude_roll_belt	15382	0.02	1	5	125
amplitude_pitch_belt	15382	0.02	1	2	13
amplitude_yaw_belt	15382	0.02	4	7	3
var_total_accel_belt	15382	0.02	1	6	57
avg_roll_belt	15382	0.02	1	8	159
stddev_roll_belt	15382	0.02	1	6	61
var_roll_belt	15382	0.02	1	7	80
avg_pitch_belt	15382	0.02	1	8	182
stddev_pitch_belt	15382	0.02	1	6	39
var_pitch_belt	15382	0.02	1	6	55
avg_yaw_belt	15382	0.02	1	8	203
stddev_yaw_belt	15382	0.02	1	6	53
var_yaw_belt	15382	0.02	1	8	127
var_accel_arm	15382	0.02	1	8	308
avg_roll_arm	15382	0.02	1	9	255
stddev_roll_arm	15382	0.02	1	8	255
var_roll_arm	15382	0.02	1	10	255
avg_pitch_arm	15382	0.02	1	8	255
stddev_pitch_arm	15382	0.02	1	7	255
var_pitch_arm	15382	0.02	1	9	255
avg_yaw_arm	15382	0.02	1	9	255
stddev_yaw_arm	15382	0.02	1	8	254
var_yaw_arm	15382	0.02	1	10	254
kurtosis_roll_arm	15382	0.02	7	8	254
kurtosis_picth_arm	15382	0.02	7	8	254
kurtosis_yaw_arm	15382	0.02	7	8	308
skewness_roll_arm	15382	0.02	7	8	255
skewness_pitch_arm	15382	0.02	7	8	254
skewness_yaw_arm	15382	0.02	7	8	307
max_roll_arm	15382	0.02	1	5	234
max_picth_arm	15382	0.02	1	5	209
max_yaw_arm	15382	0.02	1	2	51
min_roll_arm	15382	0.02	1	5	222
min_pitch_arm	15382	0.02	1	5	233
min_yaw_arm	15382	0.02	1	2	38
amplitude_roll_arm	15382	0.02	1	5	242
amplitude_pitch_arm	15382	0.02	1	6	233
amplitude_yaw_arm	15382	0.02	1	2	51
kurtosis_roll_dumbbell	15382	0.02	6	7	305
kurtosis_picth_dumbbell	15382	0.02	6	7	310
kurtosis_yaw_dumbbell	15382	0.02	7	7	1
skewness_roll_dumbbell	15382	0.02	6	7	310
skewness_pitch_dumbbell	15382	0.02	6	7	309
skewness_yaw_dumbbell	15382	0.02	7	7	1
max_roll_dumbbell	15382	0.02	1	5	270
max_picth_dumbbell	15382	0.02	2	6	277
max_yaw_dumbbell	15382	0.02	3	7	66
min_roll_dumbbell	15382	0.02	1	6	274
min_pitch_dumbbell	15382	0.02	1	6	277
min_yaw_dumbbell	15382	0.02	3	7	66
amplitude_roll_dumbbell	15382	0.02	1	6	298
amplitude_pitch_dumbbell	15382	0.02	1	6	295
amplitude_yaw_dumbbell	15382	0.02	4	7	2
var_accel_dumbbell	15382	0.02	1	8	296
avg_roll_dumbbell	15382	0.02	5	9	307
stddev_roll_dumbbell	15382	0.02	1	8	301
var_roll_dumbbell	15382	0.02	1	10	301
avg_pitch_dumbbell	15382	0.02	5	8	307
stddev_pitch_dumbbell	15382	0.02	1	7	301
var_pitch_dumbbell	15382	0.02	1	9	301
avg_yaw_dumbbell	15382	0.02	5	9	307
stddev_yaw_dumbbell	15382	0.02	1	7	301
var_yaw_dumbbell	15382	0.02	1	9	301
kurtosis_roll_forearm	15382	0.02	6	7	245
kurtosis_picth_forearm	15382	0.02	6	7	245
kurtosis_yaw_forearm	15382	0.02	7	7	1
skewness_roll_forearm	15382	0.02	6	7	246
skewness_pitch_forearm	15382	0.02	6	7	241
skewness_yaw_forearm	15382	0.02	7	7	1
max_roll_forearm	15382	0.02	1	5	220
max_picth_forearm	15382	0.02	1	5	123
max_yaw_forearm	15382	0.02	3	7	43
min_roll_forearm	15382	0.02	1	5	213
min_pitch_forearm	15382	0.02	1	5	139
min_yaw_forearm	15382	0.02	3	7	43
amplitude_roll_forearm	15382	0.02	1	5	229
amplitude_pitch_forearm	15382	0.02	1	6	144
amplitude_yaw_forearm	15382	0.02	4	7	2
var_accel_forearm	15382	0.02	1	9	310
avg_roll_forearm	15382	0.02	1	10	245
stddev_roll_forearm	15382	0.02	1	9	243
var_roll_forearm	15382	0.02	1	11	243
avg_pitch_forearm	15382	0.02	1	9	247
stddev_pitch_forearm	15382	0.02	1	8	247
var_pitch_forearm	15382	0.02	1	10	247
avg_yaw_forearm	15382	0.02	1	10	247
stddev_yaw_forearm	15382	0.02	1	9	245
var_yaw_forearm	15382	0.02	1	11	245
classe	0	1.00	1	1	5

Variable type: numeric

skim_variable	complete_rate	mean	sd	p0	p25	p50	p75	p100	hist
num_window	1	432.51	248.50	1.00	223.00	427.00	648.00	864.00	▇▇▇▇▇
roll_belt	1	64.69	62.73	-28.90	1.10	114.00	123.00	162.00	▇▁▁▅▅
pitch_belt	1	0.18	22.45	-55.80	1.67	5.27	14.90	60.30	▃▁▇▅▁
yaw_belt	1	-10.74	95.50	-180.00	-88.30	-12.30	13.70	179.00	▁▇▅▁▃
total_accel_belt	1	11.34	7.74	0.00	3.00	17.00	18.00	29.00	▇▁▂▆▁
gyros_belt_x	1	-0.01	0.21	-1.00	-0.03	0.03	0.11	2.22	▁▇▁▁▁
gyros_belt_y	1	0.04	0.08	-0.53	0.00	0.02	0.11	0.64	▁▁▇▁▁
gyros_belt_z	1	-0.13	0.24	-1.46	-0.20	-0.10	-0.02	1.62	▁▂▇▁▁
accel_belt_x	1	-5.41	29.74	-120.00	-21.00	-14.00	-5.00	83.00	▁▁▇▁▂
accel_belt_y	1	30.26	28.52	-65.00	3.00	36.00	61.00	164.00	▁▇▇▁▁
accel_belt_z	1	-73.05	100.39	-275.00	-162.00	-153.00	27.00	104.00	▁▇▁▅▃
magnet_belt_x	1	55.98	64.44	-52.00	9.00	35.00	60.00	485.00	▇▁▂▁▁
magnet_belt_y	1	593.68	35.62	354.00	581.00	601.00	610.00	673.00	▁▁▁▇▃
magnet_belt_z	1	-345.54	65.02	-623.00	-375.00	-320.00	-306.00	293.00	▁▇▁▁▁
roll_arm	1	18.22	72.38	-180.00	-31.20	0.00	77.20	180.00	▁▃▇▆▂
pitch_arm	1	-4.61	30.75	-88.20	-25.70	0.00	11.20	88.50	▁▅▇▂▁
yaw_arm	1	-0.50	71.12	-180.00	-42.70	0.00	45.60	180.00	▁▃▇▃▂
total_accel_arm	1	25.44	10.53	1.00	17.00	27.00	33.00	66.00	▃▆▇▁▁
gyros_arm_x	1	0.04	1.99	-6.37	-1.35	0.08	1.56	4.87	▁▃▇▆▂
gyros_arm_y	1	-0.26	0.85	-3.44	-0.80	-0.24	0.14	2.81	▁▂▇▂▁
gyros_arm_z	1	0.27	0.55	-2.33	-0.07	0.25	0.72	3.02	▁▂▇▂▁
accel_arm_x	1	-60.27	181.43	-404.00	-241.00	-43.00	83.00	435.00	▆▆▇▅▁
accel_arm_y	1	32.11	109.77	-315.00	-54.00	13.00	139.00	308.00	▁▃▇▆▂
accel_arm_z	1	-70.98	134.52	-636.00	-142.00	-47.00	23.00	271.00	▁▁▅▇▁
magnet_arm_x	1	191.39	443.76	-584.00	-299.00	284.00	638.00	780.00	▆▃▂▃▇
magnet_arm_y	1	156.45	202.35	-392.00	-10.00	202.00	323.00	583.00	▁▅▅▇▂
magnet_arm_z	1	306.97	326.20	-597.00	133.00	443.00	545.00	694.00	▁▂▂▃▇
roll_dumbbell	1	24.01	69.84	-153.51	-17.42	48.42	67.58	153.55	▂▂▃▇▂
pitch_dumbbell	1	-10.79	36.94	-149.59	-40.69	-21.10	17.64	149.40	▁▆▇▂▁
yaw_dumbbell	1	1.48	82.56	-150.87	-77.75	-4.16	79.97	154.75	▃▇▅▅▆
total_accel_dumbbell	1	13.71	10.21	0.00	4.00	10.00	19.00	42.00	▇▅▃▃▁
gyros_dumbbell_x	1	0.17	0.39	-1.99	-0.03	0.13	0.35	2.20	▁▁▇▁▁
gyros_dumbbell_y	1	0.04	0.48	-2.10	-0.14	0.05	0.21	4.37	▁▇▁▁▁
gyros_dumbbell_z	1	-0.15	0.32	-2.38	-0.31	-0.13	0.03	1.72	▁▁▇▂▁
accel_dumbbell_x	1	-28.62	67.02	-237.00	-50.00	-8.00	11.00	235.00	▁▂▇▁▁
accel_dumbbell_y	1	52.77	80.36	-189.00	-8.00	42.00	111.00	302.00	▁▇▇▅▁
accel_dumbbell_z	1	-38.32	109.57	-334.00	-141.00	-1.00	38.00	318.00	▁▆▇▃▁
magnet_dumbbell_x	1	-329.18	340.01	-643.00	-535.00	-480.00	-306.00	592.00	▇▂▁▁▂
magnet_dumbbell_y	1	220.95	326.12	-3600.00	232.00	311.00	389.00	633.00	▁▁▁▁▇
magnet_dumbbell_z	1	46.05	140.02	-262.00	-45.00	14.00	95.00	452.00	▁▇▆▂▂
roll_forearm	1	33.82	107.89	-180.00	-0.66	21.20	140.00	180.00	▃▂▇▂▇
pitch_forearm	1	10.67	27.97	-72.50	0.00	9.19	28.40	89.80	▁▁▇▃▁
yaw_forearm	1	19.76	103.09	-180.00	-68.00	0.00	110.00	180.00	▅▅▇▆▇
total_accel_forearm	1	34.77	10.06	0.00	29.00	36.00	41.00	79.00	▁▃▇▁▁
gyros_forearm_x	1	0.16	0.63	-4.95	-0.21	0.05	0.58	3.97	▁▁▇▃▁
gyros_forearm_y	1	0.06	2.17	-7.02	-1.48	0.03	1.62	6.13	▁▃▇▆▁
gyros_forearm_z	1	0.14	0.60	-8.09	-0.18	0.08	0.49	4.10	▁▁▁▇▁
accel_forearm_x	1	-62.55	180.36	-498.00	-179.00	-57.00	75.00	477.00	▂▆▇▅▁
accel_forearm_y	1	164.76	200.25	-632.00	57.00	201.00	313.00	591.00	▁▂▃▇▃
accel_forearm_z	1	-54.99	138.27	-446.00	-181.00	-39.00	26.00	291.00	▁▇▅▅▃
magnet_forearm_x	1	-314.94	345.59	-1280.00	-617.00	-382.00	-78.00	666.00	▁▇▇▅▁
magnet_forearm_y	1	382.40	508.76	-896.00	8.50	593.00	738.00	1480.00	▂▂▂▇▁
magnet_forearm_z	1	393.79	369.52	-973.00	193.00	511.00	653.00	1090.00	▁▁▂▇▃

Variable type: POSIXct

skim_variable	n_missing	complete_rate	min	max	median	n_unique
cvtd_timestamp	0	1	2011-11-28 14:13:00	2011-12-05 14:24:00	2011-12-02 13:35:00	20

It seems like the majority of the variables are numeric, but one important thing to note is that there is a high percentage of missing observations for a subset of the variables. From the completion rate, it seems that the missing values across the different attributes occur for the same observations. Since the majority of observations have missing values for these variables, it is unlikely that we could impute them. When viewing the data in spreadsheet applications these variables have a mix of being coded as NA or being simply blank, while even for observations where values are available, there are instances of a value showing as #DIV/0.

eda <- train %>% select(where(~ !any(is.na(.))))

One important thing in classification problems is to investigate whether there is imbalance between the different classes in our training data. For example, if a class is over-represented in the data then our classifier might tend to over-predict that class. Let’s check how many times each class appears in the data.

ggplot(eda, aes(classe, fill = classe)) +
  geom_bar() +
  theme_bw()

Looking at the above, there does not seem to be severe imbalance among classes. The first class, A is represented more compared to the other ones, but it’s not to an overwhelming extent.

Considering the other columns in our data, user name cannot be part of our predictors because it cannot generalize to other data if our model was used on data that is not part of this study. Looking at the timestamp, each subject was fully measured at a different point in the day, with each exercise happening back-to-back. From Velloso et al. (2013) (Velloso et al. 2013), we know that all subjects were observed in the presence of a professional trainer to observe that the exercise was done according to specification each time. I will not consider the timestamps or any features derived from them (e.g. using the time of day) as a predictor in our models.

eda %>%
  mutate(new_window = as.numeric(factor(new_window))) %>%
  select(where(is.numeric)) %>%
  correlate(method = "pearson", quiet = TRUE) %>%
  shave() %>%
  rplot() %>%
  ggplotly()

Warning: `aes_string()` was deprecated in ggplot2 3.0.0.
ℹ Please use tidy evaluation idioms with `aes()`.
ℹ See also `vignette("ggplot2-in-packages")` for more information.
ℹ The deprecated feature was likely used in the corrr package.
  Please report the issue at <https://github.com/tidymodels/corrr/issues>.

Even excluding the columns mentioned above, there are a lot of features and presenting more information on EDA here would get too extensive. Since the purpose here is to demonstrate the modeling process with tidymodels, we will not be performing a more extensive EDA. For quick data exploration, you can use DataExplorer::create_report(eda) and explore::explore(eda) (after installing the two packages) to get a full data report on the data set from the former and a shiny app for interactive exploration from the latter.

Modeling

In this section we will define the recipe to pre-process the data, specify the candidate models and combine them in a workflow_set. We then tune all of them with a single call and compare their performance.

Pre-processing Recipe

I will use the recipes package to provide a specification of all the transformations to the data set before I fit any models, which will ensure that the same transformations are applied to the training, test and quiz data in the same way. Furthermore, it helps avoid information leakage as the transformations will be applied to all data sets using the statistics calculated for the training data. Specifically, I will remove all variables with missing values as well as other attributes discussed above, perform a transformation to try to make all predictors more symmetric (which is useful for models that benefit from predictors with distributions close to the Gaussian), normalize all variables (particularly important for the KNN and glmnet models) and then removing any predictors, if any, with very small variance. Note that we could define different pre-processing steps if required by the various models we will be tuning.

# Creating a vector with the names of all variables that should be removed because they contain NAs

cols_rm <- train %>%
  select(where(~ any(is.na(.)))) %>%
  colnames()

model_recipe <- recipe(classe ~ ., data = train) %>%
  step_rm(all_of(!!cols_rm), all_nominal_predictors(),
          cvtd_timestamp, num_window,
          new_window) %>%
  step_YeoJohnson(all_numeric_predictors()) %>%
  step_normalize(all_numeric_predictors()) %>%
  step_nzv(all_numeric_predictors())

# Below rows can be used to perform the transformation on the training set. Since we will be using the workflow package, this is not required.

# model_recipe_prepped <- model_recipe %>% prep()
# baked_recipe <- model_recipe_prepped %>% bake(new_data = NULL)

Model Specification

In this section I will use the parsnip package to create model specifications and set which model parameters will need to be tuned to ensure higher model performance. I will be trying four different models:

Random Forests (rf) - with 1000 trees and we will tune the number of predictors at each node split and the minimum number of data points in a node required for the node to be further split.
K Nearest Neighbours (knn) - with a tunable number k of neighbours, kernel function with which to weight distances, and the parameter for the Minkowski distance.
Multinomial Logistic Regression with Regularization (lm) - with a tunable regularization penalty.
Boosted Trees (boost) - where we tune the number of trees, the learning rate, tree depth, number of predictors at each node split and the minimum number of data points in a node.

With parsnip we can create the specification in a similar manner across models and specify “computational engines” - practically R functions/packages that implement the calculations.

rf_model <- rand_forest(
                        mtry = tune(),
                        min_n = tune(),
                        trees = 1000
                        ) %>%
            set_engine("ranger") %>%
            set_mode("classification")


# rf_fit <- fit(rf_model, classe ~ ., data = baked_recipe) # This call could be used to fit the model to the training data, but we will be using the workflows interface

knn_model <- nearest_neighbor(
                              neighbors = tune(),
                              weight_func = tune(),
                              dist_power = tune()
                              ) %>%
             set_engine("kknn") %>%
             set_mode("classification")

# knn_fit <- fit(knn_model, classe ~ ., data = baked_recipe)

lasso_model <- multinom_reg(
                             penalty = tune(),
                             mixture = 1
                             ) %>%
                set_engine("glmnet")

# lasso_fit <- fit(lasso_model, classe ~ ., data = baked_recipe)

boost_model <- boost_tree(
                          trees = tune(),
                          mtry = tune(),
                          min_n = tune(),
                          learn_rate = tune(),
                          tree_depth = tune()
                          ) %>%
               set_engine("xgboost") %>%
               set_mode("classification")

# boost_fit <- fit(boost_model, classe ~ ., data = baked_recipe)

Instead of building one workflow per model, we use the workflowsets package to pair the single pre-processing recipe with each of the four model specs in one go. The resulting workflow_set is a tibble with one row per model, and the rest of the modelling section operates on it directly.

wfs <- workflow_set(
  preproc = list(rec = model_recipe),
  models  = list(rf = rf_model, knn = knn_model, lasso = lasso_model, boost = boost_model)
)

wfs

# A workflow set/tibble: 4 × 4
  wflow_id  info             option    result    
  <chr>     <list>           <list>    <list>    
1 rec_rf    <tibble [1 × 4]> <opts[0]> <list [0]>
2 rec_knn   <tibble [1 × 4]> <opts[0]> <list [0]>
3 rec_lasso <tibble [1 × 4]> <opts[0]> <list [0]>
4 rec_boost <tibble [1 × 4]> <opts[0]> <list [0]>

Model Tuning

Let us tune the different model parameters using 10-fold cross-validation. To create the grid with the combinations of parameters we can use a space-filling design with 30 points, based on which 30 combinations of the parameters will be picked such that they cover the most area in the design space. The dials package contains sensible default ranges for the most common hyperparameters that are tuned in models. The user can modify those if required, and for some, the default range depends on the number of features. One such example is the mtry parameter in random forests and boosted trees algorithms, whose max value is equal to the number of predictors in the processed data set. Since the number of predictors is known after preparing the recipe, we use finalize() to set the upper bound of mtry for the two affected models and then attach those parameter sets to the relevant rows of the workflow set via option_add().

# Number of predictors after pre-processing — needed to finalize mtry for rf and boost.

trained_data <- model_recipe %>% prep() %>% bake(new_data = NULL)

rf_param    <- extract_parameter_set_dials(rf_model)    %>% finalize(trained_data)
boost_param <- extract_parameter_set_dials(boost_model) %>% finalize(trained_data)

wfs <- wfs %>%
  option_add(param_info = rf_param,    id = "rec_rf") %>%
  option_add(param_info = boost_param, id = "rec_boost")

rf_param %>% extract_parameter_dials("mtry")

# Randomly Selected Predictors (quantitative)

Range: [1, 53]

When tuning a model, it is always important to consider what we are trying to optimize the model (e.g. achieve highest possible accuracy, maximize true positives, etc). For our problem, the aim is to accurately predict the class of each observation, so at the end of the tuning process we will pick the hyperparameters that achieve highest accuracy. We pass an explicit metric_set(accuracy, roc_auc) to workflow_map() so that probability predictions (roc_auc) are also produced — they’re needed downstream by stacks. Parallel execution is handled by the future framework, which tune 1.2.0 adopted in 2024 in place of the older foreach/doParallel setup.

set.seed(9876)

folds <- vfold_cv(data = train, v = 10, strata = classe)

plan(multisession, workers = 10)

tuned_set <- wfs %>%
  workflow_map(
    "tune_grid",
    resamples = folds,
    grid      = 30,
    metrics   = metric_set(accuracy, roc_auc),
    control   = control_grid(
      save_pred     = TRUE,
      save_workflow = TRUE,
      allow_par     = TRUE,
      parallel_over = "resamples"
    ),
    seed    = 1992,
    verbose = TRUE
  )

Warning: Using `all_of()` outside of a selecting function was deprecated in tidyselect
1.2.0.
ℹ See details at
  <https://tidyselect.r-lib.org/reference/faq-selection-context.html>

i 1 of 4 tuning:     rec_rf

Warning: Using `all_of()` outside of a selecting function was deprecated in tidyselect
1.2.0.
ℹ See details at
  <https://tidyselect.r-lib.org/reference/faq-selection-context.html>

→ A | warning: ! 53 columns were requested but there were 52 predictors in the data.
               ℹ 52 predictors will be used.

Warning: Using `all_of()` outside of a selecting function was deprecated in tidyselect
1.2.0.
ℹ See details at
  <https://tidyselect.r-lib.org/reference/faq-selection-context.html>

→ A | warning: ! 53 columns were requested but there were 52 predictors in the data.
               ℹ 52 predictors will be used.

Warning: Using `all_of()` outside of a selecting function was deprecated in tidyselect
1.2.0.
ℹ See details at
  <https://tidyselect.r-lib.org/reference/faq-selection-context.html>

→ A | warning: ! 53 columns were requested but there were 52 predictors in the data.
               ℹ 52 predictors will be used.

Warning: Using `all_of()` outside of a selecting function was deprecated in tidyselect
1.2.0.
ℹ See details at
  <https://tidyselect.r-lib.org/reference/faq-selection-context.html>

→ A | warning: ! 53 columns were requested but there were 52 predictors in the data.
               ℹ 52 predictors will be used.

Warning: Using `all_of()` outside of a selecting function was deprecated in tidyselect
1.2.0.
ℹ See details at
  <https://tidyselect.r-lib.org/reference/faq-selection-context.html>

→ A | warning: ! 53 columns were requested but there were 52 predictors in the data.
               ℹ 52 predictors will be used.

Warning: Using `all_of()` outside of a selecting function was deprecated in tidyselect
1.2.0.
ℹ See details at
  <https://tidyselect.r-lib.org/reference/faq-selection-context.html>

→ A | warning: ! 53 columns were requested but there were 52 predictors in the data.
               ℹ 52 predictors will be used.

Warning: Using `all_of()` outside of a selecting function was deprecated in tidyselect
1.2.0.
ℹ See details at
  <https://tidyselect.r-lib.org/reference/faq-selection-context.html>

→ A | warning: ! 53 columns were requested but there were 52 predictors in the data.
               ℹ 52 predictors will be used.

Warning: Using `all_of()` outside of a selecting function was deprecated in tidyselect
1.2.0.
ℹ See details at
  <https://tidyselect.r-lib.org/reference/faq-selection-context.html>

→ A | warning: ! 53 columns were requested but there were 52 predictors in the data.
               ℹ 52 predictors will be used.

Warning: Using `all_of()` outside of a selecting function was deprecated in tidyselect
1.2.0.
ℹ See details at
  <https://tidyselect.r-lib.org/reference/faq-selection-context.html>

→ A | warning: ! 53 columns were requested but there were 52 predictors in the data.
               ℹ 52 predictors will be used.

Warning: Using `all_of()` outside of a selecting function was deprecated in tidyselect
1.2.0.
ℹ See details at
  <https://tidyselect.r-lib.org/reference/faq-selection-context.html>

→ A | warning: ! 53 columns were requested but there were 52 predictors in the data.
               ℹ 52 predictors will be used.

✔ 1 of 4 tuning:     rec_rf (38m 46.7s)

i 2 of 4 tuning:     rec_knn

✔ 2 of 4 tuning:     rec_knn (54m 56.8s)

i 3 of 4 tuning:     rec_lasso

✔ 3 of 4 tuning:     rec_lasso (1m 10.9s)

i 4 of 4 tuning:     rec_boost

✔ 4 of 4 tuning:     rec_boost (11m 27.9s)

plan(sequential)

In-sample Accuracy

One can use the collect_metrics() function — or its workflow-set counterparts like extract_workflow_set_result() and rank_results() — to visualize the average accuracy for each combination of parameters (averaging across resamples), and see the various hyperparameters that achieve such accuracy.

tuned_set %>% extract_workflow_set_result("rec_rf") %>% autoplot(metric = "accuracy")

We can see that for the random forests model a combination of around 15-20 predictors and a minimal node size in the range between 5-15 seem to be optimal.

tuned_set %>% extract_workflow_set_result("rec_knn") %>% autoplot(metric = "accuracy")

For K-NN, a small number of neighbours is preferred, while Minkowski Distance of order 0.25 seems to perform best.

tuned_set %>% extract_workflow_set_result("rec_lasso") %>% autoplot(metric = "accuracy")

Small penalty is preferred for the LASSO model and it seems that up to a point, similar accuracy levels are achieved.

tuned_set %>% extract_workflow_set_result("rec_boost") %>% autoplot(metric = "accuracy")

For boosted trees, it seems that a higher learning rate is better. Higher tree depth (especially in the range of 9-14) seems to provide best results, while the number of trees and the minimal node size seem to have a wide range of values for which we achieve increased accuracy.

Let us select the best models from each type of model and compare in-sample accuracy. rank_results() will pull the best configuration per workflow in one call.

model_labels <- c(
  rec_rf    = "Random Forest",
  rec_knn   = "K-NN",
  rec_lasso = "Logistic Reg",
  rec_boost = "Boosted Trees"
)

rank_results(tuned_set, rank_metric = "accuracy", select_best = TRUE) %>%
  filter(.metric == "accuracy") %>%
  transmute(model = model_labels[wflow_id], accuracy = mean) %>%
  arrange(desc(accuracy)) %>%
  knitr::kable()

model	accuracy
Boosted Trees	0.9964958
K-NN	0.9952857
Random Forest	0.9950932
Logistic Reg	0.7256387

We can see that the random forests, K-NN, and boosted trees models perform exceptionally on the resamples of the train data, while even the best lasso logistic regression model performs much worse than the other three. However, there is high chance that our models have overfit on the training data and actually will not perform as well when generalizing to new data. This is where out-of-sample data comes to play, as we will use the portion of the data we set aside at the beginning to calculate accuracy on new data.

Out-of-sample Accuracy

Now that we have a set of hyperparameters that optimize performance for each model, we can update each workflow with its best parameters, fit on the entire training set and predict on the held-out test set. With workflowsets this becomes a single map over the rows of the tuned set rather than four separate finalize-and-fit blocks.

last_fits <- tibble(
  wflow_id = c("rec_rf", "rec_knn", "rec_lasso", "rec_boost"),
  pretty   = c("Random Forest", "K-NN", "LASSO LogReg", "Boosted Trees"),
  seed     = c(1209L, 1387L, NA_integer_, 54678L)
) %>%
  mutate(
    final_wf  = map(wflow_id, ~ {
      wf   <- extract_workflow(tuned_set, .x)
      best <- tuned_set %>% extract_workflow_set_result(.x) %>% select_best(metric = "accuracy")
      finalize_workflow(wf, best)
    }),
    final_fit = map2(final_wf, seed, ~ {
      if (!is.na(.y)) set.seed(.y)
      last_fit(.x, split)
    })
  )

last_fits %>%
  transmute(model = pretty, .metrics = map(final_fit, collect_metrics)) %>%
  unnest(.metrics) %>%
  filter(.metric == "accuracy") %>%
  select(model, accuracy = .estimate) %>%
  arrange(desc(accuracy)) %>%
  knitr::kable()

model	accuracy
K-NN	0.9966896
Boosted Trees	0.9966896
Random Forest	0.9949071
LASSO LogReg	0.7318564

We can see that the boosted trees and k nearest neighbours models perform great, with random forest trailing slightly behind. The LASSO logistic regression model has much lower performance and would not be preferred. At this point we could walk away with a model that has a 99.8% accuracy on unseen data. However, we can take it a couple of steps further to see if we can achieve even greater accuracy, as we’ll see in the next sections.

Ensemble Model

In the previous section we used the tune package to try out different hyperparameter combinations over our data and estimate model accuracy using cross-validation. Let’s assume we haven’t yet tested the best model on our test data as we’re only supposed to use the test set for final selection and we shouldn’t be using the knowledge from applying to test data to improve performance. Because each workflow in tuned_set was tuned with save_pred = TRUE and save_workflow = TRUE, we can pass the whole workflow set to stacks::add_candidates() in one call — since stacks 0.2.2, the helper accepts a workflow_set directly. The reason we kept roc_auc as a metric while tuning is because it creates soft predictions, which are required in classification problems to create the stack. Since the outputs of these models will be highly correlated, the blend_predictions function performs regularization to decide which outputs will be used in the final prediction. For the multinomial blend we use mn_log_loss (multinomial log loss) rather than accuracy — accuracy is not smooth in the predicted-probability space, which on a multiclass outcome can lead the LASSO blender to shrink every candidate coefficient to zero and produce an empty stack (tidymodels/stacks#225).

set.seed(5523)

model_stack <- stacks() %>%
  add_candidates(tuned_set) %>%
  blend_predictions(metric = metric_set(mn_log_loss))

Warning: The inputted `candidates` argument `rec_rf` generated notes during
tuning/resampling. Model stacking may fail due to these issues; see
`collect_notes()` (`?tune::collect_notes()`) if so.

Warning: Predictions from 85 candidates were identical to those from existing candidates
and were removed from the data stack.

model_stack_fit <- model_stack %>% fit_members()

Warning: Using `all_of()` outside of a selecting function was deprecated in tidyselect
1.2.0.
ℹ See details at
  <https://tidyselect.r-lib.org/reference/faq-selection-context.html>

class_levels <- levels(factor(train$classe))
pred_cols    <- paste0(".pred_", class_levels)

stack_pred_test <- test %>%
  bind_cols(predict(model_stack_fit, new_data = ., type = "prob")) %>%
  mutate(
    .pred_class = factor(
      class_levels[max.col(across(all_of(pred_cols)), ties.method = "first")],
      levels = class_levels
    )
  )

stack_pred_test %>%
  mutate(classe = factor(classe, levels = class_levels)) %>%
  accuracy(classe, .pred_class) %>%
  knitr::kable()

.metric	.estimator	.estimate
accuracy	multiclass	0.9979628

We see that in the end we achieved the same accuracy as our best model, which is not unexpected considering our accuracy was almost perfect. We can also have a look at the weights of the different models used in the ensemble.

autoplot(model_stack_fit, type = "weights") %>%
  ggplotly()

While there was not much room for improvement, as mentioned in the beginning, this was a good opportunity to play around with the new package in practice.

Although this data set did not present much of a challenge in terms of predicting the outcome, we managed to cover many of the different steps in the modeling process using tidymodels. The updated version of this post leans on workflowsets — the package the original 2021 post pointed at as future work — to express the four-model tuning sweep as a single pipeline, and on the future backend that tune 1.2.0 adopted for parallel execution. Further steps one could take in this analysis could potentially involve using functionality from the tidyposterior package to make statistical comparisons between the models we constructed that performed similarly.

References

Velloso, Eduardo, Andreas Bulling, Hans Gellersen, Wallace Ugulino, and Hugo Fuks. 2013. “The 4th Augmented Human International Conference.” https://doi.org/10.1145/2459236.2459256.