library(tidymodels)
library(tidyverse)
library(here)
library(rsample)
library(rpart.plot)
library(vip)
library(glmnet)
library(ranger)
library(ggplot2)
library(ggfortify)
Machine Learning
This exercise is the fifth in flu analysis series. This page will follow the tidymodels
method to train and fit three machine learning models on the main continuous outcome BodyTemp
.
For the machine learning exercise, we will use the flu
data and focus on Body Temperature as the outcome. We will be using the Decision Tree model, LASSO model, and Random forest model to determine the best performing model.
Load packages
Load data
#load data
<- here("fluanalysis", "data", "cleandata.rds")
filelocation load(filelocation)
#reassign as flu
<- cleandata
flu
#remove columns with <50 entries in one category
<- flu %>%
flu select(!c("Vision","Hearing"))
Data splitting
#establish reproducibility by setting the seed
set.seed(123)
#stratify data split to balance outcomes between data sets
<- initial_split(flu, prop = 7/10, strata = BodyTemp)
data_split1
#create two data sets with 70% of data in training set
<- training(data_split1)
train_data1 <- testing(data_split1) test_data1
Cross Validation & Workflow set up
#set seed for reproducibility
set.seed(123)
<- vfold_cv(train_data1, v = 5)
folds
#create recipe for machine learning
#step_ functions create dummy variables and order the severity columns
<- recipe(BodyTemp ~., data = train_data1) %>%
ml_rec step_dummy(all_nominal_predictors()) %>%
step_ordinalscore()
#set linear regression
<- linear_reg()
lm_mod
#create machine learning workflow
<- workflow() %>%
ml_wf add_model(lm_mod) %>%
add_recipe(ml_rec)
Null model
We will use the null model as comparison for the other models. All other models should perform better than the null model since this model does not use any of the predictor variables.
#set null model using null_model() function
<- null_model() %>%
nullmod set_engine("parsnip") %>%
set_mode("regression")
#create null recipe
<- recipe(BodyTemp ~ ., data = train_data1)%>%
null_rec step_dummy(all_nominal_predictors()) %>%
step_ordinalscore()
#create null workflow
<- workflow() %>%
nullmodwf add_model(nullmod) %>%
add_recipe(null_rec)
#fit the null model to training data
<- nullmodwf %>%
nullmodfit fit(data = train_data1)
The null model can be used to make predictions of the outcome, but without any predictors, the model will produce the mean body temperature for the training and test data. The rmse()
function will produce the metric for this model.
#use null model to make predictions from training data
<- augment(nullmodfit, train_data1) %>%
nullmod_predtrain rmse(truth = BodyTemp, .pred)
nullmod_predtrain
# A tibble: 1 × 3
.metric .estimator .estimate
<chr> <chr> <dbl>
1 rmse standard 1.21
#use null model to make predictions from test data
<-augment(nullmodfit, test_data1) %>%
nullmod_predtest rmse(BodyTemp, .pred)
nullmod_predtest
# A tibble: 1 × 3
.metric .estimator .estimate
<chr> <chr> <dbl>
1 rmse standard 1.16
The training null model has a RMSE of 1.209 and the test null model performs slightly better with a RMSE of 1.163.
Model tuning and fitting
For each model, we will use the tune()
function to find the best parameters during the cross-validation step. After finding the best model based on our performance metric RMSE, we will fit the best model.
Tree model specification
#set up decision tree model and tune the parameters
<- decision_tree(
tune_spec cost_complexity = tune(), #one parameter we will tune
tree_depth = tune() #another parameter
%>%
) set_engine("rpart") %>% #specify engine
set_mode("regression") #regression due to continuous outcome
Tree model workflow
#create decision tree workflow
<- workflow() %>%
tree_wf add_model(tune_spec) %>%
add_recipe(ml_rec) #can use the general machine learning recipe
Tree model grid specification
The grid_regular()
will produce values of the parameters that we test to find the best model. For each value of tree depth, 5 cost_complexity values will be tested.
#create grid of resampled values for cross-validation
<- grid_regular(cost_complexity(),
tree_grid tree_depth(),
levels = 5) #produces 5x5 = 25 combinations
Tree model tuning with cross-validation
Cross-validation is required for tuning models. In order to find the best model, we can use resampled data from the training set to test the models against “new data” in order to see the variation within the model.
set.seed(123)
<- tree_wf %>%
tree_res tune_grid(
resamples = folds,
grid = tree_grid,
control = control_grid(save_pred = TRUE)
)
! Fold1: internal: A correlation computation is required, but `estimate` is constant and ha...
! Fold2: internal: A correlation computation is required, but `estimate` is constant and ha...
! Fold3: internal: A correlation computation is required, but `estimate` is constant and ha...
! Fold4: internal: A correlation computation is required, but `estimate` is constant and ha...
! Fold5: internal: A correlation computation is required, but `estimate` is constant and ha...
Plotting the resampled models produces a summary of the RMSE in the top row. A lower RMSE is preferred.
%>%
tree_res autoplot()
It would appear that the model with tree_depth = 1 has the best RMSE. ### Tree model performance To confirm the best performing tree model, we can use select_best()
function and extract the metrics.
#select best tree model
<- tree_res %>% select_best("rmse")
best_tree best_tree
# A tibble: 1 × 3
cost_complexity tree_depth .config
<dbl> <int> <chr>
1 0.0000000001 1 Preprocessor1_Model01
#set up workflow
<- tree_wf %>%
treefinal_wf finalize_workflow(best_tree)
#fit the final model
<- treefinal_wf %>%
treefinal_fit fit(train_data1)
#extract RMSE value
<- augment(treefinal_fit, train_data1) %>%
treefinalfitted_rmseselect(BodyTemp, .pred) %>%
rmse(truth = BodyTemp, .pred)
The best performing tree model has a tree depth of 1 and and RMSE of 1.178 which is slightly better than the training null model of 1.209.
#save for later comparison: RMSE based on predictions from best model
<- tree_res %>%
tree_rmse collect_predictions(parameters = best_tree) %>%
rmse(BodyTemp, .pred) %>%
mutate(model = "Tree")
tree_rmse
# A tibble: 1 × 4
.metric .estimator .estimate model
<chr> <chr> <dbl> <chr>
1 rmse standard 1.19 Tree
The decision tree can be plotted using rpart.plot()
. The level of importance for each predictor can be found using vip()
function.
#extract engine and plot decision tree
%>%
treefinal_fitextract_fit_engine() %>%
rpart.plot(roundint = FALSE)
#plot importance of predictors
%>%
treefinal_fit extract_fit_parsnip() %>%
vip()
Plots for Tree model
We may also be interested in the predictions and residuals for the model compared to the actual outcomes. First, we need to create the predictions using augment()
and the residuals using mutate()
.
#create predictions and residuals
<- augment(treefinal_fit, train_data1) %>%
tree_resd select(c(.pred, BodyTemp)) %>%
mutate(.resid = BodyTemp - .pred)
#plot body temperature and predictions
<- ggplot(tree_resd)+
treeplot1 geom_point(aes(BodyTemp, .pred))+
scale_x_continuous(limits = c(97,103.5))+
scale_y_continuous(limits = c(97,103.5))
treeplot1
#plot body temperature and residuals
<- ggplot(tree_resd)+
treeplot2 geom_point(aes(BodyTemp,.resid))
treeplot2
The predictions and residuals show that the models do not perform well in predicting body temperature.
Compare model to null
As mentioned before, the tree model is only slightly better than the null model (RMSE of 1.88 (SE = 0.061) compared to 1.209). The fitted tree model has a RMSE of 1.178.
#metrics for best model
%>% show_best() tree_res
Warning: No value of `metric` was given; metric 'rmse' will be used.
# A tibble: 5 × 8
cost_complexity tree_depth .metric .estimator mean n std_err .config
<dbl> <int> <chr> <chr> <dbl> <int> <dbl> <chr>
1 0.0000000001 1 rmse standard 1.19 5 0.0613 Preprocesso…
2 0.0000000178 1 rmse standard 1.19 5 0.0613 Preprocesso…
3 0.00000316 1 rmse standard 1.19 5 0.0613 Preprocesso…
4 0.000562 1 rmse standard 1.19 5 0.0613 Preprocesso…
5 0.0000000001 4 rmse standard 1.20 5 0.0504 Preprocesso…
treefinalfitted_rmse
# A tibble: 1 × 3
.metric .estimator .estimate
<chr> <chr> <dbl>
1 rmse standard 1.18
LASSO model specifications
Next we will look at the LASSO model following the same general steps as the decision tree model.
#set LASSO model
<- linear_reg(penalty = tune(), #penalties given for number of predictors
lm_modLASSO mixture = 1) %>%
set_engine("glmnet")
LASSO model workflow
#set LASSO workflow
<- workflow() %>%
LASS_wf add_model(lm_modLASSO) %>%
add_recipe(ml_rec)
LASSO model grid specifications
We will use a similar grid method as before in order to tune the parameters of the models.
#create LASSO grid
<-tibble(penalty = 10^seq(-4, -1, length.out = 30)) #creates a grid of penalty values to tune
LASS_grid %>% top_n(-5) #lowest penalty values LASS_grid
Selecting by penalty
# A tibble: 5 × 1
penalty
<dbl>
1 0.0001
2 0.000127
3 0.000161
4 0.000204
5 0.000259
Penalties will be highest for models with the most predictors.
LASSO model tuning with cross-validation
set.seed(123)
<-
LASS_res %>%
LASS_wf tune_grid(folds, #cross-validation
grid = LASS_grid, #penalty grid
control = control_grid(save_pred = TRUE), #save CV predictions to use later
metrics = metric_set(rmse)) #RMSE as metric
We can plot the metrics against penalty values.
<-
LASS_plot %>%
LASS_res autoplot()
LASS_plot
We would like the lowest RMSE and the lowest penalty, so the best model should have an RMSE around 1.15.
LASSO model performance
Using select_best()
function and extracting the metrics will give us the model performance measure.
#show top performing models
<- LASS_res %>% show_best()
LASS_showbest
#select best model
<- LASS_res %>%
best_LASS_model select_best("rmse")
best_LASS_model
# A tibble: 1 × 2
penalty .config
<dbl> <chr>
1 0.0386 Preprocessor1_Model26
#update workflow with best model
<- LASS_wf %>%
LASSfinal_wf finalize_workflow(best_LASS_model)
#fit the model to the training data
<- LASSfinal_wf %>%
LASSfinal_fit fit(train_data1)
#collect final RMSE
<- augment(LASSfinal_fit, train_data1) %>%
LASSfinalfitted_rmseselect(BodyTemp, .pred) %>%
rmse(truth = BodyTemp, .pred)
The LASSO model (RMSE = 1.117) performed better than the decision tree model (1.187) and the null model (1.209).
#save for later comparison: RMSE based on predictions from best model
<- LASS_res %>%
LASS_rmse collect_predictions(parameters = best_LASS_model) %>%
rmse(BodyTemp, .pred) %>%
mutate(model = "LASSO")
LASS_rmse
# A tibble: 1 × 4
.metric .estimator .estimate model
<chr> <chr> <dbl> <chr>
1 rmse standard 1.15 LASSO
Plots for LASSO model
We can again plot the predictions and residuals against the outcome.
#create predictions and residuals
<- augment(LASSfinal_fit, train_data1) %>%
LASS_resd select(c(.pred, BodyTemp)) %>%
mutate(.resid = BodyTemp - .pred)
#plots LASSO predictions and body temperature
<- ggplot(LASS_resd)+
LASS_plot1 geom_point(aes(BodyTemp, .pred))+
scale_x_continuous(limits = c(97,103.5))+
scale_y_continuous(limits = c(97,103.5))
LASS_plot1
#plots residuals and body temperature
<- ggplot(LASS_resd)+
LASS_plot2 geom_point(aes(BodyTemp,.resid))
LASS_plot2
The prediction and residual plots show a better performance than the other two models we have looked at so far.
Compare model to null model
As mentioned, the LASSO model performs better than the null mode with a RMSE of 1.148 (SE = 0.0534) compared the null model (1.209). The fitted LASSO model has an RMSE of 1.117.
%>%
LASS_res show_best()
# A tibble: 5 × 7
penalty .metric .estimator mean n std_err .config
<dbl> <chr> <chr> <dbl> <int> <dbl> <chr>
1 0.0386 rmse standard 1.15 5 0.0534 Preprocessor1_Model26
2 0.0489 rmse standard 1.15 5 0.0542 Preprocessor1_Model27
3 0.0304 rmse standard 1.15 5 0.0528 Preprocessor1_Model25
4 0.0621 rmse standard 1.15 5 0.0552 Preprocessor1_Model28
5 0.0240 rmse standard 1.15 5 0.0524 Preprocessor1_Model24
LASSfinalfitted_rmse
# A tibble: 1 × 3
.metric .estimator .estimate
<chr> <chr> <dbl>
1 rmse standard 1.12
Random forest model specifications
Finally, following the same steps are above, we can look at random forest models using cross-validation and tuning to find the best model.
#how many cores your computer can use for running tuning on multiple models at the same time
<- parallel::detectCores()
cores
#set up random forest model
<- rand_forest(mtry = tune(), #parameter to tune
rf_mod min_n = tune(), #parameter to tune
trees = 1000) %>%
set_engine("ranger", num.threads = cores) %>%
set_mode("regression")
Random forest model workflow
#create random forest recipe
<- recipe(BodyTemp ~ ., data = train_data1) %>%
rf_rec step_dummy(all_nominal_predictors()) %>%
step_ordinalscore()
#create workflow
<- workflow() %>%
rf_wf add_model(rf_mod) %>%
add_recipe(rf_rec)
Random forest model grid specifications/tuning with cross-validation
We will use a similar grid method for tuning model parameters
#create grid for tuning parameters for cross-validation
set.seed(123)
<- rf_wf %>%
rf_res tune_grid(folds,
grid = 25, #25 models
control = control_grid(save_pred = TRUE),
metrics = metric_set(rmse))
i Creating pre-processing data to finalize unknown parameter: mtry
We can look at the best random forest models by RMSE values.
#show top performing models
<- rf_res %>% show_best()
rf_showbest
rf_showbest
# A tibble: 5 × 8
mtry min_n .metric .estimator mean n std_err .config
<int> <int> <chr> <chr> <dbl> <int> <dbl> <chr>
1 8 37 rmse standard 1.16 5 0.0574 Preprocessor1_Model25
2 4 14 rmse standard 1.16 5 0.0589 Preprocessor1_Model06
3 3 23 rmse standard 1.16 5 0.0603 Preprocessor1_Model13
4 13 35 rmse standard 1.16 5 0.0564 Preprocessor1_Model15
5 7 12 rmse standard 1.17 5 0.0559 Preprocessor1_Model18
The best performing model has mtry (number of predictors at each node) = 8, min_n (min number of data points required to split) = 37 and RMSE of 1.158 (SE = 0.057). This is a better performing model compared to the null and other models.
The autoplot()
function visually shows the performance of the various models.
%>% autoplot() rf_res
We will extract the best model from the group using select_best()
.
#extract best model
<-
rf_best %>%
rf_res select_best(metric = "rmse")
#show results
rf_best
# A tibble: 1 × 3
mtry min_n .config
<int> <int> <chr>
1 8 37 Preprocessor1_Model25
#collect predictions of best random forest model and calculate RMSE
#save for later comparison: RMSE based on predictions from best model
<-
rf_rmse %>%
rf_res collect_predictions(parameters = rf_best) %>%
rmse(truth = BodyTemp, estimate = .pred) %>%
mutate(model = "Random Forest")
rf_rmse
# A tibble: 1 × 4
.metric .estimator .estimate model
<chr> <chr> <dbl> <chr>
1 rmse standard 1.16 Random Forest
Using the best random forest model, the RMSE is 1.163 which is similar to the LASSO model.
Final fit for random forest model
Now we can fit the model to the training data to get the final RMSE value.
#set random forest model using best model parameters
<-
rffinal_mod rand_forest(mtry = 8, min_n = 37, trees = 1000) %>%
set_engine("ranger", num.threads = cores, importance = "impurity") %>%
set_mode("regression")
#update workflow
<- rf_wf %>%
rffinal_wf update_model(rffinal_mod)
#fit to training data
<- rffinal_wf %>%
rffinal_fit fit(train_data1)
#pull RMSE from fitted model
<- augment(rffinal_fit, train_data1) %>%
rffinalfitted_rmse select(BodyTemp, .pred) %>%
rmse(truth = BodyTemp, .pred)
The fitted random forest model RMSE is 1.028 which is slightly better than the LASSO model and better than the null and decision tree models.
We can also look at the important predictors for the random forest model using vip()
function`
%>%
rffinal_fit extract_fit_parsnip() %>%
vip(num_features = 20)
Sneezing remains as the most important predictor for body temperature followed closely by subjective fever.
Plots for Random forest model
We will again plot the predictions and residuals for the model using augment()
.
#create predictions and residuals
<- augment(rffinal_fit, train_data1) %>%
rf_resd select(c(.pred, BodyTemp)) %>%
mutate(.resid = BodyTemp - .pred)
#plot body temperature and predictions
<- ggplot(rf_resd)+
rf_plot1 geom_point(aes(BodyTemp,.pred))
rf_plot1
#plot body temperature and residuals
<- ggplot(rf_resd)+
rf_plot2 geom_point(aes(BodyTemp, .resid))
rf_plot2
The residuals and prediction plots look similar to the LASSO model.
Compare model to null model
%>% show_best() %>% arrange(mean) %>% top_n(1) rf_res
Selecting by .config
# A tibble: 1 × 8
mtry min_n .metric .estimator mean n std_err .config
<int> <int> <chr> <chr> <dbl> <int> <dbl> <chr>
1 8 37 rmse standard 1.16 5 0.0574 Preprocessor1_Model25
Compared to the null, the random forest model is a better predictor of body temperature with a RMSE of 1.158 (SE = 0.057) compared to 1.209.
Choosing best model
Now that we have looked at all of the models, we can choose the best model to test with our test_data1
data set which has not been used during the tuning/fitting process.
#create comparison tibble for three models
<- bind_rows(tree_rmse,LASS_rmse,rf_rmse)
compare compare
# A tibble: 3 × 4
.metric .estimator .estimate model
<chr> <chr> <dbl> <chr>
1 rmse standard 1.19 Tree
2 rmse standard 1.15 LASSO
3 rmse standard 1.16 Random Forest
Based on the RMSE metric, the best performing models are the LASSO and Random forest models. We can also check the uncertainty by looking at the plots shown above and the standard error.
%>% head(1) LASS_showbest
# A tibble: 1 × 7
penalty .metric .estimator mean n std_err .config
<dbl> <chr> <chr> <dbl> <int> <dbl> <chr>
1 0.0386 rmse standard 1.15 5 0.0534 Preprocessor1_Model26
%>% head(1) rf_showbest
# A tibble: 1 × 8
mtry min_n .metric .estimator mean n std_err .config
<int> <int> <chr> <chr> <dbl> <int> <dbl> <chr>
1 8 37 rmse standard 1.16 5 0.0574 Preprocessor1_Model25
Since the LASSO model has a lower SE than the random forest model, we will choose the LASSO model for testing the model on new data.
Final evaluation
Now we can fit the LASSO model to the test data to check the RMSE on “new data”.
<-
LASStest_fit %>%
LASSfinal_wf last_fit(data_split1)
%>% collect_metrics() LASStest_fit
# A tibble: 2 × 4
.metric .estimator .estimate .config
<chr> <chr> <dbl> <chr>
1 rmse standard 1.16 Preprocessor1_Model1
2 rsq standard 0.0297 Preprocessor1_Model1
With the test data, the model has a RMSE of 1.16 which is slightly worse than the fitted model. This would make sense as the fitted model is designed off the train_data1
set.
We will plot the same prediction and residual plots as before using the test_data1
outcomes. They will look pretty similar to the fitted LASSO model since the RMSE is similar.
#create predictions and residuals for test data
<- augment(LASStest_fit) %>%
LASStest_resd select(c(.pred, BodyTemp)) %>%
mutate(.resid = BodyTemp - .pred)
#plots LASSO predictions and body temperature for test data
<- ggplot(LASS_resd)+
LASS_plot3 geom_point(aes(BodyTemp, .pred))+
scale_x_continuous(limits = c(97,103.5))+
scale_y_continuous(limits = c(97,103.5))
LASS_plot3
#plots residuals and body temperature for test data
<- ggplot(LASS_resd)+
LASS_plot4 geom_point(aes(BodyTemp,.resid))
LASS_plot4