= suppressPackageStartupMessages ssh
Chemometric Modeling with Tidymodels: A Tutorial for Spectroscopic Data
For this tutorial, we use the beer
dataset, publicly available and commonly used for spectroscopy-based regression problems. This dataset contains near-infrared spectroscopy (NIRS) data of beer samples alongside measurements of the original gravity (alcoholic beverage), which serves as the target variable. Original gravity (OG) is one of the primary metrics used by brewers to estimate the potential alcohol content of the beer, as it reflects the fermentable sugar content available for yeast metabolism. By analyzing OG alongside the NIRS spectra, we can explore how the spectral data correlates with this fundamental brewing property, offering insights into the chemical composition and quality of the beer samples.
Setup
Below, we use suppressPackageStartupMessages
to suppress startup messages for clarity and load essential packages.
tidyverse
for data manipulation and visualization.tidymodels
for modeling workflows and machine learning.tidymodels_prefer()
ensures consistency across conflicting tidymodels functions.
ssh(library(tidyverse))
ssh(library(tidymodels))
tidymodels_prefer()
Additional libraries include:
kknn
: Implements k-nearest neighbors (KNN).glmnet
: Used for elastic net regression.ranger
: Used for random forest.plsmod
: Supports partial least squares (PLS) regression.magrittr
: Provides pipe operators (%>%, %<>%
).patchwork
: Simplifies combiningggplot2
plots.
library(kknn)
library(plsmod)
ssh(library(glmnet))
ssh(library(ranger))
ssh(library(magrittr))
library(patchwork)
We set a custom theme with a clean white background and adjusted sizes for all ggplot2
plots.
= 15
base_size theme_bw(
base_size = base_size,
base_line_size = base_size / 22,
base_rect_size = base_size / 15
%>%
) theme_set()
Dataset Overview
We begin by loading the beer
dataset and identifying the spectral predictor columns, which correspond to the NIRS wavelength variables. Usually, I prefer storing spectral wavelengths as character strings in a variable named wavelength
because it makes data manipulation easier. This approach enhances flexibility when selecting, filtering, or grouping columns, simplifies integration with tidyverse functions, and ensures compatibility with tidymodels
preprocessing workflows.
<- read_csv("beer.csv", show_col_types = FALSE)
beer_data <- beer_data %>% select(starts_with("xtrain")) %>% names() wavelength
Previewing the first rows of the dataset helps us ensure its integrity and understand its structure.
%>% head(5) %>% DT::datatable() beer_data
Show the code
<- beer_data %>% mutate(spectra_id = paste0("s", 1:80)) %>%
p pivot_longer(
cols = -c(originalGravity, spectra_id),
names_to = "wavelength",
values_to = "intensity"
%>%
) mutate(wavelength = rep(seq(1100, 2250, 2), times = 80)) %>%
ggplot() +
aes(x = wavelength, y = intensity, colour = originalGravity, group = spectra_id) +
geom_line() +
scale_color_viridis_c(option = "inferno", direction = 1) +
labs(
x = "Wavelength [nm]",
y = "Absorbance [arb. units]",
title = "NIRS Spectra of Beer Samples",
subtitle = "Contains 80 samples, measured from 1100 to 2250 nm",
color = "Original Gravity") +
theme_minimal()
::ggplotly(p) plotly
Supervised Learning Techniques
For this analysis, we’ll evaluate and compare the performance of four supervised learning algorithms, categorized by their linearity or modeling approach (parametric vs. non-parametric):
Algorithm | Acronym | Approach |
---|---|---|
sparse Partial Least Squares | sPLS |
Linear |
Elastic Net | ENet |
Linear |
k-Nearest Neighbors | KNN |
Non-linear |
Random Forests | RF |
Non-linear |
Step 1: Data Splitting
To ensure unbiased model evaluation, we partition the data into training (80%) and testing (20%) sets, employing stratified sampling based on the target variable’s distribution.
set.seed(123)
<- initial_split(beer_data, prop = 0.8, strata = originalGravity)
split_data <- training(split_data)
train_data <- testing(split_data) test_data
Step 2: Cross-Validation
We use a 5-fold repeated cross-validation strategy for hyperparameters tuning and performance evaluation, minimizing the risk of overfitting.
<- vfold_cv(train_data, v = 5, repeats = 3) cv_folds
Interestingly, the vfold_cv
function provides a powerful way to visualize the distribution of data across folds, allowing us to confirm that the stratification and splits are evenly balanced. This ensures that each fold accurately represents the overall dataset, enhancing the reliability of cross-validation results.
Show the code
%>%
cv_folds tidy() %>%
ggplot(aes(x = Fold, y = Row, fill = Data)) +
geom_tile() +
facet_wrap(~Repeat) +
scale_fill_brewer(palette = "Paired")
Step 3: Preprocessing
Preprocessing spectral data is a vast and intricate topic, deserving its own dedicated discussion, which we will explore in a future post. For this tutorial, we remove zero-variance predictors, and center the spectra intensity to ensure the data is well-suited for modeling.
tidymodels
provides a wide array of preprocessing steps through its versatile step_*
functions. These functions allow for comprehensive data transformations, including centering, scaling, feature selection, and more, to be seamlessly integrated into the modeling workflow. Additionally, tidymodels
offers the flexibility to create custom recipe steps, enabling you to design and implement tailored data transformations that meet your specific needs.
<- recipe(originalGravity ~ ., data = train_data) %>%
base_recipe update_role(originalGravity, new_role = "outcome") %>%
update_role(all_of(wavelength), new_role = "predictor") %>%
step_zv(all_predictors()) %>%
step_center(all_predictors())
For comparison with the base preprocessing step, we introduce an additional step that applies Principal Component Analysis (PCA) to the predictor variables. This transformation reduces the dimensionality of the data while retaining the most significant variance.
<- base_recipe %>%
pca_recipe step_pca(all_predictors())
Step 4: Model Specifications
We now define model specifications for each algorithm, incorporating hyperparameter tuning within the tidymodels
framework. Notably, the tune
function is used to specify hyperparameters that require optimization during the tuning process. For parameters with predefined values, these can be directly assigned within their allowable range.
# k-Nearest Neighbors (KNN)
<- nearest_neighbor() %>%
knn_spec set_args(neighbors = tune(), weight_func = tune(), dist_power = tune()) %>%
set_engine('kknn') %>%
set_mode('regression')
# Partial Least Squares Regression (PLSR)
<- pls() %>%
spls_spec set_args(predictor_prop = tune(), num_comp = tune()) %>%
set_engine('mixOmics') %>%
set_mode('regression')
# Elastic Net (ENet)
<-linear_reg() %>%
enet_spec set_args(penalty = tune(), mixture = tune()) %>%
set_engine("glmnet") %>%
set_mode("regression")
# Random Forest (RF)
<- rand_forest() %>%
rf_spec set_args(mtry = tune(), trees = tune(), min_n = tune()) %>%
set_engine("ranger") %>%
set_mode("regression")
Step 5: Model Training and Tuning
Using the workflow_set
function, we define a workflow for each model and tune hyperparameters using a grid search, and enable parallel processing using 4 CPU cores, accelerating computation during tuning.
<- workflow_set(
wflowSet preproc = list(base = base_recipe, pca = pca_recipe),
models = list(
knn = knn_spec,
spls = spls_spec,
enet = enet_spec,
rf = rf_spec),
cross = TRUE
)
As mentionned earlier, we opted to train each model using both the base preprocessing approach and the PCA-transformed data. This strategy results in a total of eight unique training models as follows:
wflowSet
# A workflow set/tibble: 8 × 4
wflow_id info option result
<chr> <list> <list> <list>
1 base_knn <tibble [1 × 4]> <opts[0]> <list [0]>
2 base_spls <tibble [1 × 4]> <opts[0]> <list [0]>
3 base_enet <tibble [1 × 4]> <opts[0]> <list [0]>
4 base_rf <tibble [1 × 4]> <opts[0]> <list [0]>
5 pca_knn <tibble [1 × 4]> <opts[0]> <list [0]>
6 pca_spls <tibble [1 × 4]> <opts[0]> <list [0]>
7 pca_enet <tibble [1 × 4]> <opts[0]> <list [0]>
8 pca_rf <tibble [1 × 4]> <opts[0]> <list [0]>
<- control_grid(
ctrl_grid verbose = TRUE,
allow_par = TRUE,
extract = NULL,
save_pred = TRUE,
pkgs = c("doParallel", "doFuture"),
save_workflow = FALSE,
event_level = "first",
parallel_over = "resamples"
)
<- parallel::makePSOCKcluster(4)
cl ::registerDoParallel(cl) doParallel
%<>%
wflowSet workflow_map(
fn = "tune_grid",
resamples = cv_folds,
grid = 10,
metrics = metric_set(rmse, mae, rsq),
control = ctrl_grid,
seed = 3L,
verbose = TRUE
)
::stopCluster(cl) parallel
Next, we utilize the rank_results
function to systematically rank the training models based on key performance metrics such as the root mean squared error (RMSE), the mean absolute error (MAE), and the coefficient of determination \((R^2)\). Following this, we visualize the model performance with error bars, providing a clear and insightful comparison of their predictive capabilities.
%>%
wflowSet rank_results(rank_metric = "rmse") %>%
relocate(c(rank, model, .metric, mean, std_err), .before = wflow_id) %>%
mutate(mean = round(mean, 4), std_err = round(std_err, 4)) %>%
::datatable(rownames = FALSE, filter ="top") DT
%>% autoplot(std_errs = qnorm(0.95), type = "wflow_id") +
wflowSet ::scale_color_lancet() +
ggscitheme(legend.position = "bottom") +
ylab("")
Step 6: Model Selection
By configuring the autoplot
function with the argument select_best = TRUE
, we rank the training models while visually emphasizing the best-performing model, making it easy to identify the optimal choice for further evaluation.
%>% autoplot(select_best = TRUE, std_errs = qnorm(0.95), type = "wflow_id") +
wflowSet geom_point(size = 3) +
::scale_color_lancet() +
ggscitheme(legend.position = "bottom") +
ylab("")
The extract_workflow_set_result
function retrieves the optimized hyperparameter values for the best-performing model, as determined by the lowest RMSE. In this analysis, it determines the optimal settings for base_spls
, specifically identifying the optimized number of latent variables (num_comp
) and the proportion of predictors (predictor_prop
) allowed to have non-zero coefficients.
<- wflowSet %>%
best_model extract_workflow_set_result("base_spls") %>%
select_best(metric = "rmse") %>%
print()
# A tibble: 1 × 3
predictor_prop num_comp .config
<dbl> <int> <chr>
1 0.0633 3 Preprocessor1_Model01
We use the collect_predictions
function to gather the best-performing model’s training data and visualize the relationship between the actual and predicted values. This allows us to assess the model’s predictive accuracy. Additionally, we perform a residual analysis using standardized residuals to further evaluate the model’s performance and identify any potential areas for improvement.
<- wflowSet %>%
train_results collect_predictions() %>%
filter(wflow_id == "base_spls" & .config == best_model %>% pull(.config)) %>%
select(-.row)
%>%
train_results mutate(
.pred = round(.pred, 2),
originalGravity = round(originalGravity, 4),
residuals = round((originalGravity - .pred)/sd((originalGravity - .pred)), 2)
%>%
) relocate(c(.pred, originalGravity, residuals, model), .before = wflow_id) %>%
::datatable(rownames = FALSE) DT
Step 7: Model Testing
Finally, we finalize the best-performing model (base_spls
) by utilizing the extract_workflow
and finalize_workflow
functions. We then assess the model’s performance on the test set using last_fit(split = split_data)
, calculating key metrics to evaluate its accuracy. These metrics are retrieved using the collect_metrics()
function.
As done previously, we visualize the results through actual vs. predicted plots, complemented by residual diagnostics, to provide a comprehensive evaluation of the model’s performance.
<- wflowSet %>%
test_results extract_workflow("base_spls") %>%
finalize_workflow(best_model) %>%
last_fit(split = split_data)
$.predictions[[1]] %>%
test_resultsmutate(
.pred = round(.pred, 2),
originalGravity = round(originalGravity, 4),
residuals = round((originalGravity - .pred)/sd((originalGravity - .pred)), 2)
%>%
) relocate(residuals, .before = .config) %>%
::datatable(rownames = FALSE, width = 400) DT
%>%
test_results collect_metrics() %>%
mutate(.estimate = round(.estimate, 4)) %>%
::datatable(rownames = FALSE, width = 400) DT
Conclusion
This tutorial showcased the versatility of the Tidymodels framework for chemometric applications. By leveraging its modular and tidy design, you can implement robust spectroscopic models tailored to your dataset, ensuring both accuracy and reproducibility.