This function performs hyperparameter tuning for a gradient boosting model
used in deweathering air pollution data. It uses cross-validation to find
optimal hyperparameters and returns the best performing model along with
performance metrics and visualizations. Parallel processing (e.g., through
the mirai package) is recommended to speed up tuning - see
https://tune.tidymodels.org/articles/extras/optimizations.html#parallel-processing.
Usage
tune_dw_model(
data,
pollutant,
vars = c("trend", "ws", "wd", "hour", "weekday", "air_temp"),
tree_depth = 5,
trees = 50L,
learn_rate = 0.1,
mtry = NULL,
min_n = 10L,
loss_reduction = 0,
sample_size = 1L,
stop_iter = 45L,
engine = c("xgboost", "lightgbm", "ranger"),
split_prop = 3/4,
grid_levels = 5,
v_partitions = 10,
progress = TRUE,
...,
.date = "date"
)Arguments
- data
An input
data.framecontaining one pollutant column (defined usingpollutant) and a collection of feature columns (defined usingvars).- pollutant
The name of the column (likely a pollutant) in
datato predict.- vars
The name of the columns in
datato use as model features - i.e., to predict the values in thepollutantcolumn. Any character columns will be coerced to factors."hour","weekday","trend","yday","week", and"month"are special terms and will be passed toappend_dw_vars()if not present innames(data).- tree_depth, trees, learn_rate, mtry, min_n, loss_reduction, sample_size, stop_iter
If length 1, these parameters will be fixed. If length
2, the parameter will be tuned within the range defined between the first and last value. For example, iftree_depth = c(1, 5)andgrid_levels = 3, tree depths of1,3, and5will be tested. Seebuild_dw_model()for specific parameter definitions.- engine
A single character string specifying what computational engine to use for fitting. Can be
"xgboost","lightgbm"(boosted trees) or"ranger"(random forest). See the documentation below for more information.- split_prop
The proportion of data to be retained for modeling/analysis. Passed to the
propargument ofrsample::initial_split().- grid_levels
An integer for the number of values of each parameter to use to make the regular grid. Passed to the
levelsargument ofdials::grid_regular().- v_partitions
The number of partitions of the data set to use for v-fold cross-validation. Passed to the
vargument ofrsample::vfold_cv().- progress
Log progress in the console? Passed to the
verboseargument oftune::control_grid(). Note that logging does not occur when parallel processing is used.- ...
Used to pass additional engine-specific parameters to the model. The parameters listed here can be tuned using
tune_dw_model(). All other parameters must be fixed.alpha:<xgboost>L1 regularization term on weights.lambda:<xgboost>L2 regularization term on weights.num_leaves:<lightgbm>max number of leaves in one tree.regularization.factor:<ranger>Regularization factor (gain penalization).regularization.usedepth:<ranger>Consider the depth in regularization? (TRUE/FALSE).splitrule:<ranger>Splitting rule. One of dials::ranger_reg_rules.alpha:<ranger>Significance threshold to allow splitting (forsplitrule = "maxstat").minprop:<ranger>Lower quantile of covariate distribution to be considered for splitting (forsplitrule = "maxstat").num.random.splits:<ranger>Number of random splits to consider for each candidate splitting variable (forsplitrule = "extratrees").
- .date
The name of the 'date' column which defines the air quality timeseries. Passed to
append_dw_vars()if needed. Also used to extract the time zone of the data for later restoration iftrendis used as a variable.
Details
The function performs the following steps:
Removes rows with missing values in the pollutant or predictor variables
Splits data into training and testing sets
Creates a tuning grid for any parameters specified as ranges
Performs grid search with cross-validation to find optimal hyperparameters
Fits a final model using the best hyperparameters
Generates predictions and performance metrics
At least one hyperparameter must be specified as a range (vector of length 2) for tuning to occur. Single values are treated as fixed parameters.
Modelling Approaches and Parameters
Types of Model
There are two modelling approaches available to build_dw_model():
Boosted Trees (
xgboost,lightgbm)Random Forest (
ranger)
Each of these approaches take different parameters.
Boosted Trees
Two engines are available for boosted tree models:
"xgboost""lightgbm"
The following universal parameters apply and are tunable:
tree_depth: Tree Depthtrees: # Treeslearn_rate: Learning Ratemtry: # Randomly Selected Predictorsmin_n: Minimal Node Sizeloss_reduction: Minimum Loss Reductionsample_size: Proportion Observations Sampled (xgboostonly)stop_iter: # Iterations Before Stopping (xgboostonly)
The following xgboost-specific parameters are tunable:
alpha: L1 regularization term on weights. Increasing this value will make model more conservativelambda: L2 regularization term on weights. Increasing this value will make model more conservative
The following lightgbm-specific parameters are tunable:
num_leaves: max number of leaves in one tree
Random Forest
One engine is available for random forest models:
"ranger"
The following universal parameters apply and are tunable:
mtry: # Randomly Selected Predictorstrees: # Treesmin_n: Minimal Node Size
The following ranger-specific parameters are tunable:
regularization.factor: Regularization factor (gain penalization)regularization.usedepth: Consider the depth in regularization? (TRUE/FALSE)splitrule: Splitting rule. One of dials::ranger_reg_rulesalpha: Significance threshold to allow splitting (forsplitrule = "maxstat")minprop: Lower quantile of covariate distribution to be considered for splitting (forsplitrule = "maxstat")num.random.splits: Number of random splits to consider for each candidate splitting variable (forsplitrule = "extratrees")
See also
Other Model Tuning Functions:
plot_tdw_testing_scatter(),
plot_tdw_tuning_metrics()
