This function performs hyperparameter tuning for a gradient boosting model
used in deweathering air pollution data. It uses cross-validation to find
optimal hyperparameters and returns the best performing model along with
performance metrics and visualizations. Parallel processing (e.g., through
the mirai package) is recommended to speed up tuning - see
https://tune.tidymodels.org/articles/extras/optimizations.html#parallel-processing.
Usage
tune_dw_model(
data,
pollutant,
vars = c("trend", "ws", "wd", "hour", "weekday", "air_temp"),
tree_depth = 5,
trees = 200L,
learn_rate = 0.1,
mtry = NULL,
min_n = 10L,
loss_reduction = 0,
sample_size = 1L,
stop_iter = 190L,
engine = c("xgboost", "lightgbm"),
split_prop = 3/4,
grid_levels = 5,
v_partitions = 10
)Arguments
- data
An input
data.framecontaining one pollutant column (defined usingpollutant) and a collection of feature columns (defined usingvars).- pollutant
The name of the column (likely a pollutant) in
datato predict.- vars
The name of the columns in
datato use as model features - i.e., to predict the values in thepollutantcolumn. Any character columns will be coerced to factors."hour","weekday","trend","yday","week", and"month"are special terms and will be passed toappend_dw_vars()if not present innames(data).- tree_depth, trees, learn_rate, mtry, min_n, loss_reduction, sample_size, stop_iter
If length 1, these parameters will be fixed. If length
2, the parameter will be tuned within the range defined between the first and last value. For example, iftree_depth = c(1, 5)andgrid_levels = 3, tree depths of1,3, and5will be tested.- engine
A single character string specifying what computational engine to use for fitting.
- split_prop
The proportion of data to be retained for modeling/analysis. Passed to the
propargument ofrsample::initial_split().- grid_levels
An integer for the number of values of each parameter to use to make the regular grid. Passed to the
levelsargument ofdials::grid_regular().- v_partitions
The number of partitions of the data set to use for v-fold cross-validation. Passed to the
vargument ofrsample::vfold_cv().
Details
The function performs the following steps:
Removes rows with missing values in the pollutant or predictor variables
Splits data into training and testing sets
Creates a tuning grid for any parameters specified as ranges
Performs grid search with cross-validation to find optimal hyperparameters
Fits a final model using the best hyperparameters
Generates predictions and performance metrics
At least one hyperparameter must be specified as a range (vector of length 2) for tuning to occur. Single values are treated as fixed parameters.
