This function builds a 'deweathering' machine learning model with useful
methods for interrogating it in an air quality and meteorological context. It
uses any number of variables (most usefully meteorological variables like
wind speed and wind direction and temporal variables defined in
append_dw_vars()) to fit a model predicting a given pollutant. While
these models are useful for 'removing' the effects of meteorology from an air
quality time series (e.g., through simulate_dw_met()), they are also useful
for explanatory analysis (e.g., through plot_dw_partial_1d()).
Arguments
- data
An input
data.framecontaining one pollutant column (defined usingpollutant) and a collection of feature columns (defined usingvars).- pollutant
The name of the column (likely a pollutant) in
datato predict.- vars
The name of the columns in
datato use as model features - i.e., to predict the values in thepollutantcolumn. Any character columns will be coerced to factors."hour","weekday","trend","yday","week", and"month"are special terms and will be passed toappend_dw_vars()if not present innames(data).- tree_depth
Tree Depth
<xgboost|lightgbm>An integer for the maximum depth of the tree (i.e., number of splits).
- trees
Number of Trees
<xgboost|lightgbm|ranger>An integer for the number of trees contained in the ensemble.
- learn_rate
Learning Rate
<xgboost|lightgbm>A number for the rate at which the boosting algorithm adapts from iteration-to-iteration. This is sometimes referred to as the shrinkage parameter.
- mtry
Number of Randomly Selected Predictors
<xgboost|lightgbm|ranger>A number for the number (or proportion) of predictors that will be randomly sampled at each split when creating the tree models.
- min_n
Minimal Node Size
<xgboost|lightgbm|ranger>An integer for the minimum number of data points in a node that is required for the node to be split further.
- loss_reduction
Minimum Loss Reduction
<xgboost|lightgbm>A number for the reduction in the loss function required to split further.
- sample_size
Proportion Observations Sampled
<xgboost>A number for the number (or proportion) of data that is exposed to the fitting routine.
- stop_iter
Number of Iterations Before Stopping
<xgboost>The number of iterations without improvement before stopping.
- engine
A single character string specifying what computational engine to use for fitting. Can be
"xgboost","lightgbm"(boosted trees) or"ranger"(random forest). See the documentation below for more information.- ...
Used to pass additional engine-specific parameters to the model. The parameters listed here can be tuned using
tune_dw_model(). All other parameters must be fixed.alpha:<xgboost>L1 regularization term on weights.lambda:<xgboost>L2 regularization term on weights.num_leaves:<lightgbm>max number of leaves in one tree.regularization.factor:<ranger>Regularization factor (gain penalization).regularization.usedepth:<ranger>Consider the depth in regularization? (TRUE/FALSE).splitrule:<ranger>Splitting rule. One of dials::ranger_reg_rules.alpha:<ranger>Significance threshold to allow splitting (forsplitrule = "maxstat").minprop:<ranger>Lower quantile of covariate distribution to be considered for splitting (forsplitrule = "maxstat").num.random.splits:<ranger>Number of random splits to consider for each candidate splitting variable (forsplitrule = "extratrees").
- .date
The name of the 'date' column which defines the air quality timeseries. Passed to
append_dw_vars()if needed. Also used to extract the time zone of the data for later restoration iftrendis used as a variable.
Modelling Approaches and Parameters
Types of Model
There are two modelling approaches available to build_dw_model():
Boosted Trees (
xgboost,lightgbm)Random Forest (
ranger)
Each of these approaches take different parameters.
Boosted Trees
Two engines are available for boosted tree models:
"xgboost""lightgbm"
The following universal parameters apply and are tunable:
tree_depth: Tree Depthtrees: # Treeslearn_rate: Learning Ratemtry: # Randomly Selected Predictorsmin_n: Minimal Node Sizeloss_reduction: Minimum Loss Reductionsample_size: Proportion Observations Sampled (xgboostonly)stop_iter: # Iterations Before Stopping (xgboostonly)
The following xgboost-specific parameters are tunable:
alpha: L1 regularization term on weights. Increasing this value will make model more conservativelambda: L2 regularization term on weights. Increasing this value will make model more conservative
The following lightgbm-specific parameters are tunable:
num_leaves: max number of leaves in one tree
Random Forest
One engine is available for random forest models:
"ranger"
The following universal parameters apply and are tunable:
mtry: # Randomly Selected Predictorstrees: # Treesmin_n: Minimal Node Size
The following ranger-specific parameters are tunable:
regularization.factor: Regularization factor (gain penalization)regularization.usedepth: Consider the depth in regularization? (TRUE/FALSE)splitrule: Splitting rule. One of dials::ranger_reg_rulesalpha: Significance threshold to allow splitting (forsplitrule = "maxstat")minprop: Lower quantile of covariate distribution to be considered for splitting (forsplitrule = "maxstat")num.random.splits: Number of random splits to consider for each candidate splitting variable (forsplitrule = "extratrees")
