
Meteorological Normalisation with {deweather}
Source:vignettes/articles/deweather.Rmd
deweather.RmdIntroduction
Meteorology plays a central role in affecting the concentrations of pollutants in the atmosphere. When considering trends in air pollutants it can be very difficult to know whether a change in concentration is due to emissions or meteorology.
The deweather package uses a powerful statistical technique based on boosted regression trees using a variety of packages using the tidymodels framework. This allows for a variety of engines to be employed, with the default being xgboost. Statistical models are developed to explain concentrations using meteorological and other variables. These models can be tested on randomly withheld data with the aim of developing the most appropriate model.
Much of deweather supports parallel processing with the mirai package, so along with loading the library we’ll set some daemons as well as a seed.
Example data set
The deweather package comes with a comprehensive
data set of air quality and meteorological data. The air quality data is
from Marylebone Road in central London (obtained from the
openair package) and the meteorological data from
Heathrow Airport (obtained from the worldmet package).
The aqroadside data frame contains various pollutants such
a NOx, NO2, ethane and isoprene as well as
meteorological data including wind speed, wind direction, relative
humidity, ambient temperature and cloud cover.
head(aqroadside)
#> # A tibble: 6 × 11
#> date nox no2 ethane isoprene benzene ws wd air_temp
#> <dttm> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 2000-01-01 00:00:00 388 78 13.5 1.22 7.74 2.1 200 7.4
#> 2 2000-01-01 01:00:00 886 84 12.0 0.85 5.66 2.1 190 7.83
#> 3 2000-01-01 02:00:00 816 117 14.7 1.75 11.4 1.8 215 7.95
#> 4 2000-01-01 03:00:00 636 97 20.4 4.36 30.8 1.25 240 8.05
#> 5 2000-01-01 04:00:00 483 74 18.8 3.08 19.7 1.55 260 8.6
#> 6 2000-01-01 05:00:00 231 65 16.4 2.12 13.1 2.35 270 8.35
#> # ℹ 2 more variables: rh <dbl>, cl <dbl>Prepare for Model Building
A straightforward way to get started is to use the
append_dw_vars() function to attach a load of useful model
features to your air quality data.frame. Variables such as
the hour of the day ("hour") and day of the week
("weekday") are used as features to explain some of the
variation. "hour" for example very usefully acts as a proxy
for the diurnal variation in emissions. These temporal emission proxies
are also important to include to help the model differentiate between
emission versus weather-related changes. For example, emissions tend to
change throughout a day and so do variables such as wind speed and
ambient temperature (here already present in the data as
"ws" and "air_temp" respectively).
Note that, strictly, append_dw_vars() does not need to
be run directly; if build_dw_model() or
tune_dw_model() are passed any of "hour",
"weekday", "trend", "yday",
"week", or "month" to vars, the
parent function will use append_dw_vars() itself in the
background if those columns do not exist in the input dataframe.
append_dw_vars(aqroadside) |>
dplyr::glimpse()
#> Rows: 149,040
#> Columns: 18
#> $ date <dttm> 2000-01-01 00:00:00, 2000-01-01 01:00:00, 2000-01-01 02:00:0…
#> $ nox <dbl> 388, 886, 816, 636, 483, 231, 380, 287, 187, 111, 149, 132, 3…
#> $ no2 <dbl> 78, 84, 117, 97, 74, 65, 67, 69, 67, 57, 57, 57, 82, 84, 76, …
#> $ ethane <dbl> 13.54, 11.99, 14.69, 20.38, 18.75, 16.38, 16.16, NA, NA, NA, …
#> $ isoprene <dbl> 1.22, 0.85, 1.75, 4.36, 3.08, 2.12, 1.25, NA, NA, NA, NA, 0.4…
#> $ benzene <dbl> 7.74, 5.66, 11.38, 30.75, 19.66, 13.13, 7.64, NA, NA, NA, NA,…
#> $ ws <dbl> 2.10, 2.10, 1.80, 1.25, 1.55, 2.35, 1.80, 1.00, 0.50, 0.00, 0…
#> $ wd <dbl> 200, 190, 215, 240, 260, 270, 260, 250, 240, NA, 210, 275, 26…
#> $ air_temp <dbl> 7.400000, 7.833333, 7.950000, 8.050000, 8.600000, 8.350000, 8…
#> $ rh <dbl> 97.33482, 97.15970, 99.00136, 98.67437, 95.11393, 97.71058, 9…
#> $ cl <dbl> 5.0, 6.0, 4.5, 4.5, 7.5, 5.5, 8.0, 4.0, 1.5, 2.5, 4.0, 7.0, 4…
#> $ trend <dbl> 946684800, 946688400, 946692000, 946695600, 946699200, 946702…
#> $ hour <int> 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17,…
#> $ weekday <fct> Sat, Sat, Sat, Sat, Sat, Sat, Sat, Sat, Sat, Sat, Sat, Sat, S…
#> $ weekend <fct> weekend, weekend, weekend, weekend, weekend, weekend, weekend…
#> $ yday <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…
#> $ week <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…
#> $ month <fct> Jan, Jan, Jan, Jan, Jan, Jan, Jan, Jan, Jan, Jan, Jan, Jan, J…Build a Model
Tuning a Model
There are many input parameters to build_dw_model() and,
while sensible defaults have been set, you may be interested in seeing
how tweaking them can influence how well the model behaves. This is
called “tuning”, and the tune_dw_model() function provides
an interface for this. Any of the boosted decision tree parameters
(e.g., tree_depth and trees) can be set as a
range which is then combined with grid_levels to create a
regular grid of all parameter combinations.
Note that this process can take a while, especially if many
parameters are being tuned. One way to speed this up is to invoke
parallel processing through the mirai package - see https://tune.tidymodels.org/articles/extras/optimizations.html#parallel-processing
for more information. In this example, we’ll also significantly trim
down the data to speed things up.
tuned_results <-
aqroadside |>
# want to sample by weekday to get good spread of categorical data
append_dw_vars("weekday") |>
dplyr::slice_sample(n = 250, by = weekday) |>
# tune model
tune_dw_model(
pollutant = "no2",
tree_depth = c(1, 5),
trees = c(150, 250),
grid_levels = 3L
)
#> → A | warning: `early_stop` was reduced to 149.
#> There were issues with some computations A: x1
#> There were issues with some computations A: x1
#> This output has a few useful features. First, we’re informed that the
best value for trees is 150 and for tree_depth
is 5.
tuned_results$best_params
#> $trees
#> [1] 150
#>
#> $tree_depth
#> [1] 5We can also see how the tuned model behaved on some reserved testing data. A table of predictions is returned, as well as a table of statistics and a scatter plot of modelled vs measured values.
dplyr::glimpse(tuned_results$final_fit$metrics)
#> Rows: 1
#> Columns: 12
#> $ pollutant <fct> no2
#> $ n <int> 424
#> $ fac2 <dbl> 0.9716981
#> $ mb <dbl> -1.593782
#> $ mge <dbl> 19.85444
#> $ nmb <dbl> -0.01676148
#> $ nmge <dbl> 0.208805
#> $ rmse <dbl> 27.60343
#> $ r <dbl> 0.8157166
#> $ p <dbl> 2.258024e-102
#> $ coe <dbl> 0.4743745
#> $ ioa <dbl> 0.7371873
tuned_results$final_fit$plot
Finalising a Model
Assuming that a good model can be developed, it can now be explored
in more detail. deweather has useful defaults for many of
the model parameters, but these can be adjusted by the user if
required.
no2_model <-
build_dw_model(
data = aqroadside,
pollutant = "no2",
vars = c("trend", "ws", "wd", "hour", "weekday", "air_temp"),
engine = "xgboost"
)This function returns a “deweathering model” object. We can see a quick summary of it by simply printing it.
no2_model
#>
#> ── Deweather Model ─────────────────────────────────────────────────────────────
#> • A model for predicting no2 using wd (42.1%), hour (27.9%), trend (12.8%),
#> weekday (8.2%), ws (6.5%), and air_temp (2.6%).
#>
#>
#> ── Model Parameters ──
#>
#>
#>
#> • tree_depth: 5
#>
#> • trees: 200
#>
#> • learn_rate: 0.1
#>
#> • mtry: NULL
#>
#> • min_n: 10
#>
#> • loss_reduction: 0
#>
#> • sample_size: 1
#>
#> • stop_iter: 190We can also pull out specific features of the model using
get_dw_*() functions. deweather has many of
such ‘getter’ functions, and they form a consistent, useful API for
accessing relevant features of different objects created in the
package.
get_dw_pollutant(no2_model)
#> [1] "no2"
get_dw_vars(no2_model)
#> [1] "trend" "ws" "wd" "hour" "weekday" "air_temp"
get_dw_input_data(no2_model)
#> # A tibble: 144,535 × 7
#> no2 trend ws wd hour weekday air_temp
#> <dbl> <dbl> <dbl> <dbl> <int> <fct> <dbl>
#> 1 78 946684800 2.1 200 0 Sat 7.4
#> 2 84 946688400 2.1 190 1 Sat 7.83
#> 3 117 946692000 1.8 215 2 Sat 7.95
#> 4 97 946695600 1.25 240 3 Sat 8.05
#> 5 74 946699200 1.55 260 4 Sat 8.6
#> 6 65 946702800 2.35 270 5 Sat 8.35
#> 7 67 946706400 1.8 260 6 Sat 8.2
#> 8 69 946710000 1 250 7 Sat 7.1
#> 9 67 946713600 0.5 240 8 Sat 6.2
#> 10 57 946720800 0.5 210 10 Sat 6.9
#> # ℹ 144,525 more rowsOne feature we immediately have access to is the “feature importance”
score. This can be obtained as a data.frame using a similar
approach to the above. This varies depending on the chosen
engine, but for boosted trees this represents ‘Gain’ - the
fractional contribution of each feature to the model based on the total
gain of this feature’s splits. A higher percentage means a more
important predictive feature.
get_dw_importance(no2_model)
#> # A tibble: 12 × 2
#> var importance
#> <fct> <dbl>
#> 1 wd 0.421
#> 2 hour 0.279
#> 3 trend 0.128
#> 4 ws 0.0646
#> 5 weekdaySun 0.0450
#> 6 weekdaySat 0.0282
#> 7 air_temp 0.0262
#> 8 weekdayMon 0.00257
#> 9 weekdayThu 0.00230
#> 10 weekdayWed 0.00163
#> 11 weekdayTue 0.001000
#> 12 weekdayFri 0.000862The Gain can be automatically plotted as a bar chart using the
plot_dw_importance() function. The wind direction is the
most predictive feature, and the day of the week being any week day is
the least predictive feature.
plot_dw_importance(no2_model)
The feature importance of our model.
You’ll notice that our ‘character’ variable, the day of the week, has
been split out into multiple levels. This is because
xgboost requires numeric variables, so behind the scenes
our ‘day of the week’ variable is split out into a matrix of seven
variables. This is informative - for example, we can see here that the
day of the week being a weekend is a useful feature. Regardless, if you
would like to see factor features as single features, the
aggregate_factors argument may be of use.
plot_dw_importance(no2_model, aggregate_factors = TRUE)
The feature importance of our model, with factors aggregated.
Examine the partial dependencies
Basic PD Plots
One of the benefits of the boosted regression tree approach is that the partial dependencies can be explored. In simple terms, the partial dependencies show the relationship between the pollutant of interest and the covariates used in the model while holding the value of other covariates at their mean level.
Lets plot a partial dependency for the hour variable.
We’ll set n to 100 for speed; this will sample
100 random samples of our original data to construct the plot. We can
see that no2 is highest during the day and lowest
overnight, everything else kept equal.
plot_dw_partial_1d(no2_model, "hour", n = 100)
A categorical variable like weekday looks slightly
different, but achieves a similar result. no2 is lowest on
weekends, all else being equal.
plot_dw_partial_1d(no2_model, "weekday", n = 100)
If a variable isn’t given, all variables will be plotted in a
patchwork assembly in order of importance.
plot_dw_partial_1d(no2_model, n = 100)
Grouped PD
Sometimes, when examining partial dependences, it is useful to consider grouped partial dependences. This means, before PDs are calculated, we split the data into groups and calculate PDs for each subset of the data. Let’s give this a go now.
plot_dw_partial_1d(no2_model, "hour", group = "weekday", n = 1000)
Note that group need not be categorical; you may provide
a continuous variable which will be binned into
group_intervals. Here we group by "air_temp"
and split it into 3 equally sized bins.
plot_dw_partial_1d(
no2_model,
"hour",
group = "air_temp",
group_intervals = 3L,
n = 1000
)
Two-Way Interactions
Above we needed to treat one of our continuous features as a factor feature.
It can be very useful to plot important two-way interactions. In this
example the interaction between "ws" and
"air_temp" is considered. The plot shows that
no2 tends to be high when the wind speed is low and the
temperature is low, i.e., stable atmospheric conditions. Also
no2 tends to be high when the temperature is high, which is
most likely due to more O3 available to convert NO to
NO2. In fact, background O3 would probably be a
useful covariate to add to the model.
plot_dw_partial_2d(no2_model, "ws", "air_temp", n = 200)
It can be easier to see some of these relationships using a binned
scale. contour = "lines" will overlay the continuous colour
surface with contour lines. contour = "fill" will bin the
entire colour scale. The number of bins is controlled using the
contour_bins argument.
plot_dw_partial_2d(
no2_model,
"ws",
"air_temp",
n = 200,
contour = "fill",
contour_bins = 10,
cols = "turbo"
)
Customisation
These plots can be customised like any other ggplot2 object.
library(ggplot2)
plot_dw_partial_1d(no2_model, "hour",group = "weekday") +
theme_light(14) +
theme(
title = element_text(face = "bold"),
plot.subtitle = element_text(face = "plain", size = 11),
legend.position = "none"
) +
scale_fill_manual(
values = c(
"orange",
"cadetblue",
"lightblue4",
"lightblue3",
"lightblue2",
"lightblue1",
"orange3"
),
aesthetics = c("fill", "color")
) +
labs(
x = "Hour of the Day",
y = "NOx (ug/m3)",
fill = "Weekday",
color = "Weekday",
subtitle = "Weekends (orange) have lower NOx emissions than weekdays (blue)."
) +
scale_y_continuous(limits = c(0, NA)) +
scale_x_continuous(breaks = seq(0, 24, 4), expand = expansion())
#> Scale for fill is already present.
#> Adding another scale for fill, which will replace the existing scale.
#> Scale for x is already present.
#> Adding another scale for x, which will replace the existing scale.
A thoroughly customised partial dependence plot.
Apply meteorological averaging
An indication of the meteorologically-averaged trend is
given by the plot_dw_partial_1d() function above.
plot_dw_partial_1d(no2_model, "trend", n = 100, intervals = 100)
A much better indication is given by using the model to predict many
times with random sampling of meteorological
conditions. This sampling is carried out by the
simulate_met() function.
Note that you’d typically want n to be a higher value
than what’s used in this example; it has been set to 50 for
speed. Recall also that some mirai::daemons() are set to
allow for parallelism.
demet <- simulate_dw_met(
no2_model,
vars = c("ws", "wd", "air_temp"),
n = 50
)Now it is possible to plot the resulting trend.
library(ggplot2)
ggplot(demet, aes(x = date, y = no2)) +
geom_line(linewidth = 0.01) +
theme_bw() +
labs(
y = openair::quickText("NO2"),
x = NULL
)
A deweathered nitrogen dioxide trend.
The plot shows the trend in NO2 controlling for the main weather variables. The plot now reveals the strong diurnal and weekly cycle in NO2 that is driven by variations in the sources of NO2 (NOx) rather than meteorology, i.e., road traffic which has strong hourly and daily variations throughout the year. It can be useful to simply average the results to provide a better indication of the overall trend. For example:
demet |>
openair::timeAverage("month") |>
ggplot(aes(x = date, y = no2)) +
geom_line() +
theme_bw() +
labs(
y = openair::quickText("NO2"),
x = NULL
)
A time-averaged deweathered nitrogen dioxide trend.
With some simple data manipulation, we can compare the meteorological simulation, the partial dependence profile, and the input data. It is clear that the input data has much more noise. The PD profile are simulated data are similar, but the simulated data are higher overall. This highlights that purely relying on the PD profiles (which isolates the trend from even temporal variables like day of the week) may underestimate pollutant concentrations.
# PD - set intervals to something high for a good profile
pd_trend <- plot_dw_partial_1d(
no2_model,
"trend",
plot = FALSE,
n = 50,
intervals = 200
) |>
purrr::pluck("trend") |>
dplyr::transmute(
date = as.POSIXct(trend),
no2 = mean
)
# input data
input_data <- get_dw_input_data(no2_model) |>
dplyr::transmute(
date = as.POSIXct(trend),
no2 = no2
)
# combine with demet and plot
dplyr::bind_rows(
"Simulated" = demet,
"Input" = input_data,
"Partial Dep" = pd_trend,
.id = "source"
) |>
openair::timeAverage("month", type = "source") |>
ggplot(aes(x = date, y = no2, color = source)) +
geom_line() +
theme_bw() +
labs(
y = openair::quickText("NO2"),
x = NULL
) +
scale_color_manual(
values = c(
"Input" = "grey70",
"Partial Dep" = "orange",
"Simulated" = "royalblue"
)
)
Note that the function predict_dw() is also provided for
generalised use of a deweather model for prediction.