Overview
Often, I want to analyze a suite of response variables using mixed-effects models that have identical predictor variables and the same random effects structure. Rather than creating models for each variable separately (copy-paste ad infinitum…) I’ve found that it’s more efficient and less error-prone to use a split-apply-combine process to work with all the models at once.
This script shows the general workflow for such a process using a dataset of plant and insect biomass responses to a nutrient addition (fertilization) experiment in arctic tundra. I needed to know whether the biomass of each food-web group (parasitoids, predators, herbivores, decomposers, and plants) responded to nutrient addition, and whether these responses were proportional to the amount of nutrients added (linear), or non-linear, i.e., saturating/accelerating at higher levels of nutrient enrichment.
This workflow has four steps: 1. Model selection: comparing models with and without a second-order polynomial term for the amount of nutrients added ([Grams N]^2) 2. Model summaries: extracting and compiling tables 3. Model predictions: generating predicted values for each response variable for plotting 4. Model plotting
Opening the toolbox
library(data.table)
# for data wrangling
library(ggplot2); library(cowplot)
# good graphics; made even better with cows
library(lme4); library(lmerTest)
# stats for mixed-effects models; I like P-values
Opening the data
summary(bio)
## Group Treatment_gNm2 Block Sample
## Detritivores:75 Min. : 0.000 Min. :1.00 Min. :1.000
## Hematophages:75 1st Qu.: 1.000 1st Qu.:1.00 1st Qu.:2.000
## Herbivores :75 Median : 2.000 Median :2.00 Median :3.000
## Parasitoids :67 Mean : 3.701 Mean :1.97 Mean :3.033
## Predators :43 3rd Qu.: 5.000 3rd Qu.:3.00 3rd Qu.:4.000
## Max. :10.000 Max. :3.00 Max. :5.000
## biomass
## Min. : 0.0214
## 1st Qu.: 2.0116
## Median : 6.5574
## Mean : 20.8025
## 3rd Qu.: 22.6586
## Max. :367.4464
These data are in “long form”, where the column “Group” refers to the the food-web group. The groups are Detritivores (decomposers), Hematophages (biting flies including mosquitoes and blackflies), Herbivores, Parasitoids (e.g., wasps that prey on caterpillars), and Predators (e.g., spiders).
Each treatment (0, 1, 2, 5 or 10 grams Nitrogen added) was replicated three times, with replicate plots grouped into 3 blocks; I sampled biomass of each group (in mg) five times in each plot.
Fit models for each response variable and select the best model
For each variable, I wanted to compare models of varying complexity for each response variable, and figure out whether more complex models were better able to predict the response and were more parsimonious. Specifically, I wanted to check whether a polynomial term for a predictor variable (Grams of nitrogen added to the plant community) was necessary; i.e., whether responses to nitrogen addition were accelerating, saturating, or simply linear. I used likelihood ratio tests to select the best model for each response variable, and stored these in a list. I also compared new-ish measure of model fit for mixed-effects models: the R2c from package MuMIn (Bartoń 2019). This value represents the total amount of variance explained by both fixed and random effects (block, in our case) in the model.
I used a for() loop to do this because the dataset is small, and I find loops easy to understand when tweaking and compiling little tables step-by-step. But, it could probably also be easy to do with a sapply() vectorized function, if your dataset was larger or your list of response variables longer.
# Create a list of data frames, split by group:
lmer_biom<-split(bio, bio$Group)
# Create an empty list to store the results from comparing models to each other
modcomparelist<-list()
# Create an empty list to store the models themselves
model_list<-list()
for(i in 1:length(lmer_biom)){
# extract the name of this group (the name of the split data frame):
this_group<-names(lmer_biom)[i]
x<-lmer_biom[[i]]
mod1 <- with(x, lmer(biomass ~ Treatment_gNm2 + (1|Block)))
mod2 <- with(x, lmer(biomass ~ poly(Treatment_gNm2, 2, raw = T) + (1|Block)))
modcompare<-anova(mod1, mod2)
# report the amount of variance explained by each model:
R2c<-rbind(round(MuMIn::r.squaredGLMM(mod1)[2], digits = 3),
# r.squaredGLMM(x)[2] is R2c, the total amt of variance explained by the model
#(including random effect for block)
round(MuMIn::r.squaredGLMM(mod2)[2], digits = 3))
# how much more variance is explained with the more complex model:
diff_r2c<-round(MuMIn::r.squaredGLMM(mod2)[2], digits = 3) -
round(MuMIn::r.squaredGLMM(mod1)[2], digits = 3)
# if less than .001, print special string:
diff_r2c<-ifelse(diff_r2c == 0, "<0.001", diff_r2c)
modcompare<-cbind(R2c, modcompare)
modcompare$diff_r2c<-c(NA, diff_r2c)
# store the model comparison in our empty list:
modcomparelist[[this_group]]<-modcompare
# extract p-value of likelihood ratio test to select best model:
p_val<-modcompare$`Pr(>Chisq)`[2]
# find the best model:
if(p_val < 0.05){best_model = mod2} else if (p_val>=0.05){best_model = mod1}
# store the best model in our list of models:
model_list[[this_group]]<-best_model
}
## Warning: 'r.squaredGLMM' now calculates a revised statistic. See the help
## page.
Output tables for the comparison of the models.
modcomparetable<-rbindlist(modcomparelist, idcol = "Response")
modcomparetable<-data.table(modcomparetable)
# round some of the stats
modcomparetable[,c("AIC", "BIC", "logLik") := round(.SD,0), .SDcols=c("AIC", "BIC", "logLik")]
modcomparetable[,c("Chisq", "Pr(>Chisq)") := round(.SD,3), .SDcols=c("Chisq", "Pr(>Chisq)")]
knitr::kable(modcomparetable)
Response | R2c | Df | AIC | BIC | logLik | deviance | Chisq | Chi Df | Pr(>Chisq) | diff_r2c |
---|---|---|---|---|---|---|---|---|---|---|
Detritivores | 0.384 | 4 | 584 | 593 | -288 | 575.7350 | NA | NA | NA | NA |
Detritivores | 0.384 | 5 | 585 | 597 | -288 | 575.2715 | 0.463 | 1 | 0.496 | <0.001 |
Hematophages | 0.034 | 4 | 637 | 646 | -314 | 628.7215 | NA | NA | NA | NA |
Hematophages | 0.052 | 5 | 637 | 649 | -314 | 627.2555 | 1.466 | 1 | 0.226 | 0.018 |
Herbivores | 0.408 | 4 | 821 | 830 | -406 | 812.7011 | NA | NA | NA | NA |
Herbivores | 0.409 | 5 | 822 | 834 | -406 | 812.1684 | 0.533 | 1 | 0.465 | 0.001 |
Parasitoids | 0.108 | 4 | 315 | 324 | -154 | 307.3732 | NA | NA | NA | NA |
Parasitoids | 0.110 | 5 | 317 | 328 | -154 | 307.1560 | 0.217 | 1 | 0.641 | 0.002 |
Predators | 0.008 | 4 | 284 | 292 | -138 | 276.4769 | NA | NA | NA | NA |
Predators | 0.010 | 5 | 286 | 295 | -138 | 276.3918 | 0.085 | 1 | 0.771 | 0.002 |
Our model comparison indicates that all response variables have a linear relationship with the amount of nutrients added: likelihood ratio test P-values are all >0.05, indicating that linear models are most parsimonious. Furthermore, incorporating the a second-order polynomial term for the amount of nitrogen added explains at most 1.8% more varaince, and usually much less.
Generate summary tables for each model
Next, I customized summary tables for each model by rounding estimates, standard errors, and P-values to the number of decimals typically reported in ecological studies. Although I usually copy-paste the output into Excel or Word to complete table formatting, it’s nice to have some of the work already done.
# a place to store each summary:
summaries_list<-list()
# to store R2m and R2c scores:
goodnessoffit_list<-list()
# loop over our list of mixed-effects models:
for(i in 1:length(model_list)){
this_group<-names(model_list)[i]
x<-model_list[[i]]
mod_summary<-summary(x)
summarytab<-cbind(
Response = c(this_group, rep("", ntimes = length(rownames(mod_summary$coef))-1)),
Term = rownames(mod_summary$coef),
as.data.frame(mod_summary$coef))
# adding in a little string about goodness of fit:
goodnessoffit<-paste0("R2m = ", round(MuMIn::r.squaredGLMM(x)[1], digits = 2),
"; R2c = ", round(MuMIn::r.squaredGLMM(x)[2], digits = 2))
# put this in a footer row we can merge in with our summary table
# (this is super hack-y, should find a better way to do this with kable someday)
goodnessoffit_row<-summarytab[1,]
goodnessoffit_row[,c("Response", "Estimate", "Std. Error", "df", "t value", "Pr(>|t|)")]<-NA
goodnessoffit_row$Term<-goodnessoffit
summaries_list[[this_group]]<-rbind(summarytab, goodnessoffit_row)
}
Then we compile the summaries together.
summarytable<-rbindlist(summaries_list)
summarytable[,c("df", "t value") := round(.SD,1), .SDcols=c("df", "t value") ]
summarytable[,c("Estimate", "Std. Error") := round(.SD,2), .SDcols=c("Estimate", "Std. Error")]
summarytable$`Pr(>|t|)` <- round(summarytable$`Pr(>|t|)`, digits = 3)
summarytable$`Pr(>|t|)` <- ifelse(summarytable$`Pr(>|t|)` == 0, "<0.001", summarytable$`Pr(>|t|)`)
# more hacks and tweaks...
summarytable$Estimate<-as.character(summarytable$Estimate)
summarytable$`Std. Error`<-as.character(summarytable$`Std. Error`)
summarytable$df<-as.character(summarytable$df)
summarytable$`t value`<-as.character(summarytable$`t value`)
summarytable[grep("R2m", summarytable$Term),c("Response", "Estimate", "Std. Error","df", "t value", "Pr(>|t|)")]<-""
knitr::kable(summarytable)
Response | Term | Estimate | Std. Error | df | t value | Pr(>|t|) |
---|---|---|---|---|---|---|
Detritivores | (Intercept) | 10.45 | 2.68 | 3.4 | 3.9 | 0.025 |
Treatment_gNm2 | 2.23 | 0.36 | 71 | 6.3 | <0.001 | |
R2m = 0.33; R2c = 0.38 | ||||||
Hematophages | (Intercept) | 8.11 | 3.07 | 4.9 | 2.6 | 0.047 |
Treatment_gNm2 | -0.31 | 0.51 | 71 | -0.6 | 0.546 | |
R2m = 0; R2c = 0.03 | ||||||
Herbivores | (Intercept) | 28.16 | 17.42 | 2.6 | 1.6 | 0.218 |
Treatment_gNm2 | 9.81 | 1.7 | 71 | 5.8 | <0.001 | |
R2m = 0.27; R2c = 0.41 | ||||||
Parasitoids | (Intercept) | 1.29 | 0.49 | 5.8 | 2.6 | 0.041 |
Treatment_gNm2 | 0.2 | 0.08 | 63.1 | 2.5 | 0.016 | |
R2m = 0.08; R2c = 0.11 | ||||||
Predators | (Intercept) | 3.19 | 1.36 | 41 | 2.4 | 0.024 |
Treatment_gNm2 | 0.15 | 0.25 | 41 | 0.6 | 0.565 | |
R2m = 0.01; R2c = 0.01 |
Generate predicted values from the model for plotting
predicted_values_list<-list()
# we can use the same empty dataset to fill in with predictions for all the models, because the model was the same.
emptydat<-expand.grid(Treatment_gNm2 = unique(bio$Treatment_gNm2))
# (can expand this to other variables as needed)
# make a list to store the predicted values for each group/response variable
predicted_values_list<-list()
for (i in 1:length(model_list)){
this_group<-names(model_list)[i]
x<-model_list[[i]]
summary<-summaries_list[[i]]
# We only want to predict values for each treatment where treatment was significant
# what was the smallest P-value for treatment?
# (the way I have written this, it would also work if treatment was interacting with another variable.)
min_p<-min(summary$`Pr(>|t|)`[grep("Treatment", summary$Term)])
# generate predictions if P <0.05, otherwise ignore.
if(min_p<0.05){predicted_value = predict(x, newdata = emptydat, re.form = NA)}else if(min_p>=0.05){predicted_value = rep(NA, nrow(emptydat))}
predicted_value<-cbind(emptydat, predicted_value)
predicted_values_list[[this_group]]<-predicted_value
}
predicted_values<-rbindlist(predicted_values_list, idcol = "Group")
head(predicted_values, 10)
## Group Treatment_gNm2 predicted_value
## 1: Detritivores 0 10.44744
## 2: Detritivores 1 12.67267
## 3: Detritivores 2 14.89789
## 4: Detritivores 5 21.57357
## 5: Detritivores 10 32.69969
## 6: Hematophages 0 NA
## 7: Hematophages 1 NA
## 8: Hematophages 2 NA
## 9: Hematophages 5 NA
## 10: Hematophages 10 NA
Plot predicted values and observed values
# calculate means & standard errors from raw data for plotting.
# average for each block:
avg_block_bio<-with(bio,
aggregate(biomass ~ Group + Treatment_gNm2 + Block,
FUN = function(x) mn = mean(x)))
# average again for each treatment, also get a standard error for each treatment:
avg_bio<-with(avg_block_bio,
aggregate(biomass ~ Group + Treatment_gNm2,
FUN = function(x) c(mn = mean(x), se = sciplot::se(x))))
avg_bio<-do.call(data.frame, avg_bio)
biomass_stat_plot<-
ggplot(avg_bio, aes(x = Treatment_gNm2, y = biomass.mn)) +
geom_point(cex = 3)+
geom_errorbar(aes(ymin = biomass.mn - biomass.se, max = biomass.mn + biomass.se), width = 0)+
facet_wrap(~Group, nrow = 2, scales = "free")+
scale_x_continuous(breaks = c(0, 1, 2, 5, 10))+
geom_smooth(data = predicted_values, aes(x = Treatment_gNm2, y = predicted_value),
method = "loess", se = F, span = 2, color = "black")+
xlab(label=expression(paste("N (g m"^-2,")")))+
ylab(label=expression(paste("dry biomass (mg m"^-2, ")")))
biomass_stat_plot
## Warning: Removed 10 rows containing non-finite values (stat_smooth).