Model selection 101, using R

Lots of insignificant variables though!This is our first attempt at a “full” model, so lets define this as our ‘fit.

2’ and continue.

fit.

2 <- tempfit4.

Adding variable interactionThe easiest way to check for variable interaction is using the R-function ‘add1’, this is simply the case of defining a scope to test and which test to use when testing relative to the original model.

F-tests are usually only relevant for LM and AOV models so we can ‘safely’ ignore that testing criteria, we’ll instead be using a χ²-test (Chisq or Chi²):add1(fit.

2, scope = .

~.

+ .

^2, test=”Chisq”)This scope is simply asking to test the current model (.

~.

) plus interaction between existing variables (+ .

^2), this will output a lot of interactions, some with statistically significant P-values, but it can be annoying to manually sort through, so lets sort the list so we get the lowest P-values on the top:add1.

test <- add1(fit.

2, scope = .

~.

+ .

^2, test=”Chisq”)add1.

test[order(add1.

test$’Pr(>Chi)’),]Right, so it seems like there might be an interaction between the foreigner and age variables.

One thing to consider before simply adding the interaction with the lowest P-value is whether or not this makes sense in context of our current model, right now age² is actually the most significant variable in our model so we might argue that adding the interaction between foreigner and age² is more intuitive, for simplicity we’ll stick with the foreigner:age interaction.

Lets test for more interactions after adding the variable interaction foreigner:age:add1.

test <- add1(update(fit.

2, .

~.

+ foreigner:age), scope = .

~.

+ .

^2, test=”Chisq”)add1.

test[order(add1.

test$’Pr(>Chi)’),]Now it seems like there’s a significant interaction in foreigner:factor(young.

children + school.

children == 0)After a few rounds of this we end up seeing no new statistically significant interactions, by the end we’ve added the following interactions: + foreigner:age + foreigner:factor(young.

children + school.

children == 0) + age:school.

children + gov.

support:factor(young.

children == 0)So lets update our fit with the new variable interactions as follows:fit.

3 <- update(fit.

2, .

~.

+ foreigner:age + foreigner:factor(young.

children + school.

children == 0) + age:school.

children + gov.

support:factor(young.

children == 0))summary(fit.

3)A small improvement in AIC once more, from 1017.

7 to 1006.

9.

5.

Removing insignificant variablesThis process is quite similar to the last one in step 4.

We’ll simply be using the drop1 function in R now instead of add1, and due to us seeking to remove instead of appending variables we seek the highest P-value instead of the lowest (we’ll still use χ²-test as our criteria):drop1.

test <- drop1(fit.

3, test=”Chisq”)drop1.

test[rev(order(drop1.

test$’Pr(>Chi)’)),]This tells us mostly the same as our model-summary, gov.

support² definitely dosn’t seem to be statistically significant, so we’ll remove that first and so on and so forth, we end up removing the following variables:- gov.

support²- young.

children- education- education²After those have been removed we see that all the remaining variables are statistically significant, so lets update our fit by removing the variables listed above:fit.

4 <- update(fit.

3, .

~.

— I(gov.

support^2) — young.

children — education — I(education^2))summary(fit.

4)Nice, another improvement in AIC although marginal and insignificant, the main advantage of this model over our previous is the added simplicity inherent in the reduced number of explanatory variables!One might wonder “Why aren’t we removing the gov.

support variable?.It’s clearly insignificant when looking at the summary of our model!” this is due to the principle of marginality which prohibits us from removing an insignificant variable if said variable has a significant interaction with another, like gov.

support:factor(young.

children == 0).

You might argue that removing gov.

support would be beneficial to the simplicity of the model given that it’s clearly insignificant and that the interaction with the young.

children == 0 variable is only marginally significant (p = 0.

435), however upon further inspection when removing gov.

support from the model the variable-interaction splits into two variables for TRUE and FALSE, thus not giving us any added simplicity, the AIC, all other coefficients as well as the null- and residual deviance stays the exact same and by that account I close to leave it in the model.

Doing an add1 and a drop1 test on our new and improved model shows us there’re no new interactions that are significant and that all current variables are significant so we’re done!.The final fit is:glm(employed == “yes” ~ foreigner + gov.

support + age + school.

children + factor(young.

children == 0) + factor(school.

children == 0) + factor(young.

children + school.

children == 0) + I(age^2) + foreigner:age + foreigner:factor(young.

children + school.

children == 0) + age:school.

children + gov.

support:factor(young.

children == 0), family = binomial, data = dat)6.

Removing outliersSo now that we have a model we’re satisfied with we can look for outliers that negatively effect the model.

Using the “car” package https://www.

rdocumentation.

org/packages/car/versions/1.

0-2 we can use the influencePlot() and outlierTest() functions to find potential outlier:We see that the datapoint 416 is classified as an outlier in both tests, we can take a look at the point in a few of our plots to gauge whether or not to remove it:par(mfrow = c(2, 2)) # 2 row(s), 2 column(s)plot(fit.

4, col=ifelse(as.

integer(rownames(dat))==416, “red”, “black”), pch=20)It seems like this could very well be screwing a bit with our model, note that we should actually be using pearson residuals to gauge our models fit so the fact that we don’t have anything close to a straight line in the upper left plot is fine, Q-Q plots are irrelevant for this kind of model as well.

Lets try removing the point and take a look at the new fit:final.

fit <- update(fit.

4, data = dat[-c(416),])summary(final.

fit)Down to almost 1000 AIC from the original 1067, this isn’t really a relevant measure of performance when comparing the AIC of two different sets of data (since we removed point 416), we would actually have to conclude that 416 was an outlier in the initial model as well, remove it and then compare the AIC value of the initial model without point 416 to our final fit without point 416 as well.

Looking at another round of influencePlot() and outlierTest() we find that datapoint 329 is acting out as well, however looking at the actual plots we see that we can’t really justify a removal of the data like we could with 416.

This is our final fit.

7.

Model evaluationSo now that we have a final fit where we can’t confidently add or remove any interactions variables and other transformations it’s time to evaluate if our model actually fits our data and if there’s even a statistically significant difference between our final fit and the first “naive” fit.

Lets start by taking a look at our fitted values vs.

the Pearson residuals:par(mfrow = c(1, 1)) # 2 row(s), 2 column(s)plot(p.

resid ~ fit.

val)lines(ksmooth(y = p.

resid, x = fit.

val, bandwidth=.

1, kernel=’normal’), col=’blue’)abline(h=0)This is a pretty decent fit, lets take a look at the fit for the major explanatory variables as well:With the exception of ‘gov.

support’ everything looks quite nice, also the reason for the bend in ‘gov.

support’ seems to be a single outlier in which someone was granted a substantially lower amount of support compared to all our other points of data.

Checking for potential overfittingOverfitting is the bane of all statistical modelling, how can you make sure your model isn’t just fitting to the exact data you fed it?.The goal is to make a model which generalizes and doesn’t just cater to the current data at hand.

So how do we test if our model is overfitting on our data?A popular metric to test is the delta value generated through Cross-Validation, we can calculated these by using the cv.

glm function from the ‘boot’ package and compare our final fit to our first!cv.

glm(dat[-c(416),], final.

fit, K = 13)$deltacv.

glm(dat[-c(416),], update(fit.

1, data = dat[-c(416),]), K = 13)$deltaIn the code above we’re using k-fold cross-validation with k = 13 (since 13 is a factor of 871 which is the length of our data after removing the outlier) this means we’re splitting our data in 13 ‘chunks’.

The delta values is a vector wherein the first component is the raw cross-validation estimate of prediction error and the second component is the adjusted cross-validation estimate (designed to compensate for the bias introduced by not using an exhaustive testing methods such as leave-one-out)Running the code above yields the following delta-values, note that these are subject to some random variance so you might not get the exact same values:The prediction error is lower for the final fit, even when testing with cross-validation.

Thus we can assume that our model hasn’t been overfitting on our data!The final piece of the puzzleSo now we’ve concluded that our model is actually a pretty decent fit for our data, but is it a statistically significant difference from the “naive” model without any transformations and variable interactions?.We can use an ANOVA test for this, we just have to remove the same point of data in both fits:There’s definitely a significant difference between the two fits, we can happily conclude that our hard work has paid off!8.

Model visualizationNow that we’ve gotten ourselves a model, how do we actually visualize and interpret what it says about the relationships in our data?Take a look at the following walk-trough which uses the same data and model as this article!:Visualizing models 101, using RSo you’ve got yourself a model, now what?medium.

comClosing remarksPlease keep in mind that this is purely introductory and that this isn’t an exhaustive analysis or conclusion!.If we were more rigorous in our pursuit we would’ve incorporated Cross-Validation tests and ANOVA tests on each new iteration of our model, ie.

whenever we add a new variable, interaction or power-transformation.

Feel free to message me if you have any questions and please correct me if you feel like I missed something or did something wrong, do keep in mind that this is suppose to serve as an introduction to modelling in R, I’m well aware that this process is highly simplified compared to more advanced methods!If you want to try your luck with this same dataset give it a go here: https://github.

com/pela15ae/statmod/blob/master/employment_data.

txtThe data is Danish so to convert the headers and categorical values to English run this piece of code:names(dat) <- c(“employed”, “foreigner”, “gov.

support”, “age”, “education”, “young.

children”, “school.

children”)levels(dat$employed)[1] <- “yes”levels(dat$employed)[2] <- “no”levels(dat$foreigner)[1] <- “yes”levels(dat$foreigner)[2] <- “no”◾️ ◾️ ◾️ Make sure to follow my profile ????.if you enjoy this article and want to see more!.◾️ ◾️ ◾️.. More details

Leave a Reply