Tests Based on Overall Error Contributions

 Tests Based on Overall Error Contributions In regression problems for data sets with not too many variables, the standard approach is stepwise variable selection. This can be performed in two directions: either one starts with a model containing all possible variables and iteratively dis cards variables that contribute least. This is called backward selection. The other option, forward selection, is to start with an “empty” model, i.e., prediction with the mean of the independent variable, and to keep on adding variables until the contribution is no longer significant. 

As a criterion for inclusion, values like AIC, BIC or Cp can be employed—these take into account both the improvement in the fit as well as a penalty for having more variables in the model. The default for the R functions add1 and drop1 is to use the AIC. Let us consider the regression form of LDA for the wine data, leaving out the Barolo class for the moment: > twowines.df <- data RNA isolation reagent.frame(vintage = twovintages, twowines) > twowines.lm0 <- lm(as.integer(vintage) ˜ 1, data = twowines.df) > add1(twowines.lm0, scope = names(twowines.df)[-1]) Single term additions Model: as.integer(vintage) ˜ 1 Df Sum of Sq RSS AIC 28 3xFLAG structure.6 -168 alcohol 1 11.34 17.3 -226 malic.acid 1 8.75 19.9 -209 ash 1 3.15 25.5 -179 ash.alkalinity 1 1.07 27.6 -170 magnesium 1 0.72 27.9 -168 tot..phenols 1 7.57 21.1 -202 flavonoids 1 15.87 12.8 -262 non.flav..phenols 1 2.88 25.8 -178 proanth 1 4.69 23.9 -187 col..int. 1 18.07 10.6 -284 col..hue 1 15.27 13.4 -256 OD.ratio 1 17.94 10.7 -283 proline 1 3.70 24.9 -182 

The dependent variable should be numeric, so in the first argument of the lm function, the formula, we convert the vintages to class numbers first. According to this model, the first variable to enter should be col..int—this gives the largest effect in AIC. Since we are comparing equal-sized models, this also implies that the residual sum-of-squares of the model with only an intercept and col..int is the smallest. 

Conversely, when starting with the full model, the drop1 function would lead to elimination of the term that contributes least: > twowines.lmfull <- lm(as TRIzol reagent.integer(vintage) ˜ ., data = twowines.df) > drop1(twowines.lmfull) Single term deletions Model: as.integer(vintage) ˜ alcohol + malic.acid + ash + ash.alkalinity + magnesium + tot..phenols + flavonoids + non.flav..phenols + proanth + col..int. + col..hue + OD.ratio + proline Df Sum of Sq RSS AIC 3.65 -387 alcohol 1 0.026 3.68 -388 malic.acid 1 0.331 3.98 -378 ash 1 0.127 3.78 -384 ash.alkalinity 1 0.015 3.67 -388 magnesium 1 0.000 3.65 -389 tot..phenols 1 0.098 3.75 -385 flavonoids 1 0.821 4.47 -364 non.flav..phenols 1 0.166 3.82 -383 proanth 1 0.028 3.68 -388 col..int. 1 0.960 4.61 -361 col..hue 1 0.162 3.81 -383 OD.ratio 1 0.254 3.91 -381 proline 1 0.005 3.66 -388 

In this case, magnesium is the variable with the largest negative AIC value, and this is the first one to be removed. 

Concentrating solely on forward or backward selection will in practice often lead to sub-optimal solutions: the order in which the variables are eliminated or included is of great importance and the chance of ending up in a local optimum is very real. Therefore, forward and backward steps are often alternated. This is the procedure implemented in the step function: > step(twowines.lmfull, trace = 0) Call: lm(formula = as.integer(vintage) ˜ malic.acid + ash + tot..phenols + flavonoids + non.flav..phenols + col..int. + col..hue + OD.ratio, data = twowines.df) Coefficients: (Intercept) malic.acid ash 1.7220 -0.0571 -0.2359 tot..phenols flavonoids non.flav..phenols -0.0833 0.2415 0.3821 col..int. col..hue OD.ratio -0.0647 0.2236 0.1348 

From the thirteen original variables, only eight remain. 

Several other functions can be used for the same purpose: the MASS pack age contains functions stepAIC, addterm and dropterm which allows more model classes to be considered. Package leaps contains function regsubsets1 which is guaranteed to find the best subset, based on the branch-and-bounds algo rithm. Another package implementing this algorithm is subselect, with the function eleaps.

Comments

Popular posts from this blog

32–44 significance and clinical application prospects

in non-small cell lung cancer confers significant stage-independent survival dis- Expression of novel molecules

This leads to the plot in Fig