Confidence Intervals for Individual Coefficients

 Confidence Intervals for Individual Coefficients Let’s use the wine data as an example, and predict class labels from the thirteen mea sured variables. We can assess the confidence intervals for the model quite easily, formulating the problem in a regression sense. For each of the three classes a regres sion vector is obtained. The coefficients for Grignolino, third class, can be obtained as follows: > X <- wines[wines.odd, ] > C <- classvec2classmat(vintages[wines.odd]) > wines.lm <- lm(C ˜ X) > wines.lm.summ <- summary(wines.lm) > wines.lm.summ[[3]] Call: lm(formula = Grignolino ˜ X) Residuals: Min 1Q Median 3Q Max -0.4657 -0.1387 0.0022 0.1326 0.4210 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 2.77235 0.63633 4.36 4.1e-05 *** Xalcohol -0.12466 0.04918 -2.53 0.0133 * Xmalic acid -0.06631 0.02628 -2.52 0.0138 * Xash -0.56351 0.12824 -4.39 3.6e-05 *** Xash alkalinity 0.03227 0.00975 3.31 0.0014 ** Xmagnesium 0.00118 0.00173 0.68 0.4992 Xtot. phenols -0.00434 0.07787 -0.06 0.9558 Xflavonoids 0.12497 0.05547 2.25 0.0272 * Xnon-flav. phenols 0.36091 0.23337 1.55 0.1262 Xproanth 0.09320 0.05808 1.60 0.1128 Xcol. int. -0.04748 0.01661 -2.86 0.0055 ** Xcol. hue 0.18276 0.16723 1.09 0.2779 XOD ratio 0.00589 0.06306 0.09 0.9258 Xproline -0.00064 0.00012 -5.33 1.0e-06 *** 

Signif. codes: 0 ’***’ 0.001 ’**’ 0.01 ’*’ 0.05 ’.’ 0.1 ’ ’ 1 Residual standard error: 0.209 on 75 degrees of freedom Multiple R-squared: 0.847, Adjusted R-squared: 0.82 F-statistic: 31.9 on 13 and 75 DF, p-value: <2e-16 

The column with the stars in the output allows us to easily spot coefficients that are significant at a certain level. To get a summary of all variables that have p values smaller than, say, 0.1 for each of the three classes, we can issue: > sapply(wines.lm.summ, + function(x) which(x$coefficients[, 4] < .1)) $‘Response Barbera‘ Xmalic acid Xash Xflavonoids 348 Xnon-flav. phenols Xcol. int. XOD ratio 9 11 13 $‘Response Barolo‘ (Intercept) Xalcohol Xash Xash alkalinity 1245 Xflavonoids Xproanth XOD ratio Xproline 8 10 13 14 $‘Response Grignolino‘ (Intercept) Xalcohol Xmalic acid Xash 1234 Xash alkalinity Xflavonoids Xcol cck8 mw. int. Xproline 5 8 11 14 

Variables ash and flavonoids occur as significant for all three cultivars; six others (not counting the intercept, of course) for two out of three cultivars. 

In cases where no confidence intervals can be calculated analytically, such as in PCR or PLS, we can, e.g., use bootstrap confidence intervals. For the gasoline data, modelled with PCR using four latent variables, we have calculated bootstrap confidence intervals in Sect. 9.6.2. The percentile intervals, shown in Fig. 9.6, already indicated that most regression coefficients are significantly different from zero. How does that look for the (better) BCα confidence intervals? Let’s find out: > gas.BCACI <- + t(sapply(1:ncol(gasoline$NIR), + function(i, x) + boot.ci(x, index = i, type = "bca")$bca[, 4:5], + gas.pcr.bootCI)) A plot of the regression coefficients with these 95% confidence intervals (Fig. 10.1) immediately shows which variables are significantly different from zero: > BCAcoef <- gas.pcr.bootCI$t0 > signif <- gas.BCACI[, 1]>0| gas.BCACI[, 2] < 0 > BCAcoef[!signif] <- NA > matplot(wavelengths, gas.BCACI, type = "n", + xlab = "Wavelength (nm)", + ylab = "Regression coefficient", + main = "Gasoline data: PCR (4 PCs)") > abline(h = 0, col = "gray") > polygon(c(wavelengths, rev(wavelengths)), + c(gas.BCACI[, 1], rev(gas.BCACI[, 2])), + col = "pink", border = NA) > lines(wavelengths cck-8 inhibitor, BCAcoef, lwd = 2) Fig. 10.1 Significance of regression coefficients for PCR using four PCs on the gasoline data; coefficients whose 95% confidence interval (calculated with the BCα bootstrap and indicated in pink) includes zero are not shown Re-fitting the model after keeping only the 325 wavelengths leads to > smallmod <- pcr(octane ˜ NIR[, signif], data = gasoline, + ncomp = 4, validation = "LOO") > RMSEP(smallmod, intercept = FALSE, estimate = "CV") 1 comps 2 comps 3 comps 4 comps 1.4342 1.4720 0.2756 0.2497 

The error estimate is lower even than global minimum (at seven PCs) with the full data set containing 401 wavelengths. Here, one could also consider going for the three component model which sacrifices very little in terms of RMSEP (it is still better than the seven-component model seen earlier) and has, well, one component fewer. After variable selection, refitting often leads to more parsimonious models in terms of the number of components needed. Even if the predictions are not (much) better, the improved interpretability is often seen as reason enough to consider variable selection. 

Although this kind of procedure has been proposed in the literature several times, e.g., in Wehrens and van der Linden (1997), it is essentially incorrect. For the spectrum-like data, the correlations between the wavelengths are so large that the confidence intervals of individual coefficients are not particularly useful to determine which variables are significant—both errors of the first (false positives) and second kind (false negatives) are possible 3xFLAG PEPTIDE. Taking into account correlations and calculating so-called Scheffé intervals (Efron and Hastie 2016) often leads to intervals so wide that they have no practical relevance. The confidence intervals described above, for individual coefficients, at least give some idea of where important information is located.

Comments

Popular posts from this blog

32–44 significance and clinical application prospects

in non-small cell lung cancer confers significant stage-independent survival dis- Expression of novel molecules

This leads to the plot in Fig