ISLR :: 5.3 Lab: CV & BS :: (4) Bootstrap
5.3.1 Validation Set Approach
5.3.2 Leave-One-Out CV
5.3.3 k-Fold CV
5.3.4 Bootstrap
linear regression model의 정확도 추정
5.3 Lab: Cross-Validation and the Bootstrap
5.3.4 The Bootstrap
Linear Regression Model 의 정확성 추정
linear model을 fitting하면서 regression coefficients에 대한 변동성(variability) 평가.
Auto 자료에서, Y(mpg:연비)와 X(horsepower:마력)의 Linear regression model의 절편과 기울기 에 대한 변동성 평가를 위해
lm()함수와 bootstrap을 사용한 결과값을 비교해 보자.
Data
### DATA Loading Auto <- read.table("http://www-bcf.usc.edu/~gareth/ISL/Auto.csv", header=T, sep=",", quote="\"", na.strings=c("NA","-","?")) Auto.omit.na <- Auto[complete.cases(Auto[,4]),] #str(Auto.omit.na); head(Auto.omit.na) Df <- Auto.omit.na
lm ()함수
coef(lm(mpg~horsepower, data=Df)) lm.fit.linear <- lm(mpg~horsepower, data = Df) summary(lm.fit.linear)$coefficients
Estimate Std. Error t value Pr(>|t|) (Intercept) 39.9358610 0.717498656 55.65984 1.220362e-187 horsepower -0.1578447 0.006445501 -24.48914 7.031989e-81
bootstrap
alpha.fn <- function(data,index){ return ( coef(lm(mpg~horsepower, data=data, subset=index)) ) # estimate alpha } alpha.fn(Df, 1:392) set.seed(1) alpha.fn(Df, sample(x=392,size=392,replace=T,prob=NULL)) boot(Df, alpha.fn, 1000)
ORDINARY NONPARAMETRIC BOOTSTRAP Call: boot(data = Df, statistic = alpha.fn, R = 1000) Bootstrap Statistics : original bias std. error t1* 39.9358610 0.0296667441 0.860440524 t2* -0.1578447 -0.0003113047 0.007411218
lm ()함수의 SE() = 0.7174 SE() = 0.0064
bootstrap의 SE() = 0.8604 SE() = 0.0074
추정치가 정확히 일치하지는 않는다.
lm()의 SE 공식은 특정 가정을 필요로 한다 (사실 데이터는 선형보다는 비선형에 가깝다.)
bootstrap 특정한 가정이 필요없다. 따라서 오히려 이 예의 경우 더 정확할 수 있다.
ggplot(Df, aes(x=horsepower, y=mpg)) + theme_bw() + geom_point(shape=1) + geom_smooth(method=lm, se=F, color="steel blue") + geom_smooth(color="coral", se=F)
all
############################################################################# ### DATA Loading Auto <- read.table("http://www-bcf.usc.edu/~gareth/ISL/Auto.csv", header=T, sep=",", quote="\"", na.strings=c("NA","-","?")) Auto.omit.na <- Auto[complete.cases(Auto[,4]),] #str(Auto.omit.na); head(Auto.omit.na) Df <- Auto.omit.na ############################################################################# alpha.fn <- function(data,index){ return ( coef(lm(mpg~horsepower, data=data, subset=index)) ) # estimate alpha } alpha.fn(Df, 1:392) set.seed(1) alpha.fn(Df, sample(x=392,size=392,replace=T,prob=NULL)) boot(Df, alpha.fn, 1000) ############################################################################# coef(lm(mpg~horsepower, data=Df)) lm.fit.linear <- lm(mpg~horsepower, data = Df) summary(lm.fit.linear)$coefficients ############################################################################# ggplot(Df, aes(x=horsepower, y=mpg)) + theme_bw() + geom_point(shape=1) + geom_smooth(method=lm, se=F, color="steel blue") + geom_smooth(color="coral", se=F)
2차항을 넣으면 좀더 정확해 진다.
############################################################################# alpha.fn <- function(data,index){ return ( coef(lm(mpg~horsepower+I(horsepower^2), data=data, subset=index)) ) # estimate alpha } alpha.fn(Df, 1:392) set.seed(1) alpha.fn(Df, sample(x=392,size=392,replace=T,prob=NULL)) boot(Df, alpha.fn, 1000) ############################################################################# coef(lm(mpg~horsepower+I(horsepower^2), data=Df)) lm.fit.linear <- lm(mpg~horsepower+I(horsepower^2), data = Df) summary(lm.fit.linear)$coefficients
auto <- fread("http://www-bcf.usc.edu/~gareth/ISL/Auto.csv", na.strings=c("NA","-","?")) dd <- auto[!is.na(horsepower),]