ISLR :: 5.3 Lab: CV & BS :: (4) Bootstrap
5.3.1 Validation Set Approach
5.3.2 Leave-One-Out CV
5.3.3 k-Fold CV
5.3.4 Bootstrap
linear regression model의 정확도 추정
5.3 Lab: Cross-Validation and the Bootstrap
5.3.4 The Bootstrap
Linear Regression Model 의 정확성 추정
linear model을 fitting하면서 regression coefficients에 대한 변동성(variability) 평가.
Auto 자료에서, Y(mpg:연비)와 X(horsepower:마력)의 Linear regression model의 절편과 기울기
에 대한 변동성 평가를 위해
lm()함수와 bootstrap을 사용한 결과값을 비교해 보자.
Data
### DATA Loading
Auto <- read.table("http://www-bcf.usc.edu/~gareth/ISL/Auto.csv",
header=T, sep=",", quote="\"", na.strings=c("NA","-","?"))
Auto.omit.na <- Auto[complete.cases(Auto[,4]),]
#str(Auto.omit.na); head(Auto.omit.na)
Df <- Auto.omit.na
lm ()함수
coef(lm(mpg~horsepower, data=Df)) lm.fit.linear <- lm(mpg~horsepower, data = Df) summary(lm.fit.linear)$coefficients
Estimate Std. Error t value Pr(>|t|) (Intercept) 39.9358610 0.717498656 55.65984 1.220362e-187 horsepower -0.1578447 0.006445501 -24.48914 7.031989e-81
bootstrap
alpha.fn <- function(data,index){
return ( coef(lm(mpg~horsepower, data=data, subset=index)) ) # estimate alpha
}
alpha.fn(Df, 1:392)
set.seed(1)
alpha.fn(Df, sample(x=392,size=392,replace=T,prob=NULL))
boot(Df, alpha.fn, 1000)
ORDINARY NONPARAMETRIC BOOTSTRAP
Call:
boot(data = Df, statistic = alpha.fn, R = 1000)
Bootstrap Statistics :
original bias std. error
t1* 39.9358610 0.0296667441 0.860440524
t2* -0.1578447 -0.0003113047 0.007411218
lm ()함수의 SE() = 0.7174 SE(
) = 0.0064
bootstrap의 SE() = 0.8604 SE(
) = 0.0074
추정치가 정확히 일치하지는 않는다.
lm()의 SE 공식은 특정 가정을 필요로 한다 (사실 데이터는 선형보다는 비선형에 가깝다.)
bootstrap 특정한 가정이 필요없다. 따라서 오히려 이 예의 경우 더 정확할 수 있다.
ggplot(Df, aes(x=horsepower, y=mpg)) + theme_bw() +
geom_point(shape=1) +
geom_smooth(method=lm, se=F, color="steel blue") +
geom_smooth(color="coral", se=F)
all
#############################################################################
### DATA Loading
Auto <- read.table("http://www-bcf.usc.edu/~gareth/ISL/Auto.csv",
header=T, sep=",", quote="\"", na.strings=c("NA","-","?"))
Auto.omit.na <- Auto[complete.cases(Auto[,4]),]
#str(Auto.omit.na); head(Auto.omit.na)
Df <- Auto.omit.na
#############################################################################
alpha.fn <- function(data,index){
return ( coef(lm(mpg~horsepower, data=data, subset=index)) ) # estimate alpha
}
alpha.fn(Df, 1:392)
set.seed(1)
alpha.fn(Df, sample(x=392,size=392,replace=T,prob=NULL))
boot(Df, alpha.fn, 1000)
#############################################################################
coef(lm(mpg~horsepower, data=Df))
lm.fit.linear <- lm(mpg~horsepower, data = Df)
summary(lm.fit.linear)$coefficients
#############################################################################
ggplot(Df, aes(x=horsepower, y=mpg)) + theme_bw() +
geom_point(shape=1) +
geom_smooth(method=lm, se=F, color="steel blue") +
geom_smooth(color="coral", se=F)
2차항을 넣으면 좀더 정확해 진다.
#############################################################################
alpha.fn <- function(data,index){
return ( coef(lm(mpg~horsepower+I(horsepower^2), data=data, subset=index)) ) # estimate alpha
}
alpha.fn(Df, 1:392)
set.seed(1)
alpha.fn(Df, sample(x=392,size=392,replace=T,prob=NULL))
boot(Df, alpha.fn, 1000)
#############################################################################
coef(lm(mpg~horsepower+I(horsepower^2), data=Df))
lm.fit.linear <- lm(mpg~horsepower+I(horsepower^2), data = Df)
summary(lm.fit.linear)$coefficients
auto <- fread("http://www-bcf.usc.edu/~gareth/ISL/Auto.csv", na.strings=c("NA","-","?"))
dd <- auto[!is.na(horsepower),]