Scale(), Normalization표준화

Published by onesixx on

Scaling의 종류

https://en.wikipedia.org/wiki/Feature_scaling
http://www.projectl33t.xyz/archives/8359
http://pds9.egloos.com/pds/200807/14/28/Chapter4.pdf

Standardization

각 feature의 평균을 빼고 표준편차로 나눈다. 평균 0 분산 1로 표준화 (standard scores

ML 기법 예)  support vector machines, logistic regression, and neural networks 등에서 다차원 데이터를 사용할때
데이터 예) audio signals 이나 image 의 pixel 값등

Re-scaling

관측값의 range를  0~1  또는 -1 ~ 1로 변환

여기서, x 는 변환전 original값 ,  x` 는  변환후 normalized 값

ex) 한 반의 학생의 몸무게게 60kg에서 100kg 사이일때, 먼저 60을 빼고, 범위인 40으로 나눈다. 

Scaling to unit length

feature 벡터를  벡터의 Euclidean 거리로 나누어  길이가 1로 만든다.
응용으로 feature vector의 L1 norm (ex,  Manhattan Distance,  City-Block Length ,  Taxicab Geometry) 을 사용하기도 한다. 

https://cos.name/cn/topic/101615/

scale()

default는 TRUE, TRUE

Center (mean 0)Scale( sd 1)
TT
FT
TF
#scale
x <- as.matrix(c(1:10))
mean(x); sd(x)

# center=T, scale=T
scale(x, center=T, scale=T)
scale(x, center=T, scale=apply(x,2,sd))
apply(x, 2, function(x){(x-mean(x))/sd(x)})

#  center=T, scale=F
scale(x, center=T, scale=F)
apply(x, 2, function(x){(x-mean(x))})

#  center=F, scale=T
scale(x, center=F, scale=T)
apply(x, 2, function(x){x/sqrt(sum(x^2)/(length(x)-1))})


scale(x, center=F, scale=F)
apply(x, 2, function(x){x/sqrt(sum(x^2)/(length(x)-1))})


#참고
scale.default
(centered.x <- scale(x, scale=TRUE))
cov(centered.scaled.x <- scale(x)) # all 1
apply(centered.scaled.x, 2, mean)
apply(centered.scaled.x, 2, sd)


(centered.x <- scale(x, scale = FALSE))
cov(centered.scaled.x <- scale(x)) # all 1
colMeans(centered.scaled.x)
apply(x, 2, mean)
apply(centered.scaled.x, 2, mean)
apply(centered.scaled.x, 2, sd)


x <- matrix(1:10, ncol = 2)
x1 <- x[,1]
x2 <- x[,2]

(centered01 <- scale(x))
#(centered01 <- scale(x, center=TRUE,  scale=TRUE ))
(centered02 <- scale(x, center=TRUE,  scale=FALSE))
(centered03 <- scale(x, center=FALSE, scale=TRUE ))

(centered04 <- transform(x, V1=x1-min(x1)/max(x1)-min(x1), V2=x2-min(x2)/max(x2)-min(x2)))
(centered05 <- transform(x, V1=x1-mean(x1)/sd(x1), V2=x2-mean(x2)/sd(x2)))
(centered05 <- transform(x, V1=x1/sd(x1), V2=x2-mean(x2)/sd(x2)))

source("http://bioconductor.org/biocLite.R")
biocLite("bioDist")
h <- matrix(rnorm(200), nrow = 5)
euc(h)

f
colMeans(centered01)
colMeans(centered02)
colMeans(centered03)
colMeans(centered04)
colMeans(centered05)

apply(centered01, 2, sd)
apply(centered02, 2, sd)
apply(centered03, 2, sd)
apply(centered04, 2, sd)
apply(centered05, 2, sd)

Why does scale return NaN for zero variance columns?
> x1 <- c(1,1,1,1,1)
> scale(x1)
     [,1]
[1,]  NaN
[2,]  NaN
[3,]  NaN
[4,]  NaN
[5,]  NaN
> scale(x1, 
+       center = TRUE, 
+       scale = (var(x1)!=0))
     [,1]
[1,]    0
[2,]    0
[3,]    0
[4,]    0
[5,]    0
attr(,"scaled:center")
[1] 1
Categories: R Reshaping

onesixx

Blog Owner

Subscribe
Notify of
guest

0 Comments
Oldest
Newest Most Voted
Inline Feedbacks
View all comments
0
Would love your thoughts, please comment.x
()
x