optimizer, loss, metrics

Published by onesixx on

์†์‹คํ•จ์ˆ˜(loss)์™€ ํ‰๊ฐ€์ง€ํ‘œ(metric)๋ž€? ๊ทธ ์ฐจ์ด๋Š”?
https://gombru.github.io/2018/05/23/cross_entropy_loss/
https://teddylee777.github.io/tensorflow/keras-metic-%EC%BB%A4%EC%8A%A4%ED%85%80

๋ชจ๋ธ์„ ์ปดํŒŒ์ผ(ํ•™์Šต ๊ณผ์ • ์„ค์ •)ํ•  ๋•Œ, “optimizer”, “loss”์™€ “metrics”์„ ์„ ํƒํ•œ๋‹ค.

network %>% compile(
    loss="categorical_crossentropy", optimizer="rmsprop",  metrics=c("accuracy")
)
model.compile(loss='mse', optimizer='adam',    metrics=['mse', 'mae', 'mape'])
model.compile(loss='mse', optimizer='rmsprop', metrics=['mse', 'mae', 'mape'])
model.compile(loss=keras.losses.categorical_crossentropy,
                          optimizer='adam',    metrics=['accuracy'])
model.compile(loss='mse', optimizer=tf.keras.optimizers.Adm(lr=0.001), metrics=['mae'])

loss function( ์†์‹คํ•จ์ˆ˜, ๋ชฉ์ ํ•จ์ˆ˜)

๋„คํŠธ์›Œํฌ์˜ parameter๋“ค์„ ฮธ๋ผ๊ณ  ํ–ˆ์„ ๋•Œ, ๋„คํŠธ์›Œํฌ์—์„œ ๋‚ด๋†“๋Š” ๊ฒฐ๊ณผ๊ฐ’๊ณผ ์‹ค์ œ ๊ฒฐ๊ณผ๊ฐ’ ์‚ฌ์ด์˜ ์ฐจ์ด๋ฅผ ์ •์˜ํ•˜๋Š” ํ•จ์ˆ˜ Loss function J(ฮธ)์˜ ๊ฐ’์„ ์ตœ์†Œํ™”

๋ชจ๋ธ์„ ํ›ˆ๋ จ์‹œํ‚ฌ๋•Œ, Loss ํ•จ์ˆ˜๋ฅผ ์ตœ์†Œ๋กœ ๋งŒ๋“ค์–ด์ฃผ๋Š” ๊ฐ€์ค‘์น˜๋“ค์„ ์ฐพ๋Š” ๊ฒƒ์„ ๋ชฉํ‘œ๋กœ ์‚ผ์Šต๋‹ˆ๋‹ค.

ํ›ˆ๋ จ์…‹๊ณผ ์—ฐ๊ด€. ํ›ˆ๋ จ์— ์‚ฌ์šฉ. 

ํ›ˆ๋ จ์‹œํ‚ค๋Š” ๋ชจ๋ธ์— ์ ํ•ฉํ•œ ์†์‹คํ•จ์ˆ˜๋ฅผ ์„ ํƒ

์˜ˆ๋ฅผ ๋“ค์–ด, 10๊ฐœ์˜ ํด๋ž˜์Šค๋ฅผ ๋ถ„๋ฅ˜ํ•  ์ˆ˜ ์žˆ๋Š” ๋ถ„๋ฅ˜๊ธฐ๋ฅผ ํ›ˆ๋ จ์‹œํ‚ค๋Š” ๊ฒฝ์šฐ์—๋Š” ์†์‹คํ•จ์ˆ˜๋กœ sparse categorical crossentropy๋ฅผ ์‚ฌ์šฉํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

MAE(mean absolute error), hinge, categorical crossentropy, sparse categorical crossentropy, binary crossentropy

๋ชจ๋ธ์€ ์‹ค์ œ ๋ผ๋ฒจ๊ณผ ๊ฐ€์žฅ ๊ฐ€๊นŒ์šด ๊ฐ’์ด ์˜ˆ์ธก๋˜๋„๋ก ํ›ˆ๋ จ๋˜์–ด์ง‘๋‹ˆ๋‹ค. ์ด๋•Œ ๊ทธ ๊ฐ€๊นŒ์šด ์ •๋„๋ฅผ ์ธก์ •ํ•˜๊ธฐ ์œ„ํ•ด ์‚ฌ์šฉ๋˜๋Š” ๊ฒƒ์ด ์†์‹ค ํ•จ์ˆ˜(loss function)์ž…๋‹ˆ๋‹ค.

Regression

MSE(mean squared error), RMSE, MAE(mean absolute error)

MAE๋Š” regression ๋ชจ๋ธ์„ ํ›ˆ๋ จ์‹œํ‚ฌ๋•Œ, ๋งŽ์ด ์‚ฌ์šฉ๋˜๋Š” ์†์‹ค ํ•จ์ˆ˜
MAE๋Š” ์ ˆ๋Œ€๊ฐ’(์ œ๊ณฑ์ด ์•„๋‹Œ)์˜ ํ‰๊ท 

Metrics
MAE

Classification

binary crossentropy

์ดํ•ญ ๋ถ„๋ฅ˜๊ธฐ(class๊ฐ€ ๋‘๊ฐœ์ธ ๋ถ„๋ฅ˜, T/F, ์–‘/์Œ)๋ฅผ ํ›ˆ๋ จ์‹œํ‚ฌ๋•Œ ์‚ฌ์šฉ

L=โˆ’1Nโˆ‘Ni=1tilog(yi)+(1โˆ’ti)log(1โˆ’yi)L=โˆ’1Nโˆ‘i=1Ntilog(yi)+(1โˆ’ti)log(1โˆ’yi)   …(binary crossentropy)

์†์‹คํ•จ์ˆ˜๋Š” ์˜ˆ์ธก๊ฐ’๊ณผ ์‹ค์ œ๊ฐ’์ด ๊ฐ™์œผ๋ฉด 0์ด ๋˜๋Š” ํŠน์„ฑ์„ ๊ฐ–๊ณ  ์žˆ์–ด์•ผ ํ•ฉ๋‹ˆ๋‹ค.
์˜ˆ์ธก๊ฐ’๊ณผ ์‹ค์ œ๊ฐ’์ด ๋ชจ๋‘ 1๋กœ ๊ฐ™์„ ๋•Œ(yi=ti=1yi=ti=1) ์†์‹คํ•จ์ˆ˜๊ฐ’์ด 0์ด ๋˜๋Š”์ง€ ํ•œ๋ฒˆ ํ™•์ธํ•ด๋ณด๊ฒ ์Šต๋‹ˆ๋‹ค.
์ฐธ๊ณ ๋กœ ์ด์ง„ ๋ถ„๋ฅ˜๊ธฐ์˜ ๊ฒฝ์šฐ ์˜ˆ์ธก๊ฐ’์ด 0๊ณผ 1์‚ฌ์ด์˜ ํ™•๋ฅ ๊ฐ’์œผ๋กœ ๋‚˜์˜ต๋‹ˆ๋‹ค. 1์— ๊ฐ€๊นŒ์šฐ๋ฉด ํ•˜๋‚˜์˜ ํด๋ž˜์Šค(์˜ˆ๋ฅผ ๋“ค์–ด, True ํด๋ž˜์Šค)์ผ ํ™•๋ฅ ์ด ํฐ ๊ฒƒ์ด๊ณ , 0์— ๊ฐ€๊นŒ์šฐ๋ฉด ๋‹ค๋ฅธ ํ•˜๋‚˜์˜ ํด๋ž˜์Šค(์˜ˆ๋ฅผ ๋“ค์–ด, False ํด๋ž˜์Šค)์ผ ํ™•๋ฅ ์ด ํฐ ๊ฒƒ์ด์ฃ . ์ƒํ™ฉ์„ ๊ฐ„๋‹จํ•˜๊ฒŒ ํ•˜๊ธฐ ์œ„ํ•ด์„œ ์ƒ˜ํ”Œ์ด ํ•˜๋‚˜๋งŒ ์žˆ๋‹ค๊ณ  ๊ฐ€์ •ํ•˜๊ฒ ์Šต๋‹ˆ๋‹ค.

L=โˆ’[1log1+(1โˆ’1)log(1โˆ’1)]=0L=โˆ’[1log1+(1โˆ’1)log(1โˆ’1)]=0

์˜ˆ์ธก๊ฐ’๊ณผ ์‹ค์ œ๊ฐ’์ด ๊ฐ™์€ ๊ฒฝ์šฐ์—๋Š” ๊ธฐ๋Œ€ํ–ˆ๋˜ ๋Œ€๋กœ ์†์‹คํ•จ์ˆ˜๊ฐ’์€ 0์ด ๋ฉ๋‹ˆ๋‹ค. ์ด๋ฒˆ์—๋Š” ์˜ˆ์ธก๊ฐ’์€ 0, ์‹ค์ œ๊ฐ’์€ 1์ธ ์ƒํ™ฉ์—๋Š”(yi=0,ti=1yi=0,ti=1) ์–ด๋–ป๊ฒŒ ๋˜๋Š”์ง€ ์‚ดํŽด๋ณด๊ฒ ์Šต๋‹ˆ๋‹ค. 

L=โˆ’[1log0+(1โˆ’0)log(1โˆ’0)]=โˆžL=โˆ’[1log0+(1โˆ’0)log(1โˆ’0)]=โˆž

์–‘์˜ ๋ฌดํ•œ๋Œ€๊ฐ€ ๋ฉ๋‹ˆ๋‹ค. ๊ทธ๋Ÿฐ๋ฐ ์ผ๋ฐ˜์ ์œผ๋กœ ํ™•๋ฅ ์ด 0์ด ๋‚˜์˜ค์ง€๋Š” ์•Š๊ธฐ ๋•Œ๋ฌธ์— ๊ฝค ํฐ ์ˆ˜๊ฐ€ ๋‚˜์˜จ๋‹ค๊ณ  ์ƒ๊ฐํ•˜๋ฉด ๋ฉ๋‹ˆ๋‹ค. ์ด๋Ÿฌํ•œ ํŠน์„ฑ์„ ๊ฐ–๊ณ  ์žˆ๊ธฐ ๋•Œ๋ฌธ์— binary crossentropy๊ฐ€ ์ด์ง„ ๋ถ„๋ฅ˜์— ์ ์ ˆํžˆ ์‚ฌ์šฉ๋  ์ˆ˜ ์žˆ๋Š” ์†์‹คํ•จ์ˆ˜์ธ ๊ฒƒ์ž…๋‹ˆ๋‹ค. 

categorical crossentropy

categorical crossentropy๋Š” ๋ถ„๋ฅ˜ํ•ด์•ผํ•  ํด๋ž˜์Šค๊ฐ€ 3๊ฐœ ์ด์ƒ์ธ ๊ฒฝ์šฐ, ์ฆ‰ ๋ฉ€ํ‹ฐํด๋ž˜์Šค ๋ถ„๋ฅ˜์— ์‚ฌ์šฉ๋ฉ๋‹ˆ๋‹ค.
๋ผ๋ฒจ์ด [0,0,1,0,0], [1,0,0,0,0], [0,0,0,1,0]๊ณผ ๊ฐ™์ด one-hot ํ˜•ํƒœ๋กœ ์ œ๊ณต๋  ๋•Œ ์‚ฌ์šฉ๋ฉ๋‹ˆ๋‹ค. ๊ณต์‹์€ ๋‹ค์Œ๊ณผ ๊ฐ™์Šต๋‹ˆ๋‹ค. 

L=โˆ’1Nโˆ‘Nj=1โˆ‘Ci=1tijlog(yij)L=โˆ’1Nโˆ‘j=1Nโˆ‘i=1Ctijlog(yij)   …(categorical crossentropy)

์—ฌ๊ธฐ์„œ C๋Š” ํด๋ž˜์Šค์˜ ๊ฐฏ์ˆ˜์ž…๋‹ˆ๋‹ค. 

์ด๋ฒˆ์—๋„ ์ƒ˜ํ”Œ์ด ํ•˜๋‚˜๋งŒ ์žˆ๋‹ค๊ณ  ๊ฐ€์ •ํ•˜๊ณ , ์‹ค์ œ๊ฐ’๊ณผ ์˜ˆ์ธก๊ฐ’์ด ์™„์ „ํžˆ ์ผ์น˜ํ•˜๋Š” ๊ฒฝ์šฐ์˜ ์†์‹คํ•จ์ˆ˜๊ฐ’์„ ์‚ดํŽด๋ณด๊ฒ ์Šต๋‹ˆ๋‹ค. 0์ด ๋‚˜์™€์•ผํ•ฉ๋‹ˆ๋‹ค. ์‹ค์ œ๊ฐ’๊ณผ ์˜ˆ์ธก๊ฐ’์ด ๋ชจ๋‘ [1 0 0 0 0]์ด๋ผ๊ณ  ๊ฐ€์ •ํ•˜๊ฒ ์Šต๋‹ˆ๋‹ค. 

L=โˆ’(1log1+0log0+0log0+0log0+0log0)=0L=โˆ’(1log1+0log0+0log0+0log0+0log0)=0

๊ณ„์‚ฐํ–ˆ๋”๋‹ˆ 0์ด ๋‚˜์™”์Šต๋‹ˆ๋‹ค. ์ด๋ฒˆ์—๋Š” ์‹ค์ œ๊ฐ’์€ [1 0 0 0 0], ์˜ˆ์ธก๊ฐ’์€ [0 1 0 0 0]์ธ ๊ฒฝ์šฐ์˜ ์†์‹คํ•จ์ˆ˜๊ฐ’์„ ๊ตฌํ•ด๋ณด๊ฒ ์Šต๋‹ˆ๋‹ค. 

L=โˆ’(1log0+0log1+0log0+0log0+0log0)=โˆžL=โˆ’(1log0+0log1+0log0+0log0+0log0)=โˆž

๊ณ„์‚ฐํ–ˆ๋”๋‹ˆ ์–‘์˜ ๋ฌดํ•œ๋Œ€๊ฐ€ ๋‚˜์™”์Šต๋‹ˆ๋‹ค. ์ผ๋ฐ˜์ ์œผ๋กœ ์˜ˆ์ธก๊ฐ’์€ [0.02 0.94 0.02 0.01 0.01]์™€ ๊ฐ™์€ ์‹์œผ๋กœ ๋‚˜์˜ค๊ธฐ ๋•Œ๋ฌธ์— ์–‘์˜ ๋ฌดํ•œ๋Œ€๊ฐ€ ๋‚˜์˜ฌ๋ฆฌ๋Š” ์—†์ง€๋งŒ, ํฐ ๊ฐ’์ด ๋‚˜์˜ค๋Š” ๊ฒƒ๋งŒ์€ ๋ถ„๋ช…ํ•ฉ๋‹ˆ๋‹ค. ์ด๋Ÿฌํ•œ ํŠน์„ฑ์„ ๊ฐ€์ง€๊ณ  ์žˆ๊ธฐ ๋•Œ๋ฌธ์— categorical crossentropy๋Š” ๋ฉ€ํ‹ฐํด๋ž˜์Šค ๋ถ„๋ฅ˜ ๋ฌธ์ œ์˜ ์†์‹คํ•จ์ˆ˜๋กœ ์‚ฌ์šฉ๋˜๊ธฐ์— ์ ํ•ฉํ•ฉ๋‹ˆ๋‹ค. 

5. sparse categorical crossentropy

sparse categorical crossentropy ์—ญ์‹œ ๋ฉ€ํ‹ฐํด๋ž˜์Šค ๋ถ„๋ฅ˜์— ์‚ฌ์šฉ๋˜๋Š” ์†์‹คํ•จ์ˆ˜์ž…๋‹ˆ๋‹ค. ์ด๋ฆ„์— sparse ํ•˜๋‚˜๊ฐ€ ์ถ”๊ฐ€๋˜์—ˆ์ฃ . ๊ทธ๋Ÿฌ๋ฉด sparse categorical crossentropy๊ฐ€ ์‚ฌ์šฉ๋˜๋Š” ๊ฒฝ์šฐ๋Š” ์–ธ์ œ์ผ๊นŒ์š”?
๋ฐ”๋กœ ๋ผ๋ฒจ์ด 0, 1, 2, 3, 4์™€ ๊ฐ™์ด ์ •์ˆ˜์˜ ํ˜•ํƒœ๋กœ ์ œ๊ณต๋  ๋•Œ์ž…๋‹ˆ๋‹ค. 

optimizer(์ตœ์ ํ™” ๊ณ„ํš)

http://shuuki4.github.io/deep%20learning/2016/05/20/Gradient-Descent-Algorithm-Overview.html

SGD (Stochastic gradient decent, ํ™•๋ฅ ์ ๊ฒฝ์‚ฌํ•˜๊ฐ•)์˜ ํŠน์ •๋ณ€ํ˜•์„ ๊ตฌํ˜„

Loss function์— ๋”ฐ๋ผ Network์„ ๊ฐฑ์‹ ํ•˜๋Š” ๋ฐฉ๋ฒ•

Neural network์˜ weight์„ ์กฐ์ ˆํ•˜๋Š” ๊ณผ์ •์—๋Š” ๋ณดํ†ต โ€˜Gradient Descentโ€™ ๋ผ๋Š” ๋ฐฉ๋ฒ•์„ ์‚ฌ์šฉ
์ด๋Š” ๋„คํŠธ์›Œํฌ์˜ parameter๋“ค์„ ฮธ๋ผ๊ณ  ํ–ˆ์„ ๋•Œ, ๋„คํŠธ์›Œํฌ์—์„œ ๋‚ด๋†“๋Š” ๊ฒฐ๊ณผ๊ฐ’๊ณผ ์‹ค์ œ ๊ฒฐ๊ณผ๊ฐ’ ์‚ฌ์ด์˜ ์ฐจ์ด๋ฅผ ์ •์˜ํ•˜๋Š” ํ•จ์ˆ˜ Loss function J(ฮธ)์˜ ๊ฐ’์„ ์ตœ์†Œํ™”ํ•˜๊ธฐ ์œ„ํ•ด ๊ธฐ์šธ๊ธฐ(gradient) โˆ‡J(ฮธ)๋ฅผ ์ด์šฉํ•˜๋Š” ๋ฐฉ๋ฒ•์ด๋‹ค. Gradient Descent์—์„œ๋Š” ฮธ ์— ๋Œ€ํ•ด gradient์˜ ๋ฐ˜๋Œ€ ๋ฐฉํ–ฅ์œผ๋กœ ์ผ์ • ํฌ๊ธฐ๋งŒํผ ์ด๋™ํ•ด๋‚ด๋Š” ๊ฒƒ์„ ๋ฐ˜๋ณตํ•˜์—ฌ Loss function J(ฮธ) ์˜ ๊ฐ’์„ ์ตœ์†Œํ™”ํ•˜๋Š” ฮธ์˜ ๊ฐ’์„ ์ฐพ๋Š”๋‹ค. ํ•œ iteration์—์„œ์˜ ๋ณ€ํ™” ์‹์€ ๋‹ค์Œ๊ณผ ๊ฐ™๋‹ค.ฮธ=ฮธโˆ’ฮทโˆ‡J(ฮธ)

์ด ๋•Œ ฮท ๋Š” ๋ฏธ๋ฆฌ ์ •ํ•ด์ง„ ๊ฑธ์Œ์˜ ํฌ๊ธฐ โ€˜step sizeโ€™ ๋กœ์„œ, ๋ณดํ†ต 0.01~0.001 ์ •๋„์˜ ์ ๋‹นํ•œ ํฌ๊ธฐ๋ฅผ ์‚ฌ์šฉํ•œ๋‹ค.

์ด ๋•Œ Loss Function์„ ๊ณ„์‚ฐํ•  ๋•Œ ์ „์ฒด train set์„ ์‚ฌ์šฉํ•˜๋Š” ๊ฒƒ์„ Batch Gradient Descent ๋ผ๊ณ  ํ•œ๋‹ค. ๊ทธ๋Ÿฌ๋‚˜ ์ด๋ ‡๊ฒŒ ๊ณ„์‚ฐ์„ ํ•  ๊ฒฝ์šฐ ํ•œ๋ฒˆ step์„ ๋‚ด๋”›์„ ๋•Œ ์ „์ฒด ๋ฐ์ดํ„ฐ์— ๋Œ€ํ•ด Loss Function์„ ๊ณ„์‚ฐํ•ด์•ผ ํ•˜๋ฏ€๋กœ ๋„ˆ๋ฌด ๋งŽ์€ ๊ณ„์‚ฐ๋Ÿ‰์ด ํ•„์š”ํ•˜๋‹ค. ์ด๋ฅผ ๋ฐฉ์ง€ํ•˜๊ธฐ ์œ„ํ•ด ๋ณดํ†ต์€ Stochastic Gradient Descent (SGD) ๋ผ๋Š” ๋ฐฉ๋ฒ•์„ ์‚ฌ์šฉํ•œ๋‹ค. ์ด ๋ฐฉ๋ฒ•์—์„œ๋Š” loss function์„ ๊ณ„์‚ฐํ•  ๋•Œ ์ „์ฒด ๋ฐ์ดํ„ฐ(batch) ๋Œ€์‹  ์ผ๋ถ€ ์กฐ๊ทธ๋งˆํ•œ ๋ฐ์ดํ„ฐ์˜ ๋ชจ์Œ(mini-batch)์— ๋Œ€ํ•ด์„œ๋งŒ loss function์„ ๊ณ„์‚ฐํ•œ๋‹ค. ์ด ๋ฐฉ๋ฒ•์€ batch gradient descent ๋ณด๋‹ค ๋‹ค์†Œ ๋ถ€์ •ํ™•ํ•  ์ˆ˜๋Š” ์žˆ์ง€๋งŒ, ํ›จ์”ฌ ๊ณ„์‚ฐ ์†๋„๊ฐ€ ๋น ๋ฅด๊ธฐ ๋•Œ๋ฌธ์— ๊ฐ™์€ ์‹œ๊ฐ„์— ๋” ๋งŽ์€ step์„ ๊ฐˆ ์ˆ˜ ์žˆ์œผ๋ฉฐ ์—ฌ๋Ÿฌ ๋ฒˆ ๋ฐ˜๋ณตํ•  ๊ฒฝ์šฐ ๋ณดํ†ต batch์˜ ๊ฒฐ๊ณผ์™€ ์œ ์‚ฌํ•œ ๊ฒฐ๊ณผ๋กœ ์ˆ˜๋ ดํ•œ๋‹ค. ๋˜ํ•œ, SGD๋ฅผ ์‚ฌ์šฉํ•  ๊ฒฝ์šฐ Batch Gradient Descent์—์„œ ๋น ์งˆ local minima์— ๋น ์ง€์ง€ ์•Š๊ณ  ๋” ์ข‹์€ ๋ฐฉํ–ฅ์œผ๋กœ ์ˆ˜๋ ดํ•  ๊ฐ€๋Šฅ์„ฑ๋„ ์žˆ๋‹ค.

๋ณดํ†ต Neural Network๋ฅผ ํŠธ๋ ˆ์ด๋‹ํ•  ๋•Œ๋Š” ์ด SGD๋ฅผ ์ด์šฉํ•œ๋‹ค. ๊ทธ๋Ÿฌ๋‚˜ ๋‹จ์ˆœํ•œ SGD๋ฅผ ์ด์šฉํ•˜์—ฌ ๋„คํŠธ์›Œํฌ๋ฅผ ํ•™์Šต์‹œํ‚ค๋Š” ๊ฒƒ์—๋Š” ํ•œ๊ณ„๊ฐ€ ์žˆ๋‹ค. ๊ฒฐ๋ก ๋ถ€ํ„ฐ ์‚ดํŽด๋ณด๊ธฐ ์œ„ํ•ด, ๋‹ค์Œ๊ณผ ๊ฐ™์€ ๊ทธ๋ฆผ๋“ค์„ ์‚ดํŽด๋ณด์ž.

Gradient Descent Optimization Algorithms at Long Valley
Gradient Descent Optimization Algorithms at Long Valley
Gradient Descent Optimization Algorithms at Beale's Function
Gradient Descent Optimization Algorithms at Beale’s Function
Gradient Descent Optimization Algorithms at Saddle Point
Gradient Descent Optimization Algorithms at Saddle Point

์œ„์˜ ๊ทธ๋ฆผ๋“ค์€ ๊ฐ๊ฐ SGD ๋ฐ SGD์˜ ๋ณ€ํ˜• ์•Œ๊ณ ๋ฆฌ์ฆ˜๋“ค์ด ์ตœ์ ๊ฐ’์„ ์ฐพ๋Š” ๊ณผ์ •์„ ์‹œ๊ฐํ™”ํ•œ ๊ฒƒ์ด๋‹ค. ๋นจ๊ฐ„์ƒ‰์˜ SGD๊ฐ€ ์šฐ๋ฆฌ๊ฐ€ ์•Œ๊ณ  ์žˆ๋Š” Naive Stochastic Gradient Descent ์•Œ๊ณ ๋ฆฌ์ฆ˜์ด๊ณ , Momentum, NAG, Adagrad, AdaDelta, RMSprop ๋“ฑ์€ SGD์˜ ๋ณ€ํ˜•์ด๋‹ค. ๋ณด๋‹ค์‹œํ”ผ ๋ชจ๋“  ๊ฒฝ์šฐ์—์„œ SGD๋Š” ๋‹ค๋ฅธ ์•Œ๊ณ ๋ฆฌ์ฆ˜๋“ค์— ๋น„ํ•ด ์„ฑ๋Šฅ์ด ์›”๋“ฑํ•˜๊ฒŒ ๋‚ฎ๋‹ค. ๋‹ค๋ฅธ ์•Œ๊ณ ๋ฆฌ์ฆ˜๋“ค ๋ณด๋‹ค ์ด๋™์†๋„๊ฐ€ ํ˜„์ €ํ•˜๊ฒŒ ๋Š๋ฆด๋ฟ๋งŒ ์•„๋‹ˆ๋ผ, ๋ฐฉํ–ฅ์„ ์ œ๋Œ€๋กœ ์žก์ง€ ๋ชปํ•˜๊ณ  ์ด์ƒํ•œ ๊ณณ์—์„œ ์ˆ˜๋ ดํ•˜์—ฌ ์ด๋™ํ•˜์ง€ ๋ชปํ•˜๋Š” ๋ชจ์Šต๋„ ๊ด€์ฐฐํ•  ์ˆ˜ ์žˆ๋‹ค. ์ฆ‰ ๋‹จ์ˆœํ•œ SGD๋ฅผ ์ด์šฉํ•˜์—ฌ ๋„คํŠธ์›Œํฌ๋ฅผ ํ•™์Šต์‹œํ‚ฌ ๊ฒฝ์šฐ ๋„คํŠธ์›Œํฌ๊ฐ€ ์ƒ๋Œ€์ ์œผ๋กœ ์ข‹์€ ๊ฒฐ๊ณผ๋ฅผ ์–ป์ง€ ๋ชปํ•  ๊ฒƒ์ด๋ผ๊ณ  ์˜ˆ์ธกํ•  ์ˆ˜ ์žˆ๋‹ค. ๊ทธ๋ ‡๋‹ค๋ฉด ์‹ค์ œ๋กœ๋Š” ์–ด๋–ค ๋ฐฉ๋ฒ•๋“ค์„ ์ด์šฉํ•ด์•ผ ํ•˜๋Š” ๊ฒƒ์ธ๊ฐ€? ์ด ๊ธ€์—์„œ๋Š” Neural Network๋ฅผ ํ•™์Šต์‹œํ‚ฌ ๋•Œ ์‹ค์ œ๋กœ ๋งŽ์ด ์‚ฌ์šฉํ•˜๋Š” ๋‹ค์–‘ํ•œ SGD์˜ ๋ณ€ํ˜• ์•Œ๊ณ ๋ฆฌ์ฆ˜๋“ค์„ ๊ฐ„๋žตํ•˜๊ฒŒ ์‚ดํŽด๋ณด๊ฒ ๋‹ค. ๋‚ด์šฉ๊ณผ ๊ทธ๋ฆผ์˜ ์ƒ๋‹น ๋ถ€๋ถ„์€ Sebastian Ruder์˜ ๊ธ€ ์—์„œ ์ฐจ์šฉํ–ˆ๋‹ค.

Momentum

Momentum ๋ฐฉ์‹์€ ๋ง ๊ทธ๋Œ€๋กœ Gradient Descent๋ฅผ ํ†ตํ•ด ์ด๋™ํ•˜๋Š” ๊ณผ์ •์— ์ผ์ข…์˜ โ€˜๊ด€์„ฑโ€™์„ ์ฃผ๋Š” ๊ฒƒ์ด๋‹ค. ํ˜„์žฌ Gradient๋ฅผ ํ†ตํ•ด ์ด๋™ํ•˜๋Š” ๋ฐฉํ–ฅ๊ณผ๋Š” ๋ณ„๊ฐœ๋กœ, ๊ณผ๊ฑฐ์— ์ด๋™ํ–ˆ๋˜ ๋ฐฉ์‹์„ ๊ธฐ์–ตํ•˜๋ฉด์„œ ๊ทธ ๋ฐฉํ–ฅ์œผ๋กœ ์ผ์ • ์ •๋„๋ฅผ ์ถ”๊ฐ€์ ์œผ๋กœ ์ด๋™ํ•˜๋Š” ๋ฐฉ์‹์ด๋‹ค. ์ˆ˜์‹์œผ๋กœ ํ‘œํ˜„ํ•˜๋ฉด ๋‹ค์Œ๊ณผ ๊ฐ™๋‹ค. vtvt๋ฅผ time step t์—์„œ์˜ ์ด๋™ ๋ฒกํ„ฐ๋ผ๊ณ  ํ•  ๋•Œ, ๋‹ค์Œ๊ณผ ๊ฐ™์€ ์‹์œผ๋กœ ์ด๋™์„ ํ‘œํ˜„ํ•  ์ˆ˜ ์žˆ๋‹ค.vt=ฮณvtโˆ’1+ฮทโˆ‡ฮธJ(ฮธ)vt=ฮณvtโˆ’1+ฮทโˆ‡ฮธJ(ฮธ)ฮธ=ฮธโˆ’vtฮธ=ฮธโˆ’vt

์ด ๋•Œ, ฮณฮณ๋Š” ์–ผ๋งˆ๋‚˜ momentum์„ ์ค„ ๊ฒƒ์ธ์ง€์— ๋Œ€ํ•œ momentum term์œผ๋กœ์„œ, ๋ณดํ†ต 0.9 ์ •๋„์˜ ๊ฐ’์„ ์‚ฌ์šฉํ•œ๋‹ค. ์‹์„ ์‚ดํŽด๋ณด๋ฉด ๊ณผ๊ฑฐ์— ์–ผ๋งˆ๋‚˜ ์ด๋™ํ–ˆ๋Š”์ง€์— ๋Œ€ํ•œ ์ด๋™ ํ•ญ v๋ฅผ ๊ธฐ์–ตํ•˜๊ณ , ์ƒˆ๋กœ์šด ์ด๋™ํ•ญ์„ ๊ตฌํ•  ๊ฒฝ์šฐ ๊ณผ๊ฑฐ์— ์ด๋™ํ–ˆ๋˜ ์ •๋„์— ๊ด€์„ฑํ•ญ๋งŒํผ ๊ณฑํ•ด์ค€ ํ›„ Gradient์„ ์ด์šฉํ•œ ์ด๋™ step ํ•ญ์„ ๋”ํ•ด์ค€๋‹ค. ์ด๋ ‡๊ฒŒ ํ•  ๊ฒฝ์šฐ ์ด๋™ํ•ญ vtvt ๋Š” ๋‹ค์Œ๊ณผ ๊ฐ™์€ ๋ฐฉ์‹์œผ๋กœ ์ •๋ฆฌํ•  ์ˆ˜ ์žˆ์–ด, Gradient๋“ค์˜ ์ง€์ˆ˜ํ‰๊ท ์„ ์ด์šฉํ•˜์—ฌ ์ด๋™ํ•œ๋‹ค๊ณ ๋„ ํ•ด์„ํ•  ์ˆ˜ ์žˆ๋‹ค.vt=ฮทโˆ‡ฮธJ(ฮธ)t+ฮณฮทโˆ‡ฮธJ(ฮธ)tโˆ’1+ฮณ2ฮทโˆ‡ฮธJ(ฮธ)tโˆ’2+….vt=ฮทโˆ‡ฮธJ(ฮธ)t+ฮณฮทโˆ‡ฮธJ(ฮธ)tโˆ’1+ฮณ2ฮทโˆ‡ฮธJ(ฮธ)tโˆ’2+….

Momentum ๋ฐฉ์‹์€ SGD๊ฐ€ Oscilation ํ˜„์ƒ์„ ๊ฒช์„ ๋•Œ ๋”์šฑ ๋น›์„ ๋ฐœํ•œ๋‹ค. ๋‹ค์Œ๊ณผ ๊ฐ™์ด SGD๊ฐ€ Oscilation์„ ๊ฒช๊ณ  ์žˆ๋Š” ์ƒํ™ฉ์„ ์‚ดํŽด๋ณด์ž.

Oscilation

ํ˜„์žฌ SGD๋Š” ์ค‘์•™์˜ ์ตœ์ ์ ์œผ๋กœ ์ด๋™ํ•ด์•ผํ•˜๋Š” ์ƒํ™ฉ์ธ๋ฐ, ํ•œ๋ฒˆ์˜ step์—์„œ ์›€์ง์ผ ์ˆ˜ ์žˆ๋Š” step size๋Š” ํ•œ๊ณ„๊ฐ€ ์žˆ์œผ๋ฏ€๋กœ ์ด๋Ÿฌํ•œ oscilation ํ˜„์ƒ์ด ์ผ์–ด๋‚  ๋•Œ๋Š” ์ขŒ์šฐ๋กœ ๊ณ„์† ์ง„๋™ํ•˜๋ฉด์„œ ์ด๋™์— ๋‚œํ•ญ์„ ๊ฒช๊ฒŒ ๋œ๋‹ค.

Momentum Oscilation

๊ทธ๋Ÿฌ๋‚˜ Momentum ๋ฐฉ์‹์„ ์‚ฌ์šฉํ•  ๊ฒฝ์šฐ ๋‹ค์Œ๊ณผ ๊ฐ™์ด ์ž์ฃผ ์ด๋™ํ•˜๋Š” ๋ฐฉํ–ฅ์— ๊ด€์„ฑ์ด ๊ฑธ๋ฆฌ๊ฒŒ ๋˜๊ณ , ์ง„๋™์„ ํ•˜๋”๋ผ๋„ ์ค‘์•™์œผ๋กœ ๊ฐ€๋Š” ๋ฐฉํ–ฅ์— ํž˜์„ ์–ป๊ธฐ ๋•Œ๋ฌธ์— SGD์— ๋น„ํ•ด ์ƒ๋Œ€์ ์œผ๋กœ ๋น ๋ฅด๊ฒŒ ์ด๋™ํ•  ์ˆ˜ ์žˆ๋‹ค.

Avoiding Local Minima. Picture from http://www.yaldex.com.
Avoiding Local Minima. Picture from http://www.yaldex.com.

๋˜ํ•œ Momentum ๋ฐฉ์‹์„ ์ด์šฉํ•  ๊ฒฝ์šฐ ์œ„์˜ ๊ทธ๋ฆผ๊ณผ ๊ฐ™์ด local minima๋ฅผ ๋น ์ ธ๋‚˜์˜ค๋Š” ํšจ๊ณผ๊ฐ€ ์žˆ์„ ๊ฒƒ์ด๋ผ๊ณ ๋„ ๊ธฐ๋Œ€ํ•  ์ˆ˜ ์žˆ๋‹ค. ๊ธฐ์กด์˜ SGD๋ฅผ ์ด์šฉํ•  ๊ฒฝ์šฐ ์ขŒ์ธก์˜ local minima์— ๋น ์ง€๋ฉด gradient๊ฐ€ 0์ด ๋˜์–ด ์ด๋™ํ•  ์ˆ˜๊ฐ€ ์—†์ง€๋งŒ, momentum ๋ฐฉ์‹์˜ ๊ฒฝ์šฐ ๊ธฐ์กด์— ์ด๋™ํ–ˆ๋˜ ๋ฐฉํ–ฅ์— ๊ด€์„ฑ์ด ์žˆ์–ด ์ด local minima๋ฅผ ๋น ์ ธ๋‚˜์˜ค๊ณ  ๋” ์ข‹์€ minima๋กœ ์ด๋™ํ•  ๊ฒƒ์„ ๊ธฐ๋Œ€ํ•  ์ˆ˜ ์žˆ๊ฒŒ ๋œ๋‹ค. ๋ฐ˜๋ฉด momentum ๋ฐฉ์‹์„ ์ด์šฉํ•  ๊ฒฝ์šฐ ๊ธฐ์กด์˜ ๋ณ€์ˆ˜๋“ค ฮธฮธ ์™ธ์—๋„ ๊ณผ๊ฑฐ์— ์ด๋™ํ–ˆ๋˜ ์–‘์„ ๋ณ€์ˆ˜๋ณ„๋กœ ์ €์žฅํ•ด์•ผํ•˜๋ฏ€๋กœ ๋ณ€์ˆ˜์— ๋Œ€ํ•œ ๋ฉ”๋ชจ๋ฆฌ๊ฐ€ ๊ธฐ์กด์˜ ๋‘ ๋ฐฐ๋กœ ํ•„์š”ํ•˜๊ฒŒ ๋œ๋‹ค.

Nesterov Accelerated Gradient (NAG)

Nesterov Accelerated Gradient(NAG)๋Š” Momentum ๋ฐฉ์‹์„ ๊ธฐ์ดˆ๋กœ ํ•œ ๋ฐฉ์‹์ด์ง€๋งŒ, Gradient๋ฅผ ๊ณ„์‚ฐํ•˜๋Š” ๋ฐฉ์‹์ด ์‚ด์ง ๋‹ค๋ฅด๋‹ค. ๋น ๋ฅธ ์ดํ•ด๋ฅผ ์œ„ํ•ด ๋‹ค์Œ ๊ทธ๋ฆผ์„ ๋จผ์ € ์‚ดํŽด๋ณด์ž.

Difference between Momentum and NAG. Picture from CS231.
Difference between Momentum and NAG. Picture from CS231.

Momentum ๋ฐฉ์‹์—์„œ๋Š” ์ด๋™ ๋ฒกํ„ฐ vtvt ๋ฅผ ๊ณ„์‚ฐํ•  ๋•Œ ํ˜„์žฌ ์œ„์น˜์—์„œ์˜ gradient์™€ momentum step์„ ๋…๋ฆฝ์ ์œผ๋กœ ๊ณ„์‚ฐํ•˜๊ณ  ํ•ฉ์นœ๋‹ค. ๋ฐ˜๋ฉด, NAG์—์„œ๋Š” momentum step์„ ๋จผ์ € ๊ณ ๋ คํ•˜์—ฌ, momentum step์„ ๋จผ์ € ์ด๋™ํ–ˆ๋‹ค๊ณ  ์ƒ๊ฐํ•œ ํ›„ ๊ทธ ์ž๋ฆฌ์—์„œ์˜ gradient๋ฅผ ๊ตฌํ•ด์„œ gradient step์„ ์ด๋™ํ•œ๋‹ค. ์ด๋ฅผ ์ˆ˜์‹์œผ๋กœ ๋‚˜ํƒ€๋‚ด๋ฉด ๋‹ค์Œ๊ณผ ๊ฐ™๋‹ค.vt=ฮณvtโˆ’1+ฮทโˆ‡ฮธJ(ฮธโˆ’ฮณvtโˆ’1)vt=ฮณvtโˆ’1+ฮทโˆ‡ฮธJ(ฮธโˆ’ฮณvtโˆ’1)ฮธ=ฮธโˆ’vtฮธ=ฮธโˆ’vt

NAG๋ฅผ ์ด์šฉํ•  ๊ฒฝ์šฐ Momentum ๋ฐฉ์‹์— ๋น„ํ•ด ๋ณด๋‹ค ํšจ๊ณผ์ ์œผ๋กœ ์ด๋™ํ•  ์ˆ˜ ์žˆ๋‹ค. Momentum ๋ฐฉ์‹์˜ ๊ฒฝ์šฐ ๋ฉˆ์ถฐ์•ผ ํ•  ์‹œ์ ์—์„œ๋„ ๊ด€์„ฑ์— ์˜ํ•ด ํ›จ์”ฌ ๋ฉ€๋ฆฌ ๊ฐˆ์ˆ˜๋„ ์žˆ๋‹ค๋Š” ๋‹จ์ ์ด ์กด์žฌํ•˜๋Š” ๋ฐ˜๋ฉด, NAG ๋ฐฉ์‹์˜ ๊ฒฝ์šฐ ์ผ๋‹จ ๋ชจ๋ฉ˜ํ…€์œผ๋กœ ์ด๋™์„ ๋ฐ˜์ •๋„ ํ•œ ํ›„ ์–ด๋–ค ๋ฐฉ์‹์œผ๋กœ ์ด๋™ํ•ด์•ผํ•  ์ง€๋ฅผ ๊ฒฐ์ •ํ•œ๋‹ค. ๋”ฐ๋ผ์„œ Momentum ๋ฐฉ์‹์˜ ๋น ๋ฅธ ์ด๋™์— ๋Œ€ํ•œ ์ด์ ์€ ๋ˆ„๋ฆฌ๋ฉด์„œ๋„, ๋ฉˆ์ถฐ์•ผ ํ•  ์ ์ ˆํ•œ ์‹œ์ ์—์„œ ์ œ๋™์„ ๊ฑฐ๋Š” ๋ฐ์— ํ›จ์”ฌ ์šฉ์ดํ•˜๋‹ค๊ณ  ์ƒ๊ฐํ•  ์ˆ˜ ์žˆ์„ ๊ฒƒ์ด๋‹ค.

Adagrad

Adagrad(Adaptive Gradient)๋Š” ๋ณ€์ˆ˜๋“ค์„ updateํ•  ๋•Œ ๊ฐ๊ฐ์˜ ๋ณ€์ˆ˜๋งˆ๋‹ค step size๋ฅผ ๋‹ค๋ฅด๊ฒŒ ์„ค์ •ํ•ด์„œ ์ด๋™ํ•˜๋Š” ๋ฐฉ์‹์ด๋‹ค. ์ด ์•Œ๊ณ ๋ฆฌ์ฆ˜์˜ ๊ธฐ๋ณธ์ ์ธ ์•„์ด๋””์–ด๋Š” โ€˜์ง€๊ธˆ๊นŒ์ง€ ๋งŽ์ด ๋ณ€ํ™”ํ•˜์ง€ ์•Š์€ ๋ณ€์ˆ˜๋“ค์€ step size๋ฅผ ํฌ๊ฒŒ ํ•˜๊ณ , ์ง€๊ธˆ๊นŒ์ง€ ๋งŽ์ด ๋ณ€ํ™”ํ–ˆ๋˜ ๋ณ€์ˆ˜๋“ค์€ step size๋ฅผ ์ž‘๊ฒŒ ํ•˜์žโ€™ ๋ผ๋Š” ๊ฒƒ์ด๋‹ค. ์ž์ฃผ ๋“ฑ์žฅํ•˜๊ฑฐ๋‚˜ ๋ณ€ํ™”๋ฅผ ๋งŽ์ด ํ•œ ๋ณ€์ˆ˜๋“ค์˜ ๊ฒฝ์šฐ optimum์— ๊ฐ€๊นŒ์ด ์žˆ์„ ํ™•๋ฅ ์ด ๋†’๊ธฐ ๋•Œ๋ฌธ์— ์ž‘์€ ํฌ๊ธฐ๋กœ ์ด๋™ํ•˜๋ฉด์„œ ์„ธ๋ฐ€ํ•œ ๊ฐ’์„ ์กฐ์ •ํ•˜๊ณ , ์ ๊ฒŒ ๋ณ€ํ™”ํ•œ ๋ณ€์ˆ˜๋“ค์€ optimum ๊ฐ’์— ๋„๋‹ฌํ•˜๊ธฐ ์œ„ํ•ด์„œ๋Š” ๋งŽ์ด ์ด๋™ํ•ด์•ผํ•  ํ™•๋ฅ ์ด ๋†’๊ธฐ ๋•Œ๋ฌธ์— ๋จผ์ € ๋น ๋ฅด๊ฒŒ loss ๊ฐ’์„ ์ค„์ด๋Š” ๋ฐฉํ–ฅ์œผ๋กœ ์ด๋™ํ•˜๋ ค๋Š” ๋ฐฉ์‹์ด๋ผ๊ณ  ์ƒ๊ฐํ•  ์ˆ˜ ์žˆ๊ฒ ๋‹ค. ํŠนํžˆ word2vec์ด๋‚˜ GloVe ๊ฐ™์ด word representation์„ ํ•™์Šต์‹œํ‚ฌ ๊ฒฝ์šฐ ๋‹จ์–ด์˜ ๋“ฑ์žฅ ํ™•๋ฅ ์— ๋”ฐ๋ผ variable์˜ ์‚ฌ์šฉ ๋น„์œจ์ด ํ™•์—ฐํ•˜๊ฒŒ ์ฐจ์ด๋‚˜๊ธฐ ๋•Œ๋ฌธ์— Adagrad์™€ ๊ฐ™์€ ํ•™์Šต ๋ฐฉ์‹์„ ์ด์šฉํ•˜๋ฉด ํ›จ์”ฌ ๋” ์ข‹์€ ์„ฑ๋Šฅ์„ ๊ฑฐ๋‘˜ ์ˆ˜ ์žˆ์„ ๊ฒƒ์ด๋‹ค.

Adagrad์˜ ํ•œ ์Šคํ…์„ ์ˆ˜์‹ํ™”ํ•˜์—ฌ ๋‚˜ํƒ€๋‚ด๋ฉด ๋‹ค์Œ๊ณผ ๊ฐ™๋‹ค.Gt=Gtโˆ’1+(โˆ‡ฮธJ(ฮธt))2Gt=Gtโˆ’1+(โˆ‡ฮธJ(ฮธt))2ฮธt+1=ฮธtโˆ’ฮทGt+ฯตโˆ’โˆ’โˆ’โˆ’โˆ’โˆšโ‹…โˆ‡ฮธJ(ฮธt)ฮธt+1=ฮธtโˆ’ฮทGt+ฯตโ‹…โˆ‡ฮธJ(ฮธt)

Neural Network์˜ parameter๊ฐ€ k๊ฐœ๋ผ๊ณ  ํ•  ๋•Œ, GtGt๋Š” k์ฐจ์› ๋ฒกํ„ฐ๋กœ์„œ โ€˜time step t๊นŒ์ง€ ๊ฐ ๋ณ€์ˆ˜๊ฐ€ ์ด๋™ํ•œ gradient์˜ sum of squaresโ€™ ๋ฅผ ์ €์žฅํ•œ๋‹ค. ฮธฮธ๋ฅผ ์—…๋ฐ์ดํŠธํ•˜๋Š” ์ƒํ™ฉ์—์„œ๋Š” ๊ธฐ์กด step size ฮทฮท์— GtGt์˜ ๋ฃจํŠธ๊ฐ’์— ๋ฐ˜๋น„๋ก€ํ•œ ํฌ๊ธฐ๋กœ ์ด๋™์„ ์ง„ํ–‰ํ•˜์—ฌ, ์ง€๊ธˆ๊นŒ์ง€ ๋งŽ์ด ๋ณ€ํ™”ํ•œ ๋ณ€์ˆ˜์ผ ์ˆ˜๋ก ์ ๊ฒŒ ์ด๋™ํ•˜๊ณ  ์ ๊ฒŒ ๋ณ€ํ™”ํ•œ ๋ณ€์ˆ˜์ผ ์ˆ˜๋ก ๋งŽ์ด ์ด๋™ํ•˜๋„๋ก ํ•œ๋‹ค. ์ด ๋•Œ ฯตฯต์€ 10โˆ’410โˆ’4 ~ 10โˆ’810โˆ’8 ์ •๋„์˜ ์ž‘์€ ๊ฐ’์œผ๋กœ์„œ 0์œผ๋กœ ๋‚˜๋ˆ„๋Š” ๊ฒƒ์„ ๋ฐฉ์ง€ํ•˜๊ธฐ ์œ„ํ•œ ์ž‘์€ ๊ฐ’์ด๋‹ค. ์—ฌ๊ธฐ์—์„œ GtGt๋ฅผ ์—…๋ฐ์ดํŠธํ•˜๋Š” ์‹์—์„œ ์ œ๊ณฑ์€ element-wise ์ œ๊ณฑ์„ ์˜๋ฏธํ•˜๋ฉฐ, ฮธฮธ๋ฅผ ์—…๋ฐ์ดํŠธํ•˜๋Š” ์‹์—์„œ๋„ โ‹…โ‹… ์€ element-wiseํ•œ ์—ฐ์‚ฐ์„ ์˜๋ฏธํ•œ๋‹ค.

Adagrad๋ฅผ ์‚ฌ์šฉํ•˜๋ฉด ํ•™์Šต์„ ์ง„ํ–‰ํ•˜๋ฉด์„œ ๊ตณ์ด step size decay๋“ฑ์„ ์‹ ๊ฒฝ์จ์ฃผ์ง€ ์•Š์•„๋„ ๋œ๋‹ค๋Š” ์žฅ์ ์ด ์žˆ๋‹ค. ๋ณดํ†ต adagrad์—์„œ step size๋กœ๋Š” 0.01 ์ •๋„๋ฅผ ์‚ฌ์šฉํ•œ ๋’ค, ๊ทธ ์ดํ›„๋กœ๋Š” ๋ฐ”๊พธ์ง€ ์•Š๋Š”๋‹ค. ๋ฐ˜๋ฉด, Adagrad์—๋Š” ํ•™์Šต์„ ๊ณ„์† ์ง„ํ–‰ํ•˜๋ฉด step size๊ฐ€ ๋„ˆ๋ฌด ์ค„์–ด๋“ ๋‹ค๋Š” ๋ฌธ์ œ์ ์ด ์žˆ๋‹ค. GG์—๋Š” ๊ณ„์† ์ œ๊ณฑํ•œ ๊ฐ’์„ ๋„ฃ์–ด์ฃผ๊ธฐ ๋•Œ๋ฌธ์— GG์˜ ๊ฐ’๋“ค์€ ๊ณ„์†ํ•ด์„œ ์ฆ๊ฐ€ํ•˜๊ธฐ ๋•Œ๋ฌธ์—, ํ•™์Šต์ด ์˜ค๋ž˜ ์ง„ํ–‰๋  ๊ฒฝ์šฐ step size๊ฐ€ ๋„ˆ๋ฌด ์ž‘์•„์ ธ์„œ ๊ฒฐ๊ตญ ๊ฑฐ์˜ ์›€์ง์ด์ง€ ์•Š๊ฒŒ ๋œ๋‹ค. ์ด๋ฅผ ๋ณด์™„ํ•˜์—ฌ ๊ณ ์นœ ์•Œ๊ณ ๋ฆฌ์ฆ˜์ด RMSProp๊ณผ AdaDelta์ด๋‹ค.

RMSProp

RMSProp์€ ๋”ฅ๋Ÿฌ๋‹์˜ ๋Œ€๊ฐ€ ์ œํ”„๋ฆฌ ํžŒํ†ค์ด ์ œ์•ˆํ•œ ๋ฐฉ๋ฒ•์œผ๋กœ์„œ, Adagrad์˜ ๋‹จ์ ์„ ํ•ด๊ฒฐํ•˜๊ธฐ ์œ„ํ•œ ๋ฐฉ๋ฒ•์ด๋‹ค. Adagrad์˜ ์‹์—์„œ gradient์˜ ์ œ๊ณฑ๊ฐ’์„ ๋”ํ•ด๋‚˜๊ฐ€๋ฉด์„œ ๊ตฌํ•œ GtGt ๋ถ€๋ถ„์„ ํ•ฉ์ด ์•„๋‹ˆ๋ผ ์ง€์ˆ˜ํ‰๊ท ์œผ๋กœ ๋ฐ”๊พธ์–ด์„œ ๋Œ€์ฒดํ•œ ๋ฐฉ๋ฒ•์ด๋‹ค. ์ด๋ ‡๊ฒŒ ๋Œ€์ฒด๋ฅผ ํ•  ๊ฒฝ์šฐ Adagrad์ฒ˜๋Ÿผ GtGt๊ฐ€ ๋ฌดํ•œ์ • ์ปค์ง€์ง€๋Š” ์•Š์œผ๋ฉด์„œ ์ตœ๊ทผ ๋ณ€ํ™”๋Ÿ‰์˜ ๋ณ€์ˆ˜๊ฐ„ ์ƒ๋Œ€์ ์ธ ํฌ๊ธฐ ์ฐจ์ด๋Š” ์œ ์ง€ํ•  ์ˆ˜ ์žˆ๋‹ค. ์‹์œผ๋กœ ๋‚˜ํƒ€๋‚ด๋ฉด ๋‹ค์Œ๊ณผ ๊ฐ™๋‹ค.G=ฮณG+(1โˆ’ฮณ)(โˆ‡ฮธJ(ฮธt))2G=ฮณG+(1โˆ’ฮณ)(โˆ‡ฮธJ(ฮธt))2ฮธ=ฮธโˆ’ฮทG+ฯตโˆ’โˆ’โˆ’โˆ’โˆ’โˆšโ‹…โˆ‡ฮธJ(ฮธt)ฮธ=ฮธโˆ’ฮทG+ฯตโ‹…โˆ‡ฮธJ(ฮธt)

AdaDelta

AdaDelta (Adaptive Delta) ๋Š” RMSProp๊ณผ ์œ ์‚ฌํ•˜๊ฒŒ AdaGrad์˜ ๋‹จ์ ์„ ๋ณด์™„ํ•˜๊ธฐ ์œ„ํ•ด ์ œ์•ˆ๋œ ๋ฐฉ๋ฒ•์ด๋‹ค. AdaDelta๋Š” RMSProp๊ณผ ๋™์ผํ•˜๊ฒŒ GG๋ฅผ ๊ตฌํ•  ๋•Œ ํ•ฉ์„ ๊ตฌํ•˜๋Š” ๋Œ€์‹  ์ง€์ˆ˜ํ‰๊ท ์„ ๊ตฌํ•œ๋‹ค. ๋‹ค๋งŒ, ์—ฌ๊ธฐ์—์„œ๋Š” step size๋ฅผ ๋‹จ์ˆœํ•˜๊ฒŒ ฮทฮท ๋กœ ์‚ฌ์šฉํ•˜๋Š” ๋Œ€์‹  step size์˜ ๋ณ€ํ™”๊ฐ’์˜ ์ œ๊ณฑ์„ ๊ฐ€์ง€๊ณ  ์ง€์ˆ˜ํ‰๊ท  ๊ฐ’์„ ์‚ฌ์šฉํ•œ๋‹ค.G=ฮณG+(1โˆ’ฮณ)(โˆ‡ฮธJ(ฮธt))2G=ฮณG+(1โˆ’ฮณ)(โˆ‡ฮธJ(ฮธt))2ฮ”ฮธ=s+ฯตโˆ’โˆ’โˆ’โˆ’โˆšG+ฯตโˆ’โˆ’โˆ’โˆ’โˆ’โˆšโ‹…โˆ‡ฮธJ(ฮธt)ฮ”ฮธ=s+ฯตG+ฯตโ‹…โˆ‡ฮธJ(ฮธt)ฮธ=ฮธโˆ’ฮ”ฮธฮธ=ฮธโˆ’ฮ”ฮธs=ฮณs+(1โˆ’ฮณ)ฮ”2ฮธs=ฮณs+(1โˆ’ฮณ)ฮ”ฮธ2

์–ผํ• ๋ณด๋ฉด ์™œ ์ด๋Ÿฌํ•œ ์‹์ด ๋„์ถœ๋˜์—ˆ๋Š”์ง€ ์ดํ•ด๊ฐ€ ์•ˆ๋  ์ˆ˜ ์žˆ์ง€๋งŒ, ์ด๋Š” ์‚ฌ์‹ค Gradient Descent์™€ ๊ฐ™์€ first-order optimization ๋Œ€์‹  Second-order optimization์„ approximate ํ•˜๊ธฐ ์œ„ํ•œ ๋ฐฉ๋ฒ•์ด๋‹ค. ์‹ค์ œ๋กœ ๋…ผ๋ฌธ์˜ ์ €์ž๋Š” SGD, Momentum, Adagrad์™€ ๊ฐ™์€ ์‹๋“ค์˜ ๊ฒฝ์šฐ ฮ”ฮธฮ”ฮธ์˜ unit(๋‹จ์œ„)์„ ๊ตฌํ•ด๋ณด๋ฉด ฮธฮธ์˜ unit์ด ์•„๋‹ˆ๋ผ ฮธฮธ์˜ unit์˜ ์—ญ์ˆ˜๋ฅผ ๋”ฐ๋ฅธ๋‹ค๋Š” ๊ฒƒ์„ ์ง€์ ํ•œ๋‹ค. ฮธฮธ์˜ unit์„ u(ฮธ)u(ฮธ)๋ผ๊ณ  ํ•˜๊ณ  J๋Š” unit์ด ์—†๋‹ค๊ณ  ์ƒ๊ฐํ•  ๊ฒฝ์šฐ, first-order optimization์—์„œ๋Š”ฮ”ฮธโˆโˆ‚Jโˆ‚ฮธโˆ1u(ฮธ)ฮ”ฮธโˆโˆ‚Jโˆ‚ฮธโˆ1u(ฮธ)

์ด๋‹ค. ๋ฐ˜๋ฉด, Newton method์™€ ๊ฐ™์€ second-order optimization์„ ์ƒ๊ฐํ•ด๋ณด๋ฉดฮ”ฮธโˆโˆ‚Jโˆ‚ฮธโˆ‚2Jโˆ‚ฮธ2โˆu(ฮธ)ฮ”ฮธโˆโˆ‚Jโˆ‚ฮธโˆ‚2Jโˆ‚ฮธ2โˆu(ฮธ)

์ด๋ฏ€๋กœ ๋ฐ”๋ฅธ unit์„ ๊ฐ€์ง€๊ฒŒ ๋œ๋‹ค. ๋”ฐ๋ผ์„œ ์ €์ž๋Š” Newtonโ€™s method๋ฅผ ์ด์šฉํ•˜์—ฌ ฮ”ฮธฮ”ฮธ๊ฐ€ โˆ‚Jโˆ‚ฮธโˆ‚2Jโˆ‚ฮธ2โˆ‚Jโˆ‚ฮธโˆ‚2Jโˆ‚ฮธ2 ๋ผ๊ณ  ์ƒ๊ฐํ•œ ํ›„, 1โˆ‚2Jโˆ‚ฮธ2=ฮ”ฮธโˆ‚Jโˆ‚ฮธ1โˆ‚2Jโˆ‚ฮธ2=ฮ”ฮธโˆ‚Jโˆ‚ฮธ ์ด๋ฏ€๋กœ ์ด๋ฅผ ๋ถ„์ž์˜ Root Mean Square, ๋ถ„๋ชจ์˜ Root Mean Square ๊ฐ’์˜ ๋น„์œจ๋กœ ๊ทผ์‚ฌํ•œ ๊ฒƒ์ด๋‹ค. ๋”์šฑ ์ž์„ธํ•œ ์„ค๋ช…์„ ์›ํ•˜์‹œ๋Š” ๋ถ„์€ ๋…ผ๋ฌธ์„ ์ง์ ‘ ์ฝ์–ด๋ณด์‹œ๊ธธ ๋ฐ”๋ž€๋‹ค.

Adam

Adam (Adaptive Moment Estimation)์€ RMSProp๊ณผ Momentum ๋ฐฉ์‹์„ ํ•ฉ์นœ ๊ฒƒ ๊ฐ™์€ ์•Œ๊ณ ๋ฆฌ์ฆ˜์ด๋‹ค. ์ด ๋ฐฉ์‹์—์„œ๋Š” Momentum ๋ฐฉ์‹๊ณผ ์œ ์‚ฌํ•˜๊ฒŒ ์ง€๊ธˆ๊นŒ์ง€ ๊ณ„์‚ฐํ•ด์˜จ ๊ธฐ์šธ๊ธฐ์˜ ์ง€์ˆ˜ํ‰๊ท ์„ ์ €์žฅํ•˜๋ฉฐ, RMSProp๊ณผ ์œ ์‚ฌํ•˜๊ฒŒ ๊ธฐ์šธ๊ธฐ์˜ ์ œ๊ณฑ๊ฐ’์˜ ์ง€์ˆ˜ํ‰๊ท ์„ ์ €์žฅํ•œ๋‹ค.mt=ฮฒ1mtโˆ’1+(1โˆ’ฮฒ1)โˆ‡ฮธJ(ฮธ)mt=ฮฒ1mtโˆ’1+(1โˆ’ฮฒ1)โˆ‡ฮธJ(ฮธ)vt=ฮฒ2vtโˆ’1+(1โˆ’ฮฒ2)(โˆ‡ฮธJ(ฮธ))2vt=ฮฒ2vtโˆ’1+(1โˆ’ฮฒ2)(โˆ‡ฮธJ(ฮธ))2

๋‹ค๋งŒ, Adam์—์„œ๋Š” m๊ณผ v๊ฐ€ ์ฒ˜์Œ์— 0์œผ๋กœ ์ดˆ๊ธฐํ™”๋˜์–ด ์žˆ๊ธฐ ๋•Œ๋ฌธ์— ํ•™์Šต์˜ ์ดˆ๋ฐ˜๋ถ€์—์„œ๋Š” mt,vtmt,vt๊ฐ€ 0์— ๊ฐ€๊น๊ฒŒ bias ๋˜์–ด์žˆ์„ ๊ฒƒ์ด๋ผ๊ณ  ํŒ๋‹จํ•˜์—ฌ ์ด๋ฅผ unbiased ํ•˜๊ฒŒ ๋งŒ๋“ค์–ด์ฃผ๋Š” ์ž‘์—…์„ ๊ฑฐ์นœ๋‹ค. mtmt ์™€ vtvt์˜ ์‹์„ โˆ‘โˆ‘ ํ˜•ํƒœ๋กœ ํŽผ์นœ ํ›„ ์–‘๋ณ€์— expectation์„ ์”Œ์›Œ์„œ ์ •๋ฆฌํ•ด๋ณด๋ฉด, ๋‹ค์Œ๊ณผ ๊ฐ™์€ ๋ณด์ •์„ ํ†ตํ•ด unbiased ๋œ expectation์„ ์–ป์„ ์ˆ˜ ์žˆ๋‹ค. ์ด ๋ณด์ •๋œ expectation๋“ค์„ ๊ฐ€์ง€๊ณ  gradient๊ฐ€ ๋“ค์–ด๊ฐˆ ์ž๋ฆฌ์— mt^mt^, GtGt๊ฐ€ ๋“ค์–ด๊ฐˆ ์ž๋ฆฌ์— vt^vt^๋ฅผ ๋„ฃ์–ด ๊ณ„์‚ฐ์„ ์ง„ํ–‰ํ•œ๋‹ค.mt^=mt1โˆ’ฮฒt1mt^=mt1โˆ’ฮฒ1tvt^=vt1โˆ’ฮฒt2vt^=vt1โˆ’ฮฒ2tฮธ=ฮธโˆ’ฮทvt^+ฯตโˆ’โˆ’โˆ’โˆ’โˆ’โˆšmt^ฮธ=ฮธโˆ’ฮทvt^+ฯตmt^

๋ณดํ†ต ฮฒ1ฮฒ1 ๋กœ๋Š” 0.9, ฮฒ2ฮฒ2๋กœ๋Š” 0.999, ฯตฯต ์œผ๋กœ๋Š” 10โˆ’810โˆ’8 ์ •๋„์˜ ๊ฐ’์„ ์‚ฌ์šฉํ•œ๋‹ค๊ณ  ํ•œ๋‹ค.

Summing up

์ง€๊ธˆ๊นŒ์ง€ ๋‹ค์–‘ํ•œ Gradient Descent ์•Œ๊ณ ๋ฆฌ์ฆ˜์˜ ๋ณ€ํ˜• ์•Œ๊ณ ๋ฆฌ์ฆ˜๋“ค์„ ์•Œ์•„๋ณด์•˜๋‹ค. Momentum, NAG, AdaGrad, AdaDelta, RMSProp, Adam ๋“ฑ ๋‹ค์–‘ํ•œ ์•Œ๊ณ ๋ฆฌ์ฆ˜๋“ค์ด ์žˆ์—ˆ์ง€๋งŒ ์ด ์ค‘์—์„œ ์–ด๋Š ์•Œ๊ณ ๋ฆฌ์ฆ˜์ด ๊ฐ€์žฅ ์ข‹๋‹ค, ๋ผ๊ณ  ๋งํ•˜๊ธฐ๋Š” ํž˜๋“ค๋‹ค. ์–ด๋–ค ๋ฌธ์ œ๋ฅผ ํ’€๊ณ ์žˆ๋Š”์ง€, ์–ด๋–ค ๋ฐ์ดํ„ฐ์…‹์„ ์‚ฌ์šฉํ•˜๋Š”์ง€, ์–ด๋–ค ๋„คํŠธ์›Œํฌ์— ๋Œ€ํ•ด ์ ์šฉํ•˜๋Š”์ง€์— ๋”ฐ๋ผ ๊ฐ ๋ฐฉ๋ฒ•์˜ ์„ฑ๋Šฅ์€ ํŒ์ดํ•˜๊ฒŒ ์ฐจ์ด๊ฐ€ ๋‚  ๊ฒƒ์ด๋ฏ€๋กœ ์‹ค์ œ๋กœ ๋„คํŠธ์›Œํฌ๋ฅผ ํ•™์Šต์‹œํ‚ฌ ๋•Œ๋Š” ๋‹ค์–‘ํ•œ ์‹œ๋„๋ฅผ ํ•ด๋ณด๋ฉฐ ํ˜„์žฌ ๊ฒฝ์šฐ์—์„œ๋Š” ์–ด๋–ค ์•Œ๊ณ ๋ฆฌ์ฆ˜์ด ๊ฐ€์žฅ ์„ฑ๋Šฅ์ด ์ข‹์€์ง€์— ๋Œ€ํ•ด ์‹คํ—˜ํ•ด๋ณผ ํ•„์š”๊ฐ€ ์žˆ๋‹ค. ๋‹คํ–‰ํžˆ, Tensorflow ๋“ฑ์˜ Machine learning library๋ฅผ ์‚ฌ์šฉํ•˜๋ฉด ์ฝ”๋“œ ํ•œ ์ค„๋งŒ ๋ณ€ํ™”์‹œํ‚ค๋ฉด ์‰ฝ๊ฒŒ ์–ด๋–ค optimizer๋ฅผ ์‚ฌ์šฉํ•  ๊ฒƒ์ธ์ง€ ์„ค์ •ํ•ด์ค„ ์ˆ˜ ์žˆ์–ด ๊ฐ„ํŽธํ•˜๊ฒŒ ์‹คํ—˜ํ•ด๋ณผ ์ˆ˜ ์žˆ์„ ๊ฒƒ์ด๋‹ค.

์—ฌ๊ธฐ์„œ ์„ค๋ช…ํ•œ ์•Œ๊ณ ๋ฆฌ์ฆ˜๋“ค์€ ๋ชจ๋‘ Stochastic Gradient Descent, ์ฆ‰ ๋‹จ์ˆœํ•œ first-order optimization์˜ ๋ณ€ํ˜•๋“ค์ด๋‹ค. ์ด ์™ธ์—๋„ Newtonโ€™s Method ๋“ฑ second-order optimization์„ ๊ธฐ๋ฐ˜์œผ๋กœ ํ•œ ์•Œ๊ณ ๋ฆฌ์ฆ˜๋“ค๋„ ์žˆ๋‹ค. ๊ทธ๋Ÿฌ๋‚˜ ๋‹จ์ˆœํ•œ second-order optimization์„ ์‚ฌ์šฉํ•˜๊ธฐ ์œ„ํ•ด์„œ๋Š” Hessian Matrix๋ผ๋Š” 2์ฐจ ํŽธ๋ฏธ๋ถ„ ํ–‰๋ ฌ์„ ๊ณ„์‚ฐํ•œ ํ›„ ์—ญํ–‰๋ ฌ์„ ๊ตฌํ•ด์•ผ ํ•˜๋Š”๋ฐ, ์ด ๊ณ„์‚ฐ๊ณผ์ •์ด ๊ณ„์‚ฐ์ ์œผ๋กœ ๋น„์‹ผ ์ž‘์—…์ด๊ธฐ ๋•Œ๋ฌธ์— ๋ณดํ†ต ์ž˜ ์‚ฌ์šฉ๋˜์ง€ ์•Š๋Š”๋‹ค. ์ด๋Ÿฌํ•œ ๊ณ„์‚ฐ๋Ÿ‰์„ ์ค„์ด๊ธฐ ์œ„ํ•ด hessian matrix๋ฅผ ๊ทผ์‚ฌํ•˜๊ฑฐ๋‚˜ ์ถ”์ •ํ•ด๋‚˜๊ฐ€๋ฉด์„œ ๊ณ„์‚ฐ์„ ์ง„ํ–‰ํ•˜๋Š” BFGS / L-BFGS ๋“ฑ์˜ ์•Œ๊ณ ๋ฆฌ์ฆ˜, ๊ทธ๋ฆฌ๊ณ  hessian matrix๋ฅผ ์ง์ ‘ ๊ณ„์‚ฐํ•˜์ง€ ์•Š์œผ๋ฉด์„œ second-order optimization์ธ Hessian-Free Optimization ๋“ฑ๋„ ์กด์žฌํ•œ๋‹ค. ์ด๋Ÿฌํ•œ ๋ถ€๋ถ„๋“ค์— ๋Œ€ํ•ด์„œ๋Š” ๋‚˜์ค‘์— ๊ธฐํšŒ๊ฐ€ ๋˜๋ฉด ๊ณต๋ถ€ํ•ด์„œ ์–ธ๊ธ‰ํ•ด๋ณด๋„๋ก ํ•˜๊ฒ ๋‹ค.

metric(ํ‰๊ฐ€์ง€ํ‘œ)

Validation set์—์„œ ํ›ˆ๋ จ๋œ ๋ชจ๋ธ์˜ ์„ฑ๋Šฅ์„ ํ‰๊ฐ€ํ•  ๋•Œ, ์–ด๋–ค ํ‰๊ฐ€์ง€ํ‘œ๋กœ ํ‰๊ฐ€ํ• ์ง€๋ฅผ ๊ฒฐ์ •ํ•ด์ค๋‹ˆ๋‹ค.

๊ฒ€์ฆ์…‹๊ณผ ์—ฐ๊ด€. ํ›ˆ๋ จ ๊ณผ์ •์„ ๋ชจ๋‹ˆํ„ฐ๋งํ•˜๋Š”๋ฐ ์‚ฌ์šฉ.
(ํ‰๊ฐ€์ง€ํ‘œ๋กœ ์–ด๋–ค ๊ฒƒ์„ ์‚ฌ์šฉํ•˜๋”๋ผ๋„ ๋ชจ๋ธ ๊ฐ€์ค‘์น˜์˜ ์—…๋ฐ์ดํŠธ์—๋Š” ์˜ํ–ฅ์„ ๋ฏธ์น˜์ง€ ์•Š๋Š”๋‹ค)

ํ•™์Šต๊ณก์„ ์„ ๊ทธ๋ฆด ๋•Œ ์†์‹คํ•จ์ˆ˜์™€ ํ‰๊ฐ€์ง€ํ‘œ๋ฅผ ์—ํฌํฌ(epoch)๋งˆ๋‹ค ๊ณ„์‚ฐํ•œ ๊ฒƒ์„ ๊ทธ๋ ค์ฃผ๋Š”๋ฐ, ์†์‹คํ•จ์ˆ˜์˜ ์ถ”์ด์™€ ํ‰๊ฐ€์ง€ํ‘œ์˜ ์ถ”์ด๋ฅผ ๋น„๊ตํ•ด๋ณด๋ฉด์„œ ๋ชจ๋ธ์ด ๊ณผ๋Œ€์ ํ•ฉ(overfit) ๋˜๋Š” ๊ณผ์†Œ์ ํ•ฉ(underfit)๋˜๊ณ  ์žˆ๋Š”์ง€ ์—ฌ๋ถ€๋ฅผ ํ™•์ธํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

 

bskyvision์˜ ์ถ”์ฒœ๊ธ€ โ˜ž

๋”ฅ๋Ÿฌ๋‹ ์†์‹ค ํ•จ์ˆ˜(loss function) ์ •๋ฆฌ: MSE, MAE, binary/categorical/sparse categorical crossentropy

[python] ํ”ผ์–ด์Šจ ์ƒ๊ด€๊ณ„์ˆ˜๋ฅผ ๋ชจ๋ธ์˜ ์†์‹คํ•จ์ˆ˜ ๋˜๋Š” ํ‰๊ฐ€์ง€ํ‘œ๋กœ ์‚ฌ์šฉํ•˜๋ ค๋ฉด

<์ฐธ๊ณ ์ž๋ฃŒ>

[1] https://keras.io/ko/metrics/, Keras Documentation, “์ธก์ •ํ•ญ๋ชฉ์˜ ์‚ฌ์šฉ๋ฒ•”

[2] https://keras.io/ko/losses/, Keras Documentation, “์†์‹ค ํ•จ์ˆ˜์˜ ์‚ฌ์šฉ”

[3] https://www.tensorflow.org/guide/keras/overview?hl=ko, TensorFlow, “์ผ€๋ผ์Šค: ๋น ๋ฅด๊ฒŒ ํ›‘์–ด๋ณด๊ธฐ”

Categories: DeepLearning

onesixx

Blog Owner

Subscribe
Notify of
guest

0 Comments
Oldest
Newest Most Voted
Inline Feedbacks
View all comments
0
Would love your thoughts, please comment.x
()
x