1D-CNN-HAR

Published by onesixx on

짧은 구간에 대해서 흥미로운 패턴을 인식하고 싶고, location에 상관없을 때 좋다.

2D이미지로 따지자면 고양이가 이미지 상에서 어느 location에 있던지 상관없어야 하니 맞는말인듯 하다.
주식에 대입해보면 어떨까? 아쉽게도 주식은 trend 및 seasonality가 존재하기 때문에 location에 영향을 받는다고 해야할 것 같다. 상승 패턴이라고 하더라도 방금전에 나온것과 한달전에 나온것은 다르기 때문이다.

예제>

Image for post

80*3 데이터셋에, 크기가 10 (10*3)인 서로다른 100개의 filter를 적용하는 Conv 1st layer 만들고,
같은 작업을 한번더 Conv 2nd layer에서 시행하여, 마지막 ouput은 62 *100 사이즈 행렬.
pooling과 dropout으로 overfitting을 방지하고, 적절한 activation function으로 결과를 만들어 낸다.

  1. Height : Network에 넣을 한개의 input데이터셋 length, ex) 80행
  2. width : Network에 넣을 한개의 input데이터셋 depth, ex) accelerometer센서 x, y, z 3열
  3. Filter = Feature detector = Sliding window ex) 필터크기 10 * 3 , 각 filter 당 결과 1개
    kernel Size : filter의 크기, ex) 10행, filter가 아래로 (80-10+1)=71 만틈 아래쪽으로 슬라이딩
  4. filters의 종류갯수 ex) 100개의 서로다른 filter , 즉 서로다른 100개의 feature를 뽑아냄.
  5. Output : 원데이터의 갯수 height 와 kenel Size 에 따라 행의 갯수가 정해지고,
    filter의 종률개수만큼의 열갯수가 정해진다.
    ex) 1st 출력뉴런은 (80-10+1) * 100 => 기본적인 feature를 학습,
    2nd 출력뉴런은 (71 -10+1) *100 => 좀 더 복잡한 feature를 학습
  6. Max Pooling : (정해진 수만큼 sliding하면서) 그 feature들 중 max value만 취해 대치하고 나머진 버리므로써
    학습된 feature의 overfitting을 방지한다. ex> 3행씩, 즉 이전 layer의 66%를 버린다. ouput은 floor(62/3) * 100

– 추가적인 Conv layer 와 추가적인 (Average) pooling
– Dropout : 추가적인 overfitting 방지를 위해, input된 units의 일부(dropout rate만큼)를 임의로 0으로 만든다.
ex) dropout rate = 0.5 50%의 뉴런을 날려, network이 작은 변화에 덜 민감하도록 만듬.
7. Dense : fully connected layer, softmax activation function을 사용하여 6개의 class를 가진 확률분포(Probability districution)를 만든다.
8. Final Ouput : 구하고자 하는 class별로 확률을 만든다.

Human Activity Recognition (HAR)

데이터를 통해 사용자의 activity 유형을 분류 ( “Downstairs”,”Jogging”,”Sitting”,”Standing”,”Upstairs”,”Walking”)
=> HAR (Human Activity Recognition part1, part2)
Data : 스마트폰 accelerometer sensor 데이터, 시간에 따라 등간격으로 측정 , x/y/z 축
WISDM_ar_v1.1_raw.txt github, kaggle

Smartphone-Based Recognition of Human Activities and Postural Transitions Data Set distributed by the University of California, Irvine.
https://blogs.rstudio.com/ai/posts/2018-07-17-activity-detection/

스마트폰의 데이터를 통해 physical activities를 예측하는 것

data: 30명 스마트폰의 accelerometer 와 gyroscope 데이터 (HAPT)

  • feature extraction 기법(fast-fourier transform)으로 pre-processed 데이터
  • raw data (x, y, z)
### Classifying activity (HAR) ~~~
# https://blogs.rstudio.com/ai/posts/2018-07-17-activity-detection/
source(file.path(getwd(),"../00.global_dl.R"))

library(knitr)
library(rmarkdown)

###### ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
## 10. Load data ----
######~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
### ` === Raw Data -------------------------------------------------------------
# http://archive.ics.uci.edu/ml/datasets/Smartphone-Based+Recognition+of+Human+Activities+and+Postural+Transitions
URL="http://archive.ics.uci.edu/ml/machine-learning-databases/00341/HAPT%20Data%20Set.zip"
temp <- tempfile()
download.file(URL, temp)
unzip(temp, exdir=file.path(DATA_PATH,"HAPT"))
unlink(temp)

rm(temp, URL)

### ` ` * Load -----------------------------------------------------------------
activityLabels <- fread(file.path(DATA_PATH,"HAPT","activity_labels.txt"), 
                        col.names=c("number", "label")) 

infoLabels <- fread(file.path(DATA_PATH,"HAPT","RawData","labels.txt"), 
                    col.names = c("experiment", "userId", "activity", "startPos", "endPos"))

dataFiles <- list.files(file.path(DATA_PATH,"HAPT","RawData"), pattern=".*[0-9].txt")
#dataFiles %>% glimpse()
###  acc/gyro _ exp1-61 _ user1-30 . txt

fileInfo <- data.table(filePath = dataFiles) %>%  #  filter(filePath != "labels.txt") %>% 
  separate(filePath, sep = '_',  into = c("type", "experiment", "userId"), remove=F) %>% 
  mutate( experiment = str_remove(experiment, "exp"),
          userId     = str_remove_all(userId, "user|\\\\.txt") ) %>%
  select(experiment,userId,type,filePath) %>% 
  spread(key=type, value=filePath)

fileInfo %>% head()
fileInfo %>% glimpse()

### ` ` * Reading and gathering data -------------------------------------------
# Read contents of single file to a dataframe with accelerometer and gyro data.
# fread("/Users/onesixx/DATA/CNN/HAPT/RawData/acc_exp01_user01.txt", col.names=c("a_x","a_y","a_z"))

readInData <- function(experiment, userId){
  ## : read data from files----
  genFilePath <- function(type){
    ## : get path ----
    file.path(DATA_PATH,"HAPT","RawData",str_c(type,"_exp",experiment,"_user",userId,".txt"))
  }
  bind_cols(
    fread(genFilePath("acc"),  col.names=c("a_x","a_y","a_z")),
    fread(genFilePath("gyro"), col.names=c("g_x","g_y","g_z"))
  )
}
# Read a given file and get the observations contained along with their classes.
loadFileData <- function(curExperiment, curUserId) {
  ## : load data from files----
  # curExperiment="01"; curUserId="01"
  # load sensor data from file into data.frame
  allData <- readInData(curExperiment, curUserId)
  
  # get observation locations in this file from labels dataframe
  label_byExperimentUserId <- infoLabels %>% 
    filter(userId    ==as.integer(curUserId), 
           experiment==as.integer(curExperiment))
  # extract observations as dataframes and save as a column in dataframe.
  label_byExperimentUserId %>% 
    mutate(data=map2(startPos, endPos, function(startPos, endPos){ # extractObservation
      allData[startPos:endPos,]
    }) ) %>% 
    select(-startPos, -endPos)
}

# scan through all experiment and userId combos and gather data into a dataframe. 
allObservations <- map2_df(fileInfo$experiment, fileInfo$userId, loadFileData) %>% 
  right_join(activityLabels, by= c("activity"="number")) %>% 
  rename(activityName=label)

# cache work. 
#ifelse(dir.exists(file.path(DATA_PATH,"PreProcess")), cat("OK","\
"), dir.create(file.path(DATA_PATH,"PreProcess")))
#write_fst(allObservations, file.path(DATA_PATH,"allObservations.fst"))
saveRDS(allObservations, file.path(DATA_PATH,"HAPT_obs.rds"))
allObservations %>% dim()
rm(allObservations)


## ` ` * Read RDS again ----
#allObservations <- read_fst(file.path(DATA_PATH,"allObservations.fst"))
allObservations <- readRDS(file.path(DATA_PATH,"HAPT_obs.rds"))
allObservations %>% head(20)

# allObservations %>% 
#   filter(activityName=="WALKING_UPSTAIRS") %>% 
#   mutate(recording_length=map_int(data, nrow)) %>% 
#   select(activityName, recording_length) %>% 
#   ggplot(aes(x=recording_length, y=activityName)) +
#   geom_density_ridges(alpha=0.8)


allObservations %>%  
  mutate(recording_length=map_int(data, nrow)) %>% 
  ggplot(aes(x=recording_length, y=activityName)) +
  ggridges::geom_density_ridges(alpha=0.8)

# activity간에 recording_length에 차이가 있다. 
# => 가장 Length가 긴 activity에 맞춰 padding-zeros할 필요
unique(allObservations$activityName)

### ` ` * Filtering activities ----
desiredActivities <- c(
  "STAND_TO_SIT", "SIT_TO_STAND", "SIT_TO_LIE", 
  "LIE_TO_SIT",   "STAND_TO_LIE", "LIE_TO_STAND"  
)

filteredObservations <- allObservations %>% 
  filter(activityName %in% desiredActivities) %>% 
  mutate(observationId=1:n())

filteredObservations
#unique(filteredObservations$activityName)


### ` ` * Training/testing split ----
set.seed(42)
## randomly choose 24 for training (80% of 30 individuals) 
## set the rest of the users to the testing set
## get all users : 30명
userIds  <- filteredObservations$userId %>% unique()
trainIds <- userIds %>% sample(size=24)
testIds  <- userIds %>% setdiff(trainIds)

## filter data. 
trainData <- filteredObservations %>% filter(userId %in% trainIds)
testData  <- filteredObservations %>% filter(userId %in% testIds)

### ` === EDA ------------------------------------------------------------------
unpackedObs <- 1:nrow(trainData) %>% 
  map_df(function(rowNum){
    #rowNum=1
    dataRow <- trainData[rowNum, ]
    
    dataRow$data[[1]] %>% 
      mutate(activityName =dataRow$activityName, 
             observationId=dataRow$observationId,
             time         = 1:n())
  }) %>%
  #gather(reading, value, -time, -activityName, -observationId) %>% 
  gather(key=reading, value=value, a_x,a_y,a_z, g_x,g_y,g_z) %>% 
  separate(col=reading, into=c("type", "direction"), sep="_") %>% 
  mutate(type=ifelse(type=="a", "acceleration", "gyro")) %>% as.data.table()

unpackedObs %>% #filter(type=="acceleration") %>% 
  ggplot(aes(x=time, y=value, color=direction)) +
  geom_line(alpha=.2) +
  geom_smooth(size=.5, alpha=.7) +
  geom_smooth(se=F, alpha=.7, size=.5) +
  facet_grid(type ~ activityName)

unpackedObs[activityName=="STAND_TO_SIT" & observationId=="1" & type=="acceleration", ]

### ` === Preprocess : Normalize / rescale /  ----------------------------------
### ` === INPUT LAYER ----------------------------------------------------------
# convert list of obs to matrices
# pad all obs
# turn into a 3D tensor

## ` ` * Padding(truncate) observations ----
# truncate long length obs (outlier)
padSize <- trainData$data %>% map_int(nrow) %>% 
              quantile(p=.98) %>% ceiling()

dataset <- data.table(trainData$data %>% map_int(nrow))
dataset%>% 
  ggplot(aes(V1)) + 
  geom_density() + #(aes(y=..density..)))
  geom_vline(aes(xintercept= quantile(dataset$V1, prob=.98)), color="red", size=.6) +
  scale_y_continuous(expand=c(0, NA)) 

convertToTensor <- function(myList) {
  # : keras::pad_sequences() ----
  # myList = trainData$data[1]
  myList %>% map(as.matrix) %>%
    pad_sequences(maxlen=padSize) 
}
# convertToTensor <- . %>% map(as.matrix) %>%
#                       pad_sequences(maxlen=padSize)

trainObs <- trainData$data %>% convertToTensor()
testObs  <- testData$data  %>% convertToTensor()

# trainData$data %>% length      [1] 290          of list
# trainObs %>% dim               [1] 290 320   6  of array
# trainData %>% mutate(nn=map_int(data, nrow)) %>% arrange(desc(nn))
# tt <- trainData[observationId==281, data] %>% convertToTensor()
tt <- trainObs[281,,]
# tt0 <- trainData$data %>% map(function(x){ x[x!=0]<-1;  x })
# ttObs <- tt0 %>% convertToTensor()
# tt <- ttObs[281,,]
reshape2::melt(tt) %>% 
  ggplot(aes(Var2, Var1, fill=value)) + geom_tile(color="white") + 
  scale_fill_gradient2(limit=c(-1,1), midpoint=0, name="", 
                       low="blue",mid="white",high="red")

# keras::pad_sequences(sequences, maxlen=NULL, dtype="int32",
#   padding="pre", truncating="pre", value=0
# )
# 길이가 다른 sequences를 같은길이로 맞출때, 2,3차원 모두 가능

trainObs %>% dim()  # 290 maxLength 6 (, , )
testObs  %>% dim()  #  68 333       6

## ` ` One-hot encoding ----
#filteredObservations$activity %>% unique() %>% sort()
oneHotClasses <- . %>% {. -7} %>%   # bring integers down to 0-6 from 7-12
                    to_categorical()   # One-hot encode

trainY <- trainData$activity %>% oneHotClasses()
testY  <- testData$activity  %>% oneHotClasses()

###### ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
### 20. Train the model ----
###### ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
### ` === Build/Reshape/complie the model --------------------------------------
input_shape <- dim(trainObs)[-1]
num_classes <- dim(trainY)[2]

filters <- 24     # number of convolutional filters to learn
kernel_size <- 8  # how many time-steps each conv layer sees.
dense_size <- 48  # size of our penultimate dense layer. 

# Initialize model
model <- keras_model_sequential()
model %>% 
  layer_conv_1d(input_shape=input_shape, filters=filters, kernel_size=kernel_size,
                padding="valid",  activation="relu") %>%
  layer_batch_normalization() %>%
  layer_spatial_dropout_1d(0.15) %>% 
  layer_conv_1d(filters = filters/2, kernel_size = kernel_size,
                activation = "relu") %>%
  # Apply average pooling:
  layer_global_average_pooling_1d() %>% 
  layer_batch_normalization() %>%
  layer_dropout(0.2) %>% 
  layer_dense(dense_size, activation="relu") %>% 
  layer_batch_normalization() %>%
  layer_dropout(0.25) %>% 
  layer_dense(num_classes, activation="softmax", name="dense_output") 

summary(model)
### ` === Train(fitting) the model : history, summary --------------------------

model %>% compile(
  loss = "categorical_crossentropy",
  optimizer = "rmsprop",
  metrics = "accuracy"
)

trainHistory <- model %>% fit(x=trainObs, y=trainY,
                              epochs=350,
                              validation_data=list(testObs, testY),
                              callbacks=list( 
                                callback_model_checkpoint("best_model.h5", save_best_only=T)
                              )
                            )
trainHistory
###### ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
### 30. Evaluation ----
###### ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
### ` === Evaluate accuracy ----------------------------------------------------
# dataframe to get labels onto one-hot encoded prediction columns
oneHotToLabel <- activityLabels %>% mutate(number=number-7) %>% filter(number>=0) %>% 
                   mutate(class=paste0("V",number+1)) %>% 
                   select(-number)

# Load our best model checkpoint
bestModel <- load_model_hdf5("best_model.h5")

tidyPredictionProbs <- bestModel %>% predict(testObs) %>% data.table() %>% 
                        mutate(obs=1:n()) %>% 
                        gather(class, prob, -obs) %>% 
                        right_join(oneHotToLabel, by = "class")

### ` === Improve the model ----------------------------------------------------

###### ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
### 40. Make predictions ----
###### ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
### ` === explain Model --------------------------------------------------------
### ` === NEW DATA predictions -------------------------------------------------

predictionPerformance <- tidyPredictionProbs %>% 
  group_by(obs) %>% 
  summarise(
    highestProb = max(prob),
    predicted = label[prob == highestProb]
  ) %>% 
  mutate(
    truth = testData$activityName,
    correct = truth == predicted
  ) 

predictionPerformance %>% paged_table()
predictionPerformance 

predictionPerformance %>% mutate(result=ifelse(correct,'Correct','Incorrect')) %>% 
  ggplot(aes(highestProb)) +
    geom_histogram(binwidth=.01) +
    geom_rug(alpha=.5) +
    facet_grid(result~.) +
    ggtitle("Probabilities associated with prediction by correctness")


predictionPerformance %>% 
  group_by(truth, predicted) %>% summarise(count=n()) %>% 
  mutate(good= (truth==predicted)) %>% 
  ggplot(aes(x=truth,  y=predicted)) +
    geom_point(aes(size=count, color=good)) +
    geom_text(aes(label=count), hjust=0, vjust=0, nudge_x=.1, nudge_y=.1) + 
    guides(color=F, size=F) 

keras::pad_sequence()

  • maxlen 인자로 문장의 길이를 맞춰줍니다.
    예를 들어 120으로 지정했다면 120보다 짧은 문장은 0으로 채워서 120단어로 맞춰주고 120보다 긴 문장은 120단어까지만 잘라냅니다.
  • (num_samples, num_timesteps)으로 2차원의 numpy 배열로 만들어줍니다.
    maxlen을 120으로 지정하였다면, num_timesteps도 120이 됩니다.
> library(keras)
> X = matrix(c(1.01528, -2.141667, 3.0597222, 
+              1.02083, -1.999999, 0.0555556, 
+              1.02778, -0.173611, 0.0694444,
+             -1,        0,        1         ), nrow=4, ncol=3, byrow=T)
> X
         [,1]      [,2]      [,3]
[1,]  1.01528 -2.141667 3.0597222
[2,]  1.02083 -1.999999 0.0555556
[3,]  1.02778 -0.173611 0.0694444
[4,] -1.00000  0.000000 1.0000000
> X %>% pad_sequences()
     [,1] [,2] [,3]
[1,]    1   -2    3
[2,]    1   -1    0
[3,]    1    0    0
[4,]   -1    0    1
> X %>% pad_sequences(maxlen=6)
     [,1] [,2] [,3] [,4] [,5] [,6]
[1,]    0    0    0    1   -2    3
[2,]    0    0    0    1   -1    0
[3,]    0    0    0    1    0    0
[4,]    0    0    0   -1    0    1


> X = list(matrix(c(1.01528, -2.141667, 3.0597222, 
+              1.02083, -1.999999, 0.0555556, 
+              1.02778, -0.173611, 0.0694444,
+              -1,        0,        1         ), nrow=4, ncol=3, byrow=T))
> X
[[1]]
         [,1]      [,2]      [,3]
[1,]  1.01528 -2.141667 3.0597222
[2,]  1.02083 -1.999999 0.0555556
[3,]  1.02778 -0.173611 0.0694444
[4,] -1.00000  0.000000 1.0000000

> X %>% pad_sequences()
, , 1
     [,1] [,2] [,3] [,4]
[1,]    1    1    1   -1
, , 2
     [,1] [,2] [,3] [,4]
[1,]   -2   -1    0    0
, , 3
     [,1] [,2] [,3] [,4]
[1,]    3    0    0    1

> X %>% pad_sequences(maxlen=6)
, , 1
     [,1] [,2] [,3] [,4] [,5] [,6]
[1,]    0    0    1    1    1   -1
, , 2
     [,1] [,2] [,3] [,4] [,5] [,6]
[1,]    0    0   -2   -1    0    0
, , 3
     [,1] [,2] [,3] [,4] [,5] [,6]
[1,]    0    0    3    0    0    1
> summary(model)
Model: "sequential_2"
_______________________________________________________________________________
Layer (type)                                        Output Shape       Param #           
===============================================================================
conv1d (Conv1D)                                     (None, 313, 24)     1176              
_______________________________________________________________________________
batch_normalization (BatchNormalization)            (None, 313, 24)       96                
_______________________________________________________________________________
spatial_dropout1d (SpatialDropout1D)                (None, 313, 24)        0                 
______________________________________________________________________________
conv1d_1 (Conv1D)                                   (None, 306, 12)     2316              
_______________________________________________________________________________
global_average_pooling1d (GlobalAveragePooling1D)   (None, 12)             0                 
______________________________________________________________________________
batch_normalization_1 (BatchNormalization)          (None, 12)            48                
_______________________________________________________________________________
dropout (Dropout)                                   (None, 12)             0                 
_______________________________________________________________________________
dense (Dense)                                       (None, 48)           624               
_______________________________________________________________________________
batch_normalization_2 (BatchNormalization)          (None, 48)           192               
_______________________________________________________________________________
dropout_1 (Dropout)                                 (None, 48)             0                 
_______________________________________________________________________________
dense_output (Dense)                                (None, 6)            294               
===============================================================================
Total params: 4,746
Trainable params: 4,578
Non-trainable params: 168
______________________________________________________________________________
Links and References

https://keras.io/guides/sequential_model/

  • Keras documentation for 1D convolutional neural networks
  • Keras examples for 1D convolutional neural networks
  • A good article with an introduction to 1D CNNs for natural language processing problems
### Classifying activity (HAR) ~~~
# https://blogs.rstudio.com/ai/posts/2018-07-17-activity-detection/
source(file.path(getwd(),"../00.global_dl.R"))

library(knitr)
library(rmarkdown)

###### ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
## 10. Load data ----
######~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
### ` === Raw Data -------------------------------------------------------------
# http://archive.ics.uci.edu/ml/datasets/Smartphone-Based+Recognition+of+Human+Activities+and+Postural+Transitions
URL="http://archive.ics.uci.edu/ml/machine-learning-databases/00341/HAPT%20Data%20Set.zip"
temp <- tempfile()
download.file(URL, temp)
unzip(temp, exdir=file.path(DATA_PATH,"HAPT"))
unlink(temp)

rm(temp, URL)

### ` ` * Load -----------------------------------------------------------------
activityLabels <- fread(file.path(DATA_PATH,"HAPT","activity_labels.txt"), 
                        col.names=c("number", "label")) 

infoLabels <- fread(file.path(DATA_PATH,"HAPT","RawData","labels.txt"), 
                    col.names = c("experiment", "userId", "activity", "startPos", "endPos"))

dataFiles <- list.files(file.path(DATA_PATH,"HAPT","RawData"), pattern=".*[0-9].txt")
#dataFiles %>% glimpse()
###  acc/gyro _ exp1-61 _ user1-30 . txt

fileInfo <- data.table(filePath = dataFiles) %>%  #  filter(filePath != "labels.txt") %>% 
  separate(filePath, sep = '_',  into = c("type", "experiment", "userId"), remove=F) %>% 
  mutate( experiment = str_remove(experiment, "exp"),
          userId     = str_remove_all(userId, "user|\\\\.txt") ) %>%
  select(experiment,userId,type,filePath) %>% 
  spread(key=type, value=filePath)

fileInfo %>% head()
fileInfo %>% glimpse()

### ` ` * Reading and gathering data -------------------------------------------
# Read contents of single file to a dataframe with accelerometer and gyro data.
# fread("/Users/onesixx/DATA/CNN/HAPT/RawData/acc_exp01_user01.txt", col.names=c("a_x","a_y","a_z"))

readInData <- function(experiment, userId){
  ## : read data from files----
  genFilePath <- function(type){
    ## : get path ----
    file.path(DATA_PATH,"HAPT","RawData",str_c(type,"_exp",experiment,"_user",userId,".txt"))
  }
  bind_cols(
    fread(genFilePath("acc"),  col.names=c("a_x","a_y","a_z")),
    fread(genFilePath("gyro"), col.names=c("g_x","g_y","g_z"))
  )
}
# Read a given file and get the observations contained along with their classes.
loadFileData <- function(curExperiment, curUserId) {
  ## : load data from files----
  # curExperiment="01"; curUserId="01"
  # load sensor data from file into data.frame
  allData <- readInData(curExperiment, curUserId)
  
  # get observation locations in this file from labels dataframe
  label_byExperimentUserId <- infoLabels %>% 
    filter(userId    ==as.integer(curUserId), 
           experiment==as.integer(curExperiment))
  # extract observations as dataframes and save as a column in dataframe.
  label_byExperimentUserId %>% 
    mutate(data=map2(startPos, endPos, function(startPos, endPos){ # extractObservation
      allData[startPos:endPos,]
    }) ) %>% 
    select(-startPos, -endPos)
}

# scan through all experiment and userId combos and gather data into a dataframe. 
allObservations <- map2_df(fileInfo$experiment, fileInfo$userId, loadFileData) %>% 
  right_join(activityLabels, by= c("activity"="number")) %>% 
  rename(activityName=label)

# cache work. 
#ifelse(dir.exists(file.path(DATA_PATH,"PreProcess")), cat("OK","\
"), dir.create(file.path(DATA_PATH,"PreProcess")))
#write_fst(allObservations, file.path(DATA_PATH,"allObservations.fst"))
saveRDS(allObservations, file.path(DATA_PATH,"HAPT_obs.rds"))
allObservations %>% dim()
rm(allObservations)


## ` ` * Read RDS again ----
#allObservations <- read_fst(file.path(DATA_PATH,"allObservations.fst"))
allObservations <- readRDS(file.path(DATA_PATH,"HAPT_obs.rds"))
allObservations %>% head(20)

# allObservations %>% 
#   filter(activityName=="WALKING_UPSTAIRS") %>% 
#   mutate(recording_length=map_int(data, nrow)) %>% 
#   select(activityName, recording_length) %>% 
#   ggplot(aes(x=recording_length, y=activityName)) +
#   geom_density_ridges(alpha=0.8)


allObservations %>%  
  mutate(recording_length=map_int(data, nrow)) %>% 
  ggplot(aes(x=recording_length, y=activityName)) +
  ggridges::geom_density_ridges(alpha=0.8)

# activity간에 recording_length에 차이가 있다. 
# => 가장 Length가 긴 activity에 맞춰 padding-zeros할 필요
unique(allObservations$activityName)

### ` ` * Filtering activities ----
desiredActivities <- c(
  "STAND_TO_SIT", "SIT_TO_STAND", "SIT_TO_LIE", 
  "LIE_TO_SIT",   "STAND_TO_LIE", "LIE_TO_STAND"  
)

filteredObservations <- allObservations %>% 
  filter(activityName %in% desiredActivities) %>% 
  mutate(observationId=1:n())

filteredObservations
#unique(filteredObservations$activityName)


### ` ` * Training/testing split ----
set.seed(42)
## randomly choose 24 for training (80% of 30 individuals) 
## set the rest of the users to the testing set
## get all users : 30명
userIds  <- filteredObservations$userId %>% unique()
trainIds <- userIds %>% sample(size=24)
testIds  <- userIds %>% setdiff(trainIds)

## filter data. 
trainData <- filteredObservations %>% filter(userId %in% trainIds)
testData  <- filteredObservations %>% filter(userId %in% testIds)

### ` === EDA ------------------------------------------------------------------
unpackedObs <- 1:nrow(trainData) %>% 
  map_df(function(rowNum){
    #rowNum=1
    dataRow <- trainData[rowNum, ]
    
    dataRow$data[[1]] %>% 
      mutate(activityName =dataRow$activityName, 
             observationId=dataRow$observationId,
             time         = 1:n())
  }) %>%
  #gather(reading, value, -time, -activityName, -observationId) %>% 
  gather(key=reading, value=value, a_x,a_y,a_z, g_x,g_y,g_z) %>% 
  separate(col=reading, into=c("type", "direction"), sep="_") %>% 
  mutate(type=ifelse(type=="a", "acceleration", "gyro")) %>% as.data.table()

unpackedObs %>% #filter(type=="acceleration") %>% 
  ggplot(aes(x=time, y=value, color=direction)) +
  geom_line(alpha=.2) +
  geom_smooth(size=.5, alpha=.7) +
  geom_smooth(se=F, alpha=.7, size=.5) +
  facet_grid(type ~ activityName)

unpackedObs[activityName=="STAND_TO_SIT" & observationId=="1" & type=="acceleration", ]

### ` === Preprocess : Normalize / rescale /  ----------------------------------
### ` === INPUT LAYER ----------------------------------------------------------
# convert list of obs to matrices
# pad all obs
# turn into a 3D tensor

## ` ` * Padding(truncate) observations ----
# truncate long length obs (outlier)
padSize <- trainData$data %>% map_int(nrow) %>% 
              quantile(p=.98) %>% ceiling()

dataset <- data.table(trainData$data %>% map_int(nrow))
dataset%>% 
  ggplot(aes(V1)) + 
  geom_density() + #(aes(y=..density..)))
  geom_vline(aes(xintercept= quantile(dataset$V1, prob=.98)), color="red", size=.6) +
  scale_y_continuous(expand=c(0, NA)) 

convertToTensor <- function(myList) {
  # : keras::pad_sequences() ----
  # myList = trainData$data[1]
  myList %>% map(as.matrix) %>%
    pad_sequences(maxlen=padSize) 
}
# convertToTensor <- . %>% map(as.matrix) %>%
#                       pad_sequences(maxlen=padSize)

trainObs <- trainData$data %>% convertToTensor()
testObs  <- testData$data  %>% convertToTensor()

# trainData$data %>% length      [1] 290          of list
# trainObs %>% dim               [1] 290 320   6  of array
# trainData %>% mutate(nn=map_int(data, nrow)) %>% arrange(desc(nn))
# tt <- trainData[observationId==281, data] %>% convertToTensor()
tt <- trainObs[281,,]
# tt0 <- trainData$data %>% map(function(x){ x[x!=0]<-1;  x })
# ttObs <- tt0 %>% convertToTensor()
# tt <- ttObs[281,,]
reshape2::melt(tt) %>% 
  ggplot(aes(Var2, Var1, fill=value)) + geom_tile(color="white") + 
  scale_fill_gradient2(limit=c(-1,1), midpoint=0, name="", 
                       low="blue",mid="white",high="red")

# keras::pad_sequences(sequences, maxlen=NULL, dtype="int32",
#   padding="pre", truncating="pre", value=0
# )
# 길이가 다른 sequences를 같은길이로 맞출때, 2,3차원 모두 가능

trainObs %>% dim()  # 290 maxLength 6 (, , )
testObs  %>% dim()  #  68 333       6

## ` ` One-hot encoding ----
#filteredObservations$activity %>% unique() %>% sort()
oneHotClasses <- . %>% {. -7} %>%   # bring integers down to 0-6 from 7-12
                    to_categorical()   # One-hot encode

trainY <- trainData$activity %>% oneHotClasses()
testY  <- testData$activity  %>% oneHotClasses()

###### ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
### 20. Train the model ----
###### ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
### ` === Build/Reshape/complie the model --------------------------------------
input_shape <- dim(trainObs)[-1]
num_classes <- dim(trainY)[2]

filters <- 24     # number of convolutional filters to learn
kernel_size <- 8  # how many time-steps each conv layer sees.
dense_size <- 48  # size of our penultimate dense layer. 

# Initialize model
model <- keras_model_sequential()
model %>% 
  layer_conv_1d(input_shape=input_shape, filters=filters, kernel_size=kernel_size,
                padding="valid",  activation="relu") %>%
  layer_batch_normalization() %>%
  layer_spatial_dropout_1d(0.15) %>% 
  layer_conv_1d(filters = filters/2, kernel_size = kernel_size,
                activation = "relu") %>%
  # Apply average pooling:
  layer_global_average_pooling_1d() %>% 
  layer_batch_normalization() %>%
  layer_dropout(0.2) %>% 
  layer_dense(dense_size, activation="relu") %>% 
  layer_batch_normalization() %>%
  layer_dropout(0.25) %>% 
  layer_dense(num_classes, activation="softmax", name="dense_output") 

summary(model)
### ` === Train(fitting) the model : history, summary --------------------------

model %>% compile(
  loss = "categorical_crossentropy",
  optimizer = "rmsprop",
  metrics = "accuracy"
)

trainHistory <- model %>% fit(x=trainObs, y=trainY,
                              epochs=350,
                              validation_data=list(testObs, testY),
                              callbacks=list( 
                                callback_model_checkpoint("best_model.h5", save_best_only=T)
                              )
                            )
trainHistory
###### ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
### 30. Evaluation ----
###### ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
### ` === Evaluate accuracy ----------------------------------------------------
# dataframe to get labels onto one-hot encoded prediction columns
oneHotToLabel <- activityLabels %>% mutate(number=number-7) %>% filter(number>=0) %>% 
                   mutate(class=paste0("V",number+1)) %>% 
                   select(-number)

# Load our best model checkpoint
bestModel <- load_model_hdf5("best_model.h5")

tidyPredictionProbs <- bestModel %>% predict(testObs) %>% data.table() %>% 
                        mutate(obs=1:n()) %>% 
                        gather(class, prob, -obs) %>% 
                        right_join(oneHotToLabel, by = "class")

### ` === Improve the model ----------------------------------------------------

###### ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
### 40. Make predictions ----
###### ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
### ` === explain Model --------------------------------------------------------
### ` === NEW DATA predictions -------------------------------------------------

predictionPerformance <- tidyPredictionProbs %>% 
  group_by(obs) %>% 
  summarise(
    highestProb = max(prob),
    predicted = label[prob == highestProb]
  ) %>% 
  mutate(
    truth = testData$activityName,
    correct = truth == predicted
  ) 

predictionPerformance %>% paged_table()
predictionPerformance 

predictionPerformance %>% mutate(result=ifelse(correct,'Correct','Incorrect')) %>% 
  ggplot(aes(highestProb)) +
    geom_histogram(binwidth=.01) +
    geom_rug(alpha=.5) +
    facet_grid(result~.) +
    ggtitle("Probabilities associated with prediction by correctness")


predictionPerformance %>% 
  group_by(truth, predicted) %>% summarise(count=n()) %>% 
  mutate(good= (truth==predicted)) %>% 
  ggplot(aes(x=truth,  y=predicted)) +
    geom_point(aes(size=count, color=good)) +
    geom_text(aes(label=count), hjust=0, vjust=0, nudge_x=.1, nudge_y=.1) + 
    guides(color=F, size=F) 
Categories: Keras

onesixx

Blog Owner

Subscribe
Notify of
guest

0 Comments
Inline Feedbacks
View all comments
0
Would love your thoughts, please comment.x
()
x