data.table |C|R|U|D |group by

Published by onesixx on 17-07-0117-07-01

Table of contents

Create data.table
Read
Update & Insert

Create data.table

create from scratch

생성과정은 data.frame과 같지만, 단, 한가지 Data.Frame은 Character를 factor로 변환하지만, data.table은 그냥 Character로 저장

###### 1
data.table(a = c(1, 2), b = c("a", "b")) 
###### 2
i <- c(1:7) 
c <- LETTERS[i] 
n <- c(-3,-2,-1,0,1,2,3)  
l <- c(TRUE,FALSE,FALSE,TRUE,FALSE,FALSE,FALSE) 
f <- c("MALE","FEMALE","MALE","FEMALE","FEMALE","MALE","MALE") %>% as.factor 
DT <- data.table(i,c,n,l,f)
###### 3
set.seed(666) 
DT <- data.table( A=rep(c("a","b","c"),each=2), B=c(1:3), C=sample(6), D=sample(6)) 
##### 4
DT <- data.table(A=1:10, B=letters[1:10], C=LETTERS[11:20], D=rep(c("One", "Two", "Three"), length.out=10))

create(convert) from data.frame

setDT()는 (사본을 만들거나 메모리 위치를 변경하지 않고) data.table을 만들 수 있다. (by reference) 즉, setDT() 함수로 사본을 만들지 않고, data.table 로 변환해 버린다.

setDF(my_dt)  # Convert data.table to data.frame
setDT(my_dt)  # Convert data.frame to data.table
diamondsDT <- as.data.table(diamonds)

Key setting

Key 세팅 하면, 해당 column으로 sorting 된다.

setkey(DT, A, B)

Alter table

Rename, Drop, Add Columns

# CONVERT COLUMN TYPE
DT[, b := as.integer(b)]                     # as.integer(), as.numeric(), as.character(), as.Date(), etc..

#* Rename Column----
setnames(DT, c("i","c"), c("ii","cc"))
setnames(DT, c(1,2), c("i","c"))
setnames(DT, c(names(dt[, ncol(aa), with=F])), "AA")

#* Drop(Hide) Column  ----
DT[, c(-1,-2)]
DT[ ,!c("year","month","day")]
DT[ ,!c(1:16,19)]

#* Adding Columns (Calculation on rows)  ----
dt <- dt[ ,dep_sch:= dep_time-dep_delay]
dt <- dt[ ,c("dep_sch","arr_sch"):= list(dep_time-dep_delay, arr_time-arr_delay)]
dt[, `:=`(c = 1 , d = 2)] 

# Delete Columns
dt[, c:= NULL]

#** conditions (IF...ELSE...)  ----
dt[ ,flag := ifelse(min < 50, 1, 0)]
dt[ ,flag := 1*(min<50)]

Read

J :: SELECT (columns Indexing)

결과가 벡터로~

하나의 컬럼을 사용할때는

DT[ , B]          
DT[ , c(B)]       #identical 
DT[["B"]]         #identical

결과가 data.table로

# c("" ) # dot # position

컬럼 선택시 숫자는 사용하지 않는것이 좋다. (select *을 쓰지않는것과 같다.)

DT[ ,"B"] 
DT[ ,c("B")]      #identical 
DT[ ,.(B)]        #identical 
DT[ ,list(B)]     #identical 
DT[ ,2]           #identical
DT[ ,c(2)]

DT[ ,c("A","B")
DT[ ,.(A,B)]
DT[ ,c(1,2)]

with=FALSE

Dataframe과 data.table의 차이

DF[1:2, ]
DF[DF$C>5, ]
DF[ ,DF$C>5]   # Error
sapply(DF$C, function(x){x>5})

DT[1:2, ]
DT[C>5, ]     #identical  DT[DT$C>5, ]
DT[ , C>5]    # T/F

vector scan, 일단 row수만큼의 TRUE/FALSE 벡터를 만들어 indexing한 후, binary search

data.frame에서 컬럼선택은 character 벡터를 사용하지만, data.table에서는 (character가 아닌) 실제 이름을 가진 List를 사용한다.
만약, 굳이 컬럼명을 character로 넘기려면, with=FALSE 옵션을 사용한다. (Default는 TRUE, 컬럼명은 variable)

theCols <- c("A","B") 
DT[ , theCols, with=FALSE]
DT[ , theCols, ]             # Error

위에서 with=FALSE를 사용하지 않으면 ,
theCols를 Character Vector(character/logical/integer)가 아닌, 컬럼명 variable로 해석하여 Error가 발생한다.
j열에는 선택할 컬럼을 표기하는 컬럼명이 오기 때문에.

KK <- c("A","B")
DT[ ,colnames(DT) %in% KK, with=F]
   A B
1: a 1
2: a 2
3: b 3
4: b 1
5: c 2
6: c 3

DT[ ,colnames(DT) %in% KK, ]
[1]  TRUE  TRUE FALSE FALSE

i :: Where (Rows subsetting)

Row accessing은 data.frame과 별반 다르지 않다.

DT[ c(4,6), ]
DT[DT$B==3, ] 
DT[   B==3, ]

Subset/Filter row

DT[C==6, ]                     # where

DT[A=="a" | C>3, ]             # where_or
DT[A==c("a","b") & C>3, ]      # where_and     
DT[!A==c("a","b") & !C>3, ]    # where_and_not 
DT[!(A=="b" & C>3), ]          # where_not_and 
DT[A %in% c("a","b"), ]        # where_in
DT[!A %in% c("a","b"), ]       # where_notin
DT[ A %like% "a", ]            # where_like
DT[C %between% c(2,4), ]       # where_between

DT[ str_detect(A,"(?i)a"),]

Key를 세팅한 경우

해당 Key의 값으로 row indexing가능

diamondsDT %>% setkey(cut,color)

diamondsDT[J("Ideal", "E"), ] 
diamondsDT[ cut=="Ideal" & color=="E", ]

전체 찾기

DT <- data.table(x=rep(c("b","a"),each=3), y=c(1,3,6), v=1:6)
   x y v
1: b 1 1
2: b 3 2
3: b 6 3
4: a 1 4
5: a 3 5
6: a 6 6

각 요소에서 1인 값을 찾는 아래 수식은 data.table로 표현할수 없기 때문에, 일단 에러가 난다.

DT[DT==1]
Error in `[.data.table`(DT, DT == 1) : 
i is invalid type (matrix). 
Perhaps in future a 2 column matrix could return a list of elements of DT (in the spirit of A[B] in FAQ 2.14).
Please let datatable-help know if you'd like this, or add your comments to FR #657.

하지만, assign은 가능하다.

DT[DT==1] <- 9

   x y v
1: b 9 9
2: b 3 2
3: b 6 3
4: a 9 4
5: a 3 5
6: a 6 6

Group by

(By 또는 keyby) => by는 Group by 라고 생각하면 편하다.
A,B을 기준으로 원래 data.table을 2개의 Group(sub data.table)으로 나눌수 있다.

DT <- data.table( A=rep(c("a","b","c"),each=2), 
                  B=c(1:3), C=sample(6), D=sample(6)) 
   A B C D
1: a 1 1 3
2: a 2 6 6
3: b 3 5 4
4: b 1 3 2
5: c 2 4 1
6: c 3 2 5

Ex> A가 a b c 3개의 그룹으로 나눌수 있으므로, 이를 기준으로 DT는 3개의 sub Group으로 나뉜다.
여기서 중요한 것은 나뉜 각 sub data.table은 계산시 .SD 라고 생각하면된다.

DT[ , print(.SD), by=A]
#    B C D
# 1: 1 1 3
# 2: 2 6 6
#    B C D
# 1: 3 5 4
# 2: 1 3 2
#    B C D
# 1: 2 4 1
# 2: 3 2 5
# Empty data.table (0 rows) of 1 col: A

split(DT, by=c("A"))
split(DT, list(DT$A))
#$a
#    A B C D
# 1: a 1 1 3
# 2: a 2 6 6
#$b
#    A B C D
# 1: b 3 5 4
# 2: b 1 3 2
#$c
#    A B C D
# 1: c 2 4 1
# 2: c 3 2 5

Group subsetting (Within Group Calculation)

ex> 각 그룹의 row 갯수

> DT[ ,.(length(B)), by=A]
> DT[ ,      .N,     by=.(A)]
> DT[ , .SD[, .N],   by=c("A")]
#    A V1
# 1: a  2
# 2: b  2
# 3: c  2

ex> 각 그룹의 random sample 1개씩

DT[ ,.SD[sample(.N,1)], by=A]
#    A B C D
# 1: a 2 4 4
# 2: b 3 1 2
# 3: c 3 6 5

ex> 새로운 column 만들기

DT[ , .SD[ , paste(A,B,sep="")], by=A]
   A V1
1: a a1
2: a a2
3: b b3
4: b b1
5: c c2
6: c c3

DT[order(-value), c(.SD[1], sum(value)), by="GG"]

fimpSplit0 <- fimp0[, step:=str_extract(name,"Step[0-9]+")] 
fimpSplit0 <- fimp0[, `:=`( iter=str_extract(name,"Iter[0-9]+"), step=str_extract(name,"Step[0-9]+") )]

Order by

order() 사용

my_dt[order( i, c)]
my_dt[order(-i, c)]

my_dt %>% setorder(c)
setorder(my_dt, c, -f)
setorder(my_dt, -c)

   i c  n     l      f
1: 7 G  3 FALSE   MALE
2: 6 F  2 FALSE   MALE
3: 5 E  1 FALSE FEMALE
4: 4 D  0  TRUE FEMALE
5: 3 C -1 FALSE   MALE
6: 2 B -2 FALSE FEMALE
7: 1 A -3  TRUE   MALE

sub Query (Chaining)

dt[…][…]

DT[A %in% c("a","b"), ][ ,E:=C-D][ ,.(A,D,E)]

Update & Insert

set.seed(666) 
DT <- data.table( A=rep(c("a","b","c"),each=2), B=c(1:3), C=sample(6), D=sample(6))

insert 는 새로운 컬럼명

DT[ , E:=rep(7:9,2)] 
DT[ , ':='(E,rep(7:9,2)),]

   A B C D E
1: a 1 5 6 7
2: a 2 1 3 8
3: b 3 4 1 9
4: b 1 6 4 7
5: c 2 3 2 8
6: c 3 2 5 9

update는 기존의 column Name

dt[condition,`:=`(col2 = 123, col3 = 234, ...)]

DT[A=="b",':='(E=1) ]
DT[A=="b",':='(E,1) ]

   A B C D E
1: a 1 5 6 7
2: a 2 1 3 8
3: b 3 4 1 1
4: b 1 6 4 1
5: c 2 3 2 8
6: c 3 2 5 9

update는 기존의 column Index

dt[condition,`:=`(col2 , 123)]

DT[A=="b",':='(5,1) ]

   A B C D E
1: a 1 5 6 7
2: a 2 1 3 8
3: b 3 4 1 1
4: b 1 6 4 1
5: c 2 3 2 8
6: c 3 2 5 9

update는 기존의 column Index

DT[condition, (2:4)`:=`lapply(.SD, f), .SDcols=2:4]
DT[         , (names(DT)[2:4]):= lapply(.SD, f), .SDcols = names(DT)[2:4]]

dd[ , (2:4):=map(.SD, ~.*100), .SDcols=2:4]

   A   B   C   D E
1: a 100 200 400 7
2: a 200 600 600 8
3: b 300 400 200 2
4: b 100 300 200 2
5: c 200 500 300 8
6: c 300 100 100 9

data.table remove comma and convert numeric in r

col_number <- colnames(ddFS)[..]
dd[ , lapply(.SD, function(x){str_replace_all(x,',','') %>% as.numeric()}), .SDcols=col_number]

data.table |C|R|U|D |group by

Create data.table

create from scratch

create(convert) from data.frame

Key setting

Alter table

Read

J :: SELECT (columns Indexing)

결과가 벡터로~

결과가 data.table로

with=FALSE

Dataframe과 data.table의 차이

i :: Where (Rows subsetting)

Key를 세팅한 경우

전체 찾기

Group by

Group subsetting (Within Group Calculation)

Order by

sub Query (Chaining)

Update & Insert

insert 는 새로운 컬럼명

update는 기존의 column Name

update는 기존의 column Index

update는 기존의 column Index

data.table remove comma and convert numeric in r

onesixx

Simple Imputation (simputation)

seq()

group , cycle by rule

data.table |C|R|U|D |group by

Create data.table

create from scratch

create(convert) from data.frame

Key setting

Alter table

Read

J :: SELECT (columns Indexing)

결과가 벡터로~

결과가 data.table로

with=FALSE

Dataframe과 data.table의 차이

i :: Where (Rows subsetting)

Key를 세팅한 경우

전체 찾기

Group by

Group subsetting (Within Group Calculation)

Order by

sub Query (Chaining)

Update & Insert

insert 는 새로운 컬럼명

update는 기존의 column Name

update는 기존의 column Index

update는 기존의 column Index

data.table remove comma and convert numeric in r

onesixx

Related Posts

Simple Imputation (simputation)

seq()

group , cycle by rule