data.table intro

Published by onesixx on

Loading required package: data.table
data.table 1.13.0 using 1 threads (see ?getDTthreads).  
Latest news:
This installation of data.table has not detected OpenMP support. 
It should still work but in single-threaded mode.

This is a Mac. 
Please read 
Please engage with Apple and ask them for support. 
Check for updates, and our Mac instructions here: 

After several years of many reports of installation problems on Mac,
it is time to gingerly point out that there have been no similar problems on Windows or Linux.
~$➜  xcode-select --install
xcode-select: error: command line tools are already installed, use "Software Update" to install updates

~$➜  brew update && brew install llvm
To use the bundled libc++ please add the following LDFLAGS:
  LDFLAGS="-L/usr/local/opt/llvm/lib -Wl,-rpath,/usr/local/opt/llvm/lib"

llvm is keg-only, which means it was not symlinked into /usr/local,
because macOS already provides this software and installing another version in
parallel can cause all kinds of trouble.

If you need to have llvm first in your PATH run:
  echo 'export PATH="/usr/local/opt/llvm/bin:$PATH"' >> ~/.zshrc

For compilers to find llvm you may need to set:
  export LDFLAGS="-L/usr/local/opt/llvm/lib"
  export CPPFLAGS="-I/usr/local/opt/llvm/include"

==> Summary
?  /usr/local/Cellar/llvm/10.0.1: 7,029 files, 1GB
# if you downloaded llvm manually above, replace with your chosen NEW_PATH/clang
LLVM_LOC = /usr/local/opt/llvm
CC=$(LLVM_LOC)/bin/clang -fopenmp
CXX=$(LLVM_LOC)/bin/clang++ -fopenmp
# -O3 should be faster than -O2 (default) level optimisation ..
CFLAGS=-g -O3 -Wall -pedantic -std=gnu99 -mtune=native -pipe
CXXFLAGS=-g -O3 -Wall -pedantic -std=c++11 -mtune=native -pipe
LDFLAGS=-L/usr/local/opt/gettext/lib -L$(LLVM_LOC)/lib -Wl,-rpath,$(LLVM_LOC)/lib
CPPFLAGS=-I/usr/local/opt/gettext/include -I$(LLVM_LOC)/include


data.table, written by Matt Dowle

data.table는 DB와 같이, 특정 column을 key값으로 indexing하여 더 빠른 access, group by, join이 가능하다. 
(data.frame과 달리, character 벡터를 factor로 자동으로 변환하지 않는다)

> library(data.table)
  The fastest way to learn (by data.table authors):
  Documentation: ?data.table, example(data.table) and browseVignettes("data.table")
  Release notes, videos and slides:
data.table 1.10.4
This installation of data.table has not detected OpenMP support. It will still work but in single-threaded mode.
If this a Mac and you obtained the Mac binary of data.table from CRAN, CRAN's Mac does not yet support OpenMP.
In the meantime please follow our Mac installation instructions on the data.table homepage. 
If it works and you observe benefits from multiple threads as others have reported, 
please convince Simon Ubanek by sending him evidence and ask him to turn on OpenMP support 
when CRAN builds package binaries for Mac. Alternatives are to install Ubuntu on your Mac 
(which I have done and works well) or use Windows where OpenMP is supported and works well.
The following objects are masked from ‘package:dplyr’:  between, first, last
The following object is masked from ‘package:purrr’:  transpose


> ls("package:data.table")
 [1] ":="                 "address"            "alloc.col"          "as.chron.IDate"     "as.chron.ITime"    
 [6] ""      "as.Date.IDate"      "as.IDate"           "as.ITime"           "" 
[11] "between"            "%between%"          "chgroup"            "%chin%"             "chmatch"           
[16] "chorder"            "CJ"                 "copy"               "data.table"         "dcast"             
[21] ""   "fintersect"         "first"              "foverlaps"          "frank"             
[26] "frankv"             "fread"              "fsetdiff"           "fsetequal"          "fsort"             
[31] "funion"             "fwrite"             "getDTthreads"       "getNumericRounding" "haskey"            
[36] "hour"               "IDateTime"          "indices"            "inrange"            "%inrange%"         
[41] ""      "isoweek"            "key"                "key<-"              "key2"              
[46] "last"               "like"               "%like%"             "mday"               "melt"              
[51] ""    "minute"             "month"              "quarter"            "rbindlist"         
[56] "rleid"              "rleidv"             "rowid"              "rowidv"             "second"            
[61] "set"                "set2key"            "set2keyv"           "setattr"            "setcolorder"       
[66] "setDF"              "setDT"              "setDTthreads"       "setindex"           "setindexv"         
[71] "setkey"             "setkeyv"            "setnames"           "setNumericRounding" "setorder"          
[76] "setorderv"          "shift"              "shouldPrint"        "SJ"                 "tables"            
[81] ""    "timetaken"          "transpose"          "truelength"         "tstrsplit"         
[86] "uniqueN"            "wday"               "week"               "yday"               "year" 


DF보다는 일관성있는 문법을 가진다. 

DF[DF$code!="abc", "valA"] %>% sum()
DT[code!="abc", sum(valA)]             #same as above

aggregate(cbind(valA, valB)~id, DF[DF$code!="abc", ], sum)
DT[code!="abc", .(sum(valA), sum(valB)), by=id]               #same as above

DF[DF$code=="abc", "valA"] <- NA
DT[code="abc", valA:=NA]           #same as above
x[i, j, 
  by, keyby, 
  with = TRUE,
  on = NULL
  mult = "all",                                               # (first, last) row of each group
  nomatch = getOption("datatable.nomatch"),                   # default: NA_integer_
  roll = FALSE,
  rollends = if (roll=="nearest") c(TRUE,TRUE)
             else if (roll>=0) c(FALSE,TRUE)
             else c(TRUE,FALSE),
  which = FALSE,
  verbose = getOption("datatable.verbose"),                   # default: FALSE
  allow.cartesian = getOption("datatable.allow.cartesian"),   # default: FALSE
  drop = NULL] 
byby=x      by="x"       by="x,y"by=.(x,y)      by=c("x", "y")group by
on               on="x" on=.(x,y)    on=c("x", "y")subset, join

기본 arguments

기본 argumentsclause in SQL
isubsetting rowsWHERE 
jmanipulate columnsSELECT 
bygrouped according toGROUP BY

다른 arguments

- with, which
- allow.cartesian
- roll, rollends
- .SD, .SDcols
- on, mult, nomatch

key for Index

더 빠른 속도를 위해 data.table상의 index를 잡아줄때 사용한다. (binary search algorithm)
Key가 설정되면, Key를 기준으로 reorder된다. 
Join시 필요하다. 
    ex)  binary search algorithm  5, 10, 7, 20, 3, 13, 26 에서 20을 찾을때, 
      1. 우선 sorting을 한다. 3, 5, 7, 10, 13, 20, 26
      2. 중간값과 비교한다. 20=10?  No 20>10
      3. 10이상의 값에서 다시 중간값을 찾아 비교한다...(반복)

setkey(DT, A)       #단일키 
setkey(DT, A, B)    #복합키

key(DT)             #현재 key 확인

Key를 설정하면 select (rows 선택)에 새로운 기능을 사용할 수 있는데, 
row number 또는 T/F를이용하는 방법이외에, Key컬럼의 값으로 select를 할수 있다. 

DT["a",]                # 단일키 
DT[J("a",1), ]          # 복합키 J()사용 
DT[J("a",c(1,2)), ]

Key를 설정하면,  기본 Aggregation과 d*ply 함수보다는 data.table의 aggregation함수를 활용하면 더 빠른 속도를 구현할수 있다. 

aggregate(C ~ A, DT, mean)     # basic aggregate
DT[, mean(C),      by=A]       # 컬럼명 V1
DT[, .(C=mean(C)), by=A]       # 컬럼명 C   using list

DT[, mean(C), by=.(A,B)]             # groupby추가 using list 
DT[, .(C=mean(C), D=sum(D)), by=A]  # 컬럼   추가 using list

======================================================================      by Deepanshu  
dplyr에 비해 큰 dataset을 다룰때 속도가 빠르다
Advanced tips and tricks with data.table by Andrew Brooks
DT = data.table(x1 = rep(letters[1:2], 6),
                x2 = rep(letters[3:5], 4),
                x3 = rep(letters[5:8], 3),
                y = rnorm(12))
DT = DT[sample(.N)]
DF =


> example(data.table)
## Not run: 
example(data.table)  # to run these examples yourself

## End(Not run)
DF = data.frame(x=rep(c("b","a","c"),each=3), y=c(1,3,6), v=1:9)
DT = data.table(x=rep(c("b","a","c"),each=3), y=c(1,3,6), v=1:9)
identical(dim(DT), dim(DF))    # TRUE
identical(DF$a, DT$a)          # TRUE
is.list(DF)                    # TRUE
is.list(DT)                    # TRUE              # TRUE


# basic row subset operations  결과는 모두 data.table
DT[2]                          # 2nd row
DT[3:2]                        # 3rd and 2nd row
DT[order(x)]                   # no need for order(DT$x)
DT[order(x), ]                 # same as above. The ',' is optional
DT[y>2]                        # all rows where DT$y > 2
DT[y>2 & v>5]                  # compound logical expressions
DT[!2:4]                       # all rows other than 2:4
DT[-(2:4)]                     # same

# select|compute columns data.table way
DT[, v]                        # v column (as vector)       벡터
DT[, list(v)]                  # v column (as data.table)
DT[, .(v)]                     # same as above, .() is a shorthand alias to list()
DT[, sum(v)]                   # sum of column v, returned as vector    벡터
DT[, .(sum(v))]                # same, but return data.table (column autonamed V1)
DT[, .(sv=sum(v))]             # same, but column named "sv"
DT[, .(v, v*2)]                # return two column data.table, v and v*2

# subset rows and select|compute data.table way
DT[2:3, sum(v)]                # sum(v) over rows 2 and 3, return vector   벡터
DT[2:3, .(sum(v))]             # same, but return data.table with column V1
DT[2:3, .(sv=sum(v))]          # same, but return data.table with column sv
DT[2:5, cat(v, "\
")]          # just for j's side effect

# select columns the data.frame way
DT[, 2]                        # 2nd column, returns a data.table always
colNum = 2                     # to refer vars in `j` from the outside of data use `..` prefix
DT[, ..colNum]                 # same, equivalent to DT[, .SD, .SDcols=colNum]
DT[["v"]]                      # same as DT[, v] but much faster

# grouping operations - j and by
DT[, sum(v), by=x]             # ad hoc by, order of groups preserved in result
DT[, sum(v), keyby=x]          # same, but order the result on by cols
DT[, sum(v), by=x][order(x)]   # same but by chaining expressions together

# fast ad hoc row subsets (subsets as joins)
DT["a", on="x"]                # same as x == "a" but uses binary search (fast)
DT["a", on=.(x)]               # same, for convenience, no need to quote every column
DT[.("a"), on="x"]             # same
DT[x=="a"]                     # same, single "==" internally optimised to use binary search (fast)
DT[x!="b" | y!=3]              # not yet optimized, currently vector scan subset
DT[.("b", 3), on=c("x", "y")]  # join on columns x,y of DT; uses binary search (fast)
DT[.("b", 3), on=.(x, y)]      # same, but using on=.()
DT[.("b", 1:2), on=c("x", "y")]             # no match returns NA
DT[.("b", 1:2), on=.(x, y), nomatch=NULL]   # no match row is not returned
DT[.("b", 1:2), on=c("x", "y"), roll=Inf]   # locf, nomatch row gets rolled by previous row
DT[.("b", 1:2), on=.(x, y), roll=-Inf]      # nocb, nomatch row gets rolled by next row
DT["b", sum(v*y), on="x"]                   # on rows where DT$x=="b", calculate sum(v*y)

# all together now
DT[x!="a", sum(v), by=x]                    # get sum(v) by "x" for each i != "a"
DT[!"a", sum(v), by=.EACHI, on="x"]         # same, but using subsets-as-joins
DT[c("b","c"), sum(v), by=.EACHI, on="x"]   # same
DT[c("b","c"), sum(v), by=.EACHI, on=.(x)]  # same, using on=.()

# joins as subsets
X = data.table(x=c("c","b"), v=8:7, foo=c(4,2))

DT[X, on="x"]                         # right join
X[DT, on="x"]                         # left join
DT[X, on="x", nomatch=NULL]           # inner join
DT[!X, on="x"]                        # not join
DT[X, on=c(y="v")]                    # join using column "y" of DT with column "v" of X
DT[X, on="y==v"]                      # same as above (v1.9.8+)

DT[X, on=.(y<=foo)]                   # NEW non-equi join (v1.9.8+)
DT[X, on="y<=foo"]                    # same as above
DT[X, on=c("y<=foo")]                 # same as above
DT[X, on=.(y>=foo)]                   # NEW non-equi join (v1.9.8+)
DT[X, on=.(x, y<=foo)]                # NEW non-equi join (v1.9.8+)
DT[X, .(x,y,x.y,v), on=.(x, y>=foo)]  # Select x's join columns as well

DT[X, on="x", mult="first"]           # first row of each group
DT[X, on="x", mult="last"]            # last row of each group
DT[X, sum(v), by=.EACHI, on="x"]      # join and eval j for each row in i
DT[X, sum(v)*foo, by=.EACHI, on="x"]  # join inherited scope
DT[X, sum(v)*i.v, by=.EACHI, on="x"]  # 'i,v' refers to X's v column
DT[X, on=.(x, v>=v), sum(y)*foo, by=.EACHI] # NEW non-equi join with by=.EACHI (v1.9.8+)

# setting keys
kDT = copy(DT)                        # (deep) copy DT to kDT to work with it.
setkey(kDT,x)                         # set a 1-column key. No quotes, for convenience.
setkeyv(kDT,"x")                      # same (v in setkeyv stands for vector)
setkeyv(kDT,v)                        # same
# key(kDT)<-"x"                       # copies whole table, please use set* functions instead
haskey(kDT)                           # TRUE
key(kDT)                              # "x"

# fast *keyed* subsets
kDT["a"]                              # subset-as-join on *key* column 'x'
kDT["a", on="x"]                      # same, being explicit using 'on=' (preferred)

# all together
kDT[!"a", sum(v), by=.EACHI]          # get sum(v) for each i != "a"

# multi-column key
setkey(kDT,x,y)                       # 2-column key
setkeyv(kDT,c("x","y"))               # same

# fast *keyed* subsets on multi-column key
kDT["a"]                              # join to 1st column of key
kDT["a", on="x"]                      # on= is optional, but is preferred
kDT[.("a")]                           # same, .() is an alias for list()
kDT[list("a")]                        # same
kDT[.("a", 3)]                        # join to 2 columns
kDT[.("a", 3:6)]                      # join 4 rows (2 missing)
kDT[.("a", 3:6), nomatch=NULL]        # remove missing
kDT[.("a", 3:6), roll=TRUE]           # locf rolling join
kDT[.("a", 3:6), roll=Inf]            # same as above
kDT[.("a", 3:6), roll=-Inf]           # nocb rolling join
kDT[!.("a")]                          # not join
kDT[!"a"]                             # same

# more on special symbols, see also ?"special-symbols"
DT[.N]                                  # last row
DT[, .N]                                # total number of rows in DT
DT[, .N, by=x]                          # number of rows in each group
DT[, .SD, .SDcols=x:y]                  # select columns 'x' through 'y'
DT[ , .SD, .SDcols = !x:y]              # drop columns 'x' through 'y'
DT[ , .SD, .SDcols = patterns('^[xv]')] # select columns matching '^x' or '^v'
DT[, .SD[1]]                            # first row of all columns
DT[, .SD[1], by=x]                      # first row of 'y' and 'v' for each group in 'x'
DT[, c(.N, lapply(.SD, sum)), by=x]     # get rows *and* sum columns 'v' and 'y' by group
DT[, .I[1], by=x]                       # row number in DT corresponding to each group
DT[, grp := .GRP, by=x]                 # add a group counter column
X[, DT[.BY, y, on="x"], by=x]           # join within each group

# add/update/delete by reference (see ?assign)
print(DT[, z:=42L])                   # add new column by reference
print(DT[, z:=NULL])                  # remove column by reference
print(DT["a", v:=42L, on="x"])        # subassign to existing v column by reference
print(DT["b", v2:=84L, on="x"])       # subassign to new column by reference (NA padded)

DT[, m:=mean(v), by=x][]              # add new column by reference by group
                                      # NB: postfix [] is shortcut to print()
# advanced usage
DT = data.table(x=rep(c("b","a","c"),each=3), v=c(1,1,1,2,2,1,1,2,2), y=c(1,3,6), a=1:9, b=9:1)

DT[, sum(v), by=.(y%%2)]              # expressions in by
DT[, sum(v), by=.(bool = y%%2)]       # same, using a named list to change by column name
DT[, .SD[2], by=x]                    # get 2nd row of each group
DT[, tail(.SD,2), by=x]               # last 2 rows of each group
DT[, lapply(.SD, sum), by=x]          # sum of all (other) columns for each group
DT[, .SD[which.min(v)], by=x]         # nested query by group

DT[, list(MySum=sum(v),
    by=.(x, y%%2)]                    # by 2 expressions

DT[, .(a = .(a), b = .(b)), by=x]     # list columns
DT[, .(seq = min(a):max(b)), by=x]    # j is not limited to just aggregations
DT[, sum(v), by=x][V1<20]             # compound query
DT[, sum(v), by=x][order(-V1)]        # ordering results
DT[, c(.N, lapply(.SD,sum)), by=x]    # get number of observations and sum per group
DT[, {tmp <- mean(y);
      .(a = a-tmp, b = b-tmp)
      }, by=x]                        # anonymous lambda in 'j', j accepts any valid
                                      # expression. TO REMEMBER: every element of
                                      # the list becomes a column in result.
DT[, plot(a,b), by=x]                 # can also plot in 'j'

# using rleid, get max(y) and min of all cols in .SDcols for each consecutive run of 'v'
DT[, c(.(y=max(y)), lapply(.SD, min)), by=rleid(v), .SDcols=v:b]

# Support guide and links:

## Not run: 
if (interactive()) {
  vignette(package="data.table")  # 9 vignettes               # 6,000 tests

  # keep up to date with latest stable version on CRAN

  # get the latest devel version
  # read more at:

## End(Not run)

Categories: Reshaping


Blog Owner

Notify of

Inline Feedbacks
View all comments
Would love your thoughts, please comment.x