data.table intro
Loading required package: data.table data.table 1.13.0 using 1 threads (see ?getDTthreads). Latest news: r-datatable.com ********** This installation of data.table has not detected OpenMP support. It should still work but in single-threaded mode. This is a Mac. Please read https://mac.r-project.org/openmp/. Please engage with Apple and ask them for support. Check r-datatable.com for updates, and our Mac instructions here: https://github.com/Rdatatable/data.table/wiki/Installation. After several years of many reports of installation problems on Mac, it is time to gingerly point out that there have been no similar problems on Windows or Linux. **********
~$➜ xcode-select --install xcode-select: error: command line tools are already installed, use "Software Update" to install updates ~$➜ brew update && brew install llvm ...... To use the bundled libc++ please add the following LDFLAGS: LDFLAGS="-L/usr/local/opt/llvm/lib -Wl,-rpath,/usr/local/opt/llvm/lib" llvm is keg-only, which means it was not symlinked into /usr/local, because macOS already provides this software and installing another version in parallel can cause all kinds of trouble. If you need to have llvm first in your PATH run: echo 'export PATH="/usr/local/opt/llvm/bin:$PATH"' >> ~/.zshrc For compilers to find llvm you may need to set: export LDFLAGS="-L/usr/local/opt/llvm/lib" export CPPFLAGS="-I/usr/local/opt/llvm/include" ==> Summary ? /usr/local/Cellar/llvm/10.0.1: 7,029 files, 1GB
# if you downloaded llvm manually above, replace with your chosen NEW_PATH/clang LLVM_LOC = /usr/local/opt/llvm CC=$(LLVM_LOC)/bin/clang -fopenmp CXX=$(LLVM_LOC)/bin/clang++ -fopenmp # -O3 should be faster than -O2 (default) level optimisation .. CFLAGS=-g -O3 -Wall -pedantic -std=gnu99 -mtune=native -pipe CXXFLAGS=-g -O3 -Wall -pedantic -std=c++11 -mtune=native -pipe LDFLAGS=-L/usr/local/opt/gettext/lib -L$(LLVM_LOC)/lib -Wl,-rpath,$(LLVM_LOC)/lib CPPFLAGS=-I/usr/local/opt/gettext/include -I$(LLVM_LOC)/include
Data.table이란
data.table, written by Matt Dowle
https://github.com/Rdatatable/data.table/wiki
http://datatable.r-forge.r-project.org/datatable-faq.pdf
https://stackoverflow.com/users/403310/matt-dowle
https://github.com/Rdatatable/data.table/wiki/talks/ArunSrinivasanSat
RdaysBudapest2016.pdf
example
https://www.listendata.com/2016/10/r-data-table.html
http://jeffmax.io/notes-on-datatable-in-r.html
http://skyontech.com/blog/data/data-table-R-cheatsheet.html
https://riptutorial.com/data-table
install
https://github.com/Rdatatable/data.table/wiki/Installation
data.table는 DB와 같이, 특정 column을 key값으로 indexing하여 더 빠른 access, group by, join이 가능하다.
(data.frame과 달리, character 벡터를 factor로 자동으로 변환하지 않는다)
> library(data.table)
data.table 1.10.4.2
The fastest way to learn (by data.table authors): https://www.datacamp.com/courses/data-analysis-the-data-table-way
Documentation: ?data.table, example(data.table) and browseVignettes("data.table")
Release notes, videos and slides: http://r-datatable.com
data.table 1.10.4
**********
This installation of data.table has not detected OpenMP support. It will still work but in single-threaded mode.
If this a Mac and you obtained the Mac binary of data.table from CRAN, CRAN's Mac does not yet support OpenMP.
In the meantime please follow our Mac installation instructions on the data.table homepage.
If it works and you observe benefits from multiple threads as others have reported,
please convince Simon Ubanek by sending him evidence and ask him to turn on OpenMP support
when CRAN builds package binaries for Mac. Alternatives are to install Ubuntu on your Mac
(which I have done and works well) or use Windows where OpenMP is supported and works well.
**********
The following objects are masked from ‘package:dplyr’: between, first, last
The following object is masked from ‘package:purrr’: transpose
?data.table example(data.table) browseVignettes("data.table")
Objects
> ls("package:data.table") [1] ":=" "address" "alloc.col" "as.chron.IDate" "as.chron.ITime" [6] "as.data.table" "as.Date.IDate" "as.IDate" "as.ITime" "as.xts.data.table" [11] "between" "%between%" "chgroup" "%chin%" "chmatch" [16] "chorder" "CJ" "copy" "data.table" "dcast" [21] "dcast.data.table" "fintersect" "first" "foverlaps" "frank" [26] "frankv" "fread" "fsetdiff" "fsetequal" "fsort" [31] "funion" "fwrite" "getDTthreads" "getNumericRounding" "haskey" [36] "hour" "IDateTime" "indices" "inrange" "%inrange%" [41] "is.data.table" "isoweek" "key" "key<-" "key2" [46] "last" "like" "%like%" "mday" "melt" [51] "melt.data.table" "minute" "month" "quarter" "rbindlist" [56] "rleid" "rleidv" "rowid" "rowidv" "second" [61] "set" "set2key" "set2keyv" "setattr" "setcolorder" [66] "setDF" "setDT" "setDTthreads" "setindex" "setindexv" [71] "setkey" "setkeyv" "setnames" "setNumericRounding" "setorder" [76] "setorderv" "shift" "shouldPrint" "SJ" "tables" [81] "test.data.table" "timetaken" "transpose" "truelength" "tstrsplit" [86] "uniqueN" "wday" "week" "yday" "year"
Syntax
DF보다는 일관성있는 문법을 가진다.
DF[DF$code!="abc", "valA"] %>% sum() DT[code!="abc", sum(valA)] #same as above aggregate(cbind(valA, valB)~id, DF[DF$code!="abc", ], sum) DT[code!="abc", .(sum(valA), sum(valB)), by=id] #same as above DF[DF$code=="abc", "valA"] <- NA DT[code="abc", valA:=NA] #same as above
x[i, j,
by, keyby,
with = TRUE,
on = NULL
mult = "all", # (first, last) row of each group
nomatch = getOption("datatable.nomatch"), # default: NA_integer_
roll = FALSE,
rollends = if (roll=="nearest") c(TRUE,TRUE)
else if (roll>=0) c(FALSE,TRUE)
else c(TRUE,FALSE),
which = FALSE,
.SDcols,
verbose = getOption("datatable.verbose"), # default: FALSE
allow.cartesian = getOption("datatable.allow.cartesian"), # default: FALSE
drop = NULL]
by | by=x by="x" by="x,y" | by=.(x,y) by=c("x", "y") | group by |
on | on="x" | on=.(x,y) on=c("x", "y") | subset, join |
기본 arguments
기본 arguments | clause in SQL | |
i | subsetting rows | WHERE |
j | manipulate columns | SELECT |
by | grouped according to | GROUP BY |
다른 arguments
- with, which
- allow.cartesian
- roll, rollends
- .SD, .SDcols
- on, mult, nomatch
key for Index
더 빠른 속도를 위해 data.table상의 index를 잡아줄때 사용한다. (binary search algorithm)
Key가 설정되면, Key를 기준으로 reorder된다.
Join시 필요하다.
ex) binary search algorithm 5, 10, 7, 20, 3, 13, 26 에서 20을 찾을때,
1. 우선 sorting을 한다. 3, 5, 7, 10, 13, 20, 26
2. 중간값과 비교한다. 20=10? No 20>10
3. 10이상의 값에서 다시 중간값을 찾아 비교한다...(반복)
setkey(DT, A) #단일키 setkey(DT, A, B) #복합키 key(DT) #현재 key 확인
Key를 설정하면 select (rows 선택)에 새로운 기능을 사용할 수 있는데,
row number 또는 T/F를이용하는 방법이외에, Key컬럼의 값으로 select를 할수 있다.
DT["a",] # 단일키 DT[c("a","c"),] DT[J("a",1), ] # 복합키 J()사용 DT[J("a",c(1,2)), ]
Key를 설정하면, 기본 Aggregation과 d*ply 함수보다는 data.table의 aggregation함수를 활용하면 더 빠른 속도를 구현할수 있다.
aggregate(C ~ A, DT, mean) # basic aggregate DT[, mean(C), by=A] # 컬럼명 V1 DT[, .(C=mean(C)), by=A] # 컬럼명 C using list DT[, mean(C), by=.(A,B)] # groupby추가 using list DT[, .(C=mean(C), D=sum(D)), by=A] # 컬럼 추가 using list
======================================================================
http://www.listendata.com/2016/10/r-data-table.html by Deepanshu
dplyr에 비해 큰 dataset을 다룰때 속도가 빠르다.
Advanced tips and tricks with data.table
by Andrew Brooks
http://brooksandrew.github.io/simpleblog/articles/advanced-data-table/
https://github.com/Rdatatable/data.table/wiki
https://www.datacamp.com/courses/data-table-data-manipulation-r-tutorial
set.seed(666) DT = data.table(x1 = rep(letters[1:2], 6), x2 = rep(letters[3:5], 4), x3 = rep(letters[5:8], 3), y = rnorm(12)) DT = DT[sample(.N)] DF = as.data.frame(DT)
Example
> example(data.table)
## Not run: example(data.table) # to run these examples yourself ## End(Not run) DF = data.frame(x=rep(c("b","a","c"),each=3), y=c(1,3,6), v=1:9) DT = data.table(x=rep(c("b","a","c"),each=3), y=c(1,3,6), v=1:9) DF DT identical(dim(DT), dim(DF)) # TRUE identical(DF$a, DT$a) # TRUE is.list(DF) # TRUE is.list(DT) # TRUE is.data.frame(DT) # TRUE tables() # basic row subset operations 결과는 모두 data.table DT[2] # 2nd row DT[3:2] # 3rd and 2nd row DT[order(x)] # no need for order(DT$x) DT[order(x), ] # same as above. The ',' is optional DT[y>2] # all rows where DT$y > 2 DT[y>2 & v>5] # compound logical expressions DT[!2:4] # all rows other than 2:4 DT[-(2:4)] # same # select|compute columns data.table way DT[, v] # v column (as vector) 벡터 DT[, list(v)] # v column (as data.table) DT[, .(v)] # same as above, .() is a shorthand alias to list() DT[, sum(v)] # sum of column v, returned as vector 벡터 DT[, .(sum(v))] # same, but return data.table (column autonamed V1) DT[, .(sv=sum(v))] # same, but column named "sv" DT[, .(v, v*2)] # return two column data.table, v and v*2 # subset rows and select|compute data.table way DT[2:3, sum(v)] # sum(v) over rows 2 and 3, return vector 벡터 DT[2:3, .(sum(v))] # same, but return data.table with column V1 DT[2:3, .(sv=sum(v))] # same, but return data.table with column sv DT[2:5, cat(v, "\ ")] # just for j's side effect # select columns the data.frame way DT[, 2] # 2nd column, returns a data.table always colNum = 2 # to refer vars in `j` from the outside of data use `..` prefix DT[, ..colNum] # same, equivalent to DT[, .SD, .SDcols=colNum] DT[["v"]] # same as DT[, v] but much faster # grouping operations - j and by DT[, sum(v), by=x] # ad hoc by, order of groups preserved in result DT[, sum(v), keyby=x] # same, but order the result on by cols DT[, sum(v), by=x][order(x)] # same but by chaining expressions together # fast ad hoc row subsets (subsets as joins) DT["a", on="x"] # same as x == "a" but uses binary search (fast) DT["a", on=.(x)] # same, for convenience, no need to quote every column DT[.("a"), on="x"] # same DT[x=="a"] # same, single "==" internally optimised to use binary search (fast) DT[x!="b" | y!=3] # not yet optimized, currently vector scan subset DT[.("b", 3), on=c("x", "y")] # join on columns x,y of DT; uses binary search (fast) DT[.("b", 3), on=.(x, y)] # same, but using on=.() DT[.("b", 1:2), on=c("x", "y")] # no match returns NA DT[.("b", 1:2), on=.(x, y), nomatch=NULL] # no match row is not returned DT[.("b", 1:2), on=c("x", "y"), roll=Inf] # locf, nomatch row gets rolled by previous row DT[.("b", 1:2), on=.(x, y), roll=-Inf] # nocb, nomatch row gets rolled by next row DT["b", sum(v*y), on="x"] # on rows where DT$x=="b", calculate sum(v*y) # all together now DT[x!="a", sum(v), by=x] # get sum(v) by "x" for each i != "a" DT[!"a", sum(v), by=.EACHI, on="x"] # same, but using subsets-as-joins DT[c("b","c"), sum(v), by=.EACHI, on="x"] # same DT[c("b","c"), sum(v), by=.EACHI, on=.(x)] # same, using on=.() # joins as subsets X = data.table(x=c("c","b"), v=8:7, foo=c(4,2)) X DT[X, on="x"] # right join X[DT, on="x"] # left join DT[X, on="x", nomatch=NULL] # inner join DT[!X, on="x"] # not join DT[X, on=c(y="v")] # join using column "y" of DT with column "v" of X DT[X, on="y==v"] # same as above (v1.9.8+) DT[X, on=.(y<=foo)] # NEW non-equi join (v1.9.8+) DT[X, on="y<=foo"] # same as above DT[X, on=c("y<=foo")] # same as above DT[X, on=.(y>=foo)] # NEW non-equi join (v1.9.8+) DT[X, on=.(x, y<=foo)] # NEW non-equi join (v1.9.8+) DT[X, .(x,y,x.y,v), on=.(x, y>=foo)] # Select x's join columns as well DT[X, on="x", mult="first"] # first row of each group DT[X, on="x", mult="last"] # last row of each group DT[X, sum(v), by=.EACHI, on="x"] # join and eval j for each row in i DT[X, sum(v)*foo, by=.EACHI, on="x"] # join inherited scope DT[X, sum(v)*i.v, by=.EACHI, on="x"] # 'i,v' refers to X's v column DT[X, on=.(x, v>=v), sum(y)*foo, by=.EACHI] # NEW non-equi join with by=.EACHI (v1.9.8+) # setting keys kDT = copy(DT) # (deep) copy DT to kDT to work with it. setkey(kDT,x) # set a 1-column key. No quotes, for convenience. setkeyv(kDT,"x") # same (v in setkeyv stands for vector) v="x" setkeyv(kDT,v) # same # key(kDT)<-"x" # copies whole table, please use set* functions instead haskey(kDT) # TRUE key(kDT) # "x" # fast *keyed* subsets kDT["a"] # subset-as-join on *key* column 'x' kDT["a", on="x"] # same, being explicit using 'on=' (preferred) # all together kDT[!"a", sum(v), by=.EACHI] # get sum(v) for each i != "a" # multi-column key setkey(kDT,x,y) # 2-column key setkeyv(kDT,c("x","y")) # same # fast *keyed* subsets on multi-column key kDT["a"] # join to 1st column of key kDT["a", on="x"] # on= is optional, but is preferred kDT[.("a")] # same, .() is an alias for list() kDT[list("a")] # same kDT[.("a", 3)] # join to 2 columns kDT[.("a", 3:6)] # join 4 rows (2 missing) kDT[.("a", 3:6), nomatch=NULL] # remove missing kDT[.("a", 3:6), roll=TRUE] # locf rolling join kDT[.("a", 3:6), roll=Inf] # same as above kDT[.("a", 3:6), roll=-Inf] # nocb rolling join kDT[!.("a")] # not join kDT[!"a"] # same # more on special symbols, see also ?"special-symbols" DT[.N] # last row DT[, .N] # total number of rows in DT DT[, .N, by=x] # number of rows in each group DT[, .SD, .SDcols=x:y] # select columns 'x' through 'y' DT[ , .SD, .SDcols = !x:y] # drop columns 'x' through 'y' DT[ , .SD, .SDcols = patterns('^[xv]')] # select columns matching '^x' or '^v' DT[, .SD[1]] # first row of all columns DT[, .SD[1], by=x] # first row of 'y' and 'v' for each group in 'x' DT[, c(.N, lapply(.SD, sum)), by=x] # get rows *and* sum columns 'v' and 'y' by group DT[, .I[1], by=x] # row number in DT corresponding to each group DT[, grp := .GRP, by=x] # add a group counter column X[, DT[.BY, y, on="x"], by=x] # join within each group # add/update/delete by reference (see ?assign) print(DT[, z:=42L]) # add new column by reference print(DT[, z:=NULL]) # remove column by reference print(DT["a", v:=42L, on="x"]) # subassign to existing v column by reference print(DT["b", v2:=84L, on="x"]) # subassign to new column by reference (NA padded) DT[, m:=mean(v), by=x][] # add new column by reference by group # NB: postfix [] is shortcut to print() # advanced usage DT = data.table(x=rep(c("b","a","c"),each=3), v=c(1,1,1,2,2,1,1,2,2), y=c(1,3,6), a=1:9, b=9:1) DT[, sum(v), by=.(y%%2)] # expressions in by DT[, sum(v), by=.(bool = y%%2)] # same, using a named list to change by column name DT[, .SD[2], by=x] # get 2nd row of each group DT[, tail(.SD,2), by=x] # last 2 rows of each group DT[, lapply(.SD, sum), by=x] # sum of all (other) columns for each group DT[, .SD[which.min(v)], by=x] # nested query by group DT[, list(MySum=sum(v), MyMin=min(v), MyMax=max(v)), by=.(x, y%%2)] # by 2 expressions DT[, .(a = .(a), b = .(b)), by=x] # list columns DT[, .(seq = min(a):max(b)), by=x] # j is not limited to just aggregations DT[, sum(v), by=x][V1<20] # compound query DT[, sum(v), by=x][order(-V1)] # ordering results DT[, c(.N, lapply(.SD,sum)), by=x] # get number of observations and sum per group DT[, {tmp <- mean(y); .(a = a-tmp, b = b-tmp) }, by=x] # anonymous lambda in 'j', j accepts any valid # expression. TO REMEMBER: every element of # the list becomes a column in result. pdf("new.pdf") DT[, plot(a,b), by=x] # can also plot in 'j' dev.off() # using rleid, get max(y) and min of all cols in .SDcols for each consecutive run of 'v' DT[, c(.(y=max(y)), lapply(.SD, min)), by=rleid(v), .SDcols=v:b] # Support guide and links: # https://github.com/Rdatatable/data.table/wiki/Support ## Not run: if (interactive()) { vignette(package="data.table") # 9 vignettes test.data.table() # 6,000 tests # keep up to date with latest stable version on CRAN update.packages() # get the latest devel version update.dev.pkg() # read more at: # https://github.com/Rdatatable/data.table/wiki/Installation } ## End(Not run)