data.table intro
Data.table이란
data.table, written by Matt Dowle
https://github.com/Rdatatable/data.table/wiki
http://datatable.r-forge.r-project.org/datatable-faq.pdf
https://stackoverflow.com/users/403310/matt-dowle
https://github.com/Rdatatable/data.table/wiki/talks/ArunSrinivasanSat
RdaysBudapest2016.pdf
example
https://www.listendata.com/2016/10/r-data-table.html
http://jeffmax.io/notes-on-datatable-in-r.html
http://skyontech.com/blog/data/data-table-R-cheatsheet.html
https://riptutorial.com/data-table
install
https://github.com/Rdatatable/data.table/wiki/Installation
data.table는 DB와 같이, 특정 column을 key값으로 indexing하여 더 빠른 access, group by, join이 가능하다.
(data.frame과 달리, character 벡터를 factor로 자동으로 변환하지 않는다)
> library(data.table)
data.table 1.10.4.2
The fastest way to learn (by data.table authors): https://www.datacamp.com/courses/data-analysis-the-data-table-way
Documentation: ?data.table, example(data.table) and browseVignettes("data.table")
Release notes, videos and slides: http://r-datatable.com
data.table 1.10.4
**********
This installation of data.table has not detected OpenMP support. It will still work but in single-threaded mode.
If this a Mac and you obtained the Mac binary of data.table from CRAN, CRAN's Mac does not yet support OpenMP.
In the meantime please follow our Mac installation instructions on the data.table homepage.
If it works and you observe benefits from multiple threads as others have reported,
please convince Simon Ubanek by sending him evidence and ask him to turn on OpenMP support
when CRAN builds package binaries for Mac. Alternatives are to install Ubuntu on your Mac
(which I have done and works well) or use Windows where OpenMP is supported and works well.
**********
The following objects are masked from ‘package:dplyr’: between, first, last
The following object is masked from ‘package:purrr’: transpose
?data.table example(data.table) browseVignettes("data.table")
Objects
Syntax
DF보다는 일관성있는 문법을 가진다.
x[i, j,
by, keyby,
with = TRUE,
on = NULL
mult = "all", # (first, last) row of each group
nomatch = getOption("datatable.nomatch"), # default: NA_integer_
roll = FALSE,
rollends = if (roll=="nearest") c(TRUE,TRUE)
else if (roll>=0) c(FALSE,TRUE)
else c(TRUE,FALSE),
which = FALSE,
.SDcols,
verbose = getOption("datatable.verbose"), # default: FALSE
allow.cartesian = getOption("datatable.allow.cartesian"), # default: FALSE
drop = NULL]
by | by=x by=”x” by=”x,y” | by=.(x,y) by=c(“x”, “y”) | group by |
on | on=”x” | on=.(x,y) on=c(“x”, “y”) | subset, join |
기본 arguments
기본 arguments | clause in SQL | |
i | subsetting rows | WHERE |
j | manipulate columns | SELECT |
by | grouped according to | GROUP BY |
다른 arguments
– with, which
– allow.cartesian
– roll, rollends
– .SD, .SDcols
– on, mult, nomatch
key for Index
더 빠른 속도를 위해 data.table상의 index를 잡아줄때 사용한다. (binary search algorithm)
Key가 설정되면, Key를 기준으로 reorder된다.
Join시 필요하다.
ex) binary search algorithm 5, 10, 7, 20, 3, 13, 26 에서 20을 찾을때,
1. 우선 sorting을 한다. 3, 5, 7, 10, 13, 20, 26
2. 중간값과 비교한다. 20=10? No 20>10
3. 10이상의 값에서 다시 중간값을 찾아 비교한다…(반복)
Key를 설정하면 select (rows 선택)에 새로운 기능을 사용할 수 있는데,
row number 또는 T/F를이용하는 방법이외에, Key컬럼의 값으로 select를 할수 있다.
Key를 설정하면, 기본 Aggregation과 d*ply 함수보다는 data.table의 aggregation함수를 활용하면 더 빠른 속도를 구현할수 있다.
======================================================================