data.table

Published by onesixx on

Data.table이란

https://github.com/Rdatatable/data.table/wiki
http://datatable.r-forge.r-project.org/datatable-faq.pdf
https://stackoverflow.com/users/403310/matt-dowle data.table, written by Matt Dowle
https://github.com/Rdatatable/data.table/wiki/talks/ArunSrinivasanSat
RdaysBudapest2016.pdf
example
https://www.listendata.com/2016/10/r-data-table.html
http://jeffmax.io/notes-on-datatable-in-r.html
http://skyontech.com/blog/data/data-table-R-cheatsheet.html
https://riptutorial.com/data-table

data.table는 DB와 같이, 특정 column을 key값으로 indexing하여 더 빠른 access, group by, join이 가능하다. 
(data.frame과 달리, character 벡터를 factor로 자동으로 변환하지 않는다)

> library(data.table)
data.table 1.10.4.2
  The fastest way to learn (by data.table authors): https://www.datacamp.com/courses/data-analysis-the-data-table-way
  Documentation: ?data.table, example(data.table) and browseVignettes("data.table")
  Release notes, videos and slides: http://r-datatable.com
data.table 1.10.4
**********
This installation of data.table has not detected OpenMP support. It will still work but in single-threaded mode.
If this a Mac and you obtained the Mac binary of data.table from CRAN, CRAN's Mac does not yet support OpenMP.
In the meantime please follow our Mac installation instructions on the data.table homepage. 
If it works and you observe benefits from multiple threads as others have reported, 
please convince Simon Ubanek by sending him evidence and ask him to turn on OpenMP support 
when CRAN builds package binaries for Mac. Alternatives are to install Ubuntu on your Mac 
(which I have done and works well) or use Windows where OpenMP is supported and works well.
**********
The following objects are masked from ‘package:dplyr’:  between, first, last
The following object is masked from ‘package:purrr’:  transpose
?data.table
example(data.table) 
browseVignettes("data.table")

Syntax

DF보다는 일관성있는 문법을 가진다. 

R
x[i, j, 
  by, keyby, 
  with = TRUE,
  on = NULL
  mult = "all",                                               # (first, last) row of each group
  nomatch = getOption("datatable.nomatch"),                   # default: NA_integer_
  roll = FALSE,
  rollends = if (roll=="nearest") c(TRUE,TRUE)
             else if (roll>=0) c(FALSE,TRUE)
             else c(TRUE,FALSE),
  which = FALSE,
  .SDcols,
  verbose = getOption("datatable.verbose"),                   # default: FALSE
  allow.cartesian = getOption("datatable.allow.cartesian"),   # default: FALSE
  drop = NULL] 
byby=x      by=”x”       by=”x,y”by=.(x,y)      by=c(“x”, “y”)group by
on               on=”x” on=.(x,y)    on=c(“x”, “y”)subset, join

기본 arguments

기본 argumentsclause in SQL
isubsetting rowsWHERE 
jmanipulate columnsSELECT 
bygrouped according toGROUP BY

다른 arguments

– with, which
– allow.cartesian
– roll, rollends
– .SD, .SDcols
– on, mult, nomatch

key for Index

더 빠른 속도를 위해 data.table상의 index를 잡아줄때 사용한다. (binary search algorithm)
Key가 설정되면, Key를 기준으로 reorder된다. 
Join시 필요하다. 
    ex)  binary search algorithm  5, 10, 7, 20, 3, 13, 26 에서 20을 찾을때, 
      1. 우선 sorting을 한다. 3, 5, 7, 10, 13, 20, 26
      2. 중간값과 비교한다. 20=10?  No 20>10
      3. 10이상의 값에서 다시 중간값을 찾아 비교한다…(반복)

R

Key를 설정하면 select (rows 선택)에 새로운 기능을 사용할 수 있는데, 
row number 또는 T/F를이용하는 방법이외에, Key컬럼의 값으로 select를 할수 있다. 

R

Key를 설정하면,  기본 Aggregation과 d*ply 함수보다는 data.table의 aggregation함수를 활용하면 더 빠른 속도를 구현할수 있다. 

R

======================================================================

http://www.listendata.com/2016/10/r-data-table.html      by Deepanshu  
dplyr에 비해 큰 dataset을 다룰때 속도가 빠르다
Advanced tips and tricks with data.table by Andrew Brooks
http://brooksandrew.github.io/simpleblog/articles/advanced-data-table/
https://github.com/Rdatatable/data.table/wiki
https://www.datacamp.com/courses/data-table-data-manipulation-r-tutorial
R
DATA for example
Categories: Reshaping

onesixx

Blog Owner

Leave a Reply

Your email address will not be published.