package :: stringr

Published by onesixx on 17-09-2417-09-24

개요

문자열 처리 방법, stringi을 기반으로 한 string manipulation functions

특징

1. factor와 character를 같은 방식으로 처리

2. 연관성 있는 함수명과 인수
– stringr의 모든 함수는 str_ 로 시작
– 첫번째 인수는 항상 string 벡터이기 때문에 pipe(%>%) 사용이 쉽다.
– 다른 함수의 입력값으로 사용하기 편리한 출력값. 길이 0인 입력값에 대해 길이 0인 결과를 돌려줌
– 입력값 NA가 포함되어 있을 때는 그 부분의 결과를 NA로 돌려줌

letters %>% .[1:10] %>% str_pad(3, "right") %>%  str_c(letters[2:11]) 
# [1] "a  b" "b  c" "c  d" "d  e" "e  f" "f  g" "g  h" "h  i" "i  j" "j  k"

3. 사용빈도가 떨어지는 문자열 조작 처리를 과감하게 제거하여 간략화시킴

paste("Hello", c("Jared", "Bob", "David"), c("Goodbye", "Seeya"))
#[1] "Hello Jared Goodbye" "Hello Bob Seeya"     "Hello David Goodbye"

 waitTime <- 25 sprintf("Hello %s, your party of %s will be seated in %s minutes",
         c("Jared", "Bob"), c("eight", 16, "four", 10), waitTime) 
#[1] "Hello Jared, your party of eight will be seated in 25 minutes" 
#[2] "Hello Bob, your party of 16 will be seated in 25 minutes"      
#[3] "Hello Jared, your party of four will be seated in 25 minutes"  
#[4] "Hello Bob, your party of 10 will be seated in 25 minutes"

Sample String

> sentences
  [1] "The birch canoe slid on the smooth planks."                "Glue the sheet to the dark blue background."              
  [3] "It's easy to tell the depth of a well."                    ...
[719] "She called his name many times."                           "When you hear the bell, come quickly."                    

> fruit
 [1] "apple"             "apricot"           "avocado"           "banana"            ...
[78] "tangerine"         "ugli fruit"        "watermelon"       

> words
  [1] "a"           "able"        "about"       "absolute"    "accept"      "account"     ...
[980] "young"

pattern matching engines

modifier함수를 이용하여, 매칭방법에 활용

fixed()	match exact bytes
coll()	match human letters
boundary()	match boundaries
regex()
ignore.case()

예제>

bananas <- c("banana", "Banana", "BANANA")
fruit   <- c("apple", "banana", "pear", "pineapple", "사과")
patr <- "\\\\w{6}$"

str_count(fruit, patr)
str_detect(fruit, patr)

str_extract_all(fruit, patr)
str_match_all(fruit, patr)

str_locate_all(fruit, patr)

String Manuplation

str_*(string, ...)

stringr	설명	Base function
str_length	string의 길이 str_length(string)	nchar()
str_c	여러 string을 하나의 string으로 Concatenate str_c (str, sep='', collapse=NULL) - sep은 각 string간 seperator, - collapse는 하나의 character vector간 구분자.	paste() paste0()
str_sub	sting에서 일부substrings 를 Extract string에서 일부substrings 를 Replace str_sub(string, start=1L, end=-1L) str_sub(string, start=1L, end=-1L, omit_na=FALSE) <- value str_sub(str, start=1, end=6)	substr()

stringr

설명

Base function

str_length

string의 길이

str_length(string)

nchar()

str_c

여러 string을 하나의 string으로 Concatenate

str_c (str, sep='', collapse=NULL)

- sep은 각 string간 seperator,
- collapse는 하나의 character vector간 구분자.

paste()
paste0()

str_sub

sting에서 일부substrings 를 Extract
string에서 일부substrings 를 Replace

str_sub(string, start=1L, end=-1L)
str_sub(string, start=1L, end=-1L, omit_na=FALSE) <- value

str_sub(str, start=1, end=6)

substr()

str_sub( , start, end)

> "123456" %>% str_sub( start=2,  end=5)
[1] "2345"
> "123456" %>% str_sub( 2,  2)
[1] "2"
> 
> "123456" %>% str_sub(start= 2,)   #end는 끝까지
[1] "23456"
> "123456" %>% str_sub(start=-2,)
[1] "56"
> 
> "123456" %>% str_sub(end= 2)   #start는 맨처음
[1] "12"
> "123456" %>% str_sub(end=-2)
[1] "12345"
> 
> # negarive 값은 출발점 끝점 구할때 시작점과 관련이 있지만, 
> # 읽어나가는 방향은 일정하게, 왼쪽에서 오른쪽. 
> "123456" %>% str_sub(-5, 3)  
[1] "23"
> "123456" %>% str_sub( 5, -3) 
[1] ""

cf. glue()

date <- "20150205" %>% ymd()
year <- year(date)
url <- glue("http://cran-logs.rstudio.com/{year}/{date}.csv.gz")

# http://cran-logs.rstudio.com/2015/2015-02-05.csv.gz

Pattern matching

RegEx
https://rstudio.com/wp-content/uploads/2016/09/RegExCheatsheet.pdf

str_* (string, pattern = " ")

stringr	설명	결과	Base function
str_detect	Detect the presence/absence of a pattern in a string. => Keep strings matching a pattern, or find positions 대소문자구분 dd[ str_detect(name,"(?i)korea"),]	T/F	grepl(pattern, x)
str_subset	wrapper around `x[str_detect(x, pattern)]`	Vector	grep(pattern, x, value=T)
str_which	wrapper around `str_detect(x, pattern) %>% which()`	idx	grep(pattern, x)

str_count	Count the number of matches in a string. str_length와 비슷하지만, pattern을 줄수 있다.	Vector
str_extract str_extract_all	Extract matching patterns from a string	vector
str_match str_match_all	Extract matched groups from a string 매치된 부분 문자열을 추출하고 참조를 행렬로 돌려줌. 1열에, str_extract(string, pattern)의 결과를 2열 이후에, 각 괄호에 매치된 이후의 결과를 보여줌	matrix
str_locate str_locate_all	Locate the position of patterns in a string.	start, end
str_replace str_replace_all	Replace matched patterns in a string.		sub(pattern, replacement, x) gsub()
str_replace_na	Turn NA into "NA"
str_split str_split_fixed	Split up a string into pieces. 최대 n 개의 분할을 지정할 수 있음.	list	strsplit(x, pattern)
str_view str_view_all	View HTML rendering of regular expression match.

> str_replace_all("sixx123", "[[:digit:]]","")
[1] "sixx"
> str_replace("sixx123", "[[:digit:]]","")
[1] "sixx23"
> str_replace_all("sixx123", "[^[:digit:]]","")
[1] "123"

> gsub("[[:digit:]]","","sixx123")
[1] "sixx"
> gsub("[^[:digit:]]", "", "sixx123")
[1] "123"

Formatting (Whitespace)

stringr	설명	Base function
str_pad	Pad a string. 폭을 width 만큼 늘려서 side를 기준으로 공백을 pad에 지정된 문자로 채워넣음	str_pad(string, width, side="left", pad=" ")
str_trunc	Truncate a character string. 폭을 width 만큼 남기고, side를 기준으로 ellipsis을 채워넣음	str_trunc(string, width, side = c("right", "left", "center"), ellipsis = "...")
str_trim	Trim whitespace from a string.	str_trim(string, side="left\|right\|both")
str_squish	공백 제거
str_wrap	지정한 폭으로 줄바꿈. indent는 선두행의 왼쪽 여백, exdent는 그 이외 행의 왼쪽여백.	str_wrap(string, width=80, indent=0, exdent=0)

Locale sensitive

stringr	설명
str_order str_sort	Order or sort a character vector.
str_to_upper , str_to_lower , str_to_title	Convert case of a string.

stringr	설명
invert_match	Switch location of matches to location of non-matches.
str_conv	Specify the encoding of a string.
str_dup	Duplicate and concatenate strings within a character vector.
str_glue str_glue_data	Format and interpolate a string with glue
word	Extract words from a sentence.

http://r4ds.had.co.nz/strings.html
http://stringr.tidyverse.org
https://cran.r-project.org/web/packages/stringr/stringr.pdf
http://wsyang.com/r/2014/07/04/stringr-package/
Data Wrangling with R 5.3
https://stackoverflow.com/questions/12775085/the-difference-between-concatenating-character-strings-with-paste-vs-cat
https://www.jaredlander.com/r-for-everyone/
stringr: mordern, consistent string processing

package :: stringr

개요

특징

Sample String

pattern matching engines

예제>

String Manuplation

Pattern matching

Formatting (Whitespace)

Locale sensitive

onesixx

Simple Imputation (simputation)

seq()

group , cycle by rule

package :: stringr

개요

특징

Sample String

pattern matching engines

예제>

String Manuplation

Pattern matching

Formatting (Whitespace)

Locale sensitive

onesixx

Related Posts

Simple Imputation (simputation)

seq()

group , cycle by rule