R에서 한글 (encoding 관련) – csv, xlsx…
기본설정
R studio > tools >
- Global options… > Code > saving > Default text encoding: UTF-8
- Project options… > Code Editing > Text excoding: UTF-8로 설정
Linux ( Ubuntu)
일단 Ubuntu의 locale설정이 제대로 되어 있어야, 웹상에 해결방법이 잘 적용된다.
https://lintut.com/how-to-set-up-system-locale-on-ubuntu-18-04/
system locale확인 방법
~$ locale LANG=en_US.UTF-8 LANGUAGE= LC_CTYPE=en_US.UTF-8 LC_NUMERIC=en_US.UTF-8 LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8 LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8 LC_PAPER=en_US.UTF-8 LC_NAME=en_US.UTF-8 LC_ADDRESS=en_US.UTF-8 LC_TELEPHONE=en_US.UTF-8 LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=en_US.UTF-8 LC_ALL=
~$ localectl status System Locale: LANG=en_US.UTF-8 VC Keymap: us X11 Layout: us
설정가능한 모든 locale
~$ locale -a C C.UTF-8 en_US.utf8 ko_KR ko_KR.euckr ko_KR.utf8 korean korean.euc POSIX
사용하고자 하는 system locale for the region이 없을 경우, 아래 명령어로 화면에서 추가해 준다.
~$ sudo dpkg-reconfigure locales
locale 수정
~$ sudo vi /etc/default/locale
수정후에는 logout해주어야하고, 확인 후 R도 restart해준다.
encoding을 고려한 read 함수
http://philogrammer.com/2017-03-15/encoding/
# library(devtools) # install_github("plgrmr/readAny", force = TRUE) # library(readAny) library(readr) uF_readAny <- function(fileNm, sep="", ...) { encoding <- as.character(guess_encoding(fileNm)[1,1]) extension <- as.character(tools::file_ext(fileNm)) if(extension=='xlsx'){ result <- read_excel(fileNm) } else { if(sep != "" | !(extension %in% c("csv","txt")) ) extension <- "custom" separate <- list(csv=",", txt="\ ", custom=sep) result <- read.table(fileNm, sep=separate[[extension]], fileEncoding=encoding, ...) } return(result) }
dd <- read.table(fileNm, header=T) dd <- uF_readAny(fileNm, header=T) %>% setDT() dd <- fread(fileNm, encoding='UTF-8') # read.csv("파일위치/파일명", fileEncoding="euc-kr") # read.table("파일위치/파일명", fileEncoding="euc-kr")
https://studyforus.tistory.com/167
RStudio encoding설정
- Tools -> Global Options…
Code> Saving "Default text encoding: " : "UTF-8" - Tools -> Project Options…
Code Editing "Text encoding: " : "UTF-8"
https://r-bong.blogspot.com/2016/03/rstudio_26.html
> Sys.getlocale() [1] "LC_CTYPE=en_US.UTF-8; LC_NUMERIC=C; LC_TIME=en_US.UTF-8; LC_COLLATE=en_US.UTF-8; LC_MONETARY=en_US.UTF-8; LC_MESSAGES=en_US.UTF-8; LC_PAPER=en_US.UTF-8; LC_NAME=C; LC_ADDRESS=C; LC_TELEPHONE=C; LC_MEASUREMENT=en_US.UTF-8; LC_IDENTIFICATION=C" > Sys.setlocale("LC_CTYPE", "C") # 강제 언어삭제 > localeToCharset() [1] "UTF-8" "ISO8859-1"
Sys.getlocale() # Sys.getlocale("LC_ALL") #[1] "en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8" Sys.getlocale("LC_COLLATE") Sys.getlocale("LC_CTYPE") Sys.getlocale("LC_MONETARY") Sys.getlocale("LC_NUMERIC") Sys.getlocale("LC_TIME") Sys.setlocale(category = "LC_CTYPE", locale = "ko_KR.UTF-8") #[1] "en_US.UTF-8/ko_KR.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8"