rvest::(Web) Scraping
https://statkclee.github.io/yonsei/data/R_Web_Crawling.pdf
● HTTP통신:httr,RSelenium
● HTML요소:rvest,jsonlite
● 인코딩관련:urltools,readr
● 파이프연산자:magrittr(dplyr)
● 텍스트전처리:stringr
rvest의 CRAN
PDF 파일
https://www.datacamp.com/community/tutorials/r-web-scraping-rvest
https://cinema4dr12.tistory.com/1170
source(file.path(getwd(),"..","00.global.R")) library(rvest) ### Title: LEGO REVIEW ~~~ # read_html() # html_node() / html_nodes() # html_text()
### IMDb의 기생충 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
URL_Movies <- "https://www.imdb.com/title/tt6751668/?ref_=fn_al_tt_1"
dd <- URL_Movies %>% read_html(encoding="UTF-8")
###### ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
###### 영화 포스터 ----
###### ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
poster <- dd %>% html_node(".poster img") %>% html_attr("src")
browseURL(poster)
###### ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
###### 등장인물 이름 ----
###### ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
##### way1 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
dd %>% html_nodes(".primary_photo img") %>% html_attr("alt")
# [1] "Kang-ho Song" "Sun-kyun Lee" "Yeo-jeong Cho" "Woo-sik Choi"
# [5] "So-dam Park" "Lee Jeong-eun" "Hye-jin Jang" "Myeong-hoon Park"
# [9] "Ji-so Jung" "Hyun-jun Jung" "Keun-rok Park" "Esuz Jeong"
# [13] "Jae-myeong Jo" "Ik-han Jung" "Kyu-baek Kim"
##### way2 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
cast_table <- dd %>% html_node(".cast_list") %>% html_table()
cast_table[2:16,2]
##### way3 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~\\
dd %>% html_node("#titleCast .primary_photo")
dd %>% html_node(".primary_photo+td a") %>% html_text() %>% trim()
dd %>% html_nodes(".primary_photo+td a") %>% html_text(trim=T)
###### ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
###### 등장인물 페이지 - 박소담 ----
###### ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
cast_rel_urls <- dd %>% html_nodes(".primary_photo+td a") %>% html_attr("href")
cast_rel_urls <- cast_rel_urls %>% url_absolute(URL_Movies)
browseURL(cast_rel_urls[5])
###### ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
###### 영화 평점 rating ----
###### ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
dd <- read_html("https://www.imdb.com/title/tt6751668/?ref_=fn_al_tt_1", encoding="UTF-8")
It certainly looks like a small enough difference, but is it pure chance?

###### ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
###### 영화 평점 rating ----
###### ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
dd <- read_html("https://www.imdb.com/title/tt6751668/?ref_=fn_al_tt_1", encoding="UTF-8")
##### way1 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
dd %>%
html_nodes(".title_bar_wrapper strong span") %>%
html_text() %>% as.numeric()\t\t
##### way2 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
dd %>%
html_nodes(xpath='//*[@id="title-overview-widget"]/div[1]/div[2]/div/div[1]/div[1]/div[1]/strong/span') %>%
html_text() %>% as.numeric()\t
# [1] 8.6
Xpath 구하기
