rvest::(Web) Scraping
https://statkclee.github.io/yonsei/data/R_Web_Crawling.pdf
● HTTP통신:httr,RSelenium
● HTML요소:rvest,jsonlite
● 인코딩관련:urltools,readr
● 파이프연산자:magrittr(dplyr)
● 텍스트전처리:stringr
rvest의 CRAN
PDF 파일
https://www.datacamp.com/community/tutorials/r-web-scraping-rvest
https://cinema4dr12.tistory.com/1170
source(file.path(getwd(),"..","00.global.R")) library(rvest) ### Title: LEGO REVIEW ~~~ # read_html() # html_node() / html_nodes() # html_text()
### IMDb의 기생충 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ URL_Movies <- "https://www.imdb.com/title/tt6751668/?ref_=fn_al_tt_1" dd <- URL_Movies %>% read_html(encoding="UTF-8") ###### ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ ###### 영화 포스터 ---- ###### ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ poster <- dd %>% html_node(".poster img") %>% html_attr("src") browseURL(poster) ###### ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ ###### 등장인물 이름 ---- ###### ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ ##### way1 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ dd %>% html_nodes(".primary_photo img") %>% html_attr("alt") # [1] "Kang-ho Song" "Sun-kyun Lee" "Yeo-jeong Cho" "Woo-sik Choi" # [5] "So-dam Park" "Lee Jeong-eun" "Hye-jin Jang" "Myeong-hoon Park" # [9] "Ji-so Jung" "Hyun-jun Jung" "Keun-rok Park" "Esuz Jeong" # [13] "Jae-myeong Jo" "Ik-han Jung" "Kyu-baek Kim" ##### way2 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ cast_table <- dd %>% html_node(".cast_list") %>% html_table() cast_table[2:16,2] ##### way3 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~\\ dd %>% html_node("#titleCast .primary_photo") dd %>% html_node(".primary_photo+td a") %>% html_text() %>% trim() dd %>% html_nodes(".primary_photo+td a") %>% html_text(trim=T) ###### ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ ###### 등장인물 페이지 - 박소담 ---- ###### ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ cast_rel_urls <- dd %>% html_nodes(".primary_photo+td a") %>% html_attr("href") cast_rel_urls <- cast_rel_urls %>% url_absolute(URL_Movies) browseURL(cast_rel_urls[5]) ###### ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ ###### 영화 평점 rating ---- ###### ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ dd <- read_html("https://www.imdb.com/title/tt6751668/?ref_=fn_al_tt_1", encoding="UTF-8")
It certainly looks like a small enough difference, but is it pure chance?
###### ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ ###### 영화 평점 rating ---- ###### ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ dd <- read_html("https://www.imdb.com/title/tt6751668/?ref_=fn_al_tt_1", encoding="UTF-8") ##### way1 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ dd %>% html_nodes(".title_bar_wrapper strong span") %>% html_text() %>% as.numeric()\t\t ##### way2 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ dd %>% html_nodes(xpath='//*[@id="title-overview-widget"]/div[1]/div[2]/div/div[1]/div[1]/div[1]/strong/span') %>% html_text() %>% as.numeric()\t # [1] 8.6