rvest::(Web) Scraping

Published by onesixx on

https://statkclee.github.io/yonsei/data/R_Web_Crawling.pdf

● HTTP통신:httr,RSelenium
● HTML요소:rvest,jsonlite
● 인코딩관련:urltools,readr
● 파이프연산자:magrittr(dplyr)
● 텍스트전처리:stringr

rvest의 CRAN 
PDF 파일
https://www.datacamp.com/community/tutorials/r-web-scraping-rvest

https://cinema4dr12.tistory.com/1170

source(file.path(getwd(),"..","00.global.R"))
library(rvest)
### Title: LEGO REVIEW ~~~
# read_html()
# html_node() / html_nodes()
# html_text()
### IMDb의 기생충 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
URL_Movies <- "https://www.imdb.com/title/tt6751668/?ref_=fn_al_tt_1"
dd <- URL_Movies %>% read_html(encoding="UTF-8") 

###### ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
###### 영화 포스터 ----
###### ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
poster <- dd %>% html_node(".poster img") %>% html_attr("src")
browseURL(poster)

###### ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
###### 등장인물 이름 ----
###### ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

##### way1 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
dd %>% html_nodes(".primary_photo img") %>% html_attr("alt")
#  [1] "Kang-ho Song"     "Sun-kyun Lee"     "Yeo-jeong Cho"    "Woo-sik Choi"    
#  [5] "So-dam Park"      "Lee Jeong-eun"    "Hye-jin Jang"     "Myeong-hoon Park"
#  [9] "Ji-so Jung"       "Hyun-jun Jung"    "Keun-rok Park"    "Esuz Jeong"      
# [13] "Jae-myeong Jo"    "Ik-han Jung"      "Kyu-baek Kim" 

##### way2 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
cast_table <- dd %>% html_node(".cast_list") %>% html_table()
cast_table[2:16,2]

##### way3 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~\\
dd %>% html_node("#titleCast .primary_photo")
dd %>% html_node(".primary_photo+td a") %>% html_text() %>% trim()
dd %>% html_nodes(".primary_photo+td a") %>% html_text(trim=T) 

###### ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
###### 등장인물 페이지 - 박소담 ----
###### ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
cast_rel_urls <- dd %>% html_nodes(".primary_photo+td a") %>% html_attr("href")
cast_rel_urls <- cast_rel_urls %>% url_absolute(URL_Movies)
browseURL(cast_rel_urls[5])

###### ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
###### 영화 평점 rating ----
###### ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
dd <- read_html("https://www.imdb.com/title/tt6751668/?ref_=fn_al_tt_1", encoding="UTF-8") 

It certainly looks like a small enough difference, but is it pure chance?

###### ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
###### 영화 평점 rating ----
###### ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
dd <- read_html("https://www.imdb.com/title/tt6751668/?ref_=fn_al_tt_1", encoding="UTF-8") 

##### way1 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
dd %>% 
html_nodes(".title_bar_wrapper strong span") %>% 
html_text() %>% as.numeric()\t\t

##### way2 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
dd %>% 
html_nodes(xpath='//*[@id="title-overview-widget"]/div[1]/div[2]/div/div[1]/div[1]/div[1]/strong/span') %>% 
html_text() %>% as.numeric()\t
# [1] 8.6

Xpath 구하기

Advice

따라서 웹크롤링하려는 웹사이트의메인페이지에서 사전에‘robots.txt’를 확인해야하며,
특히수집한데이터를영업에사용할목적이라면반드시법률 검토를진행하시기바랍니다

Categories: quant

onesixx

Blog Owner

Subscribe
Notify of
guest

0 Comments
Inline Feedbacks
View all comments
0
Would love your thoughts, please comment.x
()
x