Rcrawler::Crawling
Rcrawler로 crawling와 scraping 둘다 가능하다.
https://www.pluralsight.com/guides/web-crawling-in-r
R을 이용한 퀀트 투자 포트폴리오 만들기 – Chapter 4 크롤링 이해하기
특정 웹사이트 크롤링하기
CSS나 Xpath를 사용해서 URLs 찾기
### Title: ~~~
source(file.path(getwd(),"../00.global_quant.R"))
# > system("java -version")
# #java version "1.8.0_202"
# #Java(TM) SE Runtime Environment (build 1.8.0_202-b08)
# #Java HotSpot(TM) 64-Bit Server VM (build 25.202-b08, mixed mode)
library(Rcrawler) # require Java
myURL="https://finance.naver.com/news/news_list.nhn?mode=LSS2D§ion_id=101§ion_id2=258"
Rcrawler(Website=myURL,\t no_cores=4, no_conn=4, Obeyrobots=T, DIR=DATA_PATH)
###### finance.naver.com-091606 폴더가 만들어 진다.
# Preparing multihreading cluster .. In process : 1..
# Progress: 100.00 % : 1 parssed from 1 | Collected pages: 1 | Level: 1
# + Check INDEX dataframe variable to see crawling details
# + Collected web pages are stored in Project folder
# + Project folder name : finance.naver.com-091606
# + Project folder path : ~/DATA/quant/cData/finance.naver.com-091606
page <- LinkExtractor(url=myURL, ExternalLInks=T)
> page
# $Info
# $Info$Id
# [1] 638
# $Info$Url
# [1] "https://finance.naver.com/news/news_list.nhn?mode=LSS2D§ion_id=101§ion_id2=258"
# $Info$Crawl_status
# [1] "finished"
# $Info$Crawl_level
# [1] 1
# $Info$SumLinks
# [1] 95
# $Info[[6]]
# [1] ""
# $Info$Status_code
# [1] 200
# $Info$Content_type
# [1] "text/html"
# $Info$Encoding
# [1] "EUC-KR"
# $Info$Source_page
# [1] "\
\
\
\
\
\
\
\
\
\
\\t\
\\t\
\\t\
\\t\
\
<...showPannel(layerId);\
}\
}\
}\
\
// add data-useragent\
document.documentElement.setAttribute('data-useragent',navigator.userAgent);\
\