Rcrawler::Crawling
Rcrawler로 crawling와 scraping 둘다 가능하다.
https://www.pluralsight.com/guides/web-crawling-in-r
R을 이용한 퀀트 투자 포트폴리오 만들기 – Chapter 4 크롤링 이해하기
특정 웹사이트 크롤링하기
CSS나 Xpath를 사용해서 URLs 찾기
### Title: ~~~ source(file.path(getwd(),"../00.global_quant.R")) # > system("java -version") # #java version "1.8.0_202" # #Java(TM) SE Runtime Environment (build 1.8.0_202-b08) # #Java HotSpot(TM) 64-Bit Server VM (build 25.202-b08, mixed mode) library(Rcrawler) # require Java myURL="https://finance.naver.com/news/news_list.nhn?mode=LSS2D§ion_id=101§ion_id2=258" Rcrawler(Website=myURL,\t no_cores=4, no_conn=4, Obeyrobots=T, DIR=DATA_PATH) ###### finance.naver.com-091606 폴더가 만들어 진다. # Preparing multihreading cluster .. In process : 1.. # Progress: 100.00 % : 1 parssed from 1 | Collected pages: 1 | Level: 1 # + Check INDEX dataframe variable to see crawling details # + Collected web pages are stored in Project folder # + Project folder name : finance.naver.com-091606 # + Project folder path : ~/DATA/quant/cData/finance.naver.com-091606
page <- LinkExtractor(url=myURL, ExternalLInks=T) > page # $Info # $Info$Id # [1] 638 # $Info$Url # [1] "https://finance.naver.com/news/news_list.nhn?mode=LSS2D§ion_id=101§ion_id2=258" # $Info$Crawl_status # [1] "finished" # $Info$Crawl_level # [1] 1 # $Info$SumLinks # [1] 95 # $Info[[6]] # [1] "" # $Info$Status_code # [1] 200 # $Info$Content_type # [1] "text/html" # $Info$Encoding # [1] "EUC-KR" # $Info$Source_page # [1] "\ \ \ \ \ \ \ \ \ \ \\t\ \\t\ \\t\ \\t\ \ <...showPannel(layerId);\ }\ }\ }\ \ // add data-useragent\ document.documentElement.setAttribute('data-useragent',navigator.userAgent);\ \