Rcrawler::Crawling

Published by onesixx on

Rcrawler로 crawling와 scraping 둘다 가능하다.

https://www.pluralsight.com/guides/web-crawling-in-r
R을 이용한 퀀트 투자 포트폴리오 만들기 – Chapter 4 크롤링 이해하기

특정 웹사이트 크롤링하기

CSS나 Xpath를 사용해서 URLs 찾기

### Title: ~~~
source(file.path(getwd(),"../00.global_quant.R"))
# > system("java -version")
# #java version "1.8.0_202"
# #Java(TM) SE Runtime Environment (build 1.8.0_202-b08)
# #Java HotSpot(TM) 64-Bit Server VM (build 25.202-b08, mixed mode)
library(Rcrawler)  # require Java 
myURL="https://finance.naver.com/news/news_list.nhn?mode=LSS2D§ion_id=101§ion_id2=258"

Rcrawler(Website=myURL,\t no_cores=4, no_conn=4, Obeyrobots=T,  DIR=DATA_PATH)
###### finance.naver.com-091606 폴더가 만들어 진다. 

# Preparing multihreading cluster .. In process : 1..
# Progress: 100.00 %  :  1  parssed from  1  | Collected pages: 1  | Level: 1 
# + Check INDEX dataframe variable to see crawling details 
# + Collected web pages are stored in Project folder 
# + Project folder name : finance.naver.com-091606 
# + Project folder path : ~/DATA/quant/cData/finance.naver.com-091606 
page <- LinkExtractor(url=myURL, ExternalLInks=T) 
> page
# $Info
# $Info$Id
# [1] 638

# $Info$Url
# [1] "https://finance.naver.com/news/news_list.nhn?mode=LSS2D§ion_id=101§ion_id2=258"

# $Info$Crawl_status
# [1] "finished"

# $Info$Crawl_level
# [1] 1

# $Info$SumLinks
# [1] 95

# $Info[[6]]
# [1] ""

# $Info$Status_code
# [1] 200

# $Info$Content_type
# [1] "text/html"

# $Info$Encoding
# [1] "EUC-KR"

# $Info$Source_page
# [1] "\
\
\
\
\
\
\
\
\
\
\\t\
\\t\
\\t\
\\t\
\
<...showPannel(layerId);\
        }\
    }\
}\
\
// add data-useragent\
document.documentElement.setAttribute('data-useragent',navigator.userAgent);\
\
\ \ " # $Info$Title # [1] "³×À̹ö ±ÝÀ¶" # $InternalLinks # [1] "https://finance.naver.com/news/news_list.nhn?mode=LSS2D§ion_id=101§ion_id2=258/" # [2] "https://finance.naver.com/" # [3] "https://finance.naver.com/sise/" # [4] "https://finance.naver.com/world/" # [5] "https://finance.naver.com/marketindex/" # [6] "https://finance.naver.com/fund/" # [7] "https://finance.naver.com/research/" # [8] "https://finance.naver.com/news/" # [9] "https://finance.naver.com/mystock/" # [10] "https://finance.naver.com/news/news_list.nhn?mode=LSS2D&section_id=101&section_id2=258" # [11] "https://finance.naver.com/news/mainnews.nhn" # [12] "https://finance.naver.com/news/news_list.nhn?mode=LSS3D&section_id=101&section_id2=258&section_id3=401" # [13] "https://finance.naver.com/news/news_list.nhn?mode=LSS3D&section_id=101&section_id2=258&section_id3=402" # [14] "https://finance.naver.com/news/news_list.nhn?mode=LSS3D&section_id=101&section_id2=258&section_id3=403" # [15] "https://finance.naver.com/news/news_list.nhn?mode=LSS3D&section_id=101&section_id2=258&section_id3=404" # [16] "https://finance.naver.com/news/news_list.nhn?mode=LSS3D&section_id=101&section_id2=258&section_id3=406" # [17] "https://finance.naver.com/news/news_list.nhn?mode=LSS3D&section_id=101&section_id2=258&section_id3=429" # [18] "https://finance.naver.com/news/news_list.nhn?mode=RANK" # [19] "https://finance.naver.com/news/news_list.nhn?mode=LSTD&section_id=101&section_id2=258&type=1" # [20] "https://finance.naver.com/news/news_list.nhn?mode=TV&section_id=tv" # [21] "https://finance.naver.com/news/market_notice.nhn" # [22] "https://finance.naver.com/news/market_special.nhn" # [23] "https://finance.naver.com/news/news_search.nhn" # [24] "https://finance.naver.com/news/news_search.nhn?rcdate=&q=%BF%C0%B4%C3%C0%C7+%C1%F5%BD%C3%C0%CF%C1%A4&sm=title.basic&pd=1&stDateStart=" # [25] "https://finance.naver.com/news/news_read.nhn?article_id=0004897771&office_id=018&mode=LSS3D&type=0&section_id=101&section_id2=258&section_id3=401" # [26] "https://finance.naver.com/news/news_read.nhn?article_id=0004897770&office_id=018&mode=LSS3D&type=0&section_id=101&section_id2=258&section_id3=401" # [27] "https://finance.naver.com/news/news_read.nhn?article_id=0003895149&office_id=011&mode=LSS3D&type=0&section_id=101&section_id2=258&section_id3=401" # [28] "https://finance.naver.com/news/news_read.nhn?article_id=0000949938&office_id=215&mode=LSS3D&type=0&section_id=101&section_id2=258&section_id3=401" # [29] "https://finance.naver.com/news/news_read.nhn?article_id=0004897769&office_id=018&mode=LSS3D&type=0&section_id=101&section_id2=258&section_id3=401" # [30] "https://finance.naver.com/news/news_read.nhn?article_id=0004897767&office_id=018&mode=LSS3D&type=0&section_id=101&section_id2=258&section_id3=401" # [31] "https://finance.naver.com/news/news_read.nhn?article_id=0003895148&office_id=011&mode=LSS3D&type=0&section_id=101&section_id2=258&section_id3=401" # [32] "https://finance.naver.com/news/news_read.nhn?article_id=0004897765&office_id=018&mode=LSS3D&type=0&section_id=101&section_id2=258&section_id3=401" # [33] "https://finance.naver.com/news/news_read.nhn?article_id=0003895146&office_id=011&mode=LSS3D&type=0&section_id=101&section_id2=258&section_id3=401" # [34] "https://finance.naver.com/news/news_read.nhn?article_id=0004897763&office_id=018&mode=LSS3D&type=0&section_id=101&section_id2=258&section_id3=401" # [35] "https://finance.naver.com/news/news_list.nhn?mode=LSS2D§ion_id=101§ion_id2=258/javascript:view_newsflash_move('prev');" # [36] "https://finance.naver.com/news/news_list.nhn?mode=LSS2D§ion_id=101§ion_id2=258/javascript:view_newsflash_move('next');" # [37] "https://finance.naver.com/news/news_read.nhn?article_id=0004897771&office_id=018&mode=LSS2D&type=0&section_id=101&section_id2=258&section_id3=&date=20210409&page=1" # [38] "https://finance.naver.com/news/news_read.nhn?article_id=0004897770&office_id=018&mode=LSS2D&type=0&section_id=101&section_id2=258&section_id3=&date=20210409&page=1" # [39] "https://finance.naver.com/news/news_read.nhn?article_id=0003895149&office_id=011&mode=LSS2D&type=0&section_id=101&section_id2=258&section_id3=&date=20210409&page=1" # [40] "https://finance.naver.com/news/news_read.nhn?article_id=0004882120&office_id=277&mode=LSS2D&type=0&section_id=101&section_id2=258&section_id3=&date=20210409&page=1" # [41] "https://finance.naver.com/news/news_read.nhn?article_id=0000949938&office_id=215&mode=LSS2D&type=0&section_id=101&section_id2=258&section_id3=&date=20210409&page=1" # [42] "https://finance.naver.com/news/news_read.nhn?article_id=0004897769&office_id=018&mode=LSS2D&type=0&section_id=101&section_id2=258&section_id3=&date=20210409&page=1" # [43] "https://finance.naver.com/news/news_read.nhn?article_id=0004897767&office_id=018&mode=LSS2D&type=0&section_id=101&section_id2=258&section_id3=&date=20210409&page=1" # [44] "https://finance.naver.com/news/news_read.nhn?article_id=0003895148&office_id=011&mode=LSS2D&type=0&section_id=101&section_id2=258&section_id3=&date=20210409&page=1" # [45] "https://finance.naver.com/news/news_read.nhn?article_id=0004897765&office_id=018&mode=LSS2D&type=0&section_id=101&section_id2=258&section_id3=&date=20210409&page=1" # [46] "https://finance.naver.com/news/news_read.nhn?article_id=0003895146&office_id=011&mode=LSS2D&type=0&section_id=101&section_id2=258&section_id3=&date=20210409&page=1" # [47] "https://finance.naver.com/news/news_read.nhn?article_id=0004897763&office_id=018&mode=LSS2D&type=0&section_id=101&section_id2=258&section_id3=&date=20210409&page=1" # [48] "https://finance.naver.com/news/news_read.nhn?article_id=0010439801&office_id=003&mode=LSS2D&type=0&section_id=101&section_id2=258&section_id3=&date=20210409&page=1" # [49] "https://finance.naver.com/news/news_read.nhn?article_id=0000680150&office_id=417&mode=LSS2D&type=0&section_id=101&section_id2=258&section_id3=&date=20210409&page=1" # [50] "https://finance.naver.com/news/news_read.nhn?article_id=0004527819&office_id=015&mode=LSS2D&type=0&section_id=101&section_id2=258&section_id3=&date=20210409&page=1" # [51] "https://finance.naver.com/news/news_read.nhn?article_id=0004527816&office_id=015&mode=LSS2D&type=0&section_id=101&section_id2=258&section_id3=&date=20210409&page=1" # [52] "https://finance.naver.com/news/news_read.nhn?article_id=0010439794&office_id=003&mode=LSS2D&type=0&section_id=101&section_id2=258&section_id3=&date=20210409&page=1" # [53] "https://finance.naver.com/news/news_read.nhn?article_id=0010439793&office_id=003&mode=LSS2D&type=0&section_id=101&section_id2=258&section_id3=&date=20210409&page=1" # [54] "https://finance.naver.com/news/news_read.nhn?article_id=0010439791&office_id=003&mode=LSS2D&type=0&section_id=101&section_id2=258&section_id3=&date=20210409&page=1" # [55] "https://finance.naver.com/news/news_read.nhn?article_id=0010439787&office_id=003&mode=LSS2D&type=0&section_id=101&section_id2=258&section_id3=&date=20210409&page=1" # [56] "https://finance.naver.com/news/news_read.nhn?article_id=0010439786&office_id=003&mode=LSS2D&type=0&section_id=101&section_id2=258&section_id3=&date=20210409&page=1" # [57] "https://finance.naver.com/news/news_list.nhn?mode=LSS2D&section_id=101&section_id2=258&page=1" # [58] "https://finance.naver.com/news/news_list.nhn?mode=LSS2D&section_id=101&section_id2=258&page=2" # [59] "https://finance.naver.com/news/news_list.nhn?mode=LSS2D&section_id=101&section_id2=258&page=3" # [60] "https://finance.naver.com/news/news_list.nhn?mode=LSS2D&section_id=101&section_id2=258&page=4" # [61] "https://finance.naver.com/news/news_list.nhn?mode=LSS2D&section_id=101&section_id2=258&page=5" # [62] "https://finance.naver.com/news/news_list.nhn?mode=LSS2D&section_id=101&section_id2=258&page=6" # [63] "https://finance.naver.com/news/news_list.nhn?mode=LSS2D&section_id=101&section_id2=258&page=7" # [64] "https://finance.naver.com/news/news_list.nhn?mode=LSS2D&section_id=101&section_id2=258&page=8" # [65] "https://finance.naver.com/news/news_list.nhn?mode=LSS2D&section_id=101&section_id2=258&page=9" # [66] "https://finance.naver.com/news/news_list.nhn?mode=LSS2D&section_id=101&section_id2=258&page=10" # [67] "https://finance.naver.com/news/news_list.nhn?mode=LSS2D&section_id=101&section_id2=258&page=11" # [68] "https://finance.naver.com/news/news_list.nhn?mode=LSS2D&section_id=101&section_id2=258&page=31" # [69] "https://finance.naver.com/news/news_list.nhn?mode=LSS2D&section_id=101&section_id2=258&date=20210408" # [70] "https://finance.naver.com/news/news_list.nhn?mode=LSS2D&section_id=101&section_id2=258&date=20210407" # [71] "https://finance.naver.com/news/news_list.nhn?mode=LSS2D&section_id=101&section_id2=258&date=20210406" # [72] "https://finance.naver.com/news/news_list.nhn?mode=LSS2D&section_id=101&section_id2=258&date=20210405" # [73] "https://finance.naver.com/news/news_read.nhn?article_id=0003606973&office_id=023&mode=RANK&typ=0" # [74] "https://finance.naver.com/news/news_read.nhn?article_id=0004570324&office_id=008&mode=RANK&typ=0" # [75] "https://finance.naver.com/news/news_read.nhn?article_id=0001429037&office_id=005&mode=RANK&typ=0" # [76] "https://finance.naver.com/news/news_read.nhn?article_id=0004570379&office_id=008&mode=RANK&typ=0" # [77] "https://finance.naver.com/news/news_read.nhn?article_id=0005278879&office_id=421&mode=RANK&typ=0" # [78] "https://finance.naver.com/news/news_read.nhn?article_id=0004777118&office_id=009&mode=RANK&typ=0" # [79] "https://finance.naver.com/news/news_read.nhn?article_id=0003894898&office_id=011&mode=RANK&typ=0" # [80] "https://finance.naver.com/news/news_read.nhn?article_id=0001819347&office_id=016&mode=RANK&typ=0" # [81] "https://finance.naver.com/sise/lastsearch2.nhn" # [82] "https://finance.naver.com/item/main.nhn?code=005930" # [83] "https://finance.naver.com/item/main.nhn?code=000890" # [84] "https://finance.naver.com/item/main.nhn?code=035720" # [85] "https://finance.naver.com/item/main.nhn?code=011200" # [86] "https://finance.naver.com/item/main.nhn?code=302440" # [87] "https://finance.naver.com/rules.nhn" # [88] "https://finance.naver.com/news/news_list.nhn?mode=LSS2D§ion_id=101§ion_id2=258/javascript:;" # $ExternalLinks # [1] "https://www.naver.com/" # [2] "http://www.seibro.or.kr/websquare/control.jsp?w2xPath=/IPORTAL/user/company/BIP_CNTS01020V.xml&menuNo=273" # [3] "https://www.naver.com/rules/service.html" # [4] "https://www.naver.com/rules/privacy.html" # [5] "https://www.naver.com/rules/disclaimer.html" # [6] "https://help.naver.com/support/alias/contents2/finance/finance_1.naver" # [7] "https://www.navercorp.com/"

Xpath

//*[@id=”contentarea_left”]/ul/li[1]/dl/dd[1]/a ==> //*[@id=”contentarea_left”]/ul/li/dl/dd/a

playing it safely with a Local HTTP server

terms of service

terms of service crawling facebook
https://about.fb.com/news/2020/10/taking-legal-action-against-data-scraping/

robots.txt

웹크롤링하려는웹사이트의메인페이지에서사전에‘robots.txt’를
확인해야하며,특히수집한데이터를영업에사용할목적이라면반드시법률
검토를진행하시기바랍니다

https://warm-uk.tistory.com/39 [데이터 크롤링] OPEN API 네이버 검색 데이터 crawling 하기_이해하기

Crawl Gently

Rcrawler( url, no_cores=1 , RequestsDelay=2, MaxDepth = 1)

  • process갯수조절 : no_cores=1
  • 천천히 : RequestsDelay = 2
  • Limit depth : MaxDepth = 1

https://rdrr.io/github/salimk/Rcrawler/f/README.md

view(INDEX)

Categories: quant

onesixx

Blog Owner

Subscribe
Notify of
guest

0 Comments
Inline Feedbacks
View all comments
0
Would love your thoughts, please comment.x
()
x