[R] RSelenium을 이용한 네이버 쇼핑몰 댓글 스크래핑

정보

업무명 : RSelenium을 이용한 네이버 쇼핑몰의 댓글 스크래핑 하기
작성자 : 박진만
작성일 : 2021-05-18
설 명 :
수정이력 :

내용

[특징]

R을 이용하여 아래의 네이버 쇼핑몰 카달로그 사이트를 스크래핑 하는 코드 소개 (하단의 예시 사이트)

주연테크 리오나인 L9T27S : 네이버 쇼핑

화면크기 : 17인치(43~44cm), 무게 : 2.9kg, 종류 : 코어i7 10세대, 운영체제 : 미포함(FreeDos), CPU : 코어i7-10875H, 칩셋 제조사 : 인텔, 코어종류 : 옥타코어, 코드명 : 코멧레이크, CPU속도 : 2.3GHz, 터보부스트

search.shopping.naver.com

[기능]

웹 크롤링/웹 스크래핑

[활용 자료]

없음

[자료 처리 방안 및 활용 분석 기법]

없음

[사용법]

소스 코드 예시 참조

[사용 OS]

Windows 10

[사용 언어]

R v4.0.5
R Studio v1.2.5033

소스 코드

[사전작업]

Rselenium 다운로드
- 아래의 파일이 있어야 Rselenium을 실행할 수 있음.
- 파일 다운로드 후 압축 헤제

selenium-server-standalone-3.141.59.zip

9.11MB

구글 크롬 드라이브 다운로드
- 하단의 링크에서 다운로드 가능
- 현재 크롬 버전과 일치하는 버전의 드라이브 다운로드
- 다운로드 후 경로 지정은 자유

ChromeDriver - WebDriver for Chrome - Downloads

Current Releases If you are using Chrome version 91, please download ChromeDriver 91.0.4472.19 If you are using Chrome version 90, please download ChromeDriver 90.0.4430.24 If you are using Chrome version 89, please download ChromeDriver 89.0.4389.23 If yo

chromedriver.chromium.org

Rselenium 로드
- 커맨드 라인에서 아래의 명령줄을 사용하여 Rselenium을 로드 해 주어야 함
- 조건1 : java 가 설치 되어 있어야 함 / 조건2 : Rselenium jar 파일이 존재해야 함
- 아래의 명령줄 입력 후 첨부 이미지와 같은 화면이 뜬다면 성공

java -Dwebdriver.gecko.driver="geckodriver.exe" -jar selenium-server-standalone-3.141.59.jar -port 5000

[코드 소개]

라이브러리 로드

Sys.setlocale("LC_ALL")
#Sys.setlocale("LC_CTYPE", ".1251")
library(RSelenium)
library(rvest)
library(stringr)
library(tidyverse)
library(data.table)
library(foreach)
library(httr)
library(webdriver)
library(seleniumPipes)
library(readxl)
library(foreach)
library(ggwordcloud)
library(wordcloud2)
library(htmlwidgets)
library(webshot)
library(log4r)
library(readxl)
library(tcltk)
library(beepr)
library(noncompliance)
library(ggplot2)
library(fs)

서브 함수 로드

# 웹 환경 지정 #
Sys.setlocale("LC_ALL")
options(encoding = "UTF-8")
Sys.setenv(LANG = "en_US.UTF-8")


## 크롬 드라이버 오브젝트 생성 ##
remDr = remoteDriver(
  remoteServerAddr = "localhost"
  , port = 5000L
  , browserName = "chrome"
)

#?remoteDriver
## 크롬 드라이버 오브젝트 생성 ##
# remDrEdge = remoteDriver(
#   remoteServerAddr = "localhost"
#   , nativeEvents = FALSE
#   , port = 5001L
#   , browserName = "internet explorer"
# )

#################################### SUB ##############################
setWindowTab = function (remDr, windowId) {
  qpath = sprintf("%s/session/%s/window", remDr$serverURL, remDr$sessionInfo[["id"]])
  remDr$queryRD(qpath, "POST", qdata = list(handle = windowId))
}

getXpathText = function(xpath) {
  remDr$getPageSource()[[1]] %>%
    read_html() %>%
    rvest::html_nodes(xpath = xpath) %>%
    rvest::html_text() %>%
    str_replace_all(pattern = "\n", replacement = " ") %>%
    str_replace_all(pattern = "[\\^]", replacement = " ") %>%
    str_replace_all(pattern = "\"", replacement = " ") %>%
    str_replace_all(pattern = "\\s+", replacement = " ") %>%
    str_trim(side = "both")
}

getCssText = function(css) {
  remDr$getPageSource()[[1]] %>%
    read_html() %>%
    rvest::html_nodes(css = css) %>%
    rvest::html_text() %>%
    str_replace_all(pattern = "\n", replacement = " ") %>%
    str_replace_all(pattern = "[\\^]", replacement = " ") %>%
    str_replace_all(pattern = "\"", replacement = " ") %>%
    str_replace_all(pattern = "\\s+", replacement = " ") %>%
    str_trim(side = "both")
}

네이버 쇼핑 카테고리 페이지의 리뷰 수집 함수 로드
- 입력 파라미터 : 타겟 url 및 딜레이 시간 (디폴트 : 2초)

naver_pc_case_catalog <- function(url = url,delay = 2){
  
  if(grepl(x = url,pattern = "https://search.shopping.naver.com/catalog/") == FALSE){
    print("The URL pattern does not match. It must contain https://search.shopping.naver.com/catalog/.")
    return("error")
  }
  
  remDr$open()
  remDr$navigate(url)
  
  last_page = remDr$findElements(using = "link text",value = paste0("맨뒤"))
  
  if(length(last_page) >= 1){
    
    last_page_info = as.numeric(gsub('\\D','', as.character(last_page[[1]]$getElementAttribute(attrName = "data-nclick"))))
  } else {
    last_page_info = remDr$findElements(using = "xpath",value =  paste0("//a[@data-nclick='N=a:rev.grd']"))
    last_page_info = ceiling(as.numeric(gsub('\\D','', as.character(last_page_info[[1]]$getElementText()))) / 20)
  }
  
  print(paste0("Total ",last_page_info," pages found."))
  
  result = data.frame()
  
  for (i in 1:last_page_info) {
    
    print(paste0(i,"/",last_page_info," Collecting pages"))
    
    if(i %% 10 == 1){
      
      headlines = remDr$findElements(using = "xpath",value = paste0("//div[@id='section_review']/ul[1]/li[*]/div[2]/div[1]/em")) 
      reviews = remDr$findElements(using = "xpath",value = paste0("//div[@id='section_review']/ul[1]/li[*]/div[2]/div[1]/p"))
      stars = remDr$findElements(using = "xpath",value = paste0("//div[@id='section_review']/ul[1]/li[*]/div[1]/span[1]"))
      dates = remDr$findElements(using = "xpath",value = paste0("//div[@id='section_review']/ul[1]/li[*]/div[1]/span[4]"))
      
    } else {
      
      if(i %% 10 != 0){
        
        next_page = remDr$findElements(using = "link text",value = paste0(i))
        
      } else if (i %% 10 == 0){
        
        next_page = remDr$findElements(using = "link text",value = paste0("다음"))
        
      }
      6
      if(length(next_page) == 1) {
        
        next_page[[1]]$clickElement()
        
      } else {
        
        for (j in 1:length(next_page)) {
          
          attrs_info = next_page[[j]]$getElementAttribute(attrName = "data-nclick")
          
          if(grepl(x = attrs_info,pattern = "a:rev") == TRUE) {
            break
          }
          
        }
        
        next_page[[j]]$clickElement()
        
      }
      
      Sys.sleep(delay)
      
      headlines = remDr$findElements(using = "xpath",value = paste0("//div[@id='section_review']/ul[1]/li[*]/div[2]/div[1]/em")) 
      reviews = remDr$findElements(using = "xpath",value = paste0("//div[@id='section_review']/ul[1]/li[*]/div[2]/div[1]/p"))
      stars = remDr$findElements(using = "xpath",value = paste0("//div[@id='section_review']/ul[1]/li[*]/div[1]/span[1]"))
      dates = remDr$findElements(using = "xpath",value = paste0("//div[@id='section_review']/ul[1]/li[*]/div[1]/span[4]"))
      
    }
    
    for (k in 1:length(stars)) {
      
      stars_info = as.character(stars[[k]]$getElementText())
      dates_info = as.character(dates[[k]]$getElementText())
      reviews_info = as.character(reviews[[k]]$getElementText())
      headlines_info = as.character(headlines[[k]]$getElementText())
      
      result_part = data.frame(date = dates_info, star = stars_info, headline = headlines_info,review = reviews_info)
      result = rbind(result,result_part)
    }
    
    
  }
  
  print("Collection complete... Close the browser.")
  remDr$quit()
  return(result)
  
}

함수 적용
- 함수 적용의 예시는 아래와 같음
- 또한 실행 결과는 첨부 이미지와 같음

site_url = "https://search.shopping.naver.com/catalog/23922223523?adId=nad-a001-02-000000118607953&channel=nshop.npla&query=%EB%85%B8%ED%8A%B8%EB%B6%81&NaPm=ct%3Dkosubz0w%7Cci%3D0zC0000EBE9u7Nze51no%7Ctr%3Dpla%7Chk%3Dd0ee57966fadc5a015d744528d6a57111faf9abd&cid=0zC0000EBE9u7Nze51no"
result = naver_pc_case_catalog(url = site_url)

[전체 코드]

Sys.setlocale("LC_ALL")
#Sys.setlocale("LC_CTYPE", ".1251")
library(RSelenium)
library(rvest)
library(stringr)
library(tidyverse)
library(data.table)
library(foreach)
library(httr)
library(webdriver)
library(seleniumPipes)
library(readxl)
library(foreach)
library(ggwordcloud)
library(wordcloud2)
library(htmlwidgets)
library(webshot)
library(log4r)
library(readxl)
library(tcltk)
library(beepr)
library(noncompliance)
library(ggplot2)
library(fs)
source("./r2.R")

# 웹 환경 지정 #
Sys.setlocale("LC_ALL")
options(encoding = "UTF-8")
Sys.setenv(LANG = "en_US.UTF-8")


## 크롬 드라이버 오브젝트 생성 ##
remDr = remoteDriver(
  remoteServerAddr = "localhost"
  , port = 5000L
  , browserName = "chrome"
)

#?remoteDriver
## 크롬 드라이버 오브젝트 생성 ##
# remDrEdge = remoteDriver(
#   remoteServerAddr = "localhost"
#   , nativeEvents = FALSE
#   , port = 5001L
#   , browserName = "internet explorer"
# )

#################################### SUB ##############################
setWindowTab = function (remDr, windowId) {
  qpath = sprintf("%s/session/%s/window", remDr$serverURL, remDr$sessionInfo[["id"]])
  remDr$queryRD(qpath, "POST", qdata = list(handle = windowId))
}

getXpathText = function(xpath) {
  remDr$getPageSource()[[1]] %>%
    read_html() %>%
    rvest::html_nodes(xpath = xpath) %>%
    rvest::html_text() %>%
    str_replace_all(pattern = "\n", replacement = " ") %>%
    str_replace_all(pattern = "[\\^]", replacement = " ") %>%
    str_replace_all(pattern = "\"", replacement = " ") %>%
    str_replace_all(pattern = "\\s+", replacement = " ") %>%
    str_trim(side = "both")
}

getCssText = function(css) {
  remDr$getPageSource()[[1]] %>%
    read_html() %>%
    rvest::html_nodes(css = css) %>%
    rvest::html_text() %>%
    str_replace_all(pattern = "\n", replacement = " ") %>%
    str_replace_all(pattern = "[\\^]", replacement = " ") %>%
    str_replace_all(pattern = "\"", replacement = " ") %>%
    str_replace_all(pattern = "\\s+", replacement = " ") %>%
    str_trim(side = "both")
}

naver_pc_case_catalog <- function(url = url,delay = 2){
  
  if(grepl(x = url,pattern = "https://search.shopping.naver.com/catalog/") == FALSE){
    print("The URL pattern does not match. It must contain https://search.shopping.naver.com/catalog/.")
    return("error")
  }
  
  remDr$open()
  remDr$navigate(url)
  
  last_page = remDr$findElements(using = "link text",value = paste0("맨뒤"))
  
  if(length(last_page) >= 1){
    
    last_page_info = as.numeric(gsub('\\D','', as.character(last_page[[1]]$getElementAttribute(attrName = "data-nclick"))))
  } else {
    last_page_info = remDr$findElements(using = "xpath",value =  paste0("//a[@data-nclick='N=a:rev.grd']"))
    last_page_info = ceiling(as.numeric(gsub('\\D','', as.character(last_page_info[[1]]$getElementText()))) / 20)
  }
  
  print(paste0("Total ",last_page_info," pages found."))
  
  result = data.frame()
  
  for (i in 1:last_page_info) {
    
    print(paste0(i,"/",last_page_info," Collecting pages"))
    
    if(i %% 10 == 1){
      
      headlines = remDr$findElements(using = "xpath",value = paste0("//div[@id='section_review']/ul[1]/li[*]/div[2]/div[1]/em")) 
      reviews = remDr$findElements(using = "xpath",value = paste0("//div[@id='section_review']/ul[1]/li[*]/div[2]/div[1]/p"))
      stars = remDr$findElements(using = "xpath",value = paste0("//div[@id='section_review']/ul[1]/li[*]/div[1]/span[1]"))
      dates = remDr$findElements(using = "xpath",value = paste0("//div[@id='section_review']/ul[1]/li[*]/div[1]/span[4]"))
      
    } else {
      
      if(i %% 10 != 0){
        
        next_page = remDr$findElements(using = "link text",value = paste0(i))
        
      } else if (i %% 10 == 0){
        
        next_page = remDr$findElements(using = "link text",value = paste0("다음"))
        
      }
      6
      if(length(next_page) == 1) {
        
        next_page[[1]]$clickElement()
        
      } else {
        
        for (j in 1:length(next_page)) {
          
          attrs_info = next_page[[j]]$getElementAttribute(attrName = "data-nclick")
          
          if(grepl(x = attrs_info,pattern = "a:rev") == TRUE) {
            break
          }
          
        }
        
        next_page[[j]]$clickElement()
        
      }
      
      Sys.sleep(delay)
      
      headlines = remDr$findElements(using = "xpath",value = paste0("//div[@id='section_review']/ul[1]/li[*]/div[2]/div[1]/em")) 
      reviews = remDr$findElements(using = "xpath",value = paste0("//div[@id='section_review']/ul[1]/li[*]/div[2]/div[1]/p"))
      stars = remDr$findElements(using = "xpath",value = paste0("//div[@id='section_review']/ul[1]/li[*]/div[1]/span[1]"))
      dates = remDr$findElements(using = "xpath",value = paste0("//div[@id='section_review']/ul[1]/li[*]/div[1]/span[4]"))
      
    }
    
    for (k in 1:length(stars)) {
      
      stars_info = as.character(stars[[k]]$getElementText())
      dates_info = as.character(dates[[k]]$getElementText())
      reviews_info = as.character(reviews[[k]]$getElementText())
      headlines_info = as.character(headlines[[k]]$getElementText())
      
      result_part = data.frame(date = dates_info, star = stars_info, headline = headlines_info,review = reviews_info)
      result = rbind(result,result_part)
    }
    
    
  }
  
  print("Collection complete... Close the browser.")
  remDr$quit()
  return(result)
  
}

site_url = "https://search.shopping.naver.com/catalog/23922223523?adId=nad-a001-02-000000118607953&channel=nshop.npla&query=%EB%85%B8%ED%8A%B8%EB%B6%81&NaPm=ct%3Dkosubz0w%7Cci%3D0zC0000EBE9u7Nze51no%7Ctr%3Dpla%7Chk%3Dd0ee57966fadc5a015d744528d6a57111faf9abd&cid=0zC0000EBE9u7Nze51no"
result = naver_pc_case_catalog(url = site_url)

참고 문헌

[논문]

없음

[보고서]

없음

[URL]

없음

문의사항

[기상학/프로그래밍 언어]

sangho.lee.1990@gmail.com

[해양학/천문학/빅데이터]

saimang0804@gmail.com

본 블로그는 파트너스 활동을 통해 일정액의 수수료를 제공받을 수 있음

저작자표시 비영리 변경금지 (새창열림)

'프로그래밍 언어 > R' 카테고리의 다른 글

[R] 2019년 주요 도시의 월별 오존 농도 데이터를 이용하여 계절별 오존 농도 차이 여부 검정 (0)	2021.02.26
[R] R을 이용하여 국가기상위성센터의 위성영상 자료 다운로드 하기 (1)	2021.02.06
[R] R 및 flourish studio 를 이용하여 국가별 탄소 배출량 레이싱 차트 만들기 (0)	2021.02.01
[R] 격자정보 및 U,V,풍속 정보를 이용하여 동아시아 지도에 매핑하기 (0)	2021.01.30
[R] WRF 모델자료 후처리 시퀀스1 - nc 파일을 풀어 txt 형태로 떨어뜨리기 (0)	2021.01.27

정보

내용