[R] RSelenium을 이용한 네이버 쇼핑몰 댓글 스크래핑

 정보

  • 업무명     :  RSelenium을 이용한 네이버 쇼핑몰의 댓글 스크래핑 하기
  • 작성자     : 박진만
  • 작성일     : 2021-05-18
  • 설   명      :
  • 수정이력 :

 

 내용

[특징]

  • R을 이용하여 아래의 네이버 쇼핑몰 카달로그 사이트를 스크래핑 하는 코드 소개 (하단의 예시 사이트)
 

주연테크 리오나인 L9T27S : 네이버 쇼핑

화면크기 : 17인치(43~44cm), 무게 : 2.9kg, 종류 : 코어i7 10세대, 운영체제 : 미포함(FreeDos), CPU : 코어i7-10875H, 칩셋 제조사 : 인텔, 코어종류 : 옥타코어, 코드명 : 코멧레이크, CPU속도 : 2.3GHz, 터보부스트

search.shopping.naver.com

 

[기능]

  • 웹 크롤링/웹 스크래핑

 

[활용 자료]

  • 없음

 

[자료 처리 방안 및 활용 분석 기법]

  • 없음

 

[사용법]

  • 소스 코드 예시 참조

 

[사용 OS]

  • Windows 10

 

[사용 언어]

  • R v4.0.5
  • R Studio v1.2.5033

 

 소스 코드

[사전작업]

  • Rselenium 다운로드 
    • 아래의 파일이 있어야 Rselenium을 실행할 수 있음.
    • 파일 다운로드 후 압축 헤제

selenium-server-standalone-3.141.59.zip
9.11MB

 

  • 구글 크롬 드라이브 다운로드
    • 하단의 링크에서 다운로드 가능
    • 현재 크롬 버전과 일치하는 버전의 드라이브 다운로드
    • 다운로드 후 경로 지정은 자유

 

 

ChromeDriver - WebDriver for Chrome - Downloads

Current Releases If you are using Chrome version 91, please download ChromeDriver 91.0.4472.19 If you are using Chrome version 90, please download ChromeDriver 90.0.4430.24 If you are using Chrome version 89, please download ChromeDriver 89.0.4389.23 If yo

chromedriver.chromium.org

 

  • Rselenium 로드
    • 커맨드 라인에서 아래의 명령줄을 사용하여 Rselenium을 로드 해 주어야 함
    • 조건1 : java 가 설치 되어 있어야 함 / 조건2 : Rselenium jar 파일이 존재해야 함
    • 아래의 명령줄 입력 후 첨부 이미지와 같은 화면이 뜬다면 성공
java -Dwebdriver.gecko.driver="geckodriver.exe" -jar selenium-server-standalone-3.141.59.jar -port 5000

etc-image-0

 

[코드 소개]

  • 라이브러리 로드
Sys.setlocale("LC_ALL")
#Sys.setlocale("LC_CTYPE", ".1251")
library(RSelenium)
library(rvest)
library(stringr)
library(tidyverse)
library(data.table)
library(foreach)
library(httr)
library(webdriver)
library(seleniumPipes)
library(readxl)
library(foreach)
library(ggwordcloud)
library(wordcloud2)
library(htmlwidgets)
library(webshot)
library(log4r)
library(readxl)
library(tcltk)
library(beepr)
library(noncompliance)
library(ggplot2)
library(fs)

 

  • 서브 함수 로드
# 웹 환경 지정 #
Sys.setlocale("LC_ALL")
options(encoding = "UTF-8")
Sys.setenv(LANG = "en_US.UTF-8")


## 크롬 드라이버 오브젝트 생성 ##
remDr = remoteDriver(
  remoteServerAddr = "localhost"
  , port = 5000L
  , browserName = "chrome"
)

#?remoteDriver
## 크롬 드라이버 오브젝트 생성 ##
# remDrEdge = remoteDriver(
#   remoteServerAddr = "localhost"
#   , nativeEvents = FALSE
#   , port = 5001L
#   , browserName = "internet explorer"
# )

#################################### SUB ##############################
setWindowTab = function (remDr, windowId) {
  qpath = sprintf("%s/session/%s/window", remDr$serverURL, remDr$sessionInfo[["id"]])
  remDr$queryRD(qpath, "POST", qdata = list(handle = windowId))
}

getXpathText = function(xpath) {
  remDr$getPageSource()[[1]] %>%
    read_html() %>%
    rvest::html_nodes(xpath = xpath) %>%
    rvest::html_text() %>%
    str_replace_all(pattern = "\n", replacement = " ") %>%
    str_replace_all(pattern = "[\\^]", replacement = " ") %>%
    str_replace_all(pattern = "\"", replacement = " ") %>%
    str_replace_all(pattern = "\\s+", replacement = " ") %>%
    str_trim(side = "both")
}

getCssText = function(css) {
  remDr$getPageSource()[[1]] %>%
    read_html() %>%
    rvest::html_nodes(css = css) %>%
    rvest::html_text() %>%
    str_replace_all(pattern = "\n", replacement = " ") %>%
    str_replace_all(pattern = "[\\^]", replacement = " ") %>%
    str_replace_all(pattern = "\"", replacement = " ") %>%
    str_replace_all(pattern = "\\s+", replacement = " ") %>%
    str_trim(side = "both")
}

 

  • 네이버 쇼핑 카테고리 페이지의 리뷰 수집 함수 로드
    • 입력 파라미터 : 타겟 url 및 딜레이 시간 (디폴트 : 2초)
naver_pc_case_catalog <- function(url = url,delay = 2){
  
  if(grepl(x = url,pattern = "https://search.shopping.naver.com/catalog/") == FALSE){
    print("The URL pattern does not match. It must contain https://search.shopping.naver.com/catalog/.")
    return("error")
  }
  
  remDr$open()
  remDr$navigate(url)
  
  last_page = remDr$findElements(using = "link text",value = paste0("맨뒤"))
  
  if(length(last_page) >= 1){
    
    last_page_info = as.numeric(gsub('\\D','', as.character(last_page[[1]]$getElementAttribute(attrName = "data-nclick"))))
  } else {
    last_page_info = remDr$findElements(using = "xpath",value =  paste0("//a[@data-nclick='N=a:rev.grd']"))
    last_page_info = ceiling(as.numeric(gsub('\\D','', as.character(last_page_info[[1]]$getElementText()))) / 20)
  }
  
  print(paste0("Total ",last_page_info," pages found."))
  
  result = data.frame()
  
  for (i in 1:last_page_info) {
    
    print(paste0(i,"/",last_page_info," Collecting pages"))
    
    if(i %% 10 == 1){
      
      headlines = remDr$findElements(using = "xpath",value = paste0("//div[@id='section_review']/ul[1]/li[*]/div[2]/div[1]/em")) 
      reviews = remDr$findElements(using = "xpath",value = paste0("//div[@id='section_review']/ul[1]/li[*]/div[2]/div[1]/p"))
      stars = remDr$findElements(using = "xpath",value = paste0("//div[@id='section_review']/ul[1]/li[*]/div[1]/span[1]"))
      dates = remDr$findElements(using = "xpath",value = paste0("//div[@id='section_review']/ul[1]/li[*]/div[1]/span[4]"))
      
    } else {
      
      if(i %% 10 != 0){
        
        next_page = remDr$findElements(using = "link text",value = paste0(i))
        
      } else if (i %% 10 == 0){
        
        next_page = remDr$findElements(using = "link text",value = paste0("다음"))
        
      }
      6
      if(length(next_page) == 1) {
        
        next_page[[1]]$clickElement()
        
      } else {
        
        for (j in 1:length(next_page)) {
          
          attrs_info = next_page[[j]]$getElementAttribute(attrName = "data-nclick")
          
          if(grepl(x = attrs_info,pattern = "a:rev") == TRUE) {
            break
          }
          
        }
        
        next_page[[j]]$clickElement()
        
      }
      
      Sys.sleep(delay)
      
      headlines = remDr$findElements(using = "xpath",value = paste0("//div[@id='section_review']/ul[1]/li[*]/div[2]/div[1]/em")) 
      reviews = remDr$findElements(using = "xpath",value = paste0("//div[@id='section_review']/ul[1]/li[*]/div[2]/div[1]/p"))
      stars = remDr$findElements(using = "xpath",value = paste0("//div[@id='section_review']/ul[1]/li[*]/div[1]/span[1]"))
      dates = remDr$findElements(using = "xpath",value = paste0("//div[@id='section_review']/ul[1]/li[*]/div[1]/span[4]"))
      
    }
    
    for (k in 1:length(stars)) {
      
      stars_info = as.character(stars[[k]]$getElementText())
      dates_info = as.character(dates[[k]]$getElementText())
      reviews_info = as.character(reviews[[k]]$getElementText())
      headlines_info = as.character(headlines[[k]]$getElementText())
      
      result_part = data.frame(date = dates_info, star = stars_info, headline = headlines_info,review = reviews_info)
      result = rbind(result,result_part)
    }
    
    
  }
  
  print("Collection complete... Close the browser.")
  remDr$quit()
  return(result)
  
}

 

  • 함수 적용
    • 함수 적용의 예시는 아래와 같음
    • 또한 실행 결과는 첨부 이미지와 같음
site_url = "https://search.shopping.naver.com/catalog/23922223523?adId=nad-a001-02-000000118607953&channel=nshop.npla&query=%EB%85%B8%ED%8A%B8%EB%B6%81&NaPm=ct%3Dkosubz0w%7Cci%3D0zC0000EBE9u7Nze51no%7Ctr%3Dpla%7Chk%3Dd0ee57966fadc5a015d744528d6a57111faf9abd&cid=0zC0000EBE9u7Nze51no"
result = naver_pc_case_catalog(url = site_url)

etc-image-1
etc-image-2

 

[전체 코드]

Sys.setlocale("LC_ALL")
#Sys.setlocale("LC_CTYPE", ".1251")
library(RSelenium)
library(rvest)
library(stringr)
library(tidyverse)
library(data.table)
library(foreach)
library(httr)
library(webdriver)
library(seleniumPipes)
library(readxl)
library(foreach)
library(ggwordcloud)
library(wordcloud2)
library(htmlwidgets)
library(webshot)
library(log4r)
library(readxl)
library(tcltk)
library(beepr)
library(noncompliance)
library(ggplot2)
library(fs)
source("./r2.R")

# 웹 환경 지정 #
Sys.setlocale("LC_ALL")
options(encoding = "UTF-8")
Sys.setenv(LANG = "en_US.UTF-8")


## 크롬 드라이버 오브젝트 생성 ##
remDr = remoteDriver(
  remoteServerAddr = "localhost"
  , port = 5000L
  , browserName = "chrome"
)

#?remoteDriver
## 크롬 드라이버 오브젝트 생성 ##
# remDrEdge = remoteDriver(
#   remoteServerAddr = "localhost"
#   , nativeEvents = FALSE
#   , port = 5001L
#   , browserName = "internet explorer"
# )

#################################### SUB ##############################
setWindowTab = function (remDr, windowId) {
  qpath = sprintf("%s/session/%s/window", remDr$serverURL, remDr$sessionInfo[["id"]])
  remDr$queryRD(qpath, "POST", qdata = list(handle = windowId))
}

getXpathText = function(xpath) {
  remDr$getPageSource()[[1]] %>%
    read_html() %>%
    rvest::html_nodes(xpath = xpath) %>%
    rvest::html_text() %>%
    str_replace_all(pattern = "\n", replacement = " ") %>%
    str_replace_all(pattern = "[\\^]", replacement = " ") %>%
    str_replace_all(pattern = "\"", replacement = " ") %>%
    str_replace_all(pattern = "\\s+", replacement = " ") %>%
    str_trim(side = "both")
}

getCssText = function(css) {
  remDr$getPageSource()[[1]] %>%
    read_html() %>%
    rvest::html_nodes(css = css) %>%
    rvest::html_text() %>%
    str_replace_all(pattern = "\n", replacement = " ") %>%
    str_replace_all(pattern = "[\\^]", replacement = " ") %>%
    str_replace_all(pattern = "\"", replacement = " ") %>%
    str_replace_all(pattern = "\\s+", replacement = " ") %>%
    str_trim(side = "both")
}

naver_pc_case_catalog <- function(url = url,delay = 2){
  
  if(grepl(x = url,pattern = "https://search.shopping.naver.com/catalog/") == FALSE){
    print("The URL pattern does not match. It must contain https://search.shopping.naver.com/catalog/.")
    return("error")
  }
  
  remDr$open()
  remDr$navigate(url)
  
  last_page = remDr$findElements(using = "link text",value = paste0("맨뒤"))
  
  if(length(last_page) >= 1){
    
    last_page_info = as.numeric(gsub('\\D','', as.character(last_page[[1]]$getElementAttribute(attrName = "data-nclick"))))
  } else {
    last_page_info = remDr$findElements(using = "xpath",value =  paste0("//a[@data-nclick='N=a:rev.grd']"))
    last_page_info = ceiling(as.numeric(gsub('\\D','', as.character(last_page_info[[1]]$getElementText()))) / 20)
  }
  
  print(paste0("Total ",last_page_info," pages found."))
  
  result = data.frame()
  
  for (i in 1:last_page_info) {
    
    print(paste0(i,"/",last_page_info," Collecting pages"))
    
    if(i %% 10 == 1){
      
      headlines = remDr$findElements(using = "xpath",value = paste0("//div[@id='section_review']/ul[1]/li[*]/div[2]/div[1]/em")) 
      reviews = remDr$findElements(using = "xpath",value = paste0("//div[@id='section_review']/ul[1]/li[*]/div[2]/div[1]/p"))
      stars = remDr$findElements(using = "xpath",value = paste0("//div[@id='section_review']/ul[1]/li[*]/div[1]/span[1]"))
      dates = remDr$findElements(using = "xpath",value = paste0("//div[@id='section_review']/ul[1]/li[*]/div[1]/span[4]"))
      
    } else {
      
      if(i %% 10 != 0){
        
        next_page = remDr$findElements(using = "link text",value = paste0(i))
        
      } else if (i %% 10 == 0){
        
        next_page = remDr$findElements(using = "link text",value = paste0("다음"))
        
      }
      6
      if(length(next_page) == 1) {
        
        next_page[[1]]$clickElement()
        
      } else {
        
        for (j in 1:length(next_page)) {
          
          attrs_info = next_page[[j]]$getElementAttribute(attrName = "data-nclick")
          
          if(grepl(x = attrs_info,pattern = "a:rev") == TRUE) {
            break
          }
          
        }
        
        next_page[[j]]$clickElement()
        
      }
      
      Sys.sleep(delay)
      
      headlines = remDr$findElements(using = "xpath",value = paste0("//div[@id='section_review']/ul[1]/li[*]/div[2]/div[1]/em")) 
      reviews = remDr$findElements(using = "xpath",value = paste0("//div[@id='section_review']/ul[1]/li[*]/div[2]/div[1]/p"))
      stars = remDr$findElements(using = "xpath",value = paste0("//div[@id='section_review']/ul[1]/li[*]/div[1]/span[1]"))
      dates = remDr$findElements(using = "xpath",value = paste0("//div[@id='section_review']/ul[1]/li[*]/div[1]/span[4]"))
      
    }
    
    for (k in 1:length(stars)) {
      
      stars_info = as.character(stars[[k]]$getElementText())
      dates_info = as.character(dates[[k]]$getElementText())
      reviews_info = as.character(reviews[[k]]$getElementText())
      headlines_info = as.character(headlines[[k]]$getElementText())
      
      result_part = data.frame(date = dates_info, star = stars_info, headline = headlines_info,review = reviews_info)
      result = rbind(result,result_part)
    }
    
    
  }
  
  print("Collection complete... Close the browser.")
  remDr$quit()
  return(result)
  
}

site_url = "https://search.shopping.naver.com/catalog/23922223523?adId=nad-a001-02-000000118607953&channel=nshop.npla&query=%EB%85%B8%ED%8A%B8%EB%B6%81&NaPm=ct%3Dkosubz0w%7Cci%3D0zC0000EBE9u7Nze51no%7Ctr%3Dpla%7Chk%3Dd0ee57966fadc5a015d744528d6a57111faf9abd&cid=0zC0000EBE9u7Nze51no"
result = naver_pc_case_catalog(url = site_url)

 

 

 참고 문헌

[논문]

  • 없음

[보고서]

  • 없음

[URL]

  • 없음

 

 문의사항

[기상학/프로그래밍 언어]

  • sangho.lee.1990@gmail.com

[해양학/천문학/빅데이터]

  • saimang0804@gmail.com

 

 

 

 

 

 

 

 

 

 

본 블로그는 파트너스 활동을 통해 일정액의 수수료를 제공받을 수 있음