[R] 서울대 통계 연구소 R을 이용한 빅데이터 분석 교육 연수

 정보

  • 업무명     : 서울대 통계 연구소 R을 이용한 빅데이터 분석 교육 연수

  • 작성자     : 이상호

  • 작성일     : 2020-05-12

  • 설   명      :

  • 수정이력 :

 

 내용

[개요]

  • 안녕하세요? 기상 연구 및 웹 개발을 담당하고 있는 해솔입니다.

  • 오늘은 통계 연구소 R을 이용한 빅데이터 분석 교육 연수 내용을 다루고자 합니다.

  • 이 교육은 5일에 걸쳐 진행하였고 각 시간마다 이론 및 실습을 병행하여 진행했습니다.

  • 특히 배운 지식을 활용하여 학술 논문 또는 보고서 작성 시 많은 도움이 되었습니다.

  • 그에 따라 "교육 일정, 실습 자료, 관련 자료"순으로 소개해 드리겠습니다.

 

 

[특징]

  • R을 이용한 빅데이터 교육에 대한 이해를 돕기위해 작성

 

[기능]

  • 교육 일정

  • 실습 자료

  • 관련 자료

 

[사용 OS]

  • Windows 10

 

[사용 언어]

  • R v4.0.0

 

 세부 내용

[교육 일정]

  • 강좌 구성

    • 1일 : R tidyverse 패키지를 사용한 데이터 가공

    • 2일 : 데이터 시각화 (Data Visualization)

    • 3일 : 회귀 분석 (Regression Analysis)

    • 4일 : 분류, 군집 (Classification, Clustering)

    • 5일 : 추천 알고리즘 (Recommendation System)

  • 강의 시간표

    • 09:30-12:30 : 강의

    • 13:30-15:30 : 실습

 

 실습 자료

  • 소스 코드에 대한 설명은 주석을 참조하시기 바랍니다.

1일 : R tidyverse 패키지를 사용한 데이터 가공
  • 필수 패키지 설치 및 읽기
#################################################################################################
###                                                                                           ###                                                                                  ###
### Description : This is for the special course on "Big Data" of SNU-SRI.                    ### 
###               This concentrates on "Data Wrangling" with R package "dplyr" and "tidyr".   ###
###                                                                                           ###
### Reference : Jaimie (Jaimyoung) Kwon's github (and lecture note)                           ###
###                                                                                           ###
#################################################################################################

# 필수 패키지 설치
install.packages("ggplot2")
install.packages("dplyr")
install.packages("tidyr")
install.packages("data.table")


# 학습을 위해 필요한 패키지 설치
install.packages("lubridate")
install.packages("Lahman")

# 데이터 설치
install.packages("nycflights13")
install.packages("babynames")


# 패키지 로딩
library(ggplot2)
library(dplyr)
library(tidyr)
library(data.table)

 

  • 실습 자료

#################################################################################################
###                                                                                           ###
### Description : This is for the special course on "Big Data" of SNU-SRI.                    ### 
###               This concentrates on "Data Wrangling" with R package "dplyr" and "tidyr".   ###
###                                                                                           ###
### Reference : Jaimie (Jaimyoung) Kwon's github (and lecture note)                           ###
###                                                                                           ###
#################################################################################################

#################################################################################################
##########                                    dplyr                                    ##########
#################################################################################################


library(nycflights13)

help(package="nycflights13")

# 데이터를 transpose 취해서 보여준다. 즉, 각 설명변수(열)들이 어떤 값들이 나왔는지 순서대로 보여주는 것이다.
# 이것을 이용하면 데이터프레임 안에 있는 모든 열들을 콘솔에서 편하게 볼 수 있다.
# str 과 다른 점은 콘솔 상에서 가능한 최대한의 값들을 보여준다.
glimpse(flights)
str(flights)

# 연습문제 12/43
glimpse(airlines)
glimpse(airports)
glimpse(planes)
glimpse(weather)

flights
print(flights)

class(flights)

# x <- data.frame(c(1,1))
# class(x)
# class(tbl_df(x))

################################# 고급 연습문제(13/43) #################################

head(as.data.frame(flights))


##################################### 1. filter() #####################################
# 설명 : 주어진 조건을 만족하는 행들을 추려낸다.
# 기초적인 사용법 : filter(주어진 자료(데이터프레임), 필터링하고자 하는 조건) 

# 1월 1일에 출발한 항공편 자료를 찾는 명령
filter(flights, month == 1, day == 1)
filter(flights, month == 1 & day == 1)

# 1월과 2월에 출발한 항공편 자료를 찾는 명령
filter(flights, month == 1 | month == 2)

### slice() : 행을 행 번호로 추려낸다. 
# flights[1:10,] ==
slice(flights, 1:10)


##################################### 2. arrange() #####################################
# 설명 : 행을 변수들의 오름차순으로 정렬한다. 즉 지정한 열을 기준으로 가장 작은 값에서 시작하여
#        가장 큰 값으로 끝나는 순서대로 데이터를 정렬한다.
# 기초적인 사용법 : arrange(주어진 자료(데이터프레임), 오름차순으로 정렬하고자 하는 열(기준))

# flights 자료를 year, month, day 순으로 정렬하는 명령
# flights[order(flights$year, flights$month, flights$day),] ==
arrange(flights, year, month, day)

### desc() : 내림차순으로 정렬하도록 바꾸는 함수이다. 
# arr_delay가 가장 큰 것부터 정렬하는 명령
arrange(flights, desc(arr_delay))


##################################### 3. select() #####################################
# 설명 : 필요한 열을 선택한다. 열 이름을 써주는 연산이 가장 흔히 쓰인다.
# 기초적인 사용법 : select(주어진 자료(데이터프레임), 추출하고자 하는 열의 이름들)

# year, month, day라는 이름을 가진 열을 추출하는 명령
select(flights, year, month, day)

# 열의 범위를 지정하여 열을 추출하는 명령
select(flights, year:arr_time)

# year, month, day라는 이름을 가진 열을 제외하고 나머지를 추출하는 명령
select(flights, -year, -month, -day)

# 열의 범위를 지정하여 그 열들을 제외한 나머지 열들을 추출하는 명령
select(flights, -(year:arr_time))

### + starts_with() : 정해진 문자열로 시작하는 열들을 모두 지정할 때 사용
# arr로 시작하는 이름을 가진 열들을 모두 제외하는 명령
select(flights, -starts_with("arr"))

### + ends_with() : 정해진 문자열로 끝나는 열들을 모두 지정할 때 사용
# delay로 시작하는 이름을 가진 열들을 모두 추출하는 명령
select(flights, ends_with("delay"))

### + matches() : 정해진 형식으로 구성된 이름을 가진 열들을 모두 지정할 때 사용
# 무엇_무엇 꼴의 이름을 가진 열들을 모두 추출하는 명령
select(flights, matches("._."))
select(flights, matches(".r_t."))

### + contains() : 정해진 문자열이 포함된 이름을 가진 열들을 모두 지정할 때 사용
# _이 들어가는 이름을 가진 열들을 모두 추출하는 명령
select(flights, contains("_"))
select(flights, contains("_."))

## ?select 로 추가적으로 확인할 수 있다.

### rename() : 열의 이름을 바꿔준다.
rename(flights, tail_num = tailnum)


##################################### 4. mutate() #####################################
# 설명 : 기존의 열(들)의 값을 이용해 새로운 열을 추가한다. 
#        함수에서 생성한 열을 바로 사용할 수 있다.
# 기초적인 사용법 : mutate(주어진 자료(데이터프레임), 생성하고자 하는 열 계산)

mutate(flights, gain = arr_delay - dep_delay, gain_per_hour = gain / (air_time / 60), speed = distance / air_time * 60)

# transform() 함수와 비교하면?
mutate(flights, gain = arr_delay - dep_delay, gain_per_hour = gain / (air_time / 60))
transform(flights, gain = arr_delay - dep_delay, gain_per_hour = gain / (air_time / 60)) #error


##################################### 5. summarize() #####################################
# 설명 : 주어진 자료(데이터프레임)을 한 줄(행)으로 요약한다. 즉 요약통계량을 계산한다.
#        summarise() 함수도 같은 역할을 한다.
# 기초적인 사용법 : summarize(주어진 자료(데이터프레임), 원하는 요약통계량)

# dep_delay의 평균을 계산하는 명령
summarize(flights, delay = mean(dep_delay, na.rm = TRUE))

# 요약통계함수를 사용한다. 요약통계함수는 벡터값을 입력으로 받아 하나의 숫자를 출력한다.
# 기본 R 패키지의 요약통계함수는 min(), max(), mean(), sum(), sd(), median(), IQR() 등이 있다.
# dplyr 패키지에서 제공하는 요약통계함수는 n(), n_distinct(), first(), last(), nth() 이 있다.
# group_by() 함수와 함께 이용하면 매우 편리하다.


############################## 6. sample_n(), sample_frac() ##############################
# 설명 : sample_n() 함수는 정해진 숫자의 열을 랜덤샘플한다.
#        sample_frac() 함수는 정해진 비율의 열을 랜덤샘플한다.
# 기초적인 사용법 : sample_n(주어진 자료(데이터프레임), 추출하고자 하는 열의 개수)
#                   sample_frac(주어진 자료(데이터프레임), 
#                               추출하고자 하는 열의 개수의 전체 열의 개수에 대한 비율)

# 10줄의 열을 랜덤샘플하는 명령
sample_n(flights, 10)

# 1%의 열을 랜덤샘플하는 명령
sample_frac(flights, 0.01)

# 비복원추출이 default이지만 replace 옵션으로 복원추출할 수도 있다. 
# weight 옵션으로 가중치를 정할 수 있다.
# 재현가능한 연구를 위해서는 set.seed해줄 수 있다.


##################################### 7. distinct() #####################################
# 설명 : 주어진 자료에서 고유한 행을 찾아낸다.
# 기초적인 사용법 : distinct(주어진 자료(데이터프레임) 혹은 자료에서 선택한 열)

# 서로 다른 기체번호를 찾는 명령
distinct(select(flights, tailnum))

# 서로 다른 노선을 찾는 명령
distinct(select(flights, origin, dest))


############################# Language of data manipulation #############################
# (1) 행정렬 arrange()
# (2) 행선택 filter()
# (3) 열선택 select()
# (4) 새로운 변수 계산 mutate()
# (5) 요약통계량 계산 summarise()


########################### group_by() 함수를 이용한 그룹연산  ###########################
# 설명 : 데이터셋을 그룹으로 나눈다.
# 기초적인 사용법 : group_by(주어진 자료(데이터프레임), 그룹으로 나누고자 하는 기준이 되는 열(들))

flights
groups(flights)

(by_tailnum = group_by(flights, tailnum))
groups(by_tailnum)

### 연동사용법

## select() 는 그룹변수를 항상 포함한다.
select(by_tailnum, year, month, day)

## arrange() 는 그룹변수로 우선 정렬한다.
arrange(by_tailnum, day)
arrange(flights, day)

## mutate() 와 filter()는 윈도우 함수(window function)과 더불어 사용한다.
# vignette("window-functions")

# 각 비행기 별로 가장 큰 air_time을 가진 항공편을 추출하는 명령
filter(by_tailnum, min_rank(desc(air_time)) == 1)

# 각 비행기 별로 air_time의 크기를 작은 순서대로 순위화하는 명령
mutate(by_tailnum, rank = min_rank(air_time))

## sample_n() 과 sample_frac() 은 그룹별로 랜덤샘플한다.
sample_n(by_tailnum, 1)

## slice() 는 각 그룹별로 행을 추출한다. 
#slice는 그룹별 원소의 개수보다 많이 숫자를 설정하면 개수만큼 나온다.
slice(by_tailnum, 5)

# summarise() 는 각 그룹별로 요약통계량을 계산한다.
# 아래 참고

#(by_tailnum = group_by(flights, tailnum))
delay = summarise(by_tailnum,
                  count = n(),
                  dist = mean(distance, na.rm = TRUE),
                  delay = mean(arr_delay, na.rm = TRUE))
delay = filter(delay, count > 20, dist < 2000)
delay

### exercise 26/43
groups(by_tailnum)
by_tailnum = ungroup(by_tailnum)
groups(by_tailnum)

### exercise 26/43
rowwise(by_tailnum)
delay = summarise(rowwise(by_tailnum),
                  count = n(),
                  dist = mean(distance, na.rm = TRUE),
                  delay = mean(arr_delay, na.rm = TRUE))
delay = filter(delay, dist < 2000)
delay


########################### ggplot2 패키지를 이용한 시각화 ###########################

by_tailnum = group_by(flights, tailnum)
delay = summarise(by_tailnum,
                  count = n(),
                  dist = mean(distance, na.rm = TRUE),
                  delay = mean(arr_delay, na.rm = TRUE))
delay = filter(delay, count > 20, dist < 2000)
delay

ggplot(delay, aes(dist, delay)) +
  geom_point(aes(size = count), alpha = 1/2) +
  geom_smooth() +
  scale_size_area()

sum(select(delay, count)>=1000)
sum(is.na(select(delay, delay)))


########################### %>% 연산자와 체이닝(chaining) ###########################
# The downside of the functional nature of dplyr is that when you combine multiple data manipulation 
# operations, you have to read from the inside out and the arguments may be very distant to the function 
# call. These functions providing an alternative way of calling dplyr (and other data manipulation) 
# functions that you read can from left to right. (dyplyr 패키지 설명의 chain 부분)

# 임시변수 방법
a1 = group_by(flights, year, month, day)
a2 = select(a1, arr_delay, dep_delay)
a3 = summarise(a2,
               arr = mean(arr_delay, na.rm = TRUE),
               dep = mean(dep_delay, na.rm = TRUE))
(a4 = filter(a3, arr > 30 | dep > 30))

# 중첩 방법
filter(
  summarise(
    select(
      group_by(flights, year, month, day),
      arr_delay, dep_delay
    ),
    arr = mean(arr_delay, na.rm = TRUE),
    dep = mean(dep_delay, na.rm = TRUE)
    ),
  arr > 30 | dep > 30
)

# 두 방법 모두 가독성이 떨어진다. 

### %>% 연산자

# simple1
head(iris)
iris %>% head

# simple2
head(iris, 10)
iris %>% head(10)

# 위의 임시변수 방법과 중첩 방법과 결과가 같다.
# 또한 가독성이 좋다. 
flights %>% 
  group_by(year, month, day) %>%
  select(arr_delay, dep_delay) %>%
  summarise(
    arr = mean(arr_delay, na.rm = TRUE),
    dep = mean(dep_delay, na.rm = TRUE)
  ) %>%
  filter(arr > 30 | dep > 30)


################################## 연습문제(31/43) ##################################

# 날짜 형식을 다루는 패키지
library(lubridate)

# C 언어 중에서 지역화를 지정하는 기능. 
# 구체적으로는 setlocale 함수나 localconv 함수로 통화(通貨)의 표현, 
# 10진수 표현(소수 점이 점인가 콤마인가 등), 일시(日時) 표현의 차이 등 각국 특유의 표현을 지정한다.
# [네이버 지식백과] 로컬 [locale] (컴퓨터인터넷IT용어대사전, 2011. 1. 20., 일진사)
Sys.setlocale("LC_TIME", "usa")

(per_day = flights %>%
  group_by(year, month, day) %>%
  summarise(flights = n(),
            distance = mean(distance),
            arr = mean(arr_delay, na.rm = TRUE),
            dep = mean(dep_delay, na.rm = TRUE)
  ) %>%
  mutate(dt_str = sprintf("%04d-%02d-%02d", year, month, day),
         dt = parse_date_time(dt_str, "ymd",tz = "US/Eastern"),
         dow = wday(dt, label=TRUE)
         ) 
)

# 각각의 도착지에 대해 비행기 개수
flights %>% group_by(tailnum, dest) %>% distinct() %>% group_by(dest) %>% summarise(count = n())

# 각각의 도착지에 대해 다른 항공편 편수
flights %>% group_by(dest) %>% summarise(count = n())

# 각 날짜별 비행기 편수
select(per_day, year:flights)
flights %>% group_by(year,month, day) %>% summarise(count = n())

# 각 요일별 비행기 편수
ungroup(per_day) %>% group_by(dow) %>% summarise(count = sum(flights))
ungroup(per_day) %>% group_by(dow) %>% summarise(count = mean(flights))

per_day %>% ggplot(aes(dt, flights)) +
  geom_line() + geom_point() + geom_smooth()

per_day %>% ggplot(aes(dow, flights)) +
  geom_boxplot()

per_day %>% ggplot(aes(dt, dep)) +
  geom_line() + geom_point() + geom_smooth()

per_day %>% ggplot(aes(dow, dep)) +
  geom_boxplot()

per_day %>% ggplot(aes(dt, distance)) +
  geom_line() + geom_point() + geom_smooth()

per_day %>% ggplot(aes(dow, distance)) +
  geom_boxplot()


########################## join 함수를 이용한 테이블 결합 ##########################

(df1 <- data_frame(x = c(1, 2), y = 2:1))
(df2 <- data_frame(x = c(1, 3), a = 10, b = "a"))

# inner_join() 함수는 x와 y에 모두 매칭되는 행만 포함한다. 교집합이다.
df1 %>% inner_join(df2)

# left_join() 함수는 x 의 모든 행을 포함한다. 매칭되지 않은 y 테이블 변수들을 NA가 된다. 차집합이다.
df1 %>% left_join(df2)

# right_join() 함수는 y 테이블의 모든 행을 포함한다. 
# left_join(y,x) 의 경우와 행의 순서만 다를 뿐 동일한 결과를 준다.
df1 %>% right_join(df2)

# full_join() 함수는 x와 y의 모든 행을 포함한다. 합집합이다.
df1 %>% full_join(df2)


################################## 연습문제(35/43) ##################################

(flights2 = flights %>% select(year:day, hour, origin, dest, tailnum, carrier))
airlines

flights2 %>% left_join(airlines)
# flights2 %>% left_join(airlines, by="carrier")


# rm(list = ls())
################################# 연습문제(37/43)-1 #################################
ds <- tbl_df(mtcars)
ds
as.data.frame(ds)

if (require("Lahman") && packageVersion("Lahman") >= "3.0.1") {
  batting <- tbl_df(Batting)
  dim(batting)
  colnames(batting)
  head(batting)
  
  # Data manipulation verbs ---------------------------------------------------
  filter(batting, yearID > 2005, G > 130)
  select(batting, playerID:lgID)
  arrange(batting, playerID, desc(yearID))
  summarise(batting, G = mean(G), n = n())
  mutate(batting, rbi2 = if(is.null(AB)) 1.0 * R / AB else 0)
  
  # Group by operations -------------------------------------------------------
  # To perform operations by group, create a grouped object with group_by
  players <- group_by(batting, playerID)
  head(group_size(players), 100)
  
  summarise(players, mean_g = mean(G), best_ab = max(AB))
  best_year <- filter(players, AB == max(AB) | G == max(G))
  progress <- mutate(players, cyear = yearID - min(yearID) + 1,
                     rank(desc(AB)), cumsum(AB))
  
  # When you group by multiple level, each summarise peels off one level
  
  per_year <- group_by(batting, playerID, yearID)
  stints <- summarise(per_year, stints = max(stint))
  filter(stints, stints > 3)
  summarise(stints, max(stints))
  mutate(stints, cumsum(stints))
  
  
  # Joins ---------------------------------------------------------------------
  player_info <- select(tbl_df(Master), playerID, birthYear)
  hof <- select(filter(tbl_df(HallOfFame), inducted == "Y"),
                playerID, votedBy, category)
  
  # Match players and their hall of fame data
  inner_join(player_info, hof)
  # Keep all players, match hof data where available
  left_join(player_info, hof)
  # Find only players in hof
  semi_join(player_info, hof)
  # Find players not in hof
  anti_join(player_info, hof)
}


# rm(list = ls())
################################# 연습문제(37/43)-2 #################################

library(babynames)
babynames
class(babynames)

babynames %>% group_by(year) %>% filter(min_rank(desc(n)) == 1)

babynames %>% group_by(year) %>% filter(min_rank(desc(n)) <= 10) %>% summarise(rate = sum(prop))


# rm(list = ls())
################################# 연습문제(37/43)-3 #################################

# 주소로 지정할 수도 있지만...
batting = read.csv(file.choose())
master = read.csv(file.choose())

head(batting)
head(master)

batting = tbl_df(batting)
master = tbl_df(master)

class(batting)
class(master)

# 각 연도별 타자들의 타율을 계산하라. ( AVG(타율) = H(안타) / AB(타수) )

mutate(batting, AVG = H / AB) %>% select(playerID:lgID, AVG)

full.batting = mutate(batting, AVG = H / AB)

# 각 연도별로 선수들의 타율의 분포를 시각화하라.

### plot
full.batting %>% mutate(year = factor(yearID)) %>% 
  group_by(year) %>% 
  ggplot(aes(year, AVG)) + geom_point() + geom_smooth(aes(group=1))

# 자료를 선별하여 분포를 그린 것이므로 참고만 하자.
full.batting %>% mutate(year = factor(yearID)) %>% 
  filter(AB >= 100, yearID >= 1985) %>% 
  group_by(year) %>% 
  ggplot(aes(year, AVG)) + geom_point() + geom_smooth(aes(group=1))

full.batting %>% filter(AB >= 100) %>% 
  group_by(yearID) %>% 
  summarise( n = n()) %>% 
  ggplot(aes(yearID, n))+ geom_point()

### box-plot
full.batting %>% mutate(year = factor(yearID)) %>% 
  group_by(year) %>% 
  ggplot(aes(year, AVG)) + geom_boxplot()

# 자료를 선별하여 분포를 그린 것이므로 참고만 하자.
full.batting %>% mutate(year = factor(yearID)) %>% 
  filter(AB >= 100, yearID >= 1985) %>% 
  group_by(year) %>% 
  ggplot(aes(year, AVG)) + geom_boxplot()


# 1980년 이후 최고타율을 기록한 선수들의 리스트를 구하라. 

(AVGtable = full.batting %>% select(playerID:lgID, AVG))

# min_rank() 함수의 성질
cw <- c(1,2,3,4,NA, 2913)
min_rank(cw)

temp = AVGtable %>% group_by(playerID) %>% filter(min_rank(desc(AVG)) == 1 & yearID >= 1980)

AVGtable %>% filter(playerID == "blackti01")

# 결과 테이블을 Master 테이블과 join 해서 선수의 이름을 표시하라.
(name = master %>% select(playerID, nameFirst, nameLast, nameGiven, retroID, bbrefID))

# select() 순서를 바꾸기 위해 사용해보자.
list = temp %>% left_join(name, by="playerID") %>% select(playerID, nameFirst:bbrefID, yearID:AVG)
list


#################################################################################################
##########                                    tidyr                                    ##########
#################################################################################################

# Key concepts : Long format <-> Wide format, Pivot table

# Core contents : spread() and gather() function

# library(tidyr)


##################################### 1. spread() #####################################
# 설명 : Long format -> Wide format
# 기초적인 사용법 : spread(주어진 자료(데이터프레임), 피벗이 되는 열, 대상이 되는 열)

digest = flights %>% group_by(carrier, origin) %>% 
  summarise(air_time_mean = mean(air_time, na.rm=TRUE))
as.data.frame(digest)

# 피벗 테이블
spread(digest, origin, air_time_mean)

flights %>% group_by(carrier, origin) %>% 
  summarise(air_time_mean = mean(air_time, na.rm=TRUE)) %>%
  spread(origin, air_time_mean)

# NA를 0으로 채우기
spread(digest, origin, air_time_mean, fill = 0)


##################################### 2. gather() #####################################
# 설명 : Wide format -> Long format
# 기초적인 사용법 : gather(주어진 자료(데이터프레임), 변수명을 포함한 열, 대상이 되는 열, 
#                         long format 으로 전환할 열)

series = flights %>% group_by(month, day) %>% 
  summarise(air_time_mean = mean(air_time, na.rm=TRUE), dep_delay_mean = mean(dep_delay, na.rm=TRUE))
series

series %>% gather(indicators, value, air_time_mean, dep_delay_mean)

series %>% gather(indicators, value, -month, -day)

 

2일 : 데이터 시각화 (Data Visualization)
# Install Packages
install.packages('ggmap')
install.packages('ggplot2')
install.packages('tm')
install.packages('SnowballC')
install.packages('wordcloud')
install.packages('networkD3')
install.packages('curl')
install.packages('GGally')
install.packages('ggvis')
install.packages('shiny')
install.packages('plotly')

# 1차원 자료용 그래픽
# Data: 초미세먼지 자료
setwd("~/Desktop/개인/Big Data 시각화 (2016.02)")
pollution <- read.csv("avgpm25.csv", colClasses = c("numeric", "character", "factor", "numeric", "numeric"))
head(pollution)

##        pm25  fips region longitude latitude
## 1  9.771185 01003   east -87.74826 30.59278
## 2  9.993817 01027   east -85.84286 33.26581
## 3 10.688618 01033   east -87.72596 34.73148
## 4 11.337424 01049   east -85.79892 34.45913
## 5 12.119764 01055   east -86.03212 34.01860
## 6 10.827805 01069   east -85.35039 31.18973

# Five Number Summary
summary(pollution$pm25)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   3.383   8.549  10.050   9.836  11.360  18.440

# Boxplot
boxplot(pollution$pm25, col="blue")

boxplot(pollution$pm25, col="blue")
abline(h=12)

# Histogram
hist(pollution$pm25, col="green")
rug(pollution$pm25)

hist(pollution$pm25, col="green",breaks=100)
rug(pollution$pm25)

hist(pollution$pm25, col='green', breaks=c(2,5,8,10,14,18,20))

# Histogram with bar
hist(pollution$pm25, col="green")
abline(v=12,lwd=2)
abline(v=median(pollution$pm25), col="magenta", lwd=4)
abline(h=60, col="skyblue", lwd=4)

# Barplot
barplot(table(pollution$region), col = "wheat", main = "Number of Counties in Each Region")

# n차원 자료용 그래픽
# Multiple Boxplot
boxplot(pm25 ~ region, data = pollution, col = "red") #formula in R; y~x1+x2+...

# Multiple Histograms
par(mfrow = c(2, 1), mar = c(4, 4, 2, 1))  ### par(mfrow = c(1, 1))  or   dev.off() ; 초기화
hist(subset(pollution, region == "east")$pm25, col = "green") 
hist(subset(pollution, region == "west")$pm25, col = "green")

# Scatterplot
with(pollution, plot(latitude, pm25)) ## plot(pm25 ~ latitude, data=pollution) 과 같은 표현!
abline(h=12,lwd=2,lty=2)

with(pollution, plot(latitude, pm25, col = region)) 
abline(h=12,lwd=2,lty=2) 

# Multiple Scatterplots
par(mfrow = c(1, 2), mar = c(5, 4, 2, 1)) ## margin : c(bottom, left, top, right)
with(subset(pollution, region == "west"), plot(latitude, pm25, main = "West")) 
with(subset(pollution, region == "east"), plot(latitude, pm25, main = "East")) 

# R Plot System
# Base plotting system
# Data loading : airquality
library(datasets) 
with(airquality, plot(Wind, Ozone))

# Base Plot with Annotation
with(airquality, plot(Wind, Ozone))
title(main="Ozone and Wind in New Yorm City") ## Add a title

# Base Plot with Regression Line
with(airquality, plot(Wind, Ozone, main = "Ozone and Wind in New York City", pch = 20)) 
?points
model <- lm(Ozone ~ Wind, airquality)  
abline(model, lwd=2)
abline(1,2, lwd=2)  ### intercept(y절편), slope(기울기), thickness(두께)

# Multiple Base Plots
par(mfrow = c(1, 3), mar = c(4, 4, 2, 1), oma = c(0, 0, 2, 0)) 
with (airquality, {
  plot(Wind, Ozone, main = "Ozone and Wind")
  plot(Solar.R, Ozone, main = "Ozone and Solar Radiation") 
  plot(Temp, Ozone, main = "Ozone and Temperature") 
  mtext("Ozone and Weather in New York City", outer = TRUE) #main title;;; 통합 제목
}
)

# Lattice System in R
# Lattice system in R
library(lattice) 
str(airquality)

airquality <- transform(airquality, Month = factor(Month))  # Convert 'Month' to a factor variable
levels(airquality$Month)

xyplot(Ozone ~ Wind | Month, data = airquality, layout = c(5, 1))

xyplot(Ozone ~ Wind, data = airquality)

# Lattice, Why?
set.seed(10)
x <- rnorm(100)
f <- rep(0:1, each = 50) 
y<-x+f-f*x+rnorm(100,sd=0.5)
f <- factor(f, labels = c("Group 1", "Group 2")) # group1(f=0; y=x+norm)=1:50, group2(f=1; y=1+norm)=51:100
xyplot(y ~ x)

xyplot(y ~ x | f, layout = c(2, 1)) 

# Lattice Panel Functions: Regression line
xyplot(y ~ x | f, panel = function(x, y, ...) {
panel.xyplot(x, y, ...) # First call default panel function 
panel.lmline(x, y, col = 2) # Overlay a simple linear regression line
})

ftn = function(x, y, ...) {
panel.xyplot(x, y, ...) # First call default panel function 
panel.lmline(x, y, col = 2) # Overlay a simple linear regression line
}

xyplot(y~x | f, panel=ftn)

# ggplot2
# Data ; MAACS
library("ggplot2")
load("maacs.Rda")

# basic ggplot ; using qplot (quick plot)
qplot(logpm25, NocturnalSympt, data = maacs, facets = . ~ bmicat,
      geom = c("point"))+stat_smooth(method="lm")
      
# Function ggplot
g<-ggplot(maacs, aes(logpm25, NocturnalSympt))

g

g+geom_point()

g+geom_point()+geom_smooth(method = "lm")

g+geom_point()+geom_smooth(method = "lm")+facet_grid(.~bmicat)

# Modifying Aesthetics
g + geom_point(color = "steelblue", size=4, alpha = 1/2)

g + geom_point(aes(color = bmicat), size=4, alpha=1/2)

# Modifying Labels
g + geom_point(aes(color = bmicat), size=4, alpha=1/2) +
  labs(title = "MAACS Cohort") + labs(x=expression("log"*PM[2.5]),y="Nocturnal Symptoms")
  
# Customizing the Smooth
g + geom_point(aes(color = bmicat), size=4, alpha=1/2) +
  geom_smooth(size = 4, linetype = 3, method = "lm", se = FALSE)
  
# Parallel Coordinate Plot
require(GGally)
head(mtcars)

ggparcoord(mtcars, columns = c(1,5:10))+geom_line()

# ggmap의 활용(지도 기반 시각화)
# ggmap 패키지
library(ggmap)
geocode("Seoul National University")

##       lon      lat
## 1 126.978 37.56654

korea <- c(left = 124, bottom = 32, right = 132, top = 43.5)
map <- get_stamenmap(korea, zoom = 5, maptype = "toner-lite")
ggmap(map)

# 휴스턴(Houston) 도심의 범죄 자료
# Basic plot on the map
# crime Data from Google Maps (included in ggmap packages)
str(crime)

## 'data.frame':    86314 obs. of  17 variables:
##  $ time    : POSIXt, format: "2010-01-01 15:00:00" "2010-01-01 15:00:00" ...
##  $ date    : chr  "1/1/2010" "1/1/2010" "1/1/2010" "1/1/2010" ...
##  $ hour    : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ premise : chr  "18A" "13R" "20R" "20R" ...
##  $ offense : Factor w/ 7 levels "aggravated assault",..: 4 6 1 1 1 3 3 3 3 3 ...
##  $ beat    : chr  "15E30" "13D10" "16E20" "2A30" ...
##  $ block   : chr  "9600-9699" "4700-4799" "5000-5099" "1000-1099" ...
##  $ street  : chr  "marlive" "telephone" "wickview" "ashland" ...
##  $ type    : chr  "ln" "rd" "ln" "st" ...
##  $ suffix  : chr  "-" "-" "-" "-" ...
##  $ number  : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ month   : Ord.factor w/ 8 levels "january"<"february"<..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ day     : Ord.factor w/ 7 levels "monday"<"tuesday"<..: 5 5 5 5 5 5 5 5 5 5 ...
##  $ location: chr  "apartment parking lot" "road / street / sidewalk" "residence / house" "residence / house" ...
##  $ address : chr  "9650 marlive ln" "4750 telephone rd" "5050 wickview ln" "1050 ashland st" ...
##  $ lon     : num  -95.4 -95.3 -95.5 -95.4 -95.4 ...
##  $ lat     : num  29.7 29.7 29.6 29.8 29.7 ...

downtown<-subset(crime,
                 -95.39681 <= lon & lon <= -95.34188 &
                   29.73631 <= lat & lat <= 29.78400)

qmplot(lon, lat, data = downtown, maptype = "toner-background", color = I("red"))

Different Source / Type of map
배경에 표시될 지도의 출처(source option)와 종류(maptype option)를 다양하게 바꿈으로써,
상황에 어울리는 시각화가 가능하다.

source = “stamen”(기본설정)일 때, 가능한 maptype :
“terrain”, “terrain-background”, “terrain-labels”, “terrain-lines”, “toner”,
“toner-2010”, “toner-2011”, “toner-background”, “toner-hybrid”, “toner-labels”,
“toner-lines”, “toner-lite”, “watercolor”

source = “google”일 때, 가능한 maptype :
“terrain”, “satellite”, “roadmap”, and “hybrid”

qmplot(lon, lat, data = downtown, maptype = "watercolor", color = I("red"))

qmplot(lon, lat, data = downtown, maptype = "terrain-background", color = I("red"))

qmplot(lon, lat, data = downtown, maptype = "toner-hybrid", color = I("red")) 

qmplot(lon, lat, data = downtown, source = "google", maptype = "terrain", color = I("red"))

qmplot(lon, lat, data = downtown, source = "google", maptype = "satellite", color = I("red")) 

qmplot(lon, lat, data = downtown, source = "google", maptype = "hybrid", color = I("red"))

# 범죄율의 밀도(Density) 표현
robberies <- subset(downtown, offense == "robbery")

qmplot(lon, lat, data = robberies, geom = "blank", zoom = 15, maptype = "toner-background", darken = .7) +
  stat_density_2d(aes(fill = ..level..), geom = "polygon", alpha = .3, color = NA)+
  scale_fill_gradient2("Robbery\nPropensity", low = "white", mid = "yellow", high = "red", midpoint = 700)
  
# 범죄율의 밀도(Density) 표현 - ggmap 함수를 이용한 표현
downtownmap<-get_stamenmap(c(left=-95.39681,bottom= 29.73631,right=-95.34188,top=29.78400),zoom = 15, maptype = "toner-background", darken= .7)
ggmap(downtownmap, darken=.7)+
  stat_density_2d(aes(fill = ..level..), geom = "polygon", alpha = .3, color = NA,data=robberies)+
  scale_fill_gradient2("Robbery\nPropensity", low = "white", mid = "yellow", high = "red", midpoint = 700)
  
# 범죄의 종류별 시각화
qmplot(lon, lat, data = downtown, maptype = "toner-background", color = offense) +
  facet_wrap(~ offense)
  
# Quiz1
# baseball.csv 파일에는 한국 프로야구 10개의 팀의 연고지(구장)의 위치의 경도(lon)와 위도(lat) 값과 2015년 프로야구 관중수(crowd)가 담겨있다.
# 파일을 불러들여서(header = TRUE), 다음과 같이 지도에 각 구단의 연고지를 구단 별로 다른 색으로 위치를 표시해 보자.
# Hint :
# qmplot(lon, lat, data = downtown, maptype = “toner-background”, color = I(“red”)) 이 코드를 알맞게 변형해보세요.

# Quiz1 - Advanced
# Quiz의 결과는 다소 눈에 익지 않다. 가장 큰 이유중 하나는 대한민국 지도가 한꺼번에 보이지 않기 때문일 것이며, 10개의 팀을 모두 각기 다른 색으로 표현하는 것은 10개의 팀을 쉽게 구분할 수 있는 방법은 아닐 수 있다. 또한, baseball.csv 파일안에는 프로야구 관중수(crowd)라는 정보도 포함되어 있다.
# ggmap과 ggplot의 함수들을 함께 사용하면 다음과 같이 지도 영역의 자유로운 설정과 다양한 정보를 담을 수 있다.

# 텍스트 자료의 시각화
library(tm)
library(SnowballC)
library(wordcloud)

jeopQ <- read.csv('JEOPARDY_CSV.csv', stringsAsFactors = FALSE)

###Data Cleaning
jeopCorpus <- Corpus(VectorSource(jeopQ$Question))
jeopCorpus <- tm_map(jeopCorpus, removePunctuation) ## . , ? ! ()
jeopCorpus <- tm_map(jeopCorpus, removeWords, c("the", "this", "The", "This", stopwords('english'))) ## stopwords("en")
jeopCorpus <- tm_map(jeopCorpus, stemDocument) ## went -> go
jeopCorpus <- tm_map(jeopCorpus, PlainTextDocument)

wordcloud(jeopCorpus, max.words = 100, random.order = FALSE)

# 네트워크 자료의 시각화
# Basic Network Diagram
library(networkD3)

src <- c("A", "A", "A", "A",
        "B", "B", "C", "C", "D")
target <- c("B", "C", "D", "J",
            "E", "F", "G", "H", "I")
networkData <- data.frame(src, target)

simpleNetwork(networkData)

# Sankey Diagram
library(networkD3)
URL <- paste0(
        "https://cdn.rawgit.com/christophergandrud/networkD3/",
        "master/JSONdata/energy.json")
Energy <- jsonlite::fromJSON(URL)

head(Energy$nodes)

##                   name
## 1 Agricultural 'waste'
## 2       Bio-conversion
## 3               Liquid
## 4               Losses
## 5                Solid
## 6                  Gas

head(Energy$links)

##   source target   value
## 1      0      1 124.729
## 2      1      2   0.597
## 3      1      3  26.862
## 4      1      4 280.322
## 5      1      5  81.144
## 6      6      2  35.000

sankeyNetwork(Links = Energy$links, Nodes = Energy$nodes, Source = "source",
             Target = "target", Value = "value", NodeID = "name",
             units = "TWh", fontSize = 12, nodeWidth = 30)

# Interactive Graphics
# ggvis system
# Data : mtcars data
# ggvis
# The %>% operator passes the left hand object to the first argument of the right hand expression

library(ggvis)
library(shiny)
p <- ggvis(mtcars, x = ~wt, y = ~mpg)
layer_points(p)

# ggvis using pipe function “%>%”
mtcars %>%   ## DATA
  ggvis(x = ~wt, y = ~mpg) %>%  ## model
  layer_points()   ## plotting
  
# Different Layers
mtcars %>%   
  ggvis(x = ~wt, y = ~mpg) %>%  
  layer_points()  
  
mtcars %>%   
  ggvis(x = ~wt, y = ~mpg) %>%  
  layer_lines()  
  
mtcars %>%   
  ggvis(x = ~wt, y = ~mpg) %>%  
  layer_bars() 

mtcars %>%   
  ggvis(x = ~wt, y = ~mpg) %>%  
  layer_smooths() 

# Interactive Controls
mtcars %>% 
  ggvis(~wt, ~mpg, 
    size := input_slider(10, 100),   # := set a property to a value
    #size = ~wt,                     # = set a property w.r.t a variable with ~variable format
    opacity := input_slider(0, 1)
    ) %>% 
  layer_points() ##Press Escape/Ctrl + C to stop.
  
mtcars %>% 
  ggvis(~wt, ~mpg, 
    size := input_numeric(30),   
    opacity := input_radiobuttons(choices=c(0.2,0.4,0.6,0.8,1))
    ) %>% 
  layer_points() ##Press Escape/Ctrl + C to stop.

# Examples
mtcars %>% ggvis(x=~wt) %>%
  layer_densities(
    adjust = input_slider(.1,2, value =1, step=.1, label="Bandwidth adjustment"),
    kernel = input_select(
      c("Gaussian"="gaussian","Epanechnikov"="epanechnikov","Rectangular"="rectangular", "Triangular"="triangular", "Biweight"="biweight","Cosine"="cosine", "Optcosine"="optcosine"), label="Kernel"))

# Quiz2
# Basic R plot 을 통해 “pollution” 데이터의 scatter plot을 다음과 같이 그려볼 수 있었다.
with(pollution, plot(latitude, pm25))

# 이번에는 같은 그래프를 “ggvis”패키지와 pipe function “%>%”을 사용하여
# 다음과 같이 투명도(opacity), 크기(size), 색깔(fill)을 interactive하게 바꿀 수 있는 그래프를 그려보자.

# Hint
# 투명도(“opacity” 옵션) : input_slider 이용
# 크기(“size” 옵션) : input_numeric 이용
# 색깔(“fill” 옵션) : input_select 이용

# plotly system
# plotly
library(plotly)
set.seed(100)
d<-diamonds[sample(nrow(diamonds),1000), ]
plot_ly(d, x=carat, y=price, text=
          paste("Clarity: ", clarity),
        mode = "markers", color = carat, size= carat)


plot_ly(ggplot2::diamonds, y=price, color=cut, type="box")

plot_ly(z=volcano, type="surface")

 

 

3일 : 회귀 분석 (Regression Analysis)
install.packages("ISLR")

# 선형회귀모형의 적합
Advertising = read.csv("~/Desktop/Advertising.csv", header = TRUE)
lm.fit = lm(Sales ~ TV, data = Advertising)
summary(lm.fit)
##
## Call:
## lm(formula = Sales ~ TV, data = Advertising)
##
## Residuals:
## Min 1Q Median 3Q Max
## -8.3860 -1.9545 -0.1913 2.0671 7.2124
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 7.032594 0.457843 15.36 <2e-16 ***
## TV 0.047537 0.002691 17.67 <2e-16 ***
## ---
## Signif. codes: 0 *** 0.001 ** 0.01 * 0.05 . 0.1   1
##
## Residual standard error: 3.259 on 198 degrees of freedom
## Multiple R-squared: 0.6119,Adjusted R-squared: 0.6099
## F-statistic: 312.1 on 1 and 198 DF, p-value: < 2.2e-16
confint(lm.fit, level=0.95)
## 2.5 % 97.5 %
## (Intercept) 6.12971927 7.93546783
## TV 0.04223072 0.05284256

# 중회귀모형의 적합도
lm.fit = lm(Sales ~ TV + Radio + Newspaper, data = Advertising)
summary(lm.fit)
##
## Call:
## lm(formula = Sales ~ TV + Radio + Newspaper, data = Advertising)
##
## Residuals:
## Min 1Q Median 3Q Max
## -8.8277 -0.8908 0.2418 1.1893 2.8292
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.938889 0.311908 9.422 <2e-16 ***
## TV 0.045765 0.001395 32.809 <2e-16 ***
## Radio 0.188530 0.008611 21.893 <2e-16 ***
## Newspaper -0.001037 0.005871 -0.177 0.86
## ---
## Signif. codes: 0 *** 0.001 ** 0.01 * 0.05 . 0.1   1
##
## Residual standard error: 1.686 on 196 degrees of freedom
## Multiple R-squared: 0.8972,Adjusted R-squared: 0.8956
## F-statistic: 570.3 on 3 and 196 DF, p-value: < 2.2e-16
cor(Advertising)
## X TV Radio Newspaper Sales
## X 1.00000000 0.01771469 -0.11068044 -0.15494414 -0.05161625
## TV 0.01771469 1.00000000 0.05480866 0.05664787 0.78222442
## Radio -0.11068044 0.05480866 1.00000000 0.35410375 0.57622257
## Newspaper -0.15494414 0.05664787 0.35410375 1.00000000 0.22829903
## Sales -0.05161625 0.78222442 0.57622257 0.22829903 1.00000000

# 가변수 (Dummy variable)
Credit = read.csv("~/Desktop/Credit.csv", header = TRUE)
lm.fit = lm(Balance ~ Gender, data = Credit)
summary(lm.fit)
##
## Call:
## lm(formula = Balance ~ Gender, data = Credit)
##
## Residuals:
## Min 1Q Median 3Q Max
## -529.54 -455.35 -60.17 334.71 1489.20
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 509.80 33.13 15.389 <2e-16 ***
## GenderFemale 19.73 46.05 0.429 0.669
## ---
## Residual standard error: 460.2 on 398 degrees of freedom
## Multiple R-squared: 0.0004611,Adjusted R-squared: -0.00205
## F-statistic: 0.1836 on 1 and 398 DF, p-value: 0.6685

# 가변수를 포함한 회귀모형
lm.fit = lm(Balance~ Ethnicity, data=Credit)
summary(lm.fit)
##
## Call:
## lm(formula = Balance ~ Ethnicity, data = Credit)
##
## Residuals:
## Min 1Q Median 3Q Max
## -531.00 -457.08 -63.25 339.25 1480.50
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 531.00 46.32 11.464 <2e-16 ***
## EthnicityAsian -18.69 65.02 -0.287 0.774
## EthnicityCaucasian -12.50 56.68 -0.221 0.826
## ---
## Residual standard error: 460.9 on 397 degrees of freedom
## Multiple R-squared: 0.0002188,Adjusted R-squared: -0.004818
## F-statistic: 0.04344 on 2 and 397 DF, p-value: 0.9575

# 가법성 가정 완화 (교호작용)
lm.fit = lm(Sales ~ TV*Radio, data = Advertising)
summary(lm.fit)
##
## Call:
## lm(formula = Sales ~ TV * Radio, data = Advertising)
##
## Residuals:
## Min 1Q Median 3Q Max
## -6.3366 -0.4028 0.1831 0.5948 1.5246
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 6.750e+00 2.479e-01 27.233 <2e-16 ***
## TV 1.910e-02 1.504e-03 12.699 <2e-16 ***
## Radio 2.886e-02 8.905e-03 3.241 0.0014 **
## TV:Radio 1.086e-03 5.242e-05 20.727 <2e-16 ***
## ---
## Residual standard error: 0.9435 on 196 degrees of freedom
## Multiple R-squared: 0.9678,Adjusted R-squared: 0.9673
## F-statistic: 1963 on 3 and 196 DF, p-value: < 2.2e-16

# 변수선택
library(ISLR)
names(Hitters)
## [1] "AtBat" "Hits" "HmRun" "Runs" "RBI"
## [6] "Walks" "Years" "CAtBat" "CHits" "CHmRun"
## [11] "CRuns" "CRBI" "CWalks" "League" "Division"
## [16] "PutOuts" "Assists" "Errors" "Salary" "NewLeague"
dim(Hitters)
## [1] 322 20
sum(is.na(Hitters$Salary))
## [1] 59
Hitters = na.omit(Hitters)
dim(Hitters)
## [1] 263 20
sum(is.na(Hitters))
## [1] 0

# RSS를 포함한 변수선택을 위한 통계량 변화
par(mfrow = c(2,2))
plot(reg.summary$rss, xlab = "Number of Variables", ylab = "RSS", type = "l")
plot(reg.summary$adjr2, xlab = "Number of Variables", ylab = "Adjusted RSq", type = "l")
which.max(reg.summary$adjr2)
## [1] 11
points(11, reg.summary$adjr2[11], col = "red", cex = 2, pch = 20)
plot(reg.summary$cp, xlab = "Number of Variables", ylab = "Cp", type = "l")
which.min(reg.summary$cp)
## [1] 10
points(10, reg.summary$cp[10], col = "red", cex = 2, pch = 20)
plot(reg.summary$bic, xlab = "Number of Variables", ylab = "BIC", type = "l")
which.min(reg.summary$bic)
## [1] 6
points(6, reg.summary$bic[6], col = "red", pch = 20)

# 능선회귀와 라쏘 (Ridge Regression and LASSO)
x = model.matrix(Salary ~ . , data = Hitters)[,-1]
y = Hitters$Salary

library(glmnet)
grid = 10^seq(10, -2, length=100)
ridge.mod = glmnet(x, y, alpha=0, lambda = grid)

dim(coef(ridge.mod))
## [1] 20 100

ridge.mod$lambda[50]
## [1] 11497.57
coef(ridge.mod)[,50]
## (Intercept) AtBat Hits HmRun Runs
## 407.356050200 0.036957182 0.138180344 0.524629976 0.230701523
## RBI Walks Years CAtBat CHits
## 0.239841459 0.289618741 1.107702929 0.003131815 0.011653637
## CHmRun CRuns CRBI CWalks LeagueN
## 0.087545670 0.023379882 0.024138320 0.025015421 0.085028114
## DivisionW PutOuts Assists Errors NewLeagueN
## -6.215440973 0.016482577 0.002612988 -0.020502690 0.301433531
sqrt(sum(coef(ridge.mod)[-1,50]^2))
## [1] 6.360612
ridge.mod$lambda[60]
## [1] 705.4802
coef(ridge.mod)[,60]
## (Intercept) AtBat Hits HmRun Runs
## 54.32519950 0.11211115 0.65622409 1.17980910 0.93769713
## RBI Walks Years CAtBat CHits
## 0.84718546 1.31987948 2.59640425 0.01083413 0.04674557
## CHmRun CRuns CRBI CWalks LeagueN
## 0.33777318 0.09355528 0.09780402 0.07189612 13.68370191
## DivisionW PutOuts Assists Errors NewLeagueN
## -54.65877750 0.11852289 0.01606037 -0.70358655 8.61181213
sqrt(sum(coef(ridge.mod)[-1,60]^2))
## [1] 57.11001

predict(ridge.mod, s=50, type = "coefficients")[1:20,]
## (Intercept) AtBat Hits HmRun Runs
## 4.876610e+01 -3.580999e-01 1.969359e+00 -1.278248e+00 1.145892e+00
## RBI Walks Years CAtBat CHits
## 8.038292e-01 2.716186e+00 -6.218319e+00 5.447837e-03 1.064895e-01
## CHmRun CRuns CRBI CWalks LeagueN
## 6.244860e-01 2.214985e-01 2.186914e-01 -1.500245e-01 4.592589e+01
## DivisionW PutOuts Assists Errors NewLeagueN
## -1.182011e+02 2.502322e-01 1.215665e-01 -3.278600e+00 -9.496680e+00

set.seed(1)
train = sample(1:nrow(x), nrow(x)/2)
test = (-train)
y.test = y[test]

ridge.mod = glmnet(x[train, ], y[train], alpha = 0, lambda = grid, thresh = 1e-12)
ridge.pred = predict(ridge.mod, s=4, newx = x[test,])
mean((ridge.pred - y.test)^2)
## [1] 101036.8

mean((mean(y[train]) - y.test)^2)
## [1] 193253.1
ridge.pred = predict(ridge.mod, s=1e10, newx = x[test,])
mean((ridge.pred - y.test)^2)
## [1] 193253.1

set.seed(1)
cv.out = cv.glmnet(x[train,], y[train], alpha = 0)
plot(cv.out)

bestlam = cv.out$lambda.min
bestlam
## [1] 211.7416

ridge.pred = predict(ridge.mod, s=bestlam, newx = x[test,])
mean((ridge.pred - y.test)^2)
## [1] 96015.51

out = glmnet(x, y, alpha = 0)
predict(out, type = "coefficients", s = bestlam)[1:20,]
## (Intercept) AtBat Hits HmRun Runs
## 9.88487157 0.03143991 1.00882875 0.13927624 1.11320781
## RBI Walks Years CAtBat CHits
## 0.87318990 1.80410229 0.13074381 0.01113978 0.06489843
## CHmRun CRuns CRBI CWalks LeagueN
## 0.45158546 0.12900049 0.13737712 0.02908572 27.18227535
## DivisionW PutOuts Assists Errors NewLeagueN
## -91.63411299 0.19149252 0.04254536 -1.81244470 7.21208390

# 라쏘의 경우
lasso.mod=glmnet(x[train ,],y[train],alpha =1, lambda =grid)
plot(lasso.mod)

set.seed(1)
cv.out=cv.glmnet(x[train ,],y[train],alpha =1)
plot(cv.out)

bestlam=cv.out$lambda.min
bestlam
## [1] 16.78016
lasso.pred=predict(lasso.mod,s=bestlam,newx=x[test ,])
mean((lasso.pred-y.test)^2)
## [1] 100743.4
out=glmnet(x,y,alpha=1,lambda=grid)
lasso.coef=predict(out,type="coefficients",s=bestlam )[1:20 ,]
lasso.coef

## (Intercept) AtBat Hits HmRun Runs
## 18.5394844 0.0000000 1.8735390 0.0000000 0.0000000
## RBI Walks Years CAtBat CHits
## 0.0000000 2.2178444 0.0000000 0.0000000 0.0000000
## CHmRun CRuns CRBI CWalks LeagueN
## 0.0000000 0.2071252 0.4130132 0.0000000 3.2666677
## DivisionW PutOuts Assists Errors NewLeagueN
## -103.4845458 0.2204284 0.0000000 0.0000000 0.0000000

# 다항회귀모형의 적합
fit = lm(wage ~ poly(age, 4), data = Wage)
coef(summary(fit))
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 111.70361 0.7287409 153.283015 0.000000e+00
## poly(age, 4)1 447.06785 39.9147851 11.200558 1.484604e-28
## poly(age, 4)2 -478.31581 39.9147851 -11.983424 2.355831e-32
## poly(age, 4)3 125.52169 39.9147851 3.144742 1.678622e-03
## poly(age, 4)4 -77.91118 39.9147851 -1.951938 5.103865e-02

fit2 = lm(wage ~ poly(age, 4, raw = TRUE), data = Wage)
coef(summary(fit2))
fit2a = lm(wage ~ age + I(age^2) + I(age^3) + I(age^4), data = Wage)
coef(summary(fit2a))
fit2b = lm(wage ~ cbind(age, age^2, age^3, age^4), data = Wage)
coef(summary(fit2b))

agelims = range(Wage$age)
age.grid = seq(from = agelims[1], to = agelims[2])
preds = predict(fit, newdata = list(age = age.grid), se = TRUE)
se.bands = cbind(preds$fit + 2*preds$se.fit, preds$fit - 2*preds$se.fit)

attach(Wage)
plot(age, wage, xlim = agelims, cex = .5, col = "darkgrey", main = "Degree-4 Polynomial")
lines(age.grid, preds$fit, lwd = 2, col = "blue")
matlines(age.grid, se.bands, lwd=1, col = "blue", lty = 3)

# 회귀스플라인 (Regression Spline)
library(splines)
fit = lm(wage~bs(age, knots=c(25,40,60)), data = Wage)
pred = predict(fit, newdata = list(age = age.grid), se = TRUE)
plot(age, wage, col = "gray")
lines(age.grid, pred$fit, lwd = 2)
lines(age.grid, pred$fit + 2*pred$se, lty = "dashed")
lines(age.grid, pred$fit - 2*pred$se, lty = "dashed")

# 자연스플라인 (Natural Spline)
fit2 = lm(wage ~ ns(age, df=4), data = Wage)
pred2 = predict(fit2, newdata = list(age = age.grid), se = TRUE)
plot(age, wage, col = "gray")
lines(age.grid, pred2$fit, col = "red", lwd = 2)

# 평활스플라인 (Smoothing Spline)
plot(age, wage, xlim=agelims, cex = .5, col = "darkgrey", main = "Smoothing Spline")
fit = smooth.spline(age, wage, df = 16)
fit2 = smooth.spline(age, wage, cv = TRUE)
fit2$df
## [1] 6.794596
lines(fit, col = "red", lwd = 2)
lines(fit2, col = "blue", lwd = 2)
legend("topright",legend=c("16 DF","6.8 DF"),col=c("red "," blue "),lty =1, lwd =2, cex =.8)

# 국소회귀 (Local Regression)
plot(age,wage,xlim=agelims,cex=.5,col="darkgrey")
title("Local Regression")
fit=loess(wage~age,span=.2,data=Wage)
fit2=loess(wage~age,span=.5,data=Wage)
lines(age.grid,predict(fit,data.frame(age=age.grid)),col="red",lwd=2)
lines(age.grid,predict(fit2,data.frame(age=age.grid)),col="blue",lwd=2)
legend("topright",legend=c("Span=0.2","Span=0.5"),col=c("red","blue"),lty=1,lwd=2,cex=.8)

# 일반화가법모형 (Generalized Additive Models)
gam1 = lm(wage~ns(year, 4)+ns(age, 5) + education, data=Wage)
library(gam)
gam.m3 = gam(wage~s(year, 4) + s(age, 5) + education, data=Wage)
summary(gam.m3)
##
## Call: gam(formula = wage ~ s(year, 4) + s(age, 5) + education, data = Wage)
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -119.43 -19.70 -3.33 14.17 213.48
##
## (Dispersion Parameter for gaussian family taken to be 1235.69)
##
## Null Deviance: 5222086 on 2999 degrees of freedom
## Residual Deviance: 3689770 on 2986 degrees of freedom
## AIC: 29887.75
##
## Number of Local Scoring Iterations: 2
##
## Anova for Parametric Effects
## Df Sum Sq Mean Sq F value Pr(>F)
## s(year, 4) 1 27162 27162 21.981 2.877e-06 ***
## s(age, 5) 1 195338 195338 158.081 < 2.2e-16 ***
## education 4 1069726 267432 216.423 < 2.2e-16 ***
## Residuals 2986 3689770 1236
## ---
## Signif. codes: 0 
***
 0.001 
**
 0.01 
*
 0.05 
.
 0.1 
 
 1
##
## Anova for Nonparametric Effects
## Npar Df Npar F Pr(F)
## (Intercept)
## s(year, 4) 3 1.086 0.3537
## s(age, 5) 4 32.380 <2e-16 ***
## education
## ---

preds = predict(gam.m3, newdata = Wage)

par(mfrow = c(1,3))
plot(gam.m3, se=TRUE, col = "blue")

plot.gam(gam1, se=TRUE, col = "red")

gam.m1=gam(wage~s(age,5)+education,data=Wage)
gam.m2=gam(wage~year+s(age,5)+education,data=Wage)
anova(gam.m1,gam.m2,gam.m3,test="F")
## Analysis of Deviance Table
##
## Model 1: wage ~ s(age, 5) + education
## Model 2: wage ~ year + s(age, 5) + education
## Model 3: wage ~ s(year, 4) + s(age, 5) + education
## Resid. Df Resid. Dev Df Deviance F Pr(>F)
## 1 2990 3711731
## 2 2989 3693842 1 17889.2 14.4771 0.0001447 ***
## 3 2986 3689770 3 4071.1 1.0982 0.3485661
## ---

gam.lo=gam(wage~s(year,df=4)+lo(age,span=0.7)+education,data=Wage)
par(mfrow=c(1,3))
plot.gam(gam.lo, se=TRUE, col="green")

gam.lo.i=gam(wage~lo(year,age,span=0.5)+education,data=Wage)
library(akima)
par(mfrow=c(1,2))
plot(gam.lo.i)

table(education,I(wage>250))
##
## education FALSE TRUE
## 1. < HS Grad 268 0
## 2. HS Grad 966 5
## 3. Some College 643 7
## 4. College Grad 663 22
## 5. Advanced Degree 381 45

par(mfrow=c(1,3))
gam.lr.s=gam(I(wage>250)~year+s(age, df=5)+education, family=binomial
, data=Wage, subset =( education !="1. < HS Grad"))
plot(gam.lr.s, se=T, col="green")

 

4일 : 분류, 군집 (Classification, Clustering)

[실습]

  • 분류

## Data for An Introduction to Statistical Learning with Applications in R
library(ISLR)


## for lda() and qda()
library(MASS)

## for knn()
library(class)

## Plotting
library(ggplot2)
library(gridExtra)

data(package = "ISLR")

## Load Smarket (S&P Stock Market Data)
data(Smarket)
head(Smarket)


names(Smarket)
dim(Smarket)
summary(Smarket)
pairs(Smarket)

attach(Smarket)



# Logistic Regression

glm.fit=glm(Direction~Lag1+Lag2+Lag3+Lag4+Lag5+Volume,data=Smarket,family=binomial)
summary(glm.fit)
coef(glm.fit)
summary(glm.fit)$coef
summary(glm.fit)$coef[,4]
glm.probs=predict(glm.fit,type="response")
glm.probs[1:10]
contrasts(Direction)

glm.pred=rep("Down",1250)
glm.pred[glm.probs>.5]="Up"
table(glm.pred,Direction)
(507+145)/1250
mean(glm.pred==Direction)

train=(Year<2005)
Smarket.2005=Smarket[!train,]
dim(Smarket.2005)
Direction.2005=Direction[!train]

glm.fit=glm(Direction~Lag1+Lag2+Lag3+Lag4+Lag5+Volume,data=Smarket,family=binomial,subset=train)
glm.probs=predict(glm.fit,Smarket.2005,type="response")
glm.pred=rep("Down",252)
glm.pred[glm.probs>.5]="Up"
table(glm.pred,Direction.2005)
mean(glm.pred==Direction.2005)
mean(glm.pred!=Direction.2005)


## Plot

plotData <- ggplot(data = Smarket,
                   mapping = aes(x = Lag1, y = Lag2, color = Direction, shape = factor(Year))) +
  geom_point(stat = "identity", alpha = 0.8) +
  scale_color_manual(values = c("Down" = "yellow", "Up" = "red")) +
  theme_bw() +
  theme(legend.key = element_blank()) +
  labs(title = "Original data")
plotData


glm.fit=glm(Direction~Lag1+Lag2,data=Smarket,family=binomial(link = "logit"),subset=train)
glm.probs=predict(glm.fit,Smarket.2005,type="response")
glm.pred=rep("Down",252)
glm.pred[glm.probs>.5]="Up"
table(glm.pred,Direction.2005)
mean(glm.pred==Direction.2005)
106/(106+76)
predict(glm.fit,newdata=data.frame(Lag1=c(1.2,1.5),Lag2=c(1.1,-0.8)),type="response")




## Plot
plotData <- ggplot(data = Smarket.2005,
                   mapping = aes(x = Lag1, y = Lag2, color = Direction, shape = factor(Year))) +
  geom_point(stat = "identity", alpha = 0.8) +
  scale_color_manual(values = c("Down" = "yellow", "Up" = "red")) +
  theme_bw() +
  theme(legend.key = element_blank()) +
  labs(title = "Test data")
plotData

## Put the predicted probability
predProbLogit <- predict(glm.fit,Smarket.2005,type="response")
predClassLogit <- factor(predict(glm.fit,Smarket.2005,type="response") > 0.5,
                                 levels = c(FALSE,TRUE), labels = c("Down","Up"))

## Plot (probability)
plotLogit <- ggplot(data = Smarket.2005,
                    mapping = aes(x = Lag1, y = Lag2, color = predProbLogit, shape = factor(Year))) +
  geom_point(alpha = 0.8) +
  scale_color_gradient(low = "yellow", high = "red") +
  theme_bw() +
  theme(legend.key = element_blank()) +
  labs(title = "Predicted probability of outcome (Logistic)")
grid.arrange(plotData, plotLogit, ncol = 2)


plotLdaClass <- ggplot(data = Smarket.2005,
                       mapping = aes(x = Lag1, y = Lag2, color = predClassLogit, shape = factor(Year))) +
  geom_point(alpha = 0.7) +
  scale_color_manual(values = c("Down" = "yellow", "Up" = "red")) +
  theme_bw() +
  theme(legend.key = element_blank()) +
  labs(title = "Predicted outcome (Logistic; p>0.5)")
grid.arrange(plotData, plotLdaClass, ncol = 2)



#Linear Discriminant Analysis

library(MASS)
lda.fit=lda(Direction~Lag1+Lag2,data=Smarket,subset=train)
lda.fit
plot(lda.fit)
lda.pred=predict(lda.fit, Smarket.2005)
names(lda.pred)
lda.class=lda.pred$class
table(lda.class,Direction.2005)
mean(lda.class==Direction.2005)
sum(lda.pred$posterior[,1]>=.5)
sum(lda.pred$posterior[,1]<.5)
lda.pred$posterior[1:20,1]
lda.class[1:20]
sum(lda.pred$posterior[,1]>.9)




## Put into the dataset
predProbLda <- lda.pred$posterior[,"Up"]
predClassLda <- lda.pred$class

## Plot (probability)
plotLdaProb <- ggplot(Smarket.2005,
                      mapping = aes(x = Lag1, y = Lag2, color = predProbLda, shape = factor(Year))) +
  geom_point(alpha = 0.5) +
  scale_color_gradient(low = "yellow", high = "red") +
  theme_bw() +
  theme(legend.key = element_blank()) +
  labs(title = "Predicted probability of outcome (LDA)")
grid.arrange(plotData, plotLdaProb, ncol = 2)

## Plot (classification)
plotLdaClass <- ggplot(Smarket.2005,
              mapping = aes(x = Lag1, y = Lag2, color = predClassLda, shape = factor(Year))) +
  geom_point(alpha = 0.5) +
  scale_color_manual(values = c("Down" = "yellow", "Up" = "red")) +
  theme_bw() +
  theme(legend.key = element_blank()) +
  labs(title = "Predicted outcome (LDA)")
grid.arrange(plotData, plotLdaClass, ncol = 2)



# Quadratic Discriminant Analysis

qda.fit=qda(Direction~Lag1+Lag2,data=Smarket,subset=train)
qda.fit
qda.pred=predict(qda.fit,Smarket.2005)
qda.class=qda.pred$class
table(qda.class,Direction.2005)
mean(qda.class==Direction.2005)

## Put into the dataset
predProbQda <- qda.pred$posterior[,"Up"]
predClassQda <- qda.pred$class

## Plot (probability)
plotQdaProb <- ggplot(Smarket.2005,
                      mapping = aes(x = Lag1, y = Lag2, color = predProbQda, shape = factor(Year))) +
  geom_point(alpha = 0.5) +
  scale_color_gradient(low = "yellow", high = "red") +
  theme_bw() +
  theme(legend.key = element_blank()) +
  labs(title = "Predicted probability of outcome (QDA)")
grid.arrange(plotData, plotQdaProb, ncol = 2)

## Plot (classification)
plotQdaClass <- ggplot(Smarket.2005,
                       mapping = aes(x = Lag1, y = Lag2,  color = predClassQda, shape = factor(Year))) +
  geom_point(alpha = 0.5) +
  scale_color_manual(values = c("Down" = "yellow", "Up" = "red")) +
  theme_bw() +
  theme(legend.key = element_blank()) +
  labs(title = "Predicted outcome (QDA)")
grid.arrange(plotData, plotQdaClass, ncol = 2)




# K-Nearest Neighbors

library(class)
train.X=cbind(Lag1,Lag2)[train,]
test.X=cbind(Lag1,Lag2)[!train,]
train.Direction=Direction[train]
set.seed(1)
knn.pred=knn(train.X,test.X,train.Direction,k=1)
table(knn.pred,Direction.2005)
(83+43)/252
knn.pred=knn(train.X,test.X,train.Direction,k=3)
table(knn.pred,Direction.2005)
mean(knn.pred==Direction.2005)



## Fit KNN (the output is a vector of prediction)
knn.pred <- knn(train = train.X,
              test  = test.X,
              cl    = train.Direction, k = 3)
predClassKnn <- knn.pred

## Plot (classification)
plotKnnClass <- ggplot(Smarket.2005,
                       mapping = aes(x = Lag1, y = Lag2, color = predClassKnn, shape = factor(Year))) +
  geom_point(alpha = 0.8) +
  scale_color_manual(values = c("Down" = "yellow", "Up" = "red")) +
  theme_bw() +
  theme(legend.key = element_blank()) +
  labs(title = "Predicted outcome (KNN)")
grid.arrange(plotData, plotKnnClass, ncol = 2)

 

  • 군집분석, 요인분석

# K-Means Clustering

library(ggplot2)

set.seed(2)
nd=matrix(rnorm(50*2), ncol=2)
nd[1:25,1]=nd[1:25,1]+3
nd[1:25,2]=nd[1:25,2]-4
colnames(nd)=c("x1","x2")
km.out=kmeans(nd,2,nstart=20)
km.out$cluster
cluster=factor(km.out$cluster)
nd1=data.frame(nd,cluster)

c1=ggplot(nd1, aes(x1, x2 , color = cluster)) +
  geom_point(alpha = 0.4, size = 3.5) + labs(title = "K-Means Clustering Results with K=2")
print(c1)
dev.off()



set.seed(4)
km.out=kmeans(nd,3,nstart=20)
km.out
cluster=factor(km.out$cluster)
nd2=data.frame(nd,cluster)

c2=ggplot(nd2, aes(x1, x2 , color = cluster)) +
  geom_point(size = 3.5) + labs(title = "K-Means Clustering Results with K=3")
print(c2)

centers=as.data.frame(km.out$centers)
ggplot(data=nd2, aes(x=x1, y=x2,color=cluster )) + 
  geom_point() + 
  geom_point(data=centers, aes(x=x1,y=x2, color='Center')) +
  geom_point(data=centers, aes(x=x1,y=x2, color='Center'), size=52, alpha=.3, show.legend=FALSE)


set.seed(3)
km.out=kmeans(nd,3,nstart=1)
km.out$tot.withinss
km.out=kmeans(nd,3,nstart=20)
km.out$tot.withinss

#cf)
library(ggfortify)
autoplot(km.out, data=nd1,frame = TRUE)
autoplot(km.out, data=nd1,frame = TRUE, frame.type = 'norm')
autoplot(km.out, data=nd2, frame.type = "convex", frame.level = 0.95,
         frame.alpha = 0.2, main="K-Means Clustering Results with K=3")




# Hierarchical Clustering

hc.complete=hclust(dist(nd), method="complete")
hc.average=hclust(dist(nd), method="average")
hc.single=hclust(dist(nd), method="single")



#your own labels (now rownames) are supplied in geom_text() and label=label
dendr= dendro_data(hc.complete, type="rectangle") 
ggplot() + 
  geom_segment(data=segment(dendr), aes(x=x, y=y, xend=xend, yend=yend)) + 
  geom_text(data=label(dendr), aes(x=x, y=y, label=label, hjust=0), size=4) +
  coord_flip() + scale_y_reverse(expand=c(0.2, 0)) + 
  theme(axis.line.y=element_blank(),
        axis.ticks.y=element_blank(),
        axis.text.y=element_blank(),
        axis.title.y=element_blank(),
        panel.background=element_rect(fill="white"),
        panel.grid=element_blank())+
  labs(title="Complete Linkage")

dendr = dendro_data(hc.average, type="rectangle") 
ggplot() + 
  geom_segment(data=segment(dendr), aes(x=x, y=y, xend=xend, yend=yend)) + 
  geom_text(data=label(dendr), aes(x=x, y=y, label=label, hjust=0), size=4) +
  coord_flip() + scale_y_reverse(expand=c(0.2, 0)) + 
  theme(axis.line.y=element_blank(),
        axis.ticks.y=element_blank(),
        axis.text.y=element_blank(),
        axis.title.y=element_blank(),
        panel.background=element_rect(fill="white"),
        panel.grid=element_blank())+
  labs(title="Average Linkage")

dendr = dendro_data(hc.single, type="rectangle") 
ggplot() + 
  geom_segment(data=segment(dendr), aes(x=x, y=y, xend=xend, yend=yend)) + 
  geom_text(data=label(dendr), aes(x=x, y=y, label=label, hjust=0), size=4) +
  coord_flip() + scale_y_reverse(expand=c(0.2, 0)) + 
  theme(axis.line.y=element_blank(),
        axis.ticks.y=element_blank(),
        axis.text.y=element_blank(),
        axis.title.y=element_blank(),
        panel.background=element_rect(fill="white"),
        panel.grid=element_blank())+
  labs(title="Single Linkage")



ndsc=scale(nd)
hcsc.complete=hclust(dist(nd), method="complete")
dendr = dendro_data(hcsc.complete, type="rectangle") 
ggplot() + 
  geom_segment(data=segment(dendr), aes(x=x, y=y, xend=xend, yend=yend)) + 
  geom_text(data=label(dendr), aes(x=x, y=y, label=label, hjust=0), size=4) +
  coord_flip() + scale_y_reverse(expand=c(0.2, 0)) + 
  theme(axis.line.y=element_blank(),
        axis.ticks.y=element_blank(),
        axis.text.y=element_blank(),
        axis.title.y=element_blank(),
        panel.background=element_rect(fill="white"),
        panel.grid=element_blank())+
  labs(title="Hierarchical Clustering with Scaled Features")


nd3=matrix(rnorm(30*3), ncol=3)
ndd=as.dist(1-cor(t(nd3)))
hcc.complete=hclust(dist(ndd), method="complete")
dendr = dendro_data(hcc.complete, type="rectangle") 
ggplot() + 
  geom_segment(data=segment(dendr), aes(x=x, y=y, xend=xend, yend=yend)) + 
  geom_text(data=label(dendr), aes(x=x, y=y, label=label, hjust=0), size=4) +
  coord_flip() + scale_y_reverse(expand=c(0.2, 0)) + 
  theme(axis.line.y=element_blank(),
        axis.ticks.y=element_blank(),
        axis.text.y=element_blank(),
        axis.title.y=element_blank(),
        panel.background=element_rect(fill="white"),
        panel.grid=element_blank())+
  labs(title="Complete Linkage with Correlation-Based Distance")

cutree(hc.complete, 2)
cutree(hc.average, 2)
cutree(hc.single, 2)
cutree(hc.single, 4)


hc.average=hclust(dist(nd), method="average")         # heirarchal clustering
dendr  = dendro_data(hc.average, type="rectangle") # convert for ggplot
clust   = cutree(hc.average,k=3)                    # find 2 clusters
names(clust)=c(1:50)
clust.df <- data.frame(label=names(clust), cluster=factor(clust))
# dendr[["labels"]] has the labels, merge with clust.df based on label column
dendr[["labels"]] <- merge(dendr[["labels"]],clust.df, by="label")


# plot the dendrogram; note use of color=cluster in geom_text(...)
ggplot() + 
  geom_segment(data=segment(dendr), aes(x=x, y=y, xend=xend, yend=yend)) + 
  geom_text(data=label(dendr), aes(x, y, label=label, hjust=0, color=cluster), 
            size=3) +
  coord_flip() + scale_y_reverse(expand=c(0.2, 0)) + 
  theme(axis.line.y=element_blank(),
        axis.ticks.y=element_blank(),
        axis.text.y=element_blank(),
        axis.title.y=element_blank(),
        panel.background=element_rect(fill="white"),
        panel.grid=element_blank())



#(cf)
# demonstrate plotting directly from object class hclust
ggdendrogram(hc.complete)
ggdendrogram(hc.complete, rotate = TRUE, size = 5, theme_dendro = FALSE)

# demonstrate converting hclust to dendro using dendro_data first
hcdata <- dendro_data(hc.complete)
ggdendrogram(hcdata, rotate=TRUE) + 
  labs(title="Dendrogram in ggplot2")

# Triangular lines
ddata <- dendro_data(as.dendrogram(hc.complete), type = "triangle")
ggplot(segment(ddata)) + geom_segment(aes(x = x, y = y, xend = xend, 
                                          yend = yend)) + 
  geom_text(data = label(ddata), aes(x = x,y = y, label = label), angle = 90, lineheight = 0)




# factor analysis
life<-read.table("life_expectation.txt",header=TRUE)

cor.life<-cor(life)
library(corrplot)
corrplot(cor.life, type="upper", order="hclust", tl.col="black", tl.srt=45)


d.factanal=factanal(life,factors=1, ,method="mle",rotation="varimax")


d.factanal=factanal(life,factors=2, method="mle",rotation="varimax")


d.factanal=factanal(life,factors=3, method="mle",rotation="varimax")


# How could we interpret the three factors?

sapply(1:3, function(f)
  factanal(life, factors = f, method="mle")$PVAL)


# The communalities:

life.exp <- factanal(life,factors=3, rotation="varimax")
1-life.exp$uniquenesses




### Plot loadings against one another

# Factor 1 vs. Factor 2
text=rownames(d.factanal$loadings)
load = d.factanal$loadings
ggplot(data.frame(load[,1:2]),aes(Factor1, Factor2,label = text))+
  geom_point() + 
  geom_text(angle = 20,size=5)+
  labs(title="Factor 1 vs Factor 2", x="Factor1 loadings", y="Factor 2 loadings")

# Factor 1 vs. Factor 3
text=rownames(d.factanal$loadings)
load = d.factanal$loadings
ggplot(data.frame(load[,c(1,3)]),aes(Factor1, Factor3,label = text))+
  geom_point() + 
  geom_text(angle = 20,size=5)+
  labs(title="Factor 1 vs Factor 2", x="Factor1 loadings", y="Factor 3 loadings")


# Factor 2 vs. Factor 3:

ggplot(data.frame(load[, 2:3]),aes(Factor2, Factor3,label = text))+
  geom_point() + 
  geom_text(angle = 20,size=5)+
  labs(title="Factor 2 vs Factor 3", x="Factor2 loadings", y="Factor 3 loadings")



# Estimating factor scores for the life expectancy data set:

scores <- factanal(life,factors=3, rotation="varimax",scores="regression")$scores

### Doing separate 2-D scatterplots of the factor scores:

# Factor 1 vs. Factor 2
scores[,c(1,2,3)]
text=rownames(scores)
ggplot(data.frame(scores[, 1:2]),aes(Factor1, Factor2,label = text))+
  geom_point() + 
  geom_text(angle = 20)+
  labs(title="Factor 1 vs Factor 2", x="Factor1 scores", y="Factor 2 scores")


# Factor 1 vs. Factor 3:

ggplot(data.frame(scores[, c(1,3)]),aes(Factor1, Factor3,label = text))+
  geom_point() + 
  geom_text(angle = 20)+
  labs(title="Factor 1 vs Factor 3", x="Factor1 scores", y="Factor 3 scores")

# Factor 2 vs. Factor 3:

ggplot(data.frame(scores[, 2:3]),aes(Factor2, Factor3,label = text))+
  geom_point() + 
  geom_text(angle = 20)+
  labs(title="Factor 2 vs Factor 3", x="Factor2 scores", y="Factor 3 scores")



#(cf)1.
d.factanal = factanal(life, factors = 3, scores = 'regression')
autoplot(d.factanal, label = TRUE, label.size = 3,
         loadings = TRUE, loadings.label = TRUE, loadings.label.size  = 3)



#(cf)2.
#Using fa() from package psych with rotation
install.packages("psych")
library(psych)
d.fa = fa(r=cor.life, nfactors=3, n.obs=31, rotate="varimax")
d.fa

#Factor scores
scores <- factor.scores(x=life, f=d.fa, method="regression")
head(scores)

#Visualize loadings
factor.plot(d.fa, cut=0.5)

 

  •  나무모형

### Decision Trees

# Fitting Classification Trees

Carseats<-read.table("Carseats.txt")

library(tree)
library(ggplot2)
library(ggdendro)
attach(Carseats)
High=ifelse(Sales<=8,"No","Yes")
Carseats=data.frame(Carseats,High)
tree.carseats=tree(High~.-Sales, Carseats)
plot(tree.carseats)
text(tree.carseats)
summary(tree.carseats)
dev.off()



# Confusion matrix
set.seed(2)
train=sample(1:nrow(Carseats), 200)
Carseats.test=Carseats[-train,]
High.test=High[-train]
tree.carseats=tree(High~.-Sales,Carseats,subset=train)
tree.pred=predict(tree.carseats,Carseats.test,type="class")
table(tree.pred,High.test)
(86+57)/200


# perform cross-validation in order to determine the optimal level of tree complexity

set.seed(3)
cv.carseats=cv.tree(tree.carseats,FUN=prune.misclass)
# prune.tree() enacts optimal deviance pruning
# prune.misclass() enacts optimal misclassification rate pruning

names(cv.carseats)
cv.carseats

# plot the error rate as a function of both size and k
library(gridExtra)
cv=data.frame(cv.carseats$size,cv.carseats$k,cv.carseats$dev)
colnames(cv) = c("size","k","dev")
min=subset(cv, dev==min(dev))
cv1=ggplot(cv,aes(x=size,y=dev))+
  geom_point()+
  geom_point(data=min, size=3, colour=alpha("red", 0.7))+
  geom_line()


cv2=ggplot(cv,aes(x=k,y=dev))+
  geom_point()+
  geom_point(data=min, size=3, colour=alpha("red", 0.7))+
  geom_line()

grid.arrange(cv1, cv2, nrow=1, ncol=2)


prune.carseats=prune.misclass(tree.carseats,best=9)
plot(prune.carseats)
text(prune.carseats,pretty=0)


# Confusion matrix

tree.pred=predict(prune.carseats,Carseats.test,type="class")
table(tree.pred,High.test)
(94+60)/200



prune.carseats=prune.misclass(tree.carseats,best=15)
plot(prune.carseats)
text(prune.carseats,pretty=0)
tree.pred=predict(prune.carseats,Carseats.test,type="class")
table(tree.pred,High.test)
(86+62)/200


# (cf)
library(rpart)


tree.carseats = rpart(High~.-Sales, method = "class", data = Carseats)

# display the results 
printcp(tree.carseats) 

# Classification tree diagrams
ddata <- dendro_data(tree.carseats)
ggplot() + 
  geom_segment(data = ddata$segments, 
               aes(x = x, y = y, xend = xend, yend = yend)) + 
  geom_text(data = ddata$labels, 
            aes(x = x, y = y, label = label), size = 4, vjust = 0) +
  geom_text(data = ddata$leaf_labels, 
            aes(x = x, y = y, label = label), size = 4, vjust = 1) +
  theme_dendro()

# visualize cross-validation results 
plotcp(tree.carseats) 

# plot tree 
plot(tree.carseats, uniform=TRUE, 
     main="Classification Tree for Carseats")
text(tree.carseats, use.n=TRUE, all=TRUE, cex=.8)

# prune the tree 
pfit<- prune(tree.carseats, cp=tree.carseats$cptable[which.min(tree.carseats$cptable[,"xerror"]),"CP"])

# plot the pruned tree 
plot(pfit, uniform=TRUE, 
     main="Pruned Classification Tree for Carseats")
text(pfit, use.n=F, all=TRUE, cex=.8)


# perform cross-validation in order to determine the optimal level of tree complexity
ddata <- dendro_data(pfit)
ggplot() + 
  geom_segment(data = ddata$segments, 
               aes(x = x, y = y, xend = xend, yend = yend)) + 
  geom_text(data = ddata$labels, 
            aes(x = x, y = y, label = label), size = 4, vjust = 0) +
  geom_text(data = ddata$leaf_labels, 
            aes(x = x, y = y, label = label), size = 4, vjust = 1) +
  theme_dendro()



install.packages("rpart.plot")
library(rpart.plot)
prp(pfit, faclen = 0, cex = 0.8, extra = 1)

only_count <- function(x, labs, digits, varlen)
{
  paste(x$frame$n)
}

boxcols <- c("pink", "lightblue")[pfit$frame$yval]

par(xpd=TRUE)
prp(pfit, faclen = 0, cex = 0.8, node.fun=only_count, box.col = boxcols)
legend("bottomright", legend = c("No","Yes"), fill = c("pink", "lightblue"),
       title = "Group")





# Fitting Regression Trees

library(MASS)
set.seed(1)
train = sample(1:nrow(Boston), nrow(Boston)/2)
tree.boston=tree(medv~.,Boston,subset=train)
summary(tree.boston)


# Regression tree diagrams

tree_data <- dendro_data(tree.boston)
ggplot(segment(tree_data)) +
  geom_segment(aes(x = x, y = y, xend = xend, yend = yend, size = n), 
               colour = "blue", alpha = 0.5) +
  scale_size("n") +
  geom_text(data = label(tree_data), 
            aes(x = x, y = y, label = label), vjust = -0.5, size = 4) +
  geom_text(data = leaf_label(tree_data), 
            aes(x = x, y = y, label = label), vjust = 0.5, size = 3) +
  theme_dendro()+
  labs(title="Regression tree diagrams")



# perform cross-validation in order to determine the optimal level of tree complexity
cv.boston=cv.tree(tree.boston)

# plot the error rate as a function of size

cv=data.frame(cv.boston$size,cv.boston$k,cv.boston$dev)
colnames(cv) = c("size","k","dev")
min=subset(cv, dev==min(dev))
ggplot(cv,aes(x=size,y=dev))+
  geom_point()+
  geom_point(data=min, size=3, colour=alpha("red", 0.7))+
  geom_line()

# plot the pruned tree 
prune.boston=prune.tree(tree.boston,best=5)
tree_data <- dendro_data(prune.boston)
ggplot(segment(tree_data)) +
  geom_segment(aes(x = x, y = y, xend = xend, yend = yend, size = n), 
               colour = "blue", alpha = 0.5) +
  scale_size("n") +
  geom_text(data = label(tree_data), 
            aes(x = x, y = y, label = label), vjust = -0.5, size = 4) +
  geom_text(data = leaf_label(tree_data), 
            aes(x = x, y = y, label = label), vjust = 0.5, size = 3) +
  theme_dendro()+
  labs(title="Regression tree diagrams")


# the test set MSE associated with the regression tree

yhat=predict(tree.boston,newdata=Boston[-train,])
boston.test=Boston[-train,"medv"]
mean((yhat-boston.test)^2)

ggplot()+
  geom_point(aes(x=yhat, y=boston.test))+
  geom_abline(intercept = 0, slope = 1, color="red", 
              linetype="dashed", size=1.5)+
  xlab(expression(hat(y)))+
         ylab(expression(y[0]))











# Bagging and Random Forests

# Bagging
library(randomForest)
set.seed(1)
bag.boston=randomForest(medv~.,data=Boston,subset=train,mtry=13,importance=TRUE)
bag.boston


# the test MSE for bagging
yhat.bag = predict(bag.boston,newdata=Boston[-train,])
mean((yhat.bag-boston.test)^2)

ggplot()+
  geom_point(aes(x=yhat.bag, y=boston.test))+
  geom_abline(intercept = 0, slope = 1, color="red", 
              linetype="dashed", size=1.5)+
  xlab(expression(hat(y)))+
  ylab(expression(y[0]))


bag.boston=randomForest(medv~.,data=Boston,subset=train,mtry=13,ntree=25)
yhat.bag = predict(bag.boston,newdata=Boston[-train,])
mean((yhat.bag-boston.test)^2)


# Random Forests

set.seed(1)
rf.boston=randomForest(medv~.,data=Boston,subset=train,mtry=6,importance=TRUE)

# the test MSE for random forests
yhat.rf = predict(rf.boston,newdata=Boston[-train,])
mean((yhat.rf-boston.test)^2)


# Variable Importance Plot
var_importance <- data.frame(variable=setdiff(colnames(Boston), "medv"),
                             importance=as.vector(rf.boston$importance[,1]))
var_importance <- var_importance[order(var_importance[,2]),]
var_importance$variable <- factor(var_importance$variable, levels=var_importance$variable)

vi=ggplot(var_importance,aes(x=variable, y=importance, fill=variable)) +
  geom_bar(stat='identity', position='identity')+
  scale_fill_discrete(name="Variable Name",guide = guide_legend(reverse=TRUE)) +
  coord_flip()+
  theme(axis.text.x=element_blank(),
        axis.text.y=element_text(size=12),
        axis.title=element_text(size=16),
        plot.title=element_text(size=18),
        legend.title=element_text(size=16),
        legend.text=element_text(size=12))+
  ylab("Mean square error")


var_purity <- data.frame(variable=setdiff(colnames(Boston), "medv"),
                             purity=as.vector(rf.boston$importance[,2]))
var_purity = var_purity[order(var_purity[,2]),]
var_purity$variable <- factor(var_purity$variable, levels=var_purity$variable)

np=ggplot(var_purity,aes(x=variable, y=purity, fill=variable)) +
  geom_bar(stat='identity', position='identity') + 
  scale_fill_discrete(name="Variable Name",guide = guide_legend(reverse=TRUE)) +
  coord_flip()+
  theme(axis.text.x=element_blank(),
        axis.text.y=element_text(size=12),
        axis.title=element_text(size=16),
        plot.title=element_text(size=18),
        legend.title=element_text(size=16),
        legend.text=element_text(size=12))+
  ylab("Node impurity")


grid.arrange(vi, np, nrow=1, ncol=2,top="Variable Importance from Random Forest Fit")

varImpPlot(rf.boston)



# Boosting

library(gbm)
set.seed(1)
boost.boston=gbm(medv~.,data=Boston[train,],distribution="gaussian",n.trees=5000,interaction.depth=4)
summary(boost.boston)

# partial dependence plots for rm and lstat.

par(mfrow=c(1,2))
plot(boost.boston,i="rm",col="red")
plot(boost.boston,i="lstat",col="blue")


# the test MSE for boosting
yhat.boost=predict(boost.boston,newdata=Boston[-train,],n.trees=5000)
mean((yhat.boost-boston.test)^2)

boost.boston=gbm(medv~.,data=Boston[train,],distribution="gaussian",n.trees=5000,interaction.depth=4,shrinkage=0.2,verbose=F)
yhat.boost=predict(boost.boston,newdata=Boston[-train,],n.trees=5000)
mean((yhat.boost-boston.test)^2)

 

[연습문제]

  • 분류

# Logistic Regression 
# Exercise1


#(1)
titanic = read.csv("titanic3.csv")
dim(titanic)
str(titanic)
summary(titanic)

sum(is.na(titanic))
titanic = titanic[,c(1,2,4:7)]
titanic=na.omit(titanic)

glm.fit= glm(survived ~ pclass + sex + age + sibsp,
                             family = binomial, data = titanic)
summary(glm.fit)
coef(glm.fit)
summary(glm.fit)$coef
summary(glm.fit)$coef[,4]




#(2)
attach(titanic)
train = (parch < 1) 
test = !train
train.titanic = titanic[train, ]
test.titanic = titanic[test, ]
test.survived = survived[test]




#(3)
glm.fit=glm(survived ~ pclass + sex + age,
            family = binomial,subset=train)
glm.probs=predict(glm.fit,test.titanic,type="response")
glm.pred=rep(0,length(glm.probs))
glm.pred[glm.probs>.5]=1
mean(glm.pred!=test.survived)





# Linear Discriminant Analysis

# Exercise 2


#(1)
library(MASS)
lda.fit=lda(survived ~ pclass + sex + age + sibsp,data=titanic, subset=train)
lda.fit




#(2)
lda.pred=predict(lda.fit,test.titanic)
names(lda.pred)
lda.class=lda.pred$class
table(lda.class,test.survived)
mean(lda.class!==test.survived)





# Quadratic Discriminant Analysis

# Exercise 3


#(1)
qda.fit=qda(survived ~ pclass + sex + age + sibsp,data=titanic, subset=train)
qda.fit




#(2)
qda.class=predict(qda.fit ,test.titanic)$class
table(qda.class,test.survived)
mean(qda.class!=test.survived)






# K-Nearest Neighbors


# Exercise 4


#(1)
library(class)
train.x = cbind(pclass, sex, age, sibsp)[train, ]
rownames(train.x)=NULL
test.x = cbind(pclass, sex, age, sibsp)[test, ]
rownames(test.x)=NULL
train.survived = survived[train]

# K=1

set.seed(0413)
knn.pred=knn(train.x,test.x,train.survived,k=1)
mean(knn.pred != test.survived)

# K=10

knn.pred=knn(train.x,test.x,train.survived,k=3)
mean(knn.pred != test.survived)

# K=100

knn.pred = knn(train.x,test.x,train.survived,k = 100)
mean(knn.pred != test.survived)

 

  • 군집분석, 요인분석



# K-Means Clustering

# Exercise 10


library(ggfortify)
geneData = read.csv("Ch10Ex11.csv", header=F)
dim(geneData)
geneData = t(geneData)
rownames(geneData)=c(paste("H", 1:20, sep = ""),paste("D", 1:20, sep = ""))
colnames(geneData)=c(paste("g",1:1000,sep=""))

set.seed(0413)
km.out=kmeans(geneData,2,nstart=20)
cluster = km.out$cluster
cluster

autoplot(km.out,data=geneData,label = TRUE, label.size = 3, frame=TRUE)





# Hierarchical Clustering

# Exercise 11


geneData = read.csv("Ch10Ex11.csv", header=F)
dim(geneData)
geneData = t(geneData)
rownames(geneData)=c(paste("H", 1:20, sep = ""),paste("D", 1:20, sep = ""))
colnames(geneData)=c(paste("g",1:1000,sep=""))

dd = as.dist(1 - cor(t(geneData)))

hc.complete=hclust(dd, method="complete")
hc.average=hclust(dd, method="average")
hc.single=hclust(dd, method="single")

library(ggdendro)
library(gridExtra)
dendr = dendro_data(hc.complete, type="rectangle") 
h1=ggplot() + 
  geom_segment(data=segment(dendr), aes(x=x, y=y, xend=xend, yend=yend)) + 
  geom_text(data=label(dendr), aes(x=x, y=y, label=label, hjust=0), size=4) +
  coord_flip() + scale_y_reverse(expand=c(0.2, 0)) + 
  theme(axis.line.y=element_blank(),
        axis.ticks.y=element_blank(),
        axis.text.y=element_blank(),
        axis.title.y=element_blank(),
        panel.background=element_rect(fill="white"),
        panel.grid=element_blank())+
  labs(title="Complete Linkage with Correlation-Based Distance")


dendr = dendro_data(hc.average, type="rectangle") 
h2=ggplot() + 
  geom_segment(data=segment(dendr), aes(x=x, y=y, xend=xend, yend=yend)) + 
  geom_text(data=label(dendr), aes(x=x, y=y, label=label, hjust=0), size=4) +
  coord_flip() + scale_y_reverse(expand=c(0.2, 0)) + 
  theme(axis.line.y=element_blank(),
        axis.ticks.y=element_blank(),
        axis.text.y=element_blank(),
        axis.title.y=element_blank(),
        panel.background=element_rect(fill="white"),
        panel.grid=element_blank())+
  labs(title="Average Linkage with Correlation-Based Distance")


dendr = dendro_data(hc.single, type="rectangle") 
h3= ggplot() + 
  geom_segment(data=segment(dendr), aes(x=x, y=y, xend=xend, yend=yend)) + 
  geom_text(data=label(dendr), aes(x=x, y=y, label=label, hjust=0), size=4) +
  coord_flip() + scale_y_reverse(expand=c(0.2, 0)) + 
  theme(axis.line.y=element_blank(),
        axis.ticks.y=element_blank(),
        axis.text.y=element_blank(),
        axis.title.y=element_blank(),
        panel.background=element_rect(fill="white"),
        panel.grid=element_blank())+
  labs(title="Single Linkage with Correlation-Based Distance")

grid.arrange(h1, h2, h3,nrow=1, ncol=3)



# factor analysis

# Exercise 12


#(1)
pers<-read.csv("personality.csv")
library(corrplot)
corrplot(cor(pers), order = "hclust", tl.col="black", tl.srt=90)
stand.pers = as.data.frame(scale(pers))




#(2)
fac.pers = factanal(stand.pers, factors = 10, rotation = "none", na.action = na.omit)
fac.pers




#3)


1-fac.pers$uniquenesses


# Plot loadings against one another
text=rownames(fac.pers$loadings)
load = fac.pers$loadings
ggplot(data.frame(load[,1:2]),aes(Factor1, Factor2,label = text))+
  geom_point() + 
  geom_text(angle = 20,size=5)+
  labs(title="Factor 1 vs Factor 2", x="Factor1 loadings", y="Factor 2 loadings")




#(4)
# Factor analysis with rotation
fac.pers2 = factanal(stand.pers, factors = 10, rotation = "varimax", na.action = na.omit)
fac.pers2
load2 = fac.pers2$loadings[,1:2]


### Plot loadings against one another
text=rownames(fac.pers2$loadings)
load = fac.pers2$loadings
ggplot(data.frame(load[,1:2]),aes(Factor1, Factor2,label = text))+
  geom_point() + 
  geom_text(angle = 20,size=5)+
  labs(title="Factor 1 vs Factor 2", x="Factor1 loadings", y="Factor 2 loadings")

 

  •  나무모형

### Decision Trees

# classification Trees

# Exercise 5


#(1)
titanic = read.csv("titanic3.csv")
titanic = titanic[,c(1,2,4:7)]
titanic=na.omit(titanic)

attach(titanic)
train = (parch < 1) 
test = !train
train.titanic = titanic[train, ]
test.titanic = titanic[test, ]
test.survived = survived[test]


tree.titanic=tree(as.factor(survived)~ pclass+age+sex+sibsp, titanic,subset=train)
summary(tree.titanic)


#(2)
plot(tree.titanic)
text(tree.titanic,pretty = 0)

tree.pred=predict(tree.titanic,test.titanic, type="class")
table(tree.pred,test.survived)
(70+4)/length(tree.pred)


#(3)
set.seed(0413)
cv.titanic = cv.tree(tree.titanic, FUN = prune.misclass)
names(cv.titanic)
cv.titanic

# plot the error rate as a function of both size and k
library(gridExtra)
cv=data.frame(cv.titanic$size,cv.titanic$k,cv.titanic$dev)
colnames(cv) = c("size","k","dev")
min=subset(cv, dev==min(dev))
cv1=ggplot(cv,aes(x=size,y=dev))+
  geom_point()+
  geom_point(data=min, size=3, colour=alpha("red", 0.7))+
  geom_line()


cv2=ggplot(cv,aes(x=k,y=dev))+
  geom_point()+
  geom_point(data=min, size=3, colour=alpha("red", 0.7))+
  geom_line()

grid.arrange(cv1, cv2, nrow=1, ncol=2)


prune.titanic=prune.misclass(tree.titanic,best=3)


# Confusion matrix

tree.pred=predict(prune.titanic,test.titanic,type="class")
table(tree.pred,test.survived)
(70+4)/length(tree.pred)




# Regression Trees

# Exercise 6


#(1)
library(ISLR)
dim(Hitters)
sum(is.na(Hitters$Salary))
Hitters =na.omit(Hitters)
dim(Hitters)
sum(is.na(Hitters))


train=1:200
train.hitters=Hitters[train,]
test.hitters=Hitters[-train,]
attach(Hitters)

tree.hitters=tree(log(Salary) ~., data=Hitters,subset=train)
tree.hitters

#(2)  

tree_data <- dendro_data(tree.hitters)
ggplot(segment(tree_data)) +
  geom_segment(aes(x = x, y = y, xend = xend, yend = yend, size = n), 
               colour = "blue", alpha = 0.5) +
  scale_size("n") +
  geom_text(data = label(tree_data), 
            aes(x = x, y = y, label = label), vjust = -0.5, size = 4) +
  geom_text(data = leaf_label(tree_data), 
            aes(x = x, y = y, label = label), vjust = 0.5, size = 3) +
  theme_dendro()+
  labs(title="Regression tree diagrams")

pred.hitters = predict(tree.hitters, test.hitters)
mean((log(test.hitters$Salary) - pred.hitters)^2)


#(3)
set.seed(0413)
cv.hitters = cv.tree(tree.hitters, FUN = prune.tree)
cv=data.frame(cv.hitters$size,cv.hitters$k,cv.hitters$dev)
colnames(cv) = c("size","k","dev")
min=subset(cv, dev==min(dev))
ggplot(cv,aes(x=size,y=dev))+
  geom_point()+
  geom_point(data=min, size=3, colour=alpha("red", 0.7))+
  geom_line()


pruned.hitters = prune.tree(tree.hitters, best = 5)
pred.pruned = predict(pruned.hitters, test.hitters)
mean((log(test.hitters$Salary) - pred.pruned)^2)



# Bagging

# Exercise 7


#(1)
train=1:200
train.hitters=Hitters[train,]
test.hitters=Hitters[-train,]
attach(Hitters)

library(randomForest)
set.seed(0413)
bag.hitters = randomForest(log(Salary) ~ ., data = train.hitters, mtry = 19, ntree = 500, 
                            importance = T)


#(2)
bag.pred = predict(bag.hitters, test.hitters)
mean((log(test.hitters$Salary) - bag.pred)^2)


bag.hitters = randomForest(log(Salary) ~ ., data = train.hitters, mtry = 19, ntree = 25, 
                           importance = T)
bag.pred = predict(bag.hitters, test.hitters)
mean((log(test.hitters$Salary) - bag.pred)^2)




# Random Forest

# Exercise 8


#(1)
set.seed(0413)
rf.hitters = randomForest(log(Salary) ~ ., data = train.hitters, mtry = 6, ntree = 500, 
                           importance = T)
rf.pred = predict(rf.hitters, test.hitters)
mean((log(test.hitters$Salary) - rf.pred)^2)


#(2)

# Variable Importance Plot
var_importance <- data.frame(variable=setdiff(colnames(Hitters), "Salary"),
                             importance=as.vector(rf.hitters$importance[,1]))
var_importance <- var_importance[order(var_importance[,2]),]
var_importance$variable <- factor(var_importance$variable, levels=var_importance$variable)

ggplot(var_importance,aes(x=variable, y=importance, fill=variable)) +
  geom_bar(stat='identity', position='identity')+
  scale_fill_discrete(name="Variable Name",guide = guide_legend(reverse=TRUE)) +
  coord_flip()+
  theme(axis.text.x=element_blank(),
        axis.text.y=element_text(size=12),
        axis.title=element_text(size=16),
        plot.title=element_text(size=18),
        legend.title=element_text(size=16),
        legend.text=element_text(size=12))+
  ylab("Mean square error")


# Boosting

# Exercise 9


#(1)
library(gbm)
set.seed(0413)
boost.hitters=gbm(log(Salary)~.,data=train.hitters,distribution="gaussian",
                  n.trees=5000,interaction.depth=4,shrinkage = 0.01)
summary(boost.hitters)


#(2)
yhat.boost=predict(boost.hitters,newdata=test.hitters,n.trees=5000)
mean((yhat.boost-log(test.hitters$Salary))^2)

 

5일 : 추천 알고리즘 (Recommendation System)
########################
#3. Colaborate filtering
#3.1 Jester5k data ####
#data loading
install.packages("recommenderlab")
library(recommenderlab)
data(Jester5k)
head(as(Jester5k,"matrix"))
image(Jester5k[1:80])         #Jester5k user1~80까지만 추출해서 ratingmatrix를 확인

#colaborate filtering
r=Recommender(Jester5k[1:1000],method="UBCF")       #method를 IBCF로하면 item-based CF
pr=predict(r, Jester5k[1:2], type="ratings")
as(pr,"matrix")               #이미 평가한 자료의 경우 예측값을 주지 않는다
ptype=predict(r, Jester5k[1:2], n=5)    #n을 통해 추천품목수를 조정 가능
as(ptype,"list")

as(pr[1],"matrix")[,order(as(pr[1],"matrix"),decreasing = T)]     #첫번째 user에 대해 예상별점을 정렬


#3.2 scale 보정 #######
#making dummy variable
a = as(Jester5k[1:1000],"matrix")
dgmat=cbind(a[1:100000],as.data.frame(cbind(rep(rownames(a),ncol(Jester5k[1:1000])),rep(colnames(a),each=nrow(Jester5k[1:1000])))))    #각 행은 하나의 평점에 대한 정보를 가짐
colnames(dgmat)=c("rating","user","item")
user=unique(dgmat$user); item=unique(dgmat$item)
head(dgmat)
dgmat=dgmat[is.na(dgmat$rating)==F,]    #평점이 있는 정보만 사용
dummy=model.matrix(rating~user+item,dgmat)  #user와 item에 대해 dummy variable 생성
head(dummy)

#estimate mu by ridge regression
library(glmnet)
#cv.lm=cv.glmnet(dummy, dgmat$rating, type.measure="deviance", alpha=0)    #시간이 오래 걸림.. 집에서 해보세요..
lm = glmnet(dummy,dgmat$rating,family="gaussian",lambda=0.2, alpha=0)     #cv돌려본 결과 최적의 lambda와 대충 일치. 실제로는 cv.lm$lambda.min을 넣어준다.

#기존 형식으로 자료 변환
dgmat$rating=dgmat$rating-dummy%*%lm$beta
user.index=match(dgmat$user,user); item.index=match(dgmat$item,item)  #각 row의 user(item)가 몇번째 user(item) 인지 확인
mat=sparseMatrix(i=user.index,j=item.index,x=dgmat$rating)
mat=as.matrix(mat) ; sum(dgmat$rating==0,na.rm=T) ;mat[mat==0]=NA     #dgmat$rating 값 중 0이 없음을 확인하고 NA를 넣음
colnames(mat)=item ; rownames(mat)=user
mat = as(mat,"realRatingMatrix")        #Recommender 함수를 위해 다시 rating matrix로..

#예상 평점 계산
r1=Recommender(mat,method="UBCF")
pr1=predict(r1, mat[1:2], type="ratings")
pr1 = as(pr1,"matrix")  
rownames(lm$beta)=gsub('user','',rownames(lm$beta)) ;rownames(lm$beta)=gsub('item','',rownames(lm$beta))  #lm$beta의 이름 정제(왜 필요한지는 하기 전에 확인)
item=as.character(item) ; user=as.character(user)
user1_dummy=c()
for(i in 1:length(item)){           #user1과 user2에 대해 앞에서 추정한 mu들을 더해주기 위한 과정
  b=rep(0,length(lm$beta))
  b[match(c(user[1],item[i]),rownames(lm$beta))]=1
  b[1]=1
  user1_dummy=cbind(user1_dummy,b)
}
scale.value=rbind(t(lm$beta) %*% user1_dummy ,t(lm$beta) %*% user2_dummy)
pr1.final= scale.value+pr1 ; colnames(pr1.final)=item 
pr1.final[,order(pr1.final[1,],decreasing = T)]     #user1에 대해 예상평점이 높은 순으로 정렬

#예제 - MovieLense data
data("MovieLense")
#위 과정을 따라서 해보세요!

########################
#4. matrix factorization
#4.1 Jester5k data ####
#install library
library(matfact)
mf = matfact(as(Jester5k,"matrix")[1:1000,])
pred = mf$P %*% t(mf$Q)
colnames(pred)=item ; rownames(pred)=user
head(pred)

#prediction
i1 = is.na(as(Jester5k,"matrix")[1,])     #사용자 1,2가 평점을 내리지 않은 상품들을 추출
i2 = is.na(as(Jester5k,"matrix")[2,])
pred[1,i1==T] ; sort(pred[1,i1==T],decreasing=T)     #평점을 내리지 않은 상품들에 대해서만 예상평점을 관찰
pred[2,i2==T] ; sort(pred[2,i2==T],decreasing=T)


#4.2 scale 보정 ####
mat1=as(mat,"matrix")           #앞에서 scale 보정에 사용했던 mat 변수를 그대로 사용하면 된다.
mf1 = matfact(mat1)
pred1 = mf1$P %*% t(mf1$Q)
colnames(pred1)=item ; rownames(pred1)=user
pred1.final=pred1[1:2,]+scale.value       #앞에서와 같은 보정값을 더해주면 된다
pred1.final[1,i1==T] ; pred1.final[2,i2==T]       #matrix factorization을 통한 user1,2의 예상별점
sort(pred1.final[1,i1==T],decreasing=T); sort(pred1.final[2,i2==T],decreasing=T)    #matrix factorization을 통한 user1,2의 예상별점(별점 순으로 정렬)

#matrix factorization

#예제 - MovieLense data
#위 과정을 따라서 해보세요!


##########################
#7. 구매자료를 이용한 추천
#7.1 연관성분석 ####

#tot data 불러오기       
install.packages("arules")
library(arules)
shop=as.matrix(shop)
rules=apriori(shop, parameter=list(supp=0.1,conf=0.9))  #연관규칙 형성 
rules.sorted=sort(rules, by="lift")
inspect(rules.sorted)

#7.2 conditional regression approach ####

#너무 오래 걸림....
#pred=c()
#for(i in 1:ncol(tot)){
#  cv.fit=cv.glmnet(tot[,-i],tot[,i],family="binomial",alpha=0,nfolds=3)
#  pred.new=predict(cv.fit,newx=tot[,-i],s="lambda.min",type="response")
#  pred=cbind(pred,pred.new)
#}

#미리 해놓은 결과값 불러오기
pred=read.csv("F:\\Stat\\job\\recommender_공개강좌\\실습자료\\prediction.csv")
rownames(pred)=pred[,1];pred=pred[,-1]
pred[which(tot==1)]=0  #구매내역이 있으면 기대확률을 볼 필요가 없으므로 0으로 처리
sort(pred[1,],decreasing=T)[1:10]    #첫번째 user에게 추천하는 상품들 상위 10개와 그 기대 확률ㄷ


#인터넷 쇼핑몰 자료
#shop 자료 불러오기
#구매자료를 이용한 추천 다시 하기

 

 관련 자료

  • 강의안 및 교육 자료 참조

 

03. [서울대 통계연구소] R을 이용한 빅데이터 분석.zip

 

drive.google.com

 

 참고 문헌

[논문]

  • 없음

[보고서]

  • 없음

[URL]

  • 없음

 

 문의사항

[기상학/프로그래밍 언어]

  • sangho.lee.1990@gmail.com

[해양학/천문학/빅데이터]

  • saimang0804@gmail.com