정보
-
업무명 : 서울대 통계 연구소 R을 이용한 빅데이터 분석 교육 연수
-
작성자 : 이상호
-
작성일 : 2020-05-12
-
설 명 :
-
수정이력 :
내용
[개요]
-
안녕하세요? 기상 연구 및 웹 개발을 담당하고 있는 해솔입니다.
-
오늘은 통계 연구소 R을 이용한 빅데이터 분석 교육 연수 내용을 다루고자 합니다.
-
이 교육은 5일에 걸쳐 진행하였고 각 시간마다 이론 및 실습을 병행하여 진행했습니다.
-
특히 배운 지식을 활용하여 학술 논문 또는 보고서 작성 시 많은 도움이 되었습니다.
-
그에 따라 "교육 일정, 실습 자료, 관련 자료"순으로 소개해 드리겠습니다.
[특징]
-
R을 이용한 빅데이터 교육에 대한 이해를 돕기위해 작성
[기능]
-
교육 일정
-
실습 자료
-
관련 자료
[사용 OS]
-
Windows 10
[사용 언어]
-
R v4.0.0
세부 내용
[교육 일정]
-
강좌 구성
-
1일 : R tidyverse 패키지를 사용한 데이터 가공
-
2일 : 데이터 시각화 (Data Visualization)
-
3일 : 회귀 분석 (Regression Analysis)
-
4일 : 분류, 군집 (Classification, Clustering)
-
5일 : 추천 알고리즘 (Recommendation System)
-
-
강의 시간표
-
09:30-12:30 : 강의
-
13:30-15:30 : 실습
-
실습 자료
-
소스 코드에 대한 설명은 주석을 참조하시기 바랍니다.
1일 : R tidyverse 패키지를 사용한 데이터 가공
- 필수 패키지 설치 및 읽기
#################################################################################################
### ### ###
### Description : This is for the special course on "Big Data" of SNU-SRI. ###
### This concentrates on "Data Wrangling" with R package "dplyr" and "tidyr". ###
### ###
### Reference : Jaimie (Jaimyoung) Kwon's github (and lecture note) ###
### ###
#################################################################################################
# 필수 패키지 설치
install.packages("ggplot2")
install.packages("dplyr")
install.packages("tidyr")
install.packages("data.table")
# 학습을 위해 필요한 패키지 설치
install.packages("lubridate")
install.packages("Lahman")
# 데이터 설치
install.packages("nycflights13")
install.packages("babynames")
# 패키지 로딩
library(ggplot2)
library(dplyr)
library(tidyr)
library(data.table)
-
실습 자료
#################################################################################################
### ###
### Description : This is for the special course on "Big Data" of SNU-SRI. ###
### This concentrates on "Data Wrangling" with R package "dplyr" and "tidyr". ###
### ###
### Reference : Jaimie (Jaimyoung) Kwon's github (and lecture note) ###
### ###
#################################################################################################
#################################################################################################
########## dplyr ##########
#################################################################################################
library(nycflights13)
help(package="nycflights13")
# 데이터를 transpose 취해서 보여준다. 즉, 각 설명변수(열)들이 어떤 값들이 나왔는지 순서대로 보여주는 것이다.
# 이것을 이용하면 데이터프레임 안에 있는 모든 열들을 콘솔에서 편하게 볼 수 있다.
# str 과 다른 점은 콘솔 상에서 가능한 최대한의 값들을 보여준다.
glimpse(flights)
str(flights)
# 연습문제 12/43
glimpse(airlines)
glimpse(airports)
glimpse(planes)
glimpse(weather)
flights
print(flights)
class(flights)
# x <- data.frame(c(1,1))
# class(x)
# class(tbl_df(x))
################################# 고급 연습문제(13/43) #################################
head(as.data.frame(flights))
##################################### 1. filter() #####################################
# 설명 : 주어진 조건을 만족하는 행들을 추려낸다.
# 기초적인 사용법 : filter(주어진 자료(데이터프레임), 필터링하고자 하는 조건)
# 1월 1일에 출발한 항공편 자료를 찾는 명령
filter(flights, month == 1, day == 1)
filter(flights, month == 1 & day == 1)
# 1월과 2월에 출발한 항공편 자료를 찾는 명령
filter(flights, month == 1 | month == 2)
### slice() : 행을 행 번호로 추려낸다.
# flights[1:10,] ==
slice(flights, 1:10)
##################################### 2. arrange() #####################################
# 설명 : 행을 변수들의 오름차순으로 정렬한다. 즉 지정한 열을 기준으로 가장 작은 값에서 시작하여
# 가장 큰 값으로 끝나는 순서대로 데이터를 정렬한다.
# 기초적인 사용법 : arrange(주어진 자료(데이터프레임), 오름차순으로 정렬하고자 하는 열(기준))
# flights 자료를 year, month, day 순으로 정렬하는 명령
# flights[order(flights$year, flights$month, flights$day),] ==
arrange(flights, year, month, day)
### desc() : 내림차순으로 정렬하도록 바꾸는 함수이다.
# arr_delay가 가장 큰 것부터 정렬하는 명령
arrange(flights, desc(arr_delay))
##################################### 3. select() #####################################
# 설명 : 필요한 열을 선택한다. 열 이름을 써주는 연산이 가장 흔히 쓰인다.
# 기초적인 사용법 : select(주어진 자료(데이터프레임), 추출하고자 하는 열의 이름들)
# year, month, day라는 이름을 가진 열을 추출하는 명령
select(flights, year, month, day)
# 열의 범위를 지정하여 열을 추출하는 명령
select(flights, year:arr_time)
# year, month, day라는 이름을 가진 열을 제외하고 나머지를 추출하는 명령
select(flights, -year, -month, -day)
# 열의 범위를 지정하여 그 열들을 제외한 나머지 열들을 추출하는 명령
select(flights, -(year:arr_time))
### + starts_with() : 정해진 문자열로 시작하는 열들을 모두 지정할 때 사용
# arr로 시작하는 이름을 가진 열들을 모두 제외하는 명령
select(flights, -starts_with("arr"))
### + ends_with() : 정해진 문자열로 끝나는 열들을 모두 지정할 때 사용
# delay로 시작하는 이름을 가진 열들을 모두 추출하는 명령
select(flights, ends_with("delay"))
### + matches() : 정해진 형식으로 구성된 이름을 가진 열들을 모두 지정할 때 사용
# 무엇_무엇 꼴의 이름을 가진 열들을 모두 추출하는 명령
select(flights, matches("._."))
select(flights, matches(".r_t."))
### + contains() : 정해진 문자열이 포함된 이름을 가진 열들을 모두 지정할 때 사용
# _이 들어가는 이름을 가진 열들을 모두 추출하는 명령
select(flights, contains("_"))
select(flights, contains("_."))
## ?select 로 추가적으로 확인할 수 있다.
### rename() : 열의 이름을 바꿔준다.
rename(flights, tail_num = tailnum)
##################################### 4. mutate() #####################################
# 설명 : 기존의 열(들)의 값을 이용해 새로운 열을 추가한다.
# 함수에서 생성한 열을 바로 사용할 수 있다.
# 기초적인 사용법 : mutate(주어진 자료(데이터프레임), 생성하고자 하는 열 계산)
mutate(flights, gain = arr_delay - dep_delay, gain_per_hour = gain / (air_time / 60), speed = distance / air_time * 60)
# transform() 함수와 비교하면?
mutate(flights, gain = arr_delay - dep_delay, gain_per_hour = gain / (air_time / 60))
transform(flights, gain = arr_delay - dep_delay, gain_per_hour = gain / (air_time / 60)) #error
##################################### 5. summarize() #####################################
# 설명 : 주어진 자료(데이터프레임)을 한 줄(행)으로 요약한다. 즉 요약통계량을 계산한다.
# summarise() 함수도 같은 역할을 한다.
# 기초적인 사용법 : summarize(주어진 자료(데이터프레임), 원하는 요약통계량)
# dep_delay의 평균을 계산하는 명령
summarize(flights, delay = mean(dep_delay, na.rm = TRUE))
# 요약통계함수를 사용한다. 요약통계함수는 벡터값을 입력으로 받아 하나의 숫자를 출력한다.
# 기본 R 패키지의 요약통계함수는 min(), max(), mean(), sum(), sd(), median(), IQR() 등이 있다.
# dplyr 패키지에서 제공하는 요약통계함수는 n(), n_distinct(), first(), last(), nth() 이 있다.
# group_by() 함수와 함께 이용하면 매우 편리하다.
############################## 6. sample_n(), sample_frac() ##############################
# 설명 : sample_n() 함수는 정해진 숫자의 열을 랜덤샘플한다.
# sample_frac() 함수는 정해진 비율의 열을 랜덤샘플한다.
# 기초적인 사용법 : sample_n(주어진 자료(데이터프레임), 추출하고자 하는 열의 개수)
# sample_frac(주어진 자료(데이터프레임),
# 추출하고자 하는 열의 개수의 전체 열의 개수에 대한 비율)
# 10줄의 열을 랜덤샘플하는 명령
sample_n(flights, 10)
# 1%의 열을 랜덤샘플하는 명령
sample_frac(flights, 0.01)
# 비복원추출이 default이지만 replace 옵션으로 복원추출할 수도 있다.
# weight 옵션으로 가중치를 정할 수 있다.
# 재현가능한 연구를 위해서는 set.seed해줄 수 있다.
##################################### 7. distinct() #####################################
# 설명 : 주어진 자료에서 고유한 행을 찾아낸다.
# 기초적인 사용법 : distinct(주어진 자료(데이터프레임) 혹은 자료에서 선택한 열)
# 서로 다른 기체번호를 찾는 명령
distinct(select(flights, tailnum))
# 서로 다른 노선을 찾는 명령
distinct(select(flights, origin, dest))
############################# Language of data manipulation #############################
# (1) 행정렬 arrange()
# (2) 행선택 filter()
# (3) 열선택 select()
# (4) 새로운 변수 계산 mutate()
# (5) 요약통계량 계산 summarise()
########################### group_by() 함수를 이용한 그룹연산 ###########################
# 설명 : 데이터셋을 그룹으로 나눈다.
# 기초적인 사용법 : group_by(주어진 자료(데이터프레임), 그룹으로 나누고자 하는 기준이 되는 열(들))
flights
groups(flights)
(by_tailnum = group_by(flights, tailnum))
groups(by_tailnum)
### 연동사용법
## select() 는 그룹변수를 항상 포함한다.
select(by_tailnum, year, month, day)
## arrange() 는 그룹변수로 우선 정렬한다.
arrange(by_tailnum, day)
arrange(flights, day)
## mutate() 와 filter()는 윈도우 함수(window function)과 더불어 사용한다.
# vignette("window-functions")
# 각 비행기 별로 가장 큰 air_time을 가진 항공편을 추출하는 명령
filter(by_tailnum, min_rank(desc(air_time)) == 1)
# 각 비행기 별로 air_time의 크기를 작은 순서대로 순위화하는 명령
mutate(by_tailnum, rank = min_rank(air_time))
## sample_n() 과 sample_frac() 은 그룹별로 랜덤샘플한다.
sample_n(by_tailnum, 1)
## slice() 는 각 그룹별로 행을 추출한다.
#slice는 그룹별 원소의 개수보다 많이 숫자를 설정하면 개수만큼 나온다.
slice(by_tailnum, 5)
# summarise() 는 각 그룹별로 요약통계량을 계산한다.
# 아래 참고
#(by_tailnum = group_by(flights, tailnum))
delay = summarise(by_tailnum,
count = n(),
dist = mean(distance, na.rm = TRUE),
delay = mean(arr_delay, na.rm = TRUE))
delay = filter(delay, count > 20, dist < 2000)
delay
### exercise 26/43
groups(by_tailnum)
by_tailnum = ungroup(by_tailnum)
groups(by_tailnum)
### exercise 26/43
rowwise(by_tailnum)
delay = summarise(rowwise(by_tailnum),
count = n(),
dist = mean(distance, na.rm = TRUE),
delay = mean(arr_delay, na.rm = TRUE))
delay = filter(delay, dist < 2000)
delay
########################### ggplot2 패키지를 이용한 시각화 ###########################
by_tailnum = group_by(flights, tailnum)
delay = summarise(by_tailnum,
count = n(),
dist = mean(distance, na.rm = TRUE),
delay = mean(arr_delay, na.rm = TRUE))
delay = filter(delay, count > 20, dist < 2000)
delay
ggplot(delay, aes(dist, delay)) +
geom_point(aes(size = count), alpha = 1/2) +
geom_smooth() +
scale_size_area()
sum(select(delay, count)>=1000)
sum(is.na(select(delay, delay)))
########################### %>% 연산자와 체이닝(chaining) ###########################
# The downside of the functional nature of dplyr is that when you combine multiple data manipulation
# operations, you have to read from the inside out and the arguments may be very distant to the function
# call. These functions providing an alternative way of calling dplyr (and other data manipulation)
# functions that you read can from left to right. (dyplyr 패키지 설명의 chain 부분)
# 임시변수 방법
a1 = group_by(flights, year, month, day)
a2 = select(a1, arr_delay, dep_delay)
a3 = summarise(a2,
arr = mean(arr_delay, na.rm = TRUE),
dep = mean(dep_delay, na.rm = TRUE))
(a4 = filter(a3, arr > 30 | dep > 30))
# 중첩 방법
filter(
summarise(
select(
group_by(flights, year, month, day),
arr_delay, dep_delay
),
arr = mean(arr_delay, na.rm = TRUE),
dep = mean(dep_delay, na.rm = TRUE)
),
arr > 30 | dep > 30
)
# 두 방법 모두 가독성이 떨어진다.
### %>% 연산자
# simple1
head(iris)
iris %>% head
# simple2
head(iris, 10)
iris %>% head(10)
# 위의 임시변수 방법과 중첩 방법과 결과가 같다.
# 또한 가독성이 좋다.
flights %>%
group_by(year, month, day) %>%
select(arr_delay, dep_delay) %>%
summarise(
arr = mean(arr_delay, na.rm = TRUE),
dep = mean(dep_delay, na.rm = TRUE)
) %>%
filter(arr > 30 | dep > 30)
################################## 연습문제(31/43) ##################################
# 날짜 형식을 다루는 패키지
library(lubridate)
# C 언어 중에서 지역화를 지정하는 기능.
# 구체적으로는 setlocale 함수나 localconv 함수로 통화(通貨)의 표현,
# 10진수 표현(소수 점이 점인가 콤마인가 등), 일시(日時) 표현의 차이 등 각국 특유의 표현을 지정한다.
# [네이버 지식백과] 로컬 [locale] (컴퓨터인터넷IT용어대사전, 2011. 1. 20., 일진사)
Sys.setlocale("LC_TIME", "usa")
(per_day = flights %>%
group_by(year, month, day) %>%
summarise(flights = n(),
distance = mean(distance),
arr = mean(arr_delay, na.rm = TRUE),
dep = mean(dep_delay, na.rm = TRUE)
) %>%
mutate(dt_str = sprintf("%04d-%02d-%02d", year, month, day),
dt = parse_date_time(dt_str, "ymd",tz = "US/Eastern"),
dow = wday(dt, label=TRUE)
)
)
# 각각의 도착지에 대해 비행기 개수
flights %>% group_by(tailnum, dest) %>% distinct() %>% group_by(dest) %>% summarise(count = n())
# 각각의 도착지에 대해 다른 항공편 편수
flights %>% group_by(dest) %>% summarise(count = n())
# 각 날짜별 비행기 편수
select(per_day, year:flights)
flights %>% group_by(year,month, day) %>% summarise(count = n())
# 각 요일별 비행기 편수
ungroup(per_day) %>% group_by(dow) %>% summarise(count = sum(flights))
ungroup(per_day) %>% group_by(dow) %>% summarise(count = mean(flights))
per_day %>% ggplot(aes(dt, flights)) +
geom_line() + geom_point() + geom_smooth()
per_day %>% ggplot(aes(dow, flights)) +
geom_boxplot()
per_day %>% ggplot(aes(dt, dep)) +
geom_line() + geom_point() + geom_smooth()
per_day %>% ggplot(aes(dow, dep)) +
geom_boxplot()
per_day %>% ggplot(aes(dt, distance)) +
geom_line() + geom_point() + geom_smooth()
per_day %>% ggplot(aes(dow, distance)) +
geom_boxplot()
########################## join 함수를 이용한 테이블 결합 ##########################
(df1 <- data_frame(x = c(1, 2), y = 2:1))
(df2 <- data_frame(x = c(1, 3), a = 10, b = "a"))
# inner_join() 함수는 x와 y에 모두 매칭되는 행만 포함한다. 교집합이다.
df1 %>% inner_join(df2)
# left_join() 함수는 x 의 모든 행을 포함한다. 매칭되지 않은 y 테이블 변수들을 NA가 된다. 차집합이다.
df1 %>% left_join(df2)
# right_join() 함수는 y 테이블의 모든 행을 포함한다.
# left_join(y,x) 의 경우와 행의 순서만 다를 뿐 동일한 결과를 준다.
df1 %>% right_join(df2)
# full_join() 함수는 x와 y의 모든 행을 포함한다. 합집합이다.
df1 %>% full_join(df2)
################################## 연습문제(35/43) ##################################
(flights2 = flights %>% select(year:day, hour, origin, dest, tailnum, carrier))
airlines
flights2 %>% left_join(airlines)
# flights2 %>% left_join(airlines, by="carrier")
# rm(list = ls())
################################# 연습문제(37/43)-1 #################################
ds <- tbl_df(mtcars)
ds
as.data.frame(ds)
if (require("Lahman") && packageVersion("Lahman") >= "3.0.1") {
batting <- tbl_df(Batting)
dim(batting)
colnames(batting)
head(batting)
# Data manipulation verbs ---------------------------------------------------
filter(batting, yearID > 2005, G > 130)
select(batting, playerID:lgID)
arrange(batting, playerID, desc(yearID))
summarise(batting, G = mean(G), n = n())
mutate(batting, rbi2 = if(is.null(AB)) 1.0 * R / AB else 0)
# Group by operations -------------------------------------------------------
# To perform operations by group, create a grouped object with group_by
players <- group_by(batting, playerID)
head(group_size(players), 100)
summarise(players, mean_g = mean(G), best_ab = max(AB))
best_year <- filter(players, AB == max(AB) | G == max(G))
progress <- mutate(players, cyear = yearID - min(yearID) + 1,
rank(desc(AB)), cumsum(AB))
# When you group by multiple level, each summarise peels off one level
per_year <- group_by(batting, playerID, yearID)
stints <- summarise(per_year, stints = max(stint))
filter(stints, stints > 3)
summarise(stints, max(stints))
mutate(stints, cumsum(stints))
# Joins ---------------------------------------------------------------------
player_info <- select(tbl_df(Master), playerID, birthYear)
hof <- select(filter(tbl_df(HallOfFame), inducted == "Y"),
playerID, votedBy, category)
# Match players and their hall of fame data
inner_join(player_info, hof)
# Keep all players, match hof data where available
left_join(player_info, hof)
# Find only players in hof
semi_join(player_info, hof)
# Find players not in hof
anti_join(player_info, hof)
}
# rm(list = ls())
################################# 연습문제(37/43)-2 #################################
library(babynames)
babynames
class(babynames)
babynames %>% group_by(year) %>% filter(min_rank(desc(n)) == 1)
babynames %>% group_by(year) %>% filter(min_rank(desc(n)) <= 10) %>% summarise(rate = sum(prop))
# rm(list = ls())
################################# 연습문제(37/43)-3 #################################
# 주소로 지정할 수도 있지만...
batting = read.csv(file.choose())
master = read.csv(file.choose())
head(batting)
head(master)
batting = tbl_df(batting)
master = tbl_df(master)
class(batting)
class(master)
# 각 연도별 타자들의 타율을 계산하라. ( AVG(타율) = H(안타) / AB(타수) )
mutate(batting, AVG = H / AB) %>% select(playerID:lgID, AVG)
full.batting = mutate(batting, AVG = H / AB)
# 각 연도별로 선수들의 타율의 분포를 시각화하라.
### plot
full.batting %>% mutate(year = factor(yearID)) %>%
group_by(year) %>%
ggplot(aes(year, AVG)) + geom_point() + geom_smooth(aes(group=1))
# 자료를 선별하여 분포를 그린 것이므로 참고만 하자.
full.batting %>% mutate(year = factor(yearID)) %>%
filter(AB >= 100, yearID >= 1985) %>%
group_by(year) %>%
ggplot(aes(year, AVG)) + geom_point() + geom_smooth(aes(group=1))
full.batting %>% filter(AB >= 100) %>%
group_by(yearID) %>%
summarise( n = n()) %>%
ggplot(aes(yearID, n))+ geom_point()
### box-plot
full.batting %>% mutate(year = factor(yearID)) %>%
group_by(year) %>%
ggplot(aes(year, AVG)) + geom_boxplot()
# 자료를 선별하여 분포를 그린 것이므로 참고만 하자.
full.batting %>% mutate(year = factor(yearID)) %>%
filter(AB >= 100, yearID >= 1985) %>%
group_by(year) %>%
ggplot(aes(year, AVG)) + geom_boxplot()
# 1980년 이후 최고타율을 기록한 선수들의 리스트를 구하라.
(AVGtable = full.batting %>% select(playerID:lgID, AVG))
# min_rank() 함수의 성질
cw <- c(1,2,3,4,NA, 2913)
min_rank(cw)
temp = AVGtable %>% group_by(playerID) %>% filter(min_rank(desc(AVG)) == 1 & yearID >= 1980)
AVGtable %>% filter(playerID == "blackti01")
# 결과 테이블을 Master 테이블과 join 해서 선수의 이름을 표시하라.
(name = master %>% select(playerID, nameFirst, nameLast, nameGiven, retroID, bbrefID))
# select() 순서를 바꾸기 위해 사용해보자.
list = temp %>% left_join(name, by="playerID") %>% select(playerID, nameFirst:bbrefID, yearID:AVG)
list
#################################################################################################
########## tidyr ##########
#################################################################################################
# Key concepts : Long format <-> Wide format, Pivot table
# Core contents : spread() and gather() function
# library(tidyr)
##################################### 1. spread() #####################################
# 설명 : Long format -> Wide format
# 기초적인 사용법 : spread(주어진 자료(데이터프레임), 피벗이 되는 열, 대상이 되는 열)
digest = flights %>% group_by(carrier, origin) %>%
summarise(air_time_mean = mean(air_time, na.rm=TRUE))
as.data.frame(digest)
# 피벗 테이블
spread(digest, origin, air_time_mean)
flights %>% group_by(carrier, origin) %>%
summarise(air_time_mean = mean(air_time, na.rm=TRUE)) %>%
spread(origin, air_time_mean)
# NA를 0으로 채우기
spread(digest, origin, air_time_mean, fill = 0)
##################################### 2. gather() #####################################
# 설명 : Wide format -> Long format
# 기초적인 사용법 : gather(주어진 자료(데이터프레임), 변수명을 포함한 열, 대상이 되는 열,
# long format 으로 전환할 열)
series = flights %>% group_by(month, day) %>%
summarise(air_time_mean = mean(air_time, na.rm=TRUE), dep_delay_mean = mean(dep_delay, na.rm=TRUE))
series
series %>% gather(indicators, value, air_time_mean, dep_delay_mean)
series %>% gather(indicators, value, -month, -day)
2일 : 데이터 시각화 (Data Visualization)
# Install Packages
install.packages('ggmap')
install.packages('ggplot2')
install.packages('tm')
install.packages('SnowballC')
install.packages('wordcloud')
install.packages('networkD3')
install.packages('curl')
install.packages('GGally')
install.packages('ggvis')
install.packages('shiny')
install.packages('plotly')
# 1차원 자료용 그래픽
# Data: 초미세먼지 자료
setwd("~/Desktop/개인/Big Data 시각화 (2016.02)")
pollution <- read.csv("avgpm25.csv", colClasses = c("numeric", "character", "factor", "numeric", "numeric"))
head(pollution)
## pm25 fips region longitude latitude
## 1 9.771185 01003 east -87.74826 30.59278
## 2 9.993817 01027 east -85.84286 33.26581
## 3 10.688618 01033 east -87.72596 34.73148
## 4 11.337424 01049 east -85.79892 34.45913
## 5 12.119764 01055 east -86.03212 34.01860
## 6 10.827805 01069 east -85.35039 31.18973
# Five Number Summary
summary(pollution$pm25)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 3.383 8.549 10.050 9.836 11.360 18.440
# Boxplot
boxplot(pollution$pm25, col="blue")
boxplot(pollution$pm25, col="blue")
abline(h=12)
# Histogram
hist(pollution$pm25, col="green")
rug(pollution$pm25)
hist(pollution$pm25, col="green",breaks=100)
rug(pollution$pm25)
hist(pollution$pm25, col='green', breaks=c(2,5,8,10,14,18,20))
# Histogram with bar
hist(pollution$pm25, col="green")
abline(v=12,lwd=2)
abline(v=median(pollution$pm25), col="magenta", lwd=4)
abline(h=60, col="skyblue", lwd=4)
# Barplot
barplot(table(pollution$region), col = "wheat", main = "Number of Counties in Each Region")
# n차원 자료용 그래픽
# Multiple Boxplot
boxplot(pm25 ~ region, data = pollution, col = "red") #formula in R; y~x1+x2+...
# Multiple Histograms
par(mfrow = c(2, 1), mar = c(4, 4, 2, 1)) ### par(mfrow = c(1, 1)) or dev.off() ; 초기화
hist(subset(pollution, region == "east")$pm25, col = "green")
hist(subset(pollution, region == "west")$pm25, col = "green")
# Scatterplot
with(pollution, plot(latitude, pm25)) ## plot(pm25 ~ latitude, data=pollution) 과 같은 표현!
abline(h=12,lwd=2,lty=2)
with(pollution, plot(latitude, pm25, col = region))
abline(h=12,lwd=2,lty=2)
# Multiple Scatterplots
par(mfrow = c(1, 2), mar = c(5, 4, 2, 1)) ## margin : c(bottom, left, top, right)
with(subset(pollution, region == "west"), plot(latitude, pm25, main = "West"))
with(subset(pollution, region == "east"), plot(latitude, pm25, main = "East"))
# R Plot System
# Base plotting system
# Data loading : airquality
library(datasets)
with(airquality, plot(Wind, Ozone))
# Base Plot with Annotation
with(airquality, plot(Wind, Ozone))
title(main="Ozone and Wind in New Yorm City") ## Add a title
# Base Plot with Regression Line
with(airquality, plot(Wind, Ozone, main = "Ozone and Wind in New York City", pch = 20))
?points
model <- lm(Ozone ~ Wind, airquality)
abline(model, lwd=2)
abline(1,2, lwd=2) ### intercept(y절편), slope(기울기), thickness(두께)
# Multiple Base Plots
par(mfrow = c(1, 3), mar = c(4, 4, 2, 1), oma = c(0, 0, 2, 0))
with (airquality, {
plot(Wind, Ozone, main = "Ozone and Wind")
plot(Solar.R, Ozone, main = "Ozone and Solar Radiation")
plot(Temp, Ozone, main = "Ozone and Temperature")
mtext("Ozone and Weather in New York City", outer = TRUE) #main title;;; 통합 제목
}
)
# Lattice System in R
# Lattice system in R
library(lattice)
str(airquality)
airquality <- transform(airquality, Month = factor(Month)) # Convert 'Month' to a factor variable
levels(airquality$Month)
xyplot(Ozone ~ Wind | Month, data = airquality, layout = c(5, 1))
xyplot(Ozone ~ Wind, data = airquality)
# Lattice, Why?
set.seed(10)
x <- rnorm(100)
f <- rep(0:1, each = 50)
y<-x+f-f*x+rnorm(100,sd=0.5)
f <- factor(f, labels = c("Group 1", "Group 2")) # group1(f=0; y=x+norm)=1:50, group2(f=1; y=1+norm)=51:100
xyplot(y ~ x)
xyplot(y ~ x | f, layout = c(2, 1))
# Lattice Panel Functions: Regression line
xyplot(y ~ x | f, panel = function(x, y, ...) {
panel.xyplot(x, y, ...) # First call default panel function
panel.lmline(x, y, col = 2) # Overlay a simple linear regression line
})
ftn = function(x, y, ...) {
panel.xyplot(x, y, ...) # First call default panel function
panel.lmline(x, y, col = 2) # Overlay a simple linear regression line
}
xyplot(y~x | f, panel=ftn)
# ggplot2
# Data ; MAACS
library("ggplot2")
load("maacs.Rda")
# basic ggplot ; using qplot (quick plot)
qplot(logpm25, NocturnalSympt, data = maacs, facets = . ~ bmicat,
geom = c("point"))+stat_smooth(method="lm")
# Function ggplot
g<-ggplot(maacs, aes(logpm25, NocturnalSympt))
g
g+geom_point()
g+geom_point()+geom_smooth(method = "lm")
g+geom_point()+geom_smooth(method = "lm")+facet_grid(.~bmicat)
# Modifying Aesthetics
g + geom_point(color = "steelblue", size=4, alpha = 1/2)
g + geom_point(aes(color = bmicat), size=4, alpha=1/2)
# Modifying Labels
g + geom_point(aes(color = bmicat), size=4, alpha=1/2) +
labs(title = "MAACS Cohort") + labs(x=expression("log"*PM[2.5]),y="Nocturnal Symptoms")
# Customizing the Smooth
g + geom_point(aes(color = bmicat), size=4, alpha=1/2) +
geom_smooth(size = 4, linetype = 3, method = "lm", se = FALSE)
# Parallel Coordinate Plot
require(GGally)
head(mtcars)
ggparcoord(mtcars, columns = c(1,5:10))+geom_line()
# ggmap의 활용(지도 기반 시각화)
# ggmap 패키지
library(ggmap)
geocode("Seoul National University")
## lon lat
## 1 126.978 37.56654
korea <- c(left = 124, bottom = 32, right = 132, top = 43.5)
map <- get_stamenmap(korea, zoom = 5, maptype = "toner-lite")
ggmap(map)
# 휴스턴(Houston) 도심의 범죄 자료
# Basic plot on the map
# crime Data from Google Maps (included in ggmap packages)
str(crime)
## 'data.frame': 86314 obs. of 17 variables:
## $ time : POSIXt, format: "2010-01-01 15:00:00" "2010-01-01 15:00:00" ...
## $ date : chr "1/1/2010" "1/1/2010" "1/1/2010" "1/1/2010" ...
## $ hour : int 0 0 0 0 0 0 0 0 0 0 ...
## $ premise : chr "18A" "13R" "20R" "20R" ...
## $ offense : Factor w/ 7 levels "aggravated assault",..: 4 6 1 1 1 3 3 3 3 3 ...
## $ beat : chr "15E30" "13D10" "16E20" "2A30" ...
## $ block : chr "9600-9699" "4700-4799" "5000-5099" "1000-1099" ...
## $ street : chr "marlive" "telephone" "wickview" "ashland" ...
## $ type : chr "ln" "rd" "ln" "st" ...
## $ suffix : chr "-" "-" "-" "-" ...
## $ number : int 1 1 1 1 1 1 1 1 1 1 ...
## $ month : Ord.factor w/ 8 levels "january"<"february"<..: 1 1 1 1 1 1 1 1 1 1 ...
## $ day : Ord.factor w/ 7 levels "monday"<"tuesday"<..: 5 5 5 5 5 5 5 5 5 5 ...
## $ location: chr "apartment parking lot" "road / street / sidewalk" "residence / house" "residence / house" ...
## $ address : chr "9650 marlive ln" "4750 telephone rd" "5050 wickview ln" "1050 ashland st" ...
## $ lon : num -95.4 -95.3 -95.5 -95.4 -95.4 ...
## $ lat : num 29.7 29.7 29.6 29.8 29.7 ...
downtown<-subset(crime,
-95.39681 <= lon & lon <= -95.34188 &
29.73631 <= lat & lat <= 29.78400)
qmplot(lon, lat, data = downtown, maptype = "toner-background", color = I("red"))
Different Source / Type of map
배경에 표시될 지도의 출처(source option)와 종류(maptype option)를 다양하게 바꿈으로써,
상황에 어울리는 시각화가 가능하다.
source = “stamen”(기본설정)일 때, 가능한 maptype :
“terrain”, “terrain-background”, “terrain-labels”, “terrain-lines”, “toner”,
“toner-2010”, “toner-2011”, “toner-background”, “toner-hybrid”, “toner-labels”,
“toner-lines”, “toner-lite”, “watercolor”
source = “google”일 때, 가능한 maptype :
“terrain”, “satellite”, “roadmap”, and “hybrid”
qmplot(lon, lat, data = downtown, maptype = "watercolor", color = I("red"))
qmplot(lon, lat, data = downtown, maptype = "terrain-background", color = I("red"))
qmplot(lon, lat, data = downtown, maptype = "toner-hybrid", color = I("red"))
qmplot(lon, lat, data = downtown, source = "google", maptype = "terrain", color = I("red"))
qmplot(lon, lat, data = downtown, source = "google", maptype = "satellite", color = I("red"))
qmplot(lon, lat, data = downtown, source = "google", maptype = "hybrid", color = I("red"))
# 범죄율의 밀도(Density) 표현
robberies <- subset(downtown, offense == "robbery")
qmplot(lon, lat, data = robberies, geom = "blank", zoom = 15, maptype = "toner-background", darken = .7) +
stat_density_2d(aes(fill = ..level..), geom = "polygon", alpha = .3, color = NA)+
scale_fill_gradient2("Robbery\nPropensity", low = "white", mid = "yellow", high = "red", midpoint = 700)
# 범죄율의 밀도(Density) 표현 - ggmap 함수를 이용한 표현
downtownmap<-get_stamenmap(c(left=-95.39681,bottom= 29.73631,right=-95.34188,top=29.78400),zoom = 15, maptype = "toner-background", darken= .7)
ggmap(downtownmap, darken=.7)+
stat_density_2d(aes(fill = ..level..), geom = "polygon", alpha = .3, color = NA,data=robberies)+
scale_fill_gradient2("Robbery\nPropensity", low = "white", mid = "yellow", high = "red", midpoint = 700)
# 범죄의 종류별 시각화
qmplot(lon, lat, data = downtown, maptype = "toner-background", color = offense) +
facet_wrap(~ offense)
# Quiz1
# baseball.csv 파일에는 한국 프로야구 10개의 팀의 연고지(구장)의 위치의 경도(lon)와 위도(lat) 값과 2015년 프로야구 관중수(crowd)가 담겨있다.
# 파일을 불러들여서(header = TRUE), 다음과 같이 지도에 각 구단의 연고지를 구단 별로 다른 색으로 위치를 표시해 보자.
# Hint :
# qmplot(lon, lat, data = downtown, maptype = “toner-background”, color = I(“red”)) 이 코드를 알맞게 변형해보세요.
# Quiz1 - Advanced
# Quiz의 결과는 다소 눈에 익지 않다. 가장 큰 이유중 하나는 대한민국 지도가 한꺼번에 보이지 않기 때문일 것이며, 10개의 팀을 모두 각기 다른 색으로 표현하는 것은 10개의 팀을 쉽게 구분할 수 있는 방법은 아닐 수 있다. 또한, baseball.csv 파일안에는 프로야구 관중수(crowd)라는 정보도 포함되어 있다.
# ggmap과 ggplot의 함수들을 함께 사용하면 다음과 같이 지도 영역의 자유로운 설정과 다양한 정보를 담을 수 있다.
# 텍스트 자료의 시각화
library(tm)
library(SnowballC)
library(wordcloud)
jeopQ <- read.csv('JEOPARDY_CSV.csv', stringsAsFactors = FALSE)
###Data Cleaning
jeopCorpus <- Corpus(VectorSource(jeopQ$Question))
jeopCorpus <- tm_map(jeopCorpus, removePunctuation) ## . , ? ! ()
jeopCorpus <- tm_map(jeopCorpus, removeWords, c("the", "this", "The", "This", stopwords('english'))) ## stopwords("en")
jeopCorpus <- tm_map(jeopCorpus, stemDocument) ## went -> go
jeopCorpus <- tm_map(jeopCorpus, PlainTextDocument)
wordcloud(jeopCorpus, max.words = 100, random.order = FALSE)
# 네트워크 자료의 시각화
# Basic Network Diagram
library(networkD3)
src <- c("A", "A", "A", "A",
"B", "B", "C", "C", "D")
target <- c("B", "C", "D", "J",
"E", "F", "G", "H", "I")
networkData <- data.frame(src, target)
simpleNetwork(networkData)
# Sankey Diagram
library(networkD3)
URL <- paste0(
"https://cdn.rawgit.com/christophergandrud/networkD3/",
"master/JSONdata/energy.json")
Energy <- jsonlite::fromJSON(URL)
head(Energy$nodes)
## name
## 1 Agricultural 'waste'
## 2 Bio-conversion
## 3 Liquid
## 4 Losses
## 5 Solid
## 6 Gas
head(Energy$links)
## source target value
## 1 0 1 124.729
## 2 1 2 0.597
## 3 1 3 26.862
## 4 1 4 280.322
## 5 1 5 81.144
## 6 6 2 35.000
sankeyNetwork(Links = Energy$links, Nodes = Energy$nodes, Source = "source",
Target = "target", Value = "value", NodeID = "name",
units = "TWh", fontSize = 12, nodeWidth = 30)
# Interactive Graphics
# ggvis system
# Data : mtcars data
# ggvis
# The %>% operator passes the left hand object to the first argument of the right hand expression
library(ggvis)
library(shiny)
p <- ggvis(mtcars, x = ~wt, y = ~mpg)
layer_points(p)
# ggvis using pipe function “%>%”
mtcars %>% ## DATA
ggvis(x = ~wt, y = ~mpg) %>% ## model
layer_points() ## plotting
# Different Layers
mtcars %>%
ggvis(x = ~wt, y = ~mpg) %>%
layer_points()
mtcars %>%
ggvis(x = ~wt, y = ~mpg) %>%
layer_lines()
mtcars %>%
ggvis(x = ~wt, y = ~mpg) %>%
layer_bars()
mtcars %>%
ggvis(x = ~wt, y = ~mpg) %>%
layer_smooths()
# Interactive Controls
mtcars %>%
ggvis(~wt, ~mpg,
size := input_slider(10, 100), # := set a property to a value
#size = ~wt, # = set a property w.r.t a variable with ~variable format
opacity := input_slider(0, 1)
) %>%
layer_points() ##Press Escape/Ctrl + C to stop.
mtcars %>%
ggvis(~wt, ~mpg,
size := input_numeric(30),
opacity := input_radiobuttons(choices=c(0.2,0.4,0.6,0.8,1))
) %>%
layer_points() ##Press Escape/Ctrl + C to stop.
# Examples
mtcars %>% ggvis(x=~wt) %>%
layer_densities(
adjust = input_slider(.1,2, value =1, step=.1, label="Bandwidth adjustment"),
kernel = input_select(
c("Gaussian"="gaussian","Epanechnikov"="epanechnikov","Rectangular"="rectangular", "Triangular"="triangular", "Biweight"="biweight","Cosine"="cosine", "Optcosine"="optcosine"), label="Kernel"))
# Quiz2
# Basic R plot 을 통해 “pollution” 데이터의 scatter plot을 다음과 같이 그려볼 수 있었다.
with(pollution, plot(latitude, pm25))
# 이번에는 같은 그래프를 “ggvis”패키지와 pipe function “%>%”을 사용하여
# 다음과 같이 투명도(opacity), 크기(size), 색깔(fill)을 interactive하게 바꿀 수 있는 그래프를 그려보자.
# Hint
# 투명도(“opacity” 옵션) : input_slider 이용
# 크기(“size” 옵션) : input_numeric 이용
# 색깔(“fill” 옵션) : input_select 이용
# plotly system
# plotly
library(plotly)
set.seed(100)
d<-diamonds[sample(nrow(diamonds),1000), ]
plot_ly(d, x=carat, y=price, text=
paste("Clarity: ", clarity),
mode = "markers", color = carat, size= carat)
plot_ly(ggplot2::diamonds, y=price, color=cut, type="box")
plot_ly(z=volcano, type="surface")
3일 : 회귀 분석 (Regression Analysis)
install.packages("ISLR")
# 선형회귀모형의 적합
Advertising = read.csv("~/Desktop/Advertising.csv", header = TRUE)
lm.fit = lm(Sales ~ TV, data = Advertising)
summary(lm.fit)
##
## Call:
## lm(formula = Sales ~ TV, data = Advertising)
##
## Residuals:
## Min 1Q Median 3Q Max
## -8.3860 -1.9545 -0.1913 2.0671 7.2124
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 7.032594 0.457843 15.36 <2e-16 ***
## TV 0.047537 0.002691 17.67 <2e-16 ***
## ---
## Signif. codes: 0 *** 0.001 ** 0.01 * 0.05 . 0.1 1
##
## Residual standard error: 3.259 on 198 degrees of freedom
## Multiple R-squared: 0.6119,Adjusted R-squared: 0.6099
## F-statistic: 312.1 on 1 and 198 DF, p-value: < 2.2e-16
confint(lm.fit, level=0.95)
## 2.5 % 97.5 %
## (Intercept) 6.12971927 7.93546783
## TV 0.04223072 0.05284256
# 중회귀모형의 적합도
lm.fit = lm(Sales ~ TV + Radio + Newspaper, data = Advertising)
summary(lm.fit)
##
## Call:
## lm(formula = Sales ~ TV + Radio + Newspaper, data = Advertising)
##
## Residuals:
## Min 1Q Median 3Q Max
## -8.8277 -0.8908 0.2418 1.1893 2.8292
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.938889 0.311908 9.422 <2e-16 ***
## TV 0.045765 0.001395 32.809 <2e-16 ***
## Radio 0.188530 0.008611 21.893 <2e-16 ***
## Newspaper -0.001037 0.005871 -0.177 0.86
## ---
## Signif. codes: 0 *** 0.001 ** 0.01 * 0.05 . 0.1 1
##
## Residual standard error: 1.686 on 196 degrees of freedom
## Multiple R-squared: 0.8972,Adjusted R-squared: 0.8956
## F-statistic: 570.3 on 3 and 196 DF, p-value: < 2.2e-16
cor(Advertising)
## X TV Radio Newspaper Sales
## X 1.00000000 0.01771469 -0.11068044 -0.15494414 -0.05161625
## TV 0.01771469 1.00000000 0.05480866 0.05664787 0.78222442
## Radio -0.11068044 0.05480866 1.00000000 0.35410375 0.57622257
## Newspaper -0.15494414 0.05664787 0.35410375 1.00000000 0.22829903
## Sales -0.05161625 0.78222442 0.57622257 0.22829903 1.00000000
# 가변수 (Dummy variable)
Credit = read.csv("~/Desktop/Credit.csv", header = TRUE)
lm.fit = lm(Balance ~ Gender, data = Credit)
summary(lm.fit)
##
## Call:
## lm(formula = Balance ~ Gender, data = Credit)
##
## Residuals:
## Min 1Q Median 3Q Max
## -529.54 -455.35 -60.17 334.71 1489.20
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 509.80 33.13 15.389 <2e-16 ***
## GenderFemale 19.73 46.05 0.429 0.669
## ---
## Residual standard error: 460.2 on 398 degrees of freedom
## Multiple R-squared: 0.0004611,Adjusted R-squared: -0.00205
## F-statistic: 0.1836 on 1 and 398 DF, p-value: 0.6685
# 가변수를 포함한 회귀모형
lm.fit = lm(Balance~ Ethnicity, data=Credit)
summary(lm.fit)
##
## Call:
## lm(formula = Balance ~ Ethnicity, data = Credit)
##
## Residuals:
## Min 1Q Median 3Q Max
## -531.00 -457.08 -63.25 339.25 1480.50
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 531.00 46.32 11.464 <2e-16 ***
## EthnicityAsian -18.69 65.02 -0.287 0.774
## EthnicityCaucasian -12.50 56.68 -0.221 0.826
## ---
## Residual standard error: 460.9 on 397 degrees of freedom
## Multiple R-squared: 0.0002188,Adjusted R-squared: -0.004818
## F-statistic: 0.04344 on 2 and 397 DF, p-value: 0.9575
# 가법성 가정 완화 (교호작용)
lm.fit = lm(Sales ~ TV*Radio, data = Advertising)
summary(lm.fit)
##
## Call:
## lm(formula = Sales ~ TV * Radio, data = Advertising)
##
## Residuals:
## Min 1Q Median 3Q Max
## -6.3366 -0.4028 0.1831 0.5948 1.5246
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 6.750e+00 2.479e-01 27.233 <2e-16 ***
## TV 1.910e-02 1.504e-03 12.699 <2e-16 ***
## Radio 2.886e-02 8.905e-03 3.241 0.0014 **
## TV:Radio 1.086e-03 5.242e-05 20.727 <2e-16 ***
## ---
## Residual standard error: 0.9435 on 196 degrees of freedom
## Multiple R-squared: 0.9678,Adjusted R-squared: 0.9673
## F-statistic: 1963 on 3 and 196 DF, p-value: < 2.2e-16
# 변수선택
library(ISLR)
names(Hitters)
## [1] "AtBat" "Hits" "HmRun" "Runs" "RBI"
## [6] "Walks" "Years" "CAtBat" "CHits" "CHmRun"
## [11] "CRuns" "CRBI" "CWalks" "League" "Division"
## [16] "PutOuts" "Assists" "Errors" "Salary" "NewLeague"
dim(Hitters)
## [1] 322 20
sum(is.na(Hitters$Salary))
## [1] 59
Hitters = na.omit(Hitters)
dim(Hitters)
## [1] 263 20
sum(is.na(Hitters))
## [1] 0
# RSS를 포함한 변수선택을 위한 통계량 변화
par(mfrow = c(2,2))
plot(reg.summary$rss, xlab = "Number of Variables", ylab = "RSS", type = "l")
plot(reg.summary$adjr2, xlab = "Number of Variables", ylab = "Adjusted RSq", type = "l")
which.max(reg.summary$adjr2)
## [1] 11
points(11, reg.summary$adjr2[11], col = "red", cex = 2, pch = 20)
plot(reg.summary$cp, xlab = "Number of Variables", ylab = "Cp", type = "l")
which.min(reg.summary$cp)
## [1] 10
points(10, reg.summary$cp[10], col = "red", cex = 2, pch = 20)
plot(reg.summary$bic, xlab = "Number of Variables", ylab = "BIC", type = "l")
which.min(reg.summary$bic)
## [1] 6
points(6, reg.summary$bic[6], col = "red", pch = 20)
# 능선회귀와 라쏘 (Ridge Regression and LASSO)
x = model.matrix(Salary ~ . , data = Hitters)[,-1]
y = Hitters$Salary
library(glmnet)
grid = 10^seq(10, -2, length=100)
ridge.mod = glmnet(x, y, alpha=0, lambda = grid)
dim(coef(ridge.mod))
## [1] 20 100
ridge.mod$lambda[50]
## [1] 11497.57
coef(ridge.mod)[,50]
## (Intercept) AtBat Hits HmRun Runs
## 407.356050200 0.036957182 0.138180344 0.524629976 0.230701523
## RBI Walks Years CAtBat CHits
## 0.239841459 0.289618741 1.107702929 0.003131815 0.011653637
## CHmRun CRuns CRBI CWalks LeagueN
## 0.087545670 0.023379882 0.024138320 0.025015421 0.085028114
## DivisionW PutOuts Assists Errors NewLeagueN
## -6.215440973 0.016482577 0.002612988 -0.020502690 0.301433531
sqrt(sum(coef(ridge.mod)[-1,50]^2))
## [1] 6.360612
ridge.mod$lambda[60]
## [1] 705.4802
coef(ridge.mod)[,60]
## (Intercept) AtBat Hits HmRun Runs
## 54.32519950 0.11211115 0.65622409 1.17980910 0.93769713
## RBI Walks Years CAtBat CHits
## 0.84718546 1.31987948 2.59640425 0.01083413 0.04674557
## CHmRun CRuns CRBI CWalks LeagueN
## 0.33777318 0.09355528 0.09780402 0.07189612 13.68370191
## DivisionW PutOuts Assists Errors NewLeagueN
## -54.65877750 0.11852289 0.01606037 -0.70358655 8.61181213
sqrt(sum(coef(ridge.mod)[-1,60]^2))
## [1] 57.11001
predict(ridge.mod, s=50, type = "coefficients")[1:20,]
## (Intercept) AtBat Hits HmRun Runs
## 4.876610e+01 -3.580999e-01 1.969359e+00 -1.278248e+00 1.145892e+00
## RBI Walks Years CAtBat CHits
## 8.038292e-01 2.716186e+00 -6.218319e+00 5.447837e-03 1.064895e-01
## CHmRun CRuns CRBI CWalks LeagueN
## 6.244860e-01 2.214985e-01 2.186914e-01 -1.500245e-01 4.592589e+01
## DivisionW PutOuts Assists Errors NewLeagueN
## -1.182011e+02 2.502322e-01 1.215665e-01 -3.278600e+00 -9.496680e+00
set.seed(1)
train = sample(1:nrow(x), nrow(x)/2)
test = (-train)
y.test = y[test]
ridge.mod = glmnet(x[train, ], y[train], alpha = 0, lambda = grid, thresh = 1e-12)
ridge.pred = predict(ridge.mod, s=4, newx = x[test,])
mean((ridge.pred - y.test)^2)
## [1] 101036.8
mean((mean(y[train]) - y.test)^2)
## [1] 193253.1
ridge.pred = predict(ridge.mod, s=1e10, newx = x[test,])
mean((ridge.pred - y.test)^2)
## [1] 193253.1
set.seed(1)
cv.out = cv.glmnet(x[train,], y[train], alpha = 0)
plot(cv.out)
bestlam = cv.out$lambda.min
bestlam
## [1] 211.7416
ridge.pred = predict(ridge.mod, s=bestlam, newx = x[test,])
mean((ridge.pred - y.test)^2)
## [1] 96015.51
out = glmnet(x, y, alpha = 0)
predict(out, type = "coefficients", s = bestlam)[1:20,]
## (Intercept) AtBat Hits HmRun Runs
## 9.88487157 0.03143991 1.00882875 0.13927624 1.11320781
## RBI Walks Years CAtBat CHits
## 0.87318990 1.80410229 0.13074381 0.01113978 0.06489843
## CHmRun CRuns CRBI CWalks LeagueN
## 0.45158546 0.12900049 0.13737712 0.02908572 27.18227535
## DivisionW PutOuts Assists Errors NewLeagueN
## -91.63411299 0.19149252 0.04254536 -1.81244470 7.21208390
# 라쏘의 경우
lasso.mod=glmnet(x[train ,],y[train],alpha =1, lambda =grid)
plot(lasso.mod)
set.seed(1)
cv.out=cv.glmnet(x[train ,],y[train],alpha =1)
plot(cv.out)
bestlam=cv.out$lambda.min
bestlam
## [1] 16.78016
lasso.pred=predict(lasso.mod,s=bestlam,newx=x[test ,])
mean((lasso.pred-y.test)^2)
## [1] 100743.4
out=glmnet(x,y,alpha=1,lambda=grid)
lasso.coef=predict(out,type="coefficients",s=bestlam )[1:20 ,]
lasso.coef
## (Intercept) AtBat Hits HmRun Runs
## 18.5394844 0.0000000 1.8735390 0.0000000 0.0000000
## RBI Walks Years CAtBat CHits
## 0.0000000 2.2178444 0.0000000 0.0000000 0.0000000
## CHmRun CRuns CRBI CWalks LeagueN
## 0.0000000 0.2071252 0.4130132 0.0000000 3.2666677
## DivisionW PutOuts Assists Errors NewLeagueN
## -103.4845458 0.2204284 0.0000000 0.0000000 0.0000000
# 다항회귀모형의 적합
fit = lm(wage ~ poly(age, 4), data = Wage)
coef(summary(fit))
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 111.70361 0.7287409 153.283015 0.000000e+00
## poly(age, 4)1 447.06785 39.9147851 11.200558 1.484604e-28
## poly(age, 4)2 -478.31581 39.9147851 -11.983424 2.355831e-32
## poly(age, 4)3 125.52169 39.9147851 3.144742 1.678622e-03
## poly(age, 4)4 -77.91118 39.9147851 -1.951938 5.103865e-02
fit2 = lm(wage ~ poly(age, 4, raw = TRUE), data = Wage)
coef(summary(fit2))
fit2a = lm(wage ~ age + I(age^2) + I(age^3) + I(age^4), data = Wage)
coef(summary(fit2a))
fit2b = lm(wage ~ cbind(age, age^2, age^3, age^4), data = Wage)
coef(summary(fit2b))
agelims = range(Wage$age)
age.grid = seq(from = agelims[1], to = agelims[2])
preds = predict(fit, newdata = list(age = age.grid), se = TRUE)
se.bands = cbind(preds$fit + 2*preds$se.fit, preds$fit - 2*preds$se.fit)
attach(Wage)
plot(age, wage, xlim = agelims, cex = .5, col = "darkgrey", main = "Degree-4 Polynomial")
lines(age.grid, preds$fit, lwd = 2, col = "blue")
matlines(age.grid, se.bands, lwd=1, col = "blue", lty = 3)
# 회귀스플라인 (Regression Spline)
library(splines)
fit = lm(wage~bs(age, knots=c(25,40,60)), data = Wage)
pred = predict(fit, newdata = list(age = age.grid), se = TRUE)
plot(age, wage, col = "gray")
lines(age.grid, pred$fit, lwd = 2)
lines(age.grid, pred$fit + 2*pred$se, lty = "dashed")
lines(age.grid, pred$fit - 2*pred$se, lty = "dashed")
# 자연스플라인 (Natural Spline)
fit2 = lm(wage ~ ns(age, df=4), data = Wage)
pred2 = predict(fit2, newdata = list(age = age.grid), se = TRUE)
plot(age, wage, col = "gray")
lines(age.grid, pred2$fit, col = "red", lwd = 2)
# 평활스플라인 (Smoothing Spline)
plot(age, wage, xlim=agelims, cex = .5, col = "darkgrey", main = "Smoothing Spline")
fit = smooth.spline(age, wage, df = 16)
fit2 = smooth.spline(age, wage, cv = TRUE)
fit2$df
## [1] 6.794596
lines(fit, col = "red", lwd = 2)
lines(fit2, col = "blue", lwd = 2)
legend("topright",legend=c("16 DF","6.8 DF"),col=c("red "," blue "),lty =1, lwd =2, cex =.8)
# 국소회귀 (Local Regression)
plot(age,wage,xlim=agelims,cex=.5,col="darkgrey")
title("Local Regression")
fit=loess(wage~age,span=.2,data=Wage)
fit2=loess(wage~age,span=.5,data=Wage)
lines(age.grid,predict(fit,data.frame(age=age.grid)),col="red",lwd=2)
lines(age.grid,predict(fit2,data.frame(age=age.grid)),col="blue",lwd=2)
legend("topright",legend=c("Span=0.2","Span=0.5"),col=c("red","blue"),lty=1,lwd=2,cex=.8)
# 일반화가법모형 (Generalized Additive Models)
gam1 = lm(wage~ns(year, 4)+ns(age, 5) + education, data=Wage)
library(gam)
gam.m3 = gam(wage~s(year, 4) + s(age, 5) + education, data=Wage)
summary(gam.m3)
##
## Call: gam(formula = wage ~ s(year, 4) + s(age, 5) + education, data = Wage)
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -119.43 -19.70 -3.33 14.17 213.48
##
## (Dispersion Parameter for gaussian family taken to be 1235.69)
##
## Null Deviance: 5222086 on 2999 degrees of freedom
## Residual Deviance: 3689770 on 2986 degrees of freedom
## AIC: 29887.75
##
## Number of Local Scoring Iterations: 2
##
## Anova for Parametric Effects
## Df Sum Sq Mean Sq F value Pr(>F)
## s(year, 4) 1 27162 27162 21.981 2.877e-06 ***
## s(age, 5) 1 195338 195338 158.081 < 2.2e-16 ***
## education 4 1069726 267432 216.423 < 2.2e-16 ***
## Residuals 2986 3689770 1236
## ---
## Signif. codes: 0
***
0.001
**
0.01
*
0.05
.
0.1
1
##
## Anova for Nonparametric Effects
## Npar Df Npar F Pr(F)
## (Intercept)
## s(year, 4) 3 1.086 0.3537
## s(age, 5) 4 32.380 <2e-16 ***
## education
## ---
preds = predict(gam.m3, newdata = Wage)
par(mfrow = c(1,3))
plot(gam.m3, se=TRUE, col = "blue")
plot.gam(gam1, se=TRUE, col = "red")
gam.m1=gam(wage~s(age,5)+education,data=Wage)
gam.m2=gam(wage~year+s(age,5)+education,data=Wage)
anova(gam.m1,gam.m2,gam.m3,test="F")
## Analysis of Deviance Table
##
## Model 1: wage ~ s(age, 5) + education
## Model 2: wage ~ year + s(age, 5) + education
## Model 3: wage ~ s(year, 4) + s(age, 5) + education
## Resid. Df Resid. Dev Df Deviance F Pr(>F)
## 1 2990 3711731
## 2 2989 3693842 1 17889.2 14.4771 0.0001447 ***
## 3 2986 3689770 3 4071.1 1.0982 0.3485661
## ---
gam.lo=gam(wage~s(year,df=4)+lo(age,span=0.7)+education,data=Wage)
par(mfrow=c(1,3))
plot.gam(gam.lo, se=TRUE, col="green")
gam.lo.i=gam(wage~lo(year,age,span=0.5)+education,data=Wage)
library(akima)
par(mfrow=c(1,2))
plot(gam.lo.i)
table(education,I(wage>250))
##
## education FALSE TRUE
## 1. < HS Grad 268 0
## 2. HS Grad 966 5
## 3. Some College 643 7
## 4. College Grad 663 22
## 5. Advanced Degree 381 45
par(mfrow=c(1,3))
gam.lr.s=gam(I(wage>250)~year+s(age, df=5)+education, family=binomial
, data=Wage, subset =( education !="1. < HS Grad"))
plot(gam.lr.s, se=T, col="green")
4일 : 분류, 군집 (Classification, Clustering)
[실습]
-
분류
## Data for An Introduction to Statistical Learning with Applications in R
library(ISLR)
## for lda() and qda()
library(MASS)
## for knn()
library(class)
## Plotting
library(ggplot2)
library(gridExtra)
data(package = "ISLR")
## Load Smarket (S&P Stock Market Data)
data(Smarket)
head(Smarket)
names(Smarket)
dim(Smarket)
summary(Smarket)
pairs(Smarket)
attach(Smarket)
# Logistic Regression
glm.fit=glm(Direction~Lag1+Lag2+Lag3+Lag4+Lag5+Volume,data=Smarket,family=binomial)
summary(glm.fit)
coef(glm.fit)
summary(glm.fit)$coef
summary(glm.fit)$coef[,4]
glm.probs=predict(glm.fit,type="response")
glm.probs[1:10]
contrasts(Direction)
glm.pred=rep("Down",1250)
glm.pred[glm.probs>.5]="Up"
table(glm.pred,Direction)
(507+145)/1250
mean(glm.pred==Direction)
train=(Year<2005)
Smarket.2005=Smarket[!train,]
dim(Smarket.2005)
Direction.2005=Direction[!train]
glm.fit=glm(Direction~Lag1+Lag2+Lag3+Lag4+Lag5+Volume,data=Smarket,family=binomial,subset=train)
glm.probs=predict(glm.fit,Smarket.2005,type="response")
glm.pred=rep("Down",252)
glm.pred[glm.probs>.5]="Up"
table(glm.pred,Direction.2005)
mean(glm.pred==Direction.2005)
mean(glm.pred!=Direction.2005)
## Plot
plotData <- ggplot(data = Smarket,
mapping = aes(x = Lag1, y = Lag2, color = Direction, shape = factor(Year))) +
geom_point(stat = "identity", alpha = 0.8) +
scale_color_manual(values = c("Down" = "yellow", "Up" = "red")) +
theme_bw() +
theme(legend.key = element_blank()) +
labs(title = "Original data")
plotData
glm.fit=glm(Direction~Lag1+Lag2,data=Smarket,family=binomial(link = "logit"),subset=train)
glm.probs=predict(glm.fit,Smarket.2005,type="response")
glm.pred=rep("Down",252)
glm.pred[glm.probs>.5]="Up"
table(glm.pred,Direction.2005)
mean(glm.pred==Direction.2005)
106/(106+76)
predict(glm.fit,newdata=data.frame(Lag1=c(1.2,1.5),Lag2=c(1.1,-0.8)),type="response")
## Plot
plotData <- ggplot(data = Smarket.2005,
mapping = aes(x = Lag1, y = Lag2, color = Direction, shape = factor(Year))) +
geom_point(stat = "identity", alpha = 0.8) +
scale_color_manual(values = c("Down" = "yellow", "Up" = "red")) +
theme_bw() +
theme(legend.key = element_blank()) +
labs(title = "Test data")
plotData
## Put the predicted probability
predProbLogit <- predict(glm.fit,Smarket.2005,type="response")
predClassLogit <- factor(predict(glm.fit,Smarket.2005,type="response") > 0.5,
levels = c(FALSE,TRUE), labels = c("Down","Up"))
## Plot (probability)
plotLogit <- ggplot(data = Smarket.2005,
mapping = aes(x = Lag1, y = Lag2, color = predProbLogit, shape = factor(Year))) +
geom_point(alpha = 0.8) +
scale_color_gradient(low = "yellow", high = "red") +
theme_bw() +
theme(legend.key = element_blank()) +
labs(title = "Predicted probability of outcome (Logistic)")
grid.arrange(plotData, plotLogit, ncol = 2)
plotLdaClass <- ggplot(data = Smarket.2005,
mapping = aes(x = Lag1, y = Lag2, color = predClassLogit, shape = factor(Year))) +
geom_point(alpha = 0.7) +
scale_color_manual(values = c("Down" = "yellow", "Up" = "red")) +
theme_bw() +
theme(legend.key = element_blank()) +
labs(title = "Predicted outcome (Logistic; p>0.5)")
grid.arrange(plotData, plotLdaClass, ncol = 2)
#Linear Discriminant Analysis
library(MASS)
lda.fit=lda(Direction~Lag1+Lag2,data=Smarket,subset=train)
lda.fit
plot(lda.fit)
lda.pred=predict(lda.fit, Smarket.2005)
names(lda.pred)
lda.class=lda.pred$class
table(lda.class,Direction.2005)
mean(lda.class==Direction.2005)
sum(lda.pred$posterior[,1]>=.5)
sum(lda.pred$posterior[,1]<.5)
lda.pred$posterior[1:20,1]
lda.class[1:20]
sum(lda.pred$posterior[,1]>.9)
## Put into the dataset
predProbLda <- lda.pred$posterior[,"Up"]
predClassLda <- lda.pred$class
## Plot (probability)
plotLdaProb <- ggplot(Smarket.2005,
mapping = aes(x = Lag1, y = Lag2, color = predProbLda, shape = factor(Year))) +
geom_point(alpha = 0.5) +
scale_color_gradient(low = "yellow", high = "red") +
theme_bw() +
theme(legend.key = element_blank()) +
labs(title = "Predicted probability of outcome (LDA)")
grid.arrange(plotData, plotLdaProb, ncol = 2)
## Plot (classification)
plotLdaClass <- ggplot(Smarket.2005,
mapping = aes(x = Lag1, y = Lag2, color = predClassLda, shape = factor(Year))) +
geom_point(alpha = 0.5) +
scale_color_manual(values = c("Down" = "yellow", "Up" = "red")) +
theme_bw() +
theme(legend.key = element_blank()) +
labs(title = "Predicted outcome (LDA)")
grid.arrange(plotData, plotLdaClass, ncol = 2)
# Quadratic Discriminant Analysis
qda.fit=qda(Direction~Lag1+Lag2,data=Smarket,subset=train)
qda.fit
qda.pred=predict(qda.fit,Smarket.2005)
qda.class=qda.pred$class
table(qda.class,Direction.2005)
mean(qda.class==Direction.2005)
## Put into the dataset
predProbQda <- qda.pred$posterior[,"Up"]
predClassQda <- qda.pred$class
## Plot (probability)
plotQdaProb <- ggplot(Smarket.2005,
mapping = aes(x = Lag1, y = Lag2, color = predProbQda, shape = factor(Year))) +
geom_point(alpha = 0.5) +
scale_color_gradient(low = "yellow", high = "red") +
theme_bw() +
theme(legend.key = element_blank()) +
labs(title = "Predicted probability of outcome (QDA)")
grid.arrange(plotData, plotQdaProb, ncol = 2)
## Plot (classification)
plotQdaClass <- ggplot(Smarket.2005,
mapping = aes(x = Lag1, y = Lag2, color = predClassQda, shape = factor(Year))) +
geom_point(alpha = 0.5) +
scale_color_manual(values = c("Down" = "yellow", "Up" = "red")) +
theme_bw() +
theme(legend.key = element_blank()) +
labs(title = "Predicted outcome (QDA)")
grid.arrange(plotData, plotQdaClass, ncol = 2)
# K-Nearest Neighbors
library(class)
train.X=cbind(Lag1,Lag2)[train,]
test.X=cbind(Lag1,Lag2)[!train,]
train.Direction=Direction[train]
set.seed(1)
knn.pred=knn(train.X,test.X,train.Direction,k=1)
table(knn.pred,Direction.2005)
(83+43)/252
knn.pred=knn(train.X,test.X,train.Direction,k=3)
table(knn.pred,Direction.2005)
mean(knn.pred==Direction.2005)
## Fit KNN (the output is a vector of prediction)
knn.pred <- knn(train = train.X,
test = test.X,
cl = train.Direction, k = 3)
predClassKnn <- knn.pred
## Plot (classification)
plotKnnClass <- ggplot(Smarket.2005,
mapping = aes(x = Lag1, y = Lag2, color = predClassKnn, shape = factor(Year))) +
geom_point(alpha = 0.8) +
scale_color_manual(values = c("Down" = "yellow", "Up" = "red")) +
theme_bw() +
theme(legend.key = element_blank()) +
labs(title = "Predicted outcome (KNN)")
grid.arrange(plotData, plotKnnClass, ncol = 2)
-
군집분석, 요인분석
# K-Means Clustering
library(ggplot2)
set.seed(2)
nd=matrix(rnorm(50*2), ncol=2)
nd[1:25,1]=nd[1:25,1]+3
nd[1:25,2]=nd[1:25,2]-4
colnames(nd)=c("x1","x2")
km.out=kmeans(nd,2,nstart=20)
km.out$cluster
cluster=factor(km.out$cluster)
nd1=data.frame(nd,cluster)
c1=ggplot(nd1, aes(x1, x2 , color = cluster)) +
geom_point(alpha = 0.4, size = 3.5) + labs(title = "K-Means Clustering Results with K=2")
print(c1)
dev.off()
set.seed(4)
km.out=kmeans(nd,3,nstart=20)
km.out
cluster=factor(km.out$cluster)
nd2=data.frame(nd,cluster)
c2=ggplot(nd2, aes(x1, x2 , color = cluster)) +
geom_point(size = 3.5) + labs(title = "K-Means Clustering Results with K=3")
print(c2)
centers=as.data.frame(km.out$centers)
ggplot(data=nd2, aes(x=x1, y=x2,color=cluster )) +
geom_point() +
geom_point(data=centers, aes(x=x1,y=x2, color='Center')) +
geom_point(data=centers, aes(x=x1,y=x2, color='Center'), size=52, alpha=.3, show.legend=FALSE)
set.seed(3)
km.out=kmeans(nd,3,nstart=1)
km.out$tot.withinss
km.out=kmeans(nd,3,nstart=20)
km.out$tot.withinss
#cf)
library(ggfortify)
autoplot(km.out, data=nd1,frame = TRUE)
autoplot(km.out, data=nd1,frame = TRUE, frame.type = 'norm')
autoplot(km.out, data=nd2, frame.type = "convex", frame.level = 0.95,
frame.alpha = 0.2, main="K-Means Clustering Results with K=3")
# Hierarchical Clustering
hc.complete=hclust(dist(nd), method="complete")
hc.average=hclust(dist(nd), method="average")
hc.single=hclust(dist(nd), method="single")
#your own labels (now rownames) are supplied in geom_text() and label=label
dendr= dendro_data(hc.complete, type="rectangle")
ggplot() +
geom_segment(data=segment(dendr), aes(x=x, y=y, xend=xend, yend=yend)) +
geom_text(data=label(dendr), aes(x=x, y=y, label=label, hjust=0), size=4) +
coord_flip() + scale_y_reverse(expand=c(0.2, 0)) +
theme(axis.line.y=element_blank(),
axis.ticks.y=element_blank(),
axis.text.y=element_blank(),
axis.title.y=element_blank(),
panel.background=element_rect(fill="white"),
panel.grid=element_blank())+
labs(title="Complete Linkage")
dendr = dendro_data(hc.average, type="rectangle")
ggplot() +
geom_segment(data=segment(dendr), aes(x=x, y=y, xend=xend, yend=yend)) +
geom_text(data=label(dendr), aes(x=x, y=y, label=label, hjust=0), size=4) +
coord_flip() + scale_y_reverse(expand=c(0.2, 0)) +
theme(axis.line.y=element_blank(),
axis.ticks.y=element_blank(),
axis.text.y=element_blank(),
axis.title.y=element_blank(),
panel.background=element_rect(fill="white"),
panel.grid=element_blank())+
labs(title="Average Linkage")
dendr = dendro_data(hc.single, type="rectangle")
ggplot() +
geom_segment(data=segment(dendr), aes(x=x, y=y, xend=xend, yend=yend)) +
geom_text(data=label(dendr), aes(x=x, y=y, label=label, hjust=0), size=4) +
coord_flip() + scale_y_reverse(expand=c(0.2, 0)) +
theme(axis.line.y=element_blank(),
axis.ticks.y=element_blank(),
axis.text.y=element_blank(),
axis.title.y=element_blank(),
panel.background=element_rect(fill="white"),
panel.grid=element_blank())+
labs(title="Single Linkage")
ndsc=scale(nd)
hcsc.complete=hclust(dist(nd), method="complete")
dendr = dendro_data(hcsc.complete, type="rectangle")
ggplot() +
geom_segment(data=segment(dendr), aes(x=x, y=y, xend=xend, yend=yend)) +
geom_text(data=label(dendr), aes(x=x, y=y, label=label, hjust=0), size=4) +
coord_flip() + scale_y_reverse(expand=c(0.2, 0)) +
theme(axis.line.y=element_blank(),
axis.ticks.y=element_blank(),
axis.text.y=element_blank(),
axis.title.y=element_blank(),
panel.background=element_rect(fill="white"),
panel.grid=element_blank())+
labs(title="Hierarchical Clustering with Scaled Features")
nd3=matrix(rnorm(30*3), ncol=3)
ndd=as.dist(1-cor(t(nd3)))
hcc.complete=hclust(dist(ndd), method="complete")
dendr = dendro_data(hcc.complete, type="rectangle")
ggplot() +
geom_segment(data=segment(dendr), aes(x=x, y=y, xend=xend, yend=yend)) +
geom_text(data=label(dendr), aes(x=x, y=y, label=label, hjust=0), size=4) +
coord_flip() + scale_y_reverse(expand=c(0.2, 0)) +
theme(axis.line.y=element_blank(),
axis.ticks.y=element_blank(),
axis.text.y=element_blank(),
axis.title.y=element_blank(),
panel.background=element_rect(fill="white"),
panel.grid=element_blank())+
labs(title="Complete Linkage with Correlation-Based Distance")
cutree(hc.complete, 2)
cutree(hc.average, 2)
cutree(hc.single, 2)
cutree(hc.single, 4)
hc.average=hclust(dist(nd), method="average") # heirarchal clustering
dendr = dendro_data(hc.average, type="rectangle") # convert for ggplot
clust = cutree(hc.average,k=3) # find 2 clusters
names(clust)=c(1:50)
clust.df <- data.frame(label=names(clust), cluster=factor(clust))
# dendr[["labels"]] has the labels, merge with clust.df based on label column
dendr[["labels"]] <- merge(dendr[["labels"]],clust.df, by="label")
# plot the dendrogram; note use of color=cluster in geom_text(...)
ggplot() +
geom_segment(data=segment(dendr), aes(x=x, y=y, xend=xend, yend=yend)) +
geom_text(data=label(dendr), aes(x, y, label=label, hjust=0, color=cluster),
size=3) +
coord_flip() + scale_y_reverse(expand=c(0.2, 0)) +
theme(axis.line.y=element_blank(),
axis.ticks.y=element_blank(),
axis.text.y=element_blank(),
axis.title.y=element_blank(),
panel.background=element_rect(fill="white"),
panel.grid=element_blank())
#(cf)
# demonstrate plotting directly from object class hclust
ggdendrogram(hc.complete)
ggdendrogram(hc.complete, rotate = TRUE, size = 5, theme_dendro = FALSE)
# demonstrate converting hclust to dendro using dendro_data first
hcdata <- dendro_data(hc.complete)
ggdendrogram(hcdata, rotate=TRUE) +
labs(title="Dendrogram in ggplot2")
# Triangular lines
ddata <- dendro_data(as.dendrogram(hc.complete), type = "triangle")
ggplot(segment(ddata)) + geom_segment(aes(x = x, y = y, xend = xend,
yend = yend)) +
geom_text(data = label(ddata), aes(x = x,y = y, label = label), angle = 90, lineheight = 0)
# factor analysis
life<-read.table("life_expectation.txt",header=TRUE)
cor.life<-cor(life)
library(corrplot)
corrplot(cor.life, type="upper", order="hclust", tl.col="black", tl.srt=45)
d.factanal=factanal(life,factors=1, ,method="mle",rotation="varimax")
d.factanal=factanal(life,factors=2, method="mle",rotation="varimax")
d.factanal=factanal(life,factors=3, method="mle",rotation="varimax")
# How could we interpret the three factors?
sapply(1:3, function(f)
factanal(life, factors = f, method="mle")$PVAL)
# The communalities:
life.exp <- factanal(life,factors=3, rotation="varimax")
1-life.exp$uniquenesses
### Plot loadings against one another
# Factor 1 vs. Factor 2
text=rownames(d.factanal$loadings)
load = d.factanal$loadings
ggplot(data.frame(load[,1:2]),aes(Factor1, Factor2,label = text))+
geom_point() +
geom_text(angle = 20,size=5)+
labs(title="Factor 1 vs Factor 2", x="Factor1 loadings", y="Factor 2 loadings")
# Factor 1 vs. Factor 3
text=rownames(d.factanal$loadings)
load = d.factanal$loadings
ggplot(data.frame(load[,c(1,3)]),aes(Factor1, Factor3,label = text))+
geom_point() +
geom_text(angle = 20,size=5)+
labs(title="Factor 1 vs Factor 2", x="Factor1 loadings", y="Factor 3 loadings")
# Factor 2 vs. Factor 3:
ggplot(data.frame(load[, 2:3]),aes(Factor2, Factor3,label = text))+
geom_point() +
geom_text(angle = 20,size=5)+
labs(title="Factor 2 vs Factor 3", x="Factor2 loadings", y="Factor 3 loadings")
# Estimating factor scores for the life expectancy data set:
scores <- factanal(life,factors=3, rotation="varimax",scores="regression")$scores
### Doing separate 2-D scatterplots of the factor scores:
# Factor 1 vs. Factor 2
scores[,c(1,2,3)]
text=rownames(scores)
ggplot(data.frame(scores[, 1:2]),aes(Factor1, Factor2,label = text))+
geom_point() +
geom_text(angle = 20)+
labs(title="Factor 1 vs Factor 2", x="Factor1 scores", y="Factor 2 scores")
# Factor 1 vs. Factor 3:
ggplot(data.frame(scores[, c(1,3)]),aes(Factor1, Factor3,label = text))+
geom_point() +
geom_text(angle = 20)+
labs(title="Factor 1 vs Factor 3", x="Factor1 scores", y="Factor 3 scores")
# Factor 2 vs. Factor 3:
ggplot(data.frame(scores[, 2:3]),aes(Factor2, Factor3,label = text))+
geom_point() +
geom_text(angle = 20)+
labs(title="Factor 2 vs Factor 3", x="Factor2 scores", y="Factor 3 scores")
#(cf)1.
d.factanal = factanal(life, factors = 3, scores = 'regression')
autoplot(d.factanal, label = TRUE, label.size = 3,
loadings = TRUE, loadings.label = TRUE, loadings.label.size = 3)
#(cf)2.
#Using fa() from package psych with rotation
install.packages("psych")
library(psych)
d.fa = fa(r=cor.life, nfactors=3, n.obs=31, rotate="varimax")
d.fa
#Factor scores
scores <- factor.scores(x=life, f=d.fa, method="regression")
head(scores)
#Visualize loadings
factor.plot(d.fa, cut=0.5)
-
나무모형
### Decision Trees
# Fitting Classification Trees
Carseats<-read.table("Carseats.txt")
library(tree)
library(ggplot2)
library(ggdendro)
attach(Carseats)
High=ifelse(Sales<=8,"No","Yes")
Carseats=data.frame(Carseats,High)
tree.carseats=tree(High~.-Sales, Carseats)
plot(tree.carseats)
text(tree.carseats)
summary(tree.carseats)
dev.off()
# Confusion matrix
set.seed(2)
train=sample(1:nrow(Carseats), 200)
Carseats.test=Carseats[-train,]
High.test=High[-train]
tree.carseats=tree(High~.-Sales,Carseats,subset=train)
tree.pred=predict(tree.carseats,Carseats.test,type="class")
table(tree.pred,High.test)
(86+57)/200
# perform cross-validation in order to determine the optimal level of tree complexity
set.seed(3)
cv.carseats=cv.tree(tree.carseats,FUN=prune.misclass)
# prune.tree() enacts optimal deviance pruning
# prune.misclass() enacts optimal misclassification rate pruning
names(cv.carseats)
cv.carseats
# plot the error rate as a function of both size and k
library(gridExtra)
cv=data.frame(cv.carseats$size,cv.carseats$k,cv.carseats$dev)
colnames(cv) = c("size","k","dev")
min=subset(cv, dev==min(dev))
cv1=ggplot(cv,aes(x=size,y=dev))+
geom_point()+
geom_point(data=min, size=3, colour=alpha("red", 0.7))+
geom_line()
cv2=ggplot(cv,aes(x=k,y=dev))+
geom_point()+
geom_point(data=min, size=3, colour=alpha("red", 0.7))+
geom_line()
grid.arrange(cv1, cv2, nrow=1, ncol=2)
prune.carseats=prune.misclass(tree.carseats,best=9)
plot(prune.carseats)
text(prune.carseats,pretty=0)
# Confusion matrix
tree.pred=predict(prune.carseats,Carseats.test,type="class")
table(tree.pred,High.test)
(94+60)/200
prune.carseats=prune.misclass(tree.carseats,best=15)
plot(prune.carseats)
text(prune.carseats,pretty=0)
tree.pred=predict(prune.carseats,Carseats.test,type="class")
table(tree.pred,High.test)
(86+62)/200
# (cf)
library(rpart)
tree.carseats = rpart(High~.-Sales, method = "class", data = Carseats)
# display the results
printcp(tree.carseats)
# Classification tree diagrams
ddata <- dendro_data(tree.carseats)
ggplot() +
geom_segment(data = ddata$segments,
aes(x = x, y = y, xend = xend, yend = yend)) +
geom_text(data = ddata$labels,
aes(x = x, y = y, label = label), size = 4, vjust = 0) +
geom_text(data = ddata$leaf_labels,
aes(x = x, y = y, label = label), size = 4, vjust = 1) +
theme_dendro()
# visualize cross-validation results
plotcp(tree.carseats)
# plot tree
plot(tree.carseats, uniform=TRUE,
main="Classification Tree for Carseats")
text(tree.carseats, use.n=TRUE, all=TRUE, cex=.8)
# prune the tree
pfit<- prune(tree.carseats, cp=tree.carseats$cptable[which.min(tree.carseats$cptable[,"xerror"]),"CP"])
# plot the pruned tree
plot(pfit, uniform=TRUE,
main="Pruned Classification Tree for Carseats")
text(pfit, use.n=F, all=TRUE, cex=.8)
# perform cross-validation in order to determine the optimal level of tree complexity
ddata <- dendro_data(pfit)
ggplot() +
geom_segment(data = ddata$segments,
aes(x = x, y = y, xend = xend, yend = yend)) +
geom_text(data = ddata$labels,
aes(x = x, y = y, label = label), size = 4, vjust = 0) +
geom_text(data = ddata$leaf_labels,
aes(x = x, y = y, label = label), size = 4, vjust = 1) +
theme_dendro()
install.packages("rpart.plot")
library(rpart.plot)
prp(pfit, faclen = 0, cex = 0.8, extra = 1)
only_count <- function(x, labs, digits, varlen)
{
paste(x$frame$n)
}
boxcols <- c("pink", "lightblue")[pfit$frame$yval]
par(xpd=TRUE)
prp(pfit, faclen = 0, cex = 0.8, node.fun=only_count, box.col = boxcols)
legend("bottomright", legend = c("No","Yes"), fill = c("pink", "lightblue"),
title = "Group")
# Fitting Regression Trees
library(MASS)
set.seed(1)
train = sample(1:nrow(Boston), nrow(Boston)/2)
tree.boston=tree(medv~.,Boston,subset=train)
summary(tree.boston)
# Regression tree diagrams
tree_data <- dendro_data(tree.boston)
ggplot(segment(tree_data)) +
geom_segment(aes(x = x, y = y, xend = xend, yend = yend, size = n),
colour = "blue", alpha = 0.5) +
scale_size("n") +
geom_text(data = label(tree_data),
aes(x = x, y = y, label = label), vjust = -0.5, size = 4) +
geom_text(data = leaf_label(tree_data),
aes(x = x, y = y, label = label), vjust = 0.5, size = 3) +
theme_dendro()+
labs(title="Regression tree diagrams")
# perform cross-validation in order to determine the optimal level of tree complexity
cv.boston=cv.tree(tree.boston)
# plot the error rate as a function of size
cv=data.frame(cv.boston$size,cv.boston$k,cv.boston$dev)
colnames(cv) = c("size","k","dev")
min=subset(cv, dev==min(dev))
ggplot(cv,aes(x=size,y=dev))+
geom_point()+
geom_point(data=min, size=3, colour=alpha("red", 0.7))+
geom_line()
# plot the pruned tree
prune.boston=prune.tree(tree.boston,best=5)
tree_data <- dendro_data(prune.boston)
ggplot(segment(tree_data)) +
geom_segment(aes(x = x, y = y, xend = xend, yend = yend, size = n),
colour = "blue", alpha = 0.5) +
scale_size("n") +
geom_text(data = label(tree_data),
aes(x = x, y = y, label = label), vjust = -0.5, size = 4) +
geom_text(data = leaf_label(tree_data),
aes(x = x, y = y, label = label), vjust = 0.5, size = 3) +
theme_dendro()+
labs(title="Regression tree diagrams")
# the test set MSE associated with the regression tree
yhat=predict(tree.boston,newdata=Boston[-train,])
boston.test=Boston[-train,"medv"]
mean((yhat-boston.test)^2)
ggplot()+
geom_point(aes(x=yhat, y=boston.test))+
geom_abline(intercept = 0, slope = 1, color="red",
linetype="dashed", size=1.5)+
xlab(expression(hat(y)))+
ylab(expression(y[0]))
# Bagging and Random Forests
# Bagging
library(randomForest)
set.seed(1)
bag.boston=randomForest(medv~.,data=Boston,subset=train,mtry=13,importance=TRUE)
bag.boston
# the test MSE for bagging
yhat.bag = predict(bag.boston,newdata=Boston[-train,])
mean((yhat.bag-boston.test)^2)
ggplot()+
geom_point(aes(x=yhat.bag, y=boston.test))+
geom_abline(intercept = 0, slope = 1, color="red",
linetype="dashed", size=1.5)+
xlab(expression(hat(y)))+
ylab(expression(y[0]))
bag.boston=randomForest(medv~.,data=Boston,subset=train,mtry=13,ntree=25)
yhat.bag = predict(bag.boston,newdata=Boston[-train,])
mean((yhat.bag-boston.test)^2)
# Random Forests
set.seed(1)
rf.boston=randomForest(medv~.,data=Boston,subset=train,mtry=6,importance=TRUE)
# the test MSE for random forests
yhat.rf = predict(rf.boston,newdata=Boston[-train,])
mean((yhat.rf-boston.test)^2)
# Variable Importance Plot
var_importance <- data.frame(variable=setdiff(colnames(Boston), "medv"),
importance=as.vector(rf.boston$importance[,1]))
var_importance <- var_importance[order(var_importance[,2]),]
var_importance$variable <- factor(var_importance$variable, levels=var_importance$variable)
vi=ggplot(var_importance,aes(x=variable, y=importance, fill=variable)) +
geom_bar(stat='identity', position='identity')+
scale_fill_discrete(name="Variable Name",guide = guide_legend(reverse=TRUE)) +
coord_flip()+
theme(axis.text.x=element_blank(),
axis.text.y=element_text(size=12),
axis.title=element_text(size=16),
plot.title=element_text(size=18),
legend.title=element_text(size=16),
legend.text=element_text(size=12))+
ylab("Mean square error")
var_purity <- data.frame(variable=setdiff(colnames(Boston), "medv"),
purity=as.vector(rf.boston$importance[,2]))
var_purity = var_purity[order(var_purity[,2]),]
var_purity$variable <- factor(var_purity$variable, levels=var_purity$variable)
np=ggplot(var_purity,aes(x=variable, y=purity, fill=variable)) +
geom_bar(stat='identity', position='identity') +
scale_fill_discrete(name="Variable Name",guide = guide_legend(reverse=TRUE)) +
coord_flip()+
theme(axis.text.x=element_blank(),
axis.text.y=element_text(size=12),
axis.title=element_text(size=16),
plot.title=element_text(size=18),
legend.title=element_text(size=16),
legend.text=element_text(size=12))+
ylab("Node impurity")
grid.arrange(vi, np, nrow=1, ncol=2,top="Variable Importance from Random Forest Fit")
varImpPlot(rf.boston)
# Boosting
library(gbm)
set.seed(1)
boost.boston=gbm(medv~.,data=Boston[train,],distribution="gaussian",n.trees=5000,interaction.depth=4)
summary(boost.boston)
# partial dependence plots for rm and lstat.
par(mfrow=c(1,2))
plot(boost.boston,i="rm",col="red")
plot(boost.boston,i="lstat",col="blue")
# the test MSE for boosting
yhat.boost=predict(boost.boston,newdata=Boston[-train,],n.trees=5000)
mean((yhat.boost-boston.test)^2)
boost.boston=gbm(medv~.,data=Boston[train,],distribution="gaussian",n.trees=5000,interaction.depth=4,shrinkage=0.2,verbose=F)
yhat.boost=predict(boost.boston,newdata=Boston[-train,],n.trees=5000)
mean((yhat.boost-boston.test)^2)
[연습문제]
-
분류
# Logistic Regression
# Exercise1
#(1)
titanic = read.csv("titanic3.csv")
dim(titanic)
str(titanic)
summary(titanic)
sum(is.na(titanic))
titanic = titanic[,c(1,2,4:7)]
titanic=na.omit(titanic)
glm.fit= glm(survived ~ pclass + sex + age + sibsp,
family = binomial, data = titanic)
summary(glm.fit)
coef(glm.fit)
summary(glm.fit)$coef
summary(glm.fit)$coef[,4]
#(2)
attach(titanic)
train = (parch < 1)
test = !train
train.titanic = titanic[train, ]
test.titanic = titanic[test, ]
test.survived = survived[test]
#(3)
glm.fit=glm(survived ~ pclass + sex + age,
family = binomial,subset=train)
glm.probs=predict(glm.fit,test.titanic,type="response")
glm.pred=rep(0,length(glm.probs))
glm.pred[glm.probs>.5]=1
mean(glm.pred!=test.survived)
# Linear Discriminant Analysis
# Exercise 2
#(1)
library(MASS)
lda.fit=lda(survived ~ pclass + sex + age + sibsp,data=titanic, subset=train)
lda.fit
#(2)
lda.pred=predict(lda.fit,test.titanic)
names(lda.pred)
lda.class=lda.pred$class
table(lda.class,test.survived)
mean(lda.class!==test.survived)
# Quadratic Discriminant Analysis
# Exercise 3
#(1)
qda.fit=qda(survived ~ pclass + sex + age + sibsp,data=titanic, subset=train)
qda.fit
#(2)
qda.class=predict(qda.fit ,test.titanic)$class
table(qda.class,test.survived)
mean(qda.class!=test.survived)
# K-Nearest Neighbors
# Exercise 4
#(1)
library(class)
train.x = cbind(pclass, sex, age, sibsp)[train, ]
rownames(train.x)=NULL
test.x = cbind(pclass, sex, age, sibsp)[test, ]
rownames(test.x)=NULL
train.survived = survived[train]
# K=1
set.seed(0413)
knn.pred=knn(train.x,test.x,train.survived,k=1)
mean(knn.pred != test.survived)
# K=10
knn.pred=knn(train.x,test.x,train.survived,k=3)
mean(knn.pred != test.survived)
# K=100
knn.pred = knn(train.x,test.x,train.survived,k = 100)
mean(knn.pred != test.survived)
-
군집분석, 요인분석
# K-Means Clustering
# Exercise 10
library(ggfortify)
geneData = read.csv("Ch10Ex11.csv", header=F)
dim(geneData)
geneData = t(geneData)
rownames(geneData)=c(paste("H", 1:20, sep = ""),paste("D", 1:20, sep = ""))
colnames(geneData)=c(paste("g",1:1000,sep=""))
set.seed(0413)
km.out=kmeans(geneData,2,nstart=20)
cluster = km.out$cluster
cluster
autoplot(km.out,data=geneData,label = TRUE, label.size = 3, frame=TRUE)
# Hierarchical Clustering
# Exercise 11
geneData = read.csv("Ch10Ex11.csv", header=F)
dim(geneData)
geneData = t(geneData)
rownames(geneData)=c(paste("H", 1:20, sep = ""),paste("D", 1:20, sep = ""))
colnames(geneData)=c(paste("g",1:1000,sep=""))
dd = as.dist(1 - cor(t(geneData)))
hc.complete=hclust(dd, method="complete")
hc.average=hclust(dd, method="average")
hc.single=hclust(dd, method="single")
library(ggdendro)
library(gridExtra)
dendr = dendro_data(hc.complete, type="rectangle")
h1=ggplot() +
geom_segment(data=segment(dendr), aes(x=x, y=y, xend=xend, yend=yend)) +
geom_text(data=label(dendr), aes(x=x, y=y, label=label, hjust=0), size=4) +
coord_flip() + scale_y_reverse(expand=c(0.2, 0)) +
theme(axis.line.y=element_blank(),
axis.ticks.y=element_blank(),
axis.text.y=element_blank(),
axis.title.y=element_blank(),
panel.background=element_rect(fill="white"),
panel.grid=element_blank())+
labs(title="Complete Linkage with Correlation-Based Distance")
dendr = dendro_data(hc.average, type="rectangle")
h2=ggplot() +
geom_segment(data=segment(dendr), aes(x=x, y=y, xend=xend, yend=yend)) +
geom_text(data=label(dendr), aes(x=x, y=y, label=label, hjust=0), size=4) +
coord_flip() + scale_y_reverse(expand=c(0.2, 0)) +
theme(axis.line.y=element_blank(),
axis.ticks.y=element_blank(),
axis.text.y=element_blank(),
axis.title.y=element_blank(),
panel.background=element_rect(fill="white"),
panel.grid=element_blank())+
labs(title="Average Linkage with Correlation-Based Distance")
dendr = dendro_data(hc.single, type="rectangle")
h3= ggplot() +
geom_segment(data=segment(dendr), aes(x=x, y=y, xend=xend, yend=yend)) +
geom_text(data=label(dendr), aes(x=x, y=y, label=label, hjust=0), size=4) +
coord_flip() + scale_y_reverse(expand=c(0.2, 0)) +
theme(axis.line.y=element_blank(),
axis.ticks.y=element_blank(),
axis.text.y=element_blank(),
axis.title.y=element_blank(),
panel.background=element_rect(fill="white"),
panel.grid=element_blank())+
labs(title="Single Linkage with Correlation-Based Distance")
grid.arrange(h1, h2, h3,nrow=1, ncol=3)
# factor analysis
# Exercise 12
#(1)
pers<-read.csv("personality.csv")
library(corrplot)
corrplot(cor(pers), order = "hclust", tl.col="black", tl.srt=90)
stand.pers = as.data.frame(scale(pers))
#(2)
fac.pers = factanal(stand.pers, factors = 10, rotation = "none", na.action = na.omit)
fac.pers
#3)
1-fac.pers$uniquenesses
# Plot loadings against one another
text=rownames(fac.pers$loadings)
load = fac.pers$loadings
ggplot(data.frame(load[,1:2]),aes(Factor1, Factor2,label = text))+
geom_point() +
geom_text(angle = 20,size=5)+
labs(title="Factor 1 vs Factor 2", x="Factor1 loadings", y="Factor 2 loadings")
#(4)
# Factor analysis with rotation
fac.pers2 = factanal(stand.pers, factors = 10, rotation = "varimax", na.action = na.omit)
fac.pers2
load2 = fac.pers2$loadings[,1:2]
### Plot loadings against one another
text=rownames(fac.pers2$loadings)
load = fac.pers2$loadings
ggplot(data.frame(load[,1:2]),aes(Factor1, Factor2,label = text))+
geom_point() +
geom_text(angle = 20,size=5)+
labs(title="Factor 1 vs Factor 2", x="Factor1 loadings", y="Factor 2 loadings")
-
나무모형
### Decision Trees
# classification Trees
# Exercise 5
#(1)
titanic = read.csv("titanic3.csv")
titanic = titanic[,c(1,2,4:7)]
titanic=na.omit(titanic)
attach(titanic)
train = (parch < 1)
test = !train
train.titanic = titanic[train, ]
test.titanic = titanic[test, ]
test.survived = survived[test]
tree.titanic=tree(as.factor(survived)~ pclass+age+sex+sibsp, titanic,subset=train)
summary(tree.titanic)
#(2)
plot(tree.titanic)
text(tree.titanic,pretty = 0)
tree.pred=predict(tree.titanic,test.titanic, type="class")
table(tree.pred,test.survived)
(70+4)/length(tree.pred)
#(3)
set.seed(0413)
cv.titanic = cv.tree(tree.titanic, FUN = prune.misclass)
names(cv.titanic)
cv.titanic
# plot the error rate as a function of both size and k
library(gridExtra)
cv=data.frame(cv.titanic$size,cv.titanic$k,cv.titanic$dev)
colnames(cv) = c("size","k","dev")
min=subset(cv, dev==min(dev))
cv1=ggplot(cv,aes(x=size,y=dev))+
geom_point()+
geom_point(data=min, size=3, colour=alpha("red", 0.7))+
geom_line()
cv2=ggplot(cv,aes(x=k,y=dev))+
geom_point()+
geom_point(data=min, size=3, colour=alpha("red", 0.7))+
geom_line()
grid.arrange(cv1, cv2, nrow=1, ncol=2)
prune.titanic=prune.misclass(tree.titanic,best=3)
# Confusion matrix
tree.pred=predict(prune.titanic,test.titanic,type="class")
table(tree.pred,test.survived)
(70+4)/length(tree.pred)
# Regression Trees
# Exercise 6
#(1)
library(ISLR)
dim(Hitters)
sum(is.na(Hitters$Salary))
Hitters =na.omit(Hitters)
dim(Hitters)
sum(is.na(Hitters))
train=1:200
train.hitters=Hitters[train,]
test.hitters=Hitters[-train,]
attach(Hitters)
tree.hitters=tree(log(Salary) ~., data=Hitters,subset=train)
tree.hitters
#(2)
tree_data <- dendro_data(tree.hitters)
ggplot(segment(tree_data)) +
geom_segment(aes(x = x, y = y, xend = xend, yend = yend, size = n),
colour = "blue", alpha = 0.5) +
scale_size("n") +
geom_text(data = label(tree_data),
aes(x = x, y = y, label = label), vjust = -0.5, size = 4) +
geom_text(data = leaf_label(tree_data),
aes(x = x, y = y, label = label), vjust = 0.5, size = 3) +
theme_dendro()+
labs(title="Regression tree diagrams")
pred.hitters = predict(tree.hitters, test.hitters)
mean((log(test.hitters$Salary) - pred.hitters)^2)
#(3)
set.seed(0413)
cv.hitters = cv.tree(tree.hitters, FUN = prune.tree)
cv=data.frame(cv.hitters$size,cv.hitters$k,cv.hitters$dev)
colnames(cv) = c("size","k","dev")
min=subset(cv, dev==min(dev))
ggplot(cv,aes(x=size,y=dev))+
geom_point()+
geom_point(data=min, size=3, colour=alpha("red", 0.7))+
geom_line()
pruned.hitters = prune.tree(tree.hitters, best = 5)
pred.pruned = predict(pruned.hitters, test.hitters)
mean((log(test.hitters$Salary) - pred.pruned)^2)
# Bagging
# Exercise 7
#(1)
train=1:200
train.hitters=Hitters[train,]
test.hitters=Hitters[-train,]
attach(Hitters)
library(randomForest)
set.seed(0413)
bag.hitters = randomForest(log(Salary) ~ ., data = train.hitters, mtry = 19, ntree = 500,
importance = T)
#(2)
bag.pred = predict(bag.hitters, test.hitters)
mean((log(test.hitters$Salary) - bag.pred)^2)
bag.hitters = randomForest(log(Salary) ~ ., data = train.hitters, mtry = 19, ntree = 25,
importance = T)
bag.pred = predict(bag.hitters, test.hitters)
mean((log(test.hitters$Salary) - bag.pred)^2)
# Random Forest
# Exercise 8
#(1)
set.seed(0413)
rf.hitters = randomForest(log(Salary) ~ ., data = train.hitters, mtry = 6, ntree = 500,
importance = T)
rf.pred = predict(rf.hitters, test.hitters)
mean((log(test.hitters$Salary) - rf.pred)^2)
#(2)
# Variable Importance Plot
var_importance <- data.frame(variable=setdiff(colnames(Hitters), "Salary"),
importance=as.vector(rf.hitters$importance[,1]))
var_importance <- var_importance[order(var_importance[,2]),]
var_importance$variable <- factor(var_importance$variable, levels=var_importance$variable)
ggplot(var_importance,aes(x=variable, y=importance, fill=variable)) +
geom_bar(stat='identity', position='identity')+
scale_fill_discrete(name="Variable Name",guide = guide_legend(reverse=TRUE)) +
coord_flip()+
theme(axis.text.x=element_blank(),
axis.text.y=element_text(size=12),
axis.title=element_text(size=16),
plot.title=element_text(size=18),
legend.title=element_text(size=16),
legend.text=element_text(size=12))+
ylab("Mean square error")
# Boosting
# Exercise 9
#(1)
library(gbm)
set.seed(0413)
boost.hitters=gbm(log(Salary)~.,data=train.hitters,distribution="gaussian",
n.trees=5000,interaction.depth=4,shrinkage = 0.01)
summary(boost.hitters)
#(2)
yhat.boost=predict(boost.hitters,newdata=test.hitters,n.trees=5000)
mean((yhat.boost-log(test.hitters$Salary))^2)
5일 : 추천 알고리즘 (Recommendation System)
########################
#3. Colaborate filtering
#3.1 Jester5k data ####
#data loading
install.packages("recommenderlab")
library(recommenderlab)
data(Jester5k)
head(as(Jester5k,"matrix"))
image(Jester5k[1:80]) #Jester5k user1~80까지만 추출해서 ratingmatrix를 확인
#colaborate filtering
r=Recommender(Jester5k[1:1000],method="UBCF") #method를 IBCF로하면 item-based CF
pr=predict(r, Jester5k[1:2], type="ratings")
as(pr,"matrix") #이미 평가한 자료의 경우 예측값을 주지 않는다
ptype=predict(r, Jester5k[1:2], n=5) #n을 통해 추천품목수를 조정 가능
as(ptype,"list")
as(pr[1],"matrix")[,order(as(pr[1],"matrix"),decreasing = T)] #첫번째 user에 대해 예상별점을 정렬
#3.2 scale 보정 #######
#making dummy variable
a = as(Jester5k[1:1000],"matrix")
dgmat=cbind(a[1:100000],as.data.frame(cbind(rep(rownames(a),ncol(Jester5k[1:1000])),rep(colnames(a),each=nrow(Jester5k[1:1000]))))) #각 행은 하나의 평점에 대한 정보를 가짐
colnames(dgmat)=c("rating","user","item")
user=unique(dgmat$user); item=unique(dgmat$item)
head(dgmat)
dgmat=dgmat[is.na(dgmat$rating)==F,] #평점이 있는 정보만 사용
dummy=model.matrix(rating~user+item,dgmat) #user와 item에 대해 dummy variable 생성
head(dummy)
#estimate mu by ridge regression
library(glmnet)
#cv.lm=cv.glmnet(dummy, dgmat$rating, type.measure="deviance", alpha=0) #시간이 오래 걸림.. 집에서 해보세요..
lm = glmnet(dummy,dgmat$rating,family="gaussian",lambda=0.2, alpha=0) #cv돌려본 결과 최적의 lambda와 대충 일치. 실제로는 cv.lm$lambda.min을 넣어준다.
#기존 형식으로 자료 변환
dgmat$rating=dgmat$rating-dummy%*%lm$beta
user.index=match(dgmat$user,user); item.index=match(dgmat$item,item) #각 row의 user(item)가 몇번째 user(item) 인지 확인
mat=sparseMatrix(i=user.index,j=item.index,x=dgmat$rating)
mat=as.matrix(mat) ; sum(dgmat$rating==0,na.rm=T) ;mat[mat==0]=NA #dgmat$rating 값 중 0이 없음을 확인하고 NA를 넣음
colnames(mat)=item ; rownames(mat)=user
mat = as(mat,"realRatingMatrix") #Recommender 함수를 위해 다시 rating matrix로..
#예상 평점 계산
r1=Recommender(mat,method="UBCF")
pr1=predict(r1, mat[1:2], type="ratings")
pr1 = as(pr1,"matrix")
rownames(lm$beta)=gsub('user','',rownames(lm$beta)) ;rownames(lm$beta)=gsub('item','',rownames(lm$beta)) #lm$beta의 이름 정제(왜 필요한지는 하기 전에 확인)
item=as.character(item) ; user=as.character(user)
user1_dummy=c()
for(i in 1:length(item)){ #user1과 user2에 대해 앞에서 추정한 mu들을 더해주기 위한 과정
b=rep(0,length(lm$beta))
b[match(c(user[1],item[i]),rownames(lm$beta))]=1
b[1]=1
user1_dummy=cbind(user1_dummy,b)
}
scale.value=rbind(t(lm$beta) %*% user1_dummy ,t(lm$beta) %*% user2_dummy)
pr1.final= scale.value+pr1 ; colnames(pr1.final)=item
pr1.final[,order(pr1.final[1,],decreasing = T)] #user1에 대해 예상평점이 높은 순으로 정렬
#예제 - MovieLense data
data("MovieLense")
#위 과정을 따라서 해보세요!
########################
#4. matrix factorization
#4.1 Jester5k data ####
#install library
library(matfact)
mf = matfact(as(Jester5k,"matrix")[1:1000,])
pred = mf$P %*% t(mf$Q)
colnames(pred)=item ; rownames(pred)=user
head(pred)
#prediction
i1 = is.na(as(Jester5k,"matrix")[1,]) #사용자 1,2가 평점을 내리지 않은 상품들을 추출
i2 = is.na(as(Jester5k,"matrix")[2,])
pred[1,i1==T] ; sort(pred[1,i1==T],decreasing=T) #평점을 내리지 않은 상품들에 대해서만 예상평점을 관찰
pred[2,i2==T] ; sort(pred[2,i2==T],decreasing=T)
#4.2 scale 보정 ####
mat1=as(mat,"matrix") #앞에서 scale 보정에 사용했던 mat 변수를 그대로 사용하면 된다.
mf1 = matfact(mat1)
pred1 = mf1$P %*% t(mf1$Q)
colnames(pred1)=item ; rownames(pred1)=user
pred1.final=pred1[1:2,]+scale.value #앞에서와 같은 보정값을 더해주면 된다
pred1.final[1,i1==T] ; pred1.final[2,i2==T] #matrix factorization을 통한 user1,2의 예상별점
sort(pred1.final[1,i1==T],decreasing=T); sort(pred1.final[2,i2==T],decreasing=T) #matrix factorization을 통한 user1,2의 예상별점(별점 순으로 정렬)
#matrix factorization
#예제 - MovieLense data
#위 과정을 따라서 해보세요!
##########################
#7. 구매자료를 이용한 추천
#7.1 연관성분석 ####
#tot data 불러오기
install.packages("arules")
library(arules)
shop=as.matrix(shop)
rules=apriori(shop, parameter=list(supp=0.1,conf=0.9)) #연관규칙 형성
rules.sorted=sort(rules, by="lift")
inspect(rules.sorted)
#7.2 conditional regression approach ####
#너무 오래 걸림....
#pred=c()
#for(i in 1:ncol(tot)){
# cv.fit=cv.glmnet(tot[,-i],tot[,i],family="binomial",alpha=0,nfolds=3)
# pred.new=predict(cv.fit,newx=tot[,-i],s="lambda.min",type="response")
# pred=cbind(pred,pred.new)
#}
#미리 해놓은 결과값 불러오기
pred=read.csv("F:\\Stat\\job\\recommender_공개강좌\\실습자료\\prediction.csv")
rownames(pred)=pred[,1];pred=pred[,-1]
pred[which(tot==1)]=0 #구매내역이 있으면 기대확률을 볼 필요가 없으므로 0으로 처리
sort(pred[1,],decreasing=T)[1:10] #첫번째 user에게 추천하는 상품들 상위 10개와 그 기대 확률ㄷ
#인터넷 쇼핑몰 자료
#shop 자료 불러오기
#구매자료를 이용한 추천 다시 하기
관련 자료
-
강의안 및 교육 자료 참조
참고 문헌
[논문]
- 없음
[보고서]
- 없음
[URL]
- 없음
문의사항
[기상학/프로그래밍 언어]
- sangho.lee.1990@gmail.com
[해양학/천문학/빅데이터]
- saimang0804@gmail.com
'프로그래밍 언어 > R' 카테고리의 다른 글
[R] R 및 Python을 이용한 '디시인사이드' MBTI 갤러리 웹 크롤링 및 키워드 분석을 통한 워드 클라우드 생성 (1) | 2020.05.23 |
---|---|
[R] R을 이용한 수치해석 : 2020년 대학수학능력시험 (수능) 가형 기출문제 (0) | 2020.05.18 |
[R] KLAPS 수치예측 모델 자료를 이용하여 내삽 방법 (Multi level B-Spline Approximation, Kriging)에 따른 전처리 및 가시화 (0) | 2020.04.14 |
[R] R을 이용한 통계 분석 및 데이터 시각화 : dplyr (2) | 2020.04.09 |
[R] R을 이용한 통계 분석 및 데이터 시각화 : readr (0) | 2020.04.09 |
최근댓글