[R] R를 이용한 고급 기상 통계학 실습 : R 기초, 자료의 특성, 자료의 상관성, 통계적 유의수준, 조화 분석, 주성분 분석, 기계 학습

정보
내용
주제
참고 문헌
문의사항

정보

업무명 : R를 이용한 고급 기상 통계학 실습 : R 기초, 자료의 특성, 자료의 상관성, 통계적 유의수준, 조화 분석, 주성분 분석, 기계 학습
작성자 : 이상호
작성일 : 2020-03-16
설 명 :
수정이력 :

내용

[개요]

안녕하세요? 기상 연구 및 웹 개발을 담당하고 있는 해솔입니다.
대학원 석사 1학기에 배운 고급 기상 통계학에 대한 실습 내용을 다루고자 합니다.
그에 따른 주제는 "R 기초, 자료의 특성, 자료의 상관성, 통계적 유의수준, 조화 분석, 주성분 분석, 기계 학습" 순으로 소개해 드리고자 합니다.
추가로 각 주제에 대한 이론적 배경을 소개한 링크를 보내드립니다.

[강릉원주대 대기환경과학과] 2015년 2학기 전필 고급기상통계학 소개 및 과제물

정보 업무명 : 고급기상통계학 작성자 : 이상호 작성일 : 2019-12-20 설 명 : 수정이력 : 내용 [특징] 고급기상통계학 수업에 대한 이해를 돕기위해 작성 [기능] 소개 주제별 과제물 자료의 특성 자료의 상관성..

shlee1990.tistory.com

[특징]

대기과학에 사용하는 기상 통계학을 이해하기 위해서 실습이 요구되며 이 프로그램은 이러한 목적을 달성하기 위한 소프트웨어

[기능]

R 기초
자료의 특성
자료의 상관성
통계적 유의수준
조화 분석
주성분 분석
인공 신경망 및 딥러닝

[활용 자료]

소스 코드 및 활용 자료 첨부

202. [Advanced Meteorological Statistics] R-project Practice_Ver 2.0 (MS. Sang-Ho Lee)_20171218.7z

drive.google.com

[자료 처리 방안 및 활용 분석 기법]

없음

[사용법]

라이브러리 읽기 : source("Library.R")
소스 코드 참조

[사용 OS]

Windows 10

[사용 언어]

R v3.6.3
R Studio v1.2.5033

주제

[R 기초]

1993년 뉴질랜드 오클랜드 대학의 통계학과 교수 2명 (Ross Ihaka, Robert Gentleman) 개발
특징
- 빠른 처리 속도 (H/W 메모리 크기에 영향 받음)
- 데이터, 함수, 차트 등 모든 것이 object로 관리
- 최신의 알고리즘 및 방법론이 패키지 (Package)로 제공됨 (Neural Networks, Deeplearning et al)
- 분석에 통찰을 부여할 수 있는 그래픽에 대한 강력한 지원

Windows 설치
- http://www.r-project.org/ → Download(CRAN)
- http://www.rstudio.com/ → Desktop → Rstudio

Linux 설치
- http://m.blog.daum.net/empirical/17382805

Rstudio 4가지 패널 창 기능
- 코딩 영역, 콘솔(실행결과), 작업내역/ 환경, 탐색시/그래프/도움말/패키지 구성

version와 패키지 및 session 보기
- sessionInfo() # 64 bit 권장 !
- ?sessionInfo

> sessionInfo()  # 64 bit 권장 ! 

R version 3.6.3 (2020-02-29)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 10 x64 (build 18362)

Matrix products: default

Random number generation:
 RNG:     Mersenne-Twister 
 Normal:  Inversion 
 Sample:  Rounding 
 
locale:
[1] LC_COLLATE=Korean_Korea.949  LC_CTYPE=Korean_Korea.949   
[3] LC_MONETARY=Korean_Korea.949 LC_NUMERIC=C                
[5] LC_TIME=Korean_Korea.949    

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] devtools_2.2.2     usethis_1.5.1      reshape2_1.4.3     h2o_3.28.0.4      
 [5] neuralnet_1.44.2   MASS_7.3-51.5      mapdata_2.3.0      maps_3.3.0        
 [9] rgeos_0.5-2        colorRamps_2.3     leaflet_2.0.3      RNetCDF_2.1-1     
[13] maptools_0.9-9     sp_1.4-1           gstat_2.0-4        zoo_1.8-7         
[17] RColorBrewer_1.1-2 moments_0.14       MAT_2.2            ggcorrplot_0.1.3  
[21] gridExtra_2.3      factoextra_1.0.6   forcats_0.4.0      stringr_1.4.0     
[25] dplyr_0.8.3        purrr_0.3.3        readr_1.3.1        tidyr_1.0.2       
[29] tibble_2.1.3       tidyverse_1.3.0    GGally_1.4.0       ggplot2_3.2.1     
[33] akima_0.6-2       

loaded via a namespace (and not attached):
 [1] nlme_3.1-144      bitops_1.0-6      fs_1.3.1          xts_0.12-0       
 [5] lubridate_1.7.4   httr_1.4.1        rprojroot_1.3-2   tools_3.6.3      
 [9] backports_1.1.5   R6_2.4.1          DBI_1.1.0         lazyeval_0.2.2   
[13] colorspace_1.4-1  withr_2.1.2       prettyunits_1.1.1 tidyselect_0.2.5 
[17] processx_3.4.1    compiler_3.6.3    cli_2.0.2         rvest_0.3.5      
[21] xml2_1.2.2        desc_1.2.0        scales_1.1.0      callr_3.4.2      
[25] digest_0.6.24     foreign_0.8-75    pkgconfig_2.0.3   htmltools_0.4.0  
[29] sessioninfo_1.1.1 dbplyr_1.4.2      htmlwidgets_1.5.1 rlang_0.4.4      
[33] readxl_1.3.1      rstudioapi_0.11   FNN_1.1.3         generics_0.0.2   
[37] jsonlite_1.6.1    crosstalk_1.1.0.1 RCurl_1.98-1.1    magrittr_1.5     
[41] Rcpp_1.0.3        munsell_0.5.0     fansi_0.4.1       lifecycle_0.2.0  
[45] stringi_1.4.6     pkgbuild_1.0.6    plyr_1.8.5        grid_3.6.3       
[49] ggrepel_0.8.1     crayon_1.3.4      lattice_0.20-38   haven_2.2.0      
[53] hms_0.5.3         ps_1.3.0          zeallot_0.1.0     pillar_1.4.3     
[57] spacetime_1.2-3   pkgload_1.0.2     reprex_0.3.0      glue_1.3.1       
[61] remotes_2.1.1     modelr_0.1.6      vctrs_0.2.1       testthat_2.3.1   
[65] cellranger_1.1.0  gtable_0.3.0      reshape_0.8.8     assertthat_0.2.1 
[69] broom_0.5.5       intervals_0.15.1  memoise_1.1.0     ellipsis_0.3.0

단축키
- <Ctrl + Enter> or <Ctrl + r> : script 실행
- <Ctrl + s> : 저장

변수명 작성 규칙
- 첫자는 영문자, 두번째는 숫자,특수문자(_ .)
- 대소문자 구분(num, NUM)
- 명령어, 함수명 사용 불가
- 가장 최근값 저장
- R의 모든 변수 = 객체

패키지
- install.packages("moments") # 패키지 설치
- library(moments) # 패키지 사용
- source("Library.R") # 패키지 사용 및 설치

#=================================================================================================
# Routine  : Main R-project program
#
# Purpose  : Package install, load, and function 
#            source("Library.R")
#
# Author   : MS. Sang-Ho Lee
#
# Revisions: V1.0 August   21, 2017 First release (MS. Sang-Ho Lee)
#            V2.0 Decebmer 01, 2017 Change to ifelse when using package (MS. Sang-Ho Lee)
#=================================================================================================

      # install.packages("moments")
      # install.packages("RColorBrewer")
      # install.packages("dplyr")
      # install.packages("zoo")
      # install.packages("ggplot2")
      # install.packages("gstat")
      # install.packages("sp")
      # install.packages("maptools")
      # install.packages("RNetCDF")
      # install.packages("leaflet")
      # install.packages("colorRamps")
      # install.packages("gpclib")
      # install.packages("rgeos")
      # install.packages("mapdata")
      # install.packages("MASS")
      # install.packages("neuralnet")
      # install.packages("h2o")
      # install.packages("reshape2")
      # install.packages("devtools")
      # install.packages("sinkr")
      # install.packages("MASS")
      # install.packages("MAT")
      # install.packages("ggcorrplot")
      # install.packages("gridExtra")
      # install.packages("factoextra")
      # install.packages("tidyverse")
      # install.packages("GGally")


      library(akima)
      library(GGally)
      library(tidyverse)
      library(factoextra)
      library(gridExtra)
      library(ggcorrplot)
      library(MAT)
      library(moments)
      library(RColorBrewer)
      library(dplyr)
      library(zoo)
      library(ggplot2)
      library(gstat)
      library(sp)
      library(maptools)
      library(RNetCDF)
      library(leaflet)
      library(colorRamps)
      library(gpclib)
      library(rgeos)
      library(mapdata)
      library(MASS)
      library(neuralnet)
      library(h2o)
      library(reshape2)
      library(devtools)
      library(sinkr)
      library(MASS)


ggplot_windrose = function(data, spd, dir, spdres = 2, dirres = 30, spdmin = 2, spdmax = 20, spdseq = NULL,
                          palette = "YlGnBu", countmax = NA, debug = 0){
   
   # Look to see what data was passed in to the function
   if (is.numeric(spd) & is.numeric(dir)){
      # assume that we've been given vectors of the speed and direction vectors
      data <- data.frame(spd = spd, dir = dir)
      spd = "spd"
      dir = "dir"
   } else if (exists("data")){
      # Assume that we've been given a data frame, and the name of the speed 
      # and direction columns. This is the format we want for later use.    
   }  
   
   # Tidy up input data ----
   n.in <- NROW(data)
   dnu <- (is.na(data[[spd]]) | is.na(data[[dir]]))
   data[[spd]][dnu] <- NA
   data[[dir]][dnu] <- NA
   
   # figure out the wind speed bins ----
   if (missing(spdseq)){
      spdseq <- seq(spdmin, spdmax, spdres)
   } else {
      if (debug >0){
         cat("Using custom speed bins \n")
      }
   }
   # get some information about the number of bins, etc.
   n.spd.seq <- length(spdseq)
   n.colors.in.range <- n.spd.seq - 1
   
   # create the color map
   spd.colors <- colorRampPalette(brewer.pal(min(max(3,n.colors.in.range), min(9,n.colors.in.range)), palette))(n.colors.in.range)
   
   if (max(data[[spd]],na.rm = TRUE) > spdmax){    
      spd.breaks <- c(spdseq, max(data[[spd]],na.rm = TRUE))
      spd.labels <- c(paste(c(spdseq[1:n.spd.seq-1]), '-', c(spdseq[2:n.spd.seq])),
                      paste(spdmax, "-", max(data[[spd]],na.rm = TRUE)))
      spd.colors <- c(spd.colors, "grey50")
   } else{
      spd.breaks <- spdseq
      spd.labels <- paste(c(spdseq[1:n.spd.seq-1]), '-', c(spdseq[2:n.spd.seq]))    
   }
   data$spd.binned <- cut(x = data[[spd]], breaks = spd.breaks, labels = spd.labels, ordered_result = TRUE)
   # clean up the data
   data. <- na.omit(data)
   
   # figure out the wind direction bins
   dir.breaks <- c(-dirres/2, seq(dirres/2, 360-dirres/2, by = dirres), 360+dirres/2)  
   dir.labels <- c(paste(360-dirres/2,"-",dirres/2), paste(seq(dirres/2, 360-3*dirres/2, by = dirres),
                         "-", seq(3*dirres/2, 360-dirres/2, by = dirres)), paste(360-dirres/2,"-",dirres/2))
   # assign each wind direction to a bin
   dir.binned <- cut(data[[dir]], breaks = dir.breaks, ordered_result = TRUE)
   levels(dir.binned) <- dir.labels
   data$dir.binned <- dir.binned
   
   # Run debug if required ----
   if (debug>0){    
      cat(dir.breaks,"\n")
      cat(dir.labels,"\n")
      cat(levels(dir.binned),"\n")       
   }  
   
   # deal with change in ordering introduced somewhere around version 2.2
   if(packageVersion("ggplot2") > "2.2"){    
      # cat("Hadley broke my code\n")
      data$spd.binned = with(data, factor(spd.binned, levels = rev(levels(spd.binned))))
      spd.colors = rev(spd.colors)
   }
   
   # create the plot ----
   p.windrose <- ggplot(data = data, aes(x = dir.binned, fill = spd.binned)) +
      geom_bar() + 
      scale_x_discrete(drop = FALSE, labels = waiver()) +
      coord_polar(start = -((dirres/2)/360) * 2*pi) +
      scale_fill_manual(name = "Wind  Speed  [m/s]", values = spd.colors, drop = FALSE) +
      #theme_bw() +
      theme(axis.title.x = element_blank(),
            #panel.border = element_rect(colour = "blank"),
            panel.grid.major = element_line(colour="grey65")) +
      theme_bw() +
      theme(plot.title=element_text(face="bold", size=20)) +
      theme(axis.title.x = element_text(face="bold", size=15)) +
      theme(axis.title.y = element_text(face="bold", size=15, angle=90)) +
      theme(axis.text.x  = element_text(face="bold", size=15)) +
      theme(axis.text.y  = element_text(face="bold", size=15)) +
      theme(legend.text=element_text(size=15)) +
      theme(legend.title=element_text(size=15)) +
      labs(x = "") +
      labs(y = "Frequency  of  Counts  by  Wind  Direction") 
   
   # adjust axes if required
   if (!is.na(countmax)){
      p.windrose <- p.windrose +
         ylim(c(0,countmax))
   }
   
   # print the plot
   # print(p.windrose)  
   
   # return the handle to the wind rose
   return(p.windrose)
}

[자료의 특성]

기본 통계량 계산
- 자료의 개수 : length
- 평균 : mean
- 중간값 : median
- 최빈값 : which.max
- 분산 : var
- 표준편차 : sd
- 최대값 : max
- 최소값 : min
- 왜도 : skewness
- 첨도 : kurtosis
- 요약 통계량 : summary
- cat를 통해 자료 출력
- cat에서 file 옵션을 통해 자료 저장

X = c(1, 2, 3, 4, 5)
Y = c(3, 6, 1, 2, 3)

length(Y)    # 자료의 개수
mean(Y)      # 평균 = sum(Y)/length(Y)

median(Y)    # 중간값
names(which.max(table(Y)))  #  최빈값
var(Y)       # 분산 = sum((Y-mean(Y))^2)/(length(Y)-1)
sd(Y)        # 표준편차 = sqrt(sum((Y-mean(Y))^2)/(length(Y)-1))
max(Y)       # 최대값
min(Y)       # 최소값 
skewness(Y)  # 왜도 = sum((Y-mean(Y))^3)/(length(Y)*(sqrt(sum((Y-mean(Y))^2)/length(Y))^3))
kurtosis(Y)  # 첨도 = sum((Y-mean(Y))^4)/(length(Y)*(sqrt(sum((Y-mean(Y))^2)/length(Y))^4))

summary(Y)   # 요약 통계량 = Min./1st Qu./Median/Mean/3rd Qu./Max.

Number = length(Y)
Mean = mean(Y)
Median = median(Y)
Mode = names(which.max(table(Y)))
Var = var(Y)
Sd = sd(Y)
Max = max(Y)
Min = min(Y)
Skew = skewness(Y)
Kurt = kurtosis(Y)

## 자료 출력
cat(Number, Mean, Median, Mode, Var, Sd, Max, Min, Skew, Kurt, "\n")

## 자료 저장
cat(Number, Mean, Median, Mode, Var, Sd, Max, Min, Skew, Kurt, "\n", file='1.dat')

[자료의 상관성]

선형 회귀 모형 : lm
상관 계수 : cor.test에서 estimate 또는 cor
결정 계수 : cor.test에서 estimate의 거듭제곱
유의성 검정 (P-Value) : cor.test에서 p.value

X = c(1, 2, 3, 4, 5)
Y = c(3, 6, 4, 3, 2)

Number = length(Y)

## 선형 회귀모형
lm.fit = lm(Y ~ X)  #  Y : 종속변수, X : 독립변수
summary(lm.fit) 

## (Yfit) = a(X) + b -> coef.
a = coef(lm.fit)[2] 
b = coef(lm.fit)[1]
# a = ( sum(X*Y)-(sum(X)*sum(Y)/length(X)) ) / ( sum(X^2)-((sum(X)^2)/length(X)) )
# b = mean(Y) - (a*mean(X))

Yfit = (a*X) + b

Cor = cor.test(X, Y)$estimate  # 상관계수 (correlation coefficient)
Cor = cor(X, Y)
## Cor = ( sum(X*Y)-(sum(X)*sum(Y)/length(X)) ) / ( sqrt(sum(X^2)-((sum(X)^2)/length(X)))*sqrt(sum(Y^2)-((sum(Y)^2)/length(Y))) )

Rsq = (cor.test(X, Y)$estimate)^2  # 결정계수 (R-squared)
## Rsq = var(Yfit)/var(Y)

Pvalue = cor.test(X, Y)$p.value  # 유의성 검정(p-value)

cat(a, b, Cor, Rsq, Pvalue, "\n", file='2.dat')

## FIG
dev.off()      # Initialization
par(cex.main=1.2, cex.lab=1.2, cex.axis=1.2, cex=1.2, font=1)
plot(X, Y, pch=21, bg="white", col='black', main="Relationship between X and Y", 
     xlim=c(0, 10), ylim=c(0, 10), xlab="X lab", ylab="Y lab", xaxs="i", yaxs="i", cex=1.5)
lines(X, Yfit, col="red")  
abline(a=0, b=1, lty=2, col="black")
text(1, 9, paste0("Y = ", round(a,2), " X + ", round(b,2)), adj=0, col="red")
text(1, 8, paste0("R = ", round(cor(X, Yfit), 2), " (p-value < ", round(Pvalue, 3), ")"), adj=0, col="red")
text(1, 7, paste0("N = ", round(Number,2)), adj=0)


## 비선형 회귀모형
X2 = X^2
lm.fit = lm(Y ~ X + X2)
summary(lm.fit)

# (Yfit) = a(X2) + b(X) + c -> coeff.
a = coef(lm.fit)[3] 
b = coef(lm.fit)[2]
c = coef(lm.fit)[1]
Yfit = (a*X2) + (b*X) + c
Cor = cor(Yfit, Y)
Pvalue = cor.test(Yfit, Y)$p.value  # 유의성 검정(p-value)

# FIG
dev.off()      # Initialization
par(cex.main=1.2, cex.lab=1.2, cex.axis=1.2, cex=1.2, font=1)
plot(X, Y, pch=21, bg="white", col='black', main="Relationship between X and Y", 
     xlim=c(0, 10), ylim=c(0, 10), xlab="X lab", ylab="Y lab", xaxs="i", yaxs="i", cex=1.5)
lines(X, Yfit, col="red")  
abline(a=0, b=1, lty=2, col="black")
text(1, 9, paste0("Y = ", round(a,2), " X2 + ", round(b,2), " X +", round(c, 2)), adj=0, col="red")
text(1, 8, paste0("R = ", round(Cor, 2), " (p-value < ", round(Pvalue, 3), ")"), adj=0, col="red")
text(1, 7, paste0("N = ", round(Number,2)), adj=0)

선형 회귀 분석에 대한 산점도

비선형 회귀 분석에 대한 산점도

[통계적 유의수준]

모집단 및 표본집단의 평균/분산/표준편차 비교
- 중심극한정리 : 동일한 확률분포를 가진 독립확률 변수 n개의 평균값은 n이 클수록 정규분포에 가까워짐
- 표본집단을 통하여 모집단의 정보를 알고싶음
- 즉 표본평균집단의 평균/분산/표준편차를 이용하여 모집단의 평균/분산/표준편차를 유추할 수 있음
  - 모집단의 평균 = 표본평균집단의 평균
  - 모집단의 분산/자료수 = 표본평균집단의 분산
  - 모집단의 표준편차/루트(자료수) = 표본평균집단의 표준편차
  - 표본 개수가 많을수록 모집단을 더 잘 유추할 수 있음 > 정규분포에 가까워짐

######################################## 모집단 및 표본집단의 평균/분산/표준편차 비교 ##############################
##  - 중심극한정리 : 동일한 확률분포를 가진 독립확률 변수 n개의 평균값은 n이 클수록 정규분포에 가까워짐
##  - 표본집단을 통하여 모집단의 정보를 알고싶음
##  - 즉 표본평균집단의 평균/분산/표준편차를 이용하여 모집단의 평균/분산/표준편차를 유추할 수 있음
##    -- 모집단의 평균                  = 표본평균집단의 평균
##    -- 모집단의 분산/자료수           = 표본평균집단의 분산                  
##    -- 모집단의 표준편차/루트(자료수) = 표본평균집단의 표준편차
##  -> 표본 개수가 많을수록 모집단을 더 잘 유추할 수 있음  -> 정규분포에 가까워짐
####################################################################################################################

set.seed(1)                   # 난수로 생성된 수열을 고정시킴
X = runif(10000, min=0, max=1)  # 난수 생성 (모집단 생성)

hist(X)  # 모집단의 빈도분포 
cat(mean(X), var(X), sd(X), "\n")  # 모집단의 평균/분산/표준편차


################################################### 표본평균집단에 대해서 ###########################################################
for (i in c(10, 50, 100, 250)) {

   DO = 100000    # Number of repetition 
   N = i          # Number of sample
   # N = 30       
   
   ## 비복원 추출(무작위 정렬) : 한번 뽑은 것을 다시 뽑을 수 없는 추출
   Sort = lapply(1:DO, function(i) sample(X, N, replace=F))  
   Sort_mean <- mapply(mean, Sort)

   ## FIG
   # nf <- layout(matrix(c(1,1),1,byrow=T), c(1,1), c(1,1)) ; layout.show(nf) ; par(mar=c(5,5,5,5)) ; par(cex=1.0)
   hist(Sort_sd, breaks=50, xlab="Sample Distribution Mean", xlim=c(0.20, 0.35), col = "light grey", border = "grey", xaxs="i", yaxs="i",
        main=paste0("Central Limit Theorem : number of sample = ", N, ", number of repetition = ", sprintf("%d", Do)))

   XX1 = mean(Sort_mean)+sd(Sort_mean)
   XX2 = mean(Sort_mean)-sd(Sort_mean)
   XX3 = mean(Sort_mean)+2*sd(Sort_mean)
   XX4 = mean(Sort_mean)-2*sd(Sort_mean)
   XX5 = mean(Sort_mean)+3*sd(Sort_mean)
   XX6 = mean(Sort_mean)-3*sd(Sort_mean)
   YY = max(hist(Sort_mean, breaks=50, plot=F)$counts)
   lines( c(mean(Sort_mean), mean(Sort_mean)), c(0, YY), lty=1, col=4) ; text(mean(Sort_mean), YY/2, "Mean")
   lines( c(XX1, XX1), c(0, YY), lty=1, col=2 ) ; text( XX1, YY/2, "+1σ")
   lines( c(XX2, XX2), c(0, YY), lty=1, col=2 ) ; text( XX2, YY/2, "-1σ")
   lines( c(XX3, XX3), c(0, YY), lty=1, col=2 ) ; text( XX3, YY/2, "+2σ")
   lines( c(XX4, XX4), c(0, YY), lty=1, col=2 ) ; text( XX4, YY/2, "-2σ")
   lines( c(XX5, XX5), c(0, YY), lty=1, col=2 ) ; text( XX5, YY/2, "+3σ")
   lines( c(XX6, XX6), c(0, YY), lty=1, col=2 ) ; text( XX6, YY/2, "-3σ")
   lines( c(max(Sort_mean), max(Sort_mean)), c(0, YY), lty=1, col=3 ) ; text(max(Sort_mean), YY/2, "Max")
   lines( c(min(Sort_mean), min(Sort_mean)), c(0, YY), lty=1, col=3 ) ; text(min(Sort_mean), YY/2, "Min")
   
   
   ## 자료수, 표본평균집단의 평균, 표본평균집단의 분산, 표본평균집단의 표준편차
   ## (표본평균집단의 평균-모집단의 평균), (표본평균집단의 분산 - 모집단의 분산/자료수), (표본평균집단의 표준편차 - 모집단의 표준편차/루트(자료수))
   # cat(N, mean(Sort_mean), var(Sort_mean), sd(Sort_mean),
       # mean(Sort_mean)-mean(X), var(Sort_mean)-(var(X)/N), sd(Sort_mean)-(sd(X)/sqrt(N)), "\n")
   
   # 모집단의 평균, 분산/자료수, 표준편차/루트(자료수)
   # cat(mean(X), var(X)/length(X), sd(X)/sqrt(length(X)), "\n")
}


################################################### 표본표준편차집단에 대해서 ###########################################################

X = runif(10000, min=0, max=1)  # 난수 생성 (모집단 생성)
hist(X)  # 모집단의 빈도분포 
cat(mean(X), var(X), sd(X), "\n")  # 모집단의 평균/분산/표준편차

for (i in c(5, 10, 25, 50, 100, 250)) {
   
   DO = 100000    # Number of repetition 
   N = i          # Number of sample
   # N = 30   
   
   Sort = lapply(1:DO, function(i) sample(X, N, replace=F))  # 비복원 추출(무작위 정렬) : 한번 뽑은 것을 다시 뽑을 수 없는 추출
   Sort_mean <- mapply(mean, Sort)
   Sort_sd <- mapply(sd, Sort)
   
   ## FIG
   hist(Sort_sd, breaks=50, xlab="Sample Distribution Stdev", xlim=c(0.2, 0.4), col = "light grey", border = "grey", xaxs="i", yaxs="i",
   main=paste0("Central Limit Theorem : number of sample = ", N, ", number of repetition = ", sprintf("%d", DO)))
   
   XX1 = mean(Sort_sd)+sd(Sort_sd)
   XX2 = mean(Sort_sd)-sd(Sort_sd)
   XX3 = mean(Sort_sd)+2*sd(Sort_sd)
   XX4 = mean(Sort_sd)-2*sd(Sort_sd)
   XX5 = mean(Sort_sd)+3*sd(Sort_sd)
   XX6 = mean(Sort_sd)-3*sd(Sort_sd)
   YY = max(hist(Sort_sd, breaks=50, plot=F)$counts)
   lines( c(mean(Sort_sd), mean(Sort_sd)), c(0, YY), lty=1, col=4) ; text(mean(Sort_mean), YY/2, "Mean")
   lines( c(XX1, XX1), c(0, YY), lty=1, col=2 ) ; text( XX1, YY/2, "+1σ")
   lines( c(XX2, XX2), c(0, YY), lty=1, col=2 ) ; text( XX2, YY/2, "-1σ")
   lines( c(XX3, XX3), c(0, YY), lty=1, col=2 ) ; text( XX3, YY/2, "+2σ")
   lines( c(XX4, XX4), c(0, YY), lty=1, col=2 ) ; text( XX4, YY/2, "-2σ")
   lines( c(XX5, XX5), c(0, YY), lty=1, col=2 ) ; text( XX5, YY/2, "+3σ")
   lines( c(XX6, XX6), c(0, YY), lty=1, col=2 ) ; text( XX6, YY/2, "-3σ")
   lines( c(max(Sort_sd), max(Sort_sd)), c(0, YY), lty=1, col=3 ) ; text(max(Sort_mean), YY/2, "Max")
   lines( c(min(Sort_sd), min(Sort_sd)), c(0, YY), lty=1, col=3 ) ; text(min(Sort_mean), YY/2, "Min")
   

   ## 샘플개수, 표본평균의 표준편차, 표본표준편차의 평균/sqrt(N),  (표본평균의 표준편차-표본표준편차의 평균/sqrt(N)
   cat(N, sd(Sort_mean), mean(Sort_sd)/sqrt(N), sd(Sort_mean)-(mean(Sort_sd)/sqrt(N)), "\n")
   
   ## 중심극한정리는 샘플 개수에 작으면 약해짐.  즉 샘플 개수가 작을수록 표준편차가 증가함.
}




################################################### 상관계수에 대한 유의성 검정 ###########################################################      
data = read.csv("INPUT/KMA/1904-2016_Yearly_Mean_Groud_Obs.KMA", sep="", header=TRUE)   
head(data)  # 파일의 첫 행
tail(data)  # 파일을 마지막 행

## 시간: 1904-2016년 연평균 자료 (지상관측)

## 변수: 지점,  년,  기온,  기압,  상대습도,  풍속
##        id   year  temp   pre       rh       ws

## 지점: 강릉,  광주,  대구,  대전,  부산,  서울,  전주,  제주,  청주,  춘천,  원주,  대관령
##        105    156    143    133    159    108    146    184    131    101    114    100

## 출처: 기상청  국가기후데이터센터 (http://sts.kma.go.kr/jsp/home/contents/main/main.do)

## 지점마다 관측 기간이 다름
   ### 대관령: 1972-2016
   ### 춘천  : 1966-2016
   ### 강릉  : 1912-2016
   ### 서울  : 1908-2016
   ### 원주  : 1972-2016          
   ### 청주  : 1979-2016
   ### 대구  : 1909-2016
   ### 전주  : 1919-2016
   ### 광주  : 1940-2016
   ### 제주  : 1924-2016

   ## 1979-2016년에 대해 분석을 수행함.


####  강릉 지역에 대한 상관계수 및 유의성 검정
data_L1 = data %>% 
   filter( id == 105 ) %>%
   filter( year >= 1979 )


#### FIG
X = data_L1$year
Y = data_L1$temp

dev.off()      # Initialization
par(cex.main=1.2, cex.lab=1.2, cex.axis=1.2, cex=1.2, font=1)
plot(X, Y, col='black', main="Time  series  of  yearly  averaged  temperature  for  the  year  of  1979-2016", 
     xlab="Time  [Year]", ylab="Temperature  [℃]",  xlim=c(1979, 2016), type="l", ylim=c(10, 15), 
     xaxs="i", yaxs="i", cex=1.5)
lm.fit =lm(Y ~ X)
abline(lm.fit, col="red")
text(1980, 14.5, paste0("(Temperature) = ", round(coef(lm.fit)[2],2), " × (Year) ", round(coef(lm.fit)[1],2)), adj=0, col="red", cex=1.2)
text(1980, 14, paste0("R = ", round(cor(X, Y), 2)), adj=0, col="red", cex=1.2)
text(1980, 13.5, paste0("N = ", round(length(X),2)), adj=0, cex=1.2)



################################################### 강릉 지역에 대한 유의성 검정 ###########################################################
DO = 100000     # Number of repetition
N  = length(X)  # Number of sample

Sort_X = lapply(1:DO, function(i) X[1:N] )
Sort_Y = lapply(1:DO, function(i) sample(Y, N, replace=F))  # 비복원추출(무작위 정렬)
Sort_R = mapply(cor, Sort_X, Sort_Y)

#### FIG
# nf <- layout(matrix(c(1,1),1,byrow=T), c(1,1), c(1,1)) ; layout.show(nf) ; par(mar=c(5,5,5,5)) ; par(cex=1.0) 
hist(Sort_R, breaks=50, xlab="Sample Distribution Coefficient Correlation", xlim=c(-1, 1), col = "light grey",  xaxs="i", yaxs="i",
     border = "grey", main=paste0("Central Limit Theorem : number of sample = ", N, ", number of repetition = ", sprintf("%d", DO))) 

XX1 = mean(Sort_R)+sd(Sort_R)
XX2 = mean(Sort_R)-sd(Sort_R)
XX3 = mean(Sort_R)+2*sd(Sort_R)
XX4 = mean(Sort_R)-2*sd(Sort_R)
XX5 = mean(Sort_R)+3*sd(Sort_R)
XX6 = mean(Sort_R)-3*sd(Sort_R)
YY = max(hist(Sort_R, breaks=50, plot=F)$counts)
lines( c(mean(Sort_R), mean(Sort_R)), c(0, YY), lty=1, col=4) ; text(mean(Sort_R), YY/2, "Mean")
lines( c(XX1, XX1), c(0, YY), lty=1, col=2 ) ; text( XX1, YY/2, "+1σ")
lines( c(XX2, XX2), c(0, YY), lty=1, col=2 ) ; text( XX2, YY/2, "-1σ")
lines( c(XX3, XX3), c(0, YY), lty=1, col=2 ) ; text( XX3, YY/2, "+2σ")
lines( c(XX4, XX4), c(0, YY), lty=1, col=2 ) ; text( XX4, YY/2, "-2σ")
lines( c(XX5, XX5), c(0, YY), lty=1, col=2 ) ; text( XX5, YY/2, "+3σ")
lines( c(XX6, XX6), c(0, YY), lty=1, col=2 ) ; text( XX6, YY/2, "-3σ")
lines( c(max(Sort_R), max(Sort_R)), c(0, YY), lty=1, col=3 ) ; text(max(Sort_R), YY/2, "Max")
lines( c(min(Sort_R), min(Sort_R)), c(0, YY), lty=1, col=3 ) ; text(min(Sort_R), YY/2, "Min")
lines( c(cor(X, Y), cor(X, Y)), c(0, YY), lty=1, col=5, lw=2 ) ; text(cor(X, Y), YY/2, "R")


## 상관계수에 대한 유의성 검정  ->  유의확률 (significance probability) = P값 (p-value)*100 %
##                                   신뢰도 (1 - P값)

Pval = sum (abs(Sort_R) >= abs(cor(X, Y))) /length(Sort_R)    # (Source: http://strata.uga.edu/8370/lecturenotes/resampling.html)
Pval

## 상관계수는 0.47로서 통계적으로 0.3 % 수준으로 유의하다.
##                                99.7 % 수준으로 신뢰할 만하다.
##                                1000번 중에 약 3번은 우연의 일치이다.
##                                1000번 중에 약 997번은 우연의 일치가 아니다.

cor.test(X, Y)              
   # n <- length(X)                                 # How is the p-value calculated for pearson correlation ?
   # r <- cor(X, Y)
   # df <- n - 2
   # t = sqrt(df) * r/sqrt(1 - r^2)
   # pval = 2 * min(pt(t, df), pt(t, df, lower.tail = FALSE))


################################################### 10 지점에 대한 상관계수 및 유의성 검정 ###########################################################
for ( i in c(105, 156, 143, 133, 159, 108, 146, 184, 131, 101, 114, 100) ) {
   
   data_L1 = data %>% 
      filter( id == i ) %>%
      # filter( year >= 1979 )
      filter( year >= 2000 )
   
   X = data_L1$year
   Y = data_L1$temp
   
   
   DO = 100000    # Number of repetition
   N  = length(X)  # Number of sample
   
   Sort_X = lapply(1:DO, function(i) X[1:N] )
   Sort_Y = lapply(1:DO, function(i) sample(Y, N, replace=F))  # 비복원추출(무작위 정렬)
   Sort_R = mapply(cor, Sort_X, Sort_Y)
   
   Pval = sum (abs(Sort_R) >= abs(cor(X, Y))) /length(Sort_R)    # (Source: http://strata.uga.edu/8370/lecturenotes/resampling.html)

   Pval2 = cor.test(X, Y) 
   
   ## 지점, 상관계수, 몬테카를로 시뮬레이션에 의한 유의성, 근사식에 의한 유의성
   cat(i, cor(X, Y), Pval, Pval2$p.value, "\n")
}



################################################### 이동평균를 이용하여 상관계수 및 유의성 검정 ###########################################################

## 이동평균 예시
rollmean(c(1:6), 5)

data_L1 = data %>% 
   filter( id == 105 ) %>%
   filter( year >= 1979 ) 

   
X = data_L1$year
Y = data_L1$temp

## 5개의 이동평균      
X2 = rollmean(data_L1$year, 5)
Y2 = rollmean(data_L1$temp, 5)


#### FIG
dev.off()      # Initialization
par(cex.main=1.2, cex.lab=1.2, cex.axis=1.2, cex=1.2, font=1)
plot(X2, Y2, col='black', main="Time  series  of  yearly  averaged  temperature  for  the  year  of  1979-2016", 
     xlab="Time  [Year]", ylab="Temperature  [℃]",  xlim=c(1979, 2016), type="l", ylim=c(10, 15), 
     xaxs="i", yaxs="i", cex=1.5)
lm.fit =lm(Y2 ~ X2)
abline(lm.fit, col="red")
text(1980, 14.5, paste0("(Temperature) = ", round(coef(lm.fit)[2],2), " × (Year) ", round(coef(lm.fit)[1],2)), adj=0, col="red", cex=1.2)
text(1980, 14, paste0("R = ", round(cor(X2, Y2), 2)), adj=0, col="red", cex=1.2)
text(1980, 13.5, paste0("N = ", round(length(X2),2)), adj=0, cex=1.2)


#### FIG
DO = 10000         # Number of repetition
N  = length(X)     # Number of sample
# N2 = length(X2)  # Number of sample

Sort_X = lapply(1:DO, function(i) rollmean(X[1:N], 5) )  #  Sort_X = lapply(1:DO, function(i) X2[1:N2])
Sort_Y = lapply(1:DO, function(i) rollmean(sample(Y, N, replace=F), 5))  # 비복원추출(무작위 정렬)
Sort_R = mapply(cor, Sort_X, Sort_Y)

# nf <- layout(matrix(c(1,1),1,byrow=T), c(1,1), c(1,1)) ; layout.show(nf) ; par(mar=c(5,5,5,5)) ; par(cex=1.0) 
hist(Sort_R, breaks=50, xlab="Sample Distribution Coefficient Correlation", xlim=c(-1, 1), col = "light grey",  xaxs="i", yaxs="i",
     border = "grey", main=paste0("Central Limit Theorem : number of sample = ", N, ", number of repetition = ", sprintf("%d", DO))) 

XX1 = mean(Sort_R)+sd(Sort_R)
XX2 = mean(Sort_R)-sd(Sort_R)
XX3 = mean(Sort_R)+2*sd(Sort_R)
XX4 = mean(Sort_R)-2*sd(Sort_R)
XX5 = mean(Sort_R)+3*sd(Sort_R)
XX6 = mean(Sort_R)-3*sd(Sort_R)
YY = max(hist(Sort_R, breaks=50, plot=F)$counts)
lines( c(mean(Sort_R), mean(Sort_R)), c(0, YY), lty=1, col=4) ; text(mean(Sort_R), YY/2, "Mean")
lines( c(XX1, XX1), c(0, YY), lty=1, col=2 ) ; text( XX1, YY/2, "+1σ")
lines( c(XX2, XX2), c(0, YY), lty=1, col=2 ) ; text( XX2, YY/2, "-1σ")
lines( c(XX3, XX3), c(0, YY), lty=1, col=2 ) ; text( XX3, YY/2, "+2σ")
lines( c(XX4, XX4), c(0, YY), lty=1, col=2 ) ; text( XX4, YY/2, "-2σ")
lines( c(XX5, XX5), c(0, YY), lty=1, col=2 ) ; text( XX5, YY/2, "+3σ")
lines( c(XX6, XX6), c(0, YY), lty=1, col=2 ) ; text( XX6, YY/2, "-3σ")
lines( c(max(Sort_R), max(Sort_R)), c(0, YY), lty=1, col=3 ) ; text(max(Sort_R), YY/2, "Max")
lines( c(min(Sort_R), min(Sort_R)), c(0, YY), lty=1, col=3 ) ; text(min(Sort_R), YY/2, "Min")
lines( c(cor(X2, Y2), cor(X2, Y2)), c(0, YY), lty=1, col=5, lw=2 ) ; text(cor(X, Y), YY/2, "R")


## 상관계수에 대한 유의성 검정  -->  유의확률 (significance probability) = P값 (p-value)*100 %
##                                   신뢰도 (1 - P값)

Pval = sum (abs(Sort_R) >= abs(cor(X2, Y2))) /length(Sort_R)    # (Source: http://strata.uga.edu/8370/lecturenotes/resampling.html)
Pval

## 상관계수는 0.78로서 통계적으로 3.3 % 수준으로 유의하다.
##                                96.7 % 수준으로 신뢰할 만하다.
##                                1000번 중에 약 33번은 우연의 일치이다.
##                                1000번 중에 약 967번은 우연의 일치가 아니다.



################################################### 10 지점에서 이동평균을 이용하여 상관계수 및 유의성 검정 ###########################################################

for ( i in c(105, 156, 143, 133, 159, 108, 146, 184, 131, 101, 114, 100) ) {
   
   data_L1 = data %>% 
      filter( id == i ) %>%
      filter( year >= 1979 )
      # filter( year >= 2000 )
   
   X = data_L1$year
   Y = data_L1$temp
   
   X2 = rollmean(data_L1$year, 5)
   Y2 = rollmean(data_L1$temp, 5)
   
   DO = 100000      # Number of repetition
   N  = length(X)   # Number of sample
   # N2 = length(X2)  # Number of sample
   
   Sort_X = lapply(1:DO, function(i) rollmean(X[1:N], 5) )  #  Sort_X = lapply(1:DO, function(i) X2[1:N2])
   Sort_Y = lapply(1:DO, function(i) rollmean(sample(Y, N, replace=F), 5))  # 비복원추출(무작위 정렬)
   Sort_R = mapply(cor, Sort_X, Sort_Y)
   
   Pval = sum (abs(Sort_R) >= abs(cor(X2, Y2))) /length(Sort_R)    
   # (Source: http://strata.uga.edu/8370/lecturenotes/resampling.html)
   
   ## 지점, 상관계수, 몬테카를로 시뮬레이션에 의한 유의성
   cat(i, cor(X, Y), Pval, "\n")
}

표본 평균 집단

몬테카를로 시뮬레이션 (Monte Carlo Simulation)을 통한 표본 표준편차 집단
- 샘플 개수, 표본 평균의 표준편차, 표본 표준편차의 평균/Sqrt(N), 표본 평균의 표준편차 - 표본 표준편차의 평균 / Sqrt(N)

상관계수에 대한 유의성 검정

1979-2016년 기간 동안 강릉 지역에 대한 상관계수

몬테카를로 시뮬레이션 (Monte Carlo Simulation)을 통한 유의성 검정
- 상관 계수 및 유의 확률 (P-value)는 각각 0.47 및 0.00321로서 통계적으로 0.3 % 수준으로 유의하다.
- 즉 99.7 % 수준으로 신뢰할 만하다. 또는 1000번 중에 약 3번은 우연의 일치이다. 또는 1000번 중에 997번은 우연의 일치가 아니다.

cor.test 근사식에 통한 유의성 검정

[조화 분석]

## Example Data Sets 
X = seq(1, 12)
Y = c(-0.10, 0.40, 5.70, 13.30, 17.60, 20.60, 25.30, 25.10, 20.20, 15.80, 7.80, -0.10)

nn = length(X)
nh = trunc((nn-1)/2+1)

AA = numeric(nn)
BB = numeric(nn)
AAcos = matrix(rep(NA, nn*nh), nrow=nn, ncol=nh)
BBsin = matrix(rep(NA, nn*nh), nrow=nn, ncol=nh)
YY = numeric(nn)
YY2 = numeric(nn)
SYk = numeric(nn)
SSYk = numeric(nn)

SYt = 0  ;  SSYt = 0  ;  SYh = 0  ;  SSYh = 0
SYa = 0  ;  SSYa = 0  ;  SYk = 0  ;    PA = 0

for (i in 1:nh) {
   for (j in 1:nn) { 
       AA[i] = AA[i] + (Y[j]*cos(2.0*pi*j*i/nn))
       BB[i] = BB[i] + (Y[j]*sin(2.0*pi*j*i/nn))
   }
   AA[i] = AA[i]*2/nn
   BB[i] = BB[i]*2/nn

   if(i==nh){
      if(nn%%2==0){
         AA[i] = AA[i]/2
         BB[i] = 0.0

      }else{
         AA[i]=0
         BB[i]=0
   }
   }
}


for (i in 1:nn) {
   for (j in 1:nh) {
      AAcos[i,j] = AA[j]*cos(2*pi*j*i/nn)
      BBsin[i,j] = BB[j]*sin(2*pi*j*i/nn)
      YY[i] = YY[i] + (AAcos[i,j] + BBsin[i,j])
      if (AA[j] > 0)  { TT = atan(BB[j]/AA[j]) }
      if (AA[j] < 0)  { TT = atan(BB[j]/AA[j]) + pi }
      if (AA[j] == 0) { TT = pi/2 }
      YY2[i] = YY2[i] + sqrt(AA[j]**2 + BB[j]**2)*cos(2*pi*i*j/nn-TT)
   }
   
   Mean = mean(Y)
   SYt = SYt + Y[i]
   SSYt = SSYt + Y[i]**2
   SYh = SYh + (YY[i] + Mean)
   SSYh = SSYh + (YY[i] - Mean)
   SYa = SYa + (YY2[i] + Mean)
   SSYa = SSYa + (YY2[i] - Mean)
   
   for (j in 1:nh) {
      SYk[j] = SYk[j] + AAcos[i,j]+BBsin[i,j]
      SSYk[j] = SSYk[j] + (AAcos[i,j]+BBsin[i,j])**2
   }
}

for (j in 1:nh) {
   PA[j] = atan(BB[j]/AA[j])*180/pi
   if (AA[j] < 0) { PA[j] = PA[j] + 180 }
   if (AA[j] == 0) { PA[j] = 90 }
   PA[j] = PA[j] - 360.0/nn*j
   if (PA[j] < 0) { PA[j] = PA[j] + 360 }
   if (AA[j] == 0  &  BB[j]==0) { PA[j] = -999.0 }
}


data = data.frame(Y, AAcos+BBsin, YY+Mean, YY2+Mean)
colnames(data) = c("Yt", "Y01", "Y02", "Y03", "Y04", "Y05", "Y06", "Yh", "Ya")
summary(data)

round( data, 2)                                                                # Time, Yt, Y01, Y02, Y03, Y04, Y05, Y06 | Yh Ya
round( apply(data, 2, mean), 2)                                                # Mean
round( apply(data, 2, var), 2)                                                 # Var = (CK*CK)/2
round( apply(data, 2, var)/var(data$Yt)*100, 2)                                # Contribution [%] = R-squared/100
round( sqrt(AA[1:nh]^2+BB[1:nh]^2), 2)                                         # Amplitude (CK) 
round( ((sqrt(AA[1:nh]^2+BB[1:nh]^2))^2*nn)/((nn-1)*var(data$Yt))/2, 2)        # R-squared
round( PA[1:nh], 2)                                                            # Phase Anglea
round( nn/(1:nh), 2)                                                           # Time of length
round( 1+nn/360*PA[1:nh], 2)                                                   # Time of 1-6st Harm.


data_L1     = round( data, 2)                                                                
Mean        = round( apply(data, 2, mean), 2)                                                
Var         = round( apply(data, 2, stats::var), 2)                                                 
Contri      = round( apply(data, 2, stats::var)/var(data$Yt)*100, 2)                                
CK          = round( sqrt(AA[1:nh]^2+BB[1:nh]^2), 2)                                     
Rsq         = round( ((sqrt(AA[1:nh]^2+BB[1:nh]^2))^2*nn)/((nn-1)*var(data$Yt))/2, 2)    
Phase_angle = round( PA[1:nh], 2)                                                        
Time_length = round( nn/(1:nh), 2)                                                       
Time_1st    = round( 1+nn/360*PA[1], 2)                                                  


# FIG.
cols = brewer.pal(6, 'Set2')   
legend_text = paste(paste(c("1st", "2st", "3st", "4st", "5st", "6st"), Contri[2:7], sep=" = "), "%")
dev.off()      # Initialization
par(cex.main=1.2, cex.lab=1.2, cex.axis=1.2, cex=1.2, font=1)
plot (X, data_L1[,1], xaxs="i", yavs="i", xlim=c(1, 12), ylim=c(-10, 30),
     xlab="X lab", ylab="Y lab")
lines(X, data_L1[,2]+Mean[1], type='l', col=cols[1], lwd=2)
lines(X, data_L1[,3]+Mean[1], type='l', col=cols[2], lwd=2)
lines(X, data_L1[,4]+Mean[1], type='l', col=cols[3], lwd=2)
lines(X, data_L1[,5]+Mean[1], type='l', col=cols[4], lwd=2)
lines(X, data_L1[,6]+Mean[1], type='l', col=cols[5], lwd=2)
lines(X, data_L1[,7]+Mean[1], type='l', col=cols[6], lwd=2)
legend( "topleft",  legend=c("Yt"), pch=c(21), col=c('black'),
       cex=1.2,  pt.cex=1.3, text.font=1, bty="n", text.col=c('black'),  y.intersp = 0)
legend( "topright", legend=legend_text, col=cols, text.col=cols,
       cex=1.2,  pt.cex=1.3, text.font=1, lwd=2, bty="n", xpd=T,  y.intersp = 0.3)

## Fig. 2
plot(Time_length, Contri[2:7], xlim=c(1, 12.5), ylim=c(-5, 100.5), type="b", col="blue", 
     xlab="Time  length", ylab="Contribution  [%]")

각 주기에 따른 시계열

주기에 따른 기여도

[주성분 분석]

통계학 및 기상학/해양학에서는 각각 주성분 분석 (Principal Component Analysis) 및 경험적 직교함수 (Empirical Orthogonal Function)으로 의미함
여러 개의 독립변수로 얻어진 다변량 데이터에 대해 공분산/상관계수 구조를 선형결합식(주성분)으로 설명하고자 함
즉 차원 축소 및 주성분을 통한 데이터의 해석 용이
[연습] 1987년 1월 이타카, 캐넌다이과 지역에 일평균 최저 기온
[실습] 기상청 최고 기온 자료를 이용한 영동 및 영서에 대한 주성분 분석

## Example Data Sets 

## 시간: 1987년 1월 일평균 자료 (지상관측)
## 변수: 최저 기온
## 지점: 이타가, 캐넌다이과 

Ithaca = c(19, 25, 22, -1, 4, 14, 21, 22, 23, 27, 29, 25, 29, 15, 29, 24, 0, 2, 26, 17, 19, 9, 20, -6, -13, -13, -11, -4, -4, 11, 23)
Canandaigua = c(28, 28, 26, 19, 16, 24, 26, 24, 24, 29, 29, 27, 31, 26, 38, 23, 13, 14, 28, 19, 19, 17, 22, 2, 4, 5, 7, 8, 14, 14, 23)
data = data.frame(Ithaca, Canandaigua)
plot(Ithaca, Canandaigua, xlim=c(-10, 35), ylim=c(-10, 35))

cov(data)

summary(data)


nrow = dim(data)[1]
ncol = dim(data)[2]
anomaly = matrix(rep(NA, nrow*ncol), nrow=nrow, ncol=ncol)
for (i in 1:ncol) {
   anomaly[,i] = data[ ,i] - colMeans(data)[i]
}
colnames(anomaly) = c("anomaly_Ithaca", "anomaly_Canandaigua")

#  Mean : 13.0   Mean : 20.23 
# anomaly_Ithaca = data$Ithaca - mean(data$Ithaca)
# anomaly_Canandaigua = data$Canandaigua - mean(data$Canandaigua)

################################## 상관계수 (Correlation Coefficient), 공분산 (Covariance) ###########################################################

################################## 공분산 (Covariance) ###############################################################################################
EOF = prcomp(data, scale=FALSE, center=TRUE)        # 표준화 미고려, 아노말리 고려
# EOF = prcomp(data, scale=FALSE, center=FALSE)     # 표준화 미고려, 아노말리 미고려
Eigen_value = EOF$sdev^2  ;  Eigen_value            # 고유근
Eigen_vector = EOF$rotation  ;  Eigen_vector        # 고유벡터
EOF_data = data.frame(data, anomaly_Ithaca, anomaly_Canandaigua, EOF$x)  ;  EOF_data
   ## 이타카, 캐넌다이과의 최저기온과 아노말리 및 주성분 (PC1, PC2)
   ## PC1 = -0.8478591*anomaly(Ithaca) -0.5302216*anomaly(Canandaigua)
   ## PC2 = -0.5302216*anomaly(Ithaca) -0.8478591*anomaly(Canandaigua)


summary(EOF)  # 각  주성분의 설명력 및 표준편차

## FIG. 스크리 그림 (Scree plot): 주성분에 따른 분산/설명력
PC_var = (Eigen_value/sum(Eigen_value))*100.0
par(cex.main=1.2, cex.lab=1.2, cex.axis=1.2, cex=1.2, font=1)
plot(PC_var, xlim=c(1, 10), ylim=c(0, 100), type='b',
     xlab='Principal  Component', ylab='Proportion  of  Variance,  Cumulative  Proportion')
points(cumsum(Eigen_value)/sum(Eigen_value)*100., col='blue', type='l')


## FIG. 행렬도 : 변수와 주성분(PC)의 관계를 그래프로 표현.
#                화살표는 변수와 PC의 상관계수, PC와 평행수록 해당 PC에 큰 영향.
#                화살표 벡터의 길이가 변수의 분산을 표현.
   par(cex.main=1.2, cex.lab=1.2, cex.axis=1.2, cex=1.2, font=1)
   biplot(EOF, choices=1:2, scale=0, xlab="PC1 (96.85% explained var)", ylab="PC2 (0.03% explained var)" ) ; abline(h=0, v=0, lty=2)
   cor(data)
   
## FIG. 변수와 주성분(PC)의 시계열
cols = brewer.pal(4, 'Set2')  
par(cex.main=1.2, cex.lab=1.2, cex.axis=1.2, cex=1.2, font=1)
legend_text = c("PC1 (0.96%)", "PC2 (0.04%)", "Anomaly_Ithaca", "Anomaly_Canandaigua")
plot (EOF_data$PC1, type='l', xaxs="i", yaxs="i", col=cols[1], ylim=c(-40, 40), ylab="Temperature [deg. C]", xlab="Date  [Day]", lwd=2)
lines(EOF_data$PC2, type='l', col=cols[2], lwd=2)
lines(EOF_data$anomaly_Ithaca, type='l', col=cols[3], lwd=2)
lines(EOF_data$anomaly_Canandaigua, type='l', col=cols[4], lwd=2)
legend( "topleft", legend=legend_text, col=cols, text.col=cols,
        cex=1.5,  pt.cex=1.3, text.font=1, lwd=2, bty="n", xpd=T,  y.intersp = 0.3)

      
################################## 상관계수 (Correlation Coefficient) ################################################################################
EOF = prcomp(data, scale=TRUE, center=FALSE)  # 표준화 미고려, 아노말리 고려
Eigen_value = EOF$sdev^2  ;  Eigen_value
Eigen_vector = EOF$rotation  ;  Eigen_vector
EOF_data = data.frame(data, EOF$x)  ;  EOF_data
## 이타카, 캐넌다이과의 최저기온과 주성분

summary(EOF)

# dev.off()

## FIG. 주성분에 따른 분산
PC_var = (Eigen_value/sum(Eigen_value))*100.0
par(cex.main=1.2, cex.lab=1.2, cex.axis=1.2, cex=1.2, font=1)
plot(PC_var, xlim=c(1, 10), ylim=c(0, 100), type='b',
     xlab='Principal  Component', ylab='Proportion  of  Variance,  Cumulative  Proportion')
points(cumsum(Eigen_value)/sum(Eigen_value)*100., col='blue', type='l')


## FIG. 행렬도 : 변수와 주성분(PC)의 관계를 그래프로 표현.
#                   화살표는 변수와 PC의 상관계수, PC와 평행수록 해당 PC에 큰 영향.
#                   화살표 벡터의 길이가 변수의 분산을 표현.
par(cex.main=1.2, cex.lab=1.2, cex.axis=1.2, cex=1.2, font=1)
biplot(EOF, choices=1:2, scale=0, xlab="PC1 (95.06% explained var)", ylab="PC2 (0.05% explained var)" ) ; abline(h=0, v=0, lty=2)

## FIG. 변수와 주성분(PC)의 시계열.
cols = brewer.pal(4, 'Set2')  
par(cex.main=1.2, cex.lab=1.2, cex.axis=1.2, cex=1.2, font=1)
legend_text = c("PC1 (0.95%)", "PC2 (0.05%)")
plot(EOF_data$PC1, type='l', xaxs="i", yaxs="i", col=cols[1], ylim=c(-3, 3), ylab="PC1, PC2", lwd=2)
lines(EOF_data$PC2, type='l', col=cols[2], lwd=2)
legend( "topleft", legend=legend_text, col=cols, text.col=cols,
        cex=1.5,  pt.cex=1.3, text.font=1, lwd=2, bty="n", xpd=T,  y.intersp = 0.3)




####################### 기상청 최저/최고 기온 자료를 이용하여 영동과 영서에 대한 주성분 분석 ###########################################
data = read.csv("INPUT/KMA/2012-2016_Daily_Mean_Maximum_Air_Temperature_Gangwon.KMA", sep="", header=TRUE)
data = read.csv("INPUT/KMA/2012-2016_Daily_Mean_Minimum_Air_Temperature_Gangwon.KMA", sep="", header=TRUE)
head(data)  # 파일의 첫 행
tail(data)  # 파일의 끝 행

#################################### 자료 설명 ##############################################################
##
## 시간: 2012-2016년 일평균 자료 (지상관측)
##
## 변수: 년,    월,      일,    8개 지점별 최저/최고 기온
##       year  month    day   
##
## 지점: 춘천,  홍천,  원주,  영월,  동해,  강릉,  속초,  양양
##       ~~~~~~~~~~~~~~~~~~~~~~~~~   ~~~~~~~~~~~~~~~~~~~~~~~~~
##              영서 지역                              영동 지역
##
## 출처: 기상청 국가기후데이터센터 (http://sts.kma.go.kr/jsp/home/contents/main/main.do) 
##       Received data from master's course Se-Hun Rim
##
############################################################################################################

## 시계열 그리기 위해서 "년, 월, 일"에 대한 변수를 고려한 xran 변수 추가 그리고 4월달 자료만 추출.
MONTH = c(31, 28, 31, 30, 31, 30, 31, 31, 30, 31, 30, 31)
data_L1 = data %>% 
   mutate( xran = year + ((month-1)/12.) + ((day-1)/(12.*MONTH[month])) ) %>%
   filter( year == 2016 ) %>%
   filter( month == 4 )

head(data_L1)

data_L1_FIG = data_L1 %>%
   filter( year == 2012 )

## FIG. 영동 지역에서 최고/최저 기온 자료 시계열
cols = brewer.pal(4, 'Set2')  
# display.brewer.all(n=10, exact.n=FALSE)
par(cex.main=1.2, cex.lab=1.2, cex.axis=1.2, cex=1.2, font=1)
plot (data_L1_FIG$day, data_L1_FIG$Donghae, type='l', xaxs="i", yaxs="i", col=cols[1], ylim=c(-10, 40), ylab="Max / Min  Air  Temperature [deg. C]", xlab="Time  [Year-Month-Day]", lwd=2)
lines(data_L1_FIG$day, data_L1_FIG$Gangneung, type='l', col=cols[2], lwd=2)
lines(data_L1_FIG$day, data_L1_FIG$Sokcho, type='l', col=cols[3], lwd=2)
lines(data_L1_FIG$day, data_L1_FIG$Yangyang, col=cols[4], lwd=2)
legend( "top", legend=c("Donghae", "Gangneung", "Sokcho", "Yangyang"), 
        col=cols, text.col=cols, cex=1.3,  pt.cex=1.3, text.font=1, lwd=2, bty="n", xpd=T, horiz=TRUE,  y.intersp = 0.25)

## FIG. 영서 지역에서 최고/최저 기온 자료 시계열.
par(cex.main=1.2, cex.lab=1.2, cex.axis=1.2, cex=1.2, font=1)
plot (data_L1_FIG$day, data_L1_FIG$Chuncheon, type='l', xaxs="i", yaxs="i", col=cols[1], ylim=c(-10, 40), ylab="Max / Min  Air  Temperature [deg. C]", xlab="Time  [Year-Month-Day]", lwd=2)
lines(data_L1_FIG$day, data_L1_FIG$Hongcheon, type='l', col=cols[2], lwd=2)
lines(data_L1_FIG$day, data_L1_FIG$Wonju, type='l', col=cols[4], lwd=2)
lines(data_L1_FIG$day, data_L1_FIG$Yeongwol, type='l', col=cols[4], lwd=2)
legend( "top", legend=c("Chuncheon", "Hongcheon", "Wonju", "Yeongwol"), 
        col=cols, text.col=cols, cex=1.3,  pt.cex=1.3, text.font=1, lwd=2, bty="n", xpd=T, horiz=TRUE,  y.intersp = 0.25)


## 시간(년, 월, 일) 및 NA 제거
data_L2 = na.omit(data_L1)
head(data_L2)

## year, month, day, xran 컬럼 제거
data_L3 = data_L2[ ,-c(1:3, 12)]
head(data_L3)

## 공분산 행렬
COV = cov(data_L3)  ;  COV  ## 공분산 행렬
ggcorrplot(COV, outline.col = "white", lab = TRUE) +
   scale_fill_gradientn(colours = matlab.like(10)) +
   labs(fill = "Cov") 


## 상관계수 행렬
COR = cor(data_L3)  ;  COV          ## 상관계수 행렬 계산
COR_pvalue = cor_pmat(data_L3)      ## 상관계수 행렬에 대한 p-value 계산
ggcorrplot(COR, outline.col = "white", lab = TRUE, p.mat = COR_pvalue, sig.level = 0.05, 
           colors = c("#6D9EC1", "white", "#E46726")) 


## Anomaly calculation
nrow = dim(data_L3)[1]
ncol = dim(data_L3)[2]
anomaly = matrix(rep(NA, nrow*ncol), nrow=nrow, ncol=ncol)
for (i in 1:ncol) {
   anomaly[,i] = data_L3[ ,i] - colMeans(data_L3)[i]
}     
colnames(anomaly) = c("anomaly_chuncheon", "anomaly_hongcheon", "anomaly_wonju" , "anomaly_yeongwol",
                      "anomaly_donghae"  , "anomaly_gangneung", "anomaly_sokcho", "anomaly_yangyang")




## 주성분 분석
EOF = prcomp(data_L3, scale=FALSE, center=TRUE)  ;  EOF
   ## scale = FALSE , center = FALSE : 공분산 행렬   (아노말리 미고려)
   ## scale = FALSE , center = TRUR  : 공분산 행렬   (아노말리 고려)
   ## scale = TRUE  , center = FALSE : 상관계수 행렬 (아노말리 미고려)
   ## scale = TRUE  , center = TRUE  : 상관계수 행렬 (아노말리 고려)
names(EOF)
Eigen_value = EOF$sdev^2      ;  Eigen_value         # 고유근
Eigen_vector = EOF$rotation   ;  Eigen_vector        # 고유벡터
EOF_data = data.frame(EOF$x)  ;  EOF_data            # Principal Components
summary(EOF)


## Eigen_value Test
sum(diag(COV))        # 공분산의 대각선 합
sum(Eigen_value)      # 고유근의 합

## PC1 (Eigen_vector) Test (PC scores using R-project  vs  PC scores using calculator)
EOF_data[1,1]
## PC1[1,1] = 0.3316495*anomaly(Chuncheon) + ... +  0.3798507*anomaly(Yangyang)
# sum(-16.90909589*0.3448694, -17.093260274*0.3505063, -16.24663014*0.3380886, -16.991287671*0.3409752, -16.44942466*0.3352016,
    # -1.219956e+01*0.2659710, -13.33632877*0.2883571, -11.74356164*0.2815221, -15.407342466*0.3211499, -13.13715068*0.2810695)

## FIG. 주성분에 따른 설명력
PC_var = (Eigen_value/sum(Eigen_value))*100.0
fviz_screeplot(EOF, addlabels = TRUE, title="", x="Principal  Components", y="Percentage  of  Variance  [%]", ncp = 8) + 
   geom_point(aes(y=cumsum(PC_var)), col='red') + geom_line(aes(y=cumsum(PC_var)), col='red') +
   theme_bw(base_size = 15)

## FIG. 고유벡터를 이용하여 2D contour mapping 

## 위도 및 경도 설정
## 춘천,  홍천,  원주,  영월,  동해,  강릉,  속초,  양양
lat = c( 37.90256000,  37.68358000,  37.33756000,  37.18126000,  37.50708000,  37.75147000,  38.25085000,  38.08724000)
lon = c(127.73570000, 127.88040000, 127.94660000, 128.45743000, 129.12433000, 128.89098000, 128.56472000, 128.62966000)
obs = c( "Chuncheon",  "Hongcheon",    "Wonju",   "Yeongwol",    "Donghae",  "Gangneung",     "Sokcho",    "Yangyang")

data_L4 = data.frame( obs, lon, lat, unname(Eigen_vector) )
colnames(data_L4) = c("obs", "lon", "lat", "PC1", "PC2", "PC3", "PC4", "PC5", "PC6", "PC7", "PC8")

data_L5 = data_L4
head(data_L5)

coordinates(data_L5)  = ~lon + lat
plot(data_L5)

# summary(data_L5)

################################# 통계적 보간법 ##############################################################################
##
## 10개 지점별 고유벡터를 시각화하기 위해 공간분포의 통계적 보간법 (선형 내삽, 2차원 내삽, 역거리 가중치법, 크리킹 등) 이용
##
## 역거리 가중치법 (Inverse Distance Weighting, IDW)
## - 거리에 따른 주변값들의 가중값을 산정하여 가중평균을 통해 지점별 관측의 값을 보간하는 방법 (Kim et al., 2010)
## - 즉 IDW 보간법은 대표적인 공간분포 보간법으로 관측점간의 거리에 반비례하여 가중치를 할당하고, 
##   거리가 멀어짐에 따라 지수적으로 감소한다. 
##  
## Kim D, Ryu D, Choi Y, Lee W. 2010. Application of kriging and inverse distance weighting method for 
##        the estimation of geo-layer of Songdo area in Incheon. Journal of Korean Geotechnical Society 26(1):5-19.
##
##############################################################################################################################

## 1 km 해상도로 주변 격자를 구성
y.range = as.numeric(c(35, 40))    # min/max latitude  of the interpolation area
x.range = as.numeric(c(127, 130))  # min/max longitude of the interpolation area

grd = expand.grid(x = seq(from = x.range[1], to = x.range[2], by = 0.01),
                  y = seq(from = y.range[1], to = y.range[2], by = 0.01))  # expand points to grid
coordinates(grd) <- ~ x + y
gridded(grd) <- TRUE

# plot(grd, cex = 1.0, col = "grey")
# points(data_L5, pch = 1, col = "red", cex = 1)

idw = idw(formula = PC2 ~ 1, locations = data_L5, newdata = grd)  # apply idw model for the data
idw.output = as.data.frame(idw)  # output is defined as a data table
names(idw.output)[1:3] = c("long", "lat", "var1.pred")  # give names to the modelled variables

world = map_data("worldHires")

longitude = seq(100, 150, 0.5)
latitude  = seq(20, 60, 0.5)
x_longitude = unlist(lapply(longitude, function(x) ifelse(x < 0, paste0(abs(x), "°W"), ifelse(x > 0, paste0(x, "°E"),x))))
y_latitude  = unlist(lapply(latitude, function(x) ifelse(x < 0, paste0(x, "°S"), ifelse(x > 0, paste0(x, "°N"),x))))

ggplot() + 
   coord_fixed(ratio = 0.75) +
   theme_bw() +
   geom_tile(data = idw.output, aes(x = long, y = lat, fill = var1.pred)) + 
   scale_fill_gradient2() +
   # scale_fill_gradientn(colours = matlab.like(10), limits=c(-1.0, 1.0)) +
   guides( fill=guide_colorbar(barwidth=1.5, barheight=15, title.position="right", title.vjust = 0.5, title.hjust = 0.5) ) +
   geom_point(aes(x = lon, y = lat, colour=obs), data = data_L4, size=5, alpha=0.3, show.legend = FALSE) +
   geom_text(aes(x=lon, y=lat, label=obs, colour=obs), hjust=-0.25, vjust=0.5, nudge_x=0, nudge_y=0, size=5,
             colour='black', data=data_L4) +
   # geom_label(aes(x=lon, y=lat, label=obs, colour=obs), hjust=-0.25, vjust=0.5, nudge_x=0, nudge_y=0, size=5,
   # data=data_L4, show.legend = FALSE) +
   geom_path(aes(x=long, y=lat, group=group), data = world, colour="black") +
   # scale_x_continuous(breaks = longitude, labels = x_longitude, expand=c(0,0), limits=c(125, 130)) +
   # scale_y_continuous(breaks = latitude,  labels = y_latitude,  expand=c(0,0), limits=c(32.5, 40)) +
   scale_x_continuous(breaks = longitude, labels = x_longitude, expand=c(0,0), limits=c(127, 130)) +
   scale_y_continuous(breaks = latitude,  labels = y_latitude,  expand=c(0,0), limits=c(36, 39)) +
   theme(plot.title=element_text(face="bold", size=20, color="black")) +
   theme(axis.title.x = element_text(face="bold", size=15, colour="black")) +
   theme(axis.title.y = element_text(face="bold", size=15, colour="black", angle=90)) +
   theme(axis.text.x  = element_text(face="bold", size=15, colour="black")) +
   theme(axis.text.y  = element_text(face="bold", size=15, colour="black")) +
   theme(legend.text=element_text(size=14, color="black")) +
   theme(legend.title=element_text(size=14, angle=90)) +
   labs(x = "") +
   labs(y = "") +
   labs(fill = "Eigen  Vector") +
   labs(colour = "") +
   labs(title  = paste0("PC2 (", sprintf("%.2f", PC_var)[2], "%  Explanined  Variance) \n")) +
   theme(legend.background=element_blank())


## FIG. 각 지점에서 각 주성분들의 기여도
var = get_pca_var(EOF)
contrib = var$contrib
ggcorrplot(contrib, outline.col = "white", lab = TRUE) + 
   scale_fill_gradientn(colours = matlab.like(10)) +
   labs(fill = "Contrib") 

rowSums(contrib)  # 행에 대한 합계
colSums(contrib)  # 열에 대한 합계


## FIG. 각 지점별 주성분들의 기여도
fviz_contrib(EOF, choice = "var", axes = 2)

## FIG. 각 사례별 주성분들의 기여도 == 고유벡터의 절대값이 큰 경우
fviz_contrib(EOF, choice = "ind", axes = 2) + coord_flip()

## PC 1-2에 대해 고유벡터를 시각화
fviz_pca_var( EOF, axes=c(2,2), col.var="contrib", gradient.cols = c("#00AFBB", "#E7B800", "#FC4E07") ) 

## PC 1-2에 대해 주성분들를 시각화
fviz_pca_ind(EOF, axes=c(2,2), col.ind = "contrib", gradient.cols = c("#00AFBB", "#E7B800", "#FC4E07"), repel = TRUE)

## PC 1-2에 대해 고유벡터 및 주성분 Score를 시각화
fviz_pca_biplot(EOF, axes = c(2, 2), col.var="contrib", col.ind ="contrib", gradient.cols = c("#00AFBB", "#E7B800", "#FC4E07")) +
   theme_bw(base_size = 15)


## 년, 월, 일, 관측소별 최고/최저 기온 및 아노말리, 주성분 score
data_L6 = data.frame(data_L2$year, data_L2$month, data_L2$day, data_L3, anomaly, EOF_data)
head(data_L6)

summary(EOF)

data_L6_FIG = data_L6 %>%
   dplyr::rename( year  = data_L2.year, 
                  month = data_L2.month, 
                  day   = data_L2.day ) %>%
   dplyr::filter( year == 2016 )

## FIG. 변수와 주성분(PC 1-5)의 시계열.
cols = brewer.pal(5, 'Set2')  
par(cex.main=1.2, cex.lab=1.2, cex.axis=1.2, cex=1.2, font=1)
plot  (data_L6_FIG$day, data_L6_FIG$PC1, type='l', xaxs="i", yaxs="i", col=cols[1], ylim=c(-80, 80), ylab="Principal  Component  Scores", xlab="Time  [Year-Month-Day]", lwd=2)
points(data_L6_FIG$day, data_L6_FIG$PC2, type='l', col=cols[2], lwd=2)
points(data_L6_FIG$day, data_L6_FIG$PC3, type='l', col=cols[3], lwd=2)
points(data_L6_FIG$day, data_L6_FIG$PC4, type='l', col=cols[4], lwd=2)
points(data_L6_FIG$day, data_L6_FIG$PC5, type='l', col=cols[5], lwd=2)
legend_text = c("PC1 (80.94%)", "PC2 (15.62%)", "PC3 (1.47%)", "PC4 (0.70%)", "PC5 (0.55%)")
legend("top", legend=legend_text, col=cols, text.col=cols,
cex=1.2,  pt.cex=1.3, text.font=1, lwd=2, bty="n", xpd=T,  y.intersp = 0.25, horiz=TRUE)


############################# 영동 및 영서 지점들에 대한 바람장미 ##############################################
## - 영동과 영서 지역에 온도 차가 클 때 (PC2 절대값 최대) 바람장미의 분포를 보고자 함.
## - 2016년 4월 9, 17일 (영동 - 영서) > 0 (푄 현상으로 인해 기온 상승)          ->  서풍 계열 예상
## - 2016년 4월    26일               < 0 (시베리안 고기압으로 인해 기온 하강)  ->  북풍 또는 동풍 계열 예상
################################################################################################################

wind_data = read.csv("INPUT/KMA/201604_Hourly_Mean_Wind_Gangwon.KMA", sep="", header=TRUE)
head(wind_data)  # 파일의 첫 행

#################################### 자료 설명 ##############################################################
##
## 시간: 2016년 4월 시간별 평균된 자료 (지상관측)
##
## 변수: 년,    월,      일,    7개 지점별 풍향 및 풍속 자료
##       year  month    day                *_wd    *_ws
##
## 지점: 춘천,  홍천,  원주,  영월,  동해,  강릉,  속초
##       ~~~~~~~~~~~~~~~~~~~~~~~~~   ~~~~~~~~~~~~~~~~~~
##              영서 지역                영동 지역
##
## 출처: 기상청 국가기후데이터센터 (http://sts.kma.go.kr/jsp/home/contents/main/main.do) 
##
############################################################################################################

wind_data_L1 = na.omit(wind_data)  # NA 제거

wind_data_L2 = wind_data_L1 %>%
   filter( year  == 2016 ) %>%
   filter( month == 4 ) %>%
   filter( day   == 26 )

summary(wind_data_L2)

## 영동 지역에 대한 바람장미
FIG_Gangneung_wr = ggplot_windrose(spd = wind_data_L2$Gangneung_ws, dir = wind_data_L2$Gangneung_wd, spdres = 2, dirres=30, spdmin = 0, spdmax = 12) +
   labs(title = "Gangneung")
FIG_Donghae_wr = ggplot_windrose(spd = wind_data_L2$Donghae_ws, dir = wind_data_L2$Donghae_wd, spdres = 2, dirres=30, spdmin = 0, spdmax = 12) +
   labs(title = "Donghae")
FIG_Sokcho_wr = ggplot_windrose(spd = wind_data_L2$Sokcho_ws, dir = wind_data_L2$Sokcho_wd, spdres = 2, dirres=30, spdmin = 0, spdmax = 12) +
   labs(title = "Sokcho")
grid.arrange(FIG_Gangneung_wr, FIG_Donghae_wr, FIG_Sokcho_wr, nrow=1, ncol=3)   ## 다중 그림


## 영서 지역에 대한 바람장미
FIG_Chuncheon_wr = ggplot_windrose(spd = wind_data_L2$Chuncheon_ws, dir = wind_data_L2$Chuncheon_wd, spdres = 2, dirres=30, spdmin = 0, spdmax = 12) +
   labs(title = "Chuncheon")
FIG_Wonju_wr = ggplot_windrose(spd = wind_data_L2$Wonju_ws, dir = wind_data_L2$Wonju_wd, spdres = 2, dirres=30, spdmin = 0, spdmax = 12) +
   labs(title = "Wonju")
FIG_Yeongwol_wr = ggplot_windrose(spd = wind_data_L2$Yeongwol_ws, dir = wind_data_L2$Yeongwol_wd, spdres = 2, dirres=30, spdmin = 0, spdmax = 12) +
   labs(title = "Yeongwol")
grid.arrange(FIG_Chuncheon_wr, FIG_Wonju_wr, FIG_Yeongwol_wr, nrow=1, ncol=3)   ## 다중 그림


#################################### ECMWF 자료를 이용하여 주성분 분석 및 그래픽 표출 ###########################################
 
#################################### 자료 설명 ##############################################################
##
## 시간: 2016년 4월 1-30일 일평균 자료 (ECMWF ERA-Interim)
##
## 변수: 위도 (181),    경도 (360),      시간 (30),    지표면 기압 (360, 181, 30)   ------ 1 deg
##        longitude      latitude          time              Surface pressure
##
##       위도 (1441),   경도 (2880),     시간 (30),    지표면 기압 (2880, 1441, 30) ------ 0.125 deg
##        longitude      latitude          time              Surface pressure
##
## 출처: ECMWF era-interim (https://www.ecmwf.int/en/forecasts/datasets/reanalysis-datasets/era-interim) 
##       Received data from master's course Wenting Zhang
##
#############################################################################################################

ncfile = open.nc("INPUT/ECMWF_mslp_1deg_201604.nc")
print.nc(ncfile)

ncdata = read.nc(ncfile)

file_lines = dim.inq.nc(ncfile, "time")$length
time_units = att.get.nc(ncfile, "time", "units")
time = utcal.nc(time_units, ncdata$time)

# slp_scale     =  att.get.nc(ncfile, "sp", "scale_factor")
# slp_offset    =  att.get.nc(ncfile, "sp", "add_offset")
# slp_fillvalue =  att.get.nc(ncfile, "sp", "_FillValue")
   
lon  = ncdata[["longitude"]]
lat  = ncdata[["latitude"]]
# slp  = ncdata[["sp"]]/100.0  # (ncdata[["sp"]]*slp_scale) + slp_offset
                               # Pa -> hPa
msl  = ncdata[["msl"]]/100.0   # (ncdata[["sp"]]*slp_scale) + slp_offset
                               # Pa -> hPa

## Convert 2D matrix to 1D matrix
msl_matrix = matrix(msl, nrow = dim(lon)*dim(lat), ncol = file_lines)
head(msl_matrix)

##================================== slp_matxix ======================================================
##       1일      2일     ...     29일       30일     (2017년 4월)   => slp_matrix[1:65160, 30]
##    360*181                               360*181
##    (65,160)                              (65,160)   
##====================================================================================================


## FIG. ECMWF 자료를 이용한 지표면 기압 맵핑 (-90°N-90°N, 0°E-360°E)
lonlat = as.matrix(expand.grid(lon,lat))
colnames(lonlat) = c("long", "lat")

longitude   = seq(-180, 180, 20)
latitude    = seq(-90, 90, 10)
x_longitude = unlist(lapply(longitude, function(x) ifelse(x < 0, paste0(abs(x), "°W"), ifelse(x > 0, paste0(x, "°E"),x))))
y_latitude  = unlist(lapply(latitude, function(x) ifelse(x < 0, paste0(x, "°S"), ifelse(x > 0, paste0(x, "°N"),x))))

output = data.frame(lonlat, slp_matrix)

## Convert longitude (0-360) to (-180 to 180 or 180°W-180°E)
output$long = ( (output$long+180) %% 360 ) - 180.0

## Convert longitude (-180 to 180 or 180°W-180°E) to (0-360)
# output$long = ( output$long %% 360 )

world = map_data("worldHires")

ggplot() + 
   coord_fixed(ratio = 0.75) +
   theme_bw() +
   geom_tile(data = output, aes(x = long, y = lat, fill = X1)) +
   scale_fill_gradientn(colours = matlab.like(10)) +
   # scale_fill_gradientn(colours = matlab.like(10), limits=c(500, 1100), na.value="#AA0000", title.vjust = 0.5, title.hjust = 0.5) +
   guides( fill=guide_colorbar(barwidth=1.5, barheight=15, title.position = "right", title.vjust = 0.5, title.hjust = 0.5) ) +
   geom_path(aes(x=long, y=lat, group=group), data = world, colour="black") +
   # scale_x_continuous(breaks = longitude, labels = x_longitude, expand=c(0,0), limits=c(-180, 180)) +
   # scale_y_continuous(breaks = latitude,  labels = y_latitude,  expand=c(0,0), limits=c(-90, 90)) +
   scale_x_continuous(breaks = longitude, labels = x_longitude, expand=c(0,0), limits=c(80, 160)) +
   scale_y_continuous(breaks = latitude,  labels = y_latitude,  expand=c(0,0), limits=c(10, 70)) +
   theme(plot.title=element_text(face="bold", size=20, color="black")) +
   theme(axis.title.x = element_text(face="bold", size=15, colour="black")) +
   theme(axis.title.y = element_text(face="bold", size=15, colour="black", angle=90)) +
   theme(axis.text.x  = element_text(face="bold", size=15, colour="black")) +
   theme(axis.text.y  = element_text(face="bold", size=15, colour="black")) +
   theme(legend.text=element_text(size=14, color="black")) +
   theme(legend.title=element_text(size=14, angle=90)) +
   labs(x = "") +
   labs(y = "") +
   labs(fill = "Mean  Sea  Level  Pressure  [hPa]") +
   theme(legend.background=element_blank())



## 동아시아 영역에 대한 주성분 분석 (20°N-60°N, 115°E-150°E)
output_L1 = output %>%
   dplyr::filter(115 <= long  &  long <= 150) %>%
   dplyr::filter(20  <= lat   &  lat <= 60)

head(output_L1)

## 위도, 경도 추출 
mls_matxix_L1 = output_L1[ ,-c(1,2)]

head(mls_matxix_L1)


## 전치행렬로 변환
##================================== slp_matxix_L1 ===================================================
##       1일      2일     ...     29일       30일     (2017년 4월)   => slp_matrix_L1[1:4941, 30]
##     81*61                                81*61
##     (4941)            ..                 (4941)   
##====================================================================================================

##                                  ↓ transpose of a matrix
 
##==================================  t(slp_matxix_L1)================================================
##      1일           81*61 (4941)                                   => t(slp_matrix_L1[1:4941, 30])
##      2일                                                          => [30, 1:4941]
##      .                 .
##      .                 .
##      .                 .
##     29일 
##     30일          81*61 (4941)
##  (2017년 4월)
##====================================================================================================

EOF = prcomp(t(mls_matxix_L1), scale=F, center=T)
Eigen_value  = EOF$sdev^2          # 고유근
Eigen_vector = EOF$rotation        # 고유벡터
EOF_data = data.frame(EOF$x)  
summary(EOF)

## Anomaly calculation
nrow = dim(mls_matxix_L1)[1]
ncol = dim(mls_matxix_L1)[2]
anomaly = matrix(rep(NA, nrow*ncol), nrow=nrow, ncol=ncol)
for (i in 1:ncol) {
   anomaly[,i] = mls_matxix_L1[ ,i] - colMeans(mls_matxix_L1)[i]
}



## FIG. 주성분에 따른 설명력
PC_var = (Eigen_value/sum(Eigen_value))*100.0
fviz_screeplot(EOF, addlabels = TRUE, title="", x="Principal  Components", y="Percentage  of  Variance  [%]", ncp = 30) + 
   geom_point(aes(y=cumsum(PC_var)), col='red') + geom_line(aes(y=cumsum(PC_var)), col='red') +
   theme_bw(base_size = 15)


## FIG. 주성분의 고유벡터를 이용하여 맵핑
long = output_L1$long
lat  = output_L1$lat
output_L3 = data.frame(long, lat, Eigen_vector)
head(output_L3)

# world = map_data("worldHires")
world = map_data("world")

longitude <- seq(80, 160, 5)
latitude <- seq(10, 70, 5)
x_longitude <- unlist(lapply(longitude, function(x) ifelse(x < 0, paste0(abs(x), "°W"), ifelse(x > 0, paste0(x, "°E"),x))))
y_latitude <- unlist(lapply(latitude, function(x) ifelse(x < 0, paste0(x, "°S"), ifelse(x > 0, paste0(x, "°N"),x))))

# summary(output_L3)

ggplot() + 
   coord_fixed(ratio = 0.75) +
   theme_bw() +
   geom_tile(data = output_L3, aes(x = long, y = lat, fill = -PC1)) + 
   # scale_fill_gradient2(limits=c(-0.1, 0.1)) +
   scale_fill_gradient2(low = 'blue', mid = 'white', high = 'red') +
   guides( fill=guide_colorbar(barwidth=1.5, barheight=15, title.position = "right", title.vjust = 0.5, title.hjust = 0.5) ) +
   geom_path(aes(x=long, y=lat, group=group), data = world, colour="black") +
   scale_x_continuous(breaks = longitude, labels = x_longitude, expand=c(0,0), limits=c(115, 150)) +
   scale_y_continuous(breaks = latitude,  labels = y_latitude,  expand=c(0,0), limits=c(20, 60)) +
   theme(plot.title=element_text(face="bold", size=20, color="black")) +
   theme(axis.title.x = element_text(face="bold", size=15, colour="black")) +
   theme(axis.title.y = element_text(face="bold", size=15, colour="black", angle=90)) +
   theme(axis.text.x  = element_text(face="bold", size=15, colour="black")) +
   theme(axis.text.y  = element_text(face="bold", size=15, colour="black")) +
   theme(legend.text=element_text(size=14, color="black")) +
   theme(legend.title=element_text(size=14, angle=90)) +
   labs(x = "") +
   labs(y = "") +
   labs(fill = "Eigen  Vector \n") +
   labs(title  = paste0("PC1 (", sprintf("%.2f", PC_var)[1], "%  Explanined  Variance) \n" )) +
   theme(legend.background=element_blank()) +
   ggsave(filename=paste0('FIG/PC2.png'), width=12, height=6, dpi=600)



## FIG. 각 지점별 주성분들의 기여도
# fviz_contrib(EOF, choice = "var", axes = 3)

## FIG. 각 사례별 주성분들의 기여도 == 고유벡터의 절대값이 큰 경우
fviz_contrib(EOF, choice = "ind", axes = 3) + coord_flip()

## PC 1-2에 대해 고유벡터를 시각화
# fviz_pca_var( EOF, axes=c(2,2), col.var="contrib", gradient.cols = c("#00AFBB", "#E7B800", "#FC4E07") ) 

## PC 1-2에 대해 주성분들를 시각화
fviz_pca_ind(EOF, axes=c(2,2), col.ind = "contrib", gradient.cols = c("#00AFBB", "#E7B800", "#FC4E07"), repel = TRUE)

## PC 1-2에 대해 고유벡터 및 주성분 Score를 시각화
# fviz_pca_biplot(EOF, axes = c(2, 2), col.var="contrib", col.ind ="contrib", gradient.cols = c("#00AFBB", "#E7B800", "#FC4E07")) +
   # theme_bw(base_size = 15)



## FIG. 변수와 주성분(PC)의 시계열.
output_L4 = data.frame(time, EOF_data)
head(output_L4)

summary(output_L4)

summary(EOF)

cols = brewer.pal(5, 'Set2')  
par(cex.main=1.2, cex.lab=1.2, cex.axis=1.2, cex=1.2, font=1)
plot  (output_L4$day, output_L4$PC1, type='l', xaxs="i", yaxs="i", col=cols[1], xlim=c(1, 30), ylim=c(-600, 600), ylab="Principal  Component  Scores", xlab="Time  [Day]", lwd=2)
points(output_L4$day, output_L4$PC2, type='l', col=cols[2], lwd=2)
points(output_L4$day, output_L4$PC3, type='l', col=cols[3], lwd=2)
points(output_L4$day, output_L4$PC4, type='l', col=cols[4], lwd=2)
points(output_L4$day, output_L4$PC5, type='l', col=cols[5], lwd=2)
## ECMWF_slp_1deg_201704
legend_text = c("PC1 (34.04%)", "PC2 (16.21%)", "PC3 (11.79%)", "PC4 (8.61%)", "PC5 (6.26%)")
## ECMWF_slp_0.125deg_201704
# legend_text = c("PC1 (34.52%)", "PC2 (15.82%)", "PC3 (11.67%)", "PC4 (8.61%)", "PC5 (6.28%)")
legend( "top", legend=legend_text, col=cols, text.col=cols,
        cex=1.0,  pt.cex=1.3, text.font=1, lwd=2, bty="n", xpd=T,  y.intersp = 0.25, horiz=TRUE)


########################### 영동 및 영서 지점의 기온과 주성분(PC)의 상관 분석 ###################################################
## - 2016년 4월 1-30 일 영동 지역 (기온 하강 ~ 북풍 또는 동풍 계열)  <-  PC3의 고유벡터가 동해상에 고기압 위치
##                                                                   <-  영동 지역의 기온과 PC3의 주성분과의 높은 상관성 예상
##
##                      영서 지역 (기온 상수 ~ 서풍 계열)            <-  PC4의 고유벡터가 동해상에 저기압 위치
##                                                                   <-  영서 지역의 기온과 PC4의 주성분과의 높은 상관성 예상
#################################################################################################################################
Maxtemp = data_L1 %>%
   filter( year == 2016 ) 

output_L5 = data.frame(Maxtemp, EOF_data)
output_L6 = output_L5 %>%
   dplyr::select ( Chuncheon, Hongcheon, Wonju, Yeongwol, Donghae, Gangneung, Sokcho, Yangyang, 
                   PC1, PC2, PC3, PC4, PC5, PC6)

## FIG. 상관계수 행렬
COR = cor(output_L6)                  ## 상관계수 행렬 계산
COR_pvalue = cor_pmat(output_L6)      ## 상관계수 행렬에 대한 p-value 계산
ggcorrplot(COR, outline.col = "white", lab = TRUE, p.mat = COR_pvalue, sig.level = 0.05, 
           colors = c("#6D9EC1", "white", "#E46726")) 
ggcorrplot(COR, outline.col = "white", lab = TRUE,  
           colors = c("#6D9EC1", "white", "#E46726")) 

## FIG. 산점도 행렬 
ggpairs(output_L6)

## 기여도가 높은 날짜에 대해서 선정
output_L7 = output_L6 %>%
   slice( c(3, 4, 5, 6, 7, 11, 13, 17, 21, 22) )

## FIG. 상관계수 행렬
COR = cor(output_L7)                  ## 상관계수 행렬 계산
COR_pvalue = cor_pmat(output_L7)      ## 상관계수 행렬에 대한 p-value 계산
ggcorrplot(COR, outline.col = "white", lab = TRUE, p.mat = COR_pvalue, sig.level = 0.1, 
           colors = c("#6D9EC1", "white", "#E46726")) 
ggcorrplot(COR, outline.col = "white", lab = TRUE,  
           colors = c("#6D9EC1", "white", "#E46726")) 

## FIG. 산점도 행렬 
ggpairs(output_L7)




############################# 연구 영역에 따른 PC3 and PC4 설명력 ###############################
summary(EOF)

LAT_MIN = seq(0, 60, 5)
LON_MIN = seq(100, 180, 5)

LAT_MAX = seq(5, 60, 5)
LON_MAX = seq(105, 180, 5)

for (i in 1:length(LAT_MIN)) {
for (j in 1:length(LON_MIN)) {
for (k in 1:length(LAT_MAX)) {
for (l in 1:length(LON_MAX)) {

   output_L1 = output %>%
      dplyr::filter(LAT_MIN[i] <= lat   &  lat  <= LAT_MAX[k]) %>%
      dplyr::filter(LON_MIN[j] <= long  &  long <= LON_MAX[l])
   
   if (dim(output_L1)[1] > 0) {
      slp_matxix_L1 = output_L1[ ,-c(1,2)]
      EOF = prcomp(t(slp_matxix_L1), scale=F, center=T)
      Eigen_value  = EOF$sdev^2          # 고유근
      PC_var = (Eigen_value/sum(Eigen_value))*100.0

      # cat(LAT_MIN[i], LAT_MAX[k], LON_MIN[j], LON_MAX[l], PC_var[3:4], sum( PC_var[3:4]), "\n",
          # file = "OUTPUT/PC3 or PC4 variance according to research area.OUT", append=T)
   }
}
}
}
}

output_L8 = read.csv("OUTPUT/PC3 or PC4 variance according to research area.OUT", sep="", header=F)
colnames(output_L8) = c("lat_min", "lat_max", "lon_min", "lon_max", "PC3", "PC4", "PC34_sum")
head(output_L8)  # 파일의 첫 행


output_L8 %>% 
   dplyr::arrange( desc(PC34_sum) )  # 내림차순 정렬

기상청 최고 기온 자료를 이용한 영동 및 영서에 대한 주성분 분석

2012-2016년 일평균 최고 기온 자료를 이용한 공분산 행렬

주성분에 따른 설명력

2번째 주성분에 대한 고유벡터 가시화

2016년 04월 17일 영동 및 영서 지역에 바람장미 분포

2016년 04월 26일 영동 및 영서 지역에 바람장미 분포

영동과 영서에 대한 기온 및 바람과 연관성
- 2번째 고유 벡터 (PC2) > 0
  - 영동 > 영서 (푄 현상으로 인해 기온 상승) > 서풍 계열
- 2번째 고유벡터 (PC2) < 0
  - 영동 < 영서 (시베리안 고기압으로 인해 기온 하강) > 북풍 또는 동풍 계열

[기계 학습]

인공신경망 (Neural Networks)
- hidden = 1 : 은닉 노드 수 c(100,100)
- threshold : 편미분에 대한 종료기준 임계값 (기본값 0.01)
- rep : 훈련 과정 반복 수 (기본값 1)
- linear.output = FALSE : 출력 구간 [0, 1]의 활성화 함수에 일치되도록 보장 (기본값 TRUE)
- act.fct : 미분 활성화 함수(기본값 "logistic")
- err.fct : 미분 오차 함수(기본값 "sse")
- learningrate : 학습속도(step), backprop 사용시 설정
- algorithm : backprop > 전통적인 역전파(backpropagation)

딥러닝 (Deeplearning)
- training_frame : 학습 데이터 셋 h2o type
- nfolds : 교차검정에서 N-fold
- stopping_rounds : stops_metric의 수렴에 기초한 조기 정지 (기본값 5)
- stopping_metric : 조기 정지에 사용되는 미터법 ("MSE", "RMSE", "MAE", "RMSLE" "AUC")
- MSE : 평균제곱오차 (Mean Squared Error) > ML에서 cost Function으로 사용
- epochs : 학습과정에서 훈련데이터를 모두 소진한 횟수
- overwrite_with_best_model : 훈련된 최적의 모형으로 대체 (default : TRUE)
- activation : tanh (sigmoid 함수를 재활용, sigmoid의 범위를 -1에서 1로 넓힘)
- input_dropout_ratio : 입력 레이어 드롭 아웃 비율(일반화 향상, 0.1 또는 0.2, 기본값 0)

data = read.csv("INPUT/IEODO_Observation_QC_2017-06-16,17.OUT", sep="", head=T)
data = na.omit(data)  # NA값 제거

# 이어도 관측자료 (TrainDF: 2017-06-16, TestDF: 2017-06-17)
# 년, 월, 일, 시, 분, 풍향, 풍속, 온도, 상대습도, 기압, 일사량, 일조시간, 태양 방위각, 태양 천정각
head(data)

#================================================
#     Neural Networks
#================================================
set.seed(123)

OriDataFrame = data[ ,c(-1:-5)]  # 년,월,일,시,분 제거
str(OriDataFrame)

# Normalizing the data in interval [0, 1]
# Normalization is necessay so that each variable is scale properly
# and none of the variables overdominates
# scale function will give mean=0 and standard deviation=1 for each variables
maxValue = apply(OriDataFrame, 2, max)
minValue = apply(OriDataFrame, 2, min)
rangeInsValue = (maxValue["ins"] - minValue["ins"])

DataFrame = as.data.frame( scale(OriDataFrame, center=minValue, scale=maxValue-minValue) )

summary(DataFrame)

# Lets create the train and test data set
ind = sample(1:nrow(DataFrame), length(DataFrame[,1])/2)
# ind = 1:809
trainDF = DataFrame[ind, ]
testDF = DataFrame[-ind, ]

# Lets take some configuration for neural network
# say 13-4-2-1
# So number of hidden layers=2
# input layer had 10 units

# we need this as formula
# medv ~ crim + zn ...

allVars = colnames(DataFrame)
predictorVars = allVars[!allVars %in% "ins"]
predictorVars = paste(predictorVars, collapse="+")
form = as.formula(paste("ins~", predictorVars, collapse = "+"))
# ins ~ ws + wd + temp + rh + pre + sunshine + azimuth + zenith

# neuralModel = neuralnet(formula=form, data=trainDF)
# neuralModel = neuralnet(formula=form, hidden=c(10,10), data=trainDF)
# neuralModel = neuralnet(formula=form, hidden=c(20,20,20), data=trainDF)
neuralModel <- neuralnet(formula = form, hidden = c(4, 2), linear.output = T, data = trainDF)

# 3) NN모델 생성 - 은닉 노드 1개 
# 형식) neuralnet(formula, data, hidden) 
neuralModel = neuralnet(formula = form, data=trainDF, hidden = 1, threshold = 0.01,
                        linear.output=FALSE, act.fct = 'logistic')
# ?neuralnet
# hidden = 1 : 은닉노드수 c(100,100)
# threshold : 편미분에 대한 종료기준 임계값(기본값:0.01)
# rep : 훈련 과정 반복 수(기본값 1)
# linear.output=FALSE -> 출력 구간[0, 1]의 활성화 함수에 일치되도록 보장(기본값: TRUE) 
# act.fct : 미분 활성화 함수(기본값 "logistic")
# err.fct : 미분 오차 함수(기본값 "sse")
# learningrate : 학습속도(step), backprop 사용시 설정
# algorithm : backprop -> 전통적인 역전파(backpropagation)


# Plot the neural net
# plot(neuralModel)

# Predict for test data set
predictions = compute(neuralModel, testDF[ ,-7])
# str(predictions)

# predictions = predictions$net.result*(max(testDF$ins)-min(testDF$ins)) + min(testDF$ins)
# actualValues = (testDF$ins)*(max(testDF$ins)-min(testDF$ins)) + min(testDF$ins)

# predictions = predictions$net.result
# actualValues = testDF$ins

predictions = predictions$net.result * rangeInsValue
actualValues = testDF$ins * rangeInsValue

RMSE = sqrt(sum( (predictions-actualValues)^2 )/nrow(testDF))  ;  RMSE
cor(predictions, actualValues)

plot(actualValues, predictions, col='blue', main='Recl vs Predicted', pch=1, type="p", xlab="Actual", ylab="Predicted")
cor(testDF$ins, predictions)
abline(0, 1, col="black")


#================================================
#     Deeplearning
#================================================

# H2o is a fast and scalable opensource machine learning platform.
# Several algorithms are available.
# set.seed(123)

# initiallization
h2o.init()

# Defining x and y
y = "ins"
x = setdiff( colnames(DataFrame),y )

# Lets create the train and test data set
# ind = sample(1:nrow(DataFrame), 400)
# trainDF = DataFrame[ind, ]
# testDF = DataFrame[-ind, ]

# Fitting the model
# ?h2o.deeplearning

model = h2o.deeplearning(x=x, y=y, training_frame = as.h2o(trainDF))
# model = h2o.deeplearning(x=x, y=y, training_frame = as.h2o(trainDF), nfolds=5, hidden=c(10, 10))
# model = h2o.deeplearning(x=x, y=y, training_frame = as.h2o(trainDF), hidden=c(5000,5000))
# model = h2o.deeplearning(x=x, y=y, training_frame = as.h2o(trainDF), nfolds=5, hidden=c(1000,1000))
model

# model 생성 : network = 13-10-10-1 
model <- h2o.deeplearning(x=x, y=y, training_frame = as.h2o(trainDF),
                          nfolds = 3, #stopping_rounds = 7,
                          stopping_metric = 'RMSE',
                          epochs = 400, overwrite_with_best_model = TRUE,
                          activation = 'Tanh', input_dropout_ratio = 0.1,
                          hidden = c(10,10), l1 =0,
                          loss = 'Automatic', distribution = 'AUTO')
model  
#training_frame : 학습데이터셋-h2o type   
#nfolds : 교차검정에서 N-fold 
#stopping_rounds : stops_metric의 수렴에 기초한 조기 정지(기본값 5)
#stopping_metric : 조기 정지에 사용되는 미터법( "MSE", "RMSE", "MAE", "RMSLE" "AUC",) 
#MSE : 평균제곱오차(Mean Squared Error) -> ML에서 cost Function으로 사용 
#epochs(1에폭) : 학습과정에서 훈련데이터를 모두 소진한 횟수 
#overwrite_with_best_model : 훈련된 최적의 모형으로 대체(default : TRUE)
#activation : tanh(sigmoid 함수를 재활용, sigmoid의 범위를 -1에서 1로 넓힘)
#input_dropout_ratio : 입력 레이어 드롭 아웃 비율(일반화 향상, 0.1 또는 0.2, 기본값 0)
# 드롭 아웃 : 매개변수가 많아서 신경망이 복잡한 경우 뉴런을 임의 삭제하여 학습하는 방법 
# 훈련 시 은닉층의 뉴런을 무작위 삭제하여 신호 전달 차단  
# hidden : Hidden layer sizes(2층)

# Predict for test data set
predictions = as.data.frame(predict(model, as.h2o(testDF)))
# str(predictions)   


predictions = predictions$predict * rangeInsValue
actualValues = testDF$ins * rangeInsValue

RMSE = sqrt(sum( (predictions-actualValues)^2 )/nrow(testDF))  ; RMSE

plot(actualValues, predictions, col='blue', main='Recl vs Predicted', 
     pch=1, type="p", xlab="Actual", ylab="Predicted")
cor(testDF$ins, predictions)
abline(0, 1, col="black")

# h2o.shutdown(prompt=F)

인공신경망을 이용한 일사량 예측 결과
- 상관계수 (R) : 0.93
- 평균제곱근오차 (RMSE) : 140.84

딥러닝을 이용한 일사량 예측 결과
- 상관계수 (R) : 0.97
- 평균제곱근오차 (RMSE) : 89.09

[전체]

	#==========================================================================================================================================
	# Routine : Main R-project program
	#
	# Purpose : Advanced Meteorological Statistics R-project Practice
	# - R-project 기초
	# - Chapter 1. 자료의 특성
	# - Chapter 2. 자료의 상관성
	# - Chapter 3. 통계적 유의수준
	# - Chapter 4. 조화 분석
	# - Chapter 5. 주성분 분석
	# - Chapter 6. 인공신경망, 딥러닝
	#
	# Author : MS. Sang-Ho Lee
	#
	# Revisions: V1.0 June 03, 2017 First release (MS. Sang-Ho Lee)
	# V2.0 December 14, 2017 Add practice using Korea Meteological Adminstration (KMA) data and visualization (MS. Sang-Ho Lee)
	#==========================================================================================================================================

	#======================================================================================================
	# R 기초
	#======================================================================================================
	# 1993년 뉴질랜드 오클랜드 대학의 통계학과 교수 2명(Ross Ihaka, Robert Gentleman) 개발
	# (Source : https://en.wikipedia.org/wiki/R_(programming_language))

	# 특징 :
	# - 빠른 처리 속도(H/W 메모리 크기에 영향 받음)
	# - 데이터, 함수, 차트 등 모든 것이 object로 관리
	# - 최신의 알고리즘 및 방법론이 패키지 (Package)로 제공됨 (Neural Networks, Deeplearning et al)
	# - 분석에 통찰을 부여할 수 있는 그래픽에 대한 강력한 지원 (plotly : https://moderndata.plot.ly/interactive-r-visualizations-with-d3-ggplot2-rstudio/
	# leaflet : https://rstudio.github.io/leaflet/
	# ggplot2, ggvis et al)

	# Window 설치 : http://www.r-project.org/ → Download(CRAN)
	# http://www.rstudio.com/ → Desktop → Rstudio

	# Linux 설치 : (Source : http://m.blog.daum.net/empirical/17382805)

	# Rstudio 4가지 패널 창 기능 : 코딩 영역, 콘솔(실행결과), 작업내역/ 환경, 탐색시/그래프/도움말/패키지 구성

	#------------------ version와 패키지 및 session 보기 -----------------------
	sessionInfo() # 64 bit 권장 !
	?sessionInfo

	#----------------------- 단축키 ------------------------------
	# Ctrl + Enter or Ctrl + r : script 실행
	# Ctrl + s : 저장

	#------------------------ 변수명 작성 규칙 ------------------------------
	# - 첫자는 영문자, 두번째는 숫자,특수문자(_ .)
	# - 대소문자 구분(num, NUM)
	# - 명령어, 함수명 사용 불가
	# - 가장 최근값 저장
	# - R의 모든 변수 = 객체


	#------------------------ 패키지 ------------------------------
	# install.packages("moments") # 패키지 설치
	# library(moments) # 패키지 사용
	source("Library.R") # 패키지 사용 및 설치


	#=======================================
	# Chapter 1. 자료의 특성
	#=======================================
	X = c(1, 2, 3, 4, 5)
	Y = c(3, 6, 1, 2, 3)

	length(Y) # 자료의 개수
	mean(Y) # 평균 = sum(Y)/length(Y)

	median(Y) # 중간값
	names(which.max(table(Y))) # 최빈값
	var(Y) # 분산 = sum((Y-mean(Y))^2)/(length(Y)-1)
	sd(Y) # 표준편차 = sqrt(sum((Y-mean(Y))^2)/(length(Y)-1))
	max(Y) # 최대값
	min(Y) # 최소값
	skewness(Y) # 왜도 = sum((Y-mean(Y))^3)/(length(Y)*(sqrt(sum((Y-mean(Y))^2)/length(Y))^3))
	kurtosis(Y) # 첨도 = sum((Y-mean(Y))^4)/(length(Y)*(sqrt(sum((Y-mean(Y))^2)/length(Y))^4))

	summary(Y) # 요약 통계량 = Min./1st Qu./Median/Mean/3rd Qu./Max.

	Number = length(Y)
	Mean = mean(Y)
	Median = median(Y)
	Mode = names(which.max(table(Y)))
	Var = var(Y)
	Sd = sd(Y)
	Max = max(Y)
	Min = min(Y)
	Skew = skewness(Y)
	Kurt = kurtosis(Y)

	## 자료 출력
	cat(Number, Mean, Median, Mode, Var, Sd, Max, Min, Skew, Kurt, "\n")

	## 자료 저장
	cat(Number, Mean, Median, Mode, Var, Sd, Max, Min, Skew, Kurt, "\n", file='1.dat')


	#=======================================
	# Chapter 2. 자료의 상관성
	#=======================================
	X = c(1, 2, 3, 4, 5)
	Y = c(3, 6, 4, 3, 2)

	Number = length(Y)

	## 선형 회귀모형
	lm.fit = lm(Y ~ X) # Y : 종속변수, X : 독립변수
	summary(lm.fit)

	## (Yfit) = a(X) + b -> coef.
	a = coef(lm.fit)[2]
	b = coef(lm.fit)[1]
	# a = ( sum(XY)-(sum(X)sum(Y)/length(X)) ) / ( sum(X^2)-((sum(X)^2)/length(X)) )
	# b = mean(Y) - (a*mean(X))

	Yfit = (a*X) + b

	Cor = cor.test(X, Y)$estimate # 상관계수 (correlation coefficient)
	Cor = cor(X, Y)
	## Cor = ( sum(XY)-(sum(X)sum(Y)/length(X)) ) / ( sqrt(sum(X^2)-((sum(X)^2)/length(X)))*sqrt(sum(Y^2)-((sum(Y)^2)/length(Y))) )

	Rsq = (cor.test(X, Y)$estimate)^2 # 결정계수 (R-squared)
	## Rsq = var(Yfit)/var(Y)

	Pvalue = cor.test(X, Y)$p.value # 유의성 검정(p-value)

	cat(a, b, Cor, Rsq, Pvalue, "\n", file='2.dat')

	## FIG
	dev.off() # Initialization
	par(cex.main=1.2, cex.lab=1.2, cex.axis=1.2, cex=1.2, font=1)
	plot(X, Y, pch=21, bg="white", col='black', main="Relationship between X and Y",
	xlim=c(0, 10), ylim=c(0, 10), xlab="X lab", ylab="Y lab", xaxs="i", yaxs="i", cex=1.5)
	lines(X, Yfit, col="red")
	abline(a=0, b=1, lty=2, col="black")
	text(1, 9, paste0("Y = ", round(a,2), " X + ", round(b,2)), adj=0, col="red")
	text(1, 8, paste0("R = ", round(cor(X, Yfit), 2), " (p-value < ", round(Pvalue, 3), ")"), adj=0, col="red")
	text(1, 7, paste0("N = ", round(Number,2)), adj=0)


	## 비선형 회귀모형
	X2 = X^2
	lm.fit = lm(Y ~ X + X2)
	summary(lm.fit)

	# (Yfit) = a(X2) + b(X) + c -> coeff.
	a = coef(lm.fit)[3]
	b = coef(lm.fit)[2]
	c = coef(lm.fit)[1]
	Yfit = (aX2) + (bX) + c
	Cor = cor(Yfit, Y)
	Pvalue = cor.test(Yfit, Y)$p.value # 유의성 검정(p-value)

	# FIG
	dev.off() # Initialization
	par(cex.main=1.2, cex.lab=1.2, cex.axis=1.2, cex=1.2, font=1)
	plot(X, Y, pch=21, bg="white", col='black', main="Relationship between X and Y",
	xlim=c(0, 10), ylim=c(0, 10), xlab="X lab", ylab="Y lab", xaxs="i", yaxs="i", cex=1.5)
	lines(X, Yfit, col="red")
	abline(a=0, b=1, lty=2, col="black")
	text(1, 9, paste0("Y = ", round(a,2), " X2 + ", round(b,2), " X +", round(c, 2)), adj=0, col="red")
	text(1, 8, paste0("R = ", round(Cor, 2), " (p-value < ", round(Pvalue, 3), ")"), adj=0, col="red")
	text(1, 7, paste0("N = ", round(Number,2)), adj=0)



	#=============================================================================================================================
	# Chapter 3. 통계적 유의수준 (Statisical Significant)
	#
	# Monte Carlo Simulation = Boostrap Method = Resampling Simulation
	# 1990년 이후
	#=============================================================================================================================

	######################################## 모집단 및 표본집단의 평균/분산/표준편차 비교 ##############################
	## - 중심극한정리 : 동일한 확률분포를 가진 독립확률 변수 n개의 평균값은 n이 클수록 정규분포에 가까워짐
	## - 표본집단을 통하여 모집단의 정보를 알고싶음
	## - 즉 표본평균집단의 평균/분산/표준편차를 이용하여 모집단의 평균/분산/표준편차를 유추할 수 있음
	## -- 모집단의 평균 = 표본평균집단의 평균
	## -- 모집단의 분산/자료수 = 표본평균집단의 분산
	## -- 모집단의 표준편차/루트(자료수) = 표본평균집단의 표준편차
	## -> 표본 개수가 많을수록 모집단을 더 잘 유추할 수 있음 -> 정규분포에 가까워짐
	####################################################################################################################

	set.seed(1) # 난수로 생성된 수열을 고정시킴
	X = runif(10000, min=0, max=1) # 난수 생성 (모집단 생성)

	hist(X) # 모집단의 빈도분포
	cat(mean(X), var(X), sd(X), "\n") # 모집단의 평균/분산/표준편차


	################################################### 표본평균집단에 대해서 ###########################################################
	for (i in c(10, 50, 100, 250)) {

	DO = 100000 # Number of repetition
	N = i # Number of sample
	# N = 30

	## 비복원 추출(무작위 정렬) : 한번 뽑은 것을 다시 뽑을 수 없는 추출
	Sort = lapply(1:DO, function(i) sample(X, N, replace=F))
	Sort_mean <- mapply(mean, Sort)

	## FIG
	# nf <- layout(matrix(c(1,1),1,byrow=T), c(1,1), c(1,1)) ; layout.show(nf) ; par(mar=c(5,5,5,5)) ; par(cex=1.0)
	hist(Sort_mean, breaks=50, xlab="Sample Distribution Mean", xlim=c(0.3, 0.7), col = "light grey", border = "grey", xaxs="i", yaxs="i",
	main=paste0("Central Limit Theorem : number of sample = ", N, ", number of repetition = ", sprintf("%d", DO)))

	XX1 = mean(Sort_mean)+sd(Sort_mean)
	XX2 = mean(Sort_mean)-sd(Sort_mean)
	XX3 = mean(Sort_mean)+2*sd(Sort_mean)
	XX4 = mean(Sort_mean)-2*sd(Sort_mean)
	XX5 = mean(Sort_mean)+3*sd(Sort_mean)
	XX6 = mean(Sort_mean)-3*sd(Sort_mean)
	YY = max(hist(Sort_mean, breaks=50, plot=F)$counts)
	lines( c(mean(Sort_mean), mean(Sort_mean)), c(0, YY), lty=1, col=4) ; text(mean(Sort_mean), YY/2, "Mean")
	lines( c(XX1, XX1), c(0, YY), lty=1, col=2 ) ; text( XX1, YY/2, "+1σ")
	lines( c(XX2, XX2), c(0, YY), lty=1, col=2 ) ; text( XX2, YY/2, "-1σ")
	lines( c(XX3, XX3), c(0, YY), lty=1, col=2 ) ; text( XX3, YY/2, "+2σ")
	lines( c(XX4, XX4), c(0, YY), lty=1, col=2 ) ; text( XX4, YY/2, "-2σ")
	lines( c(XX5, XX5), c(0, YY), lty=1, col=2 ) ; text( XX5, YY/2, "+3σ")
	lines( c(XX6, XX6), c(0, YY), lty=1, col=2 ) ; text( XX6, YY/2, "-3σ")
	lines( c(max(Sort_mean), max(Sort_mean)), c(0, YY), lty=1, col=3 ) ; text(max(Sort_mean), YY/2, "Max")
	lines( c(min(Sort_mean), min(Sort_mean)), c(0, YY), lty=1, col=3 ) ; text(min(Sort_mean), YY/2, "Min")


	## 자료수, 표본평균집단의 평균, 표본평균집단의 분산, 표본평균집단의 표준편차
	## (표본평균집단의 평균-모집단의 평균), (표본평균집단의 분산 - 모집단의 분산/자료수), (표본평균집단의 표준편차 - 모집단의 표준편차/루트(자료수))
	# cat(N, mean(Sort_mean), var(Sort_mean), sd(Sort_mean),
	# mean(Sort_mean)-mean(X), var(Sort_mean)-(var(X)/N), sd(Sort_mean)-(sd(X)/sqrt(N)), "\n")

	# 모집단의 평균, 분산/자료수, 표준편차/루트(자료수)
	# cat(mean(X), var(X)/length(X), sd(X)/sqrt(length(X)), "\n")
	}


	################################################### 표본표준편차집단에 대해서 ###########################################################

	X = runif(10000, min=0, max=1) # 난수 생성 (모집단 생성)
	hist(X) # 모집단의 빈도분포
	cat(mean(X), var(X), sd(X), "\n") # 모집단의 평균/분산/표준편차

	for (i in c(5, 10, 25, 50, 100, 250)) {

	DO = 100000 # Number of repetition
	N = i # Number of sample
	# N = 30

	Sort = lapply(1:DO, function(i) sample(X, N, replace=F)) # 비복원 추출(무작위 정렬) : 한번 뽑은 것을 다시 뽑을 수 없는 추출
	Sort_mean <- mapply(mean, Sort)
	Sort_sd <- mapply(sd, Sort)

	## FIG
	hist(Sort_sd, breaks=50, xlab="Sample Distribution Stdev", xlim=c(0.2, 0.4), col = "light grey", border = "grey", xaxs="i", yaxs="i",
	main=paste0("Central Limit Theorem : number of sample = ", N, ", number of repetition = ", sprintf("%d", DO)))

	XX1 = mean(Sort_sd)+sd(Sort_sd)
	XX2 = mean(Sort_sd)-sd(Sort_sd)
	XX3 = mean(Sort_sd)+2*sd(Sort_sd)
	XX4 = mean(Sort_sd)-2*sd(Sort_sd)
	XX5 = mean(Sort_sd)+3*sd(Sort_sd)
	XX6 = mean(Sort_sd)-3*sd(Sort_sd)
	YY = max(hist(Sort_sd, breaks=50, plot=F)$counts)
	lines( c(mean(Sort_sd), mean(Sort_sd)), c(0, YY), lty=1, col=4) ; text(mean(Sort_mean), YY/2, "Mean")
	lines( c(XX1, XX1), c(0, YY), lty=1, col=2 ) ; text( XX1, YY/2, "+1σ")
	lines( c(XX2, XX2), c(0, YY), lty=1, col=2 ) ; text( XX2, YY/2, "-1σ")
	lines( c(XX3, XX3), c(0, YY), lty=1, col=2 ) ; text( XX3, YY/2, "+2σ")
	lines( c(XX4, XX4), c(0, YY), lty=1, col=2 ) ; text( XX4, YY/2, "-2σ")
	lines( c(XX5, XX5), c(0, YY), lty=1, col=2 ) ; text( XX5, YY/2, "+3σ")
	lines( c(XX6, XX6), c(0, YY), lty=1, col=2 ) ; text( XX6, YY/2, "-3σ")
	lines( c(max(Sort_sd), max(Sort_sd)), c(0, YY), lty=1, col=3 ) ; text(max(Sort_mean), YY/2, "Max")
	lines( c(min(Sort_sd), min(Sort_sd)), c(0, YY), lty=1, col=3 ) ; text(min(Sort_mean), YY/2, "Min")


	## 샘플개수, 표본평균의 표준편차, 표본표준편차의 평균/sqrt(N), (표본평균의 표준편차-표본표준편차의 평균/sqrt(N)
	cat(N, sd(Sort_mean), mean(Sort_sd)/sqrt(N), sd(Sort_mean)-(mean(Sort_sd)/sqrt(N)), "\n")

	## 중심극한정리는 샘플 개수에 작으면 약해짐. 즉 샘플 개수가 작을수록 표준편차가 증가함.
	}




	################################################### 상관계수에 대한 유의성 검정 ###########################################################
	data = read.csv("INPUT/KMA/1904-2016_Yearly_Mean_Groud_Obs.KMA", sep="", header=TRUE)
	head(data) # 파일의 첫 행
	tail(data) # 파일을 마지막 행

	## 시간: 1904-2016년 연평균 자료 (지상관측)

	## 변수: 지점, 년, 기온, 기압, 상대습도, 풍속
	## id year temp pre rh ws

	## 지점: 강릉, 광주, 대구, 대전, 부산, 서울, 전주, 제주, 청주, 춘천, 원주, 대관령
	## 105 156 143 133 159 108 146 184 131 101 114 100

	## 출처: 기상청 국가기후데이터센터 (http://sts.kma.go.kr/jsp/home/contents/main/main.do)

	## 지점마다 관측 기간이 다름
	### 대관령: 1972-2016
	### 춘천 : 1966-2016
	### 강릉 : 1912-2016
	### 서울 : 1908-2016
	### 원주 : 1972-2016
	### 청주 : 1979-2016
	### 대구 : 1909-2016
	### 전주 : 1919-2016
	### 광주 : 1940-2016
	### 제주 : 1924-2016

	## 1979-2016년에 대해 분석을 수행함.


	#### 강릉 지역에 대한 상관계수 및 유의성 검정
	data_L1 = data %>%
	filter( id == 105 ) %>%
	filter( year >= 1979 )


	#### FIG
	X = data_L1$year
	Y = data_L1$temp

	dev.off() # Initialization
	par(cex.main=1.2, cex.lab=1.2, cex.axis=1.2, cex=1.2, font=1)
	plot(X, Y, col='black', main="Time series of yearly averaged temperature for the year of 1979-2016",
	xlab="Time [Year]", ylab="Temperature [℃]", xlim=c(1979, 2016), type="l", ylim=c(10, 15),
	xaxs="i", yaxs="i", cex=1.5)
	lm.fit =lm(Y ~ X)
	abline(lm.fit, col="red")
	text(1980, 14.5, paste0("(Temperature) = ", round(coef(lm.fit)[2],2), " × (Year) ", round(coef(lm.fit)[1],2)), adj=0, col="red", cex=1.2)
	text(1980, 14, paste0("R = ", round(cor(X, Y), 2)), adj=0, col="red", cex=1.2)
	text(1980, 13.5, paste0("N = ", round(length(X),2)), adj=0, cex=1.2)



	################################################### 강릉 지역에 대한 유의성 검정 ###########################################################
	DO = 100000 # Number of repetition
	N = length(X) # Number of sample

	Sort_X = lapply(1:DO, function(i) X[1:N] )
	Sort_Y = lapply(1:DO, function(i) sample(Y, N, replace=F)) # 비복원추출(무작위 정렬)
	Sort_R = mapply(cor, Sort_X, Sort_Y)

	#### FIG
	# nf <- layout(matrix(c(1,1),1,byrow=T), c(1,1), c(1,1)) ; layout.show(nf) ; par(mar=c(5,5,5,5)) ; par(cex=1.0)
	hist(Sort_R, breaks=50, xlab="Sample Distribution Coefficient Correlation", xlim=c(-1, 1), col = "light grey", xaxs="i", yaxs="i",
	border = "grey", main=paste0("Central Limit Theorem : number of sample = ", N, ", number of repetition = ", sprintf("%d", DO)))

	XX1 = mean(Sort_R)+sd(Sort_R)
	XX2 = mean(Sort_R)-sd(Sort_R)
	XX3 = mean(Sort_R)+2*sd(Sort_R)
	XX4 = mean(Sort_R)-2*sd(Sort_R)
	XX5 = mean(Sort_R)+3*sd(Sort_R)
	XX6 = mean(Sort_R)-3*sd(Sort_R)
	YY = max(hist(Sort_R, breaks=50, plot=F)$counts)
	lines( c(mean(Sort_R), mean(Sort_R)), c(0, YY), lty=1, col=4) ; text(mean(Sort_R), YY/2, "Mean")
	lines( c(XX1, XX1), c(0, YY), lty=1, col=2 ) ; text( XX1, YY/2, "+1σ")
	lines( c(XX2, XX2), c(0, YY), lty=1, col=2 ) ; text( XX2, YY/2, "-1σ")
	lines( c(XX3, XX3), c(0, YY), lty=1, col=2 ) ; text( XX3, YY/2, "+2σ")
	lines( c(XX4, XX4), c(0, YY), lty=1, col=2 ) ; text( XX4, YY/2, "-2σ")
	lines( c(XX5, XX5), c(0, YY), lty=1, col=2 ) ; text( XX5, YY/2, "+3σ")
	lines( c(XX6, XX6), c(0, YY), lty=1, col=2 ) ; text( XX6, YY/2, "-3σ")
	lines( c(max(Sort_R), max(Sort_R)), c(0, YY), lty=1, col=3 ) ; text(max(Sort_R), YY/2, "Max")
	lines( c(min(Sort_R), min(Sort_R)), c(0, YY), lty=1, col=3 ) ; text(min(Sort_R), YY/2, "Min")
	lines( c(cor(X, Y), cor(X, Y)), c(0, YY), lty=1, col=5, lw=2 ) ; text(cor(X, Y), YY/2, "R")


	## 상관계수에 대한 유의성 검정 -> 유의확률 (significance probability) = P값 (p-value)*100 %
	## 신뢰도 (1 - P값)

	Pval = sum (abs(Sort_R) >= abs(cor(X, Y))) /length(Sort_R) # (Source: http://strata.uga.edu/8370/lecturenotes/resampling.html)
	Pval

	## 상관계수는 0.47로서 통계적으로 0.3 % 수준으로 유의하다.
	## 99.7 % 수준으로 신뢰할 만하다.
	## 1000번 중에 약 3번은 우연의 일치이다.
	## 1000번 중에 약 997번은 우연의 일치가 아니다.

	cor.test(X, Y)
	# n <- length(X) # How is the p-value calculated for pearson correlation ?
	# r <- cor(X, Y)
	# df <- n - 2
	# t = sqrt(df) * r/sqrt(1 - r^2)
	# pval = 2 * min(pt(t, df), pt(t, df, lower.tail = FALSE))


	################################################### 10 지점에 대한 상관계수 및 유의성 검정 ###########################################################
	for ( i in c(105, 156, 143, 133, 159, 108, 146, 184, 131, 101, 114, 100) ) {

	data_L1 = data %>%
	filter( id == i ) %>%
	# filter( year >= 1979 )
	filter( year >= 2000 )

	X = data_L1$year
	Y = data_L1$temp


	DO = 100000 # Number of repetition
	N = length(X) # Number of sample

	Sort_X = lapply(1:DO, function(i) X[1:N] )
	Sort_Y = lapply(1:DO, function(i) sample(Y, N, replace=F)) # 비복원추출(무작위 정렬)
	Sort_R = mapply(cor, Sort_X, Sort_Y)

	Pval = sum (abs(Sort_R) >= abs(cor(X, Y))) /length(Sort_R) # (Source: http://strata.uga.edu/8370/lecturenotes/resampling.html)

	Pval2 = cor.test(X, Y)

	## 지점, 상관계수, 몬테카를로 시뮬레이션에 의한 유의성, 근사식에 의한 유의성
	cat(i, cor(X, Y), Pval, Pval2$p.value, "\n")
	}



	################################################### 이동평균를 이용하여 상관계수 및 유의성 검정 ###########################################################

	## 이동평균 예시
	rollmean(c(1:6), 5)

	data_L1 = data %>%
	filter( id == 105 ) %>%
	filter( year >= 1979 )


	X = data_L1$year
	Y = data_L1$temp

	## 5개의 이동평균
	X2 = rollmean(data_L1$year, 5)
	Y2 = rollmean(data_L1$temp, 5)


	#### FIG
	dev.off() # Initialization
	par(cex.main=1.2, cex.lab=1.2, cex.axis=1.2, cex=1.2, font=1)
	plot(X2, Y2, col='black', main="Time series of yearly averaged temperature for the year of 1979-2016",
	xlab="Time [Year]", ylab="Temperature [℃]", xlim=c(1979, 2016), type="l", ylim=c(10, 15),
	xaxs="i", yaxs="i", cex=1.5)
	lm.fit =lm(Y2 ~ X2)
	abline(lm.fit, col="red")
	text(1980, 14.5, paste0("(Temperature) = ", round(coef(lm.fit)[2],2), " × (Year) ", round(coef(lm.fit)[1],2)), adj=0, col="red", cex=1.2)
	text(1980, 14, paste0("R = ", round(cor(X2, Y2), 2)), adj=0, col="red", cex=1.2)
	text(1980, 13.5, paste0("N = ", round(length(X2),2)), adj=0, cex=1.2)


	#### FIG
	DO = 10000 # Number of repetition
	N = length(X) # Number of sample
	# N2 = length(X2) # Number of sample

	Sort_X = lapply(1:DO, function(i) rollmean(X[1:N], 5) ) # Sort_X = lapply(1:DO, function(i) X2[1:N2])
	Sort_Y = lapply(1:DO, function(i) rollmean(sample(Y, N, replace=F), 5)) # 비복원추출(무작위 정렬)
	Sort_R = mapply(cor, Sort_X, Sort_Y)

	# nf <- layout(matrix(c(1,1),1,byrow=T), c(1,1), c(1,1)) ; layout.show(nf) ; par(mar=c(5,5,5,5)) ; par(cex=1.0)
	hist(Sort_R, breaks=50, xlab="Sample Distribution Coefficient Correlation", xlim=c(-1, 1), col = "light grey", xaxs="i", yaxs="i",
	border = "grey", main=paste0("Central Limit Theorem : number of sample = ", N, ", number of repetition = ", sprintf("%d", DO)))

	XX1 = mean(Sort_R)+sd(Sort_R)
	XX2 = mean(Sort_R)-sd(Sort_R)
	XX3 = mean(Sort_R)+2*sd(Sort_R)
	XX4 = mean(Sort_R)-2*sd(Sort_R)
	XX5 = mean(Sort_R)+3*sd(Sort_R)
	XX6 = mean(Sort_R)-3*sd(Sort_R)
	YY = max(hist(Sort_R, breaks=50, plot=F)$counts)
	lines( c(mean(Sort_R), mean(Sort_R)), c(0, YY), lty=1, col=4) ; text(mean(Sort_R), YY/2, "Mean")
	lines( c(XX1, XX1), c(0, YY), lty=1, col=2 ) ; text( XX1, YY/2, "+1σ")
	lines( c(XX2, XX2), c(0, YY), lty=1, col=2 ) ; text( XX2, YY/2, "-1σ")
	lines( c(XX3, XX3), c(0, YY), lty=1, col=2 ) ; text( XX3, YY/2, "+2σ")
	lines( c(XX4, XX4), c(0, YY), lty=1, col=2 ) ; text( XX4, YY/2, "-2σ")
	lines( c(XX5, XX5), c(0, YY), lty=1, col=2 ) ; text( XX5, YY/2, "+3σ")
	lines( c(XX6, XX6), c(0, YY), lty=1, col=2 ) ; text( XX6, YY/2, "-3σ")
	lines( c(max(Sort_R), max(Sort_R)), c(0, YY), lty=1, col=3 ) ; text(max(Sort_R), YY/2, "Max")
	lines( c(min(Sort_R), min(Sort_R)), c(0, YY), lty=1, col=3 ) ; text(min(Sort_R), YY/2, "Min")
	lines( c(cor(X2, Y2), cor(X2, Y2)), c(0, YY), lty=1, col=5, lw=2 ) ; text(cor(X, Y), YY/2, "R")


	## 상관계수에 대한 유의성 검정 --> 유의확률 (significance probability) = P값 (p-value)*100 %
	## 신뢰도 (1 - P값)

	Pval = sum (abs(Sort_R) >= abs(cor(X2, Y2))) /length(Sort_R) # (Source: http://strata.uga.edu/8370/lecturenotes/resampling.html)
	Pval

	## 상관계수는 0.78로서 통계적으로 3.3 % 수준으로 유의하다.
	## 96.7 % 수준으로 신뢰할 만하다.
	## 1000번 중에 약 33번은 우연의 일치이다.
	## 1000번 중에 약 967번은 우연의 일치가 아니다.



	################################################### 10 지점에서 이동평균을 이용하여 상관계수 및 유의성 검정 ###########################################################

	for ( i in c(105, 156, 143, 133, 159, 108, 146, 184, 131, 101, 114, 100) ) {

	data_L1 = data %>%
	filter( id == i ) %>%
	filter( year >= 1979 )
	# filter( year >= 2000 )

	X = data_L1$year
	Y = data_L1$temp

	X2 = rollmean(data_L1$year, 5)
	Y2 = rollmean(data_L1$temp, 5)

	DO = 100000 # Number of repetition
	N = length(X) # Number of sample
	# N2 = length(X2) # Number of sample

	Sort_X = lapply(1:DO, function(i) rollmean(X[1:N], 5) ) # Sort_X = lapply(1:DO, function(i) X2[1:N2])
	Sort_Y = lapply(1:DO, function(i) rollmean(sample(Y, N, replace=F), 5)) # 비복원추출(무작위 정렬)
	Sort_R = mapply(cor, Sort_X, Sort_Y)

	Pval = sum (abs(Sort_R) >= abs(cor(X2, Y2))) /length(Sort_R)
	# (Source: http://strata.uga.edu/8370/lecturenotes/resampling.html)

	## 지점, 상관계수, 몬테카를로 시뮬레이션에 의한 유의성
	cat(i, cor(X, Y), Pval, "\n")
	}





	#======================================================================================
	# Chapter 4. 조화 분석 (Harmonic Analysis)
	#======================================================================================

	## Example Data Sets
	X = seq(1, 12)
	Y = c(-0.10, 0.40, 5.70, 13.30, 17.60, 20.60, 25.30, 25.10, 20.20, 15.80, 7.80, -0.10)

	nn = length(X)
	nh = trunc((nn-1)/2+1)

	AA = numeric(nn)
	BB = numeric(nn)
	AAcos = matrix(rep(NA, nn*nh), nrow=nn, ncol=nh)
	BBsin = matrix(rep(NA, nn*nh), nrow=nn, ncol=nh)
	YY = numeric(nn)
	YY2 = numeric(nn)
	SYk = numeric(nn)
	SSYk = numeric(nn)

	SYt = 0 ; SSYt = 0 ; SYh = 0 ; SSYh = 0
	SYa = 0 ; SSYa = 0 ; SYk = 0 ; PA = 0

	for (i in 1:nh) {
	for (j in 1:nn) {
	AA[i] = AA[i] + (Y[j]cos(2.0piji/nn))
	BB[i] = BB[i] + (Y[j]sin(2.0piji/nn))
	}
	AA[i] = AA[i]*2/nn
	BB[i] = BB[i]*2/nn

	if(i==nh){
	if(nn%%2==0){
	AA[i] = AA[i]/2
	BB[i] = 0.0

	}else{
	AA[i]=0
	BB[i]=0
	}
	}
	}


	for (i in 1:nn) {
	for (j in 1:nh) {
	AAcos[i,j] = AA[j]cos(2piji/nn)
	BBsin[i,j] = BB[j]sin(2piji/nn)
	YY[i] = YY[i] + (AAcos[i,j] + BBsin[i,j])
	if (AA[j] > 0) { TT = atan(BB[j]/AA[j]) }
	if (AA[j] < 0) { TT = atan(BB[j]/AA[j]) + pi }
	if (AA[j] == 0) { TT = pi/2 }
	YY2[i] = YY2[i] + sqrt(AA[j]2 + BB[j]2)cos(2piij/nn-TT)
	}

	Mean = mean(Y)
	SYt = SYt + Y[i]
	SSYt = SSYt + Y[i]**2
	SYh = SYh + (YY[i] + Mean)
	SSYh = SSYh + (YY[i] - Mean)
	SYa = SYa + (YY2[i] + Mean)
	SSYa = SSYa + (YY2[i] - Mean)

	for (j in 1:nh) {
	SYk[j] = SYk[j] + AAcos[i,j]+BBsin[i,j]
	SSYk[j] = SSYk[j] + (AAcos[i,j]+BBsin[i,j])**2
	}
	}

	for (j in 1:nh) {
	PA[j] = atan(BB[j]/AA[j])*180/pi
	if (AA[j] < 0) { PA[j] = PA[j] + 180 }
	if (AA[j] == 0) { PA[j] = 90 }
	PA[j] = PA[j] - 360.0/nn*j
	if (PA[j] < 0) { PA[j] = PA[j] + 360 }
	if (AA[j] == 0 & BB[j]==0) { PA[j] = -999.0 }
	}


	data = data.frame(Y, AAcos+BBsin, YY+Mean, YY2+Mean)
	colnames(data) = c("Yt", "Y01", "Y02", "Y03", "Y04", "Y05", "Y06", "Yh", "Ya")
	summary(data)

	round( data, 2) # Time, Yt, Y01, Y02, Y03, Y04, Y05, Y06 \| Yh Ya
	round( apply(data, 2, mean), 2) # Mean
	round( apply(data, 2, var), 2) # Var = (CK*CK)/2
	round( apply(data, 2, var)/var(data$Yt)*100, 2) # Contribution [%] = R-squared/100
	round( sqrt(AA[1:nh]^2+BB[1:nh]^2), 2) # Amplitude (CK)
	round( ((sqrt(AA[1:nh]^2+BB[1:nh]^2))^2nn)/((nn-1)var(data$Yt))/2, 2) # R-squared
	round( PA[1:nh], 2) # Phase Anglea
	round( nn/(1:nh), 2) # Time of length
	round( 1+nn/360*PA[1:nh], 2) # Time of 1-6st Harm.


	data_L1 = round( data, 2)
	Mean = round( apply(data, 2, mean), 2)
	Var = round( apply(data, 2, stats::var), 2)
	Contri = round( apply(data, 2, stats::var)/var(data$Yt)*100, 2)
	CK = round( sqrt(AA[1:nh]^2+BB[1:nh]^2), 2)
	Rsq = round( ((sqrt(AA[1:nh]^2+BB[1:nh]^2))^2nn)/((nn-1)var(data$Yt))/2, 2)
	Phase_angle = round( PA[1:nh], 2)
	Time_length = round( nn/(1:nh), 2)
	Time_1st = round( 1+nn/360*PA[1], 2)


	# FIG.
	cols = brewer.pal(6, 'Set2')
	legend_text = paste(paste(c("1st", "2st", "3st", "4st", "5st", "6st"), Contri[2:7], sep=" = "), "%")
	dev.off() # Initialization
	par(cex.main=1.2, cex.lab=1.2, cex.axis=1.2, cex=1.2, font=1)
	plot (X, data_L1[,1], xaxs="i", yavs="i", xlim=c(1, 12), ylim=c(-10, 30),
	xlab="X lab", ylab="Y lab")
	lines(X, data_L1[,2]+Mean[1], type='l', col=cols[1], lwd=2)
	lines(X, data_L1[,3]+Mean[1], type='l', col=cols[2], lwd=2)
	lines(X, data_L1[,4]+Mean[1], type='l', col=cols[3], lwd=2)
	lines(X, data_L1[,5]+Mean[1], type='l', col=cols[4], lwd=2)
	lines(X, data_L1[,6]+Mean[1], type='l', col=cols[5], lwd=2)
	lines(X, data_L1[,7]+Mean[1], type='l', col=cols[6], lwd=2)
	legend( "topleft", legend=c("Yt"), pch=c(21), col=c('black'),
	cex=1.2, pt.cex=1.3, text.font=1, bty="n", text.col=c('black'), y.intersp = 0)
	legend( "topright", legend=legend_text, col=cols, text.col=cols,
	cex=1.2, pt.cex=1.3, text.font=1, lwd=2, bty="n", xpd=T, y.intersp = 0.3)

	## Fig. 2
	plot(Time_length, Contri[2:7], xlim=c(1, 12.5), ylim=c(-5, 100.5), type="b", col="blue",
	xlab="Time length", ylab="Contribution [%]")




	#=======================================================================================================================
	# Chapter 5. 주성분 분석 (통계학, Principal Component Analysis)
	# = 경험적 직교함수 (기상학/해양학, Empirical Orthogonal Function)
	#
	# - 여러 개의 독립변수로 얻어진 다변량 데이터에 대해 공분산/상관계수 구조를 선형결합식(주성분)으로 설명하고자 함
	# - 차원 축소 및 주성분을 통한 데이터의 해석 용이
	#
	#=======================================================================================================================

	## Example Data Sets

	## 시간: 1987년 1월 일평균 자료 (지상관측)
	## 변수: 최저 기온
	## 지점: 이타가, 캐넌다이과

	Ithaca = c(19, 25, 22, -1, 4, 14, 21, 22, 23, 27, 29, 25, 29, 15, 29, 24, 0, 2, 26, 17, 19, 9, 20, -6, -13, -13, -11, -4, -4, 11, 23)
	Canandaigua = c(28, 28, 26, 19, 16, 24, 26, 24, 24, 29, 29, 27, 31, 26, 38, 23, 13, 14, 28, 19, 19, 17, 22, 2, 4, 5, 7, 8, 14, 14, 23)
	data = data.frame(Ithaca, Canandaigua)
	plot(Ithaca, Canandaigua, xlim=c(-10, 35), ylim=c(-10, 35))

	cov(data)

	summary(data)


	nrow = dim(data)[1]
	ncol = dim(data)[2]
	anomaly = matrix(rep(NA, nrow*ncol), nrow=nrow, ncol=ncol)
	for (i in 1:ncol) {
	anomaly[,i] = data[ ,i] - colMeans(data)[i]
	}
	colnames(anomaly) = c("anomaly_Ithaca", "anomaly_Canandaigua")

	# Mean : 13.0 Mean : 20.23
	anomaly_Ithaca = data$Ithaca - mean(data$Ithaca)
	anomaly_Canandaigua = data$Canandaigua - mean(data$Canandaigua)

	################################## 상관계수 (Correlation Coefficient), 공분산 (Covariance) ###########################################################

	################################## 공분산 (Covariance) ###############################################################################################
	EOF = prcomp(data, scale=FALSE, center=TRUE) # 표준화 미고려, 아노말리 고려
	# EOF = prcomp(data, scale=FALSE, center=FALSE) # 표준화 미고려, 아노말리 미고려
	Eigen_value = EOF$sdev^2 ; Eigen_value # 고유근
	Eigen_vector = EOF$rotation ; Eigen_vector # 고유벡터
	EOF_data = data.frame(data, anomaly_Ithaca, anomaly_Canandaigua, EOF$x) ; EOF_data
	## 이타카, 캐넌다이과의 최저기온과 아노말리 및 주성분 (PC1, PC2)
	## PC1 = -0.8478591anomaly(Ithaca) -0.5302216anomaly(Canandaigua)
	## PC2 = -0.5302216anomaly(Ithaca) -0.8478591anomaly(Canandaigua)


	summary(EOF) # 각 주성분의 설명력 및 표준편차

	## FIG. 스크리 그림 (Scree plot): 주성분에 따른 분산/설명력
	PC_var = (Eigen_value/sum(Eigen_value))*100.0
	par(cex.main=1.2, cex.lab=1.2, cex.axis=1.2, cex=1.2, font=1)
	plot(PC_var, xlim=c(1, 10), ylim=c(0, 100), type='b',
	xlab='Principal Component', ylab='Proportion of Variance, Cumulative Proportion')
	points(cumsum(Eigen_value)/sum(Eigen_value)*100., col='blue', type='l')


	## FIG. 행렬도 : 변수와 주성분(PC)의 관계를 그래프로 표현.
	# 화살표는 변수와 PC의 상관계수, PC와 평행수록 해당 PC에 큰 영향.
	# 화살표 벡터의 길이가 변수의 분산을 표현.
	par(cex.main=1.2, cex.lab=1.2, cex.axis=1.2, cex=1.2, font=1)
	biplot(EOF, choices=1:2, scale=0, xlab="PC1 (96.85% explained var)", ylab="PC2 (0.03% explained var)" ) ; abline(h=0, v=0, lty=2)
	cor(data)

	## FIG. 변수와 주성분(PC)의 시계열
	cols = brewer.pal(4, 'Set2')
	par(cex.main=1.2, cex.lab=1.2, cex.axis=1.2, cex=1.2, font=1)
	legend_text = c("PC1 (0.96%)", "PC2 (0.04%)", "Anomaly_Ithaca", "Anomaly_Canandaigua")
	plot (EOF_data$PC1, type='l', xaxs="i", yaxs="i", col=cols[1], ylim=c(-40, 40), ylab="Temperature [deg. C]", xlab="Date [Day]", lwd=2)
	lines(EOF_data$PC2, type='l', col=cols[2], lwd=2)
	lines(EOF_data$anomaly_Ithaca, type='l', col=cols[3], lwd=2)
	lines(EOF_data$anomaly_Canandaigua, type='l', col=cols[4], lwd=2)
	legend( "topleft", legend=legend_text, col=cols, text.col=cols,
	cex=1.5, pt.cex=1.3, text.font=1, lwd=2, bty="n", xpd=T, y.intersp = 0.3)


	################################## 상관계수 (Correlation Coefficient) ################################################################################
	EOF = prcomp(data, scale=TRUE, center=FALSE) # 표준화 미고려, 아노말리 고려
	Eigen_value = EOF$sdev^2 ; Eigen_value
	Eigen_vector = EOF$rotation ; Eigen_vector
	EOF_data = data.frame(data, EOF$x) ; EOF_data
	## 이타카, 캐넌다이과의 최저기온과 주성분

	summary(EOF)

	# dev.off()

	## FIG. 주성분에 따른 분산
	PC_var = (Eigen_value/sum(Eigen_value))*100.0
	par(cex.main=1.2, cex.lab=1.2, cex.axis=1.2, cex=1.2, font=1)
	plot(PC_var, xlim=c(1, 10), ylim=c(0, 100), type='b',
	xlab='Principal Component', ylab='Proportion of Variance, Cumulative Proportion')
	points(cumsum(Eigen_value)/sum(Eigen_value)*100., col='blue', type='l')


	## FIG. 행렬도 : 변수와 주성분(PC)의 관계를 그래프로 표현.
	# 화살표는 변수와 PC의 상관계수, PC와 평행수록 해당 PC에 큰 영향.
	# 화살표 벡터의 길이가 변수의 분산을 표현.
	par(cex.main=1.2, cex.lab=1.2, cex.axis=1.2, cex=1.2, font=1)
	biplot(EOF, choices=1:2, scale=0, xlab="PC1 (95.06% explained var)", ylab="PC2 (0.05% explained var)" ) ; abline(h=0, v=0, lty=2)

	## FIG. 변수와 주성분(PC)의 시계열.
	cols = brewer.pal(4, 'Set2')
	par(cex.main=1.2, cex.lab=1.2, cex.axis=1.2, cex=1.2, font=1)
	legend_text = c("PC1 (0.95%)", "PC2 (0.05%)")
	plot(EOF_data$PC1, type='l', xaxs="i", yaxs="i", col=cols[1], ylim=c(-3, 3), ylab="PC1, PC2", lwd=2)
	lines(EOF_data$PC2, type='l', col=cols[2], lwd=2)
	legend( "topleft", legend=legend_text, col=cols, text.col=cols,
	cex=1.5, pt.cex=1.3, text.font=1, lwd=2, bty="n", xpd=T, y.intersp = 0.3)




	####################### 기상청 최저/최고 기온 자료를 이용하여 영동과 영서에 대한 주성분 분석 ###########################################
	data = read.csv("INPUT/KMA/2012-2016_Daily_Mean_Maximum_Air_Temperature_Gangwon.KMA", sep="", header=TRUE)
	data = read.csv("INPUT/KMA/2012-2016_Daily_Mean_Minimum_Air_Temperature_Gangwon.KMA", sep="", header=TRUE)
	head(data) # 파일의 첫 행
	tail(data) # 파일의 끝 행

	#################################### 자료 설명 ##############################################################
	##
	## 시간: 2012-2016년 일평균 자료 (지상관측)
	##
	## 변수: 년, 월, 일, 8개 지점별 최저/최고 기온
	## year month day
	##
	## 지점: 춘천, 홍천, 원주, 영월, 동해, 강릉, 속초, 양양
	## ~~~~~~~~~~~~~~~~~~~~~~~~~ ~~~~~~~~~~~~~~~~~~~~~~~~~
	## 영서 지역 영동 지역
	##
	## 출처: 기상청 국가기후데이터센터 (http://sts.kma.go.kr/jsp/home/contents/main/main.do)
	## Received data from master's course Se-Hun Rim
	##
	############################################################################################################

	## 시계열 그리기 위해서 "년, 월, 일"에 대한 변수를 고려한 xran 변수 추가 그리고 4월달 자료만 추출.
	MONTH = c(31, 28, 31, 30, 31, 30, 31, 31, 30, 31, 30, 31)
	data_L1 = data %>%
	mutate( xran = year + ((month-1)/12.) + ((day-1)/(12.*MONTH[month])) ) %>%
	filter( year == 2016 ) %>%
	filter( month == 4 )

	head(data_L1)

	data_L1_FIG = data_L1 %>%
	filter( year == 2016 )

	## FIG. 영동 지역에서 최고/최저 기온 자료 시계열
	cols = brewer.pal(4, 'Set2')
	# display.brewer.all(n=10, exact.n=FALSE)
	par(cex.main=1.2, cex.lab=1.2, cex.axis=1.2, cex=1.2, font=1)
	plot (data_L1_FIG$day, data_L1_FIG$Donghae, type='l', xaxs="i", yaxs="i", col=cols[1], ylim=c(-10, 40), ylab="Max / Min Air Temperature [deg. C]", xlab="Time [Year-Month-Day]", lwd=2)
	lines(data_L1_FIG$day, data_L1_FIG$Gangneung, type='l', col=cols[2], lwd=2)
	lines(data_L1_FIG$day, data_L1_FIG$Sokcho, type='l', col=cols[3], lwd=2)
	lines(data_L1_FIG$day, data_L1_FIG$Yangyang, col=cols[4], lwd=2)
	legend( "top", legend=c("Donghae", "Gangneung", "Sokcho", "Yangyang"),
	col=cols, text.col=cols, cex=1.3, pt.cex=1.3, text.font=1, lwd=2, bty="n", xpd=T, horiz=TRUE, y.intersp = 0.25)

	## FIG. 영서 지역에서 최고/최저 기온 자료 시계열.
	par(cex.main=1.2, cex.lab=1.2, cex.axis=1.2, cex=1.2, font=1)
	plot (data_L1_FIG$day, data_L1_FIG$Chuncheon, type='l', xaxs="i", yaxs="i", col=cols[1], ylim=c(-10, 40), ylab="Max / Min Air Temperature [deg. C]", xlab="Time [Year-Month-Day]", lwd=2)
	lines(data_L1_FIG$day, data_L1_FIG$Hongcheon, type='l', col=cols[2], lwd=2)
	lines(data_L1_FIG$day, data_L1_FIG$Wonju, type='l', col=cols[4], lwd=2)
	lines(data_L1_FIG$day, data_L1_FIG$Yeongwol, type='l', col=cols[4], lwd=2)
	legend( "top", legend=c("Chuncheon", "Hongcheon", "Wonju", "Yeongwol"),
	col=cols, text.col=cols, cex=1.3, pt.cex=1.3, text.font=1, lwd=2, bty="n", xpd=T, horiz=TRUE, y.intersp = 0.25)


	## 시간(년, 월, 일) 및 NA 제거
	data_L2 = na.omit(data_L1)
	head(data_L2)

	## year, month, day, xran 컬럼 제거
	data_L3 = data_L2[ ,-c(1:3, 12)]
	head(data_L3)

	## 공분산 행렬
	COV = cov(data_L3) ; COV ## 공분산 행렬
	ggcorrplot(COV, outline.col = "white", lab = TRUE) +
	scale_fill_gradientn(colours = matlab.like(10)) +
	labs(fill = "Cov")


	## 상관계수 행렬
	COR = cor(data_L3) ; COV ## 상관계수 행렬 계산
	COR_pvalue = cor_pmat(data_L3) ## 상관계수 행렬에 대한 p-value 계산
	ggcorrplot(COR, outline.col = "white", lab = TRUE, p.mat = COR_pvalue, sig.level = 0.05,
	colors = c("#6D9EC1", "white", "#E46726"))


	## Anomaly calculation
	nrow = dim(data_L3)[1]
	ncol = dim(data_L3)[2]
	anomaly = matrix(rep(NA, nrow*ncol), nrow=nrow, ncol=ncol)
	for (i in 1:ncol) {
	anomaly[,i] = data_L3[ ,i] - colMeans(data_L3)[i]
	}
	colnames(anomaly) = c("anomaly_chuncheon", "anomaly_hongcheon", "anomaly_wonju" , "anomaly_yeongwol",
	"anomaly_donghae" , "anomaly_gangneung", "anomaly_sokcho", "anomaly_yangyang")




	## 주성분 분석
	EOF = prcomp(data_L3, scale=FALSE, center=TRUE) ; EOF
	## scale = FALSE , center = FALSE : 공분산 행렬 (아노말리 미고려)
	## scale = FALSE , center = TRUR : 공분산 행렬 (아노말리 고려)
	## scale = TRUE , center = FALSE : 상관계수 행렬 (아노말리 미고려)
	## scale = TRUE , center = TRUE : 상관계수 행렬 (아노말리 고려)
	names(EOF)
	Eigen_value = EOF$sdev^2 ; Eigen_value # 고유근
	Eigen_vector = EOF$rotation ; Eigen_vector # 고유벡터
	EOF_data = data.frame(EOF$x) ; EOF_data # Principal Components
	summary(EOF)


	## Eigen_value Test
	sum(diag(COV)) # 공분산의 대각선 합
	sum(Eigen_value) # 고유근의 합

	## PC1 (Eigen_vector) Test (PC scores using R-project vs PC scores using calculator)
	EOF_data[1,1]
	## PC1[1,1] = 0.3316495anomaly(Chuncheon) + ... + 0.3798507anomaly(Yangyang)
	# sum(-16.909095890.3448694, -17.0932602740.3505063, -16.246630140.3380886, -16.9912876710.3409752, -16.44942466*0.3352016,
	# -1.219956e+010.2659710, -13.336328770.2883571, -11.743561640.2815221, -15.4073424660.3211499, -13.13715068*0.2810695)

	## FIG. 주성분에 따른 설명력
	PC_var = (Eigen_value/sum(Eigen_value))*100.0
	fviz_screeplot(EOF, addlabels = TRUE, title="", x="Principal Components", y="Percentage of Variance [%]", ncp = 8) +
	geom_point(aes(y=cumsum(PC_var)), col='red') + geom_line(aes(y=cumsum(PC_var)), col='red') +
	theme_bw(base_size = 15)

	## FIG. 고유벡터를 이용하여 2D contour mapping

	## 위도 및 경도 설정
	## 춘천, 홍천, 원주, 영월, 동해, 강릉, 속초, 양양
	lat = c( 37.90256000, 37.68358000, 37.33756000, 37.18126000, 37.50708000, 37.75147000, 38.25085000, 38.08724000)
	lon = c(127.73570000, 127.88040000, 127.94660000, 128.45743000, 129.12433000, 128.89098000, 128.56472000, 128.62966000)
	obs = c( "Chuncheon", "Hongcheon", "Wonju", "Yeongwol", "Donghae", "Gangneung", "Sokcho", "Yangyang")

	data_L4 = data.frame( obs, lon, lat, unname(Eigen_vector) )
	colnames(data_L4) = c("obs", "lon", "lat", "PC1", "PC2", "PC3", "PC4", "PC5", "PC6", "PC7", "PC8")

	data_L5 = data_L4
	head(data_L5)

	coordinates(data_L5) = ~lon + lat
	plot(data_L5)

	# summary(data_L5)

	################################# 통계적 보간법 ##############################################################################
	##
	## 10개 지점별 고유벡터를 시각화하기 위해 공간분포의 통계적 보간법 (선형 내삽, 2차원 내삽, 역거리 가중치법, 크리킹 등) 이용
	##
	## 역거리 가중치법 (Inverse Distance Weighting, IDW)
	## - 거리에 따른 주변값들의 가중값을 산정하여 가중평균을 통해 지점별 관측의 값을 보간하는 방법 (Kim et al., 2010)
	## - 즉 IDW 보간법은 대표적인 공간분포 보간법으로 관측점간의 거리에 반비례하여 가중치를 할당하고,
	## 거리가 멀어짐에 따라 지수적으로 감소한다.
	##
	## Kim D, Ryu D, Choi Y, Lee W. 2010. Application of kriging and inverse distance weighting method for
	## the estimation of geo-layer of Songdo area in Incheon. Journal of Korean Geotechnical Society 26(1):5-19.
	##
	##############################################################################################################################

	## 1 km 해상도로 주변 격자를 구성
	y.range = as.numeric(c(35, 40)) # min/max latitude of the interpolation area
	x.range = as.numeric(c(127, 130)) # min/max longitude of the interpolation area

	grd = expand.grid(x = seq(from = x.range[1], to = x.range[2], by = 0.01),
	y = seq(from = y.range[1], to = y.range[2], by = 0.01)) # expand points to grid
	coordinates(grd) <- ~ x + y
	gridded(grd) <- TRUE

	# plot(grd, cex = 1.0, col = "grey")
	# points(data_L5, pch = 1, col = "red", cex = 1)

	idw = idw(formula = PC2 ~ 1, locations = data_L5, newdata = grd) # apply idw model for the data
	idw.output = as.data.frame(idw) # output is defined as a data table
	names(idw.output)[1:3] = c("long", "lat", "var1.pred") # give names to the modelled variables

	world = map_data("worldHires")

	longitude = seq(100, 150, 0.5)
	latitude = seq(20, 60, 0.5)
	x_longitude = unlist(lapply(longitude, function(x) ifelse(x < 0, paste0(abs(x), "°W"), ifelse(x > 0, paste0(x, "°E"),x))))
	y_latitude = unlist(lapply(latitude, function(x) ifelse(x < 0, paste0(x, "°S"), ifelse(x > 0, paste0(x, "°N"),x))))

	ggplot() +
	coord_fixed(ratio = 0.75) +
	theme_bw() +
	geom_tile(data = idw.output, aes(x = long, y = lat, fill = var1.pred)) +
	scale_fill_gradient2() +
	# scale_fill_gradientn(colours = matlab.like(10), limits=c(-1.0, 1.0)) +
	guides( fill=guide_colorbar(barwidth=1.5, barheight=15, title.position="right", title.vjust = 0.5, title.hjust = 0.5) ) +
	geom_point(aes(x = lon, y = lat, colour=obs), data = data_L4, size=5, alpha=0.3, show.legend = FALSE) +
	geom_text(aes(x=lon, y=lat, label=obs, colour=obs), hjust=-0.25, vjust=0.5, nudge_x=0, nudge_y=0, size=5,
	colour='black', data=data_L4) +
	# geom_label(aes(x=lon, y=lat, label=obs, colour=obs), hjust=-0.25, vjust=0.5, nudge_x=0, nudge_y=0, size=5,
	# data=data_L4, show.legend = FALSE) +
	geom_path(aes(x=long, y=lat, group=group), data = world, colour="black") +
	# scale_x_continuous(breaks = longitude, labels = x_longitude, expand=c(0,0), limits=c(125, 130)) +
	# scale_y_continuous(breaks = latitude, labels = y_latitude, expand=c(0,0), limits=c(32.5, 40)) +
	scale_x_continuous(breaks = longitude, labels = x_longitude, expand=c(0,0), limits=c(127, 130)) +
	scale_y_continuous(breaks = latitude, labels = y_latitude, expand=c(0,0), limits=c(36, 39)) +
	theme(plot.title=element_text(face="bold", size=20, color="black")) +
	theme(axis.title.x = element_text(face="bold", size=15, colour="black")) +
	theme(axis.title.y = element_text(face="bold", size=15, colour="black", angle=90)) +
	theme(axis.text.x = element_text(face="bold", size=15, colour="black")) +
	theme(axis.text.y = element_text(face="bold", size=15, colour="black")) +
	theme(legend.text=element_text(size=14, color="black")) +
	theme(legend.title=element_text(size=14, angle=90)) +
	labs(x = "") +
	labs(y = "") +
	labs(fill = "Eigen Vector") +
	labs(colour = "") +
	labs(title = paste0("PC2 (", sprintf("%.2f", PC_var)[2], "% Explanined Variance) \n")) +
	theme(legend.background=element_blank())


	## FIG. 각 지점에서 각 주성분들의 기여도
	var = get_pca_var(EOF)
	contrib = var$contrib
	ggcorrplot(contrib, outline.col = "white", lab = TRUE) +
	scale_fill_gradientn(colours = matlab.like(10)) +
	labs(fill = "Contrib")

	rowSums(contrib) # 행에 대한 합계
	colSums(contrib) # 열에 대한 합계


	## FIG. 각 지점별 주성분들의 기여도
	fviz_contrib(EOF, choice = "var", axes = 2)

	## FIG. 각 사례별 주성분들의 기여도 == 고유벡터의 절대값이 큰 경우
	fviz_contrib(EOF, choice = "ind", axes = 2) + coord_flip()

	## PC 1-2에 대해 고유벡터를 시각화
	fviz_pca_var( EOF, axes=c(2,2), col.var="contrib", gradient.cols = c("#00AFBB", "#E7B800", "#FC4E07") )

	## PC 1-2에 대해 주성분들를 시각화
	fviz_pca_ind(EOF, axes=c(2,2), col.ind = "contrib", gradient.cols = c("#00AFBB", "#E7B800", "#FC4E07"), repel = TRUE)

	## PC 1-2에 대해 고유벡터 및 주성분 Score를 시각화
	fviz_pca_biplot(EOF, axes = c(2, 2), col.var="contrib", col.ind ="contrib", gradient.cols = c("#00AFBB", "#E7B800", "#FC4E07")) +
	theme_bw(base_size = 15)


	## 년, 월, 일, 관측소별 최고/최저 기온 및 아노말리, 주성분 score
	data_L6 = data.frame(data_L2$year, data_L2$month, data_L2$day, data_L3, anomaly, EOF_data)
	head(data_L6)

	summary(EOF)

	data_L6_FIG = data_L6 %>%
	dplyr::rename( year = data_L2.year,
	month = data_L2.month,
	day = data_L2.day ) %>%
	dplyr::filter( year == 2016 )

	## FIG. 변수와 주성분(PC 1-5)의 시계열.
	cols = brewer.pal(5, 'Set2')
	par(cex.main=1.2, cex.lab=1.2, cex.axis=1.2, cex=1.2, font=1)
	plot (data_L6_FIG$day, data_L6_FIG$PC1, type='l', xaxs="i", yaxs="i", col=cols[1], ylim=c(-80, 80), ylab="Principal Component Scores", xlab="Time [Year-Month-Day]", lwd=2)
	points(data_L6_FIG$day, data_L6_FIG$PC2, type='l', col=cols[2], lwd=2)
	points(data_L6_FIG$day, data_L6_FIG$PC3, type='l', col=cols[3], lwd=2)
	points(data_L6_FIG$day, data_L6_FIG$PC4, type='l', col=cols[4], lwd=2)
	points(data_L6_FIG$day, data_L6_FIG$PC5, type='l', col=cols[5], lwd=2)
	legend_text = c("PC1 (80.94%)", "PC2 (15.62%)", "PC3 (1.47%)", "PC4 (0.70%)", "PC5 (0.55%)")
	legend("top", legend=legend_text, col=cols, text.col=cols,
	cex=1.2, pt.cex=1.3, text.font=1, lwd=2, bty="n", xpd=T, y.intersp = 0.25, horiz=TRUE)


	############################# 영동 및 영서 지점들에 대한 바람장미 ##############################################
	## - 영동과 영서 지역에 온도 차가 클 때 (PC2 절대값 최대) 바람장미의 분포를 보고자 함.
	## - 2016년 4월 9, 17일 (영동 - 영서) > 0 (푄 현상으로 인해 기온 상승) -> 서풍 계열 예상
	## - 2016년 4월 26일 < 0 (시베리안 고기압으로 인해 기온 하강) -> 북풍 또는 동풍 계열 예상
	################################################################################################################

	wind_data = read.csv("INPUT/KMA/201604_Hourly_Mean_Wind_Gangwon.KMA", sep="", header=TRUE)
	head(wind_data) # 파일의 첫 행

	#################################### 자료 설명 ##############################################################
	##
	## 시간: 2016년 4월 시간별 평균된 자료 (지상관측)
	##
	## 변수: 년, 월, 일, 7개 지점별 풍향 및 풍속 자료
	## year month day _wd _ws
	##
	## 지점: 춘천, 홍천, 원주, 영월, 동해, 강릉, 속초
	## ~~~~~~~~~~~~~~~~~~~~~~~~~ ~~~~~~~~~~~~~~~~~~
	## 영서 지역 영동 지역
	##
	## 출처: 기상청 국가기후데이터센터 (http://sts.kma.go.kr/jsp/home/contents/main/main.do)
	##
	############################################################################################################

	wind_data_L1 = na.omit(wind_data) # NA 제거

	wind_data_L2 = wind_data_L1 %>%
	filter( year == 2016 ) %>%
	filter( month == 4 ) %>%
	filter( day == 26 )

	summary(wind_data_L2)

	## 영동 지역에 대한 바람장미
	FIG_Gangneung_wr = ggplot_windrose(spd = wind_data_L2$Gangneung_ws, dir = wind_data_L2$Gangneung_wd, spdres = 2, dirres=30, spdmin = 0, spdmax = 12) +
	labs(title = "Gangneung")
	FIG_Donghae_wr = ggplot_windrose(spd = wind_data_L2$Donghae_ws, dir = wind_data_L2$Donghae_wd, spdres = 2, dirres=30, spdmin = 0, spdmax = 12) +
	labs(title = "Donghae")
	FIG_Sokcho_wr = ggplot_windrose(spd = wind_data_L2$Sokcho_ws, dir = wind_data_L2$Sokcho_wd, spdres = 2, dirres=30, spdmin = 0, spdmax = 12) +
	labs(title = "Sokcho")
	grid.arrange(FIG_Gangneung_wr, FIG_Donghae_wr, FIG_Sokcho_wr, nrow=1, ncol=3) ## 다중 그림


	## 영서 지역에 대한 바람장미
	FIG_Chuncheon_wr = ggplot_windrose(spd = wind_data_L2$Chuncheon_ws, dir = wind_data_L2$Chuncheon_wd, spdres = 2, dirres=30, spdmin = 0, spdmax = 12) +
	labs(title = "Chuncheon")
	FIG_Wonju_wr = ggplot_windrose(spd = wind_data_L2$Wonju_ws, dir = wind_data_L2$Wonju_wd, spdres = 2, dirres=30, spdmin = 0, spdmax = 12) +
	labs(title = "Wonju")
	FIG_Yeongwol_wr = ggplot_windrose(spd = wind_data_L2$Yeongwol_ws, dir = wind_data_L2$Yeongwol_wd, spdres = 2, dirres=30, spdmin = 0, spdmax = 12) +
	labs(title = "Yeongwol")
	grid.arrange(FIG_Chuncheon_wr, FIG_Wonju_wr, FIG_Yeongwol_wr, nrow=1, ncol=3) ## 다중 그림


	#################################### ECMWF 자료를 이용하여 주성분 분석 및 그래픽 표출 ###########################################

	#################################### 자료 설명 ##############################################################
	##
	## 시간: 2016년 4월 1-30일 일평균 자료 (ECMWF ERA-Interim)
	##
	## 변수: 위도 (181), 경도 (360), 시간 (30), 지표면 기압 (360, 181, 30) ------ 1 deg
	## longitude latitude time Surface pressure
	##
	## 위도 (1441), 경도 (2880), 시간 (30), 지표면 기압 (2880, 1441, 30) ------ 0.125 deg
	## longitude latitude time Surface pressure
	##
	## 출처: ECMWF era-interim (https://www.ecmwf.int/en/forecasts/datasets/reanalysis-datasets/era-interim)
	## Received data from master's course Wenting Zhang
	##
	#############################################################################################################

	ncfile = open.nc("INPUT/ECMWF_mslp_1deg_201604.nc")
	print.nc(ncfile)

	ncdata = read.nc(ncfile)

	file_lines = dim.inq.nc(ncfile, "time")$length
	time_units = att.get.nc(ncfile, "time", "units")
	time = utcal.nc(time_units, ncdata$time)

	# slp_scale = att.get.nc(ncfile, "sp", "scale_factor")
	# slp_offset = att.get.nc(ncfile, "sp", "add_offset")
	# slp_fillvalue = att.get.nc(ncfile, "sp", "_FillValue")

	lon = ncdata[["longitude"]]
	lat = ncdata[["latitude"]]
	# slp = ncdata[["sp"]]/100.0 # (ncdata[["sp"]]*slp_scale) + slp_offset
	# Pa -> hPa
	msl = ncdata[["msl"]]/100.0 # (ncdata[["sp"]]*slp_scale) + slp_offset
	# Pa -> hPa

	## Convert 2D matrix to 1D matrix
	msl_matrix = matrix(msl, nrow = dim(lon)*dim(lat), ncol = file_lines)
	head(msl_matrix)

	##================================== slp_matxix ======================================================
	## 1일 2일 ... 29일 30일 (2017년 4월) => slp_matrix[1:65160, 30]
	## 360181 360181
	## (65,160) (65,160)
	##====================================================================================================


	## FIG. ECMWF 자료를 이용한 지표면 기압 맵핑 (-90°N-90°N, 0°E-360°E)
	lonlat = as.matrix(expand.grid(lon,lat))
	colnames(lonlat) = c("long", "lat")

	longitude = seq(-180, 180, 20)
	latitude = seq(-90, 90, 10)
	x_longitude = unlist(lapply(longitude, function(x) ifelse(x < 0, paste0(abs(x), "°W"), ifelse(x > 0, paste0(x, "°E"),x))))
	y_latitude = unlist(lapply(latitude, function(x) ifelse(x < 0, paste0(x, "°S"), ifelse(x > 0, paste0(x, "°N"),x))))

	output = data.frame(lonlat, slp_matrix)

	## Convert longitude (0-360) to (-180 to 180 or 180°W-180°E)
	output$long = ( (output$long+180) %% 360 ) - 180.0

	## Convert longitude (-180 to 180 or 180°W-180°E) to (0-360)
	# output$long = ( output$long %% 360 )

	world = map_data("worldHires")

	ggplot() +
	coord_fixed(ratio = 0.75) +
	theme_bw() +
	geom_tile(data = output, aes(x = long, y = lat, fill = X1)) +
	scale_fill_gradientn(colours = matlab.like(10)) +
	# scale_fill_gradientn(colours = matlab.like(10), limits=c(500, 1100), na.value="#AA0000", title.vjust = 0.5, title.hjust = 0.5) +
	guides( fill=guide_colorbar(barwidth=1.5, barheight=15, title.position = "right", title.vjust = 0.5, title.hjust = 0.5) ) +
	geom_path(aes(x=long, y=lat, group=group), data = world, colour="black") +
	# scale_x_continuous(breaks = longitude, labels = x_longitude, expand=c(0,0), limits=c(-180, 180)) +
	# scale_y_continuous(breaks = latitude, labels = y_latitude, expand=c(0,0), limits=c(-90, 90)) +
	scale_x_continuous(breaks = longitude, labels = x_longitude, expand=c(0,0), limits=c(80, 160)) +
	scale_y_continuous(breaks = latitude, labels = y_latitude, expand=c(0,0), limits=c(10, 70)) +
	theme(plot.title=element_text(face="bold", size=20, color="black")) +
	theme(axis.title.x = element_text(face="bold", size=15, colour="black")) +
	theme(axis.title.y = element_text(face="bold", size=15, colour="black", angle=90)) +
	theme(axis.text.x = element_text(face="bold", size=15, colour="black")) +
	theme(axis.text.y = element_text(face="bold", size=15, colour="black")) +
	theme(legend.text=element_text(size=14, color="black")) +
	theme(legend.title=element_text(size=14, angle=90)) +
	labs(x = "") +
	labs(y = "") +
	labs(fill = "Mean Sea Level Pressure [hPa]") +
	theme(legend.background=element_blank())



	## 동아시아 영역에 대한 주성분 분석 (20°N-60°N, 115°E-150°E)
	output_L1 = output %>%
	dplyr::filter(115 <= long & long <= 150) %>%
	dplyr::filter(20 <= lat & lat <= 60)

	head(output_L1)

	## 위도, 경도 추출
	mls_matxix_L1 = output_L1[ ,-c(1,2)]

	head(mls_matxix_L1)


	## 전치행렬로 변환
	##================================== slp_matxix_L1 ===================================================
	## 1일 2일 ... 29일 30일 (2017년 4월) => slp_matrix_L1[1:4941, 30]
	## 8161 8161
	## (4941) .. (4941)
	##====================================================================================================

	## ↓ transpose of a matrix

	##================================== t(slp_matxix_L1)================================================
	## 1일 81*61 (4941) => t(slp_matrix_L1[1:4941, 30])
	## 2일 => [30, 1:4941]
	## . .
	## . .
	## . .
	## 29일
	## 30일 81*61 (4941)
	## (2017년 4월)
	##====================================================================================================

	EOF = prcomp(t(mls_matxix_L1), scale=F, center=T)
	Eigen_value = EOF$sdev^2 # 고유근
	Eigen_vector = EOF$rotation # 고유벡터
	EOF_data = data.frame(EOF$x)
	summary(EOF)

	## Anomaly calculation
	nrow = dim(mls_matxix_L1)[1]
	ncol = dim(mls_matxix_L1)[2]
	anomaly = matrix(rep(NA, nrow*ncol), nrow=nrow, ncol=ncol)
	for (i in 1:ncol) {
	anomaly[,i] = mls_matxix_L1[ ,i] - colMeans(mls_matxix_L1)[i]
	}



	## FIG. 주성분에 따른 설명력
	PC_var = (Eigen_value/sum(Eigen_value))*100.0
	fviz_screeplot(EOF, addlabels = TRUE, title="", x="Principal Components", y="Percentage of Variance [%]", ncp = 30) +
	geom_point(aes(y=cumsum(PC_var)), col='red') + geom_line(aes(y=cumsum(PC_var)), col='red') +
	theme_bw(base_size = 15)


	## FIG. 주성분의 고유벡터를 이용하여 맵핑
	long = output_L1$long
	lat = output_L1$lat
	output_L3 = data.frame(long, lat, Eigen_vector)
	head(output_L3)

	# world = map_data("worldHires")
	world = map_data("world")

	longitude <- seq(80, 160, 5)
	latitude <- seq(10, 70, 5)
	x_longitude <- unlist(lapply(longitude, function(x) ifelse(x < 0, paste0(abs(x), "°W"), ifelse(x > 0, paste0(x, "°E"),x))))
	y_latitude <- unlist(lapply(latitude, function(x) ifelse(x < 0, paste0(x, "°S"), ifelse(x > 0, paste0(x, "°N"),x))))

	# summary(output_L3)

	ggplot() +
	coord_fixed(ratio = 0.75) +
	theme_bw() +
	geom_tile(data = output_L3, aes(x = long, y = lat, fill = -PC1)) +
	# scale_fill_gradient2(limits=c(-0.1, 0.1)) +
	scale_fill_gradient2(low = 'blue', mid = 'white', high = 'red') +
	guides( fill=guide_colorbar(barwidth=1.5, barheight=15, title.position = "right", title.vjust = 0.5, title.hjust = 0.5) ) +
	geom_path(aes(x=long, y=lat, group=group), data = world, colour="black") +
	scale_x_continuous(breaks = longitude, labels = x_longitude, expand=c(0,0), limits=c(115, 150)) +
	scale_y_continuous(breaks = latitude, labels = y_latitude, expand=c(0,0), limits=c(20, 60)) +
	theme(plot.title=element_text(face="bold", size=20, color="black")) +
	theme(axis.title.x = element_text(face="bold", size=15, colour="black")) +
	theme(axis.title.y = element_text(face="bold", size=15, colour="black", angle=90)) +
	theme(axis.text.x = element_text(face="bold", size=15, colour="black")) +
	theme(axis.text.y = element_text(face="bold", size=15, colour="black")) +
	theme(legend.text=element_text(size=14, color="black")) +
	theme(legend.title=element_text(size=14, angle=90)) +
	labs(x = "") +
	labs(y = "") +
	labs(fill = "Eigen Vector \n") +
	labs(title = paste0("PC1 (", sprintf("%.2f", PC_var)[1], "% Explanined Variance) \n" )) +
	theme(legend.background=element_blank()) +
	ggsave(filename=paste0('FIG/PC2.png'), width=12, height=6, dpi=600)



	## FIG. 각 지점별 주성분들의 기여도
	# fviz_contrib(EOF, choice = "var", axes = 3)

	## FIG. 각 사례별 주성분들의 기여도 == 고유벡터의 절대값이 큰 경우
	fviz_contrib(EOF, choice = "ind", axes = 3) + coord_flip()

	## PC 1-2에 대해 고유벡터를 시각화
	# fviz_pca_var( EOF, axes=c(2,2), col.var="contrib", gradient.cols = c("#00AFBB", "#E7B800", "#FC4E07") )

	## PC 1-2에 대해 주성분들를 시각화
	fviz_pca_ind(EOF, axes=c(2,2), col.ind = "contrib", gradient.cols = c("#00AFBB", "#E7B800", "#FC4E07"), repel = TRUE)

	## PC 1-2에 대해 고유벡터 및 주성분 Score를 시각화
	# fviz_pca_biplot(EOF, axes = c(2, 2), col.var="contrib", col.ind ="contrib", gradient.cols = c("#00AFBB", "#E7B800", "#FC4E07")) +
	# theme_bw(base_size = 15)



	## FIG. 변수와 주성분(PC)의 시계열.
	output_L4 = data.frame(time, EOF_data)
	head(output_L4)

	summary(output_L4)

	summary(EOF)

	cols = brewer.pal(5, 'Set2')
	par(cex.main=1.2, cex.lab=1.2, cex.axis=1.2, cex=1.2, font=1)
	plot (output_L4$day, output_L4$PC1, type='l', xaxs="i", yaxs="i", col=cols[1], xlim=c(1, 30), ylim=c(-600, 600), ylab="Principal Component Scores", xlab="Time [Day]", lwd=2)
	points(output_L4$day, output_L4$PC2, type='l', col=cols[2], lwd=2)
	points(output_L4$day, output_L4$PC3, type='l', col=cols[3], lwd=2)
	points(output_L4$day, output_L4$PC4, type='l', col=cols[4], lwd=2)
	points(output_L4$day, output_L4$PC5, type='l', col=cols[5], lwd=2)
	## ECMWF_slp_1deg_201704
	legend_text = c("PC1 (34.04%)", "PC2 (16.21%)", "PC3 (11.79%)", "PC4 (8.61%)", "PC5 (6.26%)")
	## ECMWF_slp_0.125deg_201704
	# legend_text = c("PC1 (34.52%)", "PC2 (15.82%)", "PC3 (11.67%)", "PC4 (8.61%)", "PC5 (6.28%)")
	legend( "top", legend=legend_text, col=cols, text.col=cols,
	cex=1.0, pt.cex=1.3, text.font=1, lwd=2, bty="n", xpd=T, y.intersp = 0.25, horiz=TRUE)


	########################### 영동 및 영서 지점의 기온과 주성분(PC)의 상관 분석 ###################################################
	## - 2016년 4월 1-30 일 영동 지역 (기온 하강 ~ 북풍 또는 동풍 계열) <- PC3의 고유벡터가 동해상에 고기압 위치
	## <- 영동 지역의 기온과 PC3의 주성분과의 높은 상관성 예상
	##
	## 영서 지역 (기온 상수 ~ 서풍 계열) <- PC4의 고유벡터가 동해상에 저기압 위치
	## <- 영서 지역의 기온과 PC4의 주성분과의 높은 상관성 예상
	#################################################################################################################################
	Maxtemp = data_L1 %>%
	filter( year == 2016 )

	output_L5 = data.frame(Maxtemp, EOF_data)
	output_L6 = output_L5 %>%
	dplyr::select ( Chuncheon, Hongcheon, Wonju, Yeongwol, Donghae, Gangneung, Sokcho, Yangyang,
	PC1, PC2, PC3, PC4, PC5, PC6)

	## FIG. 상관계수 행렬
	COR = cor(output_L6) ## 상관계수 행렬 계산
	COR_pvalue = cor_pmat(output_L6) ## 상관계수 행렬에 대한 p-value 계산
	ggcorrplot(COR, outline.col = "white", lab = TRUE, p.mat = COR_pvalue, sig.level = 0.05,
	colors = c("#6D9EC1", "white", "#E46726"))
	ggcorrplot(COR, outline.col = "white", lab = TRUE,
	colors = c("#6D9EC1", "white", "#E46726"))

	## FIG. 산점도 행렬
	ggpairs(output_L6)

	## 기여도가 높은 날짜에 대해서 선정
	output_L7 = output_L6 %>%
	slice( c(3, 4, 5, 6, 7, 11, 13, 17, 21, 22) )

	## FIG. 상관계수 행렬
	COR = cor(output_L7) ## 상관계수 행렬 계산
	COR_pvalue = cor_pmat(output_L7) ## 상관계수 행렬에 대한 p-value 계산
	ggcorrplot(COR, outline.col = "white", lab = TRUE, p.mat = COR_pvalue, sig.level = 0.1,
	colors = c("#6D9EC1", "white", "#E46726"))
	ggcorrplot(COR, outline.col = "white", lab = TRUE,
	colors = c("#6D9EC1", "white", "#E46726"))

	## FIG. 산점도 행렬
	ggpairs(output_L7)




	############################# 연구 영역에 따른 PC3 and PC4 설명력 ###############################
	summary(EOF)

	LAT_MIN = seq(0, 60, 5)
	LON_MIN = seq(100, 180, 5)

	LAT_MAX = seq(5, 60, 5)
	LON_MAX = seq(105, 180, 5)

	for (i in 1:length(LAT_MIN)) {
	for (j in 1:length(LON_MIN)) {
	for (k in 1:length(LAT_MAX)) {
	for (l in 1:length(LON_MAX)) {

	output_L1 = output %>%
	dplyr::filter(LAT_MIN[i] <= lat & lat <= LAT_MAX[k]) %>%
	dplyr::filter(LON_MIN[j] <= long & long <= LON_MAX[l])

	if (dim(output_L1)[1] > 0) {
	slp_matxix_L1 = output_L1[ ,-c(1,2)]
	EOF = prcomp(t(slp_matxix_L1), scale=F, center=T)
	Eigen_value = EOF$sdev^2 # 고유근
	PC_var = (Eigen_value/sum(Eigen_value))*100.0

	# cat(LAT_MIN[i], LAT_MAX[k], LON_MIN[j], LON_MAX[l], PC_var[3:4], sum( PC_var[3:4]), "\n",
	# file = "OUTPUT/PC3 or PC4 variance according to research area.OUT", append=T)
	}
	}
	}
	}
	}

	output_L8 = read.csv("OUTPUT/PC3 or PC4 variance according to research area.OUT", sep="", header=F)
	colnames(output_L8) = c("lat_min", "lat_max", "lon_min", "lon_max", "PC3", "PC4", "PC34_sum")
	head(output_L8) # 파일의 첫 행


	output_L8 %>%
	dplyr::arrange( desc(PC34_sum) ) # 내림차순 정렬



	#======================================================================================
	# Chapter 6. 인공신경망(Neural Networks), 딥러닝(Deeplearning)
	#======================================================================================

	data = read.csv("INPUT/IEODO_Observation_QC_2017-06-16,17.OUT", sep="", head=T)
	data = na.omit(data) # NA값 제거

	# 이어도 관측자료 (TrainDF: 2017-06-16, TestDF: 2017-06-17)
	# 년, 월, 일, 시, 분, 풍향, 풍속, 온도, 상대습도, 기압, 일사량, 일조시간, 태양 방위각, 태양 천정각
	head(data)

	#================================================
	# Neural Networks
	#================================================
	set.seed(123)

	OriDataFrame = data[ ,c(-1:-5)] # 년,월,일,시,분 제거
	str(OriDataFrame)

	# Normalizing the data in interval [0, 1]
	# Normalization is necessay so that each variable is scale properly
	# and none of the variables overdominates
	# scale function will give mean=0 and standard deviation=1 for each variables
	maxValue = apply(OriDataFrame, 2, max)
	minValue = apply(OriDataFrame, 2, min)
	rangeInsValue = (maxValue["ins"] - minValue["ins"])

	DataFrame = as.data.frame( scale(OriDataFrame, center=minValue, scale=maxValue-minValue) )

	summary(DataFrame)

	# Lets create the train and test data set
	ind = sample(1:nrow(DataFrame), length(DataFrame[,1])/2)
	# ind = 1:809
	trainDF = DataFrame[ind, ]
	testDF = DataFrame[-ind, ]

	# Lets take some configuration for neural network
	# say 13-4-2-1
	# So number of hidden layers=2
	# input layer had 10 units

	# we need this as formula
	# medv ~ crim + zn ...

	allVars = colnames(DataFrame)
	predictorVars = allVars[!allVars %in% "ins"]
	predictorVars = paste(predictorVars, collapse="+")
	form = as.formula(paste("ins~", predictorVars, collapse = "+"))
	# ins ~ ws + wd + temp + rh + pre + sunshine + azimuth + zenith

	# neuralModel = neuralnet(formula=form, data=trainDF)
	# neuralModel = neuralnet(formula=form, hidden=c(10,10), data=trainDF)
	# neuralModel = neuralnet(formula=form, hidden=c(20,20,20), data=trainDF)
	neuralModel <- neuralnet(formula = form, hidden = c(4, 2), linear.output = T, data = trainDF)

	# 3) NN모델 생성 - 은닉 노드 1개
	# 형식) neuralnet(formula, data, hidden)
	neuralModel = neuralnet(formula = form, data=trainDF, hidden = 1, threshold = 0.01,
	linear.output=FALSE, act.fct = 'logistic')
	# ?neuralnet
	# hidden = 1 : 은닉노드수 c(100,100)
	# threshold : 편미분에 대한 종료기준 임계값(기본값:0.01)
	# rep : 훈련 과정 반복 수(기본값 1)
	# linear.output=FALSE -> 출력 구간[0, 1]의 활성화 함수에 일치되도록 보장(기본값: TRUE)
	# act.fct : 미분 활성화 함수(기본값 "logistic")
	# err.fct : 미분 오차 함수(기본값 "sse")
	# learningrate : 학습속도(step), backprop 사용시 설정
	# algorithm : backprop -> 전통적인 역전파(backpropagation)


	# Plot the neural net
	# plot(neuralModel)

	# Predict for test data set
	predictions = compute(neuralModel, testDF[ ,-7])
	# str(predictions)

	# predictions = predictions$net.result*(max(testDF$ins)-min(testDF$ins)) + min(testDF$ins)
	# actualValues = (testDF$ins)*(max(testDF$ins)-min(testDF$ins)) + min(testDF$ins)

	# predictions = predictions$net.result
	# actualValues = testDF$ins

	predictions = predictions$net.result * rangeInsValue
	actualValues = testDF$ins * rangeInsValue

	RMSE = sqrt(sum( (predictions-actualValues)^2 )/nrow(testDF)) ; RMSE
	cor(predictions, actualValues)

	plot(actualValues, predictions, col='blue', main='Recl vs Predicted', pch=1, type="p", xlab="Actual", ylab="Predicted")
	cor(testDF$ins, predictions)
	abline(0, 1, col="black")


	#================================================
	# Deeplearning
	#================================================

	# H2o is a fast and scalable opensource machine learning platform.
	# Several algorithms are available.
	# set.seed(123)

	# initiallization
	h2o.init()

	# Defining x and y
	y = "ins"
	x = setdiff( colnames(DataFrame),y )

	# Lets create the train and test data set
	# ind = sample(1:nrow(DataFrame), 400)
	# trainDF = DataFrame[ind, ]
	# testDF = DataFrame[-ind, ]

	# Fitting the model
	# ?h2o.deeplearning

	model = h2o.deeplearning(x=x, y=y, training_frame = as.h2o(trainDF))
	# model = h2o.deeplearning(x=x, y=y, training_frame = as.h2o(trainDF), nfolds=5, hidden=c(10, 10))
	# model = h2o.deeplearning(x=x, y=y, training_frame = as.h2o(trainDF), hidden=c(5000,5000))
	# model = h2o.deeplearning(x=x, y=y, training_frame = as.h2o(trainDF), nfolds=5, hidden=c(1000,1000))
	model

	# model 생성 : network = 13-10-10-1
	model <- h2o.deeplearning(x=x, y=y, training_frame = as.h2o(trainDF),
	nfolds = 3, #stopping_rounds = 7,
	stopping_metric = 'RMSE',
	epochs = 400, overwrite_with_best_model = TRUE,
	activation = 'Tanh', input_dropout_ratio = 0.1,
	hidden = c(10,10), l1 =0,
	loss = 'Automatic', distribution = 'AUTO')
	model
	#training_frame : 학습데이터셋-h2o type
	#nfolds : 교차검정에서 N-fold
	#stopping_rounds : stops_metric의 수렴에 기초한 조기 정지(기본값 5)
	#stopping_metric : 조기 정지에 사용되는 미터법( "MSE", "RMSE", "MAE", "RMSLE" "AUC",)
	#MSE : 평균제곱오차(Mean Squared Error) -> ML에서 cost Function으로 사용
	#epochs(1에폭) : 학습과정에서 훈련데이터를 모두 소진한 횟수
	#overwrite_with_best_model : 훈련된 최적의 모형으로 대체(default : TRUE)
	#activation : tanh(sigmoid 함수를 재활용, sigmoid의 범위를 -1에서 1로 넓힘)
	#input_dropout_ratio : 입력 레이어 드롭 아웃 비율(일반화 향상, 0.1 또는 0.2, 기본값 0)
	# 드롭 아웃 : 매개변수가 많아서 신경망이 복잡한 경우 뉴런을 임의 삭제하여 학습하는 방법
	# 훈련 시 은닉층의 뉴런을 무작위 삭제하여 신호 전달 차단
	# hidden : Hidden layer sizes(2층)

	# Predict for test data set
	predictions = as.data.frame(predict(model, as.h2o(testDF)))
	# str(predictions)


	predictions = predictions$predict * rangeInsValue
	actualValues = testDF$ins * rangeInsValue

	RMSE = sqrt(sum( (predictions-actualValues)^2 )/nrow(testDF)) ; RMSE

	plot(actualValues, predictions, col='blue', main='Recl vs Predicted',
	pch=1, type="p", xlab="Actual", ylab="Predicted")
	cor(testDF$ins, predictions)
	abline(0, 1, col="black")

	# h2o.shutdown(prompt=F)

view raw Advanced Meteorological Statistics.R hosted with ❤ by GitHub

참고 문헌

[논문]

없음

[보고서]

없음

[URL]

없음

문의사항

[기상학/프로그래밍 언어]

sangho.lee.1990@gmail.com

[해양학/천문학/빅데이터]

saimang0804@gmail.com

본 블로그는 파트너스 활동을 통해 일정액의 수수료를 제공받을 수 있음

저작자표시 비영리 변경금지

'프로그래밍 언어 > R' 카테고리의 다른 글

[R] R을 이용한 통계 분석 및 데이터 시각화 : 벡터 (0)	2020.03.24
[R] R을 이용한 통계 분석 및 데이터 시각화 : 데이터형 (0)	2020.03.23
[R] ECMWF Interim 수치 예측 모델로부터 지오포텐셜 고도 (Geopotential Height)를 이용한 주성분 분석 및 가시화 (0)	2020.03.16
[R] "rsoi" 패키지를 통해 남방 진동 지수 (Southern Oscillation Index) 자료 다운로드 및 가시화 (0)	2020.03.08
[R] "rsoi" 패키지를 통해 엘니뇨-남방진동 (EI Nino-Southern Oscillation) 자료 다운로드 및 가시화 (0)	2020.03.08

내 블로그 - 관리자 홈 전환	`Q` `Q`
새 글 쓰기	`W` `W`

글 수정 (권한 있는 경우)	`E` `E`
댓글 영역으로 이동	`C` `C`

이 페이지의 URL 복사	`S` `S`
맨 위로 이동	`T` `T`
티스토리 홈 이동	`H` `H`
단축키 안내	`Shift` + `/` `⇧` + `/`