R을 이용한 시계열 분석 연습 (인구, 출생율)

데이터 분석

R을 이용한 시계열 분석 연습 (인구, 출생율)

r-code-for-data-analysis 2023. 11. 4. 12:04

통계청 사이트에서 1980년도부터 월별 출생아수, 혼인건수, 이혼건수, 사망건수를 시계열 분석해보자

패키지, 데이터 로드

library(tidyverse)
library(lubridate)
df <- read.csv("population.csv")
> head(df)
# A tibble: 6 × 5
  V1         출생아수 혼인건수 이혼건수 사망자수
  <chr>         <int>    <int>    <int> <chr>   
1 1981.01 월    88151    49285     1827 -       
2 1981.02 월    93556    34481     1687 -       
3 1981.03 월    70421    47843     2094 -       
4 1981.04 월    66093    35956     2189 -       
5 1981.05 월    68940    35769     2059 -       
6 1981.06 월    64634    31132     2098 -

날짜 데이터 처리

df2 <- df %>% mutate(year = substr(V1, 1,4), month = substr(V1, 6,7), 
              date =paste0(year,month,"01" )) %>%
  mutate(date = ceiling_date(as.Date(date, "%Y%m%d"), unit="month")-1) %>% 
  select(date,year,month,출생아수, 혼인건수, 이혼건수, 사망자수)
  
  > head(df2)
# A tibble: 6 × 7
  date       year  month 출생아수 혼인건수 이혼건수 사망자수
  <date>     <chr> <chr>    <int>    <int>    <int> <chr>   
1 1981-01-31 1981  01       88151    49285     1827 -       
2 1981-02-28 1981  02       93556    34481     1687 -       
3 1981-03-31 1981  03       70421    47843     2094 -       
4 1981-04-30 1981  04       66093    35956     2189 -       
5 1981-05-31 1981  05       68940    35769     2059 -       
6 1981-06-30 1981  06       64634    31132     2098 -

결측데이터 처리 (사망자수가 1983년부터 데이터가 나옴)

df2 %>% 
  mutate(사망자수 = as.numeric(gsub("-", "NA", 사망자수))) %>% 
  pivot_longer(cols = 출생아수:사망자수, names_to = "종류", values_to = "건수") %>% 
  na.omit()
  
  # A tibble: 2,024 × 5
   date       year  month 종류      건수
   <date>     <chr> <chr> <chr>    <dbl>
 1 1981-01-31 1981  01    출생아수 88151
 2 1981-01-31 1981  01    혼인건수 49285
 3 1981-01-31 1981  01    이혼건수  1827
 4 1981-02-28 1981  02    출생아수 93556
 5 1981-02-28 1981  02    혼인건수 34481
 6 1981-02-28 1981  02    이혼건수  1687
 7 1981-03-31 1981  03    출생아수 70421
 8 1981-03-31 1981  03    혼인건수 47843
 9 1981-03-31 1981  03    이혼건수  2094
10 1981-04-30 1981  04    출생아수 66093

시간별 출생아수, 혼인건수, 이혼건수, 사망자수 그래프 그리기

df2 %>% 
  mutate(사망자수 = as.numeric(gsub("-", "NA", 사망자수))) %>% 
  pivot_longer(cols = 출생아수:사망자수, names_to = "종류", values_to = "건수") %>% 
  na.omit() %>% 
  ggplot(aes(x=date, y=건수, col=종류))+
  geom_point()+
  geom_line()+
  geom_smooth(aes(col="추세선"))+
  facet_wrap(~종류, scales = "free")+
  theme_minimal()+
  theme(legend.position = "none")

각각 데이터의 특이점을 살펴보자. 시계열 데이터이므로 시계열 데이터로 변환 후 분석해보자
시계열 데이터로 변환하기

birth <- ts(df1$출생아수, frequency = 12, start = c(1981, 1))
marriage <- ts(df1$혼인건수, frequency = 12, start = c(1981, 1))
divorce <- ts(df1$이혼건수, frequency = 12, start = c(1981, 1))
death <- ts(death2$사망자수, frequency = 12, start = c(1983, 1))

추세선, 계절 특성, random 특성을 보자

# 데이터 분해 - trend, seasonal, random 데이터 추세 확인
birth_comp <- decompose(birth)
plot(birth_comp)

출생아수 시계열 분석

1. 1990년까지 출생아수 신고가 1,2월에 집중되어 random 그래프에 특이점이 보임

2. 계절적 요인이 있음.

3. 2000년이후로 한번 크게 감소하고, 2016년 이후 다시 크게 감소함

혼인건수 시계열 분석

1. 1997년이후 크게 감소하고 2013년 이후 다시 감소함

2. 혼인건수 감소가 출생아수 감소에 영향을 미침

시계열 분석 그래프를 좀 더 쉽게 그릴 수 있는 패키지 설치 후 이혼건수와 사망자수를 분석해보자

library(aTSA)

decompose(divorce, type="multiplicative") %>% autoplot()+
  ggtitle("이혼건수 시계열 분석") 
decompose(death, type="multiplicative") %>% autoplot()+
  ggtitle("사망자수 시계열 분석")

1. 이혼건수가 계속 증가하다가 2002~2003년경 이혼건수가 높아지다가 약간 감소하면서 saturation 됨

2. 2009년 경 이혼 건수가 약간 감소한 특징이 있음

1. 사망자는 2010년 이후로 급격히 증가하다가 2022년 가장 높음 (코로나 영향성 추정)

2. 고령화 이후 사망자수 증가 기울기가 점점 증가함

혼인건수와 출생아수는 상관성이 있을까?
혼인건수가 적으면 출생아수도 적지만
많아진다고 계속 증가하지 않는다

df2 %>% 
  ggplot(aes(x=혼인건수, y=출생아수))+
  geom_point()+
  geom_smooth()+
  theme_minimal()

출생아수 시계열 분석으로 예측하기

# 시계열 데이터에서 계절성 요인 제거
birth_adjusted <- birth - birth_comp$seasonal
plot.ts(birth_adjusted, main = "birth - seasonal factor")


# 차분을 통해 정상성 확인
birth_diff1 <- diff(birth_adjusted, differences = 1)
plot.ts(birth_diff1, main = "1차 차분")       # --> 분산의 변동성이 크다

acf(birth_diff1, lag.max = 20) #2부터 안으로 들어옴 MA(1)
pacf(birth_diff1, lag.max = 20) # # lag 4에서 절단값 --> AR(3)



# PACF 절단값이 명확하지 않아 ARIMA 모형 확정이 어렵다.

# ---> Auto.Arima 함수 사용
auto.arima(birth)             # ARIMA(2,1,2)(1,1,1)[12]


#case1 211
birth_arima <- arima(birth, order = c(4,0,1), seasonal = list(order = c(2,1,2), period = 12))
birth_arima

birth_fcast <- forecast(birth_arima)
birth_fcast
plot(birth_fcast, main = "Forecasts 2024 & 2025")

2024~2025년의 출생아수 예측. 월 2만명 출생아수 아래로 내려가는 경우가 많아질 것이다.

> birth_fcast
         Point Forecast    Lo 80    Hi 80     Lo 95    Hi 95
Sep 2023       20137.00 17767.43 22506.58 16513.057 23760.95
Oct 2023       18356.65 15696.30 21017.00 14287.997 22425.31
Nov 2023       16884.47 13935.37 19833.58 12374.215 21394.73
Dec 2023       15408.49 12219.40 18597.59 10531.197 20285.79
Jan 2024       21733.73 18359.43 25108.02 16573.188 26894.27
Feb 2024       18473.21 14942.16 22004.26 13072.941 23873.49
Mar 2024       20274.82 16611.20 23938.44 14671.797 25877.84
Apr 2024       17940.09 14162.86 21717.33 12163.312 23716.87
May 2024       18123.56 14247.70 21999.41 12195.950 24051.17
Jun 2024       17749.04 13786.91 21711.18 11689.487 23808.60
Jul 2024       18456.83 14418.66 22494.99 12280.990 24632.66
Aug 2024       18648.59 14543.03 22754.15 12369.679 24927.51
Sep 2024       19271.76 14495.35 24048.16 11966.870 26576.64
Oct 2024       17784.99 12761.13 22808.84 10101.666 25468.31
Nov 2024       16545.02 11280.90 21809.14  8494.241 24595.80
Dec 2024       14577.57  9104.03 20051.11  6206.512 22948.63
Jan 2025       21452.46 15805.09 27099.83 12815.548 30089.36
Feb 2025       18065.78 12266.49 23865.07  9196.530 26935.03
Mar 2025       19858.16 13926.27 25790.06 10786.108 28930.22
Apr 2025       17542.58 11494.07 23591.08  8292.190 26792.96
May 2025       17539.30 11387.51 23691.10  8130.954 26947.66
Jun 2025       16971.76 10728.06 23215.47  7422.840 26520.69
Jul 2025       17924.92 11599.06 24250.78  8250.356 27599.49
Aug 2025       18429.08 12029.52 24828.63  8641.799 28216.35

728x90