데이터 분석

[R을 이용한 타이타닉 생존자 예측]

r-code-for-data-analysis 2023. 7. 9. 18:00

고전 문제다.

일단 데이터 다운로드

R에서 머신러닝 하는 예전 방법이 아닌 최근 Workflow로 하는 방식을 연습해보았다.

분류 문제를 XGBoost 를 통해 예측하는 연습 문제다.

XGBoost (eXtreme Gradient Boosting)는 그라디언트 부스팅 트리 알고리즘에서 유명하고 효율적인 오픈 소스 구현입니다. 그라디언트 부스팅은 더욱 단순하고 약한 모델 세트의 추정치의 앙상블을 결합하여 대상 변수를 정확하게 예측하려 시도하는 지도 학습 알고리즘입니다

일단 학술적인 원리는 넘어가고, 생존자를 예측하는 문제를 풀어보자

#데이터 불러오기

rm(list=ls())

library(tidyverse)
library(gt)
library(patchwork)
library(skimr)
library(embed)
library(finetune)
library(magrittr)
library(vip)


train <- read.csv("D:/R/data/titanic/train.csv")
test <-  read.csv("D:/R/data/titanic/test.csv")
subbmission <-  read.csv("D:/R/data/titanic/submission.csv")

데이터 변수와 형식을 보자. Survived 가 우리가 예측해야 하는 종속변수이고, 나머지 컬럼이 독립변수이다.

str(train)

데이터 정제

필요없는 컬럼은 삭제한다. 이때 train을 삭제하면, test도 삭제해야 한다.
문자형은 범주형으로 바꿔준다.
target 인 Survived는 숫자로 표시되어 있지만 범주형으로 바꿔준다.

PassengerId, Name, Ticket,Cabin 열은 삭제한다.
%<>% 오퍼레이터를 사용하면 편하다.

train%<>%
  select(-c("PassengerId","Name", "Ticket", "Cabin")) %>% 
  mutate_if(is.character, as.factor) %>% 
  mutate(Survived = as.factor(Survived))



test %<>% 
  select(colnames(train)[2:8]) %>% 
    mutate_if(is.character, as.factor) 


head(train)

모델구축

훈련/검증 데이터 나누기

train 데이터를 df_train과 df_test로 나눈다. df_train으로 모델 훈련을 하고, df_test로 검증한다.
나중에 test 데이터를 모델에 넣어 나온 값을 subbmssion 으로 제출한다.

library(tidymodels)

set.seed(123)

train_split <- train %>%  
  filter(!is.na(Survived)) %>%  
  initial_split(strata = Survived)

df_train <- training(train_split)# 데이터를 교육 및 테스트 세트로 분할하고
df_test <- testing(train_split)

교차 검증용 데이터 세트 10개 만들기기

set.seed(234)
df_folds <- vfold_cv(df_train, strata = Survived)
df_folds

> df_folds
#  10-fold cross-validation using stratification 
# A tibble: 10 × 2
   splits           id    
   <list>           <chr> 
 1 <split [599/68]> Fold01
 2 <split [600/67]> Fold02
 3 <split [600/67]> Fold03
 4 <split [600/67]> Fold04
 5 <split [600/67]> Fold05
 6 <split [600/67]> Fold06
 7 <split [601/66]> Fold07
 8 <split [601/66]> Fold08
 9 <split [601/66]> Fold09
10 <split [601/66]> Fold10

레시피 만들기

문자로 된 것은 dummy화 해서 숫자로 변환한다.

df_rec <- 
  recipe(Survived ~., data = df_train) %>%  
  step_dummy(all_nominal_predictors())

df_rec

> df_rec

── Recipe ───────────────────────────────────────────────────────────────────

── Inputs 
Number of variables by role
outcome:   1
predictor: 7

── Operations 
• Dummy variables from: all_nominal_predictors()

모델링(XGBoost)

하이퍼파라미터를 고정하지 않고, tuning할 수 있게 만든다.

xgb_spec <-
  boost_tree(
    trees = tune(),
    min_n = tune(),
    mtry = tune(),
    learn_rate = 0.01
  ) %>% 
  set_engine("xgboost") %>% 
  set_mode("classification")



xgb_wf <- workflow(df_rec, xgb_spec)
doParallel::registerDoParallel()
set.seed(3)

xgb_grid <- grid_latin_hypercube(
  tree_depth(),
  min_n(),
  size = 5
)

##하이퍼파라미터 최적화
accuracy와 roc_auc 값이 나온다. 1에 가까울수록 fitting이 잘 된것이다.

xgb_rs <- tune_race_anova(
  xgb_wf,
  resamples = df_folds,
  grid = 15,
  control = control_race(verbose_elim = TRUE)
)

xgb_rs
collect_metrics(xgb_rs)
> collect_metrics(xgb_rs) 
# A tibble: 4 × 9
   mtry trees min_n .metric  .estimator  mean     n std_err .config          
  <int> <int> <int> <chr>    <chr>      <dbl> <int>   <dbl> <chr>            
1     4  1972    13 accuracy binary     0.804    10 0.00873 Preprocessor1_Mo…
2     4  1972    13 roc_auc  binary     0.866    10 0.0147  Preprocessor1_Mo…
3     9   421     5 accuracy binary     0.828    10 0.00855 Preprocessor1_Mo…
4     9   421     5 roc_auc  binary     0.873    10 0.0120  Preprocessor1_Mo…

plot_race(xgb_rs)

모델 성능 평가 하기

xgb_rs %>% collect_metrics(summarize = TRUE)
show_best(xgb_rs, metric = "roc_auc", n = 10)


> xgb_rs %>% collect_metrics(summarize = TRUE)
# A tibble: 4 × 9
   mtry trees min_n .metric  .estimator  mean     n std_err .config          
  <int> <int> <int> <chr>    <chr>      <dbl> <int>   <dbl> <chr>            
1     4  1972    13 accuracy binary     0.804    10 0.00873 Preprocessor1_Mo…
2     4  1972    13 roc_auc  binary     0.866    10 0.0147  Preprocessor1_Mo…
3     9   421     5 accuracy binary     0.828    10 0.00855 Preprocessor1_Mo…
4     9   421     5 roc_auc  binary     0.873    10 0.0120  Preprocessor1_Mo…
> show_best(xgb_rs, metric = "roc_auc", n = 10)
# A tibble: 2 × 9
   mtry trees min_n .metric .estimator  mean     n std_err .config           
  <int> <int> <int> <chr>   <chr>      <dbl> <int>   <dbl> <chr>             
1     9   421     5 roc_auc binary     0.873    10  0.0120 Preprocessor1_Mod…
2     4  1972    13 roc_auc binary     0.866    10  0.0147 Preprocessor1_Mod…

모델 검증하기

df_fit <- xgb_wf  %>% 
  finalize_workflow(select_best(xgb_rs, "roc_auc")) %>% 
  last_fit(train_split)

cm1 <- collect_predictions(df_fit) %>%  
  conf_mat(Survived, .pred_class)


autoplot(cm1, type = "mosaic")
autoplot(cm1, type = "heatmap")

collect_predictions(df_fit) %>% 
roc_curve(Survived, .pred_0) %>%
     ggplot(aes(x = 1 - specificity, y = sensitivity)) +
     geom_path() +
     geom_abline(lty = 3) +
     coord_equal() +
     theme_bw()

library(caret)
confusionMatrix(collect_predictions(df_fit)$.pred_class, collect_predictions(df_fit)$Survived, mode = "everything", positive="1")

Confusion Matrix and Statistics

          Reference
Prediction   0   1
         0 124  25
         1  14  61
                                          
               Accuracy : 0.8259          
                 95% CI : (0.7698, 0.8732)
    No Information Rate : 0.6161          
    P-Value [Acc > NIR] : 7.338e-12       
                                          
                  Kappa : 0.6229          
                                          
 Mcnemar's Test P-Value : 0.1093          
                                          
            Sensitivity : 0.7093          
            Specificity : 0.8986          
         Pos Pred Value : 0.8133          
         Neg Pred Value : 0.8322          
              Precision : 0.8133          
                 Recall : 0.7093          
                     F1 : 0.7578          
             Prevalence : 0.3839          
         Detection Rate : 0.2723          
   Detection Prevalence : 0.3348          
      Balanced Accuracy : 0.8039          
                                          
       'Positive' Class : 1

모델에 사용한 변수 중요도 보기

extract_workflow(df_fit) %>% 
  extract_fit_parsnip() %>% 
  vip(num_features = 10)

생존을 위한 가장 중요 변수는 남성이냐, 여성이냐이다. 당시 여성 먼저 구출이 되었기 때문으로 보인다.

그 다음은 1등석 손님 위주로

그 다음은 어린 나이일수록 구조가 되었다.

이걸 data 전체 모습으로 두고 보면 조금 보인다.

library(GGally)
train %>%
  ggpairs()

모델 전파

위에서 만든 모델을 하나의 모델링 함수로 만들기기

library(vetiver)
v <- extract_workflow(df_fit) %>%  
  vetiver_model("titanic-xgb")
v

test 데이터에 모델 적용

p1 <- predict(v, test)

p1

과제 제출

subbmission$Survived <- p1$.pred_class
subbmission
write.csv(subbmission, "subbmission.csv", row.names = F)

728x90