DACON 영화관객수 예측(회귀)

2023년 10월 21일 11시 27분 17초에 업로드 된 글입니다.

작성자: r-code-for-data-analysis

영화관객수 예측

라이브러리 불러오기

오늘은 데이콘의 영화 관객수 예측 데이터를 이용하여 데이터 분석을 해 보겠습니다. 데이터를 다운받아서 R 코딩을 통해 시각화, 머신러닝(회귀) 연습입니다.
데이터 출처

데이터 불러오기

rm(list=ls())

train <- read.csv("D:/r/data/데이콘/영화 관객수/movies_train.csv")
test <-  read.csv("D:/r/data/데이콘/영화 관객수/movies_test.csv")
submission <-  read.csv("D:/r/data/데이콘/영화 관객수/submission.csv")

EDA(Exploratory Data Analysis, 탐색적 데이터 분석)

컬럼 설명

title : 영화제목
distributor : 배급사
genre : 장르
release_time : 개봉일
time : 상영시간(분)
screening_rat : 상영등급
director : 감독이름
dir_prev_bfnum : 해당 감독이 이 영화를 만들기 전 제작에 참여한 영화에서의 평균 관객수(단 관객수가 알려지지 않은 영화 제외)
dir_prev_num : 해당 감독이 이 영화를 만들기 전 제작에 참여한 영화의 개수(단 관객수가 알려지지 않은 영화 제외)
num_staff : 스탭수
num_actor : 주연배우수
box_off_num : 관객수

컬럼별 속성 확인 및 결측치가 있는 컬럼 확인

head(train)

##                      title            distributor  genre release_time time
## 1              개들의 전쟁       롯데엔터테인먼트   액션   2012-11-22   96
## 2                 내부자들             (주)쇼박스 느와르   2015-11-19  130
## 3        은밀하게 위대하게             (주)쇼박스   액션   2013-06-05  123
## 4          나는 공무원이다                (주)NEW 코미디   2012-07-12  101
## 5                 불량남녀 쇼박스(주)미디어플렉스 코미디   2010-11-04  108
## 6 강철대오 : 구국의 철가방       롯데엔터테인먼트 코미디   2012-10-25  113
##     screening_rat director dir_prev_bfnum dir_prev_num num_staff num_actor
## 1 청소년 관람불가   조병옥             NA            0        91         2
## 2 청소년 관람불가   우민호      1161602.5            2       387         3
## 3     15세 관람가   장철수       220775.2            4       343         4
## 4     전체 관람가   구자홍        23894.0            2        20         6
## 5     15세 관람가   신근호            1.0            1       251         2
## 6     15세 관람가   육상효       837969.0            2       262         4
##   box_off_num
## 1       23398
## 2     7072501
## 3     6959083
## 4      217866
## 5      483387
## 6      233211

str(train)

## 'data.frame':    600 obs. of  12 variables:
##  $ title         : chr  "개들의 전쟁" "내부자들" "은밀하게 위대하게" "나는 공무원이다" ...
##  $ distributor   : chr  "롯데엔터테인먼트" "(주)쇼박스" "(주)쇼박스" "(주)NEW" ...
##  $ genre         : chr  "액션" "느와르" "액션" "코미디" ...
##  $ release_time  : chr  "2012-11-22" "2015-11-19" "2013-06-05" "2012-07-12" ...
##  $ time          : int  96 130 123 101 108 113 104 96 129 94 ...
##  $ screening_rat : chr  "청소년 관람불가" "청소년 관람불가" "15세 관람가" "전체 관람가" ...
##  $ director      : chr  "조병옥" "우민호" "장철수" "구자홍" ...
##  $ dir_prev_bfnum: num  NA 1161603 220775 23894 1 ...
##  $ dir_prev_num  : int  0 2 4 2 1 2 0 3 0 0 ...
##  $ num_staff     : int  91 387 343 20 251 262 32 342 3 138 ...
##  $ num_actor     : int  2 3 4 6 2 4 5 2 5 3 ...
##  $ box_off_num   : int  23398 7072501 6959083 217866 483387 233211 53526 1110523 4778 868 ...

skim(train)

Data summary
Name	train
Number of rows	600
Number of columns	12
_______________________
Column type frequency:
character	6
numeric	6
________________________
Group variables	None

Variable type: character

skim_variable	complete_rate	min	max	n_unique
title	1	1	30	600
distributor	1	2	15	169
genre	1	2	6	12
release_time	1	10	10	330
screening_rat	1	6	8	4
director	1	2	10	472

Variable type: numeric

skim_variable	n_missing	complete_rate	mean	sd	p0	p25	p50	p75	p100	hist
time	0	1.00	100.86	18.10	45	89.00	100.0	114.0	180	▁▇▇▁▁
dir_prev_bfnum	330	0.45	1050442.89	1791408.30	1	20380.00	478423.6	1286568.6	17615314	▇▁▁▁▁
dir_prev_num	0	1.00	0.88	1.18	0	0.00	0.0	2.0	5	▇▂▁▁▁
num_staff	0	1.00	151.12	165.65	0	17.00	82.5	264.0	869	▇▂▂▁▁
num_actor	0	1.00	3.71	2.45	0	2.00	3.0	4.0	25	▇▁▁▁▁
box_off_num	0	1.00	708181.75	1828005.85	1	1297.25	12591.0	479886.8	14262766	▇▁▁▁▁

colSums(is.na(train)) #결측치 데이터 확인

##          title    distributor          genre   release_time           time 
##              0              0              0              0              0 
##  screening_rat       director dir_prev_bfnum   dir_prev_num      num_staff 
##              0              0            330              0              0 
##      num_actor    box_off_num 
##              0              0

컬럼별 데이터 상관관계 비교

영화의 관객수를 예측하는데 가장 중요한 변수가 무엇일까? 그 감독이 만든 이전 영화의 관객수가가 가장 중요하지 않을까? 그럼 감독별로 만든 영화의 이전 영화의 관객수와 이번에 만든 영화의 관객수가 상관관계가 있는지 먼저 보자 그런데 이전 영화 관객수에 결측치가 있다. 결측치 제외하고 살펴보자. (로그 스케일로)

train %>% na.omit() %>% 
  ggplot(aes(x=dir_prev_bfnum , y=box_off_num, col=director))+
  geom_point()+
  labs(x= "감독이 만든 이전 영화 관객수", y="감독이 만든 이번 영화 관객수")+
  theme_minimal()+
  scale_x_log10(labels = scales::comma)+
  scale_y_log10(labels = scales::comma)+
  theme(legend.position = "none")+
    theme(axis.text.x = element_text(size=20,face='bold'),
        axis.text.y = element_text(size=20,face='bold'),
        axis.title.x =element_text(size=20,face='bold'),
        axis.title.y =element_text(size=20,face='bold')  )

결과를 보니 상관성이 있어 보이기도 하고 아닌것 같기도 하다. 수치와 다른 시각화 방법으로 확인해보자

상관도를 보는 라이브러리 corrplot 사용하여 시각화

train %>% na.omit() %>% 
  select(dir_prev_bfnum,box_off_num) %>% 
  cor()

##                dir_prev_bfnum box_off_num
## dir_prev_bfnum      1.0000000   0.2831842
## box_off_num         0.2831842   1.0000000

28% 상관성을 가지는 것으로 나온다. 그럼 이번 영화의 관객수와 상관성이 가장 높은 변수는 무엇일까? 전부 해보자 이렇게 수치화 하려면 각 변수도 수치로 되어 있는 것만 된다. 문자형 데이터는 제외하고 보자.

library(corrplot)

## corrplot 0.92 loaded

M <- train %>% na.omit() %>% 
   select_if(is.numeric)  %>% 
  cor()


corrplot(M,method="number")

corrplot(M,type="upper")

영화 관객의 수는 상영시간과 staff의 수와 상관성이 가장 높게 나왔다. 두 변수 모두 영화 제작비와 상관성이 높은 것이리라. 그런데 명성이 높은 감독일 수록 제작비 펀드도 많이 받아서 영화를 만들고, 그러면 더 관객수도 많은것은 아닐까?

영화 감독별 제작 영화수와 평균 관객수

영화 평균 관객수가 많은 10위 영화 감독을 알아보자

train %>% 
  group_by(director) %>% 
  summarise(n= n(), avg.boxoff_n=mean(box_off_num)) %>% 
  arrange(-avg.boxoff_n) %>% 
  top_n(10) %>% 
  mutate(director = fct_reorder(director, avg.boxoff_n)) -> top_director

## Selecting by avg.boxoff_n

top_director%>%
  ggplot(aes(x=avg.boxoff_n, y=director, fill= director ))+geom_col()+
  theme_minimal()+
  labs(x= "영화 관객수", y="감독 이름")+
  theme(axis.text.x = element_text(size=20,face='bold'),
        axis.text.y = element_text(size=20,face='bold'),
        axis.title.x =element_text(size=20,face='bold'),
        axis.title.y =element_text(size=20,face='bold')  )+
   scale_x_continuous(labels = scales::comma)+
  theme(legend.position = "none")

top5 관객수가 많은 감독이 만든 영화는 무엇이었는가?

library(knitr)
train %>% 
  filter(director %in% top_director$director[1:5]) %>% 
  group_by(director) %>% 
  summarise(title=title, 관객수 = box_off_num) %>% 
  arrange(-관객수) %>% kable()

## Warning: Returning more (or less) than 1 row per `summarise()` group was deprecated in
## dplyr 1.1.0.
## ℹ Please use `reframe()` instead.
## ℹ When switching from `summarise()` to `reframe()`, remember that `reframe()`
##   always returns an ungrouped data frame and adjust accordingly.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.

## `summarise()` has grouped output by 'director'. You can override using the
## `.groups` argument.

director	title	관객수
윤제균	국제시장	14262766
최동훈	도둑들	12983841
최동훈	암살	12706663
양우석	변호인	11374879
봉준호	설국열차	9350351
한재림	관상	9135806

다 유명한 감독이지만 각 감독이 만든 영화 데이터 수가 적어 영화 감독별로 관객수를 예측해 보는 것은 의미가 없어 보인다.

그럼 본격적으로 머신러닝을 이용하여 회귀분석을 해 보자

머신러닝 (회귀분석)

훈련/검증 데이터 나누기

필요없는 컬럼을 제외하자.(title , distributor ,release_time, director,dir_prev_bfnum ) 그리고 문자형 컬럼을 범주형 컬럼으로 바꾸자

## 'data.frame':    420 obs. of  7 variables:
##  $ genre        : Factor w/ 12 levels "SF","공포","느와르",..: 3 12 12 4 11 7 5 9 12 5 ...
##  $ time         : int  130 101 108 104 96 129 113 107 124 98 ...
##  $ screening_rat: Factor w/ 4 levels "12세 관람가",..: 4 3 2 3 4 3 1 4 2 3 ...
##  $ dir_prev_num : int  2 2 1 0 3 0 1 2 1 1 ...
##  $ num_staff    : int  387 20 251 32 342 3 14 15 572 8 ...
##  $ num_actor    : int  3 6 2 5 2 5 2 4 11 4 ...
##  $ box_off_num  : int  7072501 217866 483387 53526 1110523 4778 5851 23052 7453715 5693 ...

## 'data.frame':    180 obs. of  7 variables:
##  $ genre        : Factor w/ 12 levels "SF","공포","느와르",..: 11 11 12 5 6 2 5 5 5 5 ...
##  $ time         : int  96 123 113 94 89 98 100 67 104 90 ...
##  $ screening_rat: Factor w/ 4 levels "12세 관람가",..: 4 2 2 2 4 1 1 4 1 4 ...
##  $ dir_prev_num : int  0 4 2 0 0 0 0 0 0 0 ...
##  $ num_staff    : int  91 343 262 138 0 0 152 52 359 29 ...
##  $ num_actor    : int  2 4 4 3 4 1 1 3 4 3 ...
##  $ box_off_num  : int  23398 6959083 233211 868 745 1079 92404 39317 814570 3460 ...

5개의 머신러닝 모델링 (modeling)

#로지스틱 회귀 모델
m1<-train(box_off_num~., data=df_train, method="glm") 

#랜덤포레스트 모델
m2<-randomForest(box_off_num~., data=df_train, ntree=100, proximity=T) 

#의사결정나무 모델 
tree <- rpart(box_off_num ~ ., data = df_train)

# Fast 램덤포레스트 모델
set.seed(101)
rfo <- ranger(box_off_num ~ ., data = df_train, importance = "impurity")

# XGBoost 모델
set.seed(102)
bst <- xgboost(
  data = data.matrix(subset(df_train, select = -box_off_num)),
  label = df_train$box_off_num, 
  objective = "reg:linear",
  nrounds = 100, 
  max_depth = 5, 
  eta = 0.3,
  verbose = 0  # suppress printing
)

## [21:39:41] WARNING: src/objective/regression_obj.cu:213: reg:linear is now deprecated in favor of reg:squarederror.

모델별 중요 변수 파악하기

#각 모델별 중요도 보기

library(vip)
i1 <- vip(m1) + ggtitle("Logistic regression")
i2 <- vip(m2)+ ggtitle("Random Forest")
i3 <- vip(tree)+ ggtitle("Descision tree")
i4 <- vip(rfo)+ggtitle("Fast Random Forest")
i5 <- vip(bst)+ggtitle("XGBoost")

i3+i1+i5+i2+i4

대부분의 모델이 staff의 수와 영화 시간을 관객수 예측에 가장 중요한 변수로 보았다.

##예측하기

p1<-predict(m1, df_test)
p2<-predict(m2, df_test)
p3<-predict(tree, df_test)
p4<- predict(rfo, data = df_test, predict.all = TRUE)
p4 <- p4$predictions[,2]

p5<-predict(bst, data.matrix(df_test[,-7]))

##평가하기

r1 <- caret::R2(df_test$box_off_num  , p1) #로지스틱 회귀분석
r2 <- caret::R2(df_test$box_off_num  , p2) #랜덤포레스트
r3 <- caret::R2(df_test$box_off_num  , p3) #의사결정나무
r4 <- caret::R2(df_test$box_off_num  , p4) #ranger
r5 <- caret::R2(df_test$box_off_num  , p5) #xgboost

name <- c("Logistic regression", "Random Forest", "Descision tree", "Fast Random Forest","XGBoost")
r_squre <- round(c(r1,r2,r3,r4,r5),2)
v <- as.data.frame(cbind(name, r_squre) )

v %>% 
  mutate(name = fct_reorder(name,desc(r_squre))) %>% 
  ggplot(aes(name, r_squre, fill=name))+geom_col() + 
  geom_text(data = v, aes(label = paste("R2=",r_squre)), y = r_squre, size=5)+
  ggtitle("영화 관객수 예측")+
  labs(y="R^2", x="M/L Models",caption = "data by DACON", subtitle="Analysis by Chum-mul")+
  theme_bw()+
  theme(axis.text.y = element_text(size=12), 
        axis.text.x = element_text(size=12))+
  theme(legend.position="none")

랜덤포레스트 모델이 가장 좋은 예측력을 보였다. 그러나 R^2가 32% 정도 밖에 되지 않는다. 하이퍼파라미터를 좀 더 tuning 해 보자.

tuning

library(tidymodels)

tree_rec <- recipe(box_off_num ~ ., data = df_train) 


tree_prep <- prep(tree_rec)
juiced <- juice(tree_prep)

tune_spec <- rand_forest(
  mtry = tune(),
  trees = 1000,
  min_n = tune()
) %>%
  set_mode("regression") %>%
  set_engine("ranger")

tune_wf <- workflow() %>%
  add_recipe(tree_rec) %>%
  add_model(tune_spec)

set.seed(234)
trees_folds <- vfold_cv(df_train)

doParallel::registerDoParallel()

set.seed(345)
tune_res <- tune_grid(
  tune_wf,
  resamples = trees_folds,
  grid = 20
)

tune_res

## # Tuning results
## # 10-fold cross-validation 
## # A tibble: 10 × 4
##    splits           id     .metrics          .notes          
##    <list>           <chr>  <list>            <list>          
##  1 <split [378/42]> Fold01 <tibble [40 × 6]> <tibble [0 × 3]>
##  2 <split [378/42]> Fold02 <tibble [40 × 6]> <tibble [0 × 3]>
##  3 <split [378/42]> Fold03 <tibble [40 × 6]> <tibble [0 × 3]>
##  4 <split [378/42]> Fold04 <tibble [40 × 6]> <tibble [0 × 3]>
##  5 <split [378/42]> Fold05 <tibble [40 × 6]> <tibble [0 × 3]>
##  6 <split [378/42]> Fold06 <tibble [40 × 6]> <tibble [0 × 3]>
##  7 <split [378/42]> Fold07 <tibble [40 × 6]> <tibble [0 × 3]>
##  8 <split [378/42]> Fold08 <tibble [40 × 6]> <tibble [0 × 3]>
##  9 <split [378/42]> Fold09 <tibble [40 × 6]> <tibble [0 × 3]>
## 10 <split [378/42]> Fold10 <tibble [40 × 6]> <tibble [0 × 3]>

tune_res %>%
  collect_metrics() %>%
  filter(.metric == "rsq") %>%
  select(mean, min_n, mtry) %>%
  pivot_longer(min_n:mtry,
               values_to = "value",
               names_to = "parameter"
  ) %>%
  ggplot(aes(value, mean, color = parameter)) +
  geom_point(show.legend = FALSE) +
  facet_wrap(~parameter, scales = "free_x") +
  labs(x = NULL, y = "rsq")

rf_grid <- grid_regular(
  mtry(range = c(3, 5)),
  min_n(range = c(30, 50)),
  levels = 5
)

rf_grid

## # A tibble: 15 × 2
##     mtry min_n
##    <int> <int>
##  1     3    30
##  2     4    30
##  3     5    30
##  4     3    35
##  5     4    35
##  6     5    35
##  7     3    40
##  8     4    40
##  9     5    40
## 10     3    45
## 11     4    45
## 12     5    45
## 13     3    50
## 14     4    50
## 15     5    50

set.seed(456)
regular_res <- tune_grid(
  tune_wf,
  resamples = trees_folds,
  grid = rf_grid
)

regular_res

## # Tuning results
## # 10-fold cross-validation 
## # A tibble: 10 × 4
##    splits           id     .metrics          .notes          
##    <list>           <chr>  <list>            <list>          
##  1 <split [378/42]> Fold01 <tibble [30 × 6]> <tibble [0 × 3]>
##  2 <split [378/42]> Fold02 <tibble [30 × 6]> <tibble [0 × 3]>
##  3 <split [378/42]> Fold03 <tibble [30 × 6]> <tibble [0 × 3]>
##  4 <split [378/42]> Fold04 <tibble [30 × 6]> <tibble [0 × 3]>
##  5 <split [378/42]> Fold05 <tibble [30 × 6]> <tibble [0 × 3]>
##  6 <split [378/42]> Fold06 <tibble [30 × 6]> <tibble [0 × 3]>
##  7 <split [378/42]> Fold07 <tibble [30 × 6]> <tibble [0 × 3]>
##  8 <split [378/42]> Fold08 <tibble [30 × 6]> <tibble [0 × 3]>
##  9 <split [378/42]> Fold09 <tibble [30 × 6]> <tibble [0 × 3]>
## 10 <split [378/42]> Fold10 <tibble [30 × 6]> <tibble [0 × 3]>

regular_res %>%
  collect_metrics() %>%
  filter(.metric == "rsq") %>%
  mutate(min_n = factor(min_n)) %>%
  ggplot(aes(mtry, mean, color = min_n)) +
  geom_line(alpha = 0.5, size = 1.5) +
  geom_point() +
  labs(y = "rsq")

best_rsq <- select_best(regular_res, "rsq")

final_rf <- finalize_model(
  tune_spec,
  best_rsq
)

final_rf

## Random Forest Model Specification (regression)
## 
## Main Arguments:
##   mtry = 3
##   trees = 1000
##   min_n = 35
## 
## Computational engine: ranger

최종 tuning된 하이퍼 파라미터 넣고 모델링 하기

rand_forest(mode = "regression", mtry = varying())

## Random Forest Model Specification (regression)
## 
## Main Arguments:
##   mtry = varying()
## 
## Computational engine: ranger

model <- rand_forest(mtry = 3, min_n = 45, trees =1000, mode = "regression") %>% 
    set_engine("randomForest") %>%
  fit(box_off_num ~ ., data = df_train)

p_final <- predict(model, df_test)
caret::R2(df_test$box_off_num  , p_final) #랜덤포레스트

##            [,1]
## .pred 0.4685502

별로 R^2 35%? 별로 높아지지 않았다. 뭐가 문제인가? 중요한 변수가 빠져 있는 것은 아닌가? 영화 제목에서 단서를 찾아보자

영화 제목 분석

R에서 형태소 분석 참고 사이트 한국 R 사용자회

영화 제목 워드클라우드

영화 제목만 뽑아서 워드클라우드를 해 보자

library(bitTA)
library(dplyr) # load dplyr package
library(wordcloud2) # load wordcloud2 package
library(RcppMeCab)
library(tidytext)


words <- train %>% select(title) %>% 
   unnest_tokens( output = 분석_텍스트,
                 input = title, 
                 token = RcppMeCab::pos) %>% 
  separate(분석_텍스트, c("명사", "형태소"), sep = "/") %>% 
  filter(형태소 == "nng") %>% 
  count(명사, sort = TRUE, name = "빈도수") 


words %>% 
  wordcloud2::wordcloud2(fontFamily = "NanumSquare")

영화 제목에 많이 들어간 단어를 보니 “사랑”, “LOVE”이 가장 많았다.

영화 제목 길이와 관객수 상관성 분석

그렇다면 영화 제목 길이와 관객수는 상관성이 있을까?

train %>% mutate(l=nchar(str_trim(title))) %>% select(l,box_off_num ) %>% 
  ggplot(aes(x=l, box_off_num))+geom_point()+
  theme_minimal()+
  labs(x="영화제목 길이", y="관객수")+
    theme(axis.text.x = element_text(size=20,face='bold'),
        axis.text.y = element_text(size=20,face='bold'),
        axis.title.x =element_text(size=20,face='bold'),
        axis.title.y =element_text(size=20,face='bold')  )+
  scale_y_log10(labels = scales::comma)

영화제목 길이가 10자 이상될 경우, 관객수가 줄어드는 것으로 보인다.

영화 제목이 짝수, 홀수와 관객수는 상관성이 있을까?

train %>% mutate(l=nchar(str_trim(title))) %>% select(l,box_off_num ) %>% 
  mutate(제목 = ifelse(l%%2 == 0, "even", "odd"))%>% 
  ggplot(aes(x=제목, box_off_num, fill=제목))+geom_boxplot()+
  theme_minimal()+
  scale_y_log10(labels = scales::comma)+
  labs(title = "영화제목 글자수와 관객수 상관비교", 
       x="영화제목 길이", y="관객수")+
    theme(axis.text.x = element_text(size=20,face='bold'),
        axis.text.y = element_text(size=20,face='bold'),
        axis.title.x =element_text(size=20,face='bold'),
        axis.title.y =element_text(size=20,face='bold')  )

글자수 전체를 가지고 하니 상관성이 없어 보인다. 그러면 글자수 5자 이하로 해서 짝수와 홀수 글자 제목이 영화 관객수에 영향을 미치는지 보았다. 결과를 보니 홀수 글자수의 제목이 약간 더 관객수가 많았다.

train %>% mutate(l=nchar(str_trim(title))) %>% select(l,box_off_num ) %>% 
  filter(l<5) %>% 
  mutate(제목 = ifelse(l%%2 == 0, "even", "odd"))%>% 
  ggplot(aes(x=제목, box_off_num, fill=제목))+geom_boxplot()+
  theme_minimal()+
  scale_y_log10(labels = scales::comma)+
  labs(title = "영화제목 글자수와 관객수 상관비교", 
       subtitle = "영화제목 글자수 5이하",
       x="영화제목 길이", y="관객수")+
    theme(axis.text.x = element_text(size=20,face='bold'),
        axis.text.y = element_text(size=20,face='bold'),
        axis.title.x =element_text(size=20,face='bold'),
        axis.title.y =element_text(size=20,face='bold')  )

장르별 관객수

느와르, 액션, SF가 평균 관객수는 많지만 그만큼 산포도 크게 나온것을 알 수 있다.

train %>%
  group_by(genre) %>%
  summarise(n= n(), avg.boxoff_n=mean(box_off_num), `σ`=sd(box_off_num)) %>%
  arrange(-avg.boxoff_n) %>%
  mutate(genre = fct_reorder(genre, -avg.boxoff_n)) %>% 
  ggplot(aes(x=genre , y=avg.boxoff_n, col=`σ`))+geom_point()+
  theme_minimal()+
  labs(y="평균 관객수", x="장르")+
  scale_y_log10(labels = scales::comma)+
      theme(axis.text.x = element_text(size=20,face='bold'),
        axis.text.y = element_text(size=20,face='bold'),
        axis.title.x =element_text(size=20,face='bold'),
        axis.title.y =element_text(size=20,face='bold')  )

그렇다면 장르와 영화제목 길이는 상관성이 있을까?
서스펜스 이외에 나머지들은 장르별 영화 제목 글자수가 상관성이 보이지 않는다.

train %>% 
  mutate(l=nchar(str_trim(title))) %>% 
  select(genre, l ) %>% 
  mutate(genre = fct_reorder(genre, l, .fun = median)) %>% 
  ggplot(aes(x=genre , y=l, col=genre))+geom_point()+
  geom_boxplot()+
    labs(title = "영화 장르별 영화제목 글자수상관비교", 
       y="영화제목 길이", x="장르")+
  theme_minimal()+
      theme(axis.text.x = element_text(size=20,face='bold'),
        axis.text.y = element_text(size=20,face='bold'),
        axis.title.x =element_text(size=20,face='bold'),
        axis.title.y =element_text(size=20,face='bold')  )

영화 제목 감성 분석

영화 제목에 감성적인 단어가 얼마나 사용되었는가를 분석해보았다.

한글 감성 분석

[감성 분석 참고 사이트]{“https://r2bit.com/book_tm/sentiment.html”} 군산대 감성 분석 사전 참고

먼저 감성 사전 파일을 다운로드 해보자

getwd()

## [1] "D:/r"

url_v <- "https://github.com/park1200656/KnuSentiLex/archive/refs/heads/master.zip"
dest_v <- "data/knusenti.zip"

download.file(url = url_v, 
              destfile = dest_v,
              mode = "wb")

list.files("data/.")

##  [1] "21265_transaction_price_data"                
##  [2] "bake-sale.xlsx"                              
##  [3] "f1"                                          
##  [4] "gapminder"                                   
##  [5] "knusenti.zip"                                
##  [6] "KnuSentiLex-master"                          
##  [7] "LARD_ADM_SECT_SGG_경기"                      
##  [8] "name"                                        
##  [9] "negative.txt"                                
## [10] "penguins.xlsx"                               
## [11] "positive.txt"                                
## [12] "rn_20230717222718.csv"                       
## [13] "sido_code.csv"                               
## [14] "sigungu_code.csv"                            
## [15] "store.csv"                                   
## [16] "Sunspots.csv"                                
## [17] "ta_20230717224158.csv"                       
## [18] "나라별_국가별_출산율.csv"                    
## [19] "데이콘"                                      
## [20] "재학생 충원율 (대학)_2023-07-29201343123.csv"
## [21] "주요국 통화의 대원화환율_08211138.csv"       
## [22] "주요국 통화의 대원화환율_08211510.csv"

unzip("data/knusenti.zip", exdir = "data")
list.files("data/KnuSentiLex-master/")

## [1] "data"                     "KnuSentiLex"             
## [3] "knusl.py"                 "neg_pol_word.txt"        
## [5] "obj_unknown_pol_word.txt" "pos_pol_word.txt"        
## [7] "README.md"                "ReadMe.txt"              
## [9] "SentiWord_Dict.txt"

senti_file_list <- list.files("data/KnuSentiLex-master/",
                              full.names = TRUE)

senti_dic_df <- read_tsv(senti_file_list[9], col_names = FALSE)

## Warning: One or more parsing issues, call `problems()` on your data frame for details,
## e.g.:
##   dat <- vroom(...)
##   problems(dat)

## Rows: 14855 Columns: 2
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: "\t"
## chr (1): X1
## dbl (1): X2
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

senti_dic_df <- senti_dic_df %>% 
  rename(word = X1, sScore = X2)

그러면 -2~+2 까지 단어별로 점수가 매겨져 있는 데이터프레임을 만들 수 있다. 음수면 부정적 단어, 양수면 긍정적 단어이다.

senti_dic_df %>% filter(sScore == 2) %>% arrange(word)

## # A tibble: 2,603 × 2
##    word              sScore
##    <chr>              <dbl>
##  1 가능성이 늘어나다      2
##  2 가능성이 있다고        2
##  3 가능하다               2
##  4 가볍고 상쾌하다        2
##  5 가볍고 상쾌한          2
##  6 가볍고 시원하게        2
##  7 가볍고 편안하게        2
##  8 가볍고 환하게          2
##  9 가운데에서 뛰어남      2
## 10 가장 거룩한            2
## # ℹ 2,593 more rows

senti_dic_df %>% filter(sScore == -2) %>% arrange(word)

## # A tibble: 4,799 × 2
##    word            sScore
##    <chr>            <dbl>
##  1 가난                -2
##  2 가난뱅이            -2
##  3 가난살이            -2
##  4 가난살이하다        -2
##  5 가난설음            -2
##  6 가난에              -2
##  7 가난에 쪼들려서     -2
##  8 가난하게            -2
##  9 가난하고            -2
## 10 가난하고 어렵다     -2
## # ℹ 4,789 more rows

knu_dic_df <- senti_dic_df %>% 
  mutate(word   = ifelse( is.na(sScore), "갈등", word),
         sScore = ifelse( is.na(sScore), -1, sScore) )
  
knu_dic_df %>%   
  count(sScore)

## # A tibble: 5 × 2
##   sScore     n
##    <dbl> <int>
## 1     -2  4799
## 2     -1  5031
## 3      0   154
## 4      1  2268
## 5      2  2603

knu_dic_df %>% 
  mutate(emotion = case_when( sScore >= 1 ~ "positive",
                              sScore <= -1 ~ "negative", 
                              TRUE         ~ "neutral")) %>% 
  count(emotion)

## # A tibble: 3 × 2
##   emotion      n
##   <chr>    <int>
## 1 negative  9830
## 2 neutral    154
## 3 positive  4871

knu_dic_df <- read_tsv(senti_file_list[9], col_names = FALSE) %>% 
  rename(word = X1, sScore = X2) %>% 
  filter(!is.na(sScore)) %>% 
  add_row(word = "갈등", sScore = -1)

그러면 간단하게 한 문장을 넣어보고, 그 안에 긍정적 단어와 부정적 단어수를 파악, 점수를 매겨보자.

con_v <- "유구한 역사와 전통에 빛나는 우리 대한국민은 3·1운동으로 건립된 대한민국임시정부의 법통과 불의에 항거한 4·19민주이념을 계승하고, 조국의 민주개혁과 평화적 통일의 사명에 입각하여 정의·인도와 동포애로써 민족의 단결을 공고히 하고, 모든 사회적 폐습과 불의를 타파하며, 자율과 조화를 바탕으로 자유민주적 기본질서를 더욱 확고히 하여 정치·경제·사회·문화의 모든 영역에 있어서 각인의 기회를 균등히 하고, 능력을 최고도로 발휘하게 하며, 자유와 권리에 따르는 책임과 의무를 완수하게 하여, 안으로는 국민생활의 균등한 향상을 기하고 밖으로는 항구적인 세계평화와 인류공영에 이바지함으로써 우리들과 우리들의 자손의 안전과 자유와 행복을 영원히 확보할 것을 다짐하면서 1948년 7월 12일에 제정되고 8차에 걸쳐 개정된 헌법을 이제 국회의 의결을 거쳐 국민투표에 의하여 개정한다."

emotion <- tibble(text = con_v) %>% 
  unnest_tokens(output = word, 
                input = text) %>% 
  inner_join(knu_dic_df)

emotion

## # A tibble: 2 × 2
##   word   sScore
##   <chr>   <dbl>
## 1 조화를      2
## 2 행복을      2

결과를 보면 대부분 중성적인 단어이고 조화, 행복 이 긍정적인 점수를 가지고 있다. 그럼 이제 준비가 되었으니, 영화 제목에 들어간 단어의 감성 분석을 해 보자.

영화 제목 감성 분석

워드 클라우드로 분석해보았다.
사이즈가 큰 단어가 긍정, 그다음은 중립, 가장 작은 사이즈가 부정 단어이다.

title_emotion <- train %>% 
  mutate(text=tokenize_noun_ngrams(title)) %>% 
  select(text) %>% unnest() %>% 
  unnest_tokens(output = word, 
                input = text) %>% 
  distinct(word) %>% 
  inner_join(knu_dic_df) %>% 
  arrange(sScore)


library(wordcloud)
library(RColorBrewer)
wordcloud(words = title_emotion$word, freq = title_emotion$sScore+4, min.freq = -2,           max.words=30, random.order=FALSE, scale=c(5, 0.1), rot.per=0,  colors=brewer.pal(5, "Dark2"))

영화 제목에 사용된 제목의 긍/부정 단어의 빈도 분석을 해서 시각화 해 보자.
영화제목에 사용된 단어들은 중립적인 단어들 보다 긍정/부정 단어가 많은데 긍정보다는 부정적인 단어를 많이 사용한다. 그것이 보다 자극적으로 관객의 시선을 끄는 것 때문이기 때문일 것이다.

title_emotion %>% 
  mutate(emotion = 
           case_when(
             sScore == 2 ~ "very positive",
             sScore == 1 ~ "positive",
             sScore == -2 ~ "very negative",
             sScore == -1 ~ "negative",
             TRUE ~ "neutral"

)) %>%
  mutate(emotion = fct_reorder(emotion, abs(sScore), .fun = mean)) %>% 
  ggplot(aes(x = emotion, fill=emotion))+geom_bar()+
      labs(title = "영화제목에 사용된 단어의 감정 정도 빈도수", 
       y="빈도수", x="감정 정도")+
  theme_minimal()+
  theme(legend.position = "none")+
        theme(axis.text.x = element_text(size=20,face='bold'),
        axis.text.y = element_text(size=20,face='bold'),
        axis.title.x =element_text(size=20,face='bold'),
        axis.title.y =element_text(size=20,face='bold')  )+
   scale_y_continuous(labels = scales::comma)

그럼 긍정적인 단어가 들어간 제목이 부정적인 단어가 들어간 제목의 영화보다 관객수가 적을까?

library(tidytext)

title_emotion <- 
title_emotion %>% 
  mutate(emotion = 
           case_when(
             sScore == 2 ~ "very positive",
             sScore == 1 ~ "positive",
             sScore == -2 ~ "very negative",
             sScore == -1 ~ "negative",
             TRUE ~ "neutral"

))


  
train %>% 
   unnest_tokens( output = 분석_텍스트,
                 input = title, 
                 token = RcppMeCab::pos) %>% 
  separate(분석_텍스트, c("명사", "형태소"), sep = "/") %>% 
  filter(형태소 == "nng") %>% 
  left_join(title_emotion, by=c('명사'='word')) %>% 
  na.omit() %>% 
  group_by(emotion) %>% 
  summarise(avg_n = mean(box_off_num)) %>% 
  mutate(emotion = fct_reorder(emotion, -avg_n)) %>% 
  ggplot(aes(x=emotion , y=avg_n, fill=emotion ))+geom_col()+
  labs(title = "영화제목 감성별 관객수", 
       y="관객수", x="영화제목 감성")+
  scale_y_continuous(labels = scales::comma)+
  theme_minimal()+
      theme(axis.text.x = element_text(size=10,face='bold'),
        axis.text.y = element_text(size=10,face='bold'),
        axis.title.x =element_text(size=20,face='bold'),
        axis.title.y =element_text(size=20,face='bold')  )+
  theme(legend.position = "none")

분석 해보니 가장 부정적인 단어들이 많이 들어간 제목이 관객수가 가장 많았다 .

그럼 개봉연도별 관객수 증가폭을 알아보자. 특별한 경향성이 보이지 않아 연도별, 월별 영화별 관객수를 살펴보아도 마찬가지다.

library(lubridate)

train %>% 
  mutate(date = as.Date(release_time, '%Y-%m-%d')  ) %>% 
  ggplot(aes(x=date, y=box_off_num))+geom_point()+geom_line()+
  scale_y_log10(labels = scales::comma)+
  geom_smooth()+
  theme_minimal()

## `geom_smooth()` using method = 'loess' and formula = 'y ~ x'

train %>% 
    mutate(year = year(release_time),
         month = month(release_time)) %>% 
  ggplot(aes(x=month, y=box_off_num, col=as.factor(year)))+geom_point()+
  scale_y_log10(labels = scales::comma)+
  geom_smooth()+
  theme_minimal()+
  facet_wrap(~year,)+
  theme(legend.position = "none")

## `geom_smooth()` using method = 'loess' and formula = 'y ~ x'

그럼 월별 일별 개봉일에 따는 관객수 경향을 보자.

train %>% 
    mutate(year = year(release_time),
         month = month(release_time),
         day = day(release_time)) %>% 
  ggplot(aes(x=day, y=box_off_num, col=as.factor(month)))+geom_point()+
  scale_y_log10(labels = scales::comma)+
  geom_smooth()+
  theme_minimal()+
  facet_wrap(~month,)+
  theme(legend.position = "none")

## `geom_smooth()` using method = 'loess' and formula = 'y ~ x'

뭔가 경향이 있어보이기도 하고 아닌것 같기도 해서 상관분석을 해 보았다.

M <- train %>% na.omit() %>% 
      mutate(year = year(release_time),
         month = month(release_time),
         day = day(release_time)) %>% 
   select_if(is.numeric)  %>% 
  cor()


corrplot(M,method="number")

corrplot(M,type="upper")

영화제목 감성 분석 내용과 개봉연, 월, 일을 넣어서 머신러닝 분석을 다시 한번 해 보자.

5개의 머신러닝 모델링 (modeling)

train %>% 
   unnest_tokens( output = 분석_텍스트,
                 input = title, 
                 token = RcppMeCab::pos) %>% 
  separate(분석_텍스트, c("명사", "형태소"), sep = "/") %>% 
  filter(형태소 == "nng") %>% 
  left_join(title_emotion, by=c('명사'='word')) %>% 
   mutate(year = year(release_time),
         month = month(release_time),
         day = day(release_time)) %>% 
  mutate_if(is.character, as.factor) %>% 
  select(-release_time, -director,-distributor, -명사,  -형태소 ) %>% 
  na.omit() -> train_df

set.seed(123)

idx <- createDataPartition(train_df$box_off_num, p=0.7, list=F)
df_train <- train_df[idx,]
df_test <- train_df[-idx,]

#로지스틱 회귀 모델
m1<-train(box_off_num~., data=df_train, method="glm") 

#랜덤포레스트 모델

m2<-randomForest(box_off_num~., data=df_train, ntree=100, proximity=T) 

#의사결정나무 모델 
tree <- rpart(box_off_num ~ ., data = df_train)

# Fast 램덤포레스트 모델
set.seed(101)
rfo <- ranger(box_off_num ~ ., data = df_train, importance = "impurity")

# XGBoost 모델
set.seed(102)
bst <- xgboost(
  data = data.matrix(subset(df_train, select = -box_off_num)),
  label = df_train$box_off_num, 
  objective = "reg:linear",
  nrounds = 100, 
  max_depth = 5, 
  eta = 0.3,
  verbose = 0  # suppress printing
)

## [21:40:25] WARNING: src/objective/regression_obj.cu:213: reg:linear is now deprecated in favor of reg:squarederror.

모델별 중요 변수 파악하기

#각 모델별 중요도 보기

library(vip)
i1 <- vip(m1) + ggtitle("Logistic regression")
i2 <- vip(m2)+ ggtitle("Random Forest")
i3 <- vip(tree)+ ggtitle("Descision tree")
i4 <- vip(rfo)+ggtitle("Fast Random Forest")
i5 <- vip(bst)+ggtitle("XGBoost")

i3+i1+i5+i2+i4

대부분의 모델이 staff의 수와 영화 시간을 관객수 예측에 가장 중요한 변수로 보았다.

##예측하기

p1<-predict(m1, df_test)
p2<-predict(m2, df_test)
p3<-predict(tree, df_test)
p4<- predict(rfo, data = df_test, predict.all = TRUE)
p4 <- p4$predictions[,2]
p5<-predict(bst, data.matrix(df_test[,-8]))

##평가하기

r1 <- caret::R2(df_test$box_off_num  , p1) #로지스틱 회귀분석
r2 <- caret::R2(df_test$box_off_num  , p2) #랜덤포레스트
r3 <- caret::R2(df_test$box_off_num  , p3) #의사결정나무
r4 <- caret::R2(df_test$box_off_num  , p4) #ranger
r5 <- caret::R2(df_test$box_off_num  , p5) #xgboost

name <- c("Logistic regression", "Random Forest", "Descision tree", "Fast Random Forest","XGBoost")
r_squre <- round(c(r1,r2,r3,r4,r5),2)
v <- as.data.frame(cbind(name, r_squre) )

v %>% 
  mutate(name = fct_reorder(name,desc(r_squre))) %>% 
  ggplot(aes(name, r_squre, fill=name))+geom_col() + 
  geom_text(data = v, aes(label = paste("R2=",r_squre)), y = r_squre, size=5)+
  ggtitle("영화 관객수 예측")+
  labs(y="R^2", x="M/L Models",caption = "data by DACON", subtitle="Analysis by Chum-mul")+
  theme_bw()+
  theme(axis.text.y = element_text(size=12), 
        axis.text.x = element_text(size=12))+
  theme(legend.position="none")

R^2가 좋아졌다. 그리고 가장 좋은 모델은 XGBoost로 바뀌었다.

728x90

'데이터 분석' 카테고리의 다른 글

R을 이용한 시계열 분석 연습 (인구, 출생율) (0)	2023.11.04
[머신러닝 모델링을 통한 타이타닉 데이터 분석] (0)	2023.07.09
[R을 이용한 타이타닉 생존자 예측] (0)	2023.07.09
R을 이용한 축구 데이터 분석 (MAP) (0)	2023.06.11
점탄성 특성 분석 (ANOVA, PCA분석) (0)	2023.05.30

영화관객수 예측

첨물

2023-10-14

라이브러리 불러오기

데이터 불러오기

EDA(Exploratory Data Analysis, 탐색적 데이터 분석)

컬럼 설명

컬럼별 속성 확인 및 결측치가 있는 컬럼 확인

컬럼별 데이터 상관관계 비교

상관도를 보는 라이브러리 corrplot 사용하여 시각화

영화 감독별 제작 영화수와 평균 관객수

머신러닝 (회귀분석)

훈련/검증 데이터 나누기

5개의 머신러닝 모델링 (modeling)

모델별 중요 변수 파악하기

tuning

최종 tuning된 하이퍼 파라미터 넣고 모델링 하기

영화 제목 분석

영화 제목 워드클라우드

영화 제목 길이와 관객수 상관성 분석

장르별 관객수

영화 제목 감성 분석

한글 감성 분석

영화 제목 감성 분석

5개의 머신러닝 모델링 (modeling)

모델별 중요 변수 파악하기

'데이터 분석' 카테고리의 다른 글

티스토리툴바