We conclude some insights standing from the Marvel Studio perspective.
In the past 2 years, Marvel has brought out several good superhero movies along with some Disney + series. More importantly, Marvel has started to let different characters enter other’s movies. We believe this is a good way to increase the popularity of Marvel works. And the brand-new idea, “multiverse“, even made more plots and interesting interactions not only between different character but different versions of characters. Step on this idea, we use big data to identify the popularities on different characters to provide choices on Marvel’s next movie.
We can easily find the fluctuations on the discussion of different heroes. With the lightning fast response of Marvel fans on Reddit, we can have some ideas on how popular Marvel’s hero is. Wanda and vision, Loki and Spider-man are the most popular characters. Since we allow many heroes in one movie, we could manually increase the discussion heat of certain low discussed movies by letting the popular one in them. An example can be, making Wanda appear in the falcon and winter soldier. This may probably increase the of discussion of the movies among the audience. One of the benefits of this method is we can arose more public care on the LGBTQ+ and minority people since from the plot, many of minority topic movies do not gain a lot attentions, such as Shang-chi and Ms. Marvel.
On top of that, the discussion of Disney+ series has a weekly seasonality. And this makes the discussion lasts longer than the movies and we could see the popularity on the Disney+ series. Therefore, a good choice will be investing more Disney+ series in the future.
Another conclusion that might helpful for Marvel Studio is the correlation between IMDB rating of movie, sentiment of comments on Reddit, investment of the movie and the box office. We put the correlation in a heatmap to show how they are correlated.
As showed on the graph, the count of comments is highly correlated with box-office. That means the discussion heat of the movie is reflecting the box-office, and we may be able to use this correlation to predict the box-office if we have more available data. And this can be a good indication if we can gather the discussion heat when Marvel releases the spoiler and help the producer to know how the audience is going react when the movie releases. The reliability is proved by the correlation of the IMDB rating. The rating is highly correlated with both the count of comments on Reddit and the box-office. Therefore, using the count of comments as an indication could be a promising method.
We also apply some Machine Learning techniques in the project. For the first model, we use some features of the comment to fit the score. The score of a comment on Reddit is the count of Up Vote subtracts the count of Down Vote. We believe the score tells some Marvel fans’ value on Reddit and if this is caused by some comment features, modeling the score can be educative. We use Random Forest and Linear Regression to model the data. The result is very interesting. Some of the features tend to have some impact on the score.
Sentiment of the comment plays an important role in the score in our Random Forest model. And whether mention Hulk, Thor, Ant-man and so on has a close relation to the score of the comment. This tells Marvel fans on Reddit pays attention on these heroes and from the Marvel Studio angle, it might be beneficial if they could focus on the audience’s feedback on these movies or series. Since the Radom Forest model is quite complicated, we also fit a Linear Model to check the coefficient. The coefficient tells us the some characters like Kamala Khan, Nick Fury, Captain Carter and Odin, when a comment mentions them, it is more likely to receive higher score. And when mentioned Sersi, Casey, gorr and red skull, the score is not high usually. Based on that, one of the suggestions for Marvel Studio is focusing more on the supporting characters, since most of the high score comments talked about them.
As the second part of our Machine Learning analysis, we use the result of sentimental analysis and the score data to fit the external IMDB Rating data. As mentioned above, the score can reflect the preference and recognition of posts by users. After generating the sentimental analysis, we calcualted the count of postive/neutral/negative posts for each media as another parameter to evaluate the attitude of Reddit users on each media. We set all these attributes as features and the IMDB Rating data for each media as target, using Linear Regression and Lasso Regression to find out the correlation between features and the target and compare the prediction performance of each model.
The result shows that the score has the greatest positive impact on IMDB Rating, which means that from this aspect, the review from Reddit users is consistent with the IMDB Rating. However, surprisingly, the positive post count has a negative correlation with the IMDB Rating while the count of negative posts has a positive one. The reason may because the sentiment of the posts may not only focus on the media quality, but also on some characters or plots of a work, so it can not fully present the reveiw and attitude on the whole media. Overall, from a marketing point of view, Reddit posts can act as another way to evaluate Marvel works.
In the future, we can continuously move further on this research. We can continue collect more data on different movies, they are not necessarily limited to Marvel movies, but can be any other movies that people love to talk about on other subreddits. Step on that, we can make a movie/series rating system based on the comments on Reddit. That would be different from other rating website like IMDB or Rotten Tomatoes. This system will rate a movie based on user’s comment, score, and the sentiment. This work should be promising since we have already had a linear model on the Marvel dataset. If we have the access of other movies Reddit data, we could utilize them to compare the reviews on different genres. This would be a good window to see which are the popular topics for the audience.