Introduction
Marvel Studios is an American film and television production company that is a subsidiary of Walt Disney Studios, a division of The Walt Disney Company. Marvel Studios produces films and series for the Marvel Cinematic Universe based on characters appearing in Marvel Comics publications. Since 2008, Marvel Studios has released 30 films in the Marvel Cinematic Universe, eight television series since 2021, and two television specials. These movies, TV series and TV specials are a continuation of the One-Shots sketch series produced by the studio. The television series produced by Marvel Television also acknowledge this continuity.
Reddit, as the ninth most visited sites in the world, is a popular discussion site in the United States. registered users can submit contents to the site and discuss topics with other users. One of the most popular subreddit (top 1 of largest communities) of reddit is "r/marvelstudio". About 3 million reddit users joined this subreddit to share their thoughts and opinions towards marvel studio characters, movies and series. Thanks to this large user base, reddit has become the best source of analyzing viewers' opinions on movies, characters and series.

In this project, we collected all the reddit sumbissions and comments under "r/marvelstudio" subreddit from Jan 2021 to Aug 2022. We applied natural language processing and machine learning technics to these user comments hoping to answer business questions listed below in the appendix. The questions we raised can be divided into two main categories: Character-related and Media-related. We hope that by doing data analysis on these huge amout of reddit data, we'll be able to answer questions like "who are the most popular characters?" or "what film/series should get more investment?" to help Marvel studio get a better understanding of their viewers' thoughts and to make right business decisions/plans in the future.
In order to achieve this goal, additional informmation/dataset were collected. Two external data sources were used: one is about the marvle studio movie release date, and the other was about the character settings/backgrounds/abilities:
With the use of these additional data sources as well as the IMDB Website, we are able to do a lot of correlational tests and build predictive models that can provide uselful insights to Marvel Studio when making business decisions!Appendix
Media
-
Business goal: Determine the most popular Marvel movie or Disney+ Series in 2021-2022 Aug
Technical proposal: Use NLP to identify the posts that mention one or more moives or series. Conduct counts of which moives or series are mentioned the most. Analyze counts over time to check for major deviations and identify the leaders. Conduct sentiment analysis of the posts to assign positive or negative values to movies or series. Present findings for volume metrics and sentiment analysis for the top 5 movies or series to answer the "popular" insights for media class.
-
Business goal: Catch the correlation between IMDB media ratings, box office, investment and audience reviews
Technical proposal: Use NLP to identify the posts that mention one or more moives or series. Conduct counts of which moives or series are mentioned the most. Conduct sentiment analysis of the posts to assign positive or negative values to movies or series. Use external resources to get information about movies/series'investment, box office and imdb media ratings, etc. and calcluate the correlation between these varaibles to see how these elements can affect each other.
-
Business goal: Look up the audience's exceptation to their reactions to the movies or the series before and after the release date of the movies or series. Check for the possible reasons.
Technical proposal: Use NLP to identify the posts that mention one or more moives or series. Conduct counts of which moives or series are mentioned the most. Analyze counts over time to look for significant outliers and pinpoint the leaders. Perform sentiment analysis on the comments to rate the movies and TV shows positively or negatively. Give results from volume metrics and sentiment analysis for the characters to address the "popular" media class insights. Use external resources to get information about movies/series release date and calcluate the correlation between these varaibles to find out what the viewers thought of the movies or series shows both before and after they were released.
-
Business goal: Observe the relationship between the users' opinion of each media reflected by the Reddit data and the IMDB Rating.
Technical proposal: Fit Linear Regression Model to find the correlation between score/sentiment of each media and its IMDB Rating. Use both Linear Regression and Lasso Regression to predict the Rating based on features, tune the hyperparameters of Lasso Regression and finally evaluate the performances of all these models by the prediction results and metrics.
Character
-
Business goal: Determine the most popular Marvel character (superhero/villain and different gender) in 2021-2022 Aug.
Technical proposal: Use NLP to identify the posts that mention one or more Marvel Characters. Conduct counts of which characters are mentioned the most. Analyze counts over time to check for major deviations and identify the leaders. Conduct sentiment analysis of the posts to assign positive or negative values to characters. Present findings for volume metrics and sentiment analysis for the top 5 characters to answer the "popular" insights for character class.
-
Business goal: Track the most co-related character roles that people always mention together which means the higher of the correlation the higher audience of these two roles.
Technical proposal: Use NLP to identify the posts that mention one or more Marvel Characters. Conduct counts of which characters are mentioned the most. Analyze counts over time to check for major deviations and identify the leaders. Conduct sentiment analysis of the posts to assign positive or negative values to characters. Present findings for volume metrics and sentiment analysis for the top 5 characters to answer the "popular" insights for character class.
-
Business goal: How Marvel fans values the comments on Reddits. What are the major factors to affect the score on Marvel subreddits?
Technical proposal: By fitting the score for each comments, we could be able to fit the comment features with the comment scores. The interpretation of the model can assist us to understand the major factors on the comments score which we believe is a good way to know the values of Marvel fans. We will try to identify some of the features that will positively affect the score or negatively affect the score. This requires the models we used have a high interpretability.
-
Business goal: Observe the fluctuation of heat of discussion/sentiment/reviews towards characters. Check for the possible reasons.
Technical proposal: Use NLP to identify the posts that mention one or more Marvel Characters. Conduct counts of which characters are mentioned the most. Analyze counts over time to look for significant outliers and pinpoint the leaders. Perform sentiment analysis on the comments to rate the characters shows positively or negatively. Give results from volume metrics and sentiment analysis for the characters to address the "popular" character class insights. Use external resources to get information about movies/series release date and calcluate the correlation between these varaibles to follow the fluctuating tides of opinions about the characters in discussions and reviews both before and after they were released.