Executive Summary

Superhero films and TV series has now long held a place in the drama market. As the most famous company in the superhero industry, Marvel Studios has a lot of discussion on reddit.
In this part, we tried to use Spark NLP to analyze the Reddit Data of Marvel Studio and some other related external data. For a film and drama produce company, the reaction and discussion of audience is the most important and valuable information. We conducted the research on the popularity of movie, series, and characters by looking up to the mention count data generated by using REGEX Search. We also associated some other attributes, for example, investment, release date, box office, gender, alignment, race, with the sentimental analysis result of reddit data on different characters and media. We would like to see if there are some interesting correlations and impact among them. The following paragraphs are the procedure and results of our NLP analysis goals.

The most popular Marvel character

Business goal: Determine the most popular Marvel character (superhero/villain and different gender) in 2021-2022 Aug.

In this table, we briefly determined who was the most popular character in each category based on the count of posts and comments mentioning the character in Reddit data. Obviously, Spider-Man and Wanda dominated.

As the annual global box office champion in 2021, the popularity of Spider-Man: No Way Home has reached an unprecedented peak from 2021 to 2022 Aug. This directly made Spider-Man one of the most popular MCU characters. Spider-Man has a very large fan base due to animation. And the increasing popularity of film actors and actress, and Marvel's very large investment both brought this series to its peak.

As the first Disney+ Marvel Series, WandaVision has accumulated a lot of audience expectations before the broadcast. After the episode was released, the novel narrative technique, the actors' and actresses' exquisite acting, and the vivid and lifelike special effects all made it a success. It also gave Wanda a more complex personality and gained more popularity.

Category Character Name Mention Counts
Over All Wanda 186333
Female Wanda 186333
Male Spider-Man 146361
Hero Spider-Man 146361
Villain Wanda 186333
Human Spider-Man 146361

Character Co-ocurence

Business goal: Track the most co-related character roles that people always mention together which means the higher of the correlation the higher audience of these two roles.

In the past decade or so, Marvel Studios has created many popular and vibrant superhero and supervillain images. Some of these images are beloved by the public, some have appeared on screen many times, and some have been talked about offstage. In this analysis, we tried to find the most popular pairs of characters from the reddit's 20 months discussion as a reference for future movies and subsequent film and television planning.

NLP were used on 2021-2022 Aug reddit submossions and comments to identify the posts that mentiond two different Marvel Characters in the same post. A co-occurence matrix was then created by conducting counts of which two characters are mentioned together. Each row of the matrix was then divided by the number of counts of that row's character for normalization purpose, so that the numbers in the matrix is in the range 0-1.

Based on the result, Agnes and Wanda, Wanda and Vision, Falcon and Bucky are the most frequently mentioned-together characters. Marvel Studio should consider making sequel plans where these characters are in the same movie/series in the future.

Fluctuation of Discussion Heat of the Characters Before and After the Release Date

Business goal: Observe the fluctuation of heat of discussion/sentiment/reviews towards characters. Check for the possible reasons.

The discussion heat of the Marvel characters on the Reddit is calculated by the total comment counts with a weighted score. From the time series visualization, it could be found easily the discussion heat of the character fluctuate greatly throughout the time. The vertical dash line indicated the time when series or movies released.

You can notice when there is a new movie or series released, the discussion of the related characters will surge extremely high. Especially for the Disney+ series, since the series release on a weekly basis, the discussion displays an exactly weekly pattern. Wanda, Falcon, Winter Soldier and Loki, they are all very popular characters in Disney+ series based on the plot. Besides the series, the discussion about characters in movies rises as well when it released. Instead of a weekly pattern, however, the discussion about movies’ characters only lasts very shortly after the release date. It drops to the normal level right after that date and keeps for a long while. You can easily tell from the plot, for example, spiderman is a character that Marvel fans are looking forward since the data shows such a lightning-fast response on Reddit. Although some of the data is quite obvious, many other characters in series or movies did not show a rise or even on the plot. In this plot, we only show the top 10 characters on the discussion heat, and many of the characters are sitting at the bottom of the plot, which means they do not have obvious trends on Reddit. Or because it is hard to detect such trends based on our key words. The discussion of Wanda serves as a good example when Dr. Strange was released, comments about Wanda shoot up while we do not see the discussion about Dr. Strange himself. Overall, we can verify the behavior that the discussion of certain hero is highly related with media release and for Disney+ series, it has a weekly pattern. Based on this, we could better understand the reaction of Marvel fans on Reddit for further analysis.

Popularity of Marvel Movies and Series

Business goal: Determine the most popular Marvel movie or Disney+ Series in 2021-2022 Aug.

With an expanding timeline and a growing fan base, Marvel Studio has enjoyed tremendous success from releases spanning a decade. The movies and series productions made under their banner are highly popular among audiences. From the beginning, the Marvel Universe has been popular, constantly expanding its universe and recently exploring the multiverse further in ways we've never seen before. In the last two years, Marvel has even been releasing many new movies and series. Recent Marvel movies have introduced new features, sparking discussions among many moviegoers. So, we gathered extensive data on Reddit users' opinions of Marvel medias from the previous two years as well as external information on all of the Marvel medias that were released during the same time. From the comments left by Reddit users, we used NLP to extract the number of movie mentions. We ended up generating an interactive barplot by using JavaScript Chart. The plot shows the popularity of each Marvel medias among Reddit users over the last two years by the number of mentions. From the chart, we can see that the top five are "Spider Man", "Loki", "Thor", "Wandavison" and "Eternals". The last in the chart is "The Falcon and the Winter Soldier".

The Correlation between Marvel Medias' Related Element

Business goal: Catch the correlation between film box office, film investment and audience review & See if reddit users' opinion on movies and series are consistent with IMDB rating trends.

With the release of Iron Man in 2008, the Marvel Comics Universe erupted into the public consciousness and has dominated the box office ever since. Since then, the Marvel franchise hasn't slowed down. With wildly popular medias coming out consistently, and an onslaught of shows being released on streaming platforms like Netflix and Disney+, Marvel is poised to dominate the pop culture landscape for the foreseeable future. The influx of comic book movies has certainly brought more attention to the related box office and investment. And audience reviews are growing as well, because when a new movie or a new series is released, it brings a lot of discussion topics around the content of the medias.

To explore the relationship between these related elements, we used spark NLP to extract the scores, sentiment, and mentions scored by a large number of reddit users under different Marvel movie topics. We have also categorized media and television productions so that we can see the differences between movies and TV series. In addition, we combined the box office and investment of related movies and series. To make our analysis more accurate, we decided to use the rating of related movies on the IMDB website as one of the relationship indicators. Because IMDB's movie data has a wide coverage and high accuracy. The purpose of this is to allow the market to invest within its own comfort level and budget by getting correlations between these elements, and releases, and to understand the relevant market for Marvel medias.

The IMDB rating means that the registered users of IMDb can cast a vote (from 1 to 10) for each released media in the database. Individual votes are aggregated and summarized into a single IMDb rating. Based on the definition of IMDB rating, we obtained an average of reddit users' scores for the relevant movies summed over sentiment based on is positive attitude. We eventually generated a pairwise graph to understand the best set of features to explain a relationship between two variables or to form the most separated clusters. From the plot, we can observe the variations in each plot. The plots are in matrix format where the row name represents x axis and column name represents the y axis. The main-diagonal subplots are the univariate distributions for each attribute. The heatmap on the upper and the scatter plot on the lower triangles show the relationship (or lack thereof) between two variables. The default pairs plot by itself often gives us valuable insights. We see that count and investment (USD) are positively correlated showing that the more people are interested, the greater the likelihood of the media being invested in. It also appears that the correlation between IDMB rating and reddit score is not very strong, probably because the two platforms have different marketing purposes.

Fluctuation of Discussion Heat of the Movies and Series Before and After the Release Date

Business goal: Look up the audience's exceptation to their reactions to the movies or the series before and after the release date of the movies or series. Check for the possoble reasons.

Due to the existence of promotional tools, star effect and other factors, the audience's focus, review, and feedback on many series or movies before and after their release show a different state. Therefore, we plan to combine the results of the sentimental analysis of the Reddit data and the score data from Reddit with the release date information from the external movie data to investigate whether the audience's fervor of discussion of the films and series is consistent before and after the release.

We chose to use Plotly to create this graph in order to show the analysis results in a complete, vivid, dynamic and detailed way. In this graph, the different color curves represent different media. The vertical axes with dashed line indicate the time of release of each media. The horizontal axis is the complete timeline of the collected Reddit data, the vertical axis is the heat count which is calculated by sentimental analysis result and Reddit score data by specific weights. When the mouse is placed on any of the curves, the media name, heat index, and corresponding time point can be obtained.

As you can see from the graph, basically, for most media, The time around the release date was when they were at their highest level of heat. The hottest movies and series during this period were 'Spider-Man: No Way Home' and 'Loki Season 1' respectively. The main reason may be that these two medias has the most popular characters in MCU, especially for the new Spider-Man movie. It was unprecedented to have three actors who had played Spider-Man appeared in the same movie together, which has attracted a huge fan base. For the Heat of the movie, most of the trends are symmetrical with the peak on the day or week of release, which can last for about two months of discussion. For TV series, as they were released weekly, the trend of heat will basically fluctuate on a weekly basis within one to two months during the broadcasting on the platform, and continued to be at a relatively high level. The highest day of heat was the day when new episodes are updated every week.

One thing that needs to be mentioned is that, in this plot, we droped the heat data of WandaVision and The Falcon and The Winter Soldier because we found that it was extraodinary low when we tried to extract and analyze it. The reason is that the names of these two TV series are generated by two characters' name, which negatively affected us when we extract the media name data. We should have set the rules in the filtering process that any mention of a part of the series name would be counted as valid data. But this created too much meaningless and unrelated data for us. So, we could only select the complete names. However, this directly led to the fact that the selected data of discussion and comments on these two series was very small, and finally the calculation and analysis results were in abnormal heat values. Therefore, for these two series, the analysis of the characters' heat level is more informative and meaningful.