Personalized Movie Rating

Introduction

This semester I decided to do my Research and Publication project on writing an algorithm that would generate a personalized movie rating for each individual user. I decided to pursue this project as I am a big fan of movies and used to make short films myself. Now that I am pursuing an education in statistics and data science, I want to combine the analytical skills I am developing and my passion for film. My dream is to work for a streaming company like Netflix or Disney Plus, so this is my first step towards that.

Objective

My objective was first to write the algorithm based on already collected datasets, and then design a website that would allow the user to enter the title of the movie they wish to view, and then it would output their personalized rating metric for them. This rating metric will be a weighted average specific to each user of other common movie rating services (such as Rotten Tomatoes and IMDB). I will refer to this as “public ratings” from now on. Finally, I wanted to develop a quiz for new users on the website that, after filling out, would recommend users new movies to watch.

Data Cleaning and Processing

In the first part of the project I cleaned and explored my data, implementing the components of the algorithm on a small scale before projecting it onto all my data. The first dataset I had was one that had 13 public ratings per movie, as can be seen below.

However, this dataset only contained movies from 2016 to 2018. Still, I decided to go ahead with this data and see if I could get some good results.

The second dataset I used was one that contained 270,000 movies. For each movie, several users had given their own personal review out of 5. This would become my response variable.

I then merged these datasets together and dropped unnecessary columns until I achieved a data frame consisting of user ratings for movies out of 5 and the corresponding public rating scores. Quickly, I realized that I lost over 99% of my data on user ratings after joining the datasets. This means I had to either find a larger data set with public rating scores, or otherwise web scrape the rest. In the meantime, I decided to practice some of the steps I had mentioned in my objective paragraph on this smaller scale data frame before replicating them with my complete data frame.

Small Scale Linear Regression

I performed linear regression to generate a model that produced a weighted equation of the public rating scores for all movies in the data frame. This is what I would later apply to each individual user. I could not do this, however, until I had a significant amount movie entries for a given user.

Small Scale Movie Recommender

Next, I generated a movie recommendation system using a cosine similarity algorithm. The formula below takes takes two data sets and outputs a range of values between -1 and 1. Values of 1 are perfectly similar, whereas values of -1 are completely different.

For my project, I used the same dataset (the one with the 13 public ratings) as my A and B vectors. Using the following function, I ordered my movies by other movies most similar to them. I then select the top three and displayed them as the recommended movie for a given imputed title.

Web Scraping

I could not find another dataset to use, so I turned to web scraping. I did a bit of research and found that the best scraper to use for a dataset of this size was one called Scrapy. So after watching a couple of tutorials on Scrapy, I web scraped the IMDB page of a list of all movies from between 1972 and 2016. For each movie on this page, I had to scrape the title, the IMDB score, and the Metacritic score, which was also provided. The code I used was as follows.

The last part of this code was a little tricky. This part locates the “Next” button at the bottom of the page, then extract the URL corresponding to the following page and redirects the scraper to it. This allows the scraper to go through all the pages and scrap every movie.

Next I attempted to scrape Rotten Tomatoes the same way, but unfortunately this did not work due to the irregularity of the Rotten Tomatoes page. I then tried using Beautiful Soup, another (slower) scraper. Again, this is not work. So then I tried using Beautiful Soup along with a Chrome webdriver (which essentially opened a browser on my local computer and manually scraped through the website). This used the following code.

This actually did work! Well, almost. This was able to web scrape almost all the data, but it was simply too slow. Eventually, it crashed and I was not able to get all the data. Despite the lack of the Rotten Tomatoes data, I decided to continue with my project, and fortunately my results did not suffer too much from this loss.

Large Scale Linear Regression

At this point, I merged the web scraped data of my IMDB and Metacritic scores with my dataset of user ratings to get the data frame on the left. I then grouped this data frame by userId to get the data frame on the right.

Treating each group of userId as its own data frame, I performed a linear regression on each “sub-data frame”, with the public ratings as my explanatory variables and the user rating as my response variable. The code for this is below.

This was done to see how accurately my personalized movie rating algorithm would be. Based on the high mean squared error, one could argue that the algorithm would be quite poor, but upon further inspection I saw that this algorithm was fairly accurate for the most part, with a few outliers that could be potentially be accounted for using a more developed algorithm.

Personalized Movie Rating Algorithm

Finally, I designed my personalized movie rating algorithm. First I put together a list of ten popular movies. Here are the movies chosen and their corresponding public ratings.

I then wrote an algorithm that, after a user inputs their own rating out of 5 for each movie in the list, will be able to predict their personalized movie rating score for any movie released between the years 1972 and 2016. The code for doing so is as follows.

The initialize function “initializes” the user preferences such that for any movie title inputted into the personalized_rating function, the output would be the personalized rating for the initialized user. Here is an example:

Success!

Conclusion

All in all, this project was a success. My main objective was to design an algorithm that would be able to predict user’s personal movie rating for a given movie, and I accomplished exactly that. As a next step, I would find a way to properly scrape the Rotten Tomatoes data. I could do this by developing a better scraper, or alternatively by breaking the scraping process into multiple parts so that the program does not crash again. In addition, I could develop the website I wanted to make, allowing me to present my algorithm in a more visually appealing way. Lastly, I could implement the recommender algorithm I had already performed on a small scale onto the rest of the data. I could actually do this with one copy and paste, but it would be interesting to see if I could use my score generating algorithm to make this recommender algorithm more sophisticated. Overall I am quite happy with the work I have accomplished.

Bibliography

RDLongoria. “All U.S. Released Movies: 1972-2016.” IMDB, IMDB, 16 Mar. 2013, www.imdb.com/list/ls057823854/?sort=list_order,asc&st_dt=&mode=detail&page=1&ref_=ttls_vm_dtl.

“BROWSE ALL DVDS & STREAMING.” Top Movies - Netflix, Amazon, ITunes, DVD | Rotten Tomatoes, www.rottentomatoes.com/browse/dvd-streaming-all/.

Rounakbanik. “Movie Recommender Systems.” Kaggle, Kaggle, 6 Nov. 2017, www.kaggle.com/rounakbanik/movie-recommender-systems.

Semester

Fall 2019

Researcher

Aadiraj Batlaw

Navigation

Introduction
Objective
Data Cleaning and Processing
Small Scale Linear Regression
Small Scale Movie Recommender
Web Scraping
Large Scale Linear Regression
Personalized Movie Rating Algorithm
Conclusion
Bibliography

Executive / Directors

Member Profiles

Internal Affairs