As many of us can assume, the availability of movies is endless to the point that a person could watch a new movie every waking hour. However, we often end up finding ourselves searching through the movie selection that our streaming services have available (Netflix, Amazon Prime Video, Hulu, etc.), hoping to find the right movie to fit our interests, if not just to fit our mood. Unlike physically browsing through Blockbuster in the past, streaming services attempt to lessen the time we browse through their selection by providing us movie recommendations immediately after logging into the service. And to produce those recommendations, they use data science—specifically machine learning. In this article, I will explain step-by-step on how I made my version of a recommendation system.
Click Here to Access the GitHub Repository: To run, follow the instructions in the README or in the script. Quick model stats: the accuracy of this system was ~65% for predicting near the actual rating and ~73% for predicting whether a user would like or dislike a movie.
The first thing to cover in all data science projects is the data source. There are many different databases available to use for movie recommendation systems. I’ve decided to design my system using the MovieLens 25M Dataset that is provided for free by grouplens, a research lab at the University of Minnesota. This dataset contains 25,000,095 movie ratings from 162541 users, with the rating scale ranging between 0.5 to 5.0.
Though there are many files in the downloaded zip file, I will only be using movies.csv, ratings.csv, and tags.csv.
Warning!
Looking at just the sheer amount of ratings in this dataset, this could potentially raise a red flag — 25 million rows in one CSV file is not an easy feat for RAM storage when pre-processing! Depending on your system, loading in the movie ratings should be fine… until you start merging these CSVs… A simple early fix can bypass this issue and will be addressed later.
A glance at the contents of the three CSV files will immediately show that movies.csv and tags.csv need a bit of string manipulation and filtering before merging can even begin. (Even though I processed some information in the GitHub repository, not all processed data was used as input data and will not be mentioned in this article.)
MOVIES.CSV
Luckily grouplens made this dataset easier to manage with uniform formatting so all that was needed to be done was to extract the genres out and place them into individual columns. But, I needed to first determine what were the unique genres used in the dataset:
The code is to extract all the genres in movies.csv into a list, then changing the list to a set to only keep unique genres, and then transforming it back to a list to be able to use list methods later on if necessary. When comparing the unique genres obtained from the dataset (middle image) VS the genres listed in the README (right image), the “IMAX” genre is missing from the README list.
Although the IMAX genre is missing from the README, that might have been for a good reason, which most likely has been that IMAX wasn’t part of the dataset in previous versions and the people at grouplens simply forgot to update the README file. But, the reason I bring this up is that IMAX isn’t a genre… it’s a viewing feature. So, I had decided to remove this genre from my list of unique genres before making individual genre columns. I also changed the genre “(no genres listed)” to be “None” just to make things a bit easier and uniform to remember when coding.
TAGS.CSV
Now, this is where things become a bit tricky because the tags in this dataset were user-created, and many of them are somewhat more opinionated than others. But before determining which opinions should be kept, I lowercased all tags and removed any parenthesis from the tags, such as “Oscar (Best Supporting Actress)”.
To remove the parenthesis and text enclosed, I used Python’s Regular Expressions module (“import re”). Afterward, to do the simplest method of Natural Language Processing (NLP) without using any additional libraries, I gave the tags a brief look and decided on what should determine which opinions were too opinionated.
Just by looking at the previous image that showed the first five rows in tags.csv, I would say that “so bad it’s good” was not a good description a movie—and there are many tags similar to that phrase. Thankfully, what these opinionated opinions have in common are words with one or two letters, such as “so”, “if”, and “a”. By removing all tags with short words, I expected to filter out many tags that were not too helpful in describing the movie. However, as you can see in the code, there are three separate if-statements just to do this one task. That is because tags that contain words “based” and “sci-fi” are useful and would have been removed since they contain a two or one letter word, such as “based on” and the “fi” in sci-fi would have been removed by the last if-statement.
[Just to mention, most of the tags seemed to be spell correctly.]
RATINGS.CSV: Defining Like and Dislike
To make a clear definition of what movies a user liked or disliked, I defined “like” to mean that the user gave the movie a 4.0+ rating—anything lower was a dislike. Since this was a simple distinction to code and the ratings.csv was already large, adding a new column(s) to ratings.csv to distinguish between like and dislike was not performed.
This is when it would be best to create and use a subset of ratings.csv if the system does not have a lot of RAM storage (<16 GB).
Here is where things might become more complicated: how to tell the machine learning models what a user likes and dislikes. The approaches I took were different for the genres and tags of the movies that the users have previously watched. I decided on using three models to generate a final prediction: genres model, tags model, and combined model.
Genres Model: Scaling Genres Interests
To create the inputs for the genres model, I created a genres profile for each user where all of the movies that they’ve liked were scaled on a 0–1 range, where adding all the scaled values for the genres would result in a total of 1—also doing the same for disliked movies. For the movie in question, since the genres were already processed in a numeric-categorical format in individual genre columns, the movie profile was already ready to model input.
The reason for scaling was mainly to minimize any bias created by the models for users that have rated many movies over users that have rated only a few movies. For example, a straightforward approach would be to add up all the genres from the movies the user liked and directly feed that into the model. If Jessica rated 20 movies and Todd only rated 2 movies, Jessica’s profile might have a value of 20 in the Action genre and Todd having 2 in the same genre. The model will most likely try to make a size relationship between 20 and 2, thinking that Jessica likes Action movies a lot and Todd just somewhat might like Action movies. However, with the scaling approach that I used, Jessica and Todd would have high values for the Action genre and be judged on the same scale. But, a caveat to my scaling is that Todd would be seen to REALLY(!) like Action movies but Jessica just likes Action movies since Jessica’s profile will be much more diverse with other genres.
Tags Model: Phrase Vectorization
Numerically representing words is a requirement for machine learning models and most NLP libraries have tools to do word vectorization, but many of them do not vectorize phrases or sentences. At the start of the project, I did not think I would use the tags because of one reason—could the models find the relationship between the vectors when the vectors were not organized in any specific method?
To help the models learn, I did a bit more pre-processing on tags.csv. I wanted to remove all uncommon tags to shrink the vector dictionary that would be created later. So, with the amazing Pandas GroupBy function, it was a simple task to find common tags:
Then, I simply vectorized the common tags. When looping through the tags, if the tag was not in the vector dictionary, it was simply skipped because it was considered to be uncommon. To create the tags profile for each user, I added up all the tags for the movies the user liked and disliked separately and only kept the 20 most tag counts. This allows for more general and uniform profiles across all users rather than focusing only on the tags that the user created. After [a long time of] processing, this is resulting DataFrames for each user and movie tag profiles:
Now that the inputs were created for two of the machine learning models, the models needed to be created and trained.
Genres Model: Neural Network/Deep Learning
Here I used Keras/TensorFlow (GPU) for the neural network modeling:
As explained in the code comments, this model intakes the user’s liked and disliked genre profiles, and the genre profile of the movie in question as three separate inputs. Then, using a concatenating layer, the three branches are combined together. The output layer is set to the sigmoid activation function because I wanted the predictions to be capped at 5 (the max rating on the rating scale). This would mean that the label/ratings would need to be scaled before training (divide by 5) and rescaled back up after predicting.
Tags Model: Random Forest
For the tags model, I decided to use a random forest model since the input variables were descending in popularity — therefore, the importance of the variables can be determined by random forest.
WARNING!
Normally, optimizing for the hyperparameter would be required. However, each tree with default parameters took a large amount of RAM. In my system of 48 GB of RAM, I was only able to max out at 100 trees with occasional shutdowns due to Python running out of RAM space. This is when shortening the length of each tree is necessary for systems with less RAM storage. From testing, using less data during the training phase does not largely impact the prediction results.
Combined Model: Linear Regression
The last model takes the predictions from both genres and tags models and outputs a final prediction using linear regression.
Now the important part of this whole article — the results:
Genres Model:
Tags Model:
Combined Model:
First, I should explain the custom statistical term “FLEX” seen in all the images above. FLEX for ratings means that the regression predictions needed to be within a range of +/-0.5 from the actual label to be considered correct. For predicting like or dislike, the FLEX decision boundary for considering what was a liked movie is lowered to 3.5+ instead of 4.0+. The reasoning for making the statistics flexible is that on the original rating scale, all the ratings were in a 0.5 increment, but any regression models will predict any values between or on the increments by default. And, to me, predicting a value close to the actual should still be considered since predicting the actual rating VS close rating should not make too much of a difference when suggesting movie recommendations.
When examining the performance of all three models, the statistics do show that the performance improves in the combined model—but not by as much as I had hoped. However, improvement is an improvement, especially at the training and prediction speed of a small linear regression model.
[Although I used regression models, I can infer like/dislike using the definition for like (rating of 4.0+). Switching to classification models will yield slightly better results in predicting like/dislike but would not be useful for determining which movies to recommend first since there would not be any values to rank categorical predictions.]
Looking at the user 6550 as an example, the combined model recommended various animations, drama, and war movies — matching 6550’s like and dislike genre profiles. Oddly, the artwork of the four animation movies have similar Japanese anime style—perhaps tied in by the tags? If this is the case, then the tags model might have been able to make the connections between the tag vectors that I feared it would not have been able to do.
[Since the tag profile was not as intuitive as the genre profile without transforming the tag vectors back to the original tags, it was not shown here. The user 6550 was randomly chosen and coincidentally was one of the users who has had made many ratings.]
After seeing the results, I would say that my recommendation system worked well enough with a ~73% chance of predicting correctly on whether a person would like/dislike a movie.
If wanting to improve the predictive performance, I would first examine which users the models are having trouble predicting correctly and see if there is a correlation between those users. One possibility is that the models might have a low chance of predicting correctly for users who rated only a few movies. If this is the case, then using collaborative filtering could help the problem by projecting such users to mimic similar users who had rated a lot of movies.
Lastly, how would a streaming service use this project? One, they can use it for its intended purpose of recommending movies to its customers. Two, they can also use it to determine which movies to add and remove from their movie selection. Third, it would also help them to understand current trends, the interests of their customers, and, if the streaming service produces movies, what genres of movies to focus their production on.
[I encourage readers to share their thoughts and experiences with recommendation systems! I’m still learning so any input would be helpful.]