Movie Recommendation System

Kanu Priya
5 min readDec 20, 2020

#MachineLearning2020.

To recommend someone’s favorite movie in this huge collection of movies becomes a big issue. How to quickly check what movie a particular viewer will prefer or what type of story a person is searching for is challenging.

This blog will build a Movie Recommendation system with two techniques: content-based filtering and collaborative filtering.

Let’s start with the Introduction to the recommendation system

What is recommender system?

The recommender system aims to suggest relevant items to the user based on their personal interest or past preferences.

We have used two categories of recommendation system :

The collaborative filtering system is used to build personalized recommendations using the relationship between the user and its items. This approach builds a model from the user’s past behavior and similar decisions made by other users.

The content-based approach completely depends on the choice of the user and a description of the item. It suggests similar items based on a particular item.

About the dataset

I have used ‘The Movies Dataset’ for building the recommender system. For our system we will be using movies_metadata.csv, keywords.csv, credits.csv,ratings.csv .

Reading the dataset:

import pandas as pd
movies = pd.read_csv("../input/the-movies-dataset/movies_metadata.csv")
ratings = pd.read_csv("../input/the-movies-dataset/ratings.csv")
credits = pd.read_csv("../input/the-movies-dataset/credits.csv")
keywords = pd.read_csv("../input/the-movies-dataset/keywords.csv")

Firstly, I have visualized the dataset by plotting various plots.

Top Genres

Top 20 movies with highest rating

Now, let’s implement our first approach, the content-based filtering approach.

We built a content-based engine that took metadata such as cast, crew, genre, popularity, and keywords to come up with predictions. Features need to be preprocessed before using our approach.

•Firstly, we have stripped spaces from all our features and converted them to Lowercase .

•Next , we used a Count Vectorizer to create the count matrix .

•Then we used Cosine Similarity to calculate a numeric quantity that denotes the similarity between two movies .

#preprocess features
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
from sklearn.metrics.pairwise import cosine_similarity
from nltk.stem.snowball import SnowballStemmer
stemmer = SnowballStemmer(‘english’)content[‘keywords’] = content[‘keywords’].apply(lambda x: [stemmer.stem(i) for i in x])
content[‘keywords’] = content[‘keywords’].apply(lambda x: [str.lower(i.replace(“ “, “”)) for i in x])
content[‘genres’] = content[‘genres’].apply(lambda x: [stemmer.stem(i) for i in x])
content[‘genres’] = content[‘genres’].apply(lambda x: [str.lower(i.replace(“ “, “”)) for i in x])
content[‘production_companies’] = content[‘production_companies’].apply(lambda x: [stemmer.stem(i) for i in x])
content[‘production_companies’] = content[‘production_companies’].apply(lambda x: [str.lower(i.replace(“ “, “”)) for i in x])
content[‘cast’] = content[‘cast’].apply(lambda x: [stemmer.stem(i) for i in x])
content[‘cast’] = content[‘cast’].apply(lambda x: [str.lower(i.replace(“ “, “”)) for i in x])

content[‘director’] = content[‘director’].apply(lambda x: [stemmer.stem(i) for i in x])
content[‘director’] = content[‘director’].apply(lambda x: [str.lower(i.replace(“ “, “”)) for i in x])
content[‘soup’] = content[‘keywords’] + content[‘cast’] + content[‘director’] + content[‘genres’] + content[‘production_companies’]
content[‘soup’] = content[‘soup’].apply(lambda x: ‘ ‘.join(x))
#calculating cosine similarity.
count = CountVectorizer(analyzer='word',ngram_range=(1, 2),min_df=0, stop_words='english')
count_matrix = count.fit_transform(content['soup'])
cosine_sim = cosine_similarity(count_matrix, count_matrix)

Top 10 recommended movies for particular movie

Second approach used is collaborative based filtering

user user similarity matrix is used to predict movie based on the idea that users similar to me can be used to predict movie which I will like to watch.

  • user user similarity matrix is created using ‘pivot’ function with user id as index ,movie id as columns and ratings as values.
  • cosine similarity is used for calculating similarity among users.

# pivot ratings into movie features
from scipy.sparse import csr_matrix
df_movie_features = rm.pivot(
index=’userId’,
columns=’id’,
values=’rating’
).fillna(0)
mat_movie_features = csr_matrix(df_movie_features.values)
#calculating user user similarity matrix
similarities = cosine_similarity(user, other_users)

Top 5 movies recommendations are :

There is another collaborative filtering approach that uses the mighty algorithm Singular Value Decomposition (SVD).

It works on a matrix structure where each row represents a user, and each column represents a movie. The values of the matrix are the ratings.

We divided the dataset into five-folds; each fold is used once as a test set while the k-1 remaining folds are used for training. We applied the SVD model, and calculated root mean square error and mean absolute error.

from surprise import SVD
from surprise import Dataset
from surprise import Reader
svd = SVD()
from surprise.model_selection import KFold
from surprise.model_selection.validation import cross_validate
reader = Reader() # Used to parse a file containing ratings
dd = Dataset.load_from_df(rm[[‘userId’, ‘id’, ‘rating’]], reader)
kf = KFold(n_splits=5)
kf.split(dd)
t=cross_validate(svd, dd, measures=[‘RMSE’, ‘MAE’])trainset = dd.build_full_trainset()
svd.fit(trainset)
prediction=svd.predict(2, 115, 3)

Estimated rating prediction :

Conclusion

This blog provides information about various techniques used for the movie recommendation system. Content-based classifier suggests a movie that is closed to other movies. It does not capture users’ personal interests and likings. Whereas collaborative filtering technique using user-user similarity criteria or SVD algorithm cares about the personal interest and hence, recommends based on how the other users have rated the movie.

Contribution

This project has been implemented by me, Kanu Priya, and my teammate Avaneesh Kumar Patel. We both have contributed equally to this project.

Kanu Priya has implemented the content-based and SVD filtering technique on ‘The Movies Dataset.’ At the same time, Avaneesh Kumar Patel has implemented a collaborative filtering technique for the recommendation system.

Thanks to

Dr. Tanmoy Chakraborty

All the TA’s : Shiv Kumar Gehlot , Vivek Reddy , Chhavi Jain , Shikha Singh , Pragya Srivastava , Nirav Diwan , Ishita Bajaj , Aanchal Mongia

#IIIT DELHI

#MachineLearning2020

--

--

Kanu Priya

M Tech student in IIIT DELHI | AI specialization