90s-Kids Movies Recommender System

Published in

AI Leap

8 min readJul 30, 2021

This time I am going to build a recommender system for movies that are popular among the 90s born people. Let me ask you something before we dive into the recommender system.

How do you usually decide which movies you are going to watch or which restaurants you are going to with your friends or family?

I believe you might ask for someone else’s suggestions and sometimes might search on the internet to make a good choice. Let me explain with two use cases.

[Case 1] We literally search for the ratings of restaurants before trying out a new restaurant like how the reviews look. Right? That is why the ratings are quite important if we would like to make a recommendation for users.

Okay! What about when we do not find enough ratings for what we are looking for?

[Case 2] We usually look for restaurants based on what we like. For example, if I like to eat Pizza, I am going to find restaurants where I can get Pizza. If I love restaurants where I can listen to unplugged music, I am going to find the one where I can get a similar vibe, a similar restaurant.

I assume you may understand the cases mentioned above. Let me make it more clear mapping the cases with the types of the recommendation.

When we are making a recommendation system, we should consider the same. There are three different types of recommendation: Collaborative filtering, Content-based, and hybrid methods.

The collaborative filtering method uses the matrix of users, and their preference items (user-item-matrix). Then, we match the users with relevant interests and calculate the similarity between their profiles, and make a recommendation. The one with the most similar interests is recommended. (Similar to Case 1)

The content-based method analyzes the user profiles based on the contents by extracting the features of the content. We find the contents similar to the contents that the user has already liked. The most similar contents are recommended. (Similar to Case 2)

I hope you understand the general idea of recommendation.

Okay! I am going to explain how we can build a recommender system using collaborative filtering and content-based methods. We are going to make a movie recommender system for the 90s-kids as I mentioned earlier. We can also use this recommender and recommend to younger generations the movies that 90s kids used to enjoy.

Let’s dive in…

We are going to use MovieLens Dataset which has the following data :

user id, movie id, movie title, rating, tags, relevance, timestamp, and so on.

Let’s see how the data look like. It is required to understand the data before we have worked on it.

The average of rating values.

Average rating of each movie.

The genres which got the most rated in the dataset.

Data Preprocessing is also crucial for a recommender system as most of the data that we can able to get are not always ready to use. For data preprocessing, please check more on the notebook file.

Content-based Recommendation

In this work, ‘tag’ is the main feature used and we are going to find the most relevant ‘tag’ and extract the features of the tag content. There are clean tags with relevancy scoring. That makes it easier to get the most relevant tags for the movie.

Then, we need to preprocess it so that the tags are all concatenated into one feature as below. Then, we build a dataframe with movieId, title, genres, and tags as shown below.

Dataframe with movieId, title, genres, and tags

For feature extraction, TF-IDF is used to understand how the importance of the word in the tags by looking at how many times a word appears in a movie tag while paying attention to the same word that appears in another movie tag.

vectorizer = TfidfVectorizer(analyzer='word',ngram_range=(1, 2),min_df=1)x = vectorizer.fit_transform(df_movies_with_tags['genres'] + df_movies_with_tags['tag'])

We can check the scores of each tag.

#checkout TF-IDF scores of tagsdf = pd.DataFrame(x[0].T.todense(), index=vectorizer.get_feature_names(), columns=["TF-IDF"])df = df.sort_values('TF-IDF', ascending=False)

Once, we get the Tf_IDF score, we measure the similarity between two vectors using Sigmoid Kernel.

tag_genre_model = sigmoid_kernel(x, x)

After we have calculated the similarity, we can start to find the movie recommendation. Before going straight to the recommendation function, we need to write some helper functions.

Find similar movies by movieId.

#find similar movies by movieIddef get_similar_movies(movieId, n):
  scores = []
  scores = list(enumerate(tag_genre_model[movieId_dict[movieId]]))
  scores.sort(key=lambda x: x[1], reverse=True)return list(map(lambda x: (id_movieId_dict[x[0]], x[1]),
    scores[1: n + 1]))

Find movie title by movieId.

#find movie title by movieIddef get_movie_titles(movieIds):
  movie_titles = []
  for movieId in movieIds:
    if movieId in movie_dict:
      movie_titles.append(movie_dict[movieId])return movie_titles

We pick Spider-Man(2002) as an example movie to find similar movies. Here is what we got.

Find movies that have been watched by userId.

def get_movies(userid):
  movie_titles = []
  movie_ids = set(df_ratings[df_ratings['userId'] == userid]
    .sort_values('rating', ascending = False)['movieId'].tolist())return movie_ids

At last, we can create a recommender function for a specific user. The function takes userId and the number of recommendations we wish to make as input and returns the movie titles that we recommended.

def recommendation(userid, m=10):similar_movies = []
  watched_movie_ids = get_movies(userid)for movie in watched_movie_ids:
    movies = get_similar_movies(movie, 5)
    similar_movies.extend(movies)similar_movies = sorted(similar_movies,
    key = lambda x : x[1], reverse = True)similar_movie_ids, similarity_scores = list(zip(*similar_movies))similar_movie_ids = [movie for movie in similar_movie_ids
    if movie not in watched_movie_ids]similar_movie_ids = set(similar_movie_ids)return get_movie_titles(list(similar_movie_ids)[:m])

Let’s make a recommendation for userId — 755 who has watched the following movies:

‘Toy Story (1995)’, ‘Jumanji (1995)’, ‘Grumpier Old Men (1995)’, ‘Beyond the Valley of the Dolls (1970)’, ‘Hiroshima Mon Amour (1959)’, ‘Heat (1995)’, ‘Sabrina (1995)’, ‘Father of the Bride Part II (1995)’, ‘GoldenEye (1995)’, ‘American President, The (1995)’, ‘Dracula: Dead and Loving It (1995)’, ‘Nixon (1995)’

The top 10 recommended movies for userId — 755 are the following:

Batman Begins (2005)
Leviathan (1989)
No Country for Old Men (2007)’
Schindler’s List (1993)
Casino (1995)
Cutthroat Island (1995)
Following (1998)
City of Lost Children, The (Cité des enfants perdus, La) (1995)
Captain Phillips (2013)’, ‘First Time, The (2012)’

Okay, cool! That is the end of the Content-based recommendation system. Are you tired to read now? If you just want to know about the content-based one, you can skip the rest topic.

Collaborative-Filtering based Recommendation

For collaborative filtering, it is important to create a user-item matrix that takes the dataframe with userId, movieId, and rating as input and returns the matrix.

def create_user_movie_ratings_matrix(df):user_movie_ratings_matrix =
    df.groupby(by = ['userId','movieId'])   ['rating']
      .max().unstack().fillna(0)return user_movie_ratings_matrix.astype(float)

We are going to use the rating of the movie given by each user to find the similarity between the two users. There are two different methods for collaborative filtering: Neighborhood-based and Model-based.

We are going to focus on Neighborhood-based. In this method, there exist Similarity-based and Distance-based methods to get similar users or items.

We just use similarity-based on to calculate how similar the user interests are. The higher the similarity score, the more the user profiles are closed.

#calculate the similarity score between two usersdef find_similarity(userId1, userId2, 
  user_movie_ratings_matrix=user_movie_ratings_matrix):similarity = np.dot(user_movie_ratings_matrix.loc[userId1, :],
    user_movie_ratings_matrix.loc[userId2, :])return similarity#find similar usersdef get_similar_users(userId,
  user_movie_ratings_matrix=user_movie_ratings_matrix, m=10):users = []for i in user_movie_ratings_matrix.index:
    if i != userId:
      similarity = find_similarity(userId, i)
      users.append((i, similarity))users.sort(key=lambda x: x[1], reverse=True)return users[:m]

This time, the recommendation is made on what movies did the similar users rated. If the user whom we recommend is you, we are going to recommend the movies that your friends liked.

def get_recommendations(userId, df_ratings=df_ratings_sample, m=10):watched_movie_ids = get_movies(userId)similar_users = get_similar_users(userId)movies = []for (uId, _) in similar_users:
    movies.extend(list(get_movies(uId)))
    movies = list(set(movies))
    movies = [movie for movie in movies if movie
      not in watched_movie_ids]return get_movie_titles(movies[:m])

We pick the same userId — 138446 to compare the results with the first approach. (Content-based). The following are the recommended movies that we got.

Waiting to Exhale (1995)
Tom and Huck (1995)
Sudden Death (1995)
Money Train (1995)
Assassins (1995)
Now and Then (1995)
Across the Sea of Time (1995)
It Takes Two (1995)
Cry, the Beloved Country (1995)
Guardian Angel (1994)

What do you think about the results? What do you think which recommendation makes more accurate?

Evaluations of Recommender System

There is one more important fact of recommendation. The performance evaluation of recommendation cannot be decided only on error values like Mean Square Error(MSE) or Mean Absolute Error (MAE).

We better make online testing or AB testing, which is experimental research on two variants. For example, making two user groups like A and B. Then, recommend movies to Group A using content-based and Group B with collaborative. At last, compare the user satisfaction.

These approaches we made need optimization for sure.

Further Improvement

For content-based, we should also extract information of the actors, the directors that the users used to watch, and the movie release date. Currently, we don’t have these in our data. So, we need to find other datasets and join them. The more we can extract the useful features, the more accurate the recommendations will be.

For collaborative-based, we just used 1000 users in this work because of the limitation of computing power. We may probably get more similar users if we take an account the remaining data. We should investigate the performance of different users sampling. e.g., users who rated between 100–500 movies, 500–1000 movies, and more than 1000 movies.

At the same time, we should also consider the timestamp that the users watched the movies. So that, we can calculate the better similarity between each user and the similarity also accounts for the temporal feature.

Building a Recommender System is more of an art than a technical challenge. I believe there needs to be a lot of research and improvements in this field of data science. And, I find it it’s really hard to measure the performance. We only get feedback after we have deployed to the production. It was a great experience.

Thank you for reading this far!

You can get code here: https://github.com/AI-Leap/90s_kids_movie_recommender