Collaborative Filtering Recommendation System
Collaborative filtering is a fundamental technique in modern recommender systems, utilizing user interactions and preferences to deliver personalized recommendations. Its influence extends across various industries, fundamentally changing how users engage with digital platforms. This article delves into the principles of collaborative filtering, exploring both its theoretical foundations and real-world applications. It also offers valuable insights into the technology that drives personalized digital experiences. Before diving into this article, you may want to review Recommendation System Overview for additional context. By the end of this post, you’ll have a deeper understanding of how collaborative filtering shapes our digital choices.
- Introduction To Collaborative Filtering
- Understanding Collaborative Filtering
- Types of Collaborative Filtering
- CF Process
- CF Matrix Factorization with code
- Summary
Introduction
Collaborative filtering is a pivotal mechanism in recommender systems, leveraging user interactions to suggest items based on collective preferences. Essentially, collaborative filtering assumes that users who have shared their preferences in the past are likely to do so again. The technique does not require explicit knowledge of the items; instead, it identifies user behavior patterns. Collaborative filtering comes in two main types: user-based filtering, which recommends items to users with similar tastes, and item-based filtering, which suggests items similar to those previously preferred. The collaborative filtering process involves collecting data, creating a user-item matrix, calculating similarities, generating predictions, and ultimately providing personalized recommendations, thereby enhancing user engagement and satisfaction (Figure 1).
![Simple Overview of Collaborative Filtering](https://mlarchive.com/wp-content/uploads/2025/01/colabrativefiltering.drawio.png)
Understanding Collaborative Filtering
Collaborative filtering (CF) is a recommendation technique that leverages shared user preferences to make predictions. It assumes that users with similar tastes in the past will likely have similar preferences in the future. Unlike content-based methods, CF does not rely on item details but focuses on collective user behaviors. The two main types of CF are user-based, which suggests items based on similar user preferences, and item-based, which recommends items similar to those a user has liked before. The CF process typically involves several steps: collecting user interaction data, creating a user-item matrix, calculating similarities, generating predictions for missing preferences, and providing personalized recommendations.
To improve recommendation accuracy, CF can incorporate additional factors such as age, genre, or location preferences. For instance, users can be assigned a factor index based on attributes like age, and these indices can be normalized to determine whether a movie is suitable for children or adults(Figure 2). Similarly, items like movies can have factor indices based on attributes such as genre. This multi-factor approach ensures recommendations are more accurate and relevant. Additionally, techniques like matrix factorization can be used to refine predictions, making the recommender system more effective at offering personalized suggestions to users.
Types of CF
There are two main types:
- User-Based (UBCF):
- Concept: Recommends items to a user based on the preferences and behaviors of users with similar tastes.
- Process: Calculate the similarity between users, identify users with comparable preferences, and recommend items favored by those similar users.
- Strengths: Effective in capturing user preferences in diverse domains.
- Item-Based (IBCF):
- Concept: Recommends items similar to those a user has liked or interacted with in the past.
- Process: Calculate the similarity between items, identify items similar to those the user has shown interest in, and recommend those items.
- Strengths: Less sensitive to changes in user preferences over time compared to user-based approaches.
Here are additional types:
1. Matrix Factorization (SVD, NMF):
Matrix factorization is a technique that decomposes a matrix into the product of two lower-rank matrices, representing latent factors. In the context of CF, it’s often used to predict missing values in a user-item interaction matrix, where each entry corresponds to a user’s rating of an item. The factorized matrices capture underlying patterns and relationships, enabling accurate recommendations. Matrix factorization learns two embedding matrices: one for users and one for items, such that their product approximates the observation matrix (A). Each entry in (A) is modeled as a dot product of the embeddings corresponding to the user and item(Figure 3).
- Singular Value Decomposition (SVD):
- Concept: A matrix factorization technique that decomposes the user-item interaction matrix into three matrices representing users, latent factors, and items.
- Process: Utilizes linear algebra to identify latent features and reconstruct the original matrix.
- Strengths: Provides a compact representation of user-item interactions, facilitating efficient computation.
- Non-Negative Matrix Factorization (NMF):
- Concept: Similar to SVD, but restricts the matrices to non-negative values, making the factors interpretable.
- Process: Factorizes the user-item matrix into non-negative matrices to enhance interpretability.
- Strengths: Useful when non-negativity constraints align with the nature of the data.
2. Deep learning-based collaborative filtering:
- Concept: Utilizes neural networks to learn complex patterns and representations from user-item interactions.
- Process: Employs deep learning architectures, such as autoencoders or neural collaborative filtering models, to capture intricate relationships.
- Strengths: effective in handling large-scale and diverse datasets and capturing intricate patterns.
3. Hybrid Collaborative Filtering:
- Concept: Integrates collaborative filtering with other recommendation techniques, such as content-based filtering or demographic filtering.
- Process: Combines multiple recommendation approaches to leverage their complementary strengths.
- Strengths: It addresses the limitations of individual methods and provides more robust recommendations.
The Collaborative Filtering(CF) Process
The process varies across different techniques, but here’s a generalized overview incorporating the mentioned methods
- User-Based (UBCF):
- Data Collection: Gather user-item interaction data.
- User-Item Matrix: Create a matrix representing user-item interactions.
- Similarity Computation: Calculate the similarity between users based on their preferences.
- Prediction Generation: Generate predictions for unseen items by aggregating preferences from similar users.
- Recommendation: Suggest items with the highest predicted ratings.
- Item-Based (IBCF):
- Data Collection: Gather user-item interaction data.
- User-Item Matrix: Create a matrix representing user-item interactions.
- Similarity Computation: Calculate the similarity between items based on user preferences.
- Prediction Generation: Generate predictions for items similar to those the user has liked in the past.
- Recommendation: Suggest items with the highest predicted ratings.
- Matrix Factorization (SVD, NMF):
- Data Collection: Gather user-item interaction data.
- User-Item Matrix: Create a matrix representing user-item interactions.
- Matrix Factorization: Decompose the user-item matrix into latent factors.
- Prediction Generation: Reconstruct the matrix using the extracted latent factors.
- Recommendation: Suggest items with high predicted ratings.
- Deep learning-based:
- Data Collection: Gather user-item interaction data.
- Preprocessing: Prepare data for neural network input.
- Training: Deep learning model (e.g., neural collaborative filtering).
- Prediction Generation: Use the trained model to predict user preferences for items.
- Recommendation: Suggest items with high predicted ratings.
- Hybrid:
- Data Collection: Gather user-item interaction data.
- Integration: Combine collaborative filtering with other recommendation techniques.
- Model Training: Train a hybrid model incorporating multiple approaches.
- Prediction Generation: Generate predictions using the hybrid model.
- Recommendation: Suggest items with high predicted ratings based on the combined approach
Implementation Steps Of Movies Recommendation with Matrix Factorization
import library
# Imports
import pandas as pd
import numpy as np
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.metrics.pairwise import linear_kernel
from sklearn.metrics.pairwise import cosine_similarity
from sklearn.metrics import pairwise_distances
from sklearn.metrics.pairwise import pairwise_kernels
from sklearn.preprocessing import LabelEncoder
import matplotlib.pyplot as plt
from pathlib import Path
import os
import re
import html
import string
import unicodedata
import nltk
nltk.download('punkt')
from nltk.corpus import stopwords
nltk.download('stopwords')
nltk.download('wordnet')
from nltk.stem import PorterStemmer
from nltk.tokenize import word_tokenize, sent_tokenize
from nltk.stem import WordNetLemmatizer
from tensorflow.keras import models
from tensorflow.keras import layers
from tensorflow.keras import optimizers
from tensorflow.keras import losses
from tensorflow.keras import regularizers
from tensorflow.keras import metrics
from tensorflow.keras.utils import plot_model
from tensorflow.keras.datasets import imdb
from tensorflow.keras.preprocessing.text import text_to_word_sequence
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences
Make data that we will use (Figure 4)
a user-item matrix (also known as a rating matrix) is created using the pd.crosstab
function in pandas.
movies = pd.read_csv('path to movies') # Load the movies dataset.
ratings = pd.read_csv('path to ratings') # Load the ratings dataset.
# Create a cross-tabulation (pivot table) of userId vs movieId with default counts.
pd.crosstab(ratings.userId, ratings.movieId).head()
k = 15 # Define the number of top users and movies to filter.
# Group ratings by userId and count the number of ratings for each user.
g = ratings.groupby('userId')['rating'].count()
# Select the top 15 users with the most ratings.
top_users = g.sort_values(ascending=False)[:k]
# Group ratings by movieId and count the number of ratings for each movie.
g = ratings.groupby('movieId')['rating'].count()
# Select the top 15 movies with the most ratings.
top_movies = g.sort_values(ascending=False)[:k]
# Filter ratings to include only top users.
top_r = ratings.join(top_users, rsuffix='_r', how='inner', on='userId')
# Further filter ratings to include only top movies.
top_r = top_r.join(top_movies, rsuffix='_r', how='inner', on='movieId')
# Create a cross-tabulation of filtered users and movies with aggregated ratings as values.
pd.crosstab(top_r.userId, top_r.movieId, top_r.rating, aggfunc=np.sum)
Encode the user id, and movie id categorical variables, using sklearn LabelEncoder to be used with the Embedding layer.
# Encode userId into a continuous range of integers using LabelEncoder.
user_enc = LabelEncoder()
ratings['user'] = user_enc.fit_transform(ratings.userId.values)
# Count the number of unique users.
n_users = ratings['user'].nunique()
# Encode movieId into a continuous range of integers using LabelEncoder.
item_enc = LabelEncoder()
ratings['movie'] = user_enc.fit_transform(ratings.movieId.values)
# Count the number of unique movies.
n_movies = ratings['movie'].nunique()
# Ensure the ratings column is of float32 type for computational efficiency.
ratings['rating'] = ratings['rating'].values.astype(np.float32)
# Determine the minimum and maximum rating values for scaling or normalization.
min_rating = min(ratings['rating'])
max_rating = max(ratings['rating'])
# Display the number of users, movies, and rating range.
n_users, n_movies, min_rating, max_rating
Extract the train and test data for MovieLens
X = ratings[['user', 'movie']].values
y = ratings['rating'].values
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.1, random_state=42)
print(X_train.shape, y_train.shape)
print(X_test.shape, y_test.shape)
Build the collaborative filtering model, using keras Embedding and Dot layers
This model is an implementation of matrix factorization using neural networks (Figure 5).
User and item embeddings:
- The “embedding” layers create low-dimensional latent representations (embeddings) of users and movies. These embeddings correspond to the rows and columns of the user and item embedding matrices in traditional matrix factorization.
Dotted product prediction:
- The “dot” layer computes the dotted product between the user and movie embeddings. This is equivalent to computing an approximation of the original feedback matrix in matrix factorization.
Goal:
- The model is compiled using a loss function (mse) that minimizes the difference between the predicted ratings (from the dotted product) and the actual ratings. This is in line with the goal of matrix factorization, which is to find embedding matrices that minimize the reconstruction error of the feedback matrix.
- L2 regularization: helps prevent overfitting by adding a penalty to large embedding values.
- Trainable embeddings: Embeddings are learned directly through backpropagation.
# Define the number of latent factors for embedding dimensions.
emb_sz = 50
# Input for user IDs.
user = layers.Input(shape=(1,), name='user_id') # Define user input with shape (1,).
# Embedding layer for users with L2 regularization to reduce overfitting.
user_emb = layers.Embedding(n_users, emb_sz, embeddings_regularizer=regularizers.l2(1e-6), name='user_embedding_LUT')(user)
# Reshape user embedding to match the required dimensions.
user_emb = layers.Reshape((emb_sz,))(user_emb)
# Input for movie IDs.
movie = layers.Input(shape=(1,), name='movie_id') # Define movie input with shape (1,).
# Embedding layer for movies with L2 regularization to reduce overfitting.
movie_emb = layers.Embedding(n_movies, emb_sz, embeddings_regularizer=regularizers.l2(1e-6), name='movie_embedding_LUT')(movie)
# Reshape movie embedding to match the required dimensions.
movie_emb = layers.Reshape((emb_sz,))(movie_emb)
# Compute the dot product of user and movie embeddings to predict ratings.
rating = layers.Dot(axes=1, name='similarity_measure')([user_emb, movie_emb])
# Define the model with user and movie inputs and a predicted rating as output.
model = models.Model([user, movie], rating)
# Compile the model with Mean Squared Error (MSE) loss and RMSE metric.
model.compile(loss='mse', metrics=metrics.RootMeanSquaredError(),
optimizer=optimizers.Adam(lr=0.001))
# Display the model's architecture and summary.
model.summary()
# Plot the model structure with shapes and layer names for visualization.
plot_model(model, show_shapes=True, show_layer_names=True)
Compile and train the model on MovieLens dataset
# Compile the model
model.compile(loss='mse', metrics=metrics.RootMeanSquaredError(),
optimizer=optimizers.Adam(lr=0.001))
model.fit(x=[X_train[:,0], X_train[:,1]], y=y_train,
batch_size=64, epochs=5, verbose=1,
validation_data=([X_test[:,0], X_test[:,1]], y_test))
Make predictions on the test set and get top recommended movies for users(Figure 6)
- The model is used to predict ratings on the test set (
X_test
). - The results are combined into a DataFrame which includes columns for user, movie and predicted rating.
Adjust the number of recommendations and other parameters according to your specific requirements. This is a basic example, and depending on your use case, you might want to include additional information or post-processing steps to enhance the recommendation results.
# Use the trained model to predict ratings on the test set
predictions = model.predict([X_test[:, 0], X_test[:, 1]])
# Combine user, movie, and predicted ratings
results = np.column_stack((X_test, predictions))
# Create a DataFrame for better readability
results_df = pd.DataFrame(results, columns=['user', 'movie', 'predicted_rating'])
# Sort by user and predicted rating
results_df.sort_values(['user', 'predicted_rating'], ascending=[True, False], inplace=True)
# Group by user and get top recommendations for each user
top_recommendations = results_df.groupby('user').head(5) # Adjust '5' to the desired number of recommendations
# Display top recommendations
print("Top Recommendations:")
print(top_recommendations[['user', 'movie', 'predicted_rating']])
Summary
Collaborative filtering is a recommendation approach that analyzes user behavior to make predictions about their preferences. It comes in two main types: user-based and item-based. The process involves collecting data, creating a user-item matrix, computing similarities, generating predictions, and making recommendations.
Code
You can access the full source code in GitHub here.