In my linear algebra class this summer, I used the Netflix Prize challenge as a pratical example for an application of singular value decomposition (SVD). To be more precise, I explained the term \(p_u^Tq_i\) in the simple SVD with bias model:

$$\hat{r}_{ui} = \mu + b_u + b_i + p_u^Tq_i.$$


The above model can be found in section 2.1 in this progress paper of the winning team. In this note, I will explain this model and give an implementation in Python. A C implementation of the moddel can be found in my GitHub repository here: https://github.com/wormtooth/netflix_svd.

import os
import numpy as np
import pandas as pd

import torch
import torch.nn as nn

import matplotlib.pyplot as plt

Load Data

We can find the data on Kaggle. There are 100 millions (100480507 to be exact) rating items for 480189 users and 17770 movies. All these ratings are stored in the 4 txt files with name prefix combined_data. There are roughly 2G of data, but if we are not careful, we would soon find that we need 16G+ of memory to load all of them. The trick here is not to use list, but use np.array stead!

RATINGS_NUMBER = 100480507
MOVIES_NUMBER = 17770
USERS_NUMBER = 480189

def load_ratings(src_folder: str, cache_folder: str = '') -> np.array:
    ratings_cache_path = os.path.join(cache_folder, 'ratings.npy')
    if os.path.exists(ratings_cache_path):
        return np.load(ratings_cache_path)
    ratings = np.empty(
        RATINGS_NUMBER,
        dtype=[('user', np.int32), ('movie', np.int32), ('rating', np.int32)]
    )

    # read all ratings
    src_path = os.path.join(src_folder, 'combined_data_{}.txt')
    n = m = 0
    for i in range(1, 5):
        path = src_path.format(i)
        if not os.path.exists(path):
            raise path + " not existed"
        with open(path, "r") as f:
            for line in f:
                line = line.strip()
                if line[-1] == ':':
                    m = int(line[:-1]) - 1 # make it from 0
                    continue
                u, r, _ = line.split(',')
                u, r = int(u), int(r)
                ratings[n] = u, m, r
                n += 1

    # remap user ids so that they are sequential from 0
    ratings = np.sort(ratings, order='user')
    ratings = ratings.view(np.int32).reshape(RATINGS_NUMBER, -1)
    users = []
    for i in range(RATINGS_NUMBER):
        if users and users[-1] == ratings[i, 0]:
            ratings[i, 0] = len(users) - 1
        else:
            users.append(ratings[i, 0])
            ratings[i, 0] = len(users) - 1
    users = np.array(users)

    np.save(ratings_cache_path, ratings)
    np.save(os.path.join(cache_folder, 'users.npy'), users)
    return ratings
# change the src_folder to wherever you store the decompressed data
# it caches loaded ratings to current folder by default
src_folder = '/data/netflix'
ratings = load_ratings(src_folder)

I remap both user ids and movie ids to sequential sequences starting from 0. It will take a while to load all the ratings at the first time. But it will cache the loaded ratings to the current folder so that it will be much faster next time. You can change the cache_folder if you want.

Let's wrap the ratings into a dataframe and compute the average ratings for all movies.

ratings_df = pd.DataFrame(
    data=ratings,
    columns=['user', 'movie', 'rating']
)
ratings_df.head()
   user  movie  rating
0     0     29       3
1     0    156       3
2     0    172       4
3     0    174       5
4     0    190       2
avg_rating = ratings_df.rating.mean()
print('The average ratings for all movies: ', end='')
print(f'{avg_rating:.2f}')
The average ratings for all movies: 3.60
def RMSE(s):
    return np.sqrt(np.mean(s * s))

print('RMSE for using the average rating only: ', end='')
print(f'{RMSE(ratings_df.rating - avg_rating):.4f}')
RMSE for using the average rating only: 1.0852

The Baseline Model

Let \(\mu\) be the average rating, and it is roughly 3.60. Rating is subjective. Some users might be pickier than others, so they tend to give lower ratings. Some movies are better than others, so they receive higher ratings. These are bias from users and movies:

$$\tilde{r}_{um} = \mu + b_u + b_m,$$


where \(\tilde{r}_{um}\) is the predicted rating for user \(u\) and movie \(m\), \(b_u\) is the bias for user \(u\) and \(b_m\) is the bias for movie \(m\). We use this model as the baseline model.

Let \(r_{um}\) the true rating for user \(u\) and movie \(m\). Suppose \(r_{um} \ne 0\) if the rating for user \(u\) and movie \(m\) is not missing. Then we can find \(b_u\) and \(b_m\) to minimize

$$\sum_{r_{um} \ne 0} (\mu + b_u + b_m - r_{um})^2 + \lambda (\sum_{u} b_u^2 + \sum_{m} b_m^2).$$


The hyperparameter \(\lambda\) is the regulization factor to avoid overfitting. This is essentially a least-squares problem, and can be solved using gradient descent (or stochastic gradient descent).

One way to obtain an approximation of the baseline model is to separate users' bias and movies' bias. For example, we can assume no bias for movies, and consider only the users' bais, then we just need to find \(b_u\) to minimize

$$\sum_{r_{um} \ne 0}(\mu + b_u - r_{um})^2 + \alpha \sum_{u}b_u^2.$$

The hyperparameter \(\alpha\) is the regulization factor to replace \(\lambda\). The above function can be easily minized using calculus, i.e., set the partial derivative equal 0. We get

$$b_u = \frac{\sum_{m, r_{um} \ne 0} (r_{um} - \mu)}{R_u + \alpha}.$$


Here, \(R_u = \sum_{m, r_{um} \ne 0} 1\) is the number of moives user \(u\) rated.

Once we get \(b_u\), we then find \(b_m\) to minimize

$$\sum_{r_{um} \ne 0} (\mu + b_u + b_m - r_{um})^2 + \beta \sum_{m} b_m^2,$$


which gives us

$$b_m = \frac{\sum_{u, r_{um} \ne 0} (r_{um} - \mu - b_u)}{R_m + \beta}.$$


Here, \(R_m = \sum_{u, r_{um} \ne 0} 1\) is the number users who rated movie \(m\).

We can also consider \(b_m\) first then \(b_u\) in a similar way. They will give different approximations to the original baseline model.

from bisect import bisect_left, bisect_right

def get_baseline_model(df, alpha=0, beta=0):
    avg_rating = df.rating.mean()
    df.sort_values('user', inplace=True)
    ratings = df.values
    user_bias = np.empty(USERS_NUMBER, dtype=np.float64)
    for i in range(USERS_NUMBER):
        p = bisect_left(ratings[:, 0], i)
        q = bisect_right(ratings[:, 0], i)
        user_bias[i] = np.sum(ratings[p:q, 2] - avg_rating) / (alpha + q - p)

    df.sort_values('movie', inplace=True)
    ratings = df.values
    movie_bias = np.empty(MOVIES_NUMBER, dtype=np.float64)
    for i in range(MOVIES_NUMBER):
        p = bisect_left(ratings[:, 1], i)
        q = bisect_right(ratings[:, 1], i)
        movie_bias[i] = np.sum(
            ratings[p:q, 2] - avg_rating - user_bias[ratings[p:q, 0]]) / (beta + q - p)

    return user_bias, movie_bias

In get_baseline_model above, I computed \(b_u\) first and then \(b_m\). The implementation could look much better using groupby with dataframes. But it takes too much memory and crushes my computer with 8G RAM. So I have to do it without groupby.

alpha = 25
beta = 10
user_bias, movie_bias = get_baseline_model(ratings_df, alpha, beta)

The choice of regulization factors \(\alpha\) and \(\beta\) can be done using cross validation on the Probe set. Since it takes a while to do, I would just use \(\alpha = 25\) and \(\beta = 10\) from section III of this paper.

ratings = ratings_df.values
avg_rating = ratings_df.rating.mean()
resid = ratings[:, 2] - avg_rating - user_bias[ratings[:, 0]] - movie_bias[ratings[:, 1]]
print(f'RMSE for the baseline model with alpha = {alpha}, beta={beta}: ', end='')
print(f'{RMSE(resid):.4f}')
RMSE for the baseline model with alpha = 25, beta=10: 0.9240

Simple SVD with Bias Model

The singular value decomposition, or SVD, states that for any \(m \times n\) real matrice \(A\) can be decomposed as \(A = U \Sigma V^T\), where \(U\) is an \(m \times m\) orthogonal matrix, \(\Sigma\) is an \(m \times n\) diagonal matrice, and \(V\) is an \(n \times n\) orthogonal matrix. The nonzero values in the diagonal of \(\Sigma\) are the singular values of \(A\), and there are exactly \(r = rank A\) many of them. We are not going into details of SVD, as we don't need exactly the SVD, but rather the idea from SVD.

One application of SVD is to approximate the matrix \(A\) using only the first \(k\) largest singular values. We write \(A = U\Sigma V^T\) in such a way that the diagonal of \(\Sigma\) is in descending order. Then we can take the upper left \(k \times k\) submatrix \(\Sigma_k\) of \(\Sigma\) instead of the full \(\Sigma\), and take the corresponding \(U_k\) and \(V_k\) so that

$$A \approx U_k \Sigma_k V_k^T,$$


where \(U_k\) is of size \(m \times k\) and \(V_k\) is of size \(n \times k\).

And now we use the idea of SVD: \(U\) and \(V\) contain "characteristic" of the matrix \(A\) (\(U\) is actually the set of orthonormal eigenvectors of \(AA^T\) and \(V\) is the set of orthonormal eigenvectors of \(A^TA\)). In other words, both \(U\) and \(V\) contain latent features related to \(A\). The approximation above now means that we want \(k\) latent features to reconstruct \(A\). Absorbing \(\Sigma_k\) either into \(U_k\) or \(V_k\) or even both, we get a even simpler decomposition:

$$A = U_k V_k^T,$$


where \(k\) is the number of latent features we use to reconstruct \(A\).

It might be better understood if we put the idea in context. Let \(R\) be a \(s \times t\) matrix representating all ratings. The \(u\)-th row of \(R\) are all the ratings of user \(u\) gives to movies, and the \(m\)-th column of \(R\) are all the ratings of movie \(m\) receives from all users. We have some entries given in \(R\), but a lot of them are missing. Our task is to reconstruct an approximation of \(R\) using the given entries. Now let

$$R = UM^T,$$


where \(U\) is a \(s \times k\) matrix, \(M\) is a \(t \times k\) matrix, and \(k\) represents the number of latent features.

Each row of \(U\) represents how much a user likes the corresponding \(k\) features, and each row of \(M\) says how well the movie presents those features. We can think of these features as genres for example. I will give a simple example for \(k=3\) features: mystery, action and animation. Tom likes mystery movies a lot and not so much into animation. So his scores for these 3 features might be 2, 1, -0.5. Let's imagine that an animated mystery movie has 2, 0.5, 2 for the 3 features. Then using these features we predict that the rating Tom gives to the movie is

$$2 \times 2 + 1 \times 0.5 + (-0.5) \times 2 = 3.5.$$

In application though, we don't know what these latent features mean. There is usually no good interpretation for them.

class SimpleSVD(nn.Module):

    def __init__(self, num_features):
        super().__init__()
        self.num_features = num_features
        self.users = nn.Parameter(torch.empty((USERS_NUMBER, num_features)))
        self.movies = nn.Parameter(torch.empty((MOVIES_NUMBER, num_features)))
        self.reset_parameters()
        self.double()

    def reset_parameters(self):
        nn.init.xavier_uniform_(self.users)
        nn.init.xavier_uniform_(self.movies)

    def forward(self, inp):
        out = torch.sum(
            self.users[inp[:, 0].type(torch.long), :] * self.movies[inp[:, 1].type(torch.long), :],
            axis=1)
        return out

Above is an implementation of the simple SVD model in PyTorch. It takes an input of two columns, the first column of which are the user ids and the sceond of which are movie ids. It outputs the predicted ratings for each pair of user id and movid id.

Instead of using the simple SVD model to predict the actual ratings, we predict the residues from the baseline model. That is

$$\hat{r}_{um} = \mu + b_u + b_m + U_u M_m^T.$$


Here, \(U_u\) is the row vector representing features for the user \(u\), and \(M_m\) is the row vector for the features of movie \(m\).

# batch_size = 8
# num_features = 20
# lr = 0.001
# epoch = 2

# svd = SimpleSVD(num_features)

# # prepare train dataset
# trainset = torch.utils.data.TensorDataset(
#     torch.tensor(ratings[:, 0:2]), torch.tensor(resid)
# )
# trainloader = torch.utils.data.DataLoader(
#     trainset,
#     batch_size=batch_size,
#     shuffle=True,
#     drop_last=True
# )

# # criterion
# criterion = nn.MSELoss()
# optimizer = torch.optim.SGD(svd.parameters(), lr=lr)

# for e in range(epoch):
#     total_loss = 0.0
#     for i, (inp, tgt) in enumerate(trainloader):
#         optimizer.zero_grad()
#         pred = svd(inp)
#         loss = criterion(pred, tgt)
#         loss.backward()
#         optimizer.step()
#         total_loss += loss.item()
#     total_loss /= len(trainloader)
#     print(f'RMSE after {e+1} epoch: {np.sqrt(total_loss):.4f}')

I commented out the above code because it takes a long time to train the model. It is mainly because my laptop does not have enough RAM. I reimplemented the model using C and it ran fine with less that 1G memory. You can find the C implementation on GitHub: netflix_svd. Out of the 100M ratings, I reserved 1M for the test set. Here are some output from the C implementation of the model:

Got bias for the model with alpha = 25.00 and beta = 10.00
RMSE (train): 1.038225
RMSE (test): 1.040508
Shuffled ratings.
Epoch 1 done.
RMSE (train): 0.861172
RMSE (test): 0.874805
Shuffled ratings.
Epoch 2 done.
RMSE (train): 0.821142
RMSE (test): 0.847518
Shuffled ratings.
Epoch 3 done.
RMSE (train): 0.801727
RMSE (test): 0.838259
Shuffled ratings.
Epoch 4 done.
RMSE (train): 0.791160
RMSE (test): 0.834764
Shuffled ratings.
Epoch 5 done.
RMSE (train): 0.785267
RMSE (test): 0.833697