Recommender Systems in Practice

Recommender Systems in PracticeHow companies make online recommendationsHoutao DengBlockedUnblockFollowFollowingFeb 13Listening.

By @vincentvanzalingeOnline companies (Amazon, Netflix, Linkedin, Pandora etc.

) leverage recommender systems to help users discover new and relevant items (products, videos, jobs, music), creating a delightful user experience while driving incremental revenue.

Here we provide a practical overview of recommender systems.

First, we will introduce three types of systems: content-based, collaborative filtering, and hybrid methods.

Then we’ll discuss common problems for recommender systems: cold start, scalability, exploration & exploitation, and interpretability.

Content-based recommendationIn Pandora (music streaming service), a team of musicians labeled each music with more than 400 attributes.

Then when a user selects a music station, songs are selected by matching the songs’ attributes to the description of the station’s music taste (Music Genome Project|Pandora, Howe|Pandora).

This is content-based recommendations.

Users or items have profiles describing their characteristics and the system would recommend an item to a user if the profiles of the item and the user matches.

Stitch Fix is another example of content-based recommendation.

Users’ attributes are collected (height, weight, etc.

) and fashion products that match a user’s attributes can be recommended (Stitch Fix|2013).

In some cases, the attributes of the users/items are available, e.

g.

, users on Linkedin provide their own working experiences and skills, merchants on Amazon provide information about their product items, customers of Stitch Fix provide their preferences.

In others, additional efforts are needed to create those attributes, e.

g.

, Pandora’s music genome project.

A straightforward way of matching users and items is keyword matching.

For example, for job recommendations, one can match the keywords from a job description to the ones from job seekers’ resumes.

Term frequency-inverse document frequency is often used to put more weights to keywords that are unique for an item or user.

A more systematic way is building a supervised model predicting what types of users are more likely to like a certain type of items.

In the model, features are the attributes from users and items (e.

g.

, an indicator variable whether a job and a job seeker was in the same industry), and the response variable is whether the user engaged with the item (e.

g.

, whether the job seeker applied for the job).

The model is then applied to users and calculates the propensity of engaging unseen items.

Content-based methods are computationally fast and interpretable.

They can be easily adapted to new items or new users.

However, businesses may need to employ people to manually create item/user attributes.

In addition, some characteristics may not be easy to capture or describe explicitly.

Stitch Fix addressed this by letting machine learning handle structured data, and human handle unstructured data (or latent data).

(Stitch Fix|2013)Collaborative filteringCollaborative filtering systems make recommendations based on historic users’ preference for items (clicked, watched, purchased, liked, rated, etc.

).

The preference can be presented as a user-item matrix.

Here is an example of a matrix describing the preference of 4 users on 5 items, where p_{12} is the user 1’s preference on item 2.

Although the entries can be numeric, e.

g.

, Netflix’s movie rating prediction challenge (rating ranges from 1 to 5), in most applications, they are binary (e.

g.

, clicked, watched, purchased).

In reality, the user-item matrix can be more than millions * millions (e.

g.

, Amazon, Youtube), and the majority of entries are missing — the goal of recommender systems is to fill those missing entries.

Here we describe three types of collaborative-filtering-based approaches, one operating in the original feature space: nearest-neighbor, and two in the latent space: matrix factorization and deep learning.

Nearest-neighborNearest-neighbor-based methods are based on the similarity between pairs of items or users.

The missing values are commonly filled with zeros.

Cosine similarity is often used for measuring the distance.

The preference matrix can be represented as item vectorsThe similarity between item I1 and item I2 is calculated as cos(I1,I2).

The matrix can also be represented as user vectorsThe similarity between U1 and U2 is calculated as cos(U1,U2).

For user_i, we can recommend the items liked by user_i’s most similar users (user-to-user) or the most similar items of user_i’s liked items (item-to-item).

Item-to-item approaches are adapted more commonly in practice, by Amazon (Amazon|2003), Youtube (Youtube|2010), Linkedin (Linkedin|2014), etc.

When a customer engaged with an item, an item-to-item system can quickly surface items similar to it (similar items for each item can be pre-calculated and saved in a key-value data store).

In addition, item-to-item recommendations can be more interpretable than user-to-user recommendations, e.

g.

, the systems can provide why an item is recommended because “you liked X”.

It is possible that the number of items similar to an item is too small (after applying a threshold on the similarity scores).

One could expand the similar item list by including similar items’ similar items (Youtube|2010).

After getting the most similar items, a post-processing step can be useful.

(Youtube|2010) ranked the similar items according to video quality (e.

g.

, measured by rating), diversity (e.

g.

, limited the recommendations from one channel), and user specificity (e.

g.

, similar videos to a video with more watch time by the user should be ranked higher).

The three elements were combined with a linear model, providing a ranking.

Latent-factor methodsLatent-factor methods create a new and usually reduced feature space for the original user or item vectors, leading to reduced noise and faster computations in real-time computing.

In the following, we introduce two ways — matrix factorization and deep learning.

Matrix factorizationMatrix factorization was popularly used during the Netflix recommendation challenge (Netflix|2012).

Here we introduce singular value decomposition and a more practical version for recommendations.

Singular value decomposition (SVD) decomposes the preference matrix asU and V are unitary matrices.

For 4 users and 5 items, it looks likewhere sigma_1 > sigma_2 > sigma_3 > sigma_4.

The preference of the first user for the first item can be written asThis can be presented as vectorsAn entrywise product is applied between the sigma vector and the first user vector, and then a dot product with the first item vector.

It can be seen u and v are in the same feature space, and the sigma vector represents the importance of each feature.

Now let’s select the top two features based on the sigmaswhich can be presented as the item and user vectors each has a length of two.

Simon Funk’s SVDLots of entries in the preference matrix are missing and the regular SVD has the following issues.

(1) how missing values are imputed can have an undesirable impact on the recommendations.

(2) computational complexity for training can be high with all the entries considered.

During the Netflix challenge, Simon Funk came up with a more practical formulation (Funk|2006)In the formula, only the non-missing entries p_{ij} are considered.

The estimated score for the j^{th} item from i^{th} user isNote, user and item vectors do not have unit lengths as in SVD, but it doesn’t matter as the sum of squares for error is minimized.

Funk’s approach elegantly handles missing values.

It had great success in the Netflix challenge, and the idea was implemented by Netflix (Netflix|2012).

Deep learning embeddingThe sequence of items users engaged with can be informative about their intention.

Deep learning can be used to model sequential information.

One idea is using the skip-gram model idea (Airbnb|2018, Zillow|2018), originally used for calculating word similarity.

Say, a user’s item sequence is item1 -> item2 -> item 3 -> item4 -> … The intuition is to use each item in a sequence to predict its neighboring items, i.

e.

, formulating as a classification problem, where each item is one class.

The training data include the neighboring K items of each item (the left K and right K items).

The following figure illustrates the pairs of items with K = 1.

Furthermore, each item is represented as a one-hour vector that has a length equal to the number of items.

A neural network takes an item one-hot vector as the input and output the vector of one of its similar items, illustrated in the following figure, using the training example (Item2, Item1).

The hidden layer is the new latent space, and each item can be transferred to the new feature space using the weights between the input layer and the hidden layer (essentially a linear combination of the original features).

In reality, there can be millions of items, and billions of examples are used to train the network.

To simplify the calculation, the negative sampling idea can be applied.

The idea is to update only the weights of the output item (Item 1) and a small number of other items randomly sampled.

In the following, we highlight the items and the weights need to be updated.

It makes the calculation much faster.

Once each item is represented in the new feature space, the similarity between items can be calculated, and recommendations can be made based on similarity scores.

In some cases, users visit a sequence of items before conversion, e.

g.

, on Amazon, a user makes a purchase after a sequence of page views; on Airbnb, a user books a listing after viewing a few listings.

This information can be included by adding the converted item to every item’s training pair (Airbnb|2018), shown in the figure below.

In this way, users can reach their most favorite item more efficiently (potentially with an improved conversion rate).

Hybrid approachesHybrid approaches use information from both user-item interactions and users/items’ characteristics.

The “companies you may want to follow” feature at Linkedin used both content and collaborative filtering information (Linkedin|2014).

To determine whether a company a user may want to follow, a logistic regression classifier is built on a set of features.

The collaborative filtering information is included in a feature indicating whether the company is in the set of companies that are similar to the user already followed.

The content information includes whether the industry, location, etc.

match between the user and the company.

Deep learning models can be powerful in combining collaborative-filtering and content-based information.

The Youtube recommender system (Youtube|2016) built deep learning models to predict user’s watch given a user’ previous activity (search queries and videos watched) and static information (gender, location, etc.

).

Watched videos and queries are represented as embeddings.

Since neural networks typically take fixed-length inputs, a user’s watched videos or queries vectors are averaged, concatenated with other static features.

It is recommended that features with multiple categories should be embedded into a much smaller space (roughly proportional to the logarithm of the number of unique values), and continuous features should be normalized between 0 and 1 (Youtube|2016).

Hybrid methods can depend on content-based recommendations when a user/item has no or little activity while becoming more accurate as more activity information is recorded.

It should also be noted that the system can be more complex.

Cold start, Scalability, Interpretability, and Exploitation-ExplorationAccuracy recommendations cannot be made for new users/items without no or little information.

This is referred to as the cold start problem.

This is a typical issue for collaborative filtering systems that rely on user-item interactions.

Some heuristics can be used.

For a new user, the most popular items in the user’s area could be recommended.

For a new item, some rule-based similarity criteria can be defined.

For example, Airbnb used the average of 3 geographically closest listings of the same type and price range to approximate a new listing (Airbnb|2018).

Scalability is a key factor when determining which type of recommender systems to use.

More complex systems need more people, potentially harder to hire, to build/maintain with a larger hardware cost.

It can be a long-term commitment, and so business should understand the incremental business gain vs.

the increased cost.

With that being said, here are a few key elements of building scalable systems.

Offline batch computing and online serving.

With a large number of users and items, one has to calculate easily fetchable recommendations offline by batch.

For example, Linkedin used Hadoop for batch processing the user-item event data, and then the recommendations are loaded to a key-value store for low-latency queries.

(Linkedin|2014)Sampling.

When dealing with millions of users and items, sampling can be considered, randomly sampling items or users, or removing items without significant user engagement.

Leveraging sparsity.

In recommender systems, the user-item preference matrix is often very sparse with the majority of the entries being missing.

Leveraging the sparsity can significantly reduce computational complexity (Amazon|2003).

Multi-phase modeling.

The Youtube recommender system divided the modeling process into two steps.

The first phase only user-item activity data are used to prune millions of videos into hundreds of candidates.

In the second phase, it is then feasible to use more information about the candidate videos for further selection and ranking.

(Youtube|2016)Scale deep networks.

Neural networks for large-scale recommender systems often consist of a small number of layers.

Only one hidden layer was used for Airbnb listing recommendation (Airbnb|2018), and one linear layer + 3 ReLu layers were used for Youtube video recommendation (Youtube|2016).

In addition, although softmax or other functions are used in the output layer for training, during real-time serving time, the probability doesn’t have to be calculated, and the nearest neighbor approach can be used on the output from the last hidden layer.

Negative sampling mentioned earlier should also be considered so that for each training example, only a small number of classes' weights are updated.

Interpretability.

From the customer side, it can be helpful to state why a recommendation is made.

When recommending a video to a user, Youtube added a link to the video that the user watched and triggered the recommendation (Youtube|2010).

From the modeling side, interpretability helps developers understand and debug the system.

Content-based approaches are easy to interpret, while collaborative filtering models are harder to understand, especially when a latent space is used.

One can cluster the items or users based their original feature space (nearest-neighbor approaches) or latent space (matrix factorization and deep learning approaches), and check whether the objects from the same cluster share similar characteristics (e.

g.

, location, Airbnb|2018).

In addition, the t-SNE algorithm (Maaten|2018) can be used to project a high-dimensional space to 2-dimensional space for visualization (Zillow|2018).

It can also be useful to have a tool so that one can quickly view recommendations as a sanity check, e.

g.

, Airbnb developed an internal exploration tool for validating the recommendations (Airbnb|2018).

Exploitation-Exploration.

Recommender systems should not overfit historical user-item preference data (exploitation), to avoid getting stuck in a local optimal and making recommendations with a repeated pattern.

First, one should avoid the training data being fully impacted by previous recommendations.

Youtube includes videos embedded in other sites for training.

The videos watched outside Youtube site are not from the recommender system, and can effectively surface new content (Youtube|2016).

One may also consider injecting some randomness into the system (e.

g.

, making random recommendations).

Simple rules can be added to the system to increase the diversity of recommendations.

For example, in Youtube recommendation (Youtube|2010), videos too similar to each other are removed, and the number of videos coming from the same channel is limited.

Methods from multi-armed bandits may also be applied.

The upper confidence bound was applied by Uber eats to increase the diversity of recommended restaurants/dishes (Uber|2018).

The idea of upper confidence bound is basically using the upper bound of the estimated success rate (e.

g.

, order-rate, click-rate, watch-rate).

When a new item arrives without any information, the confidence interval is [0,1], and so the upper bound is 1.

Therefore, the new item would have a higher chance to be recommended.

As the item gets more impressions, the estimation would be more accurate, and the upper bound would be closer to its true value.

Final wordsThis article discusses methodologies and key perspectives for building a practical recommender system.

In practice, there is not a method that should always be favored over another.

Simple nearest-neighbor approaches did not prevent Amazon and Youtube from becoming the most successful e-commerce and video sites.

Companies should make choices based on multiple factors like accuracy, complexity and business impact, under realistic constraints on resources (e.

g.

, engineers and hardware).

References[1] Music Genome Project.

Pandora | Wikipedia[2] Pandora’s Music Recommender.

Michael Howe | Pandora[3] Amazon.

com recommendations: Item-to-item collaborative filtering, Greg, Smith and York | Amazon | 2003[4] The YouTube video recommendation system.

Davidson, Liebald, Liu, Nandy, Vleet, Gargi, Gupta et al.

| Youtube | 2010.

[5] The Browsemaps: Collaborative Filtering at LinkedIn.

Wu, Sam, Sean, Mitul, Posse | Linkedin | 2014[6] Netflix Recommendations: Beyond the 5 stars.

Netflix Technology Blog | 2012[7] Netflix Update: Try This at Home.

Simon Funk | 2006[8] Listing Embeddings in Search Ranking.

Mihajlo Grbovic | Airbnb | 2018[9] Deep neural networks for youtube recommendations.

Paul Covington, Jay Adams and Emre Sargin | Youtube | 2016[10] Home Embeddings for Similar Home Recommendations.

Sangdi Lin|Zillow|2018[11] Visualizing data using t-SNE.

Maaten, Laurens van der, and Geoffrey Hinton|2008[12] Food Discovery with Uber Eats: Recommending for the Marketplace.

Yuyan Wang, Yuanchi Ning, Isaac Liu, and Xian Xing Zhang|Uber|2018[13] Using Human and Machine Processing in Recommendation Systems.

Eric Colson|Stitch Fix|2013.. More details

Leave a Reply