Interesting new working paper from Netflix that addresses the question: what’s the value of its recommendation system in terms of engagement?
To answer this, the authors build a multinomial logit choice system that attempts to causally quantify how much of a user’s choice of content at time t, given all of their previous content consumption, was influenced by what was recommended to them in Netflix’s various high-profile recommendation units (billboard, Top 25, Top 100) during that session.
To do this, the authors implement something conceptually similar to Two Towers but more computationally efficient: they dynamically calculate a user preference vector at each timestep t using past watch history instead of generating per-user embeddings. The user’s watch history, which can be of variable length for any given user, is comprised of content embeddings. Each of those content embeddings is run through a two-layer MLP, and a weight is applied to each using a mechanism similar to “attention” that ranks its influence on the user’s overall preference vector. The user’s preference vector at time t is the weighted sum of those vectors.
This user preference vector is multiplied by each item’s embedding (like a “Content Tower”), and a “recommendation boost” is added to this, realized with a binary of whether the content that the user watched at time t was recommended to them for each of the high-profile placements or not.
This resolves to the utility for every item, for every user, at every timestep. These utilities (raw logits) are converted into probabilities with a Softmax, and the model maximizes the log likelihood of predicting which piece of content the user would have watched at that time. Using natural variation in recommendation outputs, this approach can determine the counterfactual: what is a user likely to have watched if their recommendations had been different?
With the model parameters learned, the authors can predict counterfactual choices by users at each timestep to measure the causal impact of recommendations, decomposed across three dimensions: selection, targeting, and exposure.
The paper indicates that Netflix’s recommender system drives roughly 40% of all viewing, with most of the lift coming from selection and targeting, while salience effects from top placements still produce measurable incremental hours. In counterfactual simulations without recommendations, overall consumption drops by about one-third. Note that, per the paper, Matrix Factorization was the classical, pre-deep-learning approach to recommender systems, including the one Netflix used during the Netflix Prize era.
Paper linked below.
