通用高效大规模生成式推荐系统
来源: Netflix Technology
Authors: Qiuling Xu Ko-Jen Hsiao Moumita Bhattacharya
Abstract
We present our approach to scaling generative recommendation models from O(1M) to O(1B) parameters, achieving substantial improvements on Netflix recommendation tasks. Successfully training such large-scale models requires coordinated advances in scaling law prediction, computational efficiency, training stability, and evaluation methodology. Our large-scale evaluations reveal significant alignment challenges when applying these models to downstream applications. To address this, we develop systematic alignment strategies, including multi-token prediction and cold-start adaptation, ensuring that scaling benefits translate into production improvements. In this tech blog, we highlight some of these techniques. Our work offers a new perspective on training and deploying large-scale generative recommendation models, demonstrating promising results that follow scaling laws.
1. Introduction
Inspired by the remarkable success of Large Language Models (LLMs), generative recommendation tasks treat user behavior data as compressed sequences and frame user understanding as a next-event prediction problem. However, training large-scale generative recommendation systems presents unique challenges distinct from those in natural language processing. Below, we outline these challenges and propose techniques to address them in this study.
Understanding Scaling Laws
A fundamental challenge is understanding the scaling laws unique to recommendation tasks. While prior work provides useful baselines, we observe novel scaling dynamics that differ from established laws such as the Chinchilla law [1] and log-scale law [2]. We find that the scaling law suggests the performance ceilings and the scaling potentials for different tasks. These insights allow us to better allocate resources and improve personalization strategies.
Training and Inference Efficiency
Efficient scaling is critical, as recommendation systems must handle catalogs that are orders of magnitude larger than those in language models. For example, in the Netflix use case, we need a 40 times larger catalog size than GPT 3’s, and our model digested 2 trillion tokens periodically compared to 500B tokens used for GPT 3. This leads to significant computational overhead during both training and inference. To address this, we employ techniques such as compressed heads and sampled softmax to reduce computational costs without substantial performance loss.
Cold-Start and Unseen Items
Recommendation models must also adapt to new items and evolving knowledge. As model size increases, so does the time required for training and inference, widening the gap between model updates and real-world deployment. Moreover, properly encoding new items in the ID-based embedding requires numerous interactions from training datasets. The need for training data further increases with large embedding dimensions in the scaled models. These exacerbate the cold-start problem for new items. To address this, we incorporate multi-modal semantic towers, enabling the model to infer properties of unseen items from their content and metadata.
Generalization Under Different Latency Profiles
Finally, serving large models in real time often requires caching, leading to a distribution shift between pre-training and downstream applications due to serving latency. Achieving strong performance across a range of serving latencies is a fundamental integration challenge. Our alignment strategies use multi-token objectives to help bridge this gap.
In the following, we present our approach for designing generalizable and efficient large-scale generative recommenders in detail.
2. Scaling Law In Recommendation
Understanding scaling laws is a fundamental research question when developing large generative recommendation models. To explore the relationship between model size and task performance, we conducted a series of ablation studies. Unlike prior work, we observed that user activities vary widely in predictability — some are inherently easier to predict, while others are more challenging. The distribution of these activity types can also fluctuate over time, complicating analysis. To minimize distribution shifts across different time periods, we categorized user activities into three broad types for simplicity.
To address the known issue of non-progressive metrics [3], we evaluated model performance using a continuous metric, Mean Reciprocal Rank (MRR), which better captures incremental improvements. Similar scaling patterns were observed with other metrics such as test loss, Hit Rate, and NDCG.

For all three tasks, model performance approximately follows a similar scaling law as described in [1]. However, in contrast to the original formulation, our observed scaling power law includes an intercept term P0:


Here, P0 represents the upper limit of achievable performance for a given task, effectively quantifying its inherent difficulty. For example, Task A’s intercept is 0.311, indicating that, under current conditions, further scaling model size would asymptotically approach a maximum performance of 0.311. The intercept P0 provides a useful measure of task difficulty. For example, Task C has an intercept of 1.075, suggesting that, given sufficient data and scaling, it could theoretically achieve near-perfect accuracy (noting that MRR is capped at 1, and values above this reflect estimation error). The readers may observe non-linearity in figure B and this is largely due to the noise from y axis being stretched exponentially.
We also compared our findings to the scaling law from [2], which models performance as P=a log(N)+b. While this log-linear relationship fits some tasks, it fails for tasks whose performance approaches a natural ceiling. Our comparison in the figure shows that the Power Law with an interception term fits better than the Log Law.
For parameter counting, we exclude the embedding and final decoding layers, as literature [1] suggests embedding has a secondary effect on the scaling law. Empirically, we observed that including them does not materially affect our conclusions.
3. Efficient Training and Inference
Efficiency is a critical requirement that distinguishes recommendation systems from typical language model deployment. Unlike LLMs, which are often trained once and deployed for long periods, recommendation systems must be retrained frequently to capture evolving user preferences, seasonal trends, and updates to the content catalog. This need for frequent updates makes training efficiency essential for operational feasibility.
Our training datasets comprise trillions of tokens — on par with major LLMs like GPT-3 (500B tokens) or PaLM (780B tokens) — but require much more frequent processing. As a result, the ongoing computational demands for recommendation systems far exceed the one-time training costs associated with most LLMs.
To address these challenges, we leverage standard optimizations such as mixed-precision training, gradient compression, and mixtures of experts. However, we find that the most significant gains come from innovations in efficient encoding and, especially, efficient decoding. Below, we focus on advances in decoding efficiency.
Efficient Decoding
A key difference from language models is that large recommendation models must manage much larger catalogs. While models like Llama 3 and GPT-4* (with open-source estimates) have vocabularies of about 100,000 tokens, recommendation systems often require vocabularies numbering in the millions or even billions to support diverse entities such as new shows and clips. The cost of decoding across such expansive vocabularies can dominate overall training costs. As shown in the Figure, training FLOPs remain relatively flat with small vocabularies but increase sharply as vocabulary size grows, with decoding overhead becoming the primary bottleneck for vanilla transformers. This calculation is based on a standard transformer with 6 layers and 1024 dimensions with 512 sequence length.

To mitigate decoding costs, we adopt sampled softmax, which samples a subset of the vocabulary as negative candidates before computing logits. This approach effectively reduces vocabulary size during training. For illustration, we use uniform random sampling, though many other sampling strategies exist.
While sampled softmax alleviates training costs, we further introduce a projected head to downsize the embedding dimension of each item. This involves adding an activation function with a linear projection layer, projecting the penultimate output to a smaller dimension before calculating logits. Notably, reducing the embedding size also facilitates easier downstream adoption by application teams.
We compare training FLOPs across different vocabulary sizes and training techniques. The x-axis in the Figure represents vocabulary size, and the y-axis represents required training FLOPs. Combining sampled softmax with projected heads can reduce training costs by one to two orders of magnitude.
It is also worth noting that our approach enables further scaling of model size, balancing performance improvements against serving costs. Currently, our model trains periodically on 80 A100 GPUs (80GB each) for 240 hours per cycle, with frequent fine-tuning.
Efficient Encoding
Unlike HSTU, we vectorize both actions and observations into one token and leverage item towers for deeper item understanding. Semantic item towers will be covered in a section later.
Efficient Development
Efficiency is also crucial throughout model experiments during research and development. We begin with a comprehensive evaluation pipeline, covering diverse application patterns and usage scenarios. We scale our test sets to facilitate meaningful analysis of differences. Initial experiments are then conducted on small models, which are then integrated into large-scale training.
4. Multi-Token Prediction (MTP)
The Latency Issue in Next-Token Prediction
In real-world deployments, outputs from large recommender models are often cached and served minutes to hours after new user activities — either due to batch processing or to control costs via embedding-based serving. These caches are refreshed periodically, triggered by thresholds such as “N events since last refresh” or “every H hours.” In many systems, cached outputs also serve as input signals for downstream models, resulting in effective latencies that can range from minutes to days. While operationally necessary, such latency introduces a critical mismatch: the “next event” used as the training target may no longer align with what the user actually experiences when the cached recommendation is served.

Consider a user whose recent viewing history is highlighted in blue. A model trained with next-token prediction focuses on the immediate next title, A (outlined in red). However, due to cache latency, by the time the recommendation for A is served, the user may have already watched A and is now ready for title B. This caused a mismatch between training and down-stream applications.
To quantify the impact of latency, we conducted ablation studies comparing user metrics under different serving delays. We observed that long-term tasks (Task A) are less affected by latency, while short-term tasks (Task B) degrade significantly. Task C decreases less than Task B potentially due to its task being easy.

Permutation Invariance in Recommendation
A second source of misalignment arises from the permutation invariance of recommendation targets. Unlike language, where word order is critical, the set of relevant next items in recommendation is often unordered. For example, if a user is likely to watch both A and B next, either order (A → B or B → A) is valid. Standard next-token prediction (NTP) forces the model to commit to a single ordered outcome, penalizing other valid sequences and introducing brittleness — especially when multiple equally relevant options exist, such as several shows matching a user’s long-term taste or interchangeable choices in short-term exploration.
To address both the latency and permutation invariance issues, we replace next-token prediction with multi-token prediction (MTP), while maintaining efficiency with a single decoding head. For each user context, we supervise the model to predict a multi-label set of future items drawn from a time window that matches the system’s serving latency. The model computes logits over the effective catalog in a single pass (optionally using our projected head for efficiency) and optimizes a weighted cross-entropy over the label set — without iterative decoding. We apply an exponential time decay to reweight labels based on their temporal proximity to the context.
Formally, the loss is defined as:

- Reward(Label) encodes utility signals such as watch-time, completion, novelty, or diversity.
- CrossEntropy(Label) is the standard per-item cross-entropy, using either all items or sampled softmax negatives.
- β is a tunable half-life parameter that matches the cache’s refresh horizon.
In practice, we construct label sets from the next few tokens within a configurable size window, tuning β based on domain knowledge and ablation results. Empirically, we find that weighting future events appropriately is as important as near-term events, and that reasonable choices of β have limited impact on overall results.
We evaluated MTP by comparing large recommendation models in two serving scenarios:
Online serving (p95 latency < 1s)
Low-cost Long-latency serving (48 hours latency)
As shown in the following figure, MTP delivers substantial improvements across all tasks when serving with high latency. In online serving, MTP also improves performance for permutation-invariant tasks (like Task A), as swapping similar items does not disrupt user experience. However, for tasks with strong short-term dependencies (Tasks B and C), MTP can introduce a slight metric drop, reflecting the structural importance of event ordering in these contexts.

5. Understanding of Unseen Tokens
Due to lengthy training and inference cycles, large recommendation models must be able to understand and recommend items that were not present in the pretraining data — addressing the cold-start problem for new content. To tackle this, we leverage multi-modal item towers that incorporate vision, language, and knowledge graph features, enriching the model’s understanding of newly introduced items.
The following figure illustrates a generative recommender architecture with item towers. In the left figure, each token input represents a user event — such as purchasing an item or watching a movie — with all relevant details compressed into the token.

In the right figure, to enhance semantic richness, we synthesize each token by fusing a learnable embedding with semantic data from a dedicated semantic tower. These two sources of information are combined by weighted summation — resulting in a unified, semantically enriched embedding. The sources of semantic information are artifacts like synopsis or box arts. Such information will be processed by specialized models into embeddings and later input into learnable projectors in the generative recommenders. We include semantic item towers within both embedding layers and the dense layer weights — injecting semantic information both in encoding and decoding.
During training, to prevent the collaborative information (learnable embedding) from overpowering the semantic signals, we randomly mask either the input or output collaborative embedding using an out-of-vocabulary (OOV) embedding. This encourages the model to infer item properties solely from semantic information alone, improving its ability to handle unseen or cold-start items. The probability of masking is calculated to match the likelihood of cold start items within the training-serving time gap.
6. Results
We demonstrate the effectiveness of our proposed designs by evaluating the scaling impact from a 50M-parameter model to a 1B-parameter model in real-world application integrations. Notably, the evaluation labels and tasks differ from those used during pre-training, providing a robust test of the large recommendation model’s generalization capabilities.
Across all tasks discussed above — and scenarios including cold start and high-latency serving — we observe consistent and substantial improvements in key metrics. This underscores the scalability and adaptability of our approach in practical deployment settings.

Acknowledgments (In Alphabetic Orders):
We could not deliver this model without the help from our stunning colleagues. We sincerely thank them for their contributions.
Baolin Li, Bella Nicholson, Dan Zheng, Dhaval Patel, Divya Gadde, Ishan Gupta, Mark Hsiao, Michael Tu, Moumita Bhattacharya, Qiuling Xu, Sejoon Oh, Sudarshan Lamkhede, Wei Wang, Yesu Feng, Zhe Zhang
References
[1] Training Compute-Optimal Large Language Models
[3] Are Emergent Abilities of Large Language Models a Mirage?
Towards Generalizable and Efficient Large-Scale Generative Recommenders was originally published in Netflix TechBlog on Medium, where people are continuing the conversation by highlighting and responding to this story.