预测内容发布风险:如何利用数据驱动的洞见转变发布计划

Source: Netflix Tech

by Emily Gill

Each year, we bring the Analytics Engineering community together for an Analytics Summit — a multi-day internal conference to share analytical deliverables across Netflix, discuss analytic practice, and build relationships within the community. This post is one of several topics presented at the Summit highlighting the breadth and impact of Analytics work across different areas of the business.

Understanding Risk in Content Launches

Every title you see on Netflix goes through several key phases: Development, Pre-Production, Production/Principal Photography, Post-Production, and finally, Launch Preparation, all leading up to the Title Launch. Once Principal Photography wraps, the focus shifts in Post-Production from content creation to quality assurance and visual effects (if needed).

At the end of Post Production, Netflix receives the final audio and video files — often delivered as an IMF (Interoperable Master Format) — which triggers a flurry of Launch Preparation activities, focused on tasks such as the development of artwork and trailers, creation of subtitles, maturity ratings & quality control, that happen within a tight window and rely on having the finalized media assets in hand.

Some of this work can be kicked off earlier using a non-final version of the media called the Locked Cut, but since it’s not the absolute final deliverable, this presents a tradeoff: should our teams who prepare content for service wait for the more finalized IMF to begin their work, or start sooner with the unfinal Locked Cut? Waiting for the IMF risks a compressed timeline if it arrives late, while starting with the Locked Cut means teams may need to do additional conformance work if there are significant changes between the Locked Cut and the final IMF.

Identifying Gaps in Schedule Accuracy

To help navigate the decision of when to start launch preparation, our teams rely on estimated delivery dates for both the Locked Cut and IMF media assets, which are manually provided by content partners in production schedules. However, these schedules often have gaps in coverage and lack accuracy for both asset types (see Figure 1).

Figure 1. At an asset-level we generally see that scheduled date accuracy and coverage are lower at horizons further from asset delivery. As we approach delivery (moving towards the right on this plot) schedules become more accurate (errors decrease) adn coverage improves.

This isn’t unexpected — productions are dynamic, facing frequent changes, scheduling conflicts, and unforeseen obstacles that can shift timelines without warning. As a result, there’s a clear opportunity to leverage the wealth of production data we collect to predict the risk of schedule slips. By developing a predictive model, we aim to both fill in ETA gaps (providing asset delivery estimates when none exist) and improve the accuracy of existing ETAs compared to traditional manual schedules.

Correlation between Schedule Accuracy and Launch Misses

Our analysis reveals a strong correlation between scheduled inaccuracies and launch misses — instances where a title experiences delays. To quantify schedule inaccuracy, we created a metric called Accumulated Error Days (AED), which measures the cumulative deviation between estimated (scheduled or predicted) delivery dates and actual delivery dates over time. AED is calculated retrospectively as the area between the scheduled (grey line) or predicted (blue line) delivery dates and the actual delivery date (green line).

When we compare titles with at least one launch miss to those without, we find that mean AED is significantly higher in the group with launch misses. Notably, this effect is even more pronounced when we focus on the period closer to delivery — indicating that high AED (i.e., inaccurate schedules) in the final stretch before launch is especially correlated with launch misses, more so than AED accumulated over a longer timeline. These findings further motivate our efforts to improve schedule accuracy and reduce AED by leveraging rich production data and predictive modeling.

Modeling Time-to-Delivery

Our predictive models are designed as boosted tree regression models that predict the “days until” either media asset delivery for in-progress productions.

To power these models, we leverage a range of upstream data sources including production-level signals of progress, title metadata, and seasonal signals. We are able to predict the days until media asset delivery using daily update snapshots, allowing us to generate up-to-date predictions that reflect the latest state of each in-progress production. This means that we have each feature and what its value was as of each day of a production. Modeling with this snapshotted data enables us to generate up-to-date predictions as new information becomes available, build a flexible model that works across all production phases, and seamlessly incorporate dynamic features that evolve over time (Figure 2).

Figure 2. Hypothetical illustration of the evolving nature of production-related signals used in our models. Some signals are present throughout but dynamic, others are present at single moments in time during specific production phases. By capturing data in a snapshotted form, we’re able to build a flexible phase-agnostic model that leverages many different types of progress signals. This figure is illustrative only and does not depict actual Netflix financial or production data.

Evaluating Our Approach

Building a Comprehensive Metrics Suite

When evaluating the performance of the predictive models, we look across a suite of metrics to try to understand where and when predicted dates outperform scheduled dates. Among these are mean and median absolute error, relative to actual delivery, to understand the accuracy of our estimated dates. We also consider bias metrics, such as mean and median error, to understand if we are consistently over- or under-predicting the actual delivery. We calculate the standard deviation of our errors to understand if there are large shifts in the bulk of the distribution of errors. For the tails of our error distributions, we calculate the percentage of our absolute errors that are greater than x days to delivery.

For scheduled dates, we calculate coverage across various horizons to delivery. This is a value prop of the model; we’ve built the model in such a way that we can always provide a predicted date and recoup any coverage gaps that exist from scheduled dates alone.

Benchmarking Against Manual Scheduling

In a backtest, we observed significant improvements across all of our metrics and across most horizons from delivery. As an example, see Figure 3 which plots global mean absolute error (MAE) and shows large reductions in errors (greater accuracy) in predicted IMF and Locked dates as compared to scheduled dates. Additionally, we see large reductions in outliers from scheduled to predicted dates as well.

Figure 3. This plot compares accuracy (measured as Mean Absolute Error) between predicted and scheduled dates. The horizontal axis plots time prior to delivery, which decreases from left to right until you reach the moment of delivery at the bottom right. For this particular asset, the predicted delivery dates on average are much more accurate than manually scheduled delivery dates throughout the full horizon to delivery.

Since our teams use these dates over a period of time and not at a single point in time, there is an additional benefit that we’re describing as an Earlier Accuracy Signal. By leveraging predictive dates, our teams benefit from a level of accuracy that they would otherwise have to wait x amount of time for if using scheduled dates. As an example, 6 months out from Locked Cut delivery the predicted dates are better than scheduled dates on 76% of titles and have a level of accuracy (6.1 wks MAE) that scheduled dates don’t reach until 11 weeks later.

Circling back to AED, which we mentioned earlier is correlated to launch misses, we find that in our backtested titles globally, and across most buying orgs and content types (i.e., series versus standalones), predicted IMF and Locked Cut dates reduce AED from scheduled dates when calculated across the 6 months leading up to delivery. We see similar patterns when we repeat this for shorter horizons to delivery as well.

Streamlining Workflows with Improved Scheduling

A key advantage of this predictive model is that estimated delivery dates are already integral to our stakeholders’ workflows — meaning we can introduce predictive dates without overhauling existing processes. However, this creates a new challenge: with both scheduled and predicted dates available, teams need to determine which is more reliable. While predictive dates are often more accurate on average, there are situations where scheduled dates perform better. To address this, we’ve built serving logic that defaults to scheduled dates in buying orgs where the model underperforms. Elsewhere, teams can view both dates side by side in dashboards, allowing them to apply their own judgment. Additionally, our predictive models leverage features that are tied to scheduled dates, which has emphasized the need and impact of ensuring our upstream teams continue to input and update scheduled dates even in the presence of our predictions. We’re piloting these predictive signals in multiple ways, tailoring the approach to fit the diverse needs and tools of our various launch prep functions.