使ç¨JDK Vector APIäźĺć¨čçłťçť
Source: Netflix Tech
Ranker is one of the largest and most complex services at Netflix. Among many things, it powers the personalized rows you see on the Netflix homepage, and runs at an enormous scale. When we looked at CPU profiles for this service, one feature kept standing out: video serendipity scoringâââthe logic that answers a simple question:
âHow different is this new title from what youâve been watching so far?â
This single feature was consuming about 7.5% of total CPU on each node running the service. What started as a simple ideaââââjust batch the video scoring featureââââturned into a deeper optimization journey. Along the way we introduced batching, re-architected memory layout and tried various libraries to handle the scoring kernels.
Read on to learn how we achieved the same serendipity scores, but at a meaningfully lower CPU per request, resulting in a reduced cluster footprint.
Problem: The Hotspot in Ranker
At a high level, serendipity scoring works like this: A candidate title and each item in a memberâs viewing history are represented as embeddings in a vector space. For each candidate, we compute its similarity against the history embeddings, find the maximum similarity, and convert that into a ânoveltyâ score. That score becomes an input feature to the downstream recommendation logic.
The original implementation was straightforward but expensive. For each candidate we fetch its embedding, loop over the history to compute cosine similarity one pair at a time and track the maximum similarity score. Although it is easy to reason about, at Rankerâs scale, this results in significant sequential work, repeated embedding lookups, scattered memory access, and poor cache locality. Profiling confirmed this.

A flamegraph made it clear: One of the top hotspots in the service was Java dot products inside the serendipity encoder. Algorithmically, the hotspot was a nested loop structure of M candidates Ă N history items where each pair generates its own cosine similarity i.e. O(MĂN) separate dot product operations.
Solution
The Original Implementation: Single video cosine loop
In simplified form the code looked like this:
for (Video candidate : candidates) {
Vector c = embedding(candidate); // D-dimensional
double maxSim = -1.0;
for (Video h : history) {
Vector v = embedding(h); // D-dimensional
double sim = cosine(c, v); // dot(c, v) / (||c|| * ||v||)
maxSim = Math.max(maxSim, sim);
}
double serendipity = 1.0 - maxSim;
emitFeature(candidate, serendipity);
}The nested for loop with O(MĂN) separate dot products brought upon its own overheads. One interesting detail we learned by instrumenting traffic shapes: most requests (about 98%) were single-video, but the remaining 2% were large batch requests. Because those batches were so large, the total volume of videos processed ended up being roughly 50:50 between single and batch jobs. This made batching worth pursuing even if it didnât help the median request.
Step 1 : Batching, from Nested Loops to Matrix Multiply
The first idea was to stop thinking in terms of âmany small dot productsâ and instead treat the work as a matrix operation. i.e. For batch candidates, implement a data layout to parallelize the math in a single operation i.e. matrix multiply. If D is the embedding dimension:
- Pack all candidate embeddings into a matrix A of shape M x D
- Pack all history embeddings into a matrix B of shape N x D
- Normalize all rows to unit length.
- Compute: cosine similarities as
[ C = A x B^T ]; where C is an M x N matrix of cosine similarities.
In pseudoâcode:
// Build matrices
double[][] A = new double[M][D]; // candidates
double[][] B = new double[N][D]; // history
for (int i = 0; i < M; i++) {
A[i] = embedding(candidates[i]).toArray();
}
for (int j = 0; j < N; j++) {
B[j] = embedding(history[j]).toArray();
}
// Normalize rows to unit vectors
normalizeRows(A);
normalizeRows(B);
// Compute C = A * B^T
double[][] C = matmul(A, B);
C[i][j] = cosine(candidates[i], history[j])
// Derive serendipity
for (int i = 0; i < M; i++) {
double maxSim = max(C[i][0..N-1]);
double serendipity = 1.0 - maxSim;
emitFeature(candidates[i], serendipity);
}
This turns MĂN separate dot products into a single matrix multiply, which is exactly what CPUs and optimized kernels are built for. We integrated this into the existing framework by supporting both, encode()for single videos and batchEncode() for batches, while maintaining backward compatibility. At this point it seemed like we were âdoneâ, but we weren't.
Step 2: When Batching Isnât Enough
Once we had a batched implementation, we ran canaries and saw something surprising: about a 5% performance regression. The algorithm wasnât the issueâââturning MĂN separate dot products into a matrix multiplication is mathematically sound. The problem was the overhead we introduced in the first implementation.
- Our initial version built double[][] matrices for candidates, history, and results on every batch. Those large, short-lived allocations created GC pressure, and the double[][] layout itself is non-contiguous in memory, which meant extra pointer chasing and worse cache behavior.
- On top of that, the first-cut Java matrix multiply was a straightforward scalar implementation, so it couldnât take advantage of SIMD. In other words, we paid the cost of batching without getting the compute efficiency we were aiming for.
The lesson was immediate: algorithmic improvements donât matter if the implementation detailsâmemory layout, allocation strategy, and the compute kernelâwork against you. That set up the next step for making the data layout cache-friendly and eliminating per-batch allocations before revisiting the matrix multiply kernel.
Step 3: Flat Buffers & ThreadLocal Reuse
We reworked the data layout to be cache-friendly and allocation-light. Instead of double[m][n], we moved to flat double[] buffers in row-major order. That gave us contiguous memory and predictable access patterns. Then we introduced a ThreadLocal<BufferHolder> that owns reusable buffers for candidates, history, and any other scratch space. Buffers grow as needed but never shrink, which avoids per-request allocation while keeping each thread isolated (no contention). A simplified sketch:
class BufferHolder {
double[] candidatesFlat = new double[0];
double[] historyFlat = new double[0];
double[] getCandidatesFlat(int required) {
if (candidatesFlat.length < required) {
candidatesFlat = new double[required];
}
return candidatesFlat;
}
double[] getHistoryFlat(int required) {
if (historyFlat.length < required) {
historyFlat = new double[required];
}
return historyFlat;
}
}
private static final ThreadLocal<BufferHolder> threadBuffers =
ThreadLocal.withInitial(BufferHolder::new);This change alone made the batched path far more predictable: fewer allocations, less GC pressure, and better cache locality.
Now the remaining question was the one we originally thought we were answering: whatâs the best way to do the matrix multiply?
Step 4: BLAS: Great in Tests, Not in Production
The obvious next step was BLAS (Basic Linear Algebra Subprograms). In isolation, microbenchmarks looked promising. But once integrated into the real batch scoring path, the gains didnât materialize. A few things were working against us:
- The default netlib-java path was using F2J (Fortran-to-Java) BLAS rather than a truly native implementation.
- Even with native BLAS, we paid overhead for setup and JNI transitions.
- Javaâs row-major layout doesnât match the column-major expectations of many BLAS routines, which can introduce conversion and temporary buffers.
- Those extra allocations and copies mattered in the full pipeline, especially alongside TensorFlow embedding work.
BLAS was still a useful experimentâââit clarified where time was being spent, but it wasnât the drop-in win we wanted. What we needed was something that stayed pure Java, fit our flat-buffer architecture, and could still exploit SIMD.
Step 5: JDK Vector API to the rescue
A Short Note on the JDK Vector API: The JDK Vector API is an incubating feature that provides a portable way to express data-parallel operations in Javaâââthink âSIMD without intrinsicsâ. You write in terms of vectors and lanes, and the JIT maps those operations to the best SIMD instructions available on the host CPU (SSE/AVX2/AVX-512), with a scalar fallback when needed. More crucially for us, itâs pure Java: no native dependencies, no JNI transitions, and a development model that looks like normal Java code rather than platform-specific assembly or intrinsics.
This was a particularly good match for our workload because we had already moved embeddings into flat, contiguous double[] buffers, and the hot loop was dominated by large numbers of dot products. The final step was to replace BLAS with a pure-Java SIMD implementation using the JDK Vector API. By this point we already had the right shape for high performanceâââbatching, flat buffers, and ThreadLocal reuse. So the remaining work was to swap out the compute kernel without introducing JNI overhead or platform-specific code. We did that behind a small factory. At class load time, MatMulFactory selects the best available implementation:
- If jdk.incubator.vector is available, use a Vector API implementation.
- Otherwise, fall back to a scalar implementation with a highly optimized loop-unrolled dot product (implemented by my colleague Patrick Strawderman, inspired by patterns used in Lucene)
In the Vector API implementation, the inner loop computes a dot product by accumulating a * b into a vector accumulator using fma() (fused multiply-add). DoubleVector.SPECIES_PREFERRED lets the runtime pick an appropriate lane width for the machine. Hereâs a simplified sketch of the inner loop:
// Vector API path (simplified)
for (int i = 0; i < M; i++) {
for (int j = 0; j < N; j++) {
DoubleVector acc = DoubleVector.zero(SPECIES);
int k = 0;
// SPECIES.length() (e.g. often 4 doubles on AVX2 and 8 doubles on AVX-512).
for (; k + SPECIES.length() <= D; k += SPECIES.length()) {
DoubleVector a = DoubleVector.fromArray(SPECIES, candidatesFlat, i*D + k);
DoubleVector b = DoubleVector.fromArray(SPECIES, historyFlat, j*D + k);
acc = a.fma(b, acc); // fused multiply-add
}
double dot = acc.reduceLanes(VectorOperators.ADD);
// handle tail k..D-1
similaritiesFlat[i*N + j] = dot;
}
}
Figure below shows how the Vector API utilizes SIMD hardware to process multiple doubles per instruction (e.g., 4 lanes on AVX2 and 8 lanes on AVXâ512). What used to be many scalar multiply-adds becomes a smaller number of vector fma() operations plus a reductionâsame algorithm, much better use of the CPUâs vector units.

Fallbacks & Safety: When the Vector API Isnât Available
Because the Vector API is still incubating, it requires a runtime flag: --add-modules=jdk.incubator.vector We didnât want correctness or availability to depend on that flag. So we designed the fallback behavior explicitly: At startup, we detect Vector API support and use the SIMD batched matmul when available; otherwise we fall back to an optimized scalar path, with single-video requests continuing to use the per-item implementation.
That gives us a clean operational story: services can opt in to the Vector API for maximum performance, but the system remains safe and predictable without it.
Results in Production:
With the full design in place with batching, flat buffers, ThreadLocal reuse, and the Vector API, we ran canaries that run production traffic. We observed a ~7% drop in CPU utilization and ~12% drop in average latency. To normalize across any small throughput differences, we also tracked CPU/RPS (CPU consumed per request-per-second). That metric improved by roughly 10%, meaning we could handle the same traffic with about 10% less CPU, and we saw similar numbers hold after full production rollout.

At the function operator level, we saw the CPU drop from the initial 7.5% to a merely ~1% with the optimization in place. At the assembly level, the shift was clear: from loop-unrolled scalar dot products to a vectorized matrix multiply on AVX-512 hardware.

Closing Thoughts
This optimization ended up being less about finding the âfastest libraryâ and more about getting the fundamentals right: choosing the right computation shape, keeping data layout cache-friendly, and avoiding overheads that can erase theoretical wins. Once those pieces were in place, the JDK Vector API was a great fit, as it let us express SIMD-style math in pure Java, without JNI, while still keeping a safe fallback path. Another bonus was the low developer overhead: compared to lower-level approaches, the Vector API let us replace a much larger, more complex implementation with a relatively small amount of readable Java code, which made it easier to review, maintain, and iterate on.
Have you tried the Vector API in a real service yet? Iâd love to hear what workloads it helped (or didnât), and what you learned about benchmarking and rollout in production.
Special thanks to Jason Koch, Patrick Strawderman, Daniel Huang, Fan Yang, and the Performance Engineering team at Netflix
Optimizing Recommendation Systems with JDKâs Vector API was originally published in Netflix TechBlog on Medium, where people are continuing the conversation by highlighting and responding to this story.