Roblox ĺŚä˝ä˝żç¨ AI ĺ¨ 100 母ç§ĺ çżťčŻ 16 ç§čŻč¨
Source: ByteByteGo
OpenClaw You Can Trust (Sponsored)
When your AI agent holds your API keys, reads your email, and runs shell commands, security isnât optional.
KiloClaw is a fully managed OpenClaw: a one-click deploy that gives you a 24/7 AI agent, without buying a Mac Mini.
Every instance runs in a dedicated Firecracker micro-VM, not a shared container, with five independent isolation layers protecting your data. An independent security assessment found zero cross-tenant vulnerabilities (read the full white paper).
Built on the same infrastructure serving 1.5M+ Kilo Code developers, with access to 500+ AI models through Kilo Gateway.
Translating between 16 languages means supporting 256 possible pairs, such as Korean to English, French to Thai, Portuguese to Japanese, and so on. One solution is to build a separate model for each pair. However, Roblox decided to build just one.
Roblox is a global platform where more than 70 million people play, create, and socialize every day across more than 15 million active experiences. Users span 180 countries and communicate constantly through in-experience text chat. And a single unified model now handles real-time chat translation across all of those users, at roughly 100 milliseconds per translation and over 5,000 chats per second.
However, Robloxâs real engineering challenge wasnât building a model that could translate. It was building a system that could translate at the speed of a conversation without breaking the user experience. In this article, we will look at what Roblox built and the trade-offs they made.
Disclaimer: This post is based on publicly shared details from the Roblox Engineering Team. Please comment if you notice any inaccuracies.
One Model Versus Many
Building a separate model for every language pair is the obvious starting point. One model for English to Korean, another for Korean to French, another for French to Thai, and so on.
With 16 languages, thatâs 16 times 16, or 256 individual models. Each one needs its own training data, its own infrastructure, its own maintenance. And when Roblox adds a 17th language, they donât need a new model. They need 32. The approach grows quadratically, and it collapses under its own weight long before you reach production.
Roblox went a different direction.
They built a single, unified transformer-based translation model that handles all 256 language directions. The key to making this work is an architecture called Mixture of Experts, or MoE. Instead of every translation request passing through every parameter in the model, a routing mechanism activates only a subset of specialized âexpertâ subnetworks depending on the input.
Different experts specialize in groups of similar languages. Given a source sentence and a target language, the system activates the relevant expert (or combination of experts) to generate the translation. Think of it as a team of specialist translators sitting behind a single reception desk. A request comes in, the routing layer sends it to the right specialist, and only that specialist does the work. The full team has broad expertise, but any single translation only activates a fraction of it.
This unified approach creates some great benefits. When all languages are trained together, similar languages actually help each other. For example, Spanish and Portuguese share enough structure that training them in the same model improves translation quality for both. The model also learns enough about each languageâs patterns that it can auto-detect the source language, even when the language setting is wrong or missing. It can even handle mixed-language input, where someone types in two languages within the same message, and still produce a reasonable translation into the target language.
However, thereâs a cost to consolidation. One model now carries the weight of all 256 directions. To handle that diversity with acceptable quality, Robloxâs model ended up with roughly 1 billion parameters. Running inference through a model that large is too slow and too expensive for real-time chat at scale. The architectural problem was solved, but the serving problem was just getting started.
Unblocked: Context that saves you time and tokens (Sponsored)
Stop babysitting your coding agents. Unblocked gives them the organizational knowledge to generate mergeable code without the back and forth. It pulls context from across your engineering stack, resolves conflicts, and cuts the rework cycle by delivering only what agents need for the task at hand.
Making a Billion Parameters Fast Enough for a Conversation
A 1-billion-parameter model produces good translations. It does not produce them fast enough for two people having a real-time conversation.
At 5,000+ chats per second, with a latency ceiling of roughly 100 milliseconds, Roblox needed to close a significant gap to make things production-ready. They did it with two moves. First, they made the model smaller. Then, they wrapped it in infrastructure that squeezes out every remaining millisecond.
Roblox used a technique called knowledge distillation, sometimes described as a teacher-student approach. The idea is straightforward. You train a large, high-quality model first (the teacher). Then you train a smaller model (the student) to mimic the teacherâs outputs. The key detail is what the student actually learns. It doesnât just learn the teacherâs final answers. It learns the teacherâs probability distributions, in other words, the teacherâs confidence levels across all possible translations for a given input.
Through this process, Roblox compressed the model from roughly 1 billion parameters to fewer than 650 million. Alongside distillation, they also applied quantization (reducing the numerical precision of model weights) and model compilation (optimizing the computation graph for specific hardware). These are additional layers of compression stacked on top of distillation.
However, the serving infrastructure is half the story. Even a distilled model doesnât hit 100ms on its own at this scale. The model is just one component in a longer pipeline, and most of the latency optimization happens outside the model itself.
When a chat message needs translation, it first passes through RCC (Robloxâs backend service). Before the model is ever involved, the system checks a translation cache. If this exact source-text-to-target-language translation has been done before, the cached result is returned immediately, and no model inference is needed.
If the cache misses, the request goes to the backend, where a dynamic batcher groups multiple translation requests together. Batching is critical because GPUs are far more efficient at processing many inputs at once than handling them one at a time. The batched requests then flow through the modelâs encoder stack, which converts the source text into a numerical representation.
Hereâs where a second, different cache comes in. Roblox added an embedding cache between the encoder and decoder. This matters for a specific scenario that happens constantly on the platform. Imagine a Korean speaker sends a message on a server with English, German, and French speakers. That single Korean message needs three separate translations. Without the embedding cache, the encoder would process the same Korean message three times. With it, the encoding happens once, the intermediate representation is cached, and the decoder generates all three translations from that single encoding. At the scale of Robloxâs chat traffic, this optimization is significant.
Finally, the decoded translation passes through Robloxâs trust and safety systems, the same scrutiny applied to all text on the platform, to catch anything that violates their policies. The translated message and the original are both sent to the recipientâs device, allowing users to toggle between the translation and the senderâs actual words.
Measuring Quality
A translation model is only as good as two things. The data it was trained on, and your ability to measure whether itâs working. For 256 language directions at Robloxâs scale, both of these are hard problems. And Roblox had to build custom solutions for each.
Standard translation quality metrics work by comparing the modelâs output against a âcorrectâ human translation, called a reference. But producing reference translations for all 256 language directions, across the volume and variety of Roblox chat messages, is impossible. Youâd need human translators producing ground truth for every combination, continuously. Roblox solved this by building its own quality estimation model that scores translations using only the source text and the machine translation output. No reference translation required.

This quality model evaluates the translations along multiple dimensions.
It checks accuracy (are there additions, omissions, or mistranslations?), fluency (grammar, spelling, punctuation), and reference consistency (does the translation make sense in context with the rest of the conversation?).
Errors are classified by severity into critical, major, and minor categories. The model operates at word-level granularity, not just sentence-level. It doesnât just flag a translation as âbad.â It pinpoints which words are wrong and how severely.
To build this, Roblox trained an ML model on human-labeled error types and scores, then fine-tuned a multilingual language model to predict these word-level errors and compute scores across their multidimensional criteria.
Common language pairs like English-Spanish have abundant parallel training data. Billions of translated sentence pairs exist on the web. However, there are also rare pairs like French-Thai. Itâs difficult to train a good translator between two languages if you donât have examples of translations between them.
Roblox addressed this with iterative back-translation. Take a French text, translate it into Thai using the current model, then translate that Thai text back into French. Compare the round-trip result to the original French. If it matches closely, the intermediate French-Thai pair is probably a good synthetic training example.
âThe critical word to note here is âiterative.â Roblox didnât do this once. They repeated the process across multiple rounds, using a mix of this synthetic back-translated data and human-labeled supervised data to progressively expand the training set. The ratio between synthetic and real data matters. Too much synthetic data degrades quality because the model starts learning from its own mistakes.
General translation data doesnât include words like âobbyâ (a Roblox obstacle course) or platform-specific slang and abbreviations. Roblox brought in human evaluators to translate popular and trending terms for each of the 16 languages, then fed those translations into the training data. This is an ongoing process because slang evolves faster than any model retraining.
Conclusion
Robloxâs unified translation model came with real costs, and understanding those costs matters as much as understanding the architecture.
Quality vs. latency is a permanent tension. The distilled student model is inherently less accurate than the teacher. Every time Roblox improves the teacher, they face the question of whether those gains survive compression. And 100 milliseconds is a hard ceiling that limits how large and accurate the serving model can get.
Low-resource pairs are still the weak link. Back-translation helps, but French-to-Thai will never be as good as English-to-Spanish. The model can handle mixed-language input, but accuracy drops. Unified doesnât mean uniform quality.
The maintenance burden is real. Building a custom translation model means owning the entire stack. Training, evaluation, serving, slang updates, safety integration, all of it. Using a commercial translation API means someone else handles that complexity. Robloxâs choice made sense because they needed domain-specific accuracy (their model outperforms commercial APIs on Roblox content, by their own metrics) and extreme latency at massive scale. Most companies should use off-the-shelf translation and spend their engineering effort elsewhere.
The reference-free quality estimation model, while clever, has an inherent limitation. It could have systematic biases that overlap with the translation modelâs own weaknesses. Itâs a pragmatic solution, not a perfect one.
References:




