打破单一数据中心限制：使用 k0smos 平台进行跨地域分布式 AI 运营实践

Breaking the single datacenter assumption

Modern AI architectures are built on the assumption of centralized, homogeneous data centers. In reality, infrastructure is messy. For most organizations, compute resources are fragmented across private clouds, research environments, and mixed generations of on-prem and edge hardware. Trapped in operational silos, leveraging these distributed resources for demanding AI workloads becomes incredibly difficult. Utilizing GPUs efficiently is no longer just a compute problem. It is fundamentally an infrastructure challenge.

Why geo-distributed AI becomes a Kubernetes problem

AI infrastructure has quietly crossed a threshold. What began as a machine learning challenge, i.e., training models faster, serving inference cheaper, and scaling compute on demand, has become broader and more structural. With players like OpenAI building their foundations on Kubernetes, and the CNCF formalizing this direction, Kubernetes has become the de facto orchestration layer for AI workloads. Geo-distributed AI is now fundamentally a cloud-native infrastructure problem.

However, when workloads break free from a single centralized datacenter to span on-prem clusters, cloud regions, and edge deployments, the complexity multiplies. You are no longer just scheduling a training job. You must manage cluster lifecycles across geographies, maintain cross-site connectivity, and integrate rapidly evolving hardware; from ultra-high-speed interconnects like NVLink to advanced memory innovations like HBM. These are fundamental distributed systems problems that sit squarely in Kubernetes territory.

This is where multi-cluster orchestration becomes non-negotiable. A single cluster cannot span these geographies, and a manually managed fleet will quickly break teams. What is required is a resilient platform layer that handles cross-site networking and heterogeneous hardware consistently, while remaining entirely Kubernetes-native. Ultimately, the question is no longer whether to run AI on Kubernetes. It is whether your Kubernetes platform is built to handle AI wherever it needs to run.

Using the k0smos stack as the foundation

As a cohesive set of open-source projects, the k0smos stack provides the architectural foundation for operating geo-distributed AI infrastructure by dividing responsibilities across three technical layers. At the core is k0s, a fully CNCF-conformant Kubernetes distribution packaged as a single, zero-dependency binary. By avoiding baked-in assumptions regarding specific CNIs, runtimes, or package managers, k0s runs natively on almost any Linux environment without host OS pollution. This lean execution model makes it a versatile underlying runtime capable of executing standard Kubernetes workloads across fragmented edge nodes, bare-metal servers, and resource-constrained VMs.

To manage these deployments at scale, k0smotron operates as the engine for hosted control planes (HCPs). It is a Kubernetes operator that deploys k0s control planes as isolated, versioned pods inside a central management cluster, completely decoupling the control plane from the worker nodes. By treating control planes as dynamically scheduled workloads rather than dedicated infrastructure, k0smotron significantly reduces resource overhead. It enables a remote machine model where worker nodes located in any geo-distributed environment; whether cloud instances, on-prem hardware, or edge nodes; can be attached to the centralized management cluster.

Tying the system together is k0rdent, the declarative management plane for multi-cluster lifecycle orchestration. It abstracts the provisioning, configuration, and templating of the cluster fleet into Kubernetes-native APIs, establishing a GitOps-driven workflow where clusters are declared, versioned, and audited as infrastructure-as-code. Through its multi-provider support, k0rdent presents a consistent operational interface regardless of whether the underlying infrastructure relies on bare metal, OpenStack, AWS, vSphere, or any other compute resource provider, effectively standardizing highly heterogeneous hardware environments at the platform layer.

Field studies built on top of a geo-distributed heterogeneous AI infrastructure

Building on the k0smos stack described above, we are collaborating with the German Federal Agency for Disruptive Innovation (SPRIND). The objective of our joint exalsius project is to pool fragmented, heterogeneous GPU hardware resources into a unified compute system.

To validate this approach, we built an environment that reflects the fragmented reality of today’s AI infrastructure. As illustrated in the architecture diagram, we set up an environment that bridges Nvidia A100 nodes in Quebec with AMD MI300X nodes in Atlanta. The cluster control plane is hosted on CPU-only nodes in Frankfurt, Germany. This setup should prove that cross-border, cross-vendor GPU environments can function cohesively.

Because the k0smos stack handles the foundational cluster lifecycle, we were able to bypass building custom management infrastructure. Instead, we added components to automatically detect and profile available hardware (crucial for efficient training configurations) and focused our engineering on three core layers:

1. Provisioning: We utilized the k0smotron ClusterAPI provider to trigger deployments directly from our management cluster in Frankfurt. The workers in Quebec and Atlanta were provisioned with k0s and their respective, vendor-specific GPU software stacks (the Nvidia GPU operator for the A100s, and the ROCm operator for the MI300Xs).

2. Operation: For cross-site connectivity, we deployed the CNCF project Cilium as our CNI, establishing secure, direct Wireguard P2P tunnels (~35ms latency, ~600MB/s) between the worker nodes. Data plane traffic bypasses centralized VPN gateways entirely, while cluster state remains centrally managed in Frankfurt. On top of this network, we integrated AI frameworks like PyTorch Elastic, Ray, and vLLM using custom k0rdent ServiceTemplates and Helm charts, provisioned via the k0rdent state manager (KSM) using Sveltos.

3. Orchestration: We added the operational abstraction and business logic required to execute distributed training and batch workloads reliably over the P2P network.

Our first field study validated this architecture by running stable, reproducible AI workloads across a static, geo-distributed setup. We successfully trained a diverse set of reference models, spanning GPT-NeoX for LLMs, ResNet for computer vision, GCN for graph learning, PPO for reinforcement learning, and Wav2Vec2 for audio, directly across the AMD and Nvidia nodes.

The critical enabler for this success was the co-design of the infrastructure and the training methodology. To prevent our long-distance P2P links from becoming a bottleneck, we implemented a distributed, low-communication training approach utilizing decoupled momentum optimization (detailed in our NeurIPS publications [add-links] and code repository [add-link]). While the underlying systems layer managed the heterogeneous hardware execution, this specialized training layer drastically reduced the cross-site communication demands.

This study proved that physical distance and hardware heterogeneity are no longer absolute barriers to distributed model training. By pairing the k0smos stack with our custom orchestration components, workloads execute cohesively across sites, entirely agnostic to the underlying provider, physical location, or GPU vendor.

In our second field study, we relaxed the static environment assumption to reflect a more realistic operating model: a highly dynamic setting where GPU resources join and leave the training pool based on the availability of abundant electricity. As geographic sites enter and exit favorable energy windows, the active resource fabric constantly shifts.

Image of the flow between active sites, A to C

To manage this churn, we adopted a federated learning paradigm, treating each site as an independent training domain that synchronizes model state only when active. Building on our k0smos foundation, we engineered this dynamic lifecycle through three key implementations:

1. We exposed an API allowing the orchestration scheduler to provision and deprovision workers based on real-time energy abundance signals provided by our non-profit partner, WattTime. A custom k0smotron extension translates these signals, activating GPU capacity during favorable windows and releasing it as conditions change.

2. We developed a custom Kubernetes operator for the Flower AI framework (github links in References). Deployed via the k0rdent state manager (KSM), this operator reconciles a declarative “Federation” custom resource. Newly spun-up nodes instantly join the federation as eligible training sites, while deprovisioned nodes exit the reconciliation loop gracefully.

3. At runtime, the coordinator and active sites communicate via gRPC over our established secure P2P network. We implemented a custom server-side scheduling strategy, relying on a Redis Publish-Subscribe queue to reliably broadcast round completions and shutdown signals across the ephemeral fleet.

Recently presented at the Flower AI Summit 2026 and EuroSys 2026, this study proves that our cloud-native platform extends from static geo-distributed training to dynamic, energy-aware orchestration. For a deeper dive into the technical details and experimental results, read the technical report or explore the code repository (github, report, and presentation links in references).

Conclusion

While the k0smos stack provided a highly stable, cloud-native foundation, these field studies highlighted where friction lies in fragmented environments: GPU lifecycle management and cross-site networking. In practice, getting nodes into a clean, GPU-ready state across different sites is messy work. Despite the heavy lifting done by the Nvidia and ROCm operators, dealing with cloud-specific kernels, conflicting pre-installed drivers, and partially configured states requires deep operational awareness. Similarly, while WireGuard and Cilium handled secure cross-site connectivity with negligible bandwidth overhead, managing site-specific network restrictions and latency-sensitive synchronization for distributed training (like torch.distributed) remains a complex engineering challenge.

Yet, the most encouraging takeaway is that running AI workloads across geo-distributed, heterogeneous hardware is entirely viable today. By pooling isolated GPU capacity into a powerful, unified resource fabric, we can dynamically adapt to shifting execution models without needing to rebuild the underlying platform. To support this evolving ecosystem, we are actively feeding our customizations and tooling back as upstream contributions to the Mirantis k0smos projects, ensuring the wider community can continue to build upon this foundation.

References

Kubernetes repository: https://github.com/kubernetes/kubernetes
k0s repository: https://github.com/k0sproject/k0s
k0rdent repository: https://github.com/k0rdent/k0rdent
k0smotron repository: https://github.com/k0sproject/k0smotron
Cilium repository: https://github.com/cilium/cilium
Dynamic energy-aware AI workload orchestration technical report: https://arxiv.org/abs/2602.22760
Dynamic energy-aware AI workload orchestration repository: https://github.com/exalsius/curtail-llm
Dynamic energy-aware AI workload orchestration presentation: https://youtu.be/VKC5r0wBgm0?si=N_4EFo9QKgCLM7_y
A custom Flower AI Kubernetes operator: https://github.com/exalsius/flower-operator
Flower Framework: https://github.com/flwrlabs/flower