构建基于 Kubernetes、GitOps 和供应链安全的云原生内部开发者平台

来源: CNCF

原文

Modern software delivery is no longer constrained by application code — it is constrained by the platform that runs it. This article presents the design of a cloud-native Internal Developer Platform (IDP) built on Kubernetes and CNCF ecosystem tools, demonstrating how Infrastructure as Code (IaC), GitOps, and security-first pipelines can be combined into a cohesive, operationally consistent platform. While some implementations use managed AKS, the architectural patterns apply equally to any CNCF-conformant Kubernetes distribution.

Modern distributed systems commonly face the following operational challenges that motivated this platform design: Deployment inconsistencies across environments caused by manual processes Lack of infrastructure versioning and drift control, leading to environment divergence. Hardcoded secrets and weak security posture embedded in CI/CD pipelines Inefficient scaling strategies that generate unnecessary cost overhead Limited disaster recovery and rollback mechanisms when deployments fail Fragmented observability making root cause analysis slow and unreliable The architecture described here directly addresses each of these gaps through declarative, automated, and policy-driven controls.

Design principles

The platform follows CNCF-aligned principles that guided every architectural decision:

  • Declarative infrastructure — all resources are version-controlled and reproducible
  • GitOps-based deployment using Argo CD — Git is the single source of truth for cluster
  • Immutable infrastructure and containerised workloads — no manual changes to running systems
  • Security-by-design across Design time threat modeling, CI/CD and runtime 
  • Observability as a core platform capability not an optional post deployment module.
  • Separation of concerns across infrastructure, platform, and application layers through modular design.

High-level architecture

The platform is structured into three logical layers with clear separation of responsibilities; collapsing these layers early introduced significant maintenance complexity. This is actually reflected in repository source code for building infrastructure, platform and applications. The Infrastructure Layer bootstraps the ArgoCD GitOps controller. Once initialized, ArgoCD manages the system by continuously monitoring and reconciling both Platform Components and Application Layer resources to match the desired state defined in Git.

Figure 1: End-to-End Cloud-Native Platform Architecture

Figure 1: End-to-End Cloud-Native Platform Architecture

1. Infrastructure layer

Responsible for provisioning all cloud resources using Terraform, structured into reusable modules:

  • Virtual Networks (VNet), subnets, and Network Security Groups
  • Managed Kubernetes Cluster 
  • Container Registry 
  • Identity, access configurations and Secret Stores

2. Platform layer

Built on Kubernetes and CNCF ecosystem tools, installed and managed declaratively in separate repository or in separate directories:

  • Argo CD — GitOps controller for continuous reconciliation
  • Istio — service mesh for traffic control, mTLS, and service-level observability
  • Prometheus — metrics collection and alerting
  • Grafana — dashboards and visualization
  • Loki — centralised log aggregation
  • Kyverno — Policy as Code enforcement at admission time

3. Application layer

Microservices deployed as containerised workloads, independently managed through Git:

  • Independently deployable services with no shared deployment schedules
  • Helm-based packaging for consistent environment promotion
  • Git-driven deployment lifecycle with full audit trail

End-to-end deployment workflow

The platform implements a multi-stage delivery workflow that enforces strict separation between application build, security validation, and infrastructure provisioning. This section illustrates how the workflow propagates starting from static code analysis through build to deployment.

 Figure 2: Cluster Architecture with End-to-End Pipeline Flow — Application, Security, and Infrastructure

Stage 1: Platform prerequisites

The workflow begins with a minimal set of foundational components required to run the automation and pipelines.

  • A container image registry for storing versioned and signed artifacts
  • A Terraform remote backend for state management and team collaboration
  • A secure cloud provider service connection for pipeline execution

Stage 2: Application pipeline

The application pipeline triggers on every commit to application repositories (Java or Angular services). Its responsibility is to produce a secure, validated, and deployable container image. Each change flows through the following stages:

  • Source code build and compilation
  • Unit and integration testing
  • Static Application Security Testing (SAST) e.g. Code Analysis
  • Dependency vulnerability scanning using Trivy
  • Container image creation
  • Image signing using Cosign to ensure integrity and provenance
  • Publishing the signed image to the container registry

Only verified, versioned, and tamper-evident artifacts are introduced into the platform. The pipeline configuration below shows the Cosign signing steps used in the pipeline.

Cosign image signing and verification 

# Stage 1: Build the container image
- task: Docker@2
  displayName: 'Build Container Image'
  inputs:
    command: build
    repository: $(ACR_NAME).azurecr.io/$(IMAGE_NAME)
    tags: $(Build.BuildId)

# Stage 2: Fetch OIDC token via Workload Identity Federation
- task: AzureCLI@2
  displayName: 'Fetch OIDC Token'
  inputs:
    azureSubscription: '$(SERVICE_CONNECTION)'
    scriptType: bash
    scriptLocation: inlineScript
    addSpnToEnvironment: true
    inlineScript: |
      echo "##vso[task.setvariable variable=AZURE_FEDERATED_TOKEN;issecret=true]$AZURE_FEDERATED_TOKEN"

# Stage 3: Sign image with Cosign (keyless via Azure Pipelines OIDC)
- script: |
    cosign sign \
      --yes \
      --identity-token=$AZURE_FEDERATED_TOKEN \
      $(ACR_NAME).azurecr.io/$(IMAGE_NAME):$(Build.BuildId)
  displayName: 'Sign Image with Cosign'
  env:
    AZURE_FEDERATED_TOKEN: $(AZURE_FEDERATED_TOKEN) 

Stage 3: Security validation pipeline

Before any deployment or infrastructure change is executed, a dedicated security validation pipeline enforces an additional trust boundary. This pipeline validates both artifacts and deployment configurations:

  • Verification of container image signatures using Cosign
  • Image vulnerability scanning using Trivy against a defined severity threshold
  • Kubernetes manifest validation using KubeSec to detect insecure configuration patterns

Only workloads that pass all three checks are considered compliant and eligible for deployment.

Stage 4: Infrastructure provisioning pipeline

Once security validation succeeds, the infrastructure provisioning pipeline is triggered to run. This stage establishes the Kubernetes foundation:

  • Provisioning of virtual networking (VNets, subnets, routing)
  • Deployment of a managed k8s cluster with auto-scaling node pools
  • Installation of Argo CD as the GitOps controller, one of the platform components. 
  • Bootstrapping of Argo CD Application CRDs
  • Connecting infrastructure Git repositories to Argo CD

The Terraform cluster module below reflects the configuration used, including Key Vault integration via the CSI driver and Calico network policy enforcement:

Terraform k8s cluster Module (modules/aks/main.tf)

resource "azurerm_kubernetes_cluster" "main" {
  name                = var.cluster_name
  resource_group_name = var.resource_group_name

  default_node_pool {
    name                 = "system"
    auto_scaling_enabled = true
    min_count            = 2
    max_count            = 10
  }

  identity { 
    type = "SystemAssigned"
  }

  network_profile {
    network_plugin = "azure"
    network_policy = "calico"
  }

  key_vault_secrets_provider {
    secret_rotation_enabled = true
  }
}

Stage 5: GitOps deployment model

After infrastructure provisioning, the platform follows a GitOps model where Git is the single source of truth. Argo CD continuously reconciles platform and application layer components through monitoring Kubernetes manifests and Helm charts . Changes pushed to Git are automatically applied to live clusters, ensuring the cluster stays in sync. This enables:

  • Automated reconciliation without manual kubectl use 
  • Full auditability via Git history and sync status 
  • Easy rollbacks using standard Git workflows

The Argo CD Application CRD below shows how a microservice is configured for automated sync with self-healing and pruning enabled:

Argo CD Application CRD — Automated GitOps Sync

apiVersion: argoproj.io/v1alpha1
kind: Application
metadata:
  name: microservice-api
  namespace: argocd
  labels:
    app.kubernetes.io/part-of: internal-developer-platform
spec:
  project: default
  source:
    repoURL: https://github.com/your-org/gitops-repo
    targetRevision: main
    path: apps/microservice-api/overlays/production
  destination:
    server: https://kubernetes.default.svc
    namespace: production
  syncPolicy:
    automated:
      prune: true        # Remove resources deleted from Git
      selfHeal: true     # Revert manual changes to cluster
    syncOptions:
      - CreateNamespace=true
      - PrunePropagationPolicy=foreground
    retry:
      limit: 5
      backoff:
        duration: 5s
        factor: 2
        maxDuration: 3m

Stage 6: Runtime request flow

Once the infrastructure and application workloads are deployed, external users send requests to the cloud load balancer, which forwards traffic to the API Gateway or Ingress layer. The gateway maps URLs and services to the appropriate Kubernetes Services, which then distribute requests across healthy application Pods for processing and response delivery.

Security architecture

Security is treated as a cross-cutting concern integrated throughout the entire platform lifecycle — not a layer applied after deployment. The approach spans supply chain integrity, policy enforcement, runtime protection, and secret management.

Figure 3: Security Controls Across the Delivery Lifecycle

1. Supply chain security

Security begins at the artifact level by ensuring only trusted and verified components enter the system:

  • Trivy scans container images and dependencies for known vulnerabilities 
  • KubeSec validates Kubernetes manifests to identify insecure config early in the lifecycle
  • Cosign provides cryptographic signing and verification of container images, ensuring integrity and provenance through keyless signing via OIDC

Together, these controls ensure only scanned, validated, and signed artifacts are eligible for deployment.

2. Policy enforcement with Kyverno

At the cluster level, Kyverno enforces policies at admission time, preventing non-compliant workloads from being scheduled. The policy below enforces one of our baseline standards — disallowing the use of the latest image tag across all pods:

Kyverno ClusterPolicy — Disallow Latest Tag

apiVersion: kyverno.io/v1
kind: ClusterPolicy
metadata:
  name: disallow-latest-tag
  annotations:
    policies.kyverno.io/title: Disallow Latest Tag
    policies.kyverno.io/description: >-
      Require image tags to be pinned to a specific version.
      The 'latest' tag is mutable and can lead to unpredictable deployments.
spec:
  validationFailureAction: Enforce
  background: true
  rules:
    - name: require-image-tag
      match:
        any:
        - resources:
            kinds:
              - Pod
      validate:
        message: >-
          The 'latest' image tag is not allowed. Specify a versioned tag.
        pattern:
          spec:
            containers:
            - image: "*:*"
    - name: disallow-latest-tag
      match:
        any:
        - resources:
            kinds:
              - Pod
      validate:
        message: "Image tag 'latest' is not permitted."
        pattern:
          spec:
            containers:
            - image: "!*:latest"

3. Runtime Security

Pre-deployment controls are necessary but not sufficient. Runtime security mechanisms monitor system behaviour and detect anomalies during workload execution:

  • Falco provides real-time detection of suspicious activity within containers and the host environment, with alerts integrated into the observability stack
  • AppArmor enforces kernel-level security profiles to restrict container capabilities and reduce the attack surface

4. Secrets management

Sensitive data is managed outside application and deployment artifacts to eliminate exposure risk:

  • Key Vault, integrated via the CSI Secrets Store driver, provides secure and dynamic secret injection into workloads at pod startup
  • Secrets are never stored in Git repositories or embedded within container images
  • Secret rotation is handled centrally in Key Vault and picked up automatically by running workloads

This approach ensures secret management remains centralised, auditable, and secure by design.

5. Networking and traffic management

The networking layer combines Kubernetes-native primitives with Istio’s service mesh capabilities to provide secure, observable, and policy-driven traffic management:

  • Kubernetes Services expose workloads internally with stable DNS-based discovery
  • Azure Load Balancer provides external ingress with DDoS protection at the network perimeter
  • Istio manages traffic routing, mTLS encryption between services, and service-level observability
  • Calico CNI enforces network policies, restricting lateral movement between namespaces

A key lesson from Istio mTLS was that enabling Strict mode cluster-wide too early caused connectivity issues because not all workloads had sidecars injected. Istio supports Permissive mode (accepts both plaintext and mTLS) and Strict mode (enforces only mTLS). The fix was to start in Permissive mode and then gradually apply PeerAuthentication in Strict mode per namespace, only after confirming full sidecar injection in each namespace.

Observability stack

Observability is implemented as a unified system with three complementary signals, all feeding into a shared Grafana interface:

ToolSignal TypePrimary Use
PrometheusMetricsResource utilisation, SLO tracking, alerting
GrafanaVisualisationDashboards, SLA reporting, incident response
LokiLogsCentralised log aggregation, correlation with traces

We adopted Prometheus, Grafana, and Loki to align with a Kubernetes-native observability model. Prometheus handles metrics, Loki handles log aggregation using lightweight label-based indexing, and Grafana provides a unified visualization layer. This reduces operational cost and complexity compared to maintaining a separate Elasticsearch and Kibana stack

Infrastructure as code strategy

Terraform is structured into modular components that reflect the modular platform layers, enabling independent versioning and testing of each:

ModuleResponsibility
networkVNet, subnets, NSGs, peering configurations
Managed k8s ClusterK8s cluster, node pools, RBAC, Key Vault integration
securityPolicies, Defender for Containers, audit logging
platform-servicesArgo CD, Istio, Prometheus, Grafana, Loki, Kyverno

Environment separation is handled using per-environment variable files:

  • dev.tfvars — reduced node counts, relaxed policies, faster iteration
  • staging.tfvars — production-equivalent topology with synthetic load testing
  • prod.tfvars — full node pools, strict policies, backup schedules enabled

This structure ensures reusability, consistency across environments, and controlled environment-specific customisation without duplicating module code.

Key outcomes

The following outcomes were observed in our internal lab and staging environments after full platform adoption.

MetricObserved Change
Deployment reliabilityImproved to ~95% success rate (from ~70% with manual processes)
Infrastructure provisioning timeReduced from hours/days to under 15 minutes via Terraform automation
Deployment frequencyIncreased from weekly to multiple releases per day
Configuration drift incidentsNear-zero — eliminated by GitOps continuous reconciliation
Pre-production vulnerability detection80% of findings caught before reaching staging
Manual kubectl operationsReduced to near-zero for routine deployments

Challenges and lessons learned

Navigating the CNCF ecosystem showed the risk of adopting too many overlapping tools early. The key lesson was to let architecture drive tooling decisions and defer additions like OpenTelemetry until the platform stabilized. Maintaining clear separation between infrastructure, platform, and application layers was essential for long-term maintainability. Early coupling of tools such as Argo CD and Istio with application code increased complexity and was later corrected by splitting repositories into different folders. GitOps improved consistency and traceability but introduced synchronization issues during repository restructuring. These were resolved using Argo CD app-of-apps and application health checks. Moving security checks earlier in the pipeline—using Trivy and KubeSec immediately after build—improved feedback speed and reduced late-stage failures.

Conclusion

This architecture shows how Kubernetes and CNCF tools can be combined to build a secure, automated, and scalable platform, where the real value comes from how deployment, security, and observability work together as a system. The core design decisions are to establish clear layer separation early, integrate security from the start, and adopt GitOps with Argo CD from day one. Future improvements focus on multi-cluster management with Argo CD ApplicationSets, stronger policy enforcement using Kyverno, deeper zero-trust networking via Istio, and adding distributed tracing through OpenTelemetry integrated into the observability stack.