Amazon ECS 中使用线性与金丝雀策略的渐进式部署

Source: AWS - Containers

When deploying new application versions, you need confidence changes won’t impact customers. Amazon Elastic Container Service (Amazon ECS) now supports linear and canary deployment strategies, complementing built-in blue/green deployments. With linear deployments, you shift traffic in equal increments with a bake time between each shift. With canary deployments, you route a small percentage to the new revision and monitor before shifting the rest. Both strategies support Amazon CloudWatch alarms for failure detection and rollback, and lifecycle hooks for custom validation.

In this post, we walk through how linear and canary strategies work in Amazon ECS, how to configure each, and how to set up automatic rollbacks with CloudWatch alarms.

How Amazon ECS orchestrates gradual deployments

When you configure linear or canary deployments, Amazon ECS uses Elastic Load Balancing weighted target groups and CloudWatch alarms for traffic shifting and automated rollback.

Figure 1: Amazon ECS architecture for gradual deployments showing blue and green revisions connected to a load balancer (Application Load Balancer or Network Load Balancer) with weighted target groups monitored by CloudWatch alarms for automated rollback.

Diagram showing Amazon ECS architecture for gradual deployments with blue and green revisions connected to a load balancer with weighted target groups monitored by Amazon CloudWatch alarms for automated rollback.

Architecture and traffic flow

Amazon ECS supports four deployment strategies, each with a different approach to traffic management. Your choice depends on your risk tolerance and required control over the rollout.

Rolling: task-by-task replacement

Rolling deployments replace tasks progressively without traffic shifting. Amazon ECS starts new tasks before stopping old ones to maintain availability (controlled by the minimumHealthyPercent and maximumPercent parameters). This is the default deployment type.

Figure 2: Rolling deployment showing four tasks being replaced progressively. New tasks (green) start before old tasks (blue) are stopped to maintain availability.

Diagram showing rolling deployment with four tasks being replaced progressively where new green tasks start before old blue tasks are stopped to maintain availability.

Consider rolling deployments for cost-sensitive deployments where you want to avoid duplicate infrastructure, and workloads where simplicity is preferred over fine-grained traffic control.

Blue/green: full traffic switch

Blue/green deployments create a complete replacement environment (green) alongside the existing one (blue). After validation using a test listener, traffic switches instantly from blue to green. The blue environment remains available for rollback.

Figure 3: Blue/green deployment showing instant traffic switch from blue (100%) to green (100%) after health checks pass.

Diagram showing blue/green deployment with instant traffic switch from blue revision at 100% to green revision at 100% after health checks pass.

Blue/green deployments work well for database schema changes requiring synchronized cutover, major version upgrades where instant rollback is critical, and services where gradual rollout provides no additional validation benefit.

Linear: gradual shift in equal increments

Linear deployments shift traffic in equal increments, with a configurable bake time at each stage. If CloudWatch alarms breach or health checks fail, the deployment automatically rolls back.

Figure 4: Linear deployment shifting traffic in 10% increments across 10 steps, with a 5-minute validation period at each step.

Diagram showing linear deployment shifting traffic in 10% increments across 10 steps with a 5-minute validation period at each step.

Canary: small traffic slice first

Canary deployments route a small percentage of traffic to the new version for an extended observation period. If validation succeeds, the remaining traffic shifts in a single step.

Figure 5: Canary deployment routing 5% of traffic to the green revision for 15 minutes of extended validation, then shifting remaining traffic on success.

Diagram showing canary deployment routing 5% of traffic to the green revision for 15 minutes of extended validation then shifting remaining traffic on success.

Choosing your rollout strategy

The following table compares all four Amazon ECS deployment strategies to help you decide which approach fits your workload.

Strategy Traffic Control Rollback Speed Cost Impact Best For
Rolling No control Slow (redeploy) Low Cost-sensitive workloads, simplicity over traffic control
Blue/green Instant switch Instant High (2x resources) Critical updates, DB migrations
Linear Gradual increments Fast (traffic shift) Medium APIs, microservices
Canary Small test first Fast (traffic shift) Medium Changes requiring careful validation, machine learning models

Observability and rollbacks

Amazon ECS provides several mechanisms to monitor deployments and automatically roll back when issues arise.

Amazon CloudWatch alarms for automated failure detection

You can associate CloudWatch alarms with your Amazon ECS service for automatic rollback. If an alarm enters the ALARM state, the deployment rolls back.

Common alarm metrics include:

  • Error rate: HTTPCode_Target_5XX_Count, HTTPCode_Target_4XX_Count
  • Latency: TargetResponseTime (p50, p99, p99.9)
  • Availability: UnHealthyHostCount, HealthyHostCount
  • Custom metrics: Application-specific business metrics published to Amazon CloudWatch

Lifecycle hooks

Lifecycle hooks let you run custom validation logic at specific points during the deployment. You can use AWS Lambda functions to implement hooks for:

  • Pre-deployment validation: Verify prerequisites before creating new tasks
  • Post-deployment testing: Run automated tests against the new revision
  • Custom health checks: Validate application-specific health criteria
  • Integration testing: Test interactions with downstream dependencies

Hooks can return IN_PROGRESS to indicate ongoing validation, SUCCEEDED to proceed, or FAILED to trigger rollback.

Bake time configuration

The deployment bake time is a buffer period after traffic shifting completes, during which the old (blue) revision remains running. This gives you instant rollback by shifting traffic back, without redeploying tasks.

Consider the following bake time configuration:

  • Most workloads: Set a baseline bake time that covers your typical health check intervals and allows enough time for error rates and latency metrics to surface in CloudWatch alarms.
  • Workloads requiring extended observation: Increase the bake time for workloads with longer feedback loops, such as batch processing, async workflows, or services where downstream effects take time to manifest.
  • Cost vs. safety tradeoff: Longer bake times increase costs (running both revisions) but improve rollback capability. Choose based on your service’s mean time to detect (MTTD) failures.

Prerequisites

Before starting either walkthrough, verify that you have the following:

  • An Amazon ECS cluster (AWS Fargate or Amazon Elastic Compute Cloud (Amazon EC2))
  • An Application Load Balancer or Network Load Balancer with two target groups configured
  • A task definition for your application
  • A running Amazon ECS service configured with a load balancer and advancedConfiguration (two target groups and a production listener rule), along with an IAM role for traffic shifting. For the linear walkthrough, create a service named my-web-service; for the canary walkthrough, create a service named payment-service. See Creating an Amazon ECS linear deployment and Creating an Amazon ECS canary deployment for instructions.
  • AWS Command Line Interface (AWS CLI) installed and configured
  • Appropriate AWS Identity and Access Management (AWS IAM) permissions for Amazon ECS, Elastic Load Balancing, and Amazon CloudWatch
  • An IAM role (for example, ecsBlueGreenRole) that allows Amazon ECS to manage load balancer target group weights during traffic shifting
  • For the canary walkthrough: custom Amazon CloudWatch metrics for business logic validation (optional)

Walkthrough: linear strategy implementation

In this walkthrough, you deploy a sample application using the linear deployment strategy with automatic rollback capabilities.

Step 1: Create CloudWatch alarm for 5XX errors

# Create alarm for 5XX errors across both target groups
aws cloudwatch put-metric-alarm \
  --alarm-name my-service-5xx-errors \
  --alarm-description "Trigger on high 5XX error rate across both target groups" \
  --metrics '[
    {
      "Id": "blue5xx",
      "MetricStat": {
        "Metric": {
          "Namespace": "AWS/ApplicationELB",
          "MetricName": "HTTPCode_Target_5XX_Count",
          "Dimensions": [
            {"Name": "TargetGroup", "Value": "targetgroup/blue/xxx"},
            {"Name": "LoadBalancer", "Value": "app/my-load-balancer/xxx"}
          ]
        },
        "Period": 60,
        "Stat": "Sum"
      },
      "ReturnData": false
    },
    {
      "Id": "green5xx",
      "MetricStat": {
        "Metric": {
          "Namespace": "AWS/ApplicationELB",
          "MetricName": "HTTPCode_Target_5XX_Count",
          "Dimensions": [
            {"Name": "TargetGroup", "Value": "targetgroup/green/xxx"},
            {"Name": "LoadBalancer", "Value": "app/my-load-balancer/xxx"}
          ]
        },
        "Period": 60,
        "Stat": "Sum"
      },
      "ReturnData": false
    },
    {
      "Id": "total5xx",
      "Expression": "SUM([blue5xx, green5xx])",
      "Label": "Total 5XX Errors",
      "ReturnData": true
    }
  ]' \
  --evaluation-periods 2 \
  --threshold 10 \
  --comparison-operator GreaterThanThreshold

Step 2: Create alarm for high latency

Create an alarm for high response time across both target groups.

# Create alarm for high latency across both target groups
aws cloudwatch put-metric-alarm \
  --alarm-name my-service-high-latency \
  --alarm-description "Trigger on high response time across both target groups" \
  --metrics '[
    {
      "Id": "blueLatency",
      "MetricStat": {
        "Metric": {
          "Namespace": "AWS/ApplicationELB",
          "MetricName": "TargetResponseTime",
          "Dimensions": [
            {"Name": "TargetGroup", "Value": "targetgroup/blue/xxx"},
            {"Name": "LoadBalancer", "Value": "app/my-load-balancer/xxx"}
          ]
        },
        "Period": 60,
        "Stat": "Average"
      },
      "ReturnData": false
    },
    {
      "Id": "greenLatency",
      "MetricStat": {
        "Metric": {
          "Namespace": "AWS/ApplicationELB",
          "MetricName": "TargetResponseTime",
          "Dimensions": [
            {"Name": "TargetGroup", "Value": "targetgroup/green/xxx"},
            {"Name": "LoadBalancer", "Value": "app/my-load-balancer/xxx"}
          ]
        },
        "Period": 60,
        "Stat": "Average"
      },
      "ReturnData": false
    },
    {
      "Id": "maxLatency",
      "Expression": "MAX([blueLatency, greenLatency])",
      "Label": "Max Response Time",
      "ReturnData": true
    }
  ]' \
  --evaluation-periods 2 \
  --threshold 1.0 \
  --comparison-operator GreaterThanThreshold

Step 3: Configure the linear strategy

Update your Amazon ECS service to use linear deployment with CloudWatch alarm integration. This also enables deployment circuit breakers for automatic rollback.deployment circuit breakers

# Configure linear deployment strategy with CloudWatch alarm integration
aws ecs update-service \
  --cluster production-cluster \
  --service my-web-service \
  --deployment-configuration '{
    "deploymentCircuitBreaker": {
      "enable": true,
      "rollback": true
    },
    "strategy": "LINEAR",
    "linearConfiguration": {
      "stepPercent": 10,
      "stepBakeTimeInMinutes": 5
    },
    "alarms": {
      "alarmNames": [
        "my-service-5xx-errors",
        "my-service-high-latency"
      ],
      "enable": true,
      "rollback": true
    }
  }'

Step 4: Deploy a new version

Trigger a deployment by forcing a new deployment of the existing task definition:

# Trigger a deployment by forcing a new deployment of the existing task definition
aws ecs update-service \
  --cluster production-cluster \
  --service my-web-service \
  --force-new-deployment

Step 5: Monitor rollout progress

Monitor deployment progress in real time using the DescribeServiceDeployments and DescribeServiceRevisions APIs, which provide detailed information about traffic shifting status, rollout state, and revision details.

# List service deployments to get the deployment ARN
aws ecs list-service-deployments \
  --service arn:aws:ecs:region:account:service/production-cluster/my-web-service

# Get detailed deployment status including traffic shifting progress
aws ecs describe-service-deployments \
  --service-deployment-arns arn:aws:ecs:region:account:service-deployment/production-cluster/my-web-service/xxx

# Get details about a specific service revision
aws ecs describe-service-revisions \
  --service-revision-arns arn:aws:ecs:region:account:service-revision/production-cluster/my-web-service/xxx

Step 6: Observe traffic shifting

During the deployment, traffic shifts in 10% increments every 5 minutes:

Time    Blue (Old)    Green (New)    Status
0:00    100%          0%             Deployment started
0:05    90%           10%            Step 1 complete
0:10    80%           20%            Step 2 complete
0:15    70%           30%            Step 3 complete
...
0:45    0%            100%           Deployment complete

If Amazon CloudWatch alarms breach during deployment, Amazon ECS automatically pauses the deployment, shifts traffic back to the stable (blue) revision, terminates the new (green) tasks, and marks the deployment as FAILED.

Step 7: Verify rollout status

Check the deployment completed successfully:

# Check deployment status
aws ecs describe-services \
--cluster production-cluster \
--services my-web-service \
--query "services[0].deployments[?status=='PRIMARY'].rolloutState" \
--output text

Step 8: Verify alarm state

Confirm no alarms are in ALARM state:

# Confirm no alarms in ALARM state
aws cloudwatch describe-alarms \
--alarm-names my-service-5xx-errors my-service-high-latency \
--query "MetricAlarms[*].[AlarmName,StateValue]" \
--output table

Walkthrough: canary strategy implementation

This walkthrough deploys an update using the canary strategy.

Step 1: Create CloudWatch alarm for HTTP errors

Create an alarm for HTTP 5XX errors using metric math across both target groups.

# Create alarm for 5XX errors across both target groups (blue and green)
aws cloudwatch put-metric-alarm \
  --alarm-name payment-service-errors \
  --alarm-description "Trigger on high 5XX error rate across both target groups" \
  --metrics '[
    {
      "Id": "blue5xx",
      "MetricStat": {
        "Metric": {
          "Namespace": "AWS/ApplicationELB",
          "MetricName": "HTTPCode_Target_5XX_Count",
          "Dimensions": [
            {"Name": "TargetGroup", "Value": "targetgroup/blue/xxx"},
            {"Name": "LoadBalancer", "Value": "app/my-load-balancer/xxx"}
          ]
        },
        "Period": 60,
        "Stat": "Sum"
      },
      "ReturnData": false
    },
    {
      "Id": "green5xx",
      "MetricStat": {
        "Metric": {
          "Namespace": "AWS/ApplicationELB",
          "MetricName": "HTTPCode_Target_5XX_Count",
          "Dimensions": [
            {"Name": "TargetGroup", "Value": "targetgroup/green/xxx"},
            {"Name": "LoadBalancer", "Value": "app/my-load-balancer/xxx"}
          ]
        },
        "Period": 60,
        "Stat": "Sum"
      },
      "ReturnData": false
    },
    {
      "Id": "total5xx",
      "Expression": "SUM([blue5xx, green5xx])",
      "Label": "Total 5XX Errors",
      "ReturnData": true
    }
  ]' \
  --evaluation-periods 2 \
  --threshold 5 \
  --comparison-operator GreaterThanThreshold

Step 2: Create alarm for business metrics

Create a business metric alarm for application monitoring:

# Business metric alarm (custom)
aws cloudwatch put-metric-alarm \
  --alarm-name payment-transaction-failure-rate \
  --metric-name TransactionFailureRate \
  --namespace CustomApp/Payments \
  --statistic Average \
  --period 300 \
  --evaluation-periods 1 \
  --threshold 0.5 \
  --comparison-operator GreaterThanThreshold

Step 3: Configure the canary strategy

Configure the canary with a small initial traffic percentage and an extended bake time:

# Configure canary deployment strategy with CloudWatch alarm integration
aws ecs update-service \
  --cluster production-cluster \
  --service payment-service \
  --deployment-configuration '{
    "deploymentCircuitBreaker": {
      "enable": true,
      "rollback": true
    },
    "strategy": "CANARY",
    "canaryConfiguration": {
      "canaryPercent": 5,
      "canaryBakeTimeInMinutes": 20
    },
    "alarms": {
      "alarmNames": [
        "payment-service-errors",
        "payment-transaction-failure-rate"
      ],
      "enable": true,
      "rollback": true
    }
  }'

Step 4: Deploy new version

Trigger a deployment by forcing a new deployment of the existing task definition:

# Trigger a deployment by forcing a new deployment of the existing task definition
aws ecs update-service \
  --cluster production-cluster \
  --service payment-service \
  --force-new-deployment

Step 5: Observe canary traffic pattern

During canary deployment, traffic shifts in two phases:

Phase 1: Canary testing (20 minutes)
Time    Blue (Old)    Green (New)    Status
0:00    100%          0%             Canary deployment started
0:01    95%           5%             Canary phase - monitoring
0:05    95%           5%             Canary phase - monitoring
0:10    95%           5%             Canary phase - monitoring
0:15    95%           5%             Canary phase - monitoring
0:20    95%           5%             Canary validation complete

Phase 2: Full rollout (if canary succeeds)
0:21    0%            100%           Full traffic shift
0:21    0%            100%           Deployment complete

Step 6: Verify rollout status

Check the deployment completed successfully:

# Check deployment status
aws ecs describe-services \
--cluster production-cluster \
--services payment-service \
--query "services[0].deployments[?status=='PRIMARY'].rolloutState" \
--output text

Step 7: Verify alarm state

Confirm no alarms are in ALARM state:

# Confirm no alarms in ALARM state
aws cloudwatch describe-alarms \
--alarm-names payment-service-errors payment-transaction-failure-rate \
--query "MetricAlarms[*].[AlarmName,StateValue]" \
--output table

Best practices

This section provides guidance on configuring alarms and setting bake times for your deployments.

Amazon CloudWatch alarm configuration

Consider configuring two tiers of alarms: critical alarms that trigger automatic rollback, and warning alarms for monitoring only. Set thresholds and evaluation periods based on your application’s baseline performance and acceptable error rates.

Critical alarms (immediate rollback):

  • HTTPCode_Target_5XX_Count
  • TargetResponseTime (p99)
  • UnHealthyHostCount

Warning alarms (monitor, don’t roll back):

  • TargetResponseTime (p50)
  • RequestCount (anomaly detection or percentage decrease)
  • CPUUtilization

Bake time guidelines

Canary bake time:

  • Low risk: shorter bake time
  • Medium risk: moderate bake time
  • High risk: extended bake time

Deployment bake time:

  • Set a baseline that exceeds your service’s mean time to detect (MTTD) failures
  • This gives you instant rollback without redeployment

Clean up resources

To avoid ongoing charges, delete all resources created during the walkthroughs. Load balancers and running Amazon ECS tasks are the primary cost drivers.

Warning: The –force flag immediately stops all running tasks without draining connections. This causes service disruption. Make sure no active traffic is being served and back up any necessary data before proceeding.

# Delete the linear walkthrough ECS service
aws ecs delete-service \
--cluster production-cluster \
--service my-web-service \
--force

# Delete the canary walkthrough ECS service
aws ecs delete-service \
--cluster production-cluster \
--service payment-service \
--force

# Delete CloudWatch alarms for linear walkthrough
aws cloudwatch delete-alarms \
--alarm-names my-service-5xx-errors my-service-high-latency

# Delete CloudWatch alarms for canary walkthrough
aws cloudwatch delete-alarms \
--alarm-names payment-service-errors payment-transaction-failure-rate

# Delete target groups (after service deletion completes)
aws elbv2 delete-target-group \
--target-group-arn arn:aws:elasticloadbalancing:region:account:targetgroup/blue/xxx
aws elbv2 delete-target-group \
--target-group-arn arn:aws:elasticloadbalancing:region:account:targetgroup/green/xxx

If you created the following resources specifically for this walkthrough and no longer need them, delete them:

# Delete the ECS cluster
aws ecs delete-cluster \
--cluster production-cluster

# Delete the load balancer
aws elbv2 delete-load-balancer \
--load-balancer-arn arn:aws:elasticloadbalancing:region:account:loadbalancer/app/my-load-balancer/xxx

# Deregister task definitions
aws ecs deregister-task-definition \
--task-definition my-task-definition:1

# Delete the IAM role
aws iam delete-role \
--role-name ecsBlueGreenRole

Conclusion

In this post, we showed you how to configure linear and canary deployment strategies in Amazon ECS with CloudWatch alarms for automatic rollback, providing native gradual rollout support with automated safety.

Next steps

To get started, try the linear deployment strategy with a non-production service first. Experiment with different step percentages and bake times to find optimal settings. After validating linear deployments, adopt canary deployments for your most sensitive services.

These strategies are available today in commercial AWS Regions. For pricing details, see the Amazon ECS pricing page.

For more information, see the Amazon ECS Developer Guide.

Additional resources


About the authors

Rishabh Dubey

Rishabh Dubey is a Delivery Consultant with AWS Professional Services, specializing in cloud-native development, distributed systems, and generative AI. He works closely with customers to design and build scalable, well-architected solutions on AWS. Outside of work, he enjoys spending time with close ones and engaging in conversations about life, career growth, and the future. Connect with him on LinkedIn

Ashish Tak

Ashish Tak is a Delivery Consultant at AWS Professional Services with over 9 years of experience in cloud-native application development, containerization, and infrastructure automation. He specializes in building scalable solutions and is passionate about leveraging AI/ML to enhance cloud-native workloads. Outside of work, Ashish enjoys spending time with his family and playing outdoor sports. Connect with him on LinkedIn

Muskan

Muskan is an Associate Delivery Consultant at AWS based in Gurugram, India. She joined Amazon 2 years ago and since then has been helping customers modernize their infrastructure, automate their DevOps workflows, and explore what Generative AI can do for their business. When she’s not working, she loves spending time with her pet Bruno and is always on the hunt for a really good cup of coffee. Connect with her on LinkedIn