mirror of
https://github.com/kemko/nomad.git
synced 2026-01-06 10:25:42 +03:00
Update operating a job, upgrade guide (#2913)
* Update operating a job, upgrade guide This PR updates the guide for updating a job to reflect the changes in Nomad 0.6 * Feedback changes * Feedback * Feedback
This commit is contained in:
@@ -3,9 +3,8 @@ layout: "docs"
|
||||
page_title: "Blue/Green & Canary Deployments - Operating a Job"
|
||||
sidebar_current: "docs-operating-a-job-updating-blue-green-deployments"
|
||||
description: |-
|
||||
Nomad supports blue/green and canary deployments through the declarative job
|
||||
file syntax. By specifying multiple task groups, Nomad allows for easy
|
||||
configuration and rollout of blue/green and canary deployments.
|
||||
Nomad has built-in support for doing blue/green and canary deployments to more
|
||||
safely update existing applications and services.
|
||||
---
|
||||
|
||||
# Blue/Green & Canary Deployments
|
||||
@@ -17,136 +16,438 @@ organizations prefer to put a "canary" build into production or utilize a
|
||||
technique known as a "blue/green" deployment to ensure a safe application
|
||||
rollout to production while minimizing downtime.
|
||||
|
||||
## Blue/Green Deployments
|
||||
|
||||
Blue/Green deployments have several other names including Red/Black or A/B, but
|
||||
the concept is generally the same. In a blue/green deployment, there are two
|
||||
application versions. Only one application version is active at a time, except
|
||||
during the transition phase from one version to the next. The term "active"
|
||||
tends to mean "receiving traffic" or "in service".
|
||||
|
||||
Imagine a hypothetical API server which has ten instances deployed to production
|
||||
at version 1.3, and we want to safely upgrade to version 1.4. After the new
|
||||
version has been approved to production, we may want to do a small rollout. In
|
||||
the event of failure, we can quickly rollback to 1.3.
|
||||
Imagine a hypothetical API server which has five instances deployed to
|
||||
production at version 1.3, and we want to safely upgrade to version 1.4. We want
|
||||
to create five new instances at version 1.4 and in the case that they are
|
||||
operating correctly we want to promote them and take down the five versions
|
||||
running 1.3. In the event of failure, we can quickly rollback to 1.3.
|
||||
|
||||
To start, version 1.3 is considered the active set and version 1.4 is the
|
||||
desired set. Here is a sample job file which models the transition from version
|
||||
1.3 to version 1.4 using a blue/green deployment.
|
||||
To start, we examine our job which is running in production:
|
||||
|
||||
```hcl
|
||||
job "docs" {
|
||||
datacenters = ["dc1"]
|
||||
# ...
|
||||
|
||||
group "api-green" {
|
||||
count = 10
|
||||
group "api" {
|
||||
count = 5
|
||||
|
||||
task "api-server" {
|
||||
driver = "docker"
|
||||
|
||||
update {
|
||||
max_parallel = 1
|
||||
canary = 5
|
||||
min_healthy_time = "30s"
|
||||
healthy_deadline = "10m"
|
||||
auto_revert = true
|
||||
}
|
||||
|
||||
config {
|
||||
image = "api-server:1.3"
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
group "api-blue" {
|
||||
count = 0
|
||||
|
||||
task "api-server" {
|
||||
driver = "docker"
|
||||
|
||||
config {
|
||||
image = "api-server:1.4"
|
||||
}
|
||||
}
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
It is clear that the active group is "api-green" since it has a non-zero count.
|
||||
To transition to v1.4 (api-blue), we increase the count of api-blue to match
|
||||
that of api-green.
|
||||
We see that it has an `update` stanza that has the `canary` equal to the desired
|
||||
count. This is what allows us to easily model blue/green deployments. When we
|
||||
change the job to run the "api-server:1.4" image, Nomad will create 5 new
|
||||
allocations without touching the original "api-server:1.3" allocations. Below we
|
||||
can see how this works by changing the image to run the new version:
|
||||
|
||||
```diff
|
||||
@@ -2,6 +2,8 @@ job "docs" {
|
||||
group "api-blue" {
|
||||
- count = 0
|
||||
+ count = 10
|
||||
|
||||
task "api-server" {
|
||||
driver = "docker"
|
||||
group "api" {
|
||||
task "api-server" {
|
||||
config {
|
||||
- image = "api-server:1.3"
|
||||
+ image = "api-server:1.4"
|
||||
```
|
||||
|
||||
Next we plan and run these changes:
|
||||
|
||||
```shell
|
||||
```text
|
||||
$ nomad plan docs.nomad
|
||||
```
|
||||
+/- Job: "docs"
|
||||
+/- Task Group: "api" (5 canary, 5 ignore)
|
||||
+/- Task: "api-server" (forces create/destroy update)
|
||||
+/- Config {
|
||||
+/- image: "api-server:1.3" => "api-server:1.4"
|
||||
}
|
||||
|
||||
Assuming the plan output looks okay, we are ready to run these changes.
|
||||
Scheduler dry-run:
|
||||
- All tasks successfully allocated.
|
||||
|
||||
Job Modify Index: 7
|
||||
To submit the job with version verification run:
|
||||
|
||||
nomad run -check-index 7 example.nomad
|
||||
|
||||
When running the job with the check-index flag, the job will only be run if the
|
||||
server side version matches the job modify index returned. If the index has
|
||||
changed, another user has modified the job and the plan's results are
|
||||
potentially invalid.
|
||||
|
||||
```shell
|
||||
$ nomad run docs.nomad
|
||||
# ...
|
||||
```
|
||||
|
||||
Our deployment is not yet finished. We are currently running at double capacity,
|
||||
so approximately half of our traffic is going to the blue and half is going to
|
||||
green. Usually we inspect our monitoring and reporting system. If we are
|
||||
experiencing errors, we reduce the count of "api-blue" back to 0. If we are
|
||||
running successfully, we change the count of "api-green" to 0.
|
||||
We can see from the plan output that Nomad is going to create 5 canaries that
|
||||
are running the "api-server:1.4" image and ignore all the allocations running
|
||||
the older image. Now if we examine the status of the job we can see that both
|
||||
the blue ("api-server:1.3") and green ("api-server:1.4") set are running.
|
||||
|
||||
```diff
|
||||
@@ -2,6 +2,8 @@ job "docs" {
|
||||
group "api-green" {
|
||||
- count = 10
|
||||
+ count = 0
|
||||
```text
|
||||
$ nomad status docs
|
||||
ID = docs
|
||||
Name = docs
|
||||
Submit Date = 07/26/17 19:57:47 UTC
|
||||
Type = service
|
||||
Priority = 50
|
||||
Datacenters = dc1
|
||||
Status = running
|
||||
Periodic = false
|
||||
Parameterized = false
|
||||
|
||||
task "api-server" {
|
||||
driver = "docker"
|
||||
Summary
|
||||
Task Group Queued Starting Running Failed Complete Lost
|
||||
api 0 0 10 0 0 0
|
||||
|
||||
Latest Deployment
|
||||
ID = 32a080c1
|
||||
Status = running
|
||||
Description = Deployment is running but requires promotion
|
||||
|
||||
Deployed
|
||||
Task Group Auto Revert Promoted Desired Canaries Placed Healthy Unhealthy
|
||||
api true false 5 5 5 5 0
|
||||
|
||||
Allocations
|
||||
ID Node ID Task Group Version Desired Status Created At
|
||||
6d8eec42 087852e2 api 1 run running 07/26/17 19:57:47 UTC
|
||||
7051480e 087852e2 api 1 run running 07/26/17 19:57:47 UTC
|
||||
36c6610f 087852e2 api 1 run running 07/26/17 19:57:47 UTC
|
||||
410ba474 087852e2 api 1 run running 07/26/17 19:57:47 UTC
|
||||
85662a7a 087852e2 api 1 run running 07/26/17 19:57:47 UTC
|
||||
3ac3fe05 087852e2 api 0 run running 07/26/17 19:53:56 UTC
|
||||
4bd51979 087852e2 api 0 run running 07/26/17 19:53:56 UTC
|
||||
2998387b 087852e2 api 0 run running 07/26/17 19:53:56 UTC
|
||||
35b813ee 087852e2 api 0 run running 07/26/17 19:53:56 UTC
|
||||
b53b4289 087852e2 api 0 run running 07/26/17 19:53:56 UTC
|
||||
```
|
||||
|
||||
The next time we want to do a deployment, the "green" group becomes our
|
||||
transition group, since the "blue" group is currently active.
|
||||
Now that we have the new set in production, we can route traffic to it and
|
||||
validate the new job version is working properly. Based on whether the new
|
||||
version is functioning properly or improperly we will either want to promote or
|
||||
fail the deployment.
|
||||
|
||||
### Promoting the Deployment
|
||||
|
||||
After deploying the new image along side the old version we have determined it
|
||||
is functioning properly and we want to transistion fully to the new version.
|
||||
Doing so is as simple as promoting the deployment:
|
||||
|
||||
```text
|
||||
$ nomad deployment promote 32a080c1
|
||||
==> Monitoring evaluation "61ac2be5"
|
||||
Evaluation triggered by job "docs"
|
||||
Evaluation within deployment: "32a080c1"
|
||||
Evaluation status changed: "pending" -> "complete"
|
||||
==> Evaluation "61ac2be5" finished with status "complete"
|
||||
```
|
||||
|
||||
If we look at the job's status we see that after promotion, Nomad stopped the
|
||||
older allocations and is only running the new one. This now completes our
|
||||
blue/green deployment.
|
||||
|
||||
```text
|
||||
$ nomad status docs
|
||||
ID = docs
|
||||
Name = docs
|
||||
Submit Date = 07/26/17 19:57:47 UTC
|
||||
Type = service
|
||||
Priority = 50
|
||||
Datacenters = dc1
|
||||
Status = running
|
||||
Periodic = false
|
||||
Parameterized = false
|
||||
|
||||
Summary
|
||||
Task Group Queued Starting Running Failed Complete Lost
|
||||
api 0 0 5 0 5 0
|
||||
|
||||
Latest Deployment
|
||||
ID = 32a080c1
|
||||
Status = successful
|
||||
Description = Deployment completed successfully
|
||||
|
||||
Deployed
|
||||
Task Group Auto Revert Promoted Desired Canaries Placed Healthy Unhealthy
|
||||
api true true 5 5 5 5 0
|
||||
|
||||
Allocations
|
||||
ID Node ID Task Group Version Desired Status Created At
|
||||
6d8eec42 087852e2 api 1 run running 07/26/17 19:57:47 UTC
|
||||
7051480e 087852e2 api 1 run running 07/26/17 19:57:47 UTC
|
||||
36c6610f 087852e2 api 1 run running 07/26/17 19:57:47 UTC
|
||||
410ba474 087852e2 api 1 run running 07/26/17 19:57:47 UTC
|
||||
85662a7a 087852e2 api 1 run running 07/26/17 19:57:47 UTC
|
||||
3ac3fe05 087852e2 api 0 stop complete 07/26/17 19:53:56 UTC
|
||||
4bd51979 087852e2 api 0 stop complete 07/26/17 19:53:56 UTC
|
||||
2998387b 087852e2 api 0 stop complete 07/26/17 19:53:56 UTC
|
||||
35b813ee 087852e2 api 0 stop complete 07/26/17 19:53:56 UTC
|
||||
b53b4289 087852e2 api 0 stop complete 07/26/17 19:53:56 UTC
|
||||
```
|
||||
|
||||
### Failing the Deployment
|
||||
|
||||
After deploying the new image alongside the old version we have determined it
|
||||
is not functioning properly and we want to roll back to the old version. Doing
|
||||
so is as simple as failing the deployment:
|
||||
|
||||
```text
|
||||
$ nomad deployment fail 32a080c1
|
||||
Deployment "32a080c1-de5a-a4e7-0218-521d8344c328" failed. Auto-reverted to job version 0.
|
||||
|
||||
==> Monitoring evaluation "6840f512"
|
||||
Evaluation triggered by job "example"
|
||||
Evaluation within deployment: "32a080c1"
|
||||
Allocation "0ccb732f" modified: node "36e7a123", group "cache"
|
||||
Allocation "64d4f282" modified: node "36e7a123", group "cache"
|
||||
Allocation "664e33c7" modified: node "36e7a123", group "cache"
|
||||
Allocation "a4cb6a4b" modified: node "36e7a123", group "cache"
|
||||
Allocation "fdd73bdd" modified: node "36e7a123", group "cache"
|
||||
Evaluation status changed: "pending" -> "complete"
|
||||
==> Evaluation "6840f512" finished with status "complete"
|
||||
```
|
||||
|
||||
If we now look at the job's status we can see that after failing the deployment,
|
||||
Nomad stopped the new allocations and is only running the old ones and reverted
|
||||
the working copy of the job back to the original specification running
|
||||
"api-server:1.3".
|
||||
|
||||
```text
|
||||
$ nomad status docs
|
||||
ID = docs
|
||||
Name = docs
|
||||
Submit Date = 07/26/17 19:57:47 UTC
|
||||
Type = service
|
||||
Priority = 50
|
||||
Datacenters = dc1
|
||||
Status = running
|
||||
Periodic = false
|
||||
Parameterized = false
|
||||
|
||||
Summary
|
||||
Task Group Queued Starting Running Failed Complete Lost
|
||||
api 0 0 5 0 5 0
|
||||
|
||||
Latest Deployment
|
||||
ID = 6f3f84b3
|
||||
Status = successful
|
||||
Description = Deployment completed successfully
|
||||
|
||||
Deployed
|
||||
Task Group Auto Revert Desired Placed Healthy Unhealthy
|
||||
cache true 5 5 5 0
|
||||
|
||||
Allocations
|
||||
ID Node ID Task Group Version Desired Status Created At
|
||||
27dc2a42 36e7a123 api 1 stop complete 07/26/17 20:07:31 UTC
|
||||
5b7d34bb 36e7a123 api 1 stop complete 07/26/17 20:07:31 UTC
|
||||
983b487d 36e7a123 api 1 stop complete 07/26/17 20:07:31 UTC
|
||||
d1cbf45a 36e7a123 api 1 stop complete 07/26/17 20:07:31 UTC
|
||||
d6b46def 36e7a123 api 1 stop complete 07/26/17 20:07:31 UTC
|
||||
0ccb732f 36e7a123 api 2 run running 07/26/17 20:06:29 UTC
|
||||
64d4f282 36e7a123 api 2 run running 07/26/17 20:06:29 UTC
|
||||
664e33c7 36e7a123 api 2 run running 07/26/17 20:06:29 UTC
|
||||
a4cb6a4b 36e7a123 api 2 run running 07/26/17 20:06:29 UTC
|
||||
fdd73bdd 36e7a123 api 2 run running 07/26/17 20:06:29 UTC
|
||||
|
||||
$ nomad job deployments docs
|
||||
ID Job ID Job Version Status Description
|
||||
6f3f84b3 example 2 successful Deployment completed successfully
|
||||
32a080c1 example 1 failed Deployment marked as failed - rolling back to job version 0
|
||||
c4c16494 example 0 successful Deployment completed successfully
|
||||
```
|
||||
|
||||
## Canary Deployments
|
||||
|
||||
A canary deployment is a special type of blue/green deployment in which a subset
|
||||
of nodes continues to run in production for an extended period of time.
|
||||
Sometimes this is done for logging/analytics or as an extended blue/green
|
||||
deployment. Whatever the reason, Nomad supports canary deployments. Using the
|
||||
same strategy as defined above, simply keep the "blue" at a lower number, for
|
||||
example:
|
||||
Canary updates are a useful way to test a new version of a job before beginning
|
||||
a rolling upgrade. The `update` stanza supports setting the number of canaries
|
||||
the job operator would like Nomad to create when the job changes via the
|
||||
`canary` parameter. When the job specification is updated, Nomad creates the
|
||||
canaries without stopping any allocations from the previous job.
|
||||
|
||||
This pattern allows operators to achieve higher confidence in the new job
|
||||
version because they can route traffic, examine logs, etc, to determine the new
|
||||
application is performing properly.
|
||||
|
||||
```hcl
|
||||
job "docs" {
|
||||
datacenters = ["dc1"]
|
||||
# ...
|
||||
|
||||
group "api" {
|
||||
count = 10
|
||||
count = 5
|
||||
|
||||
task "api-server" {
|
||||
driver = "docker"
|
||||
|
||||
update {
|
||||
max_parallel = 1
|
||||
canary = 1
|
||||
min_healthy_time = "30s"
|
||||
healthy_deadline = "10m"
|
||||
auto_revert = true
|
||||
}
|
||||
|
||||
config {
|
||||
image = "api-server:1.3"
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
group "api-canary" {
|
||||
count = 1
|
||||
|
||||
task "api-server" {
|
||||
driver = "docker"
|
||||
|
||||
config {
|
||||
image = "api-server:1.4"
|
||||
}
|
||||
}
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
Here you can see there is exactly one canary version of our application (v1.4)
|
||||
and ten regular versions. Typically canary versions are also tagged
|
||||
appropriately in the [service discovery](/docs/service-discovery/index.html)
|
||||
layer to prevent unnecessary routing.
|
||||
In the example above, the `update` stanza tells Nomad to create a single canary
|
||||
when the job specification is changed. Below we can see how this works by
|
||||
changing the image to run the new version:
|
||||
|
||||
```diff
|
||||
@@ -2,6 +2,8 @@ job "docs" {
|
||||
group "api" {
|
||||
task "api-server" {
|
||||
config {
|
||||
- image = "api-server:1.3"
|
||||
+ image = "api-server:1.4"
|
||||
```
|
||||
|
||||
Next we plan and run these changes:
|
||||
|
||||
```text
|
||||
$ nomad plan docs.nomad
|
||||
+/- Job: "docs"
|
||||
+/- Task Group: "api" (1 canary, 5 ignore)
|
||||
+/- Task: "api-server" (forces create/destroy update)
|
||||
+/- Config {
|
||||
+/- image: "api-server:1.3" => "api-server:1.4"
|
||||
}
|
||||
|
||||
Scheduler dry-run:
|
||||
- All tasks successfully allocated.
|
||||
|
||||
Job Modify Index: 7
|
||||
To submit the job with version verification run:
|
||||
|
||||
nomad run -check-index 7 example.nomad
|
||||
|
||||
When running the job with the check-index flag, the job will only be run if the
|
||||
server side version matches the job modify index returned. If the index has
|
||||
changed, another user has modified the job and the plan's results are
|
||||
potentially invalid.
|
||||
|
||||
$ nomad run docs.nomad
|
||||
# ...
|
||||
```
|
||||
|
||||
We can see from the plan output that Nomad is going to create 1 canary that
|
||||
will run the "api-server:1.4" image and ignore all the allocations running
|
||||
the older image. If we inspect the status we see that the canary is running
|
||||
along side the older version of the job:
|
||||
|
||||
```text
|
||||
$ nomad status docs
|
||||
ID = docs
|
||||
Name = docs
|
||||
Submit Date = 07/26/17 19:57:47 UTC
|
||||
Type = service
|
||||
Priority = 50
|
||||
Datacenters = dc1
|
||||
Status = running
|
||||
Periodic = false
|
||||
Parameterized = false
|
||||
|
||||
Summary
|
||||
Task Group Queued Starting Running Failed Complete Lost
|
||||
api 0 0 6 0 0 0
|
||||
|
||||
Latest Deployment
|
||||
ID = 32a080c1
|
||||
Status = running
|
||||
Description = Deployment is running but requires promotion
|
||||
|
||||
Deployed
|
||||
Task Group Auto Revert Promoted Desired Canaries Placed Healthy Unhealthy
|
||||
api true false 5 1 1 1 0
|
||||
|
||||
Allocations
|
||||
ID Node ID Task Group Version Desired Status Created At
|
||||
85662a7a 087852e2 api 1 run running 07/26/17 19:57:47 UTC
|
||||
3ac3fe05 087852e2 api 0 run running 07/26/17 19:53:56 UTC
|
||||
4bd51979 087852e2 api 0 run running 07/26/17 19:53:56 UTC
|
||||
2998387b 087852e2 api 0 run running 07/26/17 19:53:56 UTC
|
||||
35b813ee 087852e2 api 0 run running 07/26/17 19:53:56 UTC
|
||||
b53b4289 087852e2 api 0 run running 07/26/17 19:53:56 UTC
|
||||
```
|
||||
|
||||
Now if we promote the canary, this will trigger a rolling update to replace the
|
||||
remaining allocations running the older image. The rolling update will happen at
|
||||
a rate of `max_parallel`, so in this case one allocation at a time:
|
||||
|
||||
```text
|
||||
$ nomad deployment promote 37033151
|
||||
==> Monitoring evaluation "37033151"
|
||||
Evaluation triggered by job "docs"
|
||||
Evaluation within deployment: "ed28f6c2"
|
||||
Allocation "f5057465" created: node "f6646949", group "cache"
|
||||
Allocation "f5057465" status changed: "pending" -> "running"
|
||||
Evaluation status changed: "pending" -> "complete"
|
||||
==> Evaluation "37033151" finished with status "complete"
|
||||
|
||||
$ nomad status docs
|
||||
ID = docs
|
||||
Name = docs
|
||||
Submit Date = 07/26/17 20:28:59 UTC
|
||||
Type = service
|
||||
Priority = 50
|
||||
Datacenters = dc1
|
||||
Status = running
|
||||
Periodic = false
|
||||
Parameterized = false
|
||||
|
||||
Summary
|
||||
Task Group Queued Starting Running Failed Complete Lost
|
||||
api 0 0 5 0 2 0
|
||||
|
||||
Latest Deployment
|
||||
ID = ed28f6c2
|
||||
Status = running
|
||||
Description = Deployment is running
|
||||
|
||||
Deployed
|
||||
Task Group Auto Revert Promoted Desired Canaries Placed Healthy Unhealthy
|
||||
api true true 5 1 2 1 0
|
||||
|
||||
Allocations
|
||||
ID Node ID Task Group Version Desired Status Created At
|
||||
f5057465 f6646949 api 1 run running 07/26/17 20:29:23 UTC
|
||||
b1c88d20 f6646949 api 1 run running 07/26/17 20:28:59 UTC
|
||||
1140bacf f6646949 api 0 run running 07/26/17 20:28:37 UTC
|
||||
1958a34a f6646949 api 0 run running 07/26/17 20:28:37 UTC
|
||||
4bda385a f6646949 api 0 run running 07/26/17 20:28:37 UTC
|
||||
62d96f06 f6646949 api 0 stop complete 07/26/17 20:28:37 UTC
|
||||
f58abbb2 f6646949 api 0 stop complete 07/26/17 20:28:37 UTC
|
||||
```
|
||||
|
||||
Alternatively, if the canary was not performing properly, we could abandon the
|
||||
change using the `nomad deployment fail` command, similar to the blue/green
|
||||
example.
|
||||
|
||||
@@ -12,10 +12,11 @@ description: |-
|
||||
|
||||
Most applications are long-lived and require updates over time. Whether you are
|
||||
deploying a new version of your web application or upgrading to a new version of
|
||||
redis, Nomad has built-in support for rolling updates. When a job specifies a
|
||||
rolling update, Nomad can take some configurable strategies to minimize or
|
||||
eliminate down time, stagger deployments, and more. This section and subsections
|
||||
will explore how to do so safely with Nomad.
|
||||
Redis, Nomad has built-in support for rolling, blue/green, and canary updates.
|
||||
When a job specifies a rolling update, Nomad uses task state and health check
|
||||
information in order to detect allocation health and minimize or eliminate
|
||||
downtime. This section and subsections will explore how to do so safely with
|
||||
Nomad.
|
||||
|
||||
Please see one of the guides below or use the navigation on the left:
|
||||
|
||||
|
||||
@@ -4,35 +4,71 @@ page_title: "Rolling Upgrades - Operating a Job"
|
||||
sidebar_current: "docs-operating-a-job-updating-rolling-upgrades"
|
||||
description: |-
|
||||
In order to update a service while reducing downtime, Nomad provides a
|
||||
built-in mechanism for rolling upgrades. Rolling upgrades allow for a subset
|
||||
of applications to be updated at a time, with a waiting period between to
|
||||
built-in mechanism for rolling upgrades. Rolling upgrades incrementally
|
||||
transistions jobs between versions and using health check information to
|
||||
reduce downtime.
|
||||
---
|
||||
|
||||
# Rolling Upgrades
|
||||
|
||||
In order to update a service while reducing downtime, Nomad provides a built-in
|
||||
mechanism for rolling upgrades. Jobs specify their "update strategy" using the
|
||||
`update` block in the job specification as shown here:
|
||||
Nomad supports rolling updates as a first class feature. To enable rolling
|
||||
updates a job or task group is annotated with a high-level description of the
|
||||
update strategy using the [`update` stanza][update]. Under the hood, Nomad
|
||||
handles limiting parallelism, interfacing with Consul to determine service
|
||||
health and even automatically reverting to an older, healthy job when a
|
||||
deployment fails.
|
||||
|
||||
## Enabling Rolling Updates
|
||||
|
||||
Rolling updates are enabled by adding the [`update` stanza][update] to the job
|
||||
specification. The `update` stanza may be placed at the job level or in an
|
||||
individual task group. When placed at the job level, the update strategy is
|
||||
inherited by all task groups in the job. When placed at both the job and group
|
||||
level, the 'update` stanzas are merged, with group stanzas taking precedance
|
||||
over job level stanzas. See the [`update` stanza
|
||||
documentation](/docs/job-specification/update.html#upgrade-stanza-inheritance)
|
||||
for an example.
|
||||
|
||||
```hcl
|
||||
job "docs" {
|
||||
update {
|
||||
stagger = "30s"
|
||||
max_parallel = 3
|
||||
}
|
||||
job "geo-api-server" {
|
||||
# ...
|
||||
|
||||
group "api-server" {
|
||||
count = 6
|
||||
|
||||
# Add an update stanza to enable rolling updates of the service
|
||||
update {
|
||||
max_parallel = 2
|
||||
min_healthy_time = "30s"
|
||||
healthy_deadline = "10m"
|
||||
}
|
||||
|
||||
group "example" {
|
||||
task "server" {
|
||||
driver = "docker"
|
||||
|
||||
config {
|
||||
image = "geo-api-server:0.1"
|
||||
}
|
||||
|
||||
# ...
|
||||
}
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
In this example, Nomad will only update 3 task groups at a time (`max_parallel =
|
||||
3`) and will wait 30 seconds (`stagger = "30s"`) before moving on to the next
|
||||
set of task groups.
|
||||
In this example, by adding the simple `update` stanza to the "api-server" task
|
||||
group, we inform Nomad that updates to the group should be handled with a
|
||||
rolling update strategy.
|
||||
|
||||
Thus when a change is made to the job file that requires new allocations to be
|
||||
made, Nomad will deploy 2 allocations at a time and require that the allocations
|
||||
running in a healthy state for 30 seconds before deploying more versions of the
|
||||
new group.
|
||||
|
||||
By default Nomad determines allocation health by ensuring that all tasks in the
|
||||
group are running and that any [service
|
||||
check](/docs/job-specification/service.html#check-parameters) the tasks register
|
||||
are passing.
|
||||
|
||||
## Planning Changes
|
||||
|
||||
@@ -40,37 +76,36 @@ Suppose we make a change to a file to upgrade the version of a Docker container
|
||||
that is configured with the same rolling update strategy from above.
|
||||
|
||||
```diff
|
||||
@@ -2,6 +2,8 @@ job "docs" {
|
||||
group "example" {
|
||||
@@ -2,6 +2,8 @@ job "geo-api-server" {
|
||||
group "api-server" {
|
||||
task "server" {
|
||||
driver = "docker"
|
||||
|
||||
config {
|
||||
- image = "nginx:1.10"
|
||||
+ image = "nginx:1.11"
|
||||
- image = "geo-api-server:0.1"
|
||||
+ image = "geo-api-server:0.2"
|
||||
```
|
||||
|
||||
The [`nomad plan` command](/docs/commands/plan.html) allows
|
||||
us to visualize the series of steps the scheduler would perform. We can analyze
|
||||
this output to confirm it is correct:
|
||||
|
||||
```shell
|
||||
$ nomad plan docs.nomad
|
||||
```text
|
||||
$ nomad plan geo-api-server.nomad
|
||||
```
|
||||
|
||||
Here is some sample output:
|
||||
|
||||
```text
|
||||
+/- Job: "my-web"
|
||||
+/- Task Group: "web" (3 create/destroy update)
|
||||
+/- Task: "web" (forces create/destroy update)
|
||||
+/- Job: "geo-api-server"
|
||||
+/- Task Group: "api-server" (2 create/destroy update, 4 ignore)
|
||||
+/- Task: "server" (forces create/destroy update)
|
||||
+/- Config {
|
||||
+/- image: "nginx:1.10" => "nginx:1.11"
|
||||
+/- image: "geo-api-server:0.1" => "geo-api-server:0.2"
|
||||
}
|
||||
|
||||
Scheduler dry-run:
|
||||
- All tasks successfully allocated.
|
||||
- Rolling update, next evaluation will be in 30s.
|
||||
|
||||
Job Modify Index: 7
|
||||
To submit the job with version verification run:
|
||||
@@ -83,8 +118,213 @@ changed, another user has modified the job and the plan's results are
|
||||
potentially invalid.
|
||||
```
|
||||
|
||||
Here we can see that Nomad will destroy the 3 existing task groups and create 3
|
||||
replacements but it will occur with a rolling update with a stagger of `30s`.
|
||||
Here we can see that Nomad will begin a rolling update by creating and
|
||||
destroying 2 allocations first and for the time being ignoring 4 of the old
|
||||
allocations, matching our configured `max_parallel`.
|
||||
|
||||
For more details on the `update` block, see the
|
||||
[job specification documentation](/docs/job-specification/update.html).
|
||||
## Inspecting a Deployment
|
||||
|
||||
After running the plan we can submit the updated job by simply running `nomad
|
||||
run`. Once run, Nomad will begin the rolling upgrade of our service by placing
|
||||
2 allocations at a time of the new job and taking two of the old jobs down.
|
||||
|
||||
We can inspect the current state of a rolling deployment using `nomad status`:
|
||||
|
||||
```text
|
||||
$ nomad status geo-api-server
|
||||
ID = geo-api-server
|
||||
Name = geo-api-server
|
||||
Submit Date = 07/26/17 18:08:56 UTC
|
||||
Type = service
|
||||
Priority = 50
|
||||
Datacenters = dc1
|
||||
Status = running
|
||||
Periodic = false
|
||||
Parameterized = false
|
||||
|
||||
Summary
|
||||
Task Group Queued Starting Running Failed Complete Lost
|
||||
api-server 0 0 6 0 4 0
|
||||
|
||||
Latest Deployment
|
||||
ID = c5b34665
|
||||
Status = running
|
||||
Description = Deployment is running
|
||||
|
||||
Deployed
|
||||
Task Group Desired Placed Healthy Unhealthy
|
||||
api-server 6 4 2 0
|
||||
|
||||
Allocations
|
||||
ID Node ID Task Group Version Desired Status Created At
|
||||
14d288e8 f7b1ee08 api-server 1 run running 07/26/17 18:09:17 UTC
|
||||
a134f73c f7b1ee08 api-server 1 run running 07/26/17 18:09:17 UTC
|
||||
a2574bb6 f7b1ee08 api-server 1 run running 07/26/17 18:08:56 UTC
|
||||
496e7aa2 f7b1ee08 api-server 1 run running 07/26/17 18:08:56 UTC
|
||||
9fc96fcc f7b1ee08 api-server 0 run running 07/26/17 18:04:30 UTC
|
||||
2521c47a f7b1ee08 api-server 0 run running 07/26/17 18:04:30 UTC
|
||||
6b794fcb f7b1ee08 api-server 0 stop complete 07/26/17 18:04:30 UTC
|
||||
9bc11bd7 f7b1ee08 api-server 0 stop complete 07/26/17 18:04:30 UTC
|
||||
691eea24 f7b1ee08 api-server 0 stop complete 07/26/17 18:04:30 UTC
|
||||
af115865 f7b1ee08 api-server 0 stop complete 07/26/17 18:04:30 UTC
|
||||
```
|
||||
|
||||
Here we can see that Nomad has created a deployment to conduct the rolling
|
||||
upgrade from job version 0 to 1 and has placed 4 instances of the new job and
|
||||
has stopped 4 of the old instances. If we look at the deployed allocations, we
|
||||
also can see that Nomad has placed 4 instances of job version 1 but only
|
||||
considers 2 of them healthy. This is because the 2 newest placed allocations
|
||||
haven't been healthy for the required 30 seconds yet.
|
||||
|
||||
If we wait for the deployment to complete and re-issue the command, we get the
|
||||
following:
|
||||
|
||||
```text
|
||||
$ nomad status geo-api-server
|
||||
ID = geo-api-server
|
||||
Name = geo-api-server
|
||||
Submit Date = 07/26/17 18:08:56 UTC
|
||||
Type = service
|
||||
Priority = 50
|
||||
Datacenters = dc1
|
||||
Status = running
|
||||
Periodic = false
|
||||
Parameterized = false
|
||||
|
||||
Summary
|
||||
Task Group Queued Starting Running Failed Complete Lost
|
||||
cache 0 0 6 0 6 0
|
||||
|
||||
Latest Deployment
|
||||
ID = c5b34665
|
||||
Status = successful
|
||||
Description = Deployment completed successfully
|
||||
|
||||
Deployed
|
||||
Task Group Desired Placed Healthy Unhealthy
|
||||
cache 6 6 6 0
|
||||
|
||||
Allocations
|
||||
ID Node ID Task Group Version Desired Status Created At
|
||||
d42a1656 f7b1ee08 api-server 1 run running 07/26/17 18:10:10 UTC
|
||||
401daaf9 f7b1ee08 api-server 1 run running 07/26/17 18:10:00 UTC
|
||||
14d288e8 f7b1ee08 api-server 1 run running 07/26/17 18:09:17 UTC
|
||||
a134f73c f7b1ee08 api-server 1 run running 07/26/17 18:09:17 UTC
|
||||
a2574bb6 f7b1ee08 api-server 1 run running 07/26/17 18:08:56 UTC
|
||||
496e7aa2 f7b1ee08 api-server 1 run running 07/26/17 18:08:56 UTC
|
||||
9fc96fcc f7b1ee08 api-server 0 stop complete 07/26/17 18:04:30 UTC
|
||||
2521c47a f7b1ee08 api-server 0 stop complete 07/26/17 18:04:30 UTC
|
||||
6b794fcb f7b1ee08 api-server 0 stop complete 07/26/17 18:04:30 UTC
|
||||
9bc11bd7 f7b1ee08 api-server 0 stop complete 07/26/17 18:04:30 UTC
|
||||
691eea24 f7b1ee08 api-server 0 stop complete 07/26/17 18:04:30 UTC
|
||||
af115865 f7b1ee08 api-server 0 stop complete 07/26/17 18:04:30 UTC
|
||||
```
|
||||
|
||||
Nomad has successfully transitioned the group to running the updated canary and
|
||||
did so with no downtime to our service by ensuring only two allocations were
|
||||
changed at a time and the newly placed allocations ran successfully. Had any of
|
||||
the newly placed allocations failed their health check, Nomad would have aborted
|
||||
the deployment and stopped placing new allocations. If configured, Nomad can
|
||||
automatically revert back to the old job definition when the deployment fails.
|
||||
|
||||
## Auto Reverting on Failed Deployments
|
||||
|
||||
In the case we do a deployment in which the new allocations are unhealthy, Nomad
|
||||
will fail the deployment and stop placing new instances of the job. It
|
||||
optionally supports automatically reverting back to the last stable job version
|
||||
on deployment failure. Nomad keeps a history of submitted jobs and whether the
|
||||
job version was stable. A job is considered stable if all its allocations are
|
||||
healthy.
|
||||
|
||||
To enable this we simply add the `auto_revert` parameter to the `update` stanza:
|
||||
|
||||
```
|
||||
update {
|
||||
max_parallel = 2
|
||||
min_healthy_time = "30s"
|
||||
healthy_deadline = "10m"
|
||||
|
||||
# Enable automatically reverting to the last stable job on a failed
|
||||
# deployment.
|
||||
auto_revert = true
|
||||
}
|
||||
```
|
||||
|
||||
Now imagine we want to update our image to "geo-api-server:0.3" but we instead
|
||||
update it to the below and run the job:
|
||||
|
||||
```diff
|
||||
@@ -2,6 +2,8 @@ job "geo-api-server" {
|
||||
group "api-server" {
|
||||
task "server" {
|
||||
driver = "docker"
|
||||
|
||||
config {
|
||||
- image = "geo-api-server:0.2"
|
||||
+ image = "geo-api-server:0.33"
|
||||
```
|
||||
|
||||
If we run `nomad job deployments` we can see that the deployment fails and Nomad
|
||||
auto-reverts to the last stable job:
|
||||
|
||||
```text
|
||||
$ nomad job deployments geo-api-server
|
||||
ID Job ID Job Version Status Description
|
||||
0c6f87a5 geo-api-server 3 successful Deployment completed successfully
|
||||
b1712b7f geo-api-server 2 failed Failed due to unhealthy allocations - rolling back to job version 1
|
||||
3eee83ce geo-api-server 1 successful Deployment completed successfully
|
||||
72813fcf geo-api-server 0 successful Deployment completed successfully
|
||||
```
|
||||
|
||||
Nomad job versions increment monotonically, so even though Nomad reverted to the
|
||||
job specification at version 1, it creates a new job version. We can see the
|
||||
differences between a jobs versions and how Nomad auto-reverted the job using
|
||||
the `job history` command:
|
||||
|
||||
```text
|
||||
$ nomad job history -p geo-api-server
|
||||
Version = 3
|
||||
Stable = true
|
||||
Submit Date = 07/26/17 18:44:18 UTC
|
||||
Diff =
|
||||
+/- Job: "geo-api-server"
|
||||
+/- Task Group: "api-server"
|
||||
+/- Task: "server"
|
||||
+/- Config {
|
||||
+/- image: "geo-api-server:0.33" => "geo-api-server:0.2"
|
||||
}
|
||||
|
||||
Version = 2
|
||||
Stable = false
|
||||
Submit Date = 07/26/17 18:45:21 UTC
|
||||
Diff =
|
||||
+/- Job: "geo-api-server"
|
||||
+/- Task Group: "api-server"
|
||||
+/- Task: "server"
|
||||
+/- Config {
|
||||
+/- image: "geo-api-server:0.2" => "geo-api-server:0.33"
|
||||
}
|
||||
|
||||
Version = 1
|
||||
Stable = true
|
||||
Submit Date = 07/26/17 18:44:18 UTC
|
||||
Diff =
|
||||
+/- Job: "geo-api-server"
|
||||
+/- Task Group: "api-server"
|
||||
+/- Task: "server"
|
||||
+/- Config {
|
||||
+/- image: "geo-api-server:0.1" => "geo-api-server:0.2"
|
||||
}
|
||||
|
||||
Version = 0
|
||||
Stable = true
|
||||
Submit Date = 07/26/17 18:43:43 UTC
|
||||
```
|
||||
|
||||
We can see that Nomad considered the job running "geo-api-server:0.1" and
|
||||
"geo-api-server:0.2" as stable but job Version 2 that submitted the incorrect
|
||||
image is marked as unstable. This is because the placed allocations failed to
|
||||
start. Nomad detected the deployment failed and as such, created job Version 3
|
||||
that reverted back to the last healthy job.
|
||||
|
||||
[update]: /docs/job-specification/update.html "Nomad update Stanza"
|
||||
|
||||
Reference in New Issue
Block a user