Update operating a job, upgrade guide (#2913)

* Update operating a job, upgrade guide

This PR updates the guide for updating a job to reflect the changes in
Nomad 0.6

* Feedback changes

* Feedback

* Feedback
This commit is contained in:
Alex Dadgar
2017-07-26 15:06:17 -07:00
committed by GitHub
parent b57b9af467
commit 6194ac4073
3 changed files with 651 additions and 109 deletions

View File

@@ -3,9 +3,8 @@ layout: "docs"
page_title: "Blue/Green & Canary Deployments - Operating a Job"
sidebar_current: "docs-operating-a-job-updating-blue-green-deployments"
description: |-
Nomad supports blue/green and canary deployments through the declarative job
file syntax. By specifying multiple task groups, Nomad allows for easy
configuration and rollout of blue/green and canary deployments.
Nomad has built-in support for doing blue/green and canary deployments to more
safely update existing applications and services.
---
# Blue/Green & Canary Deployments
@@ -17,136 +16,438 @@ organizations prefer to put a "canary" build into production or utilize a
technique known as a "blue/green" deployment to ensure a safe application
rollout to production while minimizing downtime.
## Blue/Green Deployments
Blue/Green deployments have several other names including Red/Black or A/B, but
the concept is generally the same. In a blue/green deployment, there are two
application versions. Only one application version is active at a time, except
during the transition phase from one version to the next. The term "active"
tends to mean "receiving traffic" or "in service".
Imagine a hypothetical API server which has ten instances deployed to production
at version 1.3, and we want to safely upgrade to version 1.4. After the new
version has been approved to production, we may want to do a small rollout. In
the event of failure, we can quickly rollback to 1.3.
Imagine a hypothetical API server which has five instances deployed to
production at version 1.3, and we want to safely upgrade to version 1.4. We want
to create five new instances at version 1.4 and in the case that they are
operating correctly we want to promote them and take down the five versions
running 1.3. In the event of failure, we can quickly rollback to 1.3.
To start, version 1.3 is considered the active set and version 1.4 is the
desired set. Here is a sample job file which models the transition from version
1.3 to version 1.4 using a blue/green deployment.
To start, we examine our job which is running in production:
```hcl
job "docs" {
datacenters = ["dc1"]
# ...
group "api-green" {
count = 10
group "api" {
count = 5
task "api-server" {
driver = "docker"
update {
max_parallel = 1
canary = 5
min_healthy_time = "30s"
healthy_deadline = "10m"
auto_revert = true
}
config {
image = "api-server:1.3"
}
}
}
group "api-blue" {
count = 0
task "api-server" {
driver = "docker"
config {
image = "api-server:1.4"
}
}
}
}
```
It is clear that the active group is "api-green" since it has a non-zero count.
To transition to v1.4 (api-blue), we increase the count of api-blue to match
that of api-green.
We see that it has an `update` stanza that has the `canary` equal to the desired
count. This is what allows us to easily model blue/green deployments. When we
change the job to run the "api-server:1.4" image, Nomad will create 5 new
allocations without touching the original "api-server:1.3" allocations. Below we
can see how this works by changing the image to run the new version:
```diff
@@ -2,6 +2,8 @@ job "docs" {
group "api-blue" {
- count = 0
+ count = 10
task "api-server" {
driver = "docker"
group "api" {
task "api-server" {
config {
- image = "api-server:1.3"
+ image = "api-server:1.4"
```
Next we plan and run these changes:
```shell
```text
$ nomad plan docs.nomad
```
+/- Job: "docs"
+/- Task Group: "api" (5 canary, 5 ignore)
+/- Task: "api-server" (forces create/destroy update)
+/- Config {
+/- image: "api-server:1.3" => "api-server:1.4"
}
Assuming the plan output looks okay, we are ready to run these changes.
Scheduler dry-run:
- All tasks successfully allocated.
Job Modify Index: 7
To submit the job with version verification run:
nomad run -check-index 7 example.nomad
When running the job with the check-index flag, the job will only be run if the
server side version matches the job modify index returned. If the index has
changed, another user has modified the job and the plan's results are
potentially invalid.
```shell
$ nomad run docs.nomad
# ...
```
Our deployment is not yet finished. We are currently running at double capacity,
so approximately half of our traffic is going to the blue and half is going to
green. Usually we inspect our monitoring and reporting system. If we are
experiencing errors, we reduce the count of "api-blue" back to 0. If we are
running successfully, we change the count of "api-green" to 0.
We can see from the plan output that Nomad is going to create 5 canaries that
are running the "api-server:1.4" image and ignore all the allocations running
the older image. Now if we examine the status of the job we can see that both
the blue ("api-server:1.3") and green ("api-server:1.4") set are running.
```diff
@@ -2,6 +2,8 @@ job "docs" {
group "api-green" {
- count = 10
+ count = 0
```text
$ nomad status docs
ID = docs
Name = docs
Submit Date = 07/26/17 19:57:47 UTC
Type = service
Priority = 50
Datacenters = dc1
Status = running
Periodic = false
Parameterized = false
task "api-server" {
driver = "docker"
Summary
Task Group Queued Starting Running Failed Complete Lost
api 0 0 10 0 0 0
Latest Deployment
ID = 32a080c1
Status = running
Description = Deployment is running but requires promotion
Deployed
Task Group Auto Revert Promoted Desired Canaries Placed Healthy Unhealthy
api true false 5 5 5 5 0
Allocations
ID Node ID Task Group Version Desired Status Created At
6d8eec42 087852e2 api 1 run running 07/26/17 19:57:47 UTC
7051480e 087852e2 api 1 run running 07/26/17 19:57:47 UTC
36c6610f 087852e2 api 1 run running 07/26/17 19:57:47 UTC
410ba474 087852e2 api 1 run running 07/26/17 19:57:47 UTC
85662a7a 087852e2 api 1 run running 07/26/17 19:57:47 UTC
3ac3fe05 087852e2 api 0 run running 07/26/17 19:53:56 UTC
4bd51979 087852e2 api 0 run running 07/26/17 19:53:56 UTC
2998387b 087852e2 api 0 run running 07/26/17 19:53:56 UTC
35b813ee 087852e2 api 0 run running 07/26/17 19:53:56 UTC
b53b4289 087852e2 api 0 run running 07/26/17 19:53:56 UTC
```
The next time we want to do a deployment, the "green" group becomes our
transition group, since the "blue" group is currently active.
Now that we have the new set in production, we can route traffic to it and
validate the new job version is working properly. Based on whether the new
version is functioning properly or improperly we will either want to promote or
fail the deployment.
### Promoting the Deployment
After deploying the new image along side the old version we have determined it
is functioning properly and we want to transistion fully to the new version.
Doing so is as simple as promoting the deployment:
```text
$ nomad deployment promote 32a080c1
==> Monitoring evaluation "61ac2be5"
Evaluation triggered by job "docs"
Evaluation within deployment: "32a080c1"
Evaluation status changed: "pending" -> "complete"
==> Evaluation "61ac2be5" finished with status "complete"
```
If we look at the job's status we see that after promotion, Nomad stopped the
older allocations and is only running the new one. This now completes our
blue/green deployment.
```text
$ nomad status docs
ID = docs
Name = docs
Submit Date = 07/26/17 19:57:47 UTC
Type = service
Priority = 50
Datacenters = dc1
Status = running
Periodic = false
Parameterized = false
Summary
Task Group Queued Starting Running Failed Complete Lost
api 0 0 5 0 5 0
Latest Deployment
ID = 32a080c1
Status = successful
Description = Deployment completed successfully
Deployed
Task Group Auto Revert Promoted Desired Canaries Placed Healthy Unhealthy
api true true 5 5 5 5 0
Allocations
ID Node ID Task Group Version Desired Status Created At
6d8eec42 087852e2 api 1 run running 07/26/17 19:57:47 UTC
7051480e 087852e2 api 1 run running 07/26/17 19:57:47 UTC
36c6610f 087852e2 api 1 run running 07/26/17 19:57:47 UTC
410ba474 087852e2 api 1 run running 07/26/17 19:57:47 UTC
85662a7a 087852e2 api 1 run running 07/26/17 19:57:47 UTC
3ac3fe05 087852e2 api 0 stop complete 07/26/17 19:53:56 UTC
4bd51979 087852e2 api 0 stop complete 07/26/17 19:53:56 UTC
2998387b 087852e2 api 0 stop complete 07/26/17 19:53:56 UTC
35b813ee 087852e2 api 0 stop complete 07/26/17 19:53:56 UTC
b53b4289 087852e2 api 0 stop complete 07/26/17 19:53:56 UTC
```
### Failing the Deployment
After deploying the new image alongside the old version we have determined it
is not functioning properly and we want to roll back to the old version. Doing
so is as simple as failing the deployment:
```text
$ nomad deployment fail 32a080c1
Deployment "32a080c1-de5a-a4e7-0218-521d8344c328" failed. Auto-reverted to job version 0.
==> Monitoring evaluation "6840f512"
Evaluation triggered by job "example"
Evaluation within deployment: "32a080c1"
Allocation "0ccb732f" modified: node "36e7a123", group "cache"
Allocation "64d4f282" modified: node "36e7a123", group "cache"
Allocation "664e33c7" modified: node "36e7a123", group "cache"
Allocation "a4cb6a4b" modified: node "36e7a123", group "cache"
Allocation "fdd73bdd" modified: node "36e7a123", group "cache"
Evaluation status changed: "pending" -> "complete"
==> Evaluation "6840f512" finished with status "complete"
```
If we now look at the job's status we can see that after failing the deployment,
Nomad stopped the new allocations and is only running the old ones and reverted
the working copy of the job back to the original specification running
"api-server:1.3".
```text
$ nomad status docs
ID = docs
Name = docs
Submit Date = 07/26/17 19:57:47 UTC
Type = service
Priority = 50
Datacenters = dc1
Status = running
Periodic = false
Parameterized = false
Summary
Task Group Queued Starting Running Failed Complete Lost
api 0 0 5 0 5 0
Latest Deployment
ID = 6f3f84b3
Status = successful
Description = Deployment completed successfully
Deployed
Task Group Auto Revert Desired Placed Healthy Unhealthy
cache true 5 5 5 0
Allocations
ID Node ID Task Group Version Desired Status Created At
27dc2a42 36e7a123 api 1 stop complete 07/26/17 20:07:31 UTC
5b7d34bb 36e7a123 api 1 stop complete 07/26/17 20:07:31 UTC
983b487d 36e7a123 api 1 stop complete 07/26/17 20:07:31 UTC
d1cbf45a 36e7a123 api 1 stop complete 07/26/17 20:07:31 UTC
d6b46def 36e7a123 api 1 stop complete 07/26/17 20:07:31 UTC
0ccb732f 36e7a123 api 2 run running 07/26/17 20:06:29 UTC
64d4f282 36e7a123 api 2 run running 07/26/17 20:06:29 UTC
664e33c7 36e7a123 api 2 run running 07/26/17 20:06:29 UTC
a4cb6a4b 36e7a123 api 2 run running 07/26/17 20:06:29 UTC
fdd73bdd 36e7a123 api 2 run running 07/26/17 20:06:29 UTC
$ nomad job deployments docs
ID Job ID Job Version Status Description
6f3f84b3 example 2 successful Deployment completed successfully
32a080c1 example 1 failed Deployment marked as failed - rolling back to job version 0
c4c16494 example 0 successful Deployment completed successfully
```
## Canary Deployments
A canary deployment is a special type of blue/green deployment in which a subset
of nodes continues to run in production for an extended period of time.
Sometimes this is done for logging/analytics or as an extended blue/green
deployment. Whatever the reason, Nomad supports canary deployments. Using the
same strategy as defined above, simply keep the "blue" at a lower number, for
example:
Canary updates are a useful way to test a new version of a job before beginning
a rolling upgrade. The `update` stanza supports setting the number of canaries
the job operator would like Nomad to create when the job changes via the
`canary` parameter. When the job specification is updated, Nomad creates the
canaries without stopping any allocations from the previous job.
This pattern allows operators to achieve higher confidence in the new job
version because they can route traffic, examine logs, etc, to determine the new
application is performing properly.
```hcl
job "docs" {
datacenters = ["dc1"]
# ...
group "api" {
count = 10
count = 5
task "api-server" {
driver = "docker"
update {
max_parallel = 1
canary = 1
min_healthy_time = "30s"
healthy_deadline = "10m"
auto_revert = true
}
config {
image = "api-server:1.3"
}
}
}
group "api-canary" {
count = 1
task "api-server" {
driver = "docker"
config {
image = "api-server:1.4"
}
}
}
}
```
Here you can see there is exactly one canary version of our application (v1.4)
and ten regular versions. Typically canary versions are also tagged
appropriately in the [service discovery](/docs/service-discovery/index.html)
layer to prevent unnecessary routing.
In the example above, the `update` stanza tells Nomad to create a single canary
when the job specification is changed. Below we can see how this works by
changing the image to run the new version:
```diff
@@ -2,6 +2,8 @@ job "docs" {
group "api" {
task "api-server" {
config {
- image = "api-server:1.3"
+ image = "api-server:1.4"
```
Next we plan and run these changes:
```text
$ nomad plan docs.nomad
+/- Job: "docs"
+/- Task Group: "api" (1 canary, 5 ignore)
+/- Task: "api-server" (forces create/destroy update)
+/- Config {
+/- image: "api-server:1.3" => "api-server:1.4"
}
Scheduler dry-run:
- All tasks successfully allocated.
Job Modify Index: 7
To submit the job with version verification run:
nomad run -check-index 7 example.nomad
When running the job with the check-index flag, the job will only be run if the
server side version matches the job modify index returned. If the index has
changed, another user has modified the job and the plan's results are
potentially invalid.
$ nomad run docs.nomad
# ...
```
We can see from the plan output that Nomad is going to create 1 canary that
will run the "api-server:1.4" image and ignore all the allocations running
the older image. If we inspect the status we see that the canary is running
along side the older version of the job:
```text
$ nomad status docs
ID = docs
Name = docs
Submit Date = 07/26/17 19:57:47 UTC
Type = service
Priority = 50
Datacenters = dc1
Status = running
Periodic = false
Parameterized = false
Summary
Task Group Queued Starting Running Failed Complete Lost
api 0 0 6 0 0 0
Latest Deployment
ID = 32a080c1
Status = running
Description = Deployment is running but requires promotion
Deployed
Task Group Auto Revert Promoted Desired Canaries Placed Healthy Unhealthy
api true false 5 1 1 1 0
Allocations
ID Node ID Task Group Version Desired Status Created At
85662a7a 087852e2 api 1 run running 07/26/17 19:57:47 UTC
3ac3fe05 087852e2 api 0 run running 07/26/17 19:53:56 UTC
4bd51979 087852e2 api 0 run running 07/26/17 19:53:56 UTC
2998387b 087852e2 api 0 run running 07/26/17 19:53:56 UTC
35b813ee 087852e2 api 0 run running 07/26/17 19:53:56 UTC
b53b4289 087852e2 api 0 run running 07/26/17 19:53:56 UTC
```
Now if we promote the canary, this will trigger a rolling update to replace the
remaining allocations running the older image. The rolling update will happen at
a rate of `max_parallel`, so in this case one allocation at a time:
```text
$ nomad deployment promote 37033151
==> Monitoring evaluation "37033151"
Evaluation triggered by job "docs"
Evaluation within deployment: "ed28f6c2"
Allocation "f5057465" created: node "f6646949", group "cache"
Allocation "f5057465" status changed: "pending" -> "running"
Evaluation status changed: "pending" -> "complete"
==> Evaluation "37033151" finished with status "complete"
$ nomad status docs
ID = docs
Name = docs
Submit Date = 07/26/17 20:28:59 UTC
Type = service
Priority = 50
Datacenters = dc1
Status = running
Periodic = false
Parameterized = false
Summary
Task Group Queued Starting Running Failed Complete Lost
api 0 0 5 0 2 0
Latest Deployment
ID = ed28f6c2
Status = running
Description = Deployment is running
Deployed
Task Group Auto Revert Promoted Desired Canaries Placed Healthy Unhealthy
api true true 5 1 2 1 0
Allocations
ID Node ID Task Group Version Desired Status Created At
f5057465 f6646949 api 1 run running 07/26/17 20:29:23 UTC
b1c88d20 f6646949 api 1 run running 07/26/17 20:28:59 UTC
1140bacf f6646949 api 0 run running 07/26/17 20:28:37 UTC
1958a34a f6646949 api 0 run running 07/26/17 20:28:37 UTC
4bda385a f6646949 api 0 run running 07/26/17 20:28:37 UTC
62d96f06 f6646949 api 0 stop complete 07/26/17 20:28:37 UTC
f58abbb2 f6646949 api 0 stop complete 07/26/17 20:28:37 UTC
```
Alternatively, if the canary was not performing properly, we could abandon the
change using the `nomad deployment fail` command, similar to the blue/green
example.

View File

@@ -12,10 +12,11 @@ description: |-
Most applications are long-lived and require updates over time. Whether you are
deploying a new version of your web application or upgrading to a new version of
redis, Nomad has built-in support for rolling updates. When a job specifies a
rolling update, Nomad can take some configurable strategies to minimize or
eliminate down time, stagger deployments, and more. This section and subsections
will explore how to do so safely with Nomad.
Redis, Nomad has built-in support for rolling, blue/green, and canary updates.
When a job specifies a rolling update, Nomad uses task state and health check
information in order to detect allocation health and minimize or eliminate
downtime. This section and subsections will explore how to do so safely with
Nomad.
Please see one of the guides below or use the navigation on the left:

View File

@@ -4,35 +4,71 @@ page_title: "Rolling Upgrades - Operating a Job"
sidebar_current: "docs-operating-a-job-updating-rolling-upgrades"
description: |-
In order to update a service while reducing downtime, Nomad provides a
built-in mechanism for rolling upgrades. Rolling upgrades allow for a subset
of applications to be updated at a time, with a waiting period between to
built-in mechanism for rolling upgrades. Rolling upgrades incrementally
transistions jobs between versions and using health check information to
reduce downtime.
---
# Rolling Upgrades
In order to update a service while reducing downtime, Nomad provides a built-in
mechanism for rolling upgrades. Jobs specify their "update strategy" using the
`update` block in the job specification as shown here:
Nomad supports rolling updates as a first class feature. To enable rolling
updates a job or task group is annotated with a high-level description of the
update strategy using the [`update` stanza][update]. Under the hood, Nomad
handles limiting parallelism, interfacing with Consul to determine service
health and even automatically reverting to an older, healthy job when a
deployment fails.
## Enabling Rolling Updates
Rolling updates are enabled by adding the [`update` stanza][update] to the job
specification. The `update` stanza may be placed at the job level or in an
individual task group. When placed at the job level, the update strategy is
inherited by all task groups in the job. When placed at both the job and group
level, the 'update` stanzas are merged, with group stanzas taking precedance
over job level stanzas. See the [`update` stanza
documentation](/docs/job-specification/update.html#upgrade-stanza-inheritance)
for an example.
```hcl
job "docs" {
update {
stagger = "30s"
max_parallel = 3
}
job "geo-api-server" {
# ...
group "api-server" {
count = 6
# Add an update stanza to enable rolling updates of the service
update {
max_parallel = 2
min_healthy_time = "30s"
healthy_deadline = "10m"
}
group "example" {
task "server" {
driver = "docker"
config {
image = "geo-api-server:0.1"
}
# ...
}
}
}
```
In this example, Nomad will only update 3 task groups at a time (`max_parallel =
3`) and will wait 30 seconds (`stagger = "30s"`) before moving on to the next
set of task groups.
In this example, by adding the simple `update` stanza to the "api-server" task
group, we inform Nomad that updates to the group should be handled with a
rolling update strategy.
Thus when a change is made to the job file that requires new allocations to be
made, Nomad will deploy 2 allocations at a time and require that the allocations
running in a healthy state for 30 seconds before deploying more versions of the
new group.
By default Nomad determines allocation health by ensuring that all tasks in the
group are running and that any [service
check](/docs/job-specification/service.html#check-parameters) the tasks register
are passing.
## Planning Changes
@@ -40,37 +76,36 @@ Suppose we make a change to a file to upgrade the version of a Docker container
that is configured with the same rolling update strategy from above.
```diff
@@ -2,6 +2,8 @@ job "docs" {
group "example" {
@@ -2,6 +2,8 @@ job "geo-api-server" {
group "api-server" {
task "server" {
driver = "docker"
config {
- image = "nginx:1.10"
+ image = "nginx:1.11"
- image = "geo-api-server:0.1"
+ image = "geo-api-server:0.2"
```
The [`nomad plan` command](/docs/commands/plan.html) allows
us to visualize the series of steps the scheduler would perform. We can analyze
this output to confirm it is correct:
```shell
$ nomad plan docs.nomad
```text
$ nomad plan geo-api-server.nomad
```
Here is some sample output:
```text
+/- Job: "my-web"
+/- Task Group: "web" (3 create/destroy update)
+/- Task: "web" (forces create/destroy update)
+/- Job: "geo-api-server"
+/- Task Group: "api-server" (2 create/destroy update, 4 ignore)
+/- Task: "server" (forces create/destroy update)
+/- Config {
+/- image: "nginx:1.10" => "nginx:1.11"
+/- image: "geo-api-server:0.1" => "geo-api-server:0.2"
}
Scheduler dry-run:
- All tasks successfully allocated.
- Rolling update, next evaluation will be in 30s.
Job Modify Index: 7
To submit the job with version verification run:
@@ -83,8 +118,213 @@ changed, another user has modified the job and the plan's results are
potentially invalid.
```
Here we can see that Nomad will destroy the 3 existing task groups and create 3
replacements but it will occur with a rolling update with a stagger of `30s`.
Here we can see that Nomad will begin a rolling update by creating and
destroying 2 allocations first and for the time being ignoring 4 of the old
allocations, matching our configured `max_parallel`.
For more details on the `update` block, see the
[job specification documentation](/docs/job-specification/update.html).
## Inspecting a Deployment
After running the plan we can submit the updated job by simply running `nomad
run`. Once run, Nomad will begin the rolling upgrade of our service by placing
2 allocations at a time of the new job and taking two of the old jobs down.
We can inspect the current state of a rolling deployment using `nomad status`:
```text
$ nomad status geo-api-server
ID = geo-api-server
Name = geo-api-server
Submit Date = 07/26/17 18:08:56 UTC
Type = service
Priority = 50
Datacenters = dc1
Status = running
Periodic = false
Parameterized = false
Summary
Task Group Queued Starting Running Failed Complete Lost
api-server 0 0 6 0 4 0
Latest Deployment
ID = c5b34665
Status = running
Description = Deployment is running
Deployed
Task Group Desired Placed Healthy Unhealthy
api-server 6 4 2 0
Allocations
ID Node ID Task Group Version Desired Status Created At
14d288e8 f7b1ee08 api-server 1 run running 07/26/17 18:09:17 UTC
a134f73c f7b1ee08 api-server 1 run running 07/26/17 18:09:17 UTC
a2574bb6 f7b1ee08 api-server 1 run running 07/26/17 18:08:56 UTC
496e7aa2 f7b1ee08 api-server 1 run running 07/26/17 18:08:56 UTC
9fc96fcc f7b1ee08 api-server 0 run running 07/26/17 18:04:30 UTC
2521c47a f7b1ee08 api-server 0 run running 07/26/17 18:04:30 UTC
6b794fcb f7b1ee08 api-server 0 stop complete 07/26/17 18:04:30 UTC
9bc11bd7 f7b1ee08 api-server 0 stop complete 07/26/17 18:04:30 UTC
691eea24 f7b1ee08 api-server 0 stop complete 07/26/17 18:04:30 UTC
af115865 f7b1ee08 api-server 0 stop complete 07/26/17 18:04:30 UTC
```
Here we can see that Nomad has created a deployment to conduct the rolling
upgrade from job version 0 to 1 and has placed 4 instances of the new job and
has stopped 4 of the old instances. If we look at the deployed allocations, we
also can see that Nomad has placed 4 instances of job version 1 but only
considers 2 of them healthy. This is because the 2 newest placed allocations
haven't been healthy for the required 30 seconds yet.
If we wait for the deployment to complete and re-issue the command, we get the
following:
```text
$ nomad status geo-api-server
ID = geo-api-server
Name = geo-api-server
Submit Date = 07/26/17 18:08:56 UTC
Type = service
Priority = 50
Datacenters = dc1
Status = running
Periodic = false
Parameterized = false
Summary
Task Group Queued Starting Running Failed Complete Lost
cache 0 0 6 0 6 0
Latest Deployment
ID = c5b34665
Status = successful
Description = Deployment completed successfully
Deployed
Task Group Desired Placed Healthy Unhealthy
cache 6 6 6 0
Allocations
ID Node ID Task Group Version Desired Status Created At
d42a1656 f7b1ee08 api-server 1 run running 07/26/17 18:10:10 UTC
401daaf9 f7b1ee08 api-server 1 run running 07/26/17 18:10:00 UTC
14d288e8 f7b1ee08 api-server 1 run running 07/26/17 18:09:17 UTC
a134f73c f7b1ee08 api-server 1 run running 07/26/17 18:09:17 UTC
a2574bb6 f7b1ee08 api-server 1 run running 07/26/17 18:08:56 UTC
496e7aa2 f7b1ee08 api-server 1 run running 07/26/17 18:08:56 UTC
9fc96fcc f7b1ee08 api-server 0 stop complete 07/26/17 18:04:30 UTC
2521c47a f7b1ee08 api-server 0 stop complete 07/26/17 18:04:30 UTC
6b794fcb f7b1ee08 api-server 0 stop complete 07/26/17 18:04:30 UTC
9bc11bd7 f7b1ee08 api-server 0 stop complete 07/26/17 18:04:30 UTC
691eea24 f7b1ee08 api-server 0 stop complete 07/26/17 18:04:30 UTC
af115865 f7b1ee08 api-server 0 stop complete 07/26/17 18:04:30 UTC
```
Nomad has successfully transitioned the group to running the updated canary and
did so with no downtime to our service by ensuring only two allocations were
changed at a time and the newly placed allocations ran successfully. Had any of
the newly placed allocations failed their health check, Nomad would have aborted
the deployment and stopped placing new allocations. If configured, Nomad can
automatically revert back to the old job definition when the deployment fails.
## Auto Reverting on Failed Deployments
In the case we do a deployment in which the new allocations are unhealthy, Nomad
will fail the deployment and stop placing new instances of the job. It
optionally supports automatically reverting back to the last stable job version
on deployment failure. Nomad keeps a history of submitted jobs and whether the
job version was stable. A job is considered stable if all its allocations are
healthy.
To enable this we simply add the `auto_revert` parameter to the `update` stanza:
```
update {
max_parallel = 2
min_healthy_time = "30s"
healthy_deadline = "10m"
# Enable automatically reverting to the last stable job on a failed
# deployment.
auto_revert = true
}
```
Now imagine we want to update our image to "geo-api-server:0.3" but we instead
update it to the below and run the job:
```diff
@@ -2,6 +2,8 @@ job "geo-api-server" {
group "api-server" {
task "server" {
driver = "docker"
config {
- image = "geo-api-server:0.2"
+ image = "geo-api-server:0.33"
```
If we run `nomad job deployments` we can see that the deployment fails and Nomad
auto-reverts to the last stable job:
```text
$ nomad job deployments geo-api-server
ID Job ID Job Version Status Description
0c6f87a5 geo-api-server 3 successful Deployment completed successfully
b1712b7f geo-api-server 2 failed Failed due to unhealthy allocations - rolling back to job version 1
3eee83ce geo-api-server 1 successful Deployment completed successfully
72813fcf geo-api-server 0 successful Deployment completed successfully
```
Nomad job versions increment monotonically, so even though Nomad reverted to the
job specification at version 1, it creates a new job version. We can see the
differences between a jobs versions and how Nomad auto-reverted the job using
the `job history` command:
```text
$ nomad job history -p geo-api-server
Version = 3
Stable = true
Submit Date = 07/26/17 18:44:18 UTC
Diff =
+/- Job: "geo-api-server"
+/- Task Group: "api-server"
+/- Task: "server"
+/- Config {
+/- image: "geo-api-server:0.33" => "geo-api-server:0.2"
}
Version = 2
Stable = false
Submit Date = 07/26/17 18:45:21 UTC
Diff =
+/- Job: "geo-api-server"
+/- Task Group: "api-server"
+/- Task: "server"
+/- Config {
+/- image: "geo-api-server:0.2" => "geo-api-server:0.33"
}
Version = 1
Stable = true
Submit Date = 07/26/17 18:44:18 UTC
Diff =
+/- Job: "geo-api-server"
+/- Task Group: "api-server"
+/- Task: "server"
+/- Config {
+/- image: "geo-api-server:0.1" => "geo-api-server:0.2"
}
Version = 0
Stable = true
Submit Date = 07/26/17 18:43:43 UTC
```
We can see that Nomad considered the job running "geo-api-server:0.1" and
"geo-api-server:0.2" as stable but job Version 2 that submitted the incorrect
image is marked as unstable. This is because the placed allocations failed to
start. Nomad detected the deployment failed and as such, created job Version 3
that reverted back to the last healthy job.
[update]: /docs/job-specification/update.html "Nomad update Stanza"