Merge pull request #4148 from hashicorp/f-operating-job-guide

Added section on failure recovery under operating a job
This commit is contained in:
Alex Dadgar
2018-04-12 16:03:36 -07:00
committed by GitHub
5 changed files with 306 additions and 0 deletions

View File

@@ -0,0 +1,79 @@
---
layout: "docs"
page_title: "Check Restart Stanza - Operating a Job"
sidebar_current: "docs-operating-a-job-failure-handling-strategies-check-restart"
description: |-
Nomad can restart tasks if they have a failing health check based on
configuration specified in the `check_restart` stanza. Restarts are done locally on the node
running the task based on their `restart` policy.
---
# Check Restart Stanza
The [`check_restart` stanza][check restart] instructs Nomad when to restart tasks with unhealthy service checks.
When a health check in Consul has been unhealthy for the limit specified in a check_restart stanza,
it is restarted according to the task group's restart policy.
The `limit ` field is used to specify the number of times a failing healthcheck is seen before local restarts are attempted.
Operators can also specify a `grace` duration to wait after a task restarts before checking its health.
We recommend configuring the check restart on services if its likely that a restart would resolve the failure. This
is applicable in cases like temporary memory issues on the service.
# Example
The following `check_restart` stanza waits for two consecutive health check failures with a
grace period and considers both `critical` and `warning` statuses as failures
```text
check_restart {
limit = 2
grace = "10s"
ignore_warnings = false
}
```
The following CLI example output shows healthcheck failures triggering restarts until its
restart limit is reached.
```
$nomad alloc status e1b43128-2a0a-6aa3-c375-c7e8a7c48690
ID = e1b43128
Eval ID = 249cbfe9
Name = demo.demo[0]
Node ID = 221e998e
Job ID = demo
Job Version = 0
Client Status = failed
Client Description = <none>
Desired Status = run
Desired Description = <none>
Created = 2m59s ago
Modified = 39s ago
Task "test" is "dead"
Task Resources
CPU Memory Disk IOPS Addresses
100 MHz 300 MiB 300 MiB 0 p1: 127.0.0.1:28422
Task Events:
Started At = 2018-04-12T22:50:32Z
Finished At = 2018-04-12T22:50:54Z
Total Restarts = 3
Last Restart = 2018-04-12T17:50:15-05:00
Recent Events:
Time Type Description
2018-04-12T17:50:54-05:00 Not Restarting Exceeded allowed attempts 3 in interval 30m0s and mode is "fail"
2018-04-12T17:50:54-05:00 Killed Task successfully killed
2018-04-12T17:50:54-05:00 Killing Sent interrupt. Waiting 5s before force killing
2018-04-12T17:50:54-05:00 Restart Signaled healthcheck: check "service: \"demo-service-test\" check" unhealthy
2018-04-12T17:50:32-05:00 Started Task started by client
2018-04-12T17:50:15-05:00 Restarting Task restarting in 16.887291122s
2018-04-12T17:50:15-05:00 Killed Task successfully killed
2018-04-12T17:50:15-05:00 Killing Sent interrupt. Waiting 5s before force killing
2018-04-12T17:50:15-05:00 Restart Signaled healthcheck: check "service: \"demo-service-test\" check" unhealthy
2018-04-12T17:49:53-05:00 Started Task started by client
```
[check restart]: /docs/job-specification/check_restart.html "Nomad check restart Stanza"

View File

@@ -0,0 +1,25 @@
---
layout: "docs"
page_title: "Handling Failures - Operating a Job"
sidebar_current: "docs-operating-a-job-failure-handling-strategies"
description: |-
This section describes features in Nomad that automate recovering from failed tasks.
---
# Failure Recovery Strategies
Most applications deployed in Nomad are either long running services or one time batch jobs.
They can fail for various reasons like:
- A temporary error in the service that resolves when its restarted.
- An upstream dependency might not be available, leading to a health check failure.
- Disk, Memory or CPU contention on the node that the application is running on.
- The application uses Docker and the Docker daemon on that node is unresponsive.
Nomad provides configurable options to enable recovering failed tasks to avoid downtime. Nomad will
try to restart a failed task on the node it is running on, and also try to reschedule it on another node.
Please see one of the guides below or use the navigation on the left for details on each option:
1. [Local Restarts](/docs/operating-a-job/failure-handling-strategies/restart.html)
1. [Check Restarts](/docs/operating-a-job/failure-handling-strategies/check-restart.html)
1. [Rescheduling](/docs/operating-a-job/failure-handling-strategies/rescheduling.html)

View File

@@ -0,0 +1,92 @@
---
layout: "docs"
page_title: "Reschedule Stanza - Operating a Job"
sidebar_current: "docs-operating-a-job-failure-handling-strategies-reschedule"
description: |-
Nomad can reschedule failing tasks after any local restart attempts have been
exhausted. This is useful to recover from failures stemming from problems in the node
running the task.
---
# Reschedule Stanza
Tasks can sometimes fail due to network, CPU or memory issues on the node running the task. In such situations,
Nomad can reschedule the task on another node. The [`reschedule` stanza][reschedule] can be used to configure how
Nomad should try placing failed tasks on another node in the cluster. Reschedule attempts have a delay between
each attempt, and the delay can be configured to increase between each rescheduling attempt according to a configurable
`delay_function`. See the [`reschedule` stanza][reschedule] for more information.
Service jobs are configured by default to have unlimited reschedule attempts. We recommend using the reschedule
stanza to ensure that failed tasks are automatically reattempted on another node without needing operator intervention.
# Example
The following CLI example shows job and allocation statuses for a task being rescheduled by Nomad.
The CLI shows the number of previous attempts if there is a limit on the number of reschedule attempts.
The CLI also shows when the next reschedule will be attempted.
```text
$nomad job status demo
ID = demo
Name = demo
Submit Date = 2018-04-12T15:48:37-05:00
Type = service
Priority = 50
Datacenters = dc1
Status = pending
Periodic = false
Parameterized = false
Summary
Task Group Queued Starting Running Failed Complete Lost
demo 0 0 0 2 0 0
Future Rescheduling Attempts
Task Group Eval ID Eval Time
demo ee3de93f 5s from now
Allocations
ID Node ID Task Group Version Desired Status Created Modified
39d7823d f2c2eaa6 demo 0 run failed 5s ago 5s ago
fafb011b f2c2eaa6 demo 0 run failed 11s ago 10s ago
```
```text
$nomad alloc status 3d0b
ID = 3d0bbdb1
Eval ID = 79b846a9
Name = demo.demo[0]
Node ID = 8a184f31
Job ID = demo
Job Version = 0
Client Status = failed
Client Description = <none>
Desired Status = run
Desired Description = <none>
Created = 15s ago
Modified = 15s ago
Reschedule Attempts = 3/5
Reschedule Eligibility = 25s from now
Task "demo" is "dead"
Task Resources
CPU Memory Disk IOPS Addresses
100 MHz 300 MiB 300 MiB 0 p1: 127.0.0.1:27646
Task Events:
Started At = 2018-04-12T20:44:25Z
Finished At = 2018-04-12T20:44:25Z
Total Restarts = 0
Last Restart = N/A
Recent Events:
Time Type Description
2018-04-12T15:44:25-05:00 Not Restarting Policy allows no restarts
2018-04-12T15:44:25-05:00 Terminated Exit Code: 127
2018-04-12T15:44:25-05:00 Started Task started by client
2018-04-12T15:44:25-05:00 Task Setup Building Task Directory
2018-04-12T15:44:25-05:00 Received Task received by client
```
[reschedule]: /docs/job-specification/reschedule.html "Nomad reschedule Stanza"

View File

@@ -0,0 +1,96 @@
---
layout: "docs"
page_title: "Restart Stanza - Operating a Job"
sidebar_current: "docs-operating-a-job-failure-handling-strategies-local-restarts"
description: |-
Nomad can restart a task on the node it is running on to recover from
failures. Task restarts can be configured to be limited by number of attempts within
a specific interval.
---
# Restart Stanza
To enable restarting a failed task on the node it is running on, the task group can be annotated
with configurable options using the [`restart` stanza][restart]. Nomad will restart the failed task
up to `attempts` times within a provided `interval`. Operators can also choose whether to
keep attempting restarts on the same node, or to fail the task so that it can be rescheduled
on another node, via the `mode` parameter.
We recommend setting mode to `fail` in the restart stanza to allow rescheduling the task on another node.
## Example
The following CLI example shows job status and allocation status for a failed task that is being restarted by Nomad.
Allocations are in the `pending` state while restarts are attempted. The `Recent Events` section in the CLI
shows ongoing restart attempts.
```text
$nomad job status demo
ID = demo
Name = demo
Submit Date = 2018-04-12T14:37:18-05:00
Type = service
Priority = 50
Datacenters = dc1
Status = running
Periodic = false
Parameterized = false
Summary
Task Group Queued Starting Running Failed Complete Lost
demo 0 3 0 0 0 0
Allocations
ID Node ID Task Group Version Desired Status Created Modified
ce5bf1d1 8a184f31 demo 0 run pending 27s ago 5s ago
d5dee7c8 8a184f31 demo 0 run pending 27s ago 5s ago
ed815997 8a184f31 demo 0 run pending 27s ago 5s ago
```
In the following example, the allocation `ce5bf1d1` is restarted by Nomad approximately
every ten seconds, with a small random jitter. It eventually reaches its limit of three attempts and
transitions into a `failed` state, after which it becomes eligible for [rescheduling][rescheduling].
```text
$nomad alloc-status ce5bf1d1
ID = ce5bf1d1
Eval ID = 64e45d11
Name = demo.demo[1]
Node ID = a0ccdd8b
Job ID = demo
Job Version = 0
Client Status = failed
Client Description = <none>
Desired Status = run
Desired Description = <none>
Created = 56s ago
Modified = 22s ago
Task "demo" is "dead"
Task Resources
CPU Memory Disk IOPS Addresses
100 MHz 300 MiB 300 MiB 0
Task Events:
Started At = 2018-04-12T22:29:08Z
Finished At = 2018-04-12T22:29:08Z
Total Restarts = 3
Last Restart = 2018-04-12T17:28:57-05:00
Recent Events:
Time Type Description
2018-04-12T17:29:08-05:00 Not Restarting Exceeded allowed attempts 3 in interval 5m0s and mode is "fail"
2018-04-12T17:29:08-05:00 Terminated Exit Code: 127
2018-04-12T17:29:08-05:00 Started Task started by client
2018-04-12T17:28:57-05:00 Restarting Task restarting in 10.364602876s
2018-04-12T17:28:57-05:00 Terminated Exit Code: 127
2018-04-12T17:28:57-05:00 Started Task started by client
2018-04-12T17:28:47-05:00 Restarting Task restarting in 10.666963769s
2018-04-12T17:28:47-05:00 Terminated Exit Code: 127
2018-04-12T17:28:47-05:00 Started Task started by client
2018-04-12T17:28:35-05:00 Restarting Task restarting in 11.777324721s
```
[restart]: /docs/job-specification/restart.html "Nomad restart Stanza"
[rescheduling]: /docs/job-specification/reschedule.html "Nomad restart Stanza"

View File

@@ -132,6 +132,20 @@
</li>
</ul>
</li>
<li<%= sidebar_current("docs-operating-a-job-failure-handling-strategies") %>>
<a href="/docs/operating-a-job/failure-handling-strategies/index.html">Failure Recovery Strategies</a>
<ul class="nav">
<li<%= sidebar_current("docs-operating-a-job-failure-handling-strategies-local-restarts") %>>
<a href="/docs/operating-a-job/failure-handling-strategies/restart.html">Local Restarts</a>
</li>
<li<%= sidebar_current("docs-operating-a-job-failure-handling-strategies-check-restart") %>>
<a href="/docs/operating-a-job/failure-handling-strategies/check-restart.html">Check Restarts</a>
</li>
<li<%= sidebar_current("docs-operating-a-job-failure-handling-strategies-reschedule") %>>
<a href="/docs/operating-a-job/failure-handling-strategies/reschedule.html">Rescheduling</a>
</li>
</ul>
</li>
</ul>
</li>