Merge pull request #4148 from hashicorp/f-operating-job-guide

Added section on failure recovery under operating a job
2026-01-06 10:25:42 +03:00 · 2018-04-12 16:03:36 -07:00
parent f718106d52 2267483405
commit 5f83d80bca
5 changed files with 306 additions and 0 deletions
--- a/website/source/docs/operating-a-job/failure-handling-strategies/check-restart.html.md
+++ b/website/source/docs/operating-a-job/failure-handling-strategies/check-restart.html.md
@@ -0,0 +1,79 @@
+---
+layout: "docs"
+page_title: "Check Restart Stanza - Operating a Job"
+sidebar_current: "docs-operating-a-job-failure-handling-strategies-check-restart"
+description: |-
+  Nomad can restart tasks if they have a failing health check based on
+  configuration specified in the `check_restart` stanza. Restarts are done locally on the node
+  running the task based on their `restart` policy.
+---
+
+# Check Restart Stanza
+
+The [`check_restart` stanza][check restart] instructs Nomad when to restart tasks with unhealthy service checks.
+When a health check in Consul has been unhealthy for the limit specified in a check_restart stanza,
+it is restarted according to the task group's restart policy.
+
+The `limit ` field is used to specify the number of times a failing healthcheck is seen before local restarts are attempted.
+Operators can also specify a `grace` duration to wait after a task restarts before checking its health.
+
+We recommend configuring the check restart on services if its likely that a restart would resolve the failure. This
+is applicable in cases like temporary memory issues on the service.
+
+# Example
+
+The following `check_restart` stanza waits for two consecutive health check failures with a
+grace period and considers both `critical` and `warning` statuses as failures
+
+```text
+check_restart {
+  limit           = 2
+  grace           = "10s"
+  ignore_warnings = false
+}
+```
+
+The following CLI example output shows healthcheck failures triggering restarts until its
+restart limit is reached.
+
+```
+$nomad alloc status e1b43128-2a0a-6aa3-c375-c7e8a7c48690
+ID                   = e1b43128
+Eval ID              = 249cbfe9
+Name                 = demo.demo[0]
+Node ID              = 221e998e
+Job ID               = demo
+Job Version          = 0
+Client Status        = failed
+Client Description   = <none>
+Desired Status       = run
+Desired Description  = <none>
+Created              = 2m59s ago
+Modified             = 39s ago
+
+Task "test" is "dead"
+Task Resources
+CPU      Memory   Disk     IOPS  Addresses
+100 MHz  300 MiB  300 MiB  0     p1: 127.0.0.1:28422
+
+Task Events:
+Started At     = 2018-04-12T22:50:32Z
+Finished At    = 2018-04-12T22:50:54Z
+Total Restarts = 3
+Last Restart   = 2018-04-12T17:50:15-05:00
+
+Recent Events:
+Time                       Type              Description
+2018-04-12T17:50:54-05:00  Not Restarting    Exceeded allowed attempts 3 in interval 30m0s and mode is "fail"
+2018-04-12T17:50:54-05:00  Killed            Task successfully killed
+2018-04-12T17:50:54-05:00  Killing           Sent interrupt. Waiting 5s before force killing
+2018-04-12T17:50:54-05:00  Restart Signaled  healthcheck: check "service: \"demo-service-test\" check" unhealthy
+2018-04-12T17:50:32-05:00  Started           Task started by client
+2018-04-12T17:50:15-05:00  Restarting        Task restarting in 16.887291122s
+2018-04-12T17:50:15-05:00  Killed            Task successfully killed
+2018-04-12T17:50:15-05:00  Killing           Sent interrupt. Waiting 5s before force killing
+2018-04-12T17:50:15-05:00  Restart Signaled  healthcheck: check "service: \"demo-service-test\" check" unhealthy
+2018-04-12T17:49:53-05:00  Started           Task started by client
+```
+
+[check restart]: /docs/job-specification/check_restart.html "Nomad check restart Stanza"
--- a/website/source/docs/operating-a-job/failure-handling-strategies/index.html.md
+++ b/website/source/docs/operating-a-job/failure-handling-strategies/index.html.md
@@ -0,0 +1,25 @@
+---
+layout: "docs"
+page_title: "Handling Failures - Operating a Job"
+sidebar_current: "docs-operating-a-job-failure-handling-strategies"
+description: |-
+  This section describes features in Nomad that automate recovering from failed tasks.
+---
+
+# Failure Recovery Strategies
+
+Most applications deployed in Nomad are either long running services or one time batch jobs.
+They can fail for various reasons like:
+
+- A temporary error in the service that resolves when its restarted.
+- An upstream dependency might not be available, leading to a health check failure.
+- Disk, Memory or CPU contention on the node that the application is running on.
+- The application uses Docker and the Docker daemon on that node is unresponsive.
+
+Nomad provides configurable options to enable recovering failed tasks to avoid downtime. Nomad will
+try to restart a failed task on the node it is running on, and also try to reschedule it on another node.
+Please see one of the guides below or use the navigation on the left for details on each option:
+
+1. [Local Restarts](/docs/operating-a-job/failure-handling-strategies/restart.html)
+1. [Check Restarts](/docs/operating-a-job/failure-handling-strategies/check-restart.html)
+1. [Rescheduling](/docs/operating-a-job/failure-handling-strategies/rescheduling.html)
--- a/website/source/docs/operating-a-job/failure-handling-strategies/reschedule.html.md
+++ b/website/source/docs/operating-a-job/failure-handling-strategies/reschedule.html.md
@@ -0,0 +1,92 @@
+---
+layout: "docs"
+page_title: "Reschedule Stanza - Operating a Job"
+sidebar_current: "docs-operating-a-job-failure-handling-strategies-reschedule"
+description: |-
+  Nomad can reschedule failing tasks after any local restart attempts have been
+  exhausted. This is useful to recover from failures stemming from problems in the node
+  running the task.
+---
+
+# Reschedule Stanza
+
+Tasks can sometimes fail due to network, CPU or memory issues on the node running the task. In such situations,
+Nomad can reschedule the task on another node. The [`reschedule` stanza][reschedule] can be used to configure how
+Nomad should try placing failed tasks on another node in the cluster. Reschedule attempts have a delay between
+each attempt, and the delay can be configured to increase between each rescheduling attempt according to a configurable
+`delay_function`. See the [`reschedule` stanza][reschedule] for more information.
+
+Service jobs are configured by default to have unlimited reschedule attempts. We recommend using the reschedule
+stanza to ensure that failed tasks are automatically reattempted on another node without needing operator intervention.
+
+# Example
+The following CLI example shows job and allocation statuses for a task being rescheduled by Nomad.
+The CLI shows the number of previous attempts if there is a limit on the number of reschedule attempts.
+The CLI also shows when the next reschedule will be attempted.
+
+```text
+$nomad job status demo
+ID            = demo
+Name          = demo
+Submit Date   = 2018-04-12T15:48:37-05:00
+Type          = service
+Priority      = 50
+Datacenters   = dc1
+Status        = pending
+Periodic      = false
+Parameterized = false
+
+Summary
+Task Group  Queued  Starting  Running  Failed  Complete  Lost
+demo        0       0         0        2       0         0
+
+Future Rescheduling Attempts
+Task Group  Eval ID   Eval Time
+demo        ee3de93f  5s from now
+
+Allocations
+ID        Node ID   Task Group  Version  Desired  Status  Created  Modified
+39d7823d  f2c2eaa6  demo        0        run      failed  5s ago   5s ago
+fafb011b  f2c2eaa6  demo        0        run      failed  11s ago  10s ago
+
+```
+
+```text
+$nomad alloc status 3d0b
+ID                     = 3d0bbdb1
+Eval ID                = 79b846a9
+Name                   = demo.demo[0]
+Node ID                = 8a184f31
+Job ID                 = demo
+Job Version            = 0
+Client Status          = failed
+Client Description     = <none>
+Desired Status         = run
+Desired Description    = <none>
+Created                = 15s ago
+Modified               = 15s ago
+Reschedule Attempts    = 3/5
+Reschedule Eligibility = 25s from now
+
+Task "demo" is "dead"
+Task Resources
+CPU      Memory   Disk     IOPS  Addresses
+100 MHz  300 MiB  300 MiB  0     p1: 127.0.0.1:27646
+
+Task Events:
+Started At     = 2018-04-12T20:44:25Z
+Finished At    = 2018-04-12T20:44:25Z
+Total Restarts = 0
+Last Restart   = N/A
+
+Recent Events:
+Time                       Type            Description
+2018-04-12T15:44:25-05:00  Not Restarting  Policy allows no restarts
+2018-04-12T15:44:25-05:00  Terminated      Exit Code: 127
+2018-04-12T15:44:25-05:00  Started         Task started by client
+2018-04-12T15:44:25-05:00  Task Setup      Building Task Directory
+2018-04-12T15:44:25-05:00  Received        Task received by client
+
+```
+
+[reschedule]: /docs/job-specification/reschedule.html "Nomad reschedule Stanza"
--- a/website/source/docs/operating-a-job/failure-handling-strategies/restart.html.md
+++ b/website/source/docs/operating-a-job/failure-handling-strategies/restart.html.md
@@ -0,0 +1,96 @@
+---
+layout: "docs"
+page_title: "Restart Stanza - Operating a Job"
+sidebar_current: "docs-operating-a-job-failure-handling-strategies-local-restarts"
+description: |-
+  Nomad can restart a task on the node it is running on to recover from
+  failures. Task restarts can be configured to be limited by number of attempts within
+  a specific interval.
+---
+
+# Restart Stanza
+
+To enable restarting a failed task on the node it is running on, the task group can be annotated
+with configurable options using the [`restart` stanza][restart]. Nomad will restart the failed task
+up to `attempts` times within a provided `interval`. Operators can also choose whether to
+keep attempting restarts on the same node, or to fail the task so that it can be rescheduled
+on another node, via the `mode` parameter.
+
+We recommend setting mode to `fail` in the restart stanza to allow rescheduling the task on another node.
+
+
+## Example
+The following CLI example shows job status and allocation status for a failed task that is being restarted by Nomad.
+Allocations are in the `pending` state while restarts are attempted. The `Recent Events` section in the CLI
+shows ongoing restart attempts.
+
+```text
+$nomad job status demo
+ID            = demo
+Name          = demo
+Submit Date   = 2018-04-12T14:37:18-05:00
+Type          = service
+Priority      = 50
+Datacenters   = dc1
+Status        = running
+Periodic      = false
+Parameterized = false
+
+Summary
+Task Group  Queued  Starting  Running  Failed  Complete  Lost
+demo        0       3         0        0       0         0
+
+Allocations
+ID        Node ID   Task Group  Version  Desired  Status   Created  Modified
+ce5bf1d1  8a184f31  demo        0        run      pending  27s ago  5s ago
+d5dee7c8  8a184f31  demo        0        run      pending  27s ago  5s ago
+ed815997  8a184f31  demo        0        run      pending  27s ago  5s ago
+```
+
+In the following example, the allocation `ce5bf1d1` is restarted by Nomad approximately
+every ten seconds, with a small random jitter. It eventually reaches its limit of three attempts and
+transitions into a `failed` state, after which it becomes eligible for [rescheduling][rescheduling].
+
+```text
+$nomad alloc-status ce5bf1d1
+ID                     = ce5bf1d1
+Eval ID                = 64e45d11
+Name                   = demo.demo[1]
+Node ID                = a0ccdd8b
+Job ID                 = demo
+Job Version            = 0
+Client Status          = failed
+Client Description     = <none>
+Desired Status         = run
+Desired Description    = <none>
+Created                = 56s ago
+Modified               = 22s ago
+
+Task "demo" is "dead"
+Task Resources
+CPU      Memory   Disk     IOPS  Addresses
+100 MHz  300 MiB  300 MiB  0
+
+Task Events:
+Started At     = 2018-04-12T22:29:08Z
+Finished At    = 2018-04-12T22:29:08Z
+Total Restarts = 3
+Last Restart   = 2018-04-12T17:28:57-05:00
+
+Recent Events:
+Time                       Type            Description
+2018-04-12T17:29:08-05:00  Not Restarting  Exceeded allowed attempts 3 in interval 5m0s and mode is "fail"
+2018-04-12T17:29:08-05:00  Terminated      Exit Code: 127
+2018-04-12T17:29:08-05:00  Started         Task started by client
+2018-04-12T17:28:57-05:00  Restarting      Task restarting in 10.364602876s
+2018-04-12T17:28:57-05:00  Terminated      Exit Code: 127
+2018-04-12T17:28:57-05:00  Started         Task started by client
+2018-04-12T17:28:47-05:00  Restarting      Task restarting in 10.666963769s
+2018-04-12T17:28:47-05:00  Terminated      Exit Code: 127
+2018-04-12T17:28:47-05:00  Started         Task started by client
+2018-04-12T17:28:35-05:00  Restarting      Task restarting in 11.777324721s
+```
+
+
+[restart]: /docs/job-specification/restart.html "Nomad restart Stanza"
+[rescheduling]: /docs/job-specification/reschedule.html "Nomad restart Stanza"
--- a/website/source/layouts/docs.erb
+++ b/website/source/layouts/docs.erb
@@ -132,6 +132,20 @@
              </li>
            </ul>
          </li>
+           <li<%= sidebar_current("docs-operating-a-job-failure-handling-strategies") %>>
+                      <a href="/docs/operating-a-job/failure-handling-strategies/index.html">Failure Recovery Strategies</a>
+                      <ul class="nav">
+                        <li<%= sidebar_current("docs-operating-a-job-failure-handling-strategies-local-restarts") %>>
+                          <a href="/docs/operating-a-job/failure-handling-strategies/restart.html">Local Restarts</a>
+                        </li>
+                        <li<%= sidebar_current("docs-operating-a-job-failure-handling-strategies-check-restart") %>>
+                          <a href="/docs/operating-a-job/failure-handling-strategies/check-restart.html">Check Restarts</a>
+                        </li>
+                        <li<%= sidebar_current("docs-operating-a-job-failure-handling-strategies-reschedule") %>>
+                          <a href="/docs/operating-a-job/failure-handling-strategies/reschedule.html">Rescheduling</a>
+                        </li>
+                      </ul>
+                    </li>
        </ul>
      </li>