From c3243c89c4b6199dd8f2e5f9312a26649f45d0f6 Mon Sep 17 00:00:00 2001 From: Preetha Appan Date: Thu, 12 Apr 2018 15:57:06 -0500 Subject: [PATCH 1/4] Added section on failure recovery under operating a job with details and examples of different restarts. --- .../check-restart.html.md | 23 +++++ .../failure-handling-strategies/index.html.md | 25 +++++ .../reschedule.html.md | 92 +++++++++++++++++++ .../restart.html.md | 91 ++++++++++++++++++ website/source/layouts/docs.erb | 14 +++ 5 files changed, 245 insertions(+) create mode 100644 website/source/docs/operating-a-job/failure-handling-strategies/check-restart.html.md create mode 100644 website/source/docs/operating-a-job/failure-handling-strategies/index.html.md create mode 100644 website/source/docs/operating-a-job/failure-handling-strategies/reschedule.html.md create mode 100644 website/source/docs/operating-a-job/failure-handling-strategies/restart.html.md diff --git a/website/source/docs/operating-a-job/failure-handling-strategies/check-restart.html.md b/website/source/docs/operating-a-job/failure-handling-strategies/check-restart.html.md new file mode 100644 index 000000000..a8ae06c11 --- /dev/null +++ b/website/source/docs/operating-a-job/failure-handling-strategies/check-restart.html.md @@ -0,0 +1,23 @@ +--- +layout: "docs" +page_title: "Check Restart Stanza - Operating a Job" +sidebar_current: "docs-operating-a-job-failure-handling-strategies-check-restart" +description: |- + Nomad can restart service job tasks if they have a failing health check based on + configuration specified in the `check_restart` stanza. Restarts are done locally on the node + running the task based on their `restart` policy. +--- + +# Check Restart Stanza + +The [`check_restart` stanza][check restart] instructs Nomad when to restart tasks with unhealthy service checks. +When a health check in Consul has been unhealthy for the limit specified in a check_restart stanza, +it is restarted according to the task group's restart policy. + +The `limit ` field is used to specify the number of times a failing healthcheck is seen before local restarts are attempted. +Operators can also specify a `grace` duration to wait after a task restarts before checking its health. + +We recommend configuring the check restart on services if its likely that a restart would resolve the failure. This +is applicable in cases like temporary memory issues on the service. + +[check restart]: /docs/job-specification/check_restart.html "Nomad check restart Stanza" \ No newline at end of file diff --git a/website/source/docs/operating-a-job/failure-handling-strategies/index.html.md b/website/source/docs/operating-a-job/failure-handling-strategies/index.html.md new file mode 100644 index 000000000..087040c3c --- /dev/null +++ b/website/source/docs/operating-a-job/failure-handling-strategies/index.html.md @@ -0,0 +1,25 @@ +--- +layout: "docs" +page_title: "Handling Failures - Operating a Job" +sidebar_current: "docs-operating-a-job-failure-handling-strategies" +description: |- + This section describes features in Nomad that automate recovering from failed tasks. +--- + +# Failure Recovery Strategies + +Most applications deployed in Nomad are either long running services or one time batch jobs. +They can fail for various reasons like: + +- A temporary error in the service that resolves when its restarted. +- An upstream dependency might not be available, leading to a health check failure. +- Disk, Memory or CPU contention on the node that the application is running on. +- The application uses Docker and the Docker daemon on that node is no longer running. + +Nomad provides configurable options to enable recovering failed tasks to avoid downtime. Nomad will +try to restart a failed task on the node it is running on, and also try to reschedule it on another node. +Please see one of the guides below or use the navigation on the left for details on each option: + +1. [Local Restarts](/docs/operating-a-job/failure-handling-strategies/restart.html) +1. [Check Restarts](/docs/operating-a-job/failure-handling-strategies/check-restart.html) +1. [Rescheduling](/docs/operating-a-job/failure-handling-strategies/rescheduling.html) diff --git a/website/source/docs/operating-a-job/failure-handling-strategies/reschedule.html.md b/website/source/docs/operating-a-job/failure-handling-strategies/reschedule.html.md new file mode 100644 index 000000000..e3e3f2f80 --- /dev/null +++ b/website/source/docs/operating-a-job/failure-handling-strategies/reschedule.html.md @@ -0,0 +1,92 @@ +--- +layout: "docs" +page_title: "Reschedule Stanza - Operating a Job" +sidebar_current: "docs-operating-a-job-failure-handling-strategies-reschedule" +description: |- + Nomad can reschedule failing tasks after any local restart attempts have been + exhausted. This is useful to recover from failures stemming from problems in the node + running the task. +--- + +# Reschedule Stanza + +Tasks can sometimes fail due to network, CPU or memory issues on the node running the task. In such situations, +Nomad can reschedule the task on another node. The [`reschedule` stanza][reschedule] can be used to configure how +Nomad should try placing failed tasks on another node in the cluster. Reschedule attempts have a delay between +each attempt, and the delay can be configured to increase between each rescheduling attempt according to a configurable +`delay-function`. See the [documentation][reschedule] for more information on all the options for rescheduling. + +Service jobs are configured by default to have unlimited reschedule attempts. We recommend using the reschedule +stanza to ensure that failed tasks are automatically reattempted on another node without needing operator intervention. + +# Example +The following CLI example shows job and allocation statuses for a task being rescheduled by Nomad. +The CLI shows the number of previous attempts if there is a limit on the number of reschedule attempts. +The CLI also shows when the next reschedule will be attempted. + +```text +$nomad job status demo +ID = demo +Name = demo +Submit Date = 2018-04-12T15:48:37-05:00 +Type = service +Priority = 50 +Datacenters = dc1 +Status = pending +Periodic = false +Parameterized = false + +Summary +Task Group Queued Starting Running Failed Complete Lost +demo 0 0 0 2 0 0 + +Future Rescheduling Attempts +Task Group Eval ID Eval Time +demo ee3de93f 5s from now + +Allocations +ID Node ID Task Group Version Desired Status Created Modified +39d7823d f2c2eaa6 demo 0 run failed 5s ago 5s ago +fafb011b f2c2eaa6 demo 0 run failed 11s ago 10s ago + +``` + +```text +$nomad alloc status 3d0b +ID = 3d0bbdb1 +Eval ID = 79b846a9 +Name = demo.demo[0] +Node ID = 8a184f31 +Job ID = demo +Job Version = 0 +Client Status = failed +Client Description = +Desired Status = run +Desired Description = +Created = 15s ago +Modified = 15s ago +Reschedule Attempts = 3/5 +Reschedule Eligibility = 25s from now + +Task "demo" is "dead" +Task Resources +CPU Memory Disk IOPS Addresses +100 MHz 300 MiB 300 MiB 0 p1: 127.0.0.1:27646 + +Task Events: +Started At = 2018-04-12T20:44:25Z +Finished At = 2018-04-12T20:44:25Z +Total Restarts = 0 +Last Restart = N/A + +Recent Events: +Time Type Description +2018-04-12T15:44:25-05:00 Not Restarting Policy allows no restarts +2018-04-12T15:44:25-05:00 Terminated Exit Code: 127 +2018-04-12T15:44:25-05:00 Started Task started by client +2018-04-12T15:44:25-05:00 Task Setup Building Task Directory +2018-04-12T15:44:25-05:00 Received Task received by client + +``` + +[reschedule]: /docs/job-specification/reschedule.html "Nomad reschedule Stanza" \ No newline at end of file diff --git a/website/source/docs/operating-a-job/failure-handling-strategies/restart.html.md b/website/source/docs/operating-a-job/failure-handling-strategies/restart.html.md new file mode 100644 index 000000000..1cff939c8 --- /dev/null +++ b/website/source/docs/operating-a-job/failure-handling-strategies/restart.html.md @@ -0,0 +1,91 @@ +--- +layout: "docs" +page_title: "Restart Stanza - Operating a Job" +sidebar_current: "docs-operating-a-job-failure-handling-strategies-local-restarts" +description: |- + Nomad can restart a task on the node it is running on to recover from + failures. Task restarts can be configured to be limited by number of attempts within + a specific interval. +--- + +# Restart Stanza + +To enable restarting a failed task on the node it is running on, the task group can be annotated +with configurable options using the [`restart` stanza][restart]. Nomad will restart the failed task +upto `attempts` times within a provided `interval`. Operators can also choose whether to +keep attempting restarts on the same node, or to fail the task so that it can be rescheduled +on another node, via the `mode` parameter. + +We recommend setting mode to `fail` in the restart stanza to allow rescheduling the task on another node. + + +## Example +The following CLI example shows job status and allocation status for a failed task that is being restarted by Nomad. +Allocations are in the `pending` state while restarts are attempted. The `Recent Events` section in the CLI +shows ongoing restart attempts. + +```text +$nomad job status demo +ID = demo +Name = demo +Submit Date = 2018-04-12T14:37:18-05:00 +Type = service +Priority = 50 +Datacenters = dc1 +Status = running +Periodic = false +Parameterized = false + +Summary +Task Group Queued Starting Running Failed Complete Lost +demo 0 3 0 0 0 0 + +Allocations +ID Node ID Task Group Version Desired Status Created Modified +ce5bf1d1 8a184f31 demo 0 run pending 27s ago 5s ago +d5dee7c8 8a184f31 demo 0 run pending 27s ago 5s ago +ed815997 8a184f31 demo 0 run pending 27s ago 5s ago +``` + +```text +$nomad alloc-status ce5b +ID = ce5bf1d1 +Eval ID = 05681b90 +Name = demo.demo[1] +Node ID = 8a184f31 +Job ID = demo +Job Version = 0 +Client Status = pending +Client Description = +Desired Status = run +Desired Description = +Created = 31s ago +Modified = 9s ago + +Task "demo" is "pending" +Task Resources +CPU Memory Disk IOPS Addresses +100 MHz 300 MiB 300 MiB 0 + +Task Events: +Started At = 2018-04-12T19:37:40Z +Finished At = N/A +Total Restarts = 3 +Last Restart = 2018-04-12T14:37:40-05:00 + +Recent Events: +Time Type Description +2018-04-12T14:37:40-05:00 Restarting Task restarting in 11.686056069s +2018-04-12T14:37:40-05:00 Terminated Exit Code: 127 +2018-04-12T14:37:40-05:00 Started Task started by client +2018-04-12T14:37:29-05:00 Restarting Task restarting in 10.97348449s +2018-04-12T14:37:29-05:00 Terminated Exit Code: 127 +2018-04-12T14:37:29-05:00 Started Task started by client +2018-04-12T14:37:19-05:00 Restarting Task restarting in 10.619985509s +2018-04-12T14:37:19-05:00 Terminated Exit Code: 127 +2018-04-12T14:37:19-05:00 Started Task started by client +2018-04-12T14:37:19-05:00 Task Setup Building Task Directory +``` + + +[restart]: /docs/job-specification/restart.html "Nomad restart Stanza" diff --git a/website/source/layouts/docs.erb b/website/source/layouts/docs.erb index 4fd16fcb7..7e3eacacc 100644 --- a/website/source/layouts/docs.erb +++ b/website/source/layouts/docs.erb @@ -132,6 +132,20 @@ + > + Failure Recovery Strategies + + From 73265bede53f2f30d819d25cfe06e5df04720204 Mon Sep 17 00:00:00 2001 From: Preetha Appan Date: Thu, 12 Apr 2018 17:27:11 -0500 Subject: [PATCH 2/4] address some review comments --- .../failure-handling-strategies/check-restart.html.md | 2 +- .../operating-a-job/failure-handling-strategies/index.html.md | 2 +- .../failure-handling-strategies/reschedule.html.md | 2 +- .../operating-a-job/failure-handling-strategies/restart.html.md | 2 +- 4 files changed, 4 insertions(+), 4 deletions(-) diff --git a/website/source/docs/operating-a-job/failure-handling-strategies/check-restart.html.md b/website/source/docs/operating-a-job/failure-handling-strategies/check-restart.html.md index a8ae06c11..54b590e71 100644 --- a/website/source/docs/operating-a-job/failure-handling-strategies/check-restart.html.md +++ b/website/source/docs/operating-a-job/failure-handling-strategies/check-restart.html.md @@ -3,7 +3,7 @@ layout: "docs" page_title: "Check Restart Stanza - Operating a Job" sidebar_current: "docs-operating-a-job-failure-handling-strategies-check-restart" description: |- - Nomad can restart service job tasks if they have a failing health check based on + Nomad can restart tasks if they have a failing health check based on configuration specified in the `check_restart` stanza. Restarts are done locally on the node running the task based on their `restart` policy. --- diff --git a/website/source/docs/operating-a-job/failure-handling-strategies/index.html.md b/website/source/docs/operating-a-job/failure-handling-strategies/index.html.md index 087040c3c..985e4618f 100644 --- a/website/source/docs/operating-a-job/failure-handling-strategies/index.html.md +++ b/website/source/docs/operating-a-job/failure-handling-strategies/index.html.md @@ -14,7 +14,7 @@ They can fail for various reasons like: - A temporary error in the service that resolves when its restarted. - An upstream dependency might not be available, leading to a health check failure. - Disk, Memory or CPU contention on the node that the application is running on. -- The application uses Docker and the Docker daemon on that node is no longer running. +- The application uses Docker and the Docker daemon on that node is unresponsiveS. Nomad provides configurable options to enable recovering failed tasks to avoid downtime. Nomad will try to restart a failed task on the node it is running on, and also try to reschedule it on another node. diff --git a/website/source/docs/operating-a-job/failure-handling-strategies/reschedule.html.md b/website/source/docs/operating-a-job/failure-handling-strategies/reschedule.html.md index e3e3f2f80..ba9c91d1d 100644 --- a/website/source/docs/operating-a-job/failure-handling-strategies/reschedule.html.md +++ b/website/source/docs/operating-a-job/failure-handling-strategies/reschedule.html.md @@ -14,7 +14,7 @@ Tasks can sometimes fail due to network, CPU or memory issues on the node runnin Nomad can reschedule the task on another node. The [`reschedule` stanza][reschedule] can be used to configure how Nomad should try placing failed tasks on another node in the cluster. Reschedule attempts have a delay between each attempt, and the delay can be configured to increase between each rescheduling attempt according to a configurable -`delay-function`. See the [documentation][reschedule] for more information on all the options for rescheduling. +`delay_function`. See the [`reschedule` stanza][reschedule] for more information. Service jobs are configured by default to have unlimited reschedule attempts. We recommend using the reschedule stanza to ensure that failed tasks are automatically reattempted on another node without needing operator intervention. diff --git a/website/source/docs/operating-a-job/failure-handling-strategies/restart.html.md b/website/source/docs/operating-a-job/failure-handling-strategies/restart.html.md index 1cff939c8..11632d45c 100644 --- a/website/source/docs/operating-a-job/failure-handling-strategies/restart.html.md +++ b/website/source/docs/operating-a-job/failure-handling-strategies/restart.html.md @@ -12,7 +12,7 @@ description: |- To enable restarting a failed task on the node it is running on, the task group can be annotated with configurable options using the [`restart` stanza][restart]. Nomad will restart the failed task -upto `attempts` times within a provided `interval`. Operators can also choose whether to +up to `attempts` times within a provided `interval`. Operators can also choose whether to keep attempting restarts on the same node, or to fail the task so that it can be rescheduled on another node, via the `mode` parameter. From f588f74d16c665946934f614779776f387e28f7f Mon Sep 17 00:00:00 2001 From: Preetha Appan Date: Thu, 12 Apr 2018 17:55:43 -0500 Subject: [PATCH 3/4] more examples --- .../check-restart.html.md | 56 +++++++++++++++++ .../restart.html.md | 63 ++++++++++--------- 2 files changed, 90 insertions(+), 29 deletions(-) diff --git a/website/source/docs/operating-a-job/failure-handling-strategies/check-restart.html.md b/website/source/docs/operating-a-job/failure-handling-strategies/check-restart.html.md index 54b590e71..ae1a62905 100644 --- a/website/source/docs/operating-a-job/failure-handling-strategies/check-restart.html.md +++ b/website/source/docs/operating-a-job/failure-handling-strategies/check-restart.html.md @@ -20,4 +20,60 @@ Operators can also specify a `grace` duration to wait after a task restarts befo We recommend configuring the check restart on services if its likely that a restart would resolve the failure. This is applicable in cases like temporary memory issues on the service. +# Example + +The following `check_restart` stanza waits for two consecutive health check failures with a +grace period and considers both `critical` and `warning` statuses as failures + +```text +check_restart { + limit = 2 + grace = "10s" + ignore_warnings = false +} +``` + +The following CLI example output shows healthcheck failures triggering restarts until its +restart limit is reached. + +``` +$nomad alloc status e1b43128-2a0a-6aa3-c375-c7e8a7c48690 +ID = e1b43128 +Eval ID = 249cbfe9 +Name = demo.demo[0] +Node ID = 221e998e +Job ID = demo +Job Version = 0 +Client Status = failed +Client Description = +Desired Status = run +Desired Description = +Created = 2m59s ago +Modified = 39s ago + +Task "test" is "dead" +Task Resources +CPU Memory Disk IOPS Addresses +100 MHz 300 MiB 300 MiB 0 p1: 127.0.0.1:28422 + +Task Events: +Started At = 2018-04-12T22:50:32Z +Finished At = 2018-04-12T22:50:54Z +Total Restarts = 3 +Last Restart = 2018-04-12T17:50:15-05:00 + +Recent Events: +Time Type Description +2018-04-12T17:50:54-05:00 Not Restarting Exceeded allowed attempts 3 in interval 30m0s and mode is "fail" +2018-04-12T17:50:54-05:00 Killed Task successfully killed +2018-04-12T17:50:54-05:00 Killing Sent interrupt. Waiting 5s before force killing +2018-04-12T17:50:54-05:00 Restart Signaled healthcheck: check "service: \"demo-service-test\" check" unhealthy +2018-04-12T17:50:32-05:00 Started Task started by client +2018-04-12T17:50:15-05:00 Restarting Task restarting in 16.887291122s +2018-04-12T17:50:15-05:00 Killed Task successfully killed +2018-04-12T17:50:15-05:00 Killing Sent interrupt. Waiting 5s before force killing +2018-04-12T17:50:15-05:00 Restart Signaled healthcheck: check "service: \"demo-service-test\" check" unhealthy +2018-04-12T17:49:53-05:00 Started Task started by client +``` + [check restart]: /docs/job-specification/check_restart.html "Nomad check restart Stanza" \ No newline at end of file diff --git a/website/source/docs/operating-a-job/failure-handling-strategies/restart.html.md b/website/source/docs/operating-a-job/failure-handling-strategies/restart.html.md index 11632d45c..7c959b0ff 100644 --- a/website/source/docs/operating-a-job/failure-handling-strategies/restart.html.md +++ b/website/source/docs/operating-a-job/failure-handling-strategies/restart.html.md @@ -47,45 +47,50 @@ d5dee7c8 8a184f31 demo 0 run pending 27s ago 5s ago ed815997 8a184f31 demo 0 run pending 27s ago 5s ago ``` -```text -$nomad alloc-status ce5b -ID = ce5bf1d1 -Eval ID = 05681b90 -Name = demo.demo[1] -Node ID = 8a184f31 -Job ID = demo -Job Version = 0 -Client Status = pending -Client Description = -Desired Status = run -Desired Description = -Created = 31s ago -Modified = 9s ago +In the following example, the allocation `ce5bf1d1` is restarted by Nomad approximately +every ten seconds, with a small random jitter. It eventually reaches its limit of three attempts and +transitions into a `failed` state, after which it becomes eligible for [rescheduling][rescheduling]. -Task "demo" is "pending" +```text +$nomad alloc-status ce5bf1d1 +ID = ce5bf1d1 +Eval ID = 64e45d11 +Name = demo.demo[1] +Node ID = a0ccdd8b +Job ID = demo +Job Version = 0 +Client Status = failed +Client Description = +Desired Status = run +Desired Description = +Created = 56s ago +Modified = 22s ago + +Task "demo" is "dead" Task Resources CPU Memory Disk IOPS Addresses 100 MHz 300 MiB 300 MiB 0 Task Events: -Started At = 2018-04-12T19:37:40Z -Finished At = N/A +Started At = 2018-04-12T22:29:08Z +Finished At = 2018-04-12T22:29:08Z Total Restarts = 3 -Last Restart = 2018-04-12T14:37:40-05:00 +Last Restart = 2018-04-12T17:28:57-05:00 Recent Events: -Time Type Description -2018-04-12T14:37:40-05:00 Restarting Task restarting in 11.686056069s -2018-04-12T14:37:40-05:00 Terminated Exit Code: 127 -2018-04-12T14:37:40-05:00 Started Task started by client -2018-04-12T14:37:29-05:00 Restarting Task restarting in 10.97348449s -2018-04-12T14:37:29-05:00 Terminated Exit Code: 127 -2018-04-12T14:37:29-05:00 Started Task started by client -2018-04-12T14:37:19-05:00 Restarting Task restarting in 10.619985509s -2018-04-12T14:37:19-05:00 Terminated Exit Code: 127 -2018-04-12T14:37:19-05:00 Started Task started by client -2018-04-12T14:37:19-05:00 Task Setup Building Task Directory +Time Type Description +2018-04-12T17:29:08-05:00 Not Restarting Exceeded allowed attempts 3 in interval 5m0s and mode is "fail" +2018-04-12T17:29:08-05:00 Terminated Exit Code: 127 +2018-04-12T17:29:08-05:00 Started Task started by client +2018-04-12T17:28:57-05:00 Restarting Task restarting in 10.364602876s +2018-04-12T17:28:57-05:00 Terminated Exit Code: 127 +2018-04-12T17:28:57-05:00 Started Task started by client +2018-04-12T17:28:47-05:00 Restarting Task restarting in 10.666963769s +2018-04-12T17:28:47-05:00 Terminated Exit Code: 127 +2018-04-12T17:28:47-05:00 Started Task started by client +2018-04-12T17:28:35-05:00 Restarting Task restarting in 11.777324721s ``` [restart]: /docs/job-specification/restart.html "Nomad restart Stanza" +[rescheduling]: /docs/job-specification/reschedule.html "Nomad restart Stanza" From 2267483405e0e93f682805a59f708c4719eb50a0 Mon Sep 17 00:00:00 2001 From: Alex Dadgar Date: Thu, 12 Apr 2018 16:02:03 -0700 Subject: [PATCH 4/4] fix spelling --- .../operating-a-job/failure-handling-strategies/index.html.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/website/source/docs/operating-a-job/failure-handling-strategies/index.html.md b/website/source/docs/operating-a-job/failure-handling-strategies/index.html.md index 985e4618f..0b6aa3deb 100644 --- a/website/source/docs/operating-a-job/failure-handling-strategies/index.html.md +++ b/website/source/docs/operating-a-job/failure-handling-strategies/index.html.md @@ -14,7 +14,7 @@ They can fail for various reasons like: - A temporary error in the service that resolves when its restarted. - An upstream dependency might not be available, leading to a health check failure. - Disk, Memory or CPU contention on the node that the application is running on. -- The application uses Docker and the Docker daemon on that node is unresponsiveS. +- The application uses Docker and the Docker daemon on that node is unresponsive. Nomad provides configurable options to enable recovering failed tasks to avoid downtime. Nomad will try to restart a failed task on the node it is running on, and also try to reschedule it on another node.