From 19bac3caa8dc5e98e473f96fb6a73fcb0d9ee040 Mon Sep 17 00:00:00 2001 From: Michael Schurter Date: Thu, 14 Apr 2022 16:09:33 -0700 Subject: [PATCH] docs: add plan for node rejected details and more (#12564) - Moved federation docs to the bottom since *everyone* is potentially affected by the other sections on the page, but only users of federation are affected by it. - Added section on the plan for node rejected bug since it is fairly easy to diagnose and removing affected nodes is a fairly reliable workaround. - Mention 5s cliff for wait_for_index. - Remove the lie that we do not have job status metrics! How old was that?! - Reinforce the importance of monitoring basic system resources --- .../docs/operations/monitoring-nomad.mdx | 133 +++++++++++------- 1 file changed, 86 insertions(+), 47 deletions(-) diff --git a/website/content/docs/operations/monitoring-nomad.mdx b/website/content/docs/operations/monitoring-nomad.mdx index 690b6afc2..0dccbbefd 100644 --- a/website/content/docs/operations/monitoring-nomad.mdx +++ b/website/content/docs/operations/monitoring-nomad.mdx @@ -8,18 +8,22 @@ description: |- # Monitoring Nomad -The Nomad client and server agents collect a wide range of runtime metrics -related to the performance of the system. Operators can use this data to gain -real-time visibility into their cluster and improve performance. Additionally, -Nomad operators can set up monitoring and alerting based on these metrics in -order to respond to any changes in the cluster state. +The Nomad client and server agents collect a wide range of runtime metrics. +These metrics are useful for monitoring the health and performance of Nomad +clusters. Careful monitoring can spot trends before they cause issues and help +debug issues if they arise. -On the server side, leaders and -followers have metrics in common as well as metrics that are specific to their -roles. Clients have separate metrics for the host metrics and for -allocations/tasks, both of which have to be [explicitly -enabled][telemetry-stanza]. There are also runtime metrics that are common to -all servers and clients. +All Nomad agents, both servers and clients, report basic system and Go runtime +metrics. + +Nomad servers all report many metrics, but some metrics are specific to the +leader server. Since leadership may change at any time, these metrics should be +monitored on all servers. Missing (or 0) metrics from non-leaders may be safely +ignored. + +Nomad clients have separate metrics for the host they are running on as well as +for each allocation being run. Both of these metrics [must be explicitly +enabled][telemetry-stanza]. By default, the Nomad agent collects telemetry data at a [1 second interval][collection-interval]. Note that Nomad supports [gauges, counters, and @@ -27,19 +31,16 @@ timers][metric-types]. There are three ways to obtain metrics from Nomad: -- Query the [/metrics API endpoint][metrics-api-endpoint] to return metrics for - the current Nomad process (as of Nomad 0.7). This endpoint supports Prometheus - formatted metrics. +- Query the [/v1/metrics API endpoint][metrics-api-endpoint] to return metrics + for the current Nomad process. This endpoint supports Prometheus formatted + metrics. - Send the USR1 signal to the Nomad process. This will dump the current telemetry information to STDERR (on Linux). -- Configure Nomad to automatically forward metrics to a third-party provider. - -Nomad 0.7 added support for [tagged metrics][tagged-metrics], improving the -integrations with [DataDog][datadog-telem] and [Prometheus][prometheus-telem]. -Metrics can also be forwarded to [Statsite][statsite-telem], -[StatsD][statsd-telem], and [Circonus][circonus-telem]. +- Configure Nomad to automatically forward metrics to a third-party provider + such as [DataDog][datadog-telem], [Prometheus][prometheus-telem], + [statsd][statsd-telem], and [Circonus][circonus-telem]. ## Alerting @@ -71,7 +72,12 @@ patterns. # Key Performance Indicators -The sections below cover a number of important metrics +Nomad servers' memory, CPU, disk, and network usage all scales linearly with +cluster size and scheduling throughput. The most important aspect of ensuring +Nomad operates normally is monitoring these system resources to ensure the +servers are not encountering resource constraints. + +The sections below cover a number of other important metrics. ## Consensus Protocol (Raft) @@ -111,28 +117,46 @@ The `nomad.raft.fsm.apply` metric is an indicator of the time it takes for a server to apply Raft entries to the internal state machine. If this number trends upwards, look at the `nomad.nomad.fsm.*` metrics to see if a specific Raft entry is increasing in latency. You can compare -this to warn-level logs on the Nomad servers for "attempting to apply -large raft entry". If a specific type of message appears here, there +this to warn-level logs on the Nomad servers for `attempting to apply +large raft entry`. If a specific type of message appears here, there may be a job with a large job specification or dispatch payload that -is increasing the time it takes to apply raft messages. - -## Federated Deployments (Serf) - -Nomad uses the membership and failure detection capabilities of the Serf library -to maintain a single, global gossip pool for all servers in a federated -deployment. An uptick in `member.flap` and/or `msg.suspect` is a reliable indicator -that membership is unstable. - -If these metrics increase, look at CPU load on the servers and network -latency and packet loss for the [Serf] address. +is increasing the time it takes to apply Raft messages. Try shrinking the size +of the job either by putting distinct task groups into separate jobs, +downloading templates instead of embedding them, or reducing the `count` on +task groups. ## Scheduling -The [Scheduling] documentation describes the workflow of how -evaluations become scheduled plans and placed allocations. The -following metrics, listed in the order they are emitted, allow an -operator to observe changes in throughput at the various points in the -scheduling process. +The [Scheduling] documentation describes the workflow of how evaluations become +scheduled plans and placed allocations. + +### Progress + +There is a class of bug possible in Nomad where the two parts of the scheduling +pipeline, the workers and the leader's plan applier, *disagree* about the +validity of a plan. In the pathological case this can cause a job to never +finish scheduling, as workers produce the same plan and the plan applier +repeatedly rejects it. + +While this class of bug is very rare, it can be detected by repeated log lines +on the Nomad servers containing `plan for node rejected`: + +``` +nomad: plan for node rejected: node_id=0fa84370-c713-b914-d329-f6485951cddc reason="reserved port collision" eval_id=098a5 +``` + +While it is possible for these log lines to occur infrequently due to normal +cluster conditions, they should not appear repeatedly and prevent the job from +eventually running (look up the evaluation ID logged to find the job). + +If this log *does* appear repeatedly with the same `node_id` referenced, try +[draining] the node and shutting it down. Misconfigurations not caught by +validation can cause nodes to enter this state: [#11830][gh-11830]. + +### Performance + +The following metrics allow observing changes in throughput at the various +points in the scheduling process. - **nomad.worker.invoke_scheduler.** - The time to run the scheduler of the given type. Each scheduler worker handles one @@ -169,9 +193,11 @@ scheduling process. entirely in memory on the leader. If this metric increases, examine the CPU and memory resources of the leader. -- **nomad.plan.wait_for_index** - The time required for the planner to - wait for the Raft index of the plan to be processed. If this metric - increases, refer to the [Consensus Protocol (Raft)] section above. +- **nomad.plan.wait_for_index** - The time required for the planner to wait for + the Raft index of the plan to be processed. If this metric increases, refer + to the [Consensus Protocol (Raft)] section above. If this metric approaches 5 + seconds, scheduling operations may fail and be retried. If possible reduce + scheduling load until metrics improve. - **nomad.plan.submit** - The time to submit a scheduler plan from the worker to the leader. This operation requires writing to Raft and @@ -215,8 +241,8 @@ when the CPU is at or above the reserved resources for the task. ## Job and Task Status -We do not currently surface metrics for job and task/allocation status, although -we will consider adding metrics where it makes sense. +See [Job Summary Metrics] for monitoring the health and status of workloads +running on Nomad. ## Runtime Metrics @@ -230,6 +256,16 @@ general indicators of load and memory pressure. It is recommended to alert on upticks in any of the above, server memory usage in particular. +## Federated Deployments (Serf) + +Nomad uses the membership and failure detection capabilities of the Serf library +to maintain a single, global gossip pool for all servers in a federated +deployment. An uptick in `member.flap` and/or `msg.suspect` is a reliable indicator +that membership is unstable. + +If these metrics increase, look at CPU load on the servers and network +latency and packet loss for the [Serf] address. + [alerting-rules]: https://prometheus.io/docs/prometheus/latest/configuration/alerting_rules/ [alertmanager]: https://prometheus.io/docs/alerting/alertmanager/ [allocation-metrics]: /docs/telemetry/metrics#allocation-metrics @@ -237,14 +273,17 @@ in particular. [collection-interval]: /docs/configuration/telemetry#collection_interval [datadog-alerting]: https://www.datadoghq.com/blog/monitoring-101-alerting/ [datadog-telem]: /docs/configuration/telemetry#datadog -[prometheus-telem]: /docs/configuration/telemetry#prometheus -[metrics-api-endpoint]: /api-docs/metrics +[draining]: https://learn.hashicorp.com/tutorials/nomad/node-drain +[gh-11830]: https://github.com/hashicorp/nomad/pull/11830 [metric-types]: /docs/telemetry/metrics#metric-types +[metrics-api-endpoint]: /api-docs/metrics +[prometheus-telem]: /docs/configuration/telemetry#prometheus +[serf]: /docs/configuration#serf-1 [statsd-exporter]: https://github.com/prometheus/statsd_exporter [statsd-telem]: /docs/configuration/telemetry#statsd [statsite-telem]: /docs/configuration/telemetry#statsite [tagged-metrics]: /docs/telemetry/metrics#tagged-metrics [telemetry-stanza]: /docs/configuration/telemetry -[serf]: /docs/configuration#serf-1 [Consensus Protocol (Raft)]: /docs/operations/telemetry#consensus-protocol-raft +[Job Summary Metrics]: /docs/operations/metrics-reference#job-summary-metrics [Scheduling]: /docs/internals/scheduling/scheduling