Commit Graph

14 Commits

Author SHA1 Message Date
Michael Schurter
19bac3caa8 docs: add plan for node rejected details and more (#12564)
- Moved federation docs to the bottom since *everyone* is potentially
  affected by the other sections on the page, but only users of
  federation are affected by it.
- Added section on the plan for node rejected bug since it is fairly
  easy to diagnose and removing affected nodes is a fairly reliable
  workaround.
- Mention 5s cliff for wait_for_index.
- Remove the lie that we do not have job status metrics! How old was
  that?!
- Reinforce the importance of monitoring basic system resources
2022-04-14 16:09:33 -07:00
Jasmine Dahilig
ccaaadf493 docs: add token_last_renewal and token_next_renewal to server metrics and key metrics #12435 (#12505) 2022-04-07 15:12:41 -07:00
Derek Strickland
5b5c853597 disconnected clients: Observability plumbing (#12141)
* Add disconnects/reconnect to log output and emit reschedule metrics

* TaskGroupSummary: Add Unknown, update StateStore logic, add to metrics
2022-04-05 17:12:23 -04:00
Seth Hoenig
16efcf4e71 core: switch to go.etc.io/bbolt
This PR swaps the underlying BoltDB implementation from boltdb/bolt
to go.etc.io/bbolt.

In addition, the Server has a new configuration option for disabling
NoFreelistSync on the underlying database.

Freelist option: https://github.com/etcd-io/bbolt/blob/master/db.go#L81
Consul equivelent PR: https://github.com/hashicorp/consul/pull/11720
2022-02-23 14:26:41 -06:00
Luiz Aoqui
a0c0b808af docs: add nomad.plan.node_rejected metric (#11860) 2022-01-18 13:47:20 -05:00
Tim Gross
7fad4b9169 docs: new scheduler metrics (#11790)
* Fixed name of `nomad.scheduler.allocs.reschedule` metric
* Added new metrics to metrics reference documentation
* Expanded definitions of "waiting" metrics
* Changelog entry for #10236 and #10237
2022-01-07 09:51:15 -05:00
Tim Gross
95fa1b30f4 docs: improve docs for troubleshooting and monitoring scheduler (#11623)
This changeset adds more specific recommendations as to what metrics
to monitor, and what resources should be examined during incident
response.

It also renames the "Telemetry" section to "Monitoring Nomad" to
surface the material better and distinguish it from the "Metric
Reference".

Co-authored-by: Charlie Voiselle <464492+angrycub@users.noreply.github.com>
2021-12-07 15:52:13 -05:00
James Rasell
d2132b96b4 docs: add license expiry metric to metrics website doc. 2021-12-07 10:31:51 +00:00
kfenech1
6bbcb180f2 docs: nomad.client.unallocated.memory is in Megabytes not bytes (#11468) 2021-11-08 11:05:11 -05:00
Michael Schurter
594ceb7022 docs: improve wait_for_index metrics description (#10717)
Old description of `{plan,worker}.wait_for_index` described the metric
in terms of waiting for a snapshot which has two problems:

1. "Snapshot" is an overloaded term in Nomad and operators can't be
   expected to know which use we're referring to here.
2. The most important thing about the metric is what we're waiting *on*
   before taking a snapshot: the raft index of the object to be
   processed (plan or eval).

The new description tries to cram all of that context into the tiny
space provided.

See #5791 for details about the `wait_for_index` mechanism in general.
2021-06-09 08:53:06 -04:00
Luiz Aoqui
c7114921fa Add metrics for blocked eval resources (#10454)
* add metrics for blocked eval resources

* docs: add new blocked_evals metrics

* fix to call `pruneStats` instead of `stats.prune` directly
2021-04-29 15:03:45 -04:00
Bryce Kalow
ee79587a67 feat(website): migrates to new nav data format (#10264) 2021-03-31 08:43:17 -05:00
Tim Gross
3b42d75225 docs: add metrics from raft leadership transitions 2021-01-27 11:50:11 -05:00
Jeff Escalante
0eae603a86 implement mdx remote 2021-01-05 19:02:39 -05:00