Commit Graph

33 Commits

Author SHA1 Message Date
Piotr Kazmierczak
949a6f60c7 renamed stanza to block for consistency with other projects (#15941) 2023-01-30 15:48:43 +01:00
Tim Gross
92effde870 docs: add more warnings about running agent as root on Linux (#15926) 2023-01-27 15:22:18 -05:00
Ashlee M Boyer
3444ece549 docs: Migrate link formats (#15779)
* Adding check-legacy-links-format workflow

* Adding test-link-rewrites workflow

* chore: updates link checker workflow hash

* Migrating links to new format

Co-authored-by: Kendall Strautman <kendallstrautman@gmail.com>
2023-01-25 09:31:14 -08:00
Tim Gross
9bdb6a5b7d Rename nomad.broker.total_blocked metric (#15835)
This changeset fixes a long-standing point of confusion in metrics emitted by
the eval broker. The eval broker has a queue of "blocked" evals that are waiting
for an in-flight ("unacked") eval of the same job to be completed. But this
"blocked" state is not the same as the `blocked` status that we write to raft
and expose in the Nomad API to end users. There's a second metric
`nomad.blocked_eval.total_blocked` that refers to evaluations in that
state. This has caused ongoing confusion in major customer incidents and even in
our own documentation! (Fixed in this PR.)

There's little functional change in this PR aside from the name of the metric
emitted, but there's a bit refactoring to clean up the names in `eval_broker.go`
so that there aren't name collisions and multiple names for the same
state. Changes included are:
* Everything that was previously called "pending" referred to entities that were
  associated witht he "ready" metric. These are all now called "ready" to match
  the metric.
* Everything named "blocked" in `eval_broker.go` is now named "pending", except
  for a couple of comments that actually refer to blocked RPCs.
* Added a note to the upgrade guide docs for 1.5.0.
* Fixed the scheduling performance metrics docs because the description for
  `nomad.broker.total_blocked` was actually the description for
  `nomad.blocked_eval.total_blocked`.
2023-01-20 14:23:56 -05:00
Karel
1ded25fbf1 docs: fix conflict metric documentation, fix typo (#15805)
The description for the `nomad.nomad.blocked_evals.total_blocked` states that this could include evals blocked due to reached quota limits, but the `total_quota_limit` mentions being exclusive to its own metric.  I personally interpret `total_blocked` as encompassing any blocked evals for any reason, as written in the docs. Though someone will have to verify the validity of that statement and possibly rectify the other metric description.

Fixed a typo: `limtis` vs `limits`.
2023-01-20 13:54:11 -05:00
Tim Gross
21c2d1593a remove deprecated AllocUpdateRequestType raft entry (#15285)
After Deployments were added in Nomad 0.6.0, the `AllocUpdateRequestType` raft
log entry was no longer in use. Mark this as deprecated, remove the associated
dead code, and remove references to the metrics it emits from the docs. We'll
leave the entry itself just in case we encounter old raft logs that we need to
be able to safely load.
2022-11-17 12:08:04 -05:00
Bryce Kalow
f49b3a95dd website: fixes redirected links (#14918) 2022-10-18 10:31:52 -05:00
Kevin Wang
57dc7c2ab1 fix: website broken links (#14904)
* fix: website broken links

* fix up keyring-rotate link

Co-authored-by: Tim Gross <tgross@hashicorp.com>
2022-10-17 11:32:10 -04:00
Bryce Kalow
67d39725b1 website: content updates for developer (#14473)
Co-authored-by: Geoffrey Grosenbach <26+topfunky@users.noreply.github.com>
Co-authored-by: Anthony <russo555@gmail.com>
Co-authored-by: Ashlee Boyer <ashlee.boyer@hashicorp.com>
Co-authored-by: Ashlee M Boyer <43934258+ashleemboyer@users.noreply.github.com>
Co-authored-by: HashiBot <62622282+hashibot-web@users.noreply.github.com>
Co-authored-by: Kevin Wang <kwangsan@gmail.com>
2022-09-16 10:38:39 -05:00
Tim Gross
b7fea76f7f keyring: wrap root key in key encryption key (#14388)
Update the on-disk format for the root key so that it's wrapped with a unique
per-key/per-server key encryption key. This is a bit of security theatre for the
current implementation, but it uses `go-kms-wrapping` as the interface for
wrapping the key. This provides a shim for future support of external KMS such
as cloud provider APIs or Vault transit encryption.

* Removes the JSON serialization extension we had on the `RootKey` struct; this
  struct is now only used for key replication and not for disk serialization, so
  we don't need this helper.

* Creates a helper for generating cryptographically random slices of bytes that
  properly accounts for short reads from the source.

* No observable functional changes outside of the on-disk format, so there are
  no test updates.
2022-08-30 10:59:25 -04:00
Tim Gross
e1e5bb1dce docs: rename Secure Variables to Variables (#14352) 2022-08-29 11:37:08 -04:00
Charlie Voiselle
279631eeb5 Sweep of docs for repeated words; minor edits (#14032) 2022-08-05 16:45:30 -04:00
Tim Gross
9384ba19ad docs: concepts for secure variables and workload identity (#13764)
Includes concept docs for secure variables, concept docs for workload
identity, and an operations docs for keyring management.
2022-08-02 10:06:26 -04:00
Will Jordan
662a12a41e Return 429 response on HTTP max connection limit (#13621)
Return 429 response on HTTP max connection limit. Instead of silently closing
the connection, return a `429 Too Many Requests` HTTP response with a helpful
error message to aid debugging when the connection limit is unintentionally
reached.

Set a 10-millisecond write timeout and rate limiter for connection-limit 429
response to prevent writing the HTTP response from consuming too many server
resources.

Add `nomad.agent.http.exceeded metric` counting the number of HTTP connections
exceeding concurrency limit.
2022-07-20 14:12:21 -04:00
Michael Schurter
875cf8db51 Improve metrics reference documentation (#13769)
* docs: tighten up parameterized job metrics docs

* docs: improve alloc status descriptions

Remove `nomad.client.allocations.start` as it doesn't exist.
2022-07-15 14:22:39 -07:00
Michael Schurter
717e92d3aa docs: clarify blocked_evals metrics (#13751)
Related to #13740

- blocked_evals.total_blocked is the number of evals blocked for *any*
  reason
- blocked_evals.total_quota_limit is the number of evals blocked by
  quota limits, but critically: their resources are *not* counted in the
  cpu/memory
2022-07-14 11:32:33 -07:00
Luiz Aoqui
d456cc1e7f Track plan rejection history and automatically mark clients as ineligible (#13421)
Plan rejections occur when the scheduler work and the leader plan
applier disagree on the feasibility of a plan. This may happen for valid
reasons: since Nomad does parallel scheduling, it is expected that
different workers will have a different state when computing placements.

As the final plan reaches the leader plan applier, it may no longer be
valid due to a concurrent scheduling taking up intended resources. In
these situations the plan applier will notify the worker that the plan
was rejected and that they should refresh their state before trying
again.

In some rare and unexpected circumstances it has been observed that
workers will repeatedly submit the same plan, even if they are always
rejected.

While the root cause is still unknown this mitigation has been put in
place. The plan applier will now track the history of plan rejections
per client and include in the plan result a list of node IDs that should
be set as ineligible if the number of rejections in a given time window
crosses a certain threshold. The window size and threshold value can be
adjusted in the server configuration.

To avoid marking several nodes as ineligible at one, the operation is rate
limited to 5 nodes every 30min, with an initial burst of 10 operations.
2022-07-12 18:40:20 -04:00
Tim Gross
f295396ef8 docs: rename Internals to Concepts (#13696) 2022-07-11 16:55:33 -04:00
Michael Schurter
c52741ae1b docs: clarify total_escaped is just an optimization (#13460) 2022-06-22 11:39:56 -07:00
Michael Schurter
19bac3caa8 docs: add plan for node rejected details and more (#12564)
- Moved federation docs to the bottom since *everyone* is potentially
  affected by the other sections on the page, but only users of
  federation are affected by it.
- Added section on the plan for node rejected bug since it is fairly
  easy to diagnose and removing affected nodes is a fairly reliable
  workaround.
- Mention 5s cliff for wait_for_index.
- Remove the lie that we do not have job status metrics! How old was
  that?!
- Reinforce the importance of monitoring basic system resources
2022-04-14 16:09:33 -07:00
Jasmine Dahilig
ccaaadf493 docs: add token_last_renewal and token_next_renewal to server metrics and key metrics #12435 (#12505) 2022-04-07 15:12:41 -07:00
Derek Strickland
5b5c853597 disconnected clients: Observability plumbing (#12141)
* Add disconnects/reconnect to log output and emit reschedule metrics

* TaskGroupSummary: Add Unknown, update StateStore logic, add to metrics
2022-04-05 17:12:23 -04:00
Seth Hoenig
16efcf4e71 core: switch to go.etc.io/bbolt
This PR swaps the underlying BoltDB implementation from boltdb/bolt
to go.etc.io/bbolt.

In addition, the Server has a new configuration option for disabling
NoFreelistSync on the underlying database.

Freelist option: https://github.com/etcd-io/bbolt/blob/master/db.go#L81
Consul equivelent PR: https://github.com/hashicorp/consul/pull/11720
2022-02-23 14:26:41 -06:00
Luiz Aoqui
a0c0b808af docs: add nomad.plan.node_rejected metric (#11860) 2022-01-18 13:47:20 -05:00
Tim Gross
7fad4b9169 docs: new scheduler metrics (#11790)
* Fixed name of `nomad.scheduler.allocs.reschedule` metric
* Added new metrics to metrics reference documentation
* Expanded definitions of "waiting" metrics
* Changelog entry for #10236 and #10237
2022-01-07 09:51:15 -05:00
Tim Gross
95fa1b30f4 docs: improve docs for troubleshooting and monitoring scheduler (#11623)
This changeset adds more specific recommendations as to what metrics
to monitor, and what resources should be examined during incident
response.

It also renames the "Telemetry" section to "Monitoring Nomad" to
surface the material better and distinguish it from the "Metric
Reference".

Co-authored-by: Charlie Voiselle <464492+angrycub@users.noreply.github.com>
2021-12-07 15:52:13 -05:00
James Rasell
d2132b96b4 docs: add license expiry metric to metrics website doc. 2021-12-07 10:31:51 +00:00
kfenech1
6bbcb180f2 docs: nomad.client.unallocated.memory is in Megabytes not bytes (#11468) 2021-11-08 11:05:11 -05:00
Michael Schurter
594ceb7022 docs: improve wait_for_index metrics description (#10717)
Old description of `{plan,worker}.wait_for_index` described the metric
in terms of waiting for a snapshot which has two problems:

1. "Snapshot" is an overloaded term in Nomad and operators can't be
   expected to know which use we're referring to here.
2. The most important thing about the metric is what we're waiting *on*
   before taking a snapshot: the raft index of the object to be
   processed (plan or eval).

The new description tries to cram all of that context into the tiny
space provided.

See #5791 for details about the `wait_for_index` mechanism in general.
2021-06-09 08:53:06 -04:00
Luiz Aoqui
c7114921fa Add metrics for blocked eval resources (#10454)
* add metrics for blocked eval resources

* docs: add new blocked_evals metrics

* fix to call `pruneStats` instead of `stats.prune` directly
2021-04-29 15:03:45 -04:00
Bryce Kalow
ee79587a67 feat(website): migrates to new nav data format (#10264) 2021-03-31 08:43:17 -05:00
Tim Gross
3b42d75225 docs: add metrics from raft leadership transitions 2021-01-27 11:50:11 -05:00
Jeff Escalante
0eae603a86 implement mdx remote 2021-01-05 19:02:39 -05:00