The `nomad operator debug` command saves a CPU profile for each interval, and
names these files based on the interval.
The same functions takes a goroutine profile, heap profile, etc. but is missing
the logic to interpolate the file name with the interval. This results in the
operator debug command making potentially many expensive profile requests, and
then overwriting the data. Update the command to save every profile it scrapes,
and number them similarly to the existing CPU profile.
Additionally, the command flags for `-pprof-interval` and `-pprof-duration` were
validated backwards, which meant that we always coerced the `-pprof-interval` to
be the same as the `-pprof-duration`, which always resulted in a single profile
being taken at the start of the bundle. Correct the check as well as change the
defaults to be more sensible.
Fixes: https://github.com/hashicorp/nomad/issues/20151
The `operator snapshot` commands and agent don't back up Nomad's key
material. Add some warnings about this to places where users might be looking
for information on cluster recovery.
Fixes: https://github.com/hashicorp/nomad/issues/19389
When cluster administrators restore from Raft snapshot, they also need to ensure the
keyring is in place. For on-prem users doing in-place upgrades this is less of a
concern but for typical cloud workflows where the whole host is replaced, it's
an important warning (at least until #14852 has been implemented).
Document the secure variables keyring commands, document the aliased
gossip keyring commands, and note that the old gossip keyring commands
are deprecated.
The sidebar navigation tree for the `operator` sub-sub commands is
getting cluttered and we have a new set of commands coming to support
secure variables keyring as well. Move these all under their own
subtrees.
* core: allow pause/un-pause of eval broker on region leader.
* agent: add ability to pause eval broker via scheduler config.
* cli: add operator scheduler commands to interact with config.
* api: add ability to pause eval broker via scheduler config
* e2e: add operator scheduler test for eval broker pause.
* docs: include new opertor scheduler CLI and pause eval API info.
We introduced a `pprof-interval` argument to `operator debug` in #11938, and unfortunately this has resulted in a lot of test flakes. The actual command in use is mostly fine (although I've fixed some quirks here), so what's really happened is that the change has revealed some existing issues in the tests. Summary of changes:
* Make first pprof collection synchronous to preserve the existing
behavior for the common case where the pprof interval matches the
duration.
* Clamp `operator debug` pprof timing to that of the command. The
`pprof-duration` should be no more than `duration` and the
`pprof-interval` should be no more than `pprof-duration`. Clamp the
values rather than throwing errors, which could change the commands
that existing users might already have in debugging scripts
* Testing: remove test parallelism
The `operator debug` tests that stand up servers can't be run in
parallel, because we don't have a way of canceling the API calls for
pprof. The agent will still be running the last pprof when we exit,
and that breaks the next test that talks to that same agent.
(Because you can only run one pprof at a time on any process!)
We could split off each subtest into its own server, but this test
suite is already very slow. In future work we should fix this "for
real" by making the API call cancelable.
* Testing: assert against unexpected errors in `operator debug` tests.
If we assert there are no unexpected error outputs, it's easier for
the developer to debug when something is going wrong with the tests
because the error output will be presented as a failing test, rather
than just a failing exit code check. Or worse, no failing exit code
check!
This also forces us to be explicit about which tests will return 0
exit codes but still emit (presumably ignorable) error outputs.
Additional minor bug fixes (mostly in tests) and test refactorings:
* Fix text alignment on pprof Duration in `operator debug` output
* Remove "done" channel from `operator debug` event stream test. The
goroutine we're blocking for here already tells us it's done by
sending a value, so block on that instead of an extraneous channel
* Event stream test timer should start at current time, not zero
* Remove noise from `operator debug` test log output. The `t.Logf`
calls already are picked out from the rest of the test output by
being prefixed with the filename.
* Remove explicit pprof args so we use the defaults clamped from
duration/interval
The `nomad operator raft` and `nomad operator snapshot state`
subcommands for inspecting on-disk raft state were hidden and
undocumented. Expose and document these so that advanced operators
have support for these tools.