* Basic implementation for server members and node status
* Commands for alloc status and job status
* -ui flag for most commands
* url hints for variables
* url hints for job dispatch, evals, and deployments
* agent config ui.cli_url_links to disable
* Fix an issue where path prefix was presumed for variables
* driver uncomment and general cleanup
* -ui flag on the generic status endpoint
* Job run command gets namespaces, and no longer gets ui hints for --output flag
* Dispatch command hints get a namespace, and bunch o tests
* Lots of tests depend on specific output, so let's not mess with them
* figured out what flagAddress is all about for testServer, oof
* Parallel outside of test instances
* Browser-opening test, sorta
* Env var for disabling/enabling CLI hints
* Addressing a few PR comments
* CLI docs available flags now all have -ui
* PR comments addressed; switched the env var to be consistent and scrunched monitor-adjacent hints a bit more
* ui.Output -> ui.Warn; moves hints from stdout to stderr
* isTerminal check and parseBool on command option
* terminal.IsTerminal check removed for test-runner-not-being-terminal reasons
During unusual outage recovery scenarios on large clusters, a backlog of
millions of evaluations can appear. In these cases, the `eval delete` command can
put excessive load on the cluster by listing large sets of evals to extract the
IDs and then sending larges batches of IDs. Although the command's batch size
was carefully tuned, we still need to be JSON deserialize, re-serialize to
MessagePack, send the log entries through raft, and get the FSM applied.
To improve performance of this recovery case, move the batching process into the
RPC handler and the state store. The design here is a little weird, so let's
look a the failed options first:
* A naive solution here would be to just send the filter as the raft request and
let the FSM apply delete the whole set in a single operation. Benchmarking with
1M evals on a 3 node cluster demonstrated this can block the FSM apply for
several minutes, which puts the cluster at risk if there's a leadership
failover (the barrier write can't be made while this apply is in-flight).
* A less naive but still bad solution would be to have the RPC handler filter
and paginate, and then hand a list of IDs to the existing raft log
entry. Benchmarks showed this blocked the FSM apply for 20-30s at a time and
took roughly an hour to complete.
Instead, we're filtering and paginating in the RPC handler to find a page token,
and then passing both the filter and page token in the raft log. The FSM apply
recreates the paginator using the filter and page token to get roughly the same
page of evaluations, which it then deletes. The pagination process is fairly
cheap (only abut 5% of the total FSM apply time), so counter-intuitively this
rework ends up being much faster. A benchmark of 1M evaluations showed this
blocked the FSM apply for 20-30ms at a time (typical for normal operations) and
completes in less than 4 minutes.
Note that, as with the existing design, this delete is not consistent: a new
evaluation inserted "behind" the cursor of the pagination will fail to be
deleted.
Use the same output format when listing multiple evals in the `eval
list` command and when `eval status <prefix>` matches more than one
eval.
Include the eval namespace in all output formats and always include the
job ID in `eval status` since, even `node-update` evals are related to a
job.
Add Node ID to the evals table output to help differentiate
`node-update` evals.
Co-authored-by: James Rasell <jrasell@hashicorp.com>
Use the new filtering and pagination capabilities of the `Eval.List`
RPC to provide filtering and pagination at the command line.
Also includes note that `nomad eval status -json` is deprecated and
will be replaced with a single evaluation view in a future version of
Nomad.