Commit Graph

278 Commits

Author SHA1 Message Date
Dao Thanh Tung
30b235345d cli: Add a nomad operator client state command (#15469)
Signed-off-by: dttung2905 <ttdao.2015@accountancy.smu.edu.sg>
2023-01-11 10:03:31 -05:00
dgotlieb
089e6802d6 docs: nomad eval delete typo fix (#15667)
Status instead of Stauts
2023-01-03 14:18:03 -05:00
Piotr Kazmierczak
f452441542 ACL Binding Rules CLI documentation (#15584) 2022-12-22 16:36:25 +01:00
Danish Prakash
16401b864e command/job_stop: accept multiple jobs, stop concurrently (#12582)
* command/job_stop: accept multiple jobs, stop concurrently

Signed-off-by: danishprakash <grafitykoncept@gmail.com>

* command/job_stop_test: add test for multiple job stops

Signed-off-by: danishprakash <grafitykoncept@gmail.com>

* improve output, add changelog and docs

Signed-off-by: danishprakash <grafitykoncept@gmail.com>
Co-authored-by: Michael Schurter <mschurter@hashicorp.com>
2022-12-16 15:46:58 -08:00
Piotr Kazmierczak
758bc68925 acl: SSO auth methods CLI documentation (#15538)
This PR provides documentation for the ACL Auth Methods CLI commands.

Co-authored-by: James Rasell <jrasell@users.noreply.github.com>
2022-12-14 13:35:26 +01:00
Tim Gross
b23f435624 docs: update plugin status docs with capabilities and topology (#15448)
The `plugin status` command supports displaying CSI capabilities and topology
accessibility, but this was missing from the documentation. Extend the
`-verbose` example to show that info.
2022-12-01 12:18:56 -05:00
Jack
66c61e4fd2 cli: wait flag for use with deployment status -monitor (#15262) 2022-11-23 16:36:13 -05:00
Lance Haig
8667dc2607 Add command "nomad tls" (#14296) 2022-11-22 14:12:07 -05:00
Tim Gross
65b3d01aab eval delete: move batching of deletes into RPC handler and state (#15117)
During unusual outage recovery scenarios on large clusters, a backlog of
millions of evaluations can appear. In these cases, the `eval delete` command can
put excessive load on the cluster by listing large sets of evals to extract the
IDs and then sending larges batches of IDs. Although the command's batch size
was carefully tuned, we still need to be JSON deserialize, re-serialize to
MessagePack, send the log entries through raft, and get the FSM applied.

To improve performance of this recovery case, move the batching process into the
RPC handler and the state store. The design here is a little weird, so let's
look a the failed options first:

* A naive solution here would be to just send the filter as the raft request and
  let the FSM apply delete the whole set in a single operation. Benchmarking with
  1M evals on a 3 node cluster demonstrated this can block the FSM apply for
  several minutes, which puts the cluster at risk if there's a leadership
  failover (the barrier write can't be made while this apply is in-flight).

* A less naive but still bad solution would be to have the RPC handler filter
  and paginate, and then hand a list of IDs to the existing raft log
  entry. Benchmarks showed this blocked the FSM apply for 20-30s at a time and
  took roughly an hour to complete.

Instead, we're filtering and paginating in the RPC handler to find a page token,
and then passing both the filter and page token in the raft log. The FSM apply
recreates the paginator using the filter and page token to get roughly the same
page of evaluations, which it then deletes. The pagination process is fairly
cheap (only abut 5% of the total FSM apply time), so counter-intuitively this
rework ends up being much faster. A benchmark of 1M evaluations showed this
blocked the FSM apply for 20-30ms at a time (typical for normal operations) and
completes in less than 4 minutes.

Note that, as with the existing design, this delete is not consistent: a new
evaluation inserted "behind" the cursor of the pagination will fail to be
deleted.
2022-11-14 14:08:13 -05:00
Douglas Jose
1217a96edf Fix wrong reference to vault (#15228) 2022-11-14 10:49:09 +01:00
Kevin Wang
57dc7c2ab1 fix: website broken links (#14904)
* fix: website broken links

* fix up keyring-rotate link

Co-authored-by: Tim Gross <tgross@hashicorp.com>
2022-10-17 11:32:10 -04:00
Damian Czaja
e4efedbbe4 cli: add nomad fmt (#14779) 2022-10-06 17:00:29 -04:00
Giovani Avelar
2b9158b73e Allow specification of a custom job name/prefix for parameterized jobs (#14631) 2022-10-06 16:21:40 -04:00
Derek Strickland
58e76c64d5 Merge pull request #14664 from hashicorp/docs-multiregion-dispatch
multiregion: Added a section for multiregion parameterized job dispatch
2022-09-28 15:40:11 -04:00
Derek Strickland
3c63967107 link from dispatch command 2022-09-28 08:30:22 -04:00
Tim Gross
e7e9713d2e variables: document restrictions on path and size (#14687) 2022-09-26 11:40:53 -04:00
Bryce Kalow
67d39725b1 website: content updates for developer (#14473)
Co-authored-by: Geoffrey Grosenbach <26+topfunky@users.noreply.github.com>
Co-authored-by: Anthony <russo555@gmail.com>
Co-authored-by: Ashlee Boyer <ashlee.boyer@hashicorp.com>
Co-authored-by: Ashlee M Boyer <43934258+ashleemboyer@users.noreply.github.com>
Co-authored-by: HashiBot <62622282+hashibot-web@users.noreply.github.com>
Co-authored-by: Kevin Wang <kwangsan@gmail.com>
2022-09-16 10:38:39 -05:00
Tim Gross
93a147e482 docs: include path in ACL requirements for variables (#14561)
Also add links to the ACL policy reference and variables concepts docs near the
top of the page.
2022-09-13 10:21:29 -04:00
Charlie Voiselle
df04cd15d6 Variables CLI documentation (#14249) 2022-09-12 16:44:31 -04:00
Tim Gross
0c82b1dec9 remove root keyring install API (#14514)
* keyring rotate API should require put/post method
* remove keyring install API
2022-09-09 08:50:35 -04:00
Luiz Aoqui
99ebd0ab26 cli: set -hcl2-strict to false if -hcl1 is defined (#14426)
These options are mutually exclusive but, since `-hcl2-strict` defaults
to `true` users had to explicitily set it to `false` when using `-hcl1`.

Also return `255` when job plan fails validation as this is the expected 
code in this situation.
2022-09-01 10:42:08 -04:00
James Rasell
20f71cf3e4 docs: add documentation for ACL token expiration and ACL roles. (#14332)
The ACL command docs are now found within a sub-dir like the
operator command docs. Updates to the ACL token commands to
accommodate token expiry have also been added.

The ACL API docs are now found within a sub-dir like the operator
API docs. The ACL docs now include the ACL roles endpoint as well
as updated ACL token endpoints for token expiration.

The configuration section is also updated to accommodate the new
ACL and server parameters for the new ACL features.
2022-08-31 16:13:47 +02:00
Tim Gross
e1e5bb1dce docs: rename Secure Variables to Variables (#14352) 2022-08-29 11:37:08 -04:00
Luiz Aoqui
f74f50804a Task lifecycle restart (#14127)
* allocrunner: handle lifecycle when all tasks die

When all tasks die the Coordinator must transition to its terminal
state, coordinatorStatePoststop, to unblock poststop tasks. Since this
could happen at any time (for example, a prestart task dies), all states
must be able to transition to this terminal state.

* allocrunner: implement different alloc restarts

Add a new alloc restart mode where all tasks are restarted, even if they
have already exited. Also unifies the alloc restart logic to use the
implementation that restarts tasks concurrently and ignores
ErrTaskNotRunning errors since those are expected when restarting the
allocation.

* allocrunner: allow tasks to run again

Prevent the task runner Run() method from exiting to allow a dead task
to run again. When the task runner is signaled to restart, the function
will jump back to the MAIN loop and run it again.

The task runner determines if a task needs to run again based on two new
task events that were added to differentiate between a request to
restart a specific task, the tasks that are currently running, or all
tasks that have already run.

* api/cli: add support for all tasks alloc restart

Implement the new -all-tasks alloc restart CLI flag and its API
counterpar, AllTasks. The client endpoint calls the appropriate restart
method from the allocrunner depending on the restart parameters used.

* test: fix tasklifecycle Coordinator test

* allocrunner: kill taskrunners if all tasks are dead

When all non-poststop tasks are dead we need to kill the taskrunners so
we don't leak their goroutines, which are blocked in the alloc restart
loop. This also ensures the allocrunner exits on its own.

* taskrunner: fix tests that waited on WaitCh

Now that "dead" tasks may run again, the taskrunner Run() method will
not return when the task finishes running, so tests must wait for the
task state to be "dead" instead of using the WaitCh, since it won't be
closed until the taskrunner is killed.

* tests: add tests for all tasks alloc restart

* changelog: add entry for #14127

* taskrunner: fix restore logic.

The first implementation of the task runner restore process relied on
server data (`tr.Alloc().TerminalStatus()`) which may not be available
to the client at the time of restore.

It also had the incorrect code path. When restoring a dead task the
driver handle always needs to be clear cleanly using `clearDriverHandle`
otherwise, after exiting the MAIN loop, the task may be killed by
`tr.handleKill`.

The fix is to store the state of the Run() loop in the task runner local
client state: if the task runner ever exits this loop cleanly (not with
a shutdown) it will never be able to run again. So if the Run() loops
starts with this local state flag set, it must exit early.

This local state flag is also being checked on task restart requests. If
the task is "dead" and its Run() loop is not active it will never be
able to run again.

* address code review requests

* apply more code review changes

* taskrunner: add different Restart modes

Using the task event to differentiate between the allocrunner restart
methods proved to be confusing for developers to understand how it all
worked.

So instead of relying on the event type, this commit separated the logic
of restarting an taskRunner into two methods:
- `Restart` will retain the current behaviour and only will only restart
  the task if it's currently running.
- `ForceRestart` is the new method where a `dead` task is allowed to
  restart if its `Run()` method is still active. Callers will need to
  restart the allocRunner taskCoordinator to make sure it will allow the
  task to run again.

* minor fixes
2022-08-24 17:43:07 -04:00
Tim Gross
2eaf3d7270 allow ACL policies to be associated with workload identity (#14140)
The original design for workload identities and ACLs allows for operators to
extend the automatic capabilities of a workload by using a specially-named
policy. This has shown to be potentially unsafe because of naming collisions, so
instead we'll allow operators to explicitly attach a policy to a workload
identity.

This changeset adds workload identity fields to ACL policy objects and threads
that all the way down to the command line. It also a new secondary index to the
ACL policy table on namespace and job so that claim resolution can efficiently
query for related policies.
2022-08-22 16:41:21 -04:00
Seth Hoenig
e96d52d87f cli: respect vault token in plan command
This PR fixes a regression where the 'job plan' command would not respect
a Vault token if set via --vault-token or $VAULT_TOKEN.

Basically the same bug/fix as for the validate command in https://github.com/hashicorp/nomad/issues/13062

Fixes https://github.com/hashicorp/nomad/issues/13939
2022-08-11 08:54:08 -05:00
Charlie Voiselle
279631eeb5 Sweep of docs for repeated words; minor edits (#14032) 2022-08-05 16:45:30 -04:00
Tim Gross
587360543b docs: keyring commands (#13690)
Document the secure variables keyring commands, document the aliased
gossip keyring commands, and note that the old gossip keyring commands
are deprecated.
2022-07-20 14:14:10 -04:00
Tim Gross
f295396ef8 docs: rename Internals to Concepts (#13696) 2022-07-11 16:55:33 -04:00
Tim Gross
b209fc47da docs: move operator subcommands under their own trees (#13677)
The sidebar navigation tree for the `operator` sub-sub commands is
getting cluttered and we have a new set of commands coming to support
secure variables keyring as well. Move these all under their own
subtrees.
2022-07-11 14:00:24 -04:00
Luiz Aoqui
52389ff726 cli: improve output of eval commands (#13581)
Use the same output format when listing multiple evals in the `eval
list` command and when `eval status <prefix>` matches more than one
eval.

Include the eval namespace in all output formats and always include the
job ID in `eval status` since, even `node-update` evals are related to a
job.

Add Node ID to the evals table output to help differentiate
`node-update` evals.

Co-authored-by: James Rasell <jrasell@hashicorp.com>
2022-07-07 13:13:34 -04:00
James Rasell
11cb4c6d82 core: allow deleting of evaluations (#13492)
* core: add eval delete RPC and core functionality.

* agent: add eval delete HTTP endpoint.

* api: add eval delete API functionality.

* cli: add eval delete command.

* docs: add eval delete website documentation.
2022-07-06 16:30:11 +02:00
James Rasell
24220d0a02 core: allow pausing and un-pausing of leader broker routine (#13045)
* core: allow pause/un-pause of eval broker on region leader.

* agent: add ability to pause eval broker via scheduler config.

* cli: add operator scheduler commands to interact with config.

* api: add ability to pause eval broker via scheduler config

* e2e: add operator scheduler test for eval broker pause.

* docs: include new opertor scheduler CLI and pause eval API info.
2022-07-06 16:13:48 +02:00
Luiz Aoqui
3737fb3c7d docs: create volume spec page (#13353)
In addition to jobs, there are other objects in Nomad that have a
specific format and can be provided to commands and API endpoints.

This commit creates a new menu section to hold the specification for
volumes and update the command pages to point to the new centralized
definition.

Redirecting the previous entries is not possible with `redirect.js`
because they are done server-side and URL fragments are not accessible
to detect a match. So we provide hidden anchors with a link to the new
page to guide users towards the new documentation.

Co-authored-by: Tim Gross <tgross@hashicorp.com>
2022-06-14 14:08:25 -04:00
Michael Schurter
34959b26df docs: explain behavior of system gc command (#13342) 2022-06-13 09:54:23 +02:00
Lance Haig
eafc93902b Allow Operator Generated bootstrap token (#12520) 2022-06-03 07:37:24 -04:00
PinkLolicorn
b181919ce6 docs: mount_flags takes a slice of strings (#13087)
The description of `mount_flags` provides incorrect example
of the accepted value format.

This fixes the issue by changing the example from a string
`ro,noatime` to a slice of strings `["ro", "noatime"]`.
2022-05-20 09:16:17 -04:00
Seth Hoenig
0a5992bd20 cli: correctly use and validate job with vault token set
This PR fixes `job validate` to respect '-vault-token', '$VAULT_TOKEN',
'-vault-namespace' if set.
2022-05-19 12:13:34 -05:00
Seth Hoenig
d91e4160da cli: update default redis and use nomad service discovery
Closes #12927
Closes #12958

This PR updates the version of redis used in our examples from 3.2 to 7.
The old version is very not supported anymore, and we should be setting
a good example by using a supported version.

The long-form example job is now fixed so that the service stanza uses
nomad as the service discovery provider, and so now the job runs without
a requirement of having Consul running and configured.
2022-05-17 10:24:19 -05:00
Tim Gross
9d5c7b5d94 CSI: node drain should end once only plugins remain (#12846)
In #12324 we made it so that plugins wait until the node drain is
complete, as we do for system jobs. But we neglected to mark the node
drain as complete once only plugins (or system jobs) remaining, which
means that the node drain is left in a draining state until the
`deadline` time expires. This was incorrectly documented as expected
behavior in #12324.
2022-05-03 10:20:22 -04:00
Matus Goljer
89a794c905 nomad can also install autocomplete for fish shell (#12834) 2022-05-02 09:26:55 -04:00
Tim Gross
342a4ee735 docs: clarify capacity_min/max for volumes (#12825)
The capacity fields for `create volume` set bounds on the resulting
size of the volume, but the ultimate size of the volume will be
determined by the storage provider (between the min and max). Clarify
this in the documentation and provide a suggestion for how to set a
exact size.
2022-04-29 13:38:30 -04:00
Michael Schurter
e4d6d51035 docs: update json jobs docs (#12766)
* docs: update json jobs docs

Did you know that Nomad has not 1 but 2 JSON formats for jobs? 2½ if you
want to acknowledge that sometimes our JSON job representations have a
Job top-level wrapper and sometimes do not.

The 2½ formats are:
```
 1.   HCL JSON
 2.   Input API JSON (top-level Job field)
 2.5. Output API JSON (lacks top-level Job field)
```

`#2` is what our docs consider our API JSON. `#2.5` seems to be an
accident of history we can't fix with breaking API compatibility.

`#1` is an even more interesting accident of history: the `jobspec2`
package automatically detects if the input to Parse is JSON and switches
to a JSON parser. This behavior is undocumented, the format is
unspecified, and there is no official HashiCorp tooling to produce this
JSON from HCL. The plot thickens when you discover popular third party
tools like hcl2json.com and https://github.com/tmccombs/hcl2json seem to
produce JSON that `nomad run` accepts!

Since we have no telemetry around whether or not anyone passes HCL JSON
to `nomad run`, and people don't file bugs around features that Just
Work, I'm choosing to leave that code path in place and *acknowledged
but not suggested* in documentation.

See https://github.com/hashicorp/hcl/issues/498 for a more comprehensive
discussion of what officially supporting HCL JSON in Nomad would look
like.

(I also added some of the missing fields to the (Input API flavor) JSON
Job documentation, but it still needs a lot of work to be
comprehensive.)

Co-authored-by: Tim Gross <tgross@hashicorp.com>
2022-04-22 15:57:27 -07:00
Luiz Aoqui
0abe5a6c79 vault: revert support for entity aliases (#12723)
After a more detailed analysis of this feature, the approach taken in
PR #12449 was found to be not ideal due to poor UX (users are
responsible for setting the entity alias they would like to use) and
issues around jobs potentially masquerading itself as another Vault
entity.
2022-04-22 10:46:34 -04:00
James Rasell
2c6966c61a cli: add pagination flags to service info command. (#12730) 2022-04-22 10:32:40 +02:00
Michael Schurter
7af0c3c9e5 cli: add -json flag to support job commands (#12591)
* cli: add -json flag to support job commands

While the CLI has always supported running JSON jobs, its support has
been via HCLv2's JSON parsing. I have no idea what format it expects the
job to be in, but it's absolutely not in the same format as the API
expects.

So I ignored that and added a new -json flag to explicitly support *API*
style JSON jobspecs.

The jobspecs can even have the wrapping {"Job": {...}} envelope or not!

* docs: fix example for `nomad job validate`

We haven't been able to validate inside driver config stanzas ever since
the move to task driver plugins. 😭
2022-04-21 13:20:36 -07:00
Tim Gross
42bcb74a51 cli: detect directory when applying namespace spec file (#12738)
The new `namespace apply` feature that allows for passing a namespace
specification file detects the difference between an empty namespace
and a namespace specification by checking if the file exists. For most
cases, the file will have an extension like `.hcl` and so there's
little danger that a user will apply a file spec when they intended to
apply a file name.

But because directory names typically don't include an extension,
you're much more likely to collide when trying to `namespace apply` by
name only, and then you get a confusing error message of the form:

   Failed to read file: read $namespace: is a directory

Detect the case where the namespace name collides with a directory in
the current working directory, and skip trying to load the directory.
2022-04-21 14:53:45 -04:00
James Rasell
15e6e5befc api: Add support for filtering and pagination to the node list endpoint (#12727) 2022-04-21 17:04:33 +02:00
Tim Gross
ab6f13db1d Fix flaky operator debug test (#12501)
We introduced a `pprof-interval` argument to `operator debug` in #11938, and unfortunately this has resulted in a lot of test flakes. The actual command in use is mostly fine (although I've fixed some quirks here), so what's really happened is that the change has revealed some existing issues in the tests. Summary of changes:

* Make first pprof collection synchronous to preserve the existing
  behavior for the common case where the pprof interval matches the
  duration.

* Clamp `operator debug` pprof timing to that of the command. The
  `pprof-duration` should be no more than `duration` and the
  `pprof-interval` should be no more than `pprof-duration`. Clamp the
  values rather than throwing errors, which could change the commands
  that existing users might already have in debugging scripts

* Testing: remove test parallelism

  The `operator debug` tests that stand up servers can't be run in
  parallel, because we don't have a way of canceling the API calls for
  pprof. The agent will still be running the last pprof when we exit,
  and that breaks the next test that talks to that same agent.
  (Because you can only run one pprof at a time on any process!)

  We could split off each subtest into its own server, but this test
  suite is already very slow. In future work we should fix this "for
  real" by making the API call cancelable.


* Testing: assert against unexpected errors in `operator debug` tests.

  If we assert there are no unexpected error outputs, it's easier for
  the developer to debug when something is going wrong with the tests
  because the error output will be presented as a failing test, rather
  than just a failing exit code check. Or worse, no failing exit code
  check!

  This also forces us to be explicit about which tests will return 0
  exit codes but still emit (presumably ignorable) error outputs.

Additional minor bug fixes (mostly in tests) and test refactorings:

* Fix text alignment on pprof Duration in `operator debug` output

* Remove "done" channel from `operator debug` event stream test. The
  goroutine we're blocking for here already tells us it's done by
  sending a value, so block on that instead of an extraneous channel

* Event stream test timer should start at current time, not zero

* Remove noise from `operator debug` test log output. The `t.Logf`
  calls already are picked out from the rest of the test output by
  being prefixed with the filename.

* Remove explicit pprof args so we use the defaults clamped from
  duration/interval
2022-04-07 15:00:07 -04:00
Jasmine Dahilig
c6583b27c8 docs: update vault-token note in job run command #8040 (#12385) 2022-04-06 10:01:38 -07:00