Commit Graph

4213 Commits

Author SHA1 Message Date
Tim Gross
786dc5ff94 fingerprint: don't clear Consul/Vault attributes on failure (#14673)
Clients periodically fingerprint Vault and Consul to ensure the server has
updated attributes in the client's fingerprint. If the client can't reach
Vault/Consul, the fingerprinter clears the attributes and requires a node
update. Although this seems like correct behavior so that we can detect
intentional removal of Vault/Consul access, it has two serious failure modes:

(1) If a local Consul agent is restarted to pick up configuration changes and the
client happens to fingerprint at that moment, the client will update its
fingerprint and result in evaluations for all its jobs and all the system jobs
in the cluster.

(2) If a client loses Vault connectivity, the same thing happens. But the
consequences are much worse in the Vault case because Vault is not run as a
local agent, so Vault connectivity failures are highly correlated across the
entire cluster. A 15 second Vault outage will cause a new `node-update`
evalution for every system job on the cluster times the number of nodes, plus
one `node-update` evaluation for every non-system job on each node. On large
clusters of 1000s of nodes, we've seen this create a large backlog of evaluations.

This changeset updates the fingerprinting behavior to keep the last fingerprint
if Consul or Vault queries fail. This prevents a storm of evaluations at the
cost of requiring a client restart if Consul or Vault is intentionally removed
from the client.
2022-09-23 14:45:12 -04:00
Tim Gross
d1e90a17d6 cli: remove deprecated eval status -json list behavior (#14651)
In Nomad 1.2.6 we shipped `eval list`, which accepts a `-json` flag, and
deprecated the usage of `eval status` without an evaluation ID with an upgrade
note that it would be removed in Nomad 1.4.0. This changeset completes that
work.
2022-09-22 10:56:32 -04:00
Bryce Kalow
67d39725b1 website: content updates for developer (#14473)
Co-authored-by: Geoffrey Grosenbach <26+topfunky@users.noreply.github.com>
Co-authored-by: Anthony <russo555@gmail.com>
Co-authored-by: Ashlee Boyer <ashlee.boyer@hashicorp.com>
Co-authored-by: Ashlee M Boyer <43934258+ashleemboyer@users.noreply.github.com>
Co-authored-by: HashiBot <62622282+hashibot-web@users.noreply.github.com>
Co-authored-by: Kevin Wang <kwangsan@gmail.com>
2022-09-16 10:38:39 -05:00
Kyle Rarey
b8c872fd49 docs: Correct driver name for 'Nomad Task Group' autoscaler target (#14576) 2022-09-14 09:40:00 +02:00
Mahmood Ali
757c3c94f2 scheduler: stopped-yet-running allocs are still running (#10446)
* scheduler: stopped-yet-running allocs are still running

* scheduler: test new stopped-but-running logic

* test: assert nonoverlapping alloc behavior

Also add a simpler Wait test helper to improve line numbers and save few
lines of code.

* docs: tried my best to describe #10446

it's not concise... feedback welcome

* scheduler: fix test that allowed overlapping allocs

* devices: only free devices when ClientStatus is terminal

* test: output nicer failure message if err==nil

Co-authored-by: Mahmood Ali <mahmood@hashicorp.com>
Co-authored-by: Michael Schurter <mschurter@hashicorp.com>
2022-09-13 12:52:47 -07:00
Tim Gross
06686c84ed docs: tweak some copy in the concept docs (#14566) 2022-09-13 13:21:09 -04:00
Seth Hoenig
9cc39de738 Merge pull request #14559 from hashicorp/docs-nsd-check-watcher
docs: add documentation for nomad service check restarts
2022-09-13 10:52:01 -05:00
Ashlee M Boyer
faa5b4cf65 docs: Fixing heading order, adding text for links in /docs/ecosystem (#14549)
* Fixing heading order, adding text for links

* Apply suggestions from code review

Co-authored-by: Tim Gross <tgross@hashicorp.com>

* Applying more suggestions from code review

Co-authored-by: Tim Gross <tgross@hashicorp.com>
2022-09-13 10:59:02 -04:00
Seth Hoenig
37906ca213 docs: update docs for NSD check restart 2022-09-13 09:59:02 -05:00
Tim Gross
93a147e482 docs: include path in ACL requirements for variables (#14561)
Also add links to the ACL policy reference and variables concepts docs near the
top of the page.
2022-09-13 10:21:29 -04:00
Tim Gross
9ea8ca4aa5 docs: variables HTTP API documentation (#14516) 2022-09-13 10:18:26 -04:00
Tim Gross
0437bf384b docs: keyring HTTP API documentation (#14513) 2022-09-13 09:46:54 -04:00
Charlie Voiselle
df04cd15d6 Variables CLI documentation (#14249) 2022-09-12 16:44:31 -04:00
Tim Gross
6574717c55 docs: update template for Nomad Variables (#14527) 2022-09-12 16:36:18 -04:00
Tim Gross
0c82b1dec9 remove root keyring install API (#14514)
* keyring rotate API should require put/post method
* remove keyring install API
2022-09-09 08:50:35 -04:00
Tim Gross
f2186be02c CSI: failed allocation should not block its own controller unpublish (#14484)
A Nomad user reported problems with CSI volumes associated with failed
allocations, where the Nomad server did not send a controller unpublish RPC.

The controller unpublish is skipped if other non-terminal allocations on the
same node claim the volume. The check has a bug where the allocation belonging
to the claim being freed was included in the check incorrectly. During a normal
allocation stop for job stop or a new version of the job, the allocation is
terminal. But allocations that fail are not yet marked terminal at the point in
time when the client sends the unpublish RPC to the server.

For CSI plugins that support controller attach/detach, this means that the
controller will not be able to detach the volume from the allocation's host and
the replacement claim will fail until a GC is run. This changeset fixes the
conditional so that the claim's own allocation is not included, and makes the
logic easier to read. Include a test case covering this path.

Also includes two minor extra bugfixes:

* Entities we get from the state store should always be copied before
altering. Ensure that we copy the volume in the top-level unpublish workflow
before handing off to the steps.

* The list stub object for volumes in `nomad/structs` did not match the stub
object in `api`. The `api` package also did not include the current
readers/writers fields that are expected by the UI. True up the two objects and
add the previously undocumented fields to the docs.
2022-09-08 13:30:05 -04:00
James Rasell
11496d1816 hcl2: add strlen function and update docs. (#14463) 2022-09-06 18:42:40 +02:00
Luiz Aoqui
7d88937751 connect: interpolate task env in config values (#14445)
When configuring Consul Service Mesh, it's sometimes necessary to
provide dynamic value that are only known to Nomad at runtime. By
interpolating configuration values (in addition to configuration keys),
user are able to pass these dynamic values to Consul from their Nomad
jobs.
2022-09-02 15:00:28 -04:00
Luiz Aoqui
4f9beb0b39 docs: add warning about changing region config (#14443) 2022-09-01 16:47:06 -04:00
Luiz Aoqui
99ebd0ab26 cli: set -hcl2-strict to false if -hcl1 is defined (#14426)
These options are mutually exclusive but, since `-hcl2-strict` defaults
to `true` users had to explicitily set it to `false` when using `-hcl1`.

Also return `255` when job plan fails validation as this is the expected 
code in this situation.
2022-09-01 10:42:08 -04:00
Tim Gross
3a4345ef32 docs: clarify CSI plugin compatibility (#14434)
Nomad is generally compliant with the CSI specification for Container
Orchestrators (CO), except for unimplemented features. However, some storage
vendors have built CSI plugins that are not compliant with the specification or
which expect that they're only deployed on Kubernetes. Nomad cannot vouch for
the compatibility of any particular plugin, so clarify this in the docs.

Co-authored-by: Derek Strickland <1111455+DerekStrickland@users.noreply.github.com>
2022-09-01 10:06:44 -04:00
Brett Larson
baa3b05d00 Update ephemeral_disk.mdx (#14356)
It is really unclear on how to use this feature. it took me a while to find this, so I thought I would purpose how to use this.
2022-08-31 20:17:41 -04:00
James Rasell
20f71cf3e4 docs: add documentation for ACL token expiration and ACL roles. (#14332)
The ACL command docs are now found within a sub-dir like the
operator command docs. Updates to the ACL token commands to
accommodate token expiry have also been added.

The ACL API docs are now found within a sub-dir like the operator
API docs. The ACL docs now include the ACL roles endpoint as well
as updated ACL token endpoints for token expiration.

The configuration section is also updated to accommodate the new
ACL and server parameters for the new ACL features.
2022-08-31 16:13:47 +02:00
Tim Gross
b7fea76f7f keyring: wrap root key in key encryption key (#14388)
Update the on-disk format for the root key so that it's wrapped with a unique
per-key/per-server key encryption key. This is a bit of security theatre for the
current implementation, but it uses `go-kms-wrapping` as the interface for
wrapping the key. This provides a shim for future support of external KMS such
as cloud provider APIs or Vault transit encryption.

* Removes the JSON serialization extension we had on the `RootKey` struct; this
  struct is now only used for key replication and not for disk serialization, so
  we don't need this helper.

* Creates a helper for generating cryptographically random slices of bytes that
  properly accounts for short reads from the source.

* No observable functional changes outside of the on-disk format, so there are
  no test updates.
2022-08-30 10:59:25 -04:00
Tim Gross
8dfa6b06e7 docs: fixing a few more places we missed "secure" during rename (#14395) 2022-08-30 10:08:50 -04:00
quoing
92680a83f9 docs: template change script example correction (#14368)
"path" parameter doesn't work, should be command
2022-08-30 12:09:55 +02:00
Tim Gross
e1e5bb1dce docs: rename Secure Variables to Variables (#14352) 2022-08-29 11:37:08 -04:00
Luiz Aoqui
f74f50804a Task lifecycle restart (#14127)
* allocrunner: handle lifecycle when all tasks die

When all tasks die the Coordinator must transition to its terminal
state, coordinatorStatePoststop, to unblock poststop tasks. Since this
could happen at any time (for example, a prestart task dies), all states
must be able to transition to this terminal state.

* allocrunner: implement different alloc restarts

Add a new alloc restart mode where all tasks are restarted, even if they
have already exited. Also unifies the alloc restart logic to use the
implementation that restarts tasks concurrently and ignores
ErrTaskNotRunning errors since those are expected when restarting the
allocation.

* allocrunner: allow tasks to run again

Prevent the task runner Run() method from exiting to allow a dead task
to run again. When the task runner is signaled to restart, the function
will jump back to the MAIN loop and run it again.

The task runner determines if a task needs to run again based on two new
task events that were added to differentiate between a request to
restart a specific task, the tasks that are currently running, or all
tasks that have already run.

* api/cli: add support for all tasks alloc restart

Implement the new -all-tasks alloc restart CLI flag and its API
counterpar, AllTasks. The client endpoint calls the appropriate restart
method from the allocrunner depending on the restart parameters used.

* test: fix tasklifecycle Coordinator test

* allocrunner: kill taskrunners if all tasks are dead

When all non-poststop tasks are dead we need to kill the taskrunners so
we don't leak their goroutines, which are blocked in the alloc restart
loop. This also ensures the allocrunner exits on its own.

* taskrunner: fix tests that waited on WaitCh

Now that "dead" tasks may run again, the taskrunner Run() method will
not return when the task finishes running, so tests must wait for the
task state to be "dead" instead of using the WaitCh, since it won't be
closed until the taskrunner is killed.

* tests: add tests for all tasks alloc restart

* changelog: add entry for #14127

* taskrunner: fix restore logic.

The first implementation of the task runner restore process relied on
server data (`tr.Alloc().TerminalStatus()`) which may not be available
to the client at the time of restore.

It also had the incorrect code path. When restoring a dead task the
driver handle always needs to be clear cleanly using `clearDriverHandle`
otherwise, after exiting the MAIN loop, the task may be killed by
`tr.handleKill`.

The fix is to store the state of the Run() loop in the task runner local
client state: if the task runner ever exits this loop cleanly (not with
a shutdown) it will never be able to run again. So if the Run() loops
starts with this local state flag set, it must exit early.

This local state flag is also being checked on task restart requests. If
the task is "dead" and its Run() loop is not active it will never be
able to run again.

* address code review requests

* apply more code review changes

* taskrunner: add different Restart modes

Using the task event to differentiate between the allocrunner restart
methods proved to be confusing for developers to understand how it all
worked.

So instead of relying on the event type, this commit separated the logic
of restarting an taskRunner into two methods:
- `Restart` will retain the current behaviour and only will only restart
  the task if it's currently running.
- `ForceRestart` is the new method where a `dead` task is allowed to
  restart if its `Run()` method is still active. Callers will need to
  restart the allocRunner taskCoordinator to make sure it will allow the
  task to run again.

* minor fixes
2022-08-24 17:43:07 -04:00
Piotr Kazmierczak
34e4b080f6 template: custom change_mode scripts (#13972)
This PR adds the functionality of allowing custom scripts to be executed on template change. Resolves #2707
2022-08-24 17:43:01 +02:00
Piotr Kazmierczak
2d4acce3da docs: Update upgrade guide to reflect enterprise changes introduced in nomad-enterprise (#14212)
This PR documents a change made in the enterprise version of nomad that addresses the following issue:

When a user tries to filter audit logs, they do so with a stanza that looks like the following:

audit {
  enabled = true

  filter "remove deletes" {
    type = "HTTPEvent"
    endpoints  = ["*"]
    stages = ["OperationComplete"]
    operations = ["DELETE"]
  }
}

When specifying both an "endpoint" and a "stage", the events with both matching a "endpoint" AND a matching "stage" will be filtered.

When specifying both an "endpoint" and an "operation" the events with both matching a "endpoint" AND a matching "operation" will be filtered.

When specifying both a "stage" and an "operation" the events with a matching a "stage" OR a matching "operation" will be filtered.

The "OR" logic with stages and operations is unexpected and doesn't allow customers to get specific on which events they want to filter. For instance the following use-case is impossible to achieve: "I want to filter out all OperationReceived events that have the DELETE verb".
2022-08-24 16:31:49 +02:00
Tim Gross
e4e445ed1d docs: fix an anchor link in secure vars docs (#14231) 2022-08-23 10:46:24 -04:00
Seth Hoenig
e49bb89093 Merge pull request #14215 from hashicorp/docs-update-checks-for-nsd
docs: update check documentation with NSD specifics
2022-08-23 09:23:53 -05:00
Seth Hoenig
701c926b45 docs: fix checks doc typo
Co-authored-by: Piotr Kazmierczak <phk@mm.st>
2022-08-23 09:23:36 -05:00
Tim Gross
2eaf3d7270 allow ACL policies to be associated with workload identity (#14140)
The original design for workload identities and ACLs allows for operators to
extend the automatic capabilities of a workload by using a specially-named
policy. This has shown to be potentially unsafe because of naming collisions, so
instead we'll allow operators to explicitly attach a policy to a workload
identity.

This changeset adds workload identity fields to ACL policy objects and threads
that all the way down to the command line. It also a new secondary index to the
ACL policy table on namespace and job so that claim resolution can efficiently
query for related policies.
2022-08-22 16:41:21 -04:00
Luiz Aoqui
934bafb922 template: use pointer values for gid and uid (#14203)
When a Nomad agent starts and loads jobs that already existed in the
cluster, the default template uid and gid was being set to 0, since this
is the zero value for int. This caused these jobs to fail in
environments where it was not possible to use 0, such as in Windows
clients.

In order to differentiate between an explicit 0 and a template where
these properties were not set we need to use a pointer.
2022-08-22 16:25:49 -04:00
Seth Hoenig
c7ab7a8313 docs: update check documentation with NSD specifics
This PR updates the checks documentation to mention support for checks
when using the Nomad service provider. There are limitations of NSD
compared to Consul, and those configuration options are now noted as
being Consul-only.
2022-08-22 10:50:26 -05:00
Phil Renaud
3bf5633446 [ui] general keyboard navigation: 1.3.4 release (#14138)
* Initialized keyboard service

Neat but funky: dynamic subnav traversal

👻

generalized traverseSubnav method

Shift as special modifier key

Nice little demo panel

Keyboard shortcuts keycard

Some animation styles on keyboard shortcuts

Handle situations where a link is deeply nested from its parent menu item

Keyboard service cleanup

helper-based initializer and teardown for new contextual commands

Keyboard shortcuts modal component added and demo-ghost removed

Removed j and k from subnav traversal

Register and unregister methods for subnav plus new subnavs for volumes and volume

register main nav method

Generalizing the register nav method

12762 table keynav (#12975)

* Experimental feature: shortcut visual hints

* Long way around to a custom modifier for keyboard shortcuts

* dynamic table and list iterative shortcuts

* Progress with regular old tether

* Delogging

* Table Keynav tether fix, server and client navs, and fix to shiftless on modified arrow keys

Go to Optimize keyboard link and storage key changed to g r

parameterized jobs keyboard nav

Dynamic numeric keynav for multiple tables (#13482)

* Multiple tables init

* URL-bind enumerable keyboard commands and add to more taskRow and allocationRows

* Type safety and lint fixes

* Consolidated push to keyCommands

* Default value when removing keyCommands

* Remove the URL-based removal method and perform a recompute on any add

Get tests passing in Keynav: remove math helpers and a few other defensive moves (#13761)

* Remove ember math helpers

* Test fixes for jobparts/body

* Kill an unneeded integration helper test

* delog

* Trying if disabling percy lets this finish

* Okay so its not percy; try parallelism in circle

* Percyless yet again

* Trying a different angle to not have percy

* Upgrade percy to 1.6.1

[ui] Keyboard nav: "u" key to go up a level (#13754)

* U to go up a level

* Mislabelled my conditional

* Custom lint ignore rule

* Custom lint ignore rule, this time with commas

* Since we're getting rid of ember math helpers elsewhere, do the math ourselves here

Replace ArrowLeft etc. with an ascii arrow (#13776)

* Replace ArrowLeft etc. with an ascii arrow

* non-mutative helper cleanup

Keyboard Nav: let users rebind their shortcuts (#13781)

* click-outside and shortcuts enabled/disabled toggle

* Trap focus when modal open

* Enabled/disabled saved to localStorage

* Autofocus edit button on variable index

* Modal overflow styles

* Functional rebind

* Saving rebinds to localStorage for all majors

* Started on defaultCommandBindings

* Modal header style and cancel rebind on escape

* keyboardable keybindings w buttons instead of spans

* recording and defaultvalues

* Enter short-circuits rebind

* Only some commands are rebindable, and dont show dupes

* No unused get import

* More visually distinct header on modal

* Disallowed keys for rebind, showing buffer as you type, and moving dedupe to modal logic

willDestroy hook to prevent tests from doubling/tripling up addEventListener on kb events

remove unused tests

Keyboard Navigation acceptance tests (#13893)

* Acceptance tests for keyboard modal

* a11y audit fix and localStorage clear

* Bind/rebind/localStorage tests

* Keyboard tests for dynamic nav and tables

* Rebinder and assert expectation

* Second percy snapshot showing hints no longer relevant

Weird issue where linktos with query props specifically from the task-groups page would fail to route / hit undefined.shouldSuperCede errors

Adds the concept of exclusivity to a keycommand, removing peers that also share its label

Lintfix

Changelog and PR feedback

Changelog and PR feedback

Fix to rebinding in firefox by blurring the now-disabled button on rebind (#14053)

* Secure Variables shortcuts removed

* Variable index route autofocus removed

* Updated changelog entry

* Updated changelog entry

* Keynav docs (#14148)

* Section added to the API Docs UI page

* Added a note about disabling

* Prev and Next order

* Remove dev log and unneeded comments
2022-08-17 12:59:33 -04:00
Kerim Satirli
c1abf47e16 adds link for Nomad-Pack GitHub action (#14118) 2022-08-16 08:34:26 +02:00
Tim Gross
ed55b97f30 secure vars: filter by path in List RPCs (#14036)
The List RPCs only checked the ACL for the Prefix argument of the request. Add
an ACL filter to the paginator for the List RPC.

Extend test coverage of ACLs in the List RPC and in the `acl` package, and add a
"deny" capability so that operators can deny specific paths or prefixes below an
allowed path.
2022-08-15 11:38:20 -04:00
Mike Nomitch
24bd3378fe Add notes about DAS being prometheus only (#14040) 2022-08-15 10:17:31 +02:00
Seth Hoenig
ea570b293b Merge pull request #14086 from Morantron/patch-1
Fix typo in /tools/autoscaling
2022-08-12 09:07:12 -05:00
Seth Hoenig
d3222e24c3 Merge pull request #14088 from hashicorp/b-plan-vault-token
cli: support vault token in plan command
2022-08-12 09:05:34 -05:00
Seth Hoenig
9504a4537b Merge pull request #14089 from hashicorp/f-docker-disable-healthchecks
docker: configuration for disable docker healthcheck
2022-08-12 09:00:31 -05:00
James Rasell
3f5bc76df1 docs: correctly state RPC port is used by servers and clients. (#14091) 2022-08-12 10:14:14 +02:00
Seth Hoenig
0bd42a501c docker: create a docker task config setting for disable built-in healthcheck
This PR adds a docker driver task configuration setting for turning off
built-in HEALTHCHECK of a container.

References)
https://docs.docker.com/engine/reference/builder/#healthcheck
https://github.com/docker/engine-api/blob/master/types/container/config.go#L16

Closes #5310
Closes #14068
2022-08-11 10:33:48 -05:00
Seth Hoenig
e96d52d87f cli: respect vault token in plan command
This PR fixes a regression where the 'job plan' command would not respect
a Vault token if set via --vault-token or $VAULT_TOKEN.

Basically the same bug/fix as for the validate command in https://github.com/hashicorp/nomad/issues/13062

Fixes https://github.com/hashicorp/nomad/issues/13939
2022-08-11 08:54:08 -05:00
Morantron
0eb2194924 Update index.mdx 2022-08-11 09:03:52 +02:00
Seth Hoenig
a2d3c29005 Merge pull request #14065 from hashicorp/b-fwd-vtoken-validation
cli: forward request for job validation to nomad leader
2022-08-10 15:14:01 -05:00
Seth Hoenig
7b4652440a cli: forward request for job validation to nomad leader
This PR changes the behavior of 'nomad job validate' to forward the
request to the nomad leader, rather than responding from any server.

This is because we need the leader when validating Vault tokens, since
the leader is the only server with an active vault client.
2022-08-10 14:34:04 -05:00
dgotlieb
1d9e6f94f8 doc typo fix
docker and podman don't suck 🤣
2022-08-10 15:04:07 +03:00