The Nomad garbage collector can be triggered manually which among
other things will remove down nodes from state. If a cleaned node
reconnects after this happens, it will be unable to reconnect with
the cluster running strict enforcement, even if it has a valid
node identity token.
This change fixes the issue by allowing nodes to reconnect with a
node identity, even if their state object has been removed by the
GC process. This will only work if the node identity has not
expired. If it has and strict enforcement is enabled, the operator
will have to re-introuduce the node to the cluster which feels like
expected and correct behaviour.
The RPC handler function is quite long, so moving the argument
validation into its own function reduces this and makes sense from
an organisation view.
In https://github.com/hashicorp/nomad/issues/15459 we've had a bit of
back-and-forth as a result of applying Nomad environment variables where they
typically should not be used. Clarify that the env vars are for the CLI and
mostly not for the agent. Also move the `NOMAD_CLI_SHOW_HINTS` description into
the correct section.
The docs for the `template` block accurately describe the template configuration
default function denylist in the body but the default parameters are missing
values. The equivalent docs in the `client` configuration are missing
`executeTemplate` as well.
When a node misses a heartbeat and is marked down, Nomad deletes service
registration instances for that node. But if the node then successfully
heartbeats before its allocations are marked lost, the services are never
restored. The node is unaware that it has missed a heartbeat and there's no
anti-entropy on the node in any case.
We already delete services when the plan applier marks allocations as stopped,
so deleting the services when the node goes down is only an optimization to more
quickly divert service traffic. But because the state after a plan apply is the
"canonical" view of allocation health, this breaks correctness.
Remove the code path that deletes services from nodes when nodes go down. Retain
the state store code that deletes services when allocs are marked terminal by
the plan applier. Also add a path in the state store to delete services when
allocs are marked terminal by the client. This gets back some of the
optimization but avoids the correctness bug because marking the allocation
client-terminal is a one way operation.
Fixes: https://github.com/hashicorp/nomad/issues/16983
* Update UI, code comment, and README links to docs, tutorials
* fix typo in ephemeral disks learn more link url
* feedback on typo
Co-authored-by: Tim Gross <tgross@hashicorp.com>
---------
Co-authored-by: Tim Gross <tgross@hashicorp.com>
This change implements the client -> server workflow for Nomad
node introduction. A Nomad node can optionally be started with an
introduction token, which is a signed JWT containing claims for
the node registration. The server handles this according to the
enforcement configuration.
The introduction token can be provided by env var, cli flag, or
by placing it within a default filesystem location. The latter
option does not override the CLI or env var.
The region claims has been removed from the initial claims set of
the intro identity. This boundary is guarded by mTLS and aligns
with the node identity.
The state store test for Variables check-and-set behavior for deletes uses the
same state store for a set of parallel tests. But one of the tests overlaps
another by using the same path, and this can cause spurious test failures by
hitting the CAS conflict error. This overlap doesn't appear to be intentional,
so change the test to use a different path.
Also cleaned up some unused test helpers in the same file.
* Add -log-file-export and -log-lookback commands to add historical log to
debug capture
* use monitor.PrepFile() helper for other historical log tests
the executor dies, leaving an orphaned process still running.
the panic fix:
* don't `panic()`
* and return an empty, but non-nil, func on cgroup error
feature fix:
* allow non-root agent to proceed with exec when cgroups are off
Whenever we add a new Raft message type, we almost always need to add a new
version check to ensure that leaders aren't trying to write unknown Raft entries
to older followers. Leave a note about this where the edits happen to reduce the
risk of this unfortunately common bug.
Ref: https://github.com/hashicorp/nomad-enterprise/pull/2973
* Add MonitorExport command and handlers
* Implement autocomplete
* Require nomad in serviceName
* Fix race in StreamReader.Read
* Add and use framer.Flush() to coordinate function exit
* Add LogFile to client/Server config and read NomadLogPath in rpcHandler instead of HTTPServer
* Parameterize StreamFixed stream size
Improved the acl policy self CLI command to handle both management and client tokens.
Management tokens now display a clear message indicating global access with no individual policies.
Fixes: https://github.com/hashicorp/nomad/issues/26389
Affinities and contraints use similar feasibility checking logic to determine if
a given node matches (although affinities don't support all the same
operators). Most operators don't allow `value` to be unset. Update the docs to
reflect this.
Fixes: https://github.com/hashicorp/nomad/issues/24983
During the big docs rearchitecture, we split up the task driver pages into
separate job declaration and driver configuration pages. The link for the
`raw_exec` driver to the configuration page is a self-reference.
The documentation for CSI and DHV has a list of the available access modes, but
doesn't explain what they mean in terms of what jobs can request, the scheduler
behavior, or the CSI plugin behavior. Expand on the information available in the
CSI specification and provide a description of DHV's behavior as well.
Ref: https://github.com/container-storage-interface/spec/blob/master/spec.md#createvolume
Update our E2E compatibility test for Consul and Vault to only include back to
the oldest-supported LTS versions of Consul and Vault. This will still leave
a few unsupported non-LTS versions in the matrix between the two oldest LTS, but
this is a small number of tests and fixing it would mean hard-coding the LTS
support matrix in our tests.
It seems the tool requires a little attention and does not run
well across our enterprise codebase. Rolling back that makefile
change, so it does not stop enterprise work, backport, CI, etc.