When replication of a single key fails, the replication loop breaks early and
therefore keys that fall later in the sorting order will never get
replicated. This is particularly a problem for clusters impacted by the bug that
caused #14981 and that were later upgraded; the keys that were never replicated
can now never be replicated, and so we need to handle them safely.
Included in the replication fix:
* Refactor the replication loop so that each key replicated in a function call
that returns an error, to make the workflow more clear and reduce nesting. Log
the error and continue.
* Improve stability of keyring replication tests. We no longer block leadership
on initializing the keyring, so there's a race condition in the keyring tests
where we can test for the existence of the root key before the keyring has
been initialize. Change this to an "eventually" test.
But these fixes aren't enough to fix#14981 because they'll end up seeing an
error once a second complaining about the missing key, so we also need to fix
keyring GC so the keys can be removed from the state store. Now we'll store the
key ID used to sign a workload identity in the Allocation, and we'll index the
Allocation table on that so we can track whether any live Allocation was signed
with a particular key ID.
The existing docs on required capabilities are a little sparse and have been the
subject of a lots of questions. Expand on this information and provide a pointer
to the ongoing design discussion around rootless Nomad.
The client ACL cache was not accounting for tokens which included
ACL role links. This change modifies the behaviour to resolve role
links to policies. It will also now store ACL roles within the
cache for quick lookup. The cache TTL is configurable in the same
manner as policies or tokens.
Another small fix is included that takes into account the ACL
token expiry time. This was not included, which meant tokens with
expiry could be used past the expiry time, until they were GC'd.
This PR removes the assertion around when the 'task' field of
a check may be set. Starting in Nomad 1.4 we automatically set
the task field on all checks in support of the NSD checks feature.
This is causing validation problems elsewhere, e.g. when a group
service using the Consul provider sets 'task' it will fail
validation that worked previously.
The assertion of leaving 'task' unset was only about making sure
job submitters weren't expecting some behavior, but in practice
is causing bugs now that we need the task field for more than it
was originally added for.
We can simply update the docs, noting when the task field set by
job submitters actually has value.
* docs: clarify nomad vars vs vault
I think we should make the difference in root key management between
Nomad and Vault clear in the concept docs. I didn't see anywhere else in
the docs we compared it.
I also s/secrets/variables everywhere except the first sentence since
the feature is intended to be more generic than secrets. Right now it's
more of a compliment to Consul's kv than Vault due to root key handling
and featureset.
* Update website/content/docs/concepts/variables.mdx
Co-authored-by: Tim Gross <tgross@hashicorp.com>
This is probably undocumented for a reason, but the `enabled` toggle in the
`periodic` stanza is very useful so I figured I try adding it to the docs.
The feature has been secretly avaliable since #9142 and was called out in that
PR as being a dubious addition, only added to avoid regressions.
The use case for disabling a periodic job in this way is to prevent it from
running without modifying the schedule. Ideally Nomad would make it more clear
that this was the case, and allow you to force a run of the job, but even with
those rough edges I think users would benefit from knowing about this toggle.
This changeset adds new architecture internals documents to the contributing
guide. These are intentionally here and not on the public-facing website because
the material is not required for operators and includes a lot of diagrams that
we can cheaply maintain with mermaid syntax but would involve art assets to have
up on the main site that would become quickly out of date as code changes happen
and be extremely expensive to maintain. However, these should be suitable to use
as points of conversation with expert end users.
Included:
* A description of Evaluation triggers and expected counts, with examples.
* A description of Evaluation states and implicit states. This is taken from an
internal document in our team wiki.
* A description of how writing the State Store works. This is taken from a
diagram I put together a few months ago for internal education purposes.
* A description of Evaluation lifecycle, from registration to running
Allocations. This is mostly lifted from @lgfa29's amazing mega-diagram, but
broken into digestible chunks and without multi-region deployments, which I'd
like to cover in a future doc.
Also includes adding Deployments to our public-facing glossary.
Co-authored-by: Luiz Aoqui <luiz@hashicorp.com>
Co-authored-by: Michael Schurter <mschurter@hashicorp.com>
Co-authored-by: Seth Hoenig <shoenig@duck.com>
This reverts PR #12416 and commit 6668ce022a.
While the driver options are well and truly deprecated, this documentation also
covers features like `fingerprint.denylist` that are not available any other
way. Let's revert this until #12420 is ready.
* cleanup: fixup linter warnings in schedular/feasible.go
* core: numeric operands comparisons in constraints
This PR changes constraint comparisons to be numeric rather than
lexical if both operands are integers or floats.
Inspiration #4856Closes#4729Closes#14719
* fix: always parse as int64
* docs: write a lot of words about heartbeats
Alternative to #14670
* Apply suggestions from code review
Co-authored-by: Tim Gross <tgross@hashicorp.com>
* use descriptive title for link
* rework example of high failover ttl
Co-authored-by: Tim Gross <tgross@hashicorp.com>
* fingerprint: add node attr for reserverable cores
Add an attribute for the number of reservable CPU cores as they may
differ from the existing `cpu.numcores` due to client configuration or
OS support.
Hopefully clarifies some confusion in #14676
* add changelog
* num_reservable_cores -> reservablecores
Extension of #14673
Once Vault is initially fingerprinted, extend the period since changes
should be infrequent and the fingerprint is relatively expensive since
it is contacting a central Vault server.
Also move the period timer reset *after* the fingerprint. This is
similar to #9435 where the idea is to ensure the retry period starts
*after* the operation is attempted. 15s will be the *minimum* time
between fingerprints now instead of the *maximum* time between
fingerprints.
In the case of Vault fingerprinting, the original behavior might cause
the following:
1. Timer is reset to 15s
2. Fingerprint takes 16s
3. Timer has already elapsed so we immediately Fingerprint again
Even if fingerprinting Vault only takes a few seconds, that may very
well be due to excessive load and backing off our fingerprints is
desirable. The new bevahior ensures we always wait at least 15s between
fingerprint attempts and should allow some natural jittering based on
server load and network latency.
Clients periodically fingerprint Vault and Consul to ensure the server has
updated attributes in the client's fingerprint. If the client can't reach
Vault/Consul, the fingerprinter clears the attributes and requires a node
update. Although this seems like correct behavior so that we can detect
intentional removal of Vault/Consul access, it has two serious failure modes:
(1) If a local Consul agent is restarted to pick up configuration changes and the
client happens to fingerprint at that moment, the client will update its
fingerprint and result in evaluations for all its jobs and all the system jobs
in the cluster.
(2) If a client loses Vault connectivity, the same thing happens. But the
consequences are much worse in the Vault case because Vault is not run as a
local agent, so Vault connectivity failures are highly correlated across the
entire cluster. A 15 second Vault outage will cause a new `node-update`
evalution for every system job on the cluster times the number of nodes, plus
one `node-update` evaluation for every non-system job on each node. On large
clusters of 1000s of nodes, we've seen this create a large backlog of evaluations.
This changeset updates the fingerprinting behavior to keep the last fingerprint
if Consul or Vault queries fail. This prevents a storm of evaluations at the
cost of requiring a client restart if Consul or Vault is intentionally removed
from the client.
In Nomad 1.2.6 we shipped `eval list`, which accepts a `-json` flag, and
deprecated the usage of `eval status` without an evaluation ID with an upgrade
note that it would be removed in Nomad 1.4.0. This changeset completes that
work.
* scheduler: stopped-yet-running allocs are still running
* scheduler: test new stopped-but-running logic
* test: assert nonoverlapping alloc behavior
Also add a simpler Wait test helper to improve line numbers and save few
lines of code.
* docs: tried my best to describe #10446
it's not concise... feedback welcome
* scheduler: fix test that allowed overlapping allocs
* devices: only free devices when ClientStatus is terminal
* test: output nicer failure message if err==nil
Co-authored-by: Mahmood Ali <mahmood@hashicorp.com>
Co-authored-by: Michael Schurter <mschurter@hashicorp.com>
* Fixing heading order, adding text for links
* Apply suggestions from code review
Co-authored-by: Tim Gross <tgross@hashicorp.com>
* Applying more suggestions from code review
Co-authored-by: Tim Gross <tgross@hashicorp.com>
When configuring Consul Service Mesh, it's sometimes necessary to
provide dynamic value that are only known to Nomad at runtime. By
interpolating configuration values (in addition to configuration keys),
user are able to pass these dynamic values to Consul from their Nomad
jobs.
These options are mutually exclusive but, since `-hcl2-strict` defaults
to `true` users had to explicitily set it to `false` when using `-hcl1`.
Also return `255` when job plan fails validation as this is the expected
code in this situation.
Nomad is generally compliant with the CSI specification for Container
Orchestrators (CO), except for unimplemented features. However, some storage
vendors have built CSI plugins that are not compliant with the specification or
which expect that they're only deployed on Kubernetes. Nomad cannot vouch for
the compatibility of any particular plugin, so clarify this in the docs.
Co-authored-by: Derek Strickland <1111455+DerekStrickland@users.noreply.github.com>
The ACL command docs are now found within a sub-dir like the
operator command docs. Updates to the ACL token commands to
accommodate token expiry have also been added.
The ACL API docs are now found within a sub-dir like the operator
API docs. The ACL docs now include the ACL roles endpoint as well
as updated ACL token endpoints for token expiration.
The configuration section is also updated to accommodate the new
ACL and server parameters for the new ACL features.
Update the on-disk format for the root key so that it's wrapped with a unique
per-key/per-server key encryption key. This is a bit of security theatre for the
current implementation, but it uses `go-kms-wrapping` as the interface for
wrapping the key. This provides a shim for future support of external KMS such
as cloud provider APIs or Vault transit encryption.
* Removes the JSON serialization extension we had on the `RootKey` struct; this
struct is now only used for key replication and not for disk serialization, so
we don't need this helper.
* Creates a helper for generating cryptographically random slices of bytes that
properly accounts for short reads from the source.
* No observable functional changes outside of the on-disk format, so there are
no test updates.