The `sids_hook` serves the legacy Connect workflow, and we want to bypass it
when using workload identities. So the hook checks that there's not already a
Consul token in the alloc hook resources derived from the Workload
Identity. This check was looking for the wrong key. This would cause the hook to
ignore the Consul token we already have and then fail to derive a SI token
unless the Nomad agent has its own token with `acl:write` permission.
Fix the lookup and add tests covering the bypass behavior.
* core: plumbing to support numa aware scheduling
* core: apply node resources compatibility upon fsm rstore
Handle the case where an upgraded server dequeus an evaluation before
a client triggers a new fingerprint - which would be needed to cause
the compatibility fix to run. By running the compat fix on restore the
server will immediately have the compatible pseudo topology to use.
* lint: learn how to spell pseudo
In Nomad Enterprise, a task may connect to a non-default Vault cluster,
requiring `consul-template` to be configured with a specific client
`vault` block.
When agents start, they create a shared Consul client that is then wrapped as
various interfaces for testability, and used in constructing the Nomad client
and server. The interfaces that support workload services (rather than the Nomad
agent itself) need to support multiple Consul clusters for Nomad
Enterprise. Update these interfaces to be factory functions that return the
Consul client for a given cluster name. Update the `ServiceClient` to split
workload updates between clusters by creating a wrapper around all the clients
that delegates to the cluster-specific `ServiceClient`.
Ref: https://github.com/hashicorp/team-nomad/issues/404
Includes changes to WID Manager that make it request signed identities for
services, as well as a few improvements to WIHandle introduced in #18672.
---------
Co-authored-by: Piotr Kazmierczak <470696+pkazmierczak@users.noreply.github.com>
When clients are restarted and the identity hook runs when we restore
allocations, the running allocations are likely to have already-signed Workload
Identities that are unexpired. Save these to the client's local state DB so that
we can avoid a thundering herd of RPCs during client restart. When we restore,
we'll check if there's at least one expired signed WI before making any initial
signing request.
Included:
* Renames `getIdentities` to `getInitialIdentities` to make the workflow more clear.
* Renames the existing `widmgr_test.go` file of integration tests, which is in its
own package to avoid circular imports to `widmgr_int_test.go`
Any code that tracks workloads and their identities should not rely on string
comparisons, especially since we support 2 types of workload identities: those
that identify tasks and those that identify services. This means we cannot rely
on task.Name for workload-identity pairs.
The new type structs.WIHandle solves this problem by providing a uniform way of
identifying workloads and their identities.
Since the allocation in the task runner is updated in a separate
goroutine, a race condition may happen where the task is started but the
prestart hooks are skipped because the allocation became terminal.
Checking for a terminal allocation before proceeding with the task start
ensures the task only runs if the prestart hooks are also executed.
Since `shouldShutdown()` only uses terminal allocation status, it
remains `true` after the first transition, so it's safe to check it
again after the prestart hooks as it will never revert to `false`.
When Workload Identity is being used with Consul, the `consul_hook` will add
Consul tokens to the alloc hook resources. Update the `group_service_hook` and
`service_hook` to use those tokens when available for registering and
deregistering Consul workloads.
The `sids_hook` runs for Connect sidecar/gateway tasks and gets Consul Service
Identity (SI) tokens for use by the Envoy bootstrap hook. When Workload Identity
is being used with Consul, the `consul_hook` will have already added these
tokens to the alloc hook resources. Update the `sids_hook` to use those tokens
instead and write them to the expected area of the taskdir.
To support Workload Identity with Consul for templates, we want templates to be
able to use the WI created at the task scope (either implicitly or set by the
user). But to allow different tasks within a group to be assigned to different
clusters as we're doing for Vault, we need to be able to set the `consul` block
with its `cluster` field at the task level to override the group.
This PR introduces a new allocrunner-level consul_hook which iterates over
services and tasks, if their provider is consul, fetches consul tokens for all of
them, stores them in AllocHookResources and in task secret dirs.
Ref: hashicorp/team-nomad#404
---------
Co-authored-by: Tim Gross <tgross@hashicorp.com>
* vault: update identity name to start with `vault_`
In the original proposal, workload identities used to derive Vault
tokens were expected to be called just `vault`. But in order to support
multiple Vault clusters it is necessary to associate identities with
specific Vault cluster configuration.
This commit implements a new proposal to have Vault identities named as
`vault_<cluster>`.
This commit splits identity_hook between the allocrunner and taskrunner. The
allocrunner-level part of the hook signs each task identity, and the
taskrunner-level part picks it up and stores secrets for each task.
The code revamps the WIDMgr, which is now split into 2 interfaces:
IdentityManager which manages renewals of signatures and handles sending
updates to subscribers via Watch method, and IdentitySigner which only does the
signing.
This work is necessary for having a unified Consul login workflow that comes
with the new Consul integration. A new, allocrunner-level consul_hook will now
be the only hook doing Consul authentication.
Nomad Enterprise will support configuring multiple Vault clients. Instead of
having a single Vault client field in the Nomad client, we'll have a function
that callers can parameterize by the Vault cluster name that returns the
correctly configured Vault API client wrapper.
* drivers: plumb hardware topology via grpc into drivers
This PR swaps out the temporary use of detecting system hardware manually
in each driver for using the Client's detected topology by plumbing the
data over gRPC. This ensures that Client configuration is taken to account
consistently in all references to system topology.
* cr: use enum instead of bool for core grade
* cr: fix test slit tables to be possible
* client: refactor cpuset partitioning
This PR updates the way Nomad client manages the split between tasks
that make use of resources.cpus vs. resources.cores.
Previously, each task was explicitly assigned which CPU cores they were
able to run on. Every time a task was started or destroyed, all other
tasks' cpusets would need to be updated. This was inefficient and would
crush the Linux kernel when a client would try to run ~400 or so tasks.
Now, we make use of cgroup heirarchy and cpuset inheritence to efficiently
manage cpusets.
* cr: tweaks for feedback
* lang: note that Stack is not concurrency-safe
* client: use more descriptive name for wrangler hook in logs
* numalib: use correct name for receiver parameter
We use capped exponential backoff in several places in the code when handling
failures. The code we've copy-and-pasted all over has a check to see if the
backoff is greater than the limit, but this check happens after the bitshift and
we always increment the number of attempts. This causes an overflow with a
fairly small number of failures (ex. at one place I tested it occurs after only
24 iterations), resulting in a negative backoff which then never recovers. The
backoff becomes a tight loop consuming resources and/or DoS'ing a Nomad RPC
handler or an external API such as Vault. Note this doesn't occur in places
where we cap the number of iterations so the loop breaks (usually to return an
error), so long as the number of iterations is reasonable.
Introduce a helper with a check on the cap before the bitshift to avoid overflow in all
places this can occur.
Fixes: #18199
Co-authored-by: stswidwinski <stan.swidwinski@gmail.com>
Allows for multiple `identity{}` blocks for tasks along with user-specified audiences. This is a building block to allow workload identities to be used with Consul, Vault and 3rd party JWT based auth methods.
Expiration is still unimplemented and is necessary for JWTs to be used securely, so that's up next.
---------
Co-authored-by: Tim Gross <tgross@hashicorp.com>
* build: update to go1.21
* go: eliminate helpers in favor of min/max
* build: run go mod tidy
* build: swap depguard for semgrep
* command: fixup broken tls error check on go1.21
There are some refactorings that have to be made in the getter and state
where the api changed in `slices`
* Bump golang.org/x/exp
* Bump golang.org/x/exp in api
* Update job_endpoint_test
* [feedback] unexport sort function
This feature is necessary when user want to explicitly re-render all templates on task restart.
E.g. to fetch all new secrets from Vault, even if the lease on the existing secrets has not been expired.
This complements the `env` parameter, so that the operator can author
tasks that don't share their Vault token with the workload when using
`image` filesystem isolation. As a result, more powerful tokens can be used
in a job definition, allowing it to use template stanzas to issue all kinds of
secrets (database secrets, Vault tokens with very specific policies, etc.),
without sharing that issuing power with the task itself.
This is accomplished by creating a directory called `private` within
the task's working directory, which shares many properties of
the `secrets` directory (tmpfs where possible, not accessible by
`nomad alloc fs` or Nomad's web UI), but isn't mounted into/bound to the
container.
If the `disable_file` parameter is set to `false` (its default), the Vault token
is also written to the NOMAD_SECRETS_DIR, so the default behavior is
backwards compatible. Even if the operator never changes the default,
they will still benefit from the improved behavior of Nomad never reading
the token back in from that - potentially altered - location.
The `DisableLogCollection` capability was introduced as an experimental
interface for the Docker driver in 0.10.4. The interface has been stable and
allowing third-party task drivers the same capability would be useful for those
drivers that don't need the additional overhead of logmon.
This PR only makes the capability public. It doesn't yet add it to the
configuration options for the other internal drivers.
Fixes: #14636#15686
This PR fixes a bug where nodes configured with populated
/etc/ssh/ssh_known_hosts files would be unable to read them during
artifact downloading.
Fixes#17086
to avoid leaking task resources (e.g. containers,
iptables) if allocRunner prerun fails during
restore on client restart.
now if prerun fails, TaskRunner.MarkFailedKill()
will only emit an event, mark the task as failed,
and cancel the tr's killCtx, so then ar.runTasks()
-> tr.Run() can take care of the actual cleanup.
removed from (formerly) tr.MarkFailedDead(),
now handled by tr.Run():
* set task state as dead
* save task runner local state
* task stop hooks
also done in tr.Run() now that it's not skipped:
* handleKill() to kill tasks while respecting
their shutdown delay, and retrying as needed
* also includes task preKill hooks
* clearDriverHandle() to destroy the task
and associated resources
* task exited hooks
When the server restarts for the upgrade, it loads the `structs.Job` from the
Raft snapshot/logs. The jobspec has long since been parsed, so none of the
guards around the default value are in play. The empty field value for `Enabled`
is the zero value, which is false.
This doesn't impact any running allocation because we don't replace running
allocations when either the client or server restart. But as soon as any
allocation gets rescheduled (ex. you drain all your clients during upgrades),
it'll be using the `structs.Job` that the server has, which has `Enabled =
false`, and logs will not be collected.
This changeset fixes the bug by adding a new field `Disabled` which defaults to
false (so that the zero value works), and deprecates the old field.
Fixes#17076
This PR does some cleanup of an old code path for versions of Consul that
did not support reporting the supported versions of Envoy in its API. Those
versions are no longer supported for years at this point, and the fallback
version of envoy hasn't been supported by any version of Consul for almost
as long. Remove this code path that is no longer useful.
This PR modifies references to the envoyproxy/envoy docker image to
explicitly include the docker.io prefix. This does not affect existing
users, but makes things easier for Podman users, who otherwise need to
specify the full name because Podman does not default to docker.io
This PR updates the envoy_bootstrap_hook to no longer disable itself if
the task driver in use is not docker. In other words, make it work for
podman and other image based task drivers. The hook now only checks that
1. the task is a connect sidecar
2. the task.config block contains an "image" field
Some Nomad users ship application logs out-of-band via syslog. For these users
having `logmon` (and `docker_logger`) running is unnecessary overhead. Allow
disabling the logmon and pointing the task's stdout/stderr to /dev/null.
This changeset is the first of several incremental improvements to log
collection short of full-on logging plugins. The next step will likely be to
extend the internal-only task driver configuration so that cluster
administrators can turn off log collection for the entire driver.
---
Fixes: #11175
Co-authored-by: Thomas Weber <towe75@googlemail.com>
The first start of a Consul Connect proxy sidecar triggers a run
of the envoy_version hook which modifies the task config image
entry. The modification takes into account a number of factors to
correctly populate this. Importantly, once the hook has run, it
marks itself as done so the taskrunner will not execute it again.
When the client receives a non-destructive update for the
allocation which the proxy sidecar is a member of, it will update
and overwrite the task definition within the taskerunner. In doing
so it overwrite the modification performed by the hook. If the
allocation is restarted, the envoy_version hook will be skipped as
it previously marked itself as done, and therefore the sidecar
config image is incorrect and causes a driver error.
The fix removes the hook in marking itself as done to the view of
the taskrunner.