Currently every time a client starts, it creates a new consul token per service or task,. This PR changes the behaviour , it persists consul ACL token to the client state and it starts by looking up a token before creating a new one.
Fixes: #20184Fixes: #20185
Some of our allocrunner hooks require a task environment for interpolating values based on the node or allocation. But several of the hooks accept an already-built environment or builder and then keep that in memory. Both of these retain a copy of all the node attributes and allocation metadata, which balloons memory usage until the allocation is GC'd.
While we'd like to look into ways to avoid keeping the allocrunner around entirely (see #25372), for now we can significantly reduce memory usage by creating the task environment on-demand when calling allocrunner methods, rather than persisting it in the allocrunner hooks.
In doing so, we uncover two other bugs:
* The WID manager, the group service hook, and the checks hook have to interpolate services for specific tasks. They mutated a taskenv builder to do so, but each time they mutate the builder, they write to the same environment map. When a group has multiple tasks, it's possible for one task to set an environment variable that would then be interpolated in the service definition for another task if that task did not have that environment variable. Only the service definition interpolation is impacted. This does not leak env vars across running tasks, as each taskrunner has its own builder.
To fix this, we move the `UpdateTask` method off the builder and onto the taskenv as the `WithTask` method. This makes a shallow copy of the taskenv with a deep clone of the environment map used for interpolation, and then overwrites the environment from the task.
* The checks hook interpolates Nomad native service checks only on `Prerun` and not on `Update`. This could cause unexpected deregistration and registration of checks during in-place updates. To fix this, we make sure we interpolate in the `Update` method.
I also bumped into an incorrectly implemented interface in the CSI hook. I've pulled that and some better guardrails out to https://github.com/hashicorp/nomad/pull/25472.
Fixes: https://github.com/hashicorp/nomad/issues/25269
Fixes: https://hashicorp.atlassian.net/browse/NET-12310
Ref: https://github.com/hashicorp/nomad/issues/25372
The Nomad client can now optionally emit telemetry data from the
prerun and prestart hooks. This allows operators to monitor and
alert on failures and time taken to complete.
The new datapoints are:
- nomad.client.alloc_hook.prerun.success (counter)
- nomad.client.alloc_hook.prerun.failed (counter)
- nomad.client.alloc_hook.prerun.elapsed (sample)
- nomad.client.task_hook.prestart.success (counter)
- nomad.client.task_hook.prestart.failed (counter)
- nomad.client.task_hook.prestart.elapsed (sample)
The hook execution time is useful to Nomad engineering and will
help optimize code where possible and understand job specification
impacts on hook performance.
Currently only the PreRun and PreStart hooks have telemetry
enabled, so we limit the number of new metrics being produced.
Nomad creates Consul ACL tokens and service registrations to support Consul
service mesh workloads, before bootstrapping the Envoy proxy. Nomad always talks
to the local Consul agent and never directly to the Consul servers. But the
local Consul agent talks to the Consul servers in stale consistency mode to
reduce load on the servers. This can result in the Nomad client making the Envoy
bootstrap request with a tokens or services that have not yet replicated to the
follower that the local client is connected to. This request gets a 404 on the
ACL token and that negative entry gets cached, preventing any retries from
succeeding.
To workaround this, we'll use a method described by our friends over on
`consul-k8s` where after creating the objects in Consul we try to read them from
the local agent in stale consistency mode (which prevents a failed read from
being cached). This cannot completely eliminate this source of error because
it's possible that Consul cluster replication is unhealthy at the time we need
it, but this should make Envoy bootstrap significantly more robust.
This changset adds preflight checks for the objects we create in Consul:
* We add a preflight check for ACL tokens after we login via via Workload
Identity and in the function we use to derive tokens in the legacy
workflow. We do this check early because we also want to use this token for
registering group services in the allocrunner hooks.
* We add a preflight check for services right before we bootstrap Envoy in the
taskrunner hook, so that we have time for our service client to batch updates
to the local Consul agent in addition to the local agent sync.
We've added the timeouts to be configurable via node metadata rather than the
usual static configuration because for most cases, users should not need to
touch or even know these values are configurable; the configuration is mostly
available for testing.
Fixes: https://github.com/hashicorp/nomad/issues/9307
Fixes: https://github.com/hashicorp/nomad/issues/10451
Fixes: https://github.com/hashicorp/nomad/issues/20516
Ref: https://github.com/hashicorp/consul-k8s/pull/887
Ref: https://hashicorp.atlassian.net/browse/NET-10051
Ref: https://hashicorp.atlassian.net/browse/NET-9273
Follow-up: https://hashicorp.atlassian.net/browse/NET-10138
When logging into a JWT auth method, we need to explicitly supply the Consul
admin partition if the local Consul agent is in a partition. We can't derive
this from agent configuration because the Consul agent's configuration is
canonical, so instead we get the partition from the fingerprint (if
available). This changeset updates the Consul client constructor so that we
close over the partition from the fingerprint.
Ref: https://hashicorp.atlassian.net/browse/NET-9451
Services can have some of their string fields interpolated. The new Workload
Identity flow doesn't interpolate the services before requesting signed
identities or using those identities to get Consul tokens.
Add support for interpolation to the WID manager and the Consul tokens hook by
providing both with a taskenv builder. Add an "interpolate workload" field to
the WI handle to allow passing the original workload name to the server so the
server can find the correct service to sign.
This changeset also makes two related test improvements:
* Remove the mock WID manager, which was only used in the Consul hook tests and
isn't necessary so long as we provide the real WID manager with the mock
signer and never call `Run` on it. It wasn't feasible to exercise the correct
behavior without this refactor, as the mocks were bypassing the new code.
* Fixed swapped expect-vs-actual assertions on the `consul_hook` tests.
Fixes: https://github.com/hashicorp/nomad/issues/20025
* client/allocdir: use an interface in place of AllocDir structs
This PR replace *allocdir.AllocDir with allocdir.Interface such that we
may eventually have another implementation of alloc directories. This is
in support of the exec2 driver, which will need an implementation of the
alloc directory incompatibile with the current version.
* use rlock
The allocrunner has a service registration handler that proxies various API
calls to Consul. With multi-cluster support (for ENT), the service registration
handler is what selects the correct Consul client. The name of this field in the
allocrunner and taskrunner code base looks like it's referring to the actual
Consul API client. This was actually the case before Nomad native service
discovery was implemented, but now the name is misleading.
Remove the now-unused original configuration blocks for Consul and Vault from
the client. When the client needs to refer to a Consul or Vault block it will
always be for a specific cluster for the task/service. Add a helper for
accessing the default clusters (for the client's own use).
This is two of three changesets for this work. The remainder will implement the
same changes in the `command/agent` package.
As part of this work I discovered and fixed two bugs:
* The gRPC proxy socket that we create for Envoy is only ever created using the
default Consul cluster's configuration. This will prevent Connect from being
used with the non-default cluster.
* The Consul configuration we use for templates always comes from the default
Consul cluster's configuration, but will use the correct Consul token for the
non-default cluster. This will prevent templates from being used with the
non-default cluster.
Ref: https://github.com/hashicorp/nomad/issues/18947
Ref: https://github.com/hashicorp/nomad/pull/18991
Fixes: https://github.com/hashicorp/nomad/issues/18984
Fixes: https://github.com/hashicorp/nomad/issues/18983
When Workload Identity is being used with Consul, the `consul_hook` will add
Consul tokens to the alloc hook resources. Update the `group_service_hook` and
`service_hook` to use those tokens when available for registering and
deregistering Consul workloads.
This PR introduces a new allocrunner-level consul_hook which iterates over
services and tasks, if their provider is consul, fetches consul tokens for all of
them, stores them in AllocHookResources and in task secret dirs.
Ref: hashicorp/team-nomad#404
---------
Co-authored-by: Tim Gross <tgross@hashicorp.com>
This commit splits identity_hook between the allocrunner and taskrunner. The
allocrunner-level part of the hook signs each task identity, and the
taskrunner-level part picks it up and stores secrets for each task.
The code revamps the WIDMgr, which is now split into 2 interfaces:
IdentityManager which manages renewals of signatures and handles sending
updates to subscribers via Watch method, and IdentitySigner which only does the
signing.
This work is necessary for having a unified Consul login workflow that comes
with the new Consul integration. A new, allocrunner-level consul_hook will now
be the only hook doing Consul authentication.
* client: refactor cpuset partitioning
This PR updates the way Nomad client manages the split between tasks
that make use of resources.cpus vs. resources.cores.
Previously, each task was explicitly assigned which CPU cores they were
able to run on. Every time a task was started or destroyed, all other
tasks' cpusets would need to be updated. This was inefficient and would
crush the Linux kernel when a client would try to run ~400 or so tasks.
Now, we make use of cgroup heirarchy and cpuset inheritence to efficiently
manage cpusets.
* cr: tweaks for feedback
to avoid leaking task resources (e.g. containers,
iptables) if allocRunner prerun fails during
restore on client restart.
now if prerun fails, TaskRunner.MarkFailedKill()
will only emit an event, mark the task as failed,
and cancel the tr's killCtx, so then ar.runTasks()
-> tr.Run() can take care of the actual cleanup.
removed from (formerly) tr.MarkFailedDead(),
now handled by tr.Run():
* set task state as dead
* save task runner local state
* task stop hooks
also done in tr.Run() now that it's not skipped:
* handleKill() to kill tasks while respecting
their shutdown delay, and retrying as needed
* also includes task preKill hooks
* clearDriverHandle() to destroy the task
and associated resources
* task exited hooks
The allocrunner has a facility for passing data written by allocrunner hooks to
taskrunner hooks. Currently the only consumers of this facility are the
allocrunner CSI hook (which writes data) and the taskrunner volume hook (which
reads that same data).
The allocrunner hook for CSI volumes doesn't set the alloc hook resources
atomically. Instead, it gets the current resources and then writes a new version
back. Because the CSI hook is currently the only writer and all readers happen
long afterwards, this should be safe but #16623 shows there's some sequence of
events during restore where this breaks down.
Refactor hook resources so that hook data is accessed via setters and getters
that hold the mutex.
* client: run alloc pre-kill hooks on last pass despite no live tasks
This PR fixes a bug where alloc pre-kill hooks were not run in the
edge case where there are no live tasks remaining, but it is also
the final update to process for the (terminal) allocation. We need
to run cleanup hooks here, otherwise they will not run until the
allocation gets garbage collected (i.e. via Destroy()), possibly
at a distant time in the future.
Fixes#15477
* client: do not run ar cleanup hooks if client is shutting down
* client: accommodate Consul 1.14.0 gRPC and agent self changes.
Consul 1.14.0 changed the way in which gRPC listeners are
configured, particularly when using TLS. Prior to the change, a
single listener was responsible for handling plain-text and
encrypted gRPC requests. In 1.14.0 and beyond, separate listeners
will be used for each, defaulting to 8502 and 8503 for plain-text
and TLS respectively.
The change means that Nomad’s Consul Connect integration would not
work when integrated with Consul clusters using TLS and running
1.14.0 or greater.
The Nomad Consul fingerprinter identifies the gRPC port Consul has
exposed using the "DebugConfig.GRPCPort" value from Consul’s
“/v1/agent/self” endpoint. In Consul 1.14.0 and greater, this only
represents the plain-text gRPC port which is likely to be disbaled
in clusters running TLS. In order to fix this issue, Nomad now
takes into account the Consul version and configured scheme to
optionally use “DebugConfig.GRPCTLSPort” value from Consul’s agent
self return.
The “consul_grcp_socket” allocrunner hook has also been updated so
that the fingerprinted gRPC port attribute is passed in. This
provides a better fallback method, when the operator does not
configure the “consul.grpc_address” option.
* docs: modify Consul Connect entries to detail 1.14.0 changes.
* changelog: add entry for #15309
* fixup: tidy tests and clean version match from review feedback.
* fixup: use strings tolower func.
This PR adds support for specifying checks in services registered to
the built-in nomad service provider.
Currently only HTTP and TCP checks are supported, though more types
could be added later.
Some operators use very long group/task `shutdown_delay` settings to
safely drain network connections to their workloads after service
deregistration. But during incident response, they may want to cause
that drain to be skipped so they can quickly shed load.
Provide a `-no-shutdown-delay` flag on the `nomad alloc stop` and
`nomad job stop` commands that bypasses the delay. This sets a new
desired transition state on the affected allocations that the
allocation/task runner will identify during pre-kill on the client.
Note (as documented here) that using this flag will almost always
result in failed inbound network connections for workloads as the
tasks will exit before clients receive updated service discovery
information and won't be gracefully drained.
Add a new hostname string parameter to the network block which
allows operators to specify the hostname of the network namespace.
Changing this causes a destructive update to the allocation and it
is omitted if empty from API responses. This parameter also supports
interpolation.
In order to have a hostname passed as a configuration param when
creating an allocation network, the CreateNetwork func of the
DriverNetworkManager interface needs to be updated. In order to
minimize the disruption of future changes, rather than add another
string func arg, the function now accepts a request struct along with
the allocID param. The struct has the hostname as a field.
The in-tree implementations of DriverNetworkManager.CreateNetwork
have been modified to account for the function signature change.
In updating for the change, the enhancement of adding hostnames to
network namespaces has also been added to the Docker driver, whilst
the default Linux manager does not current implement it.
This PR adds the common OSS changes for adding support for Consul Namespaces,
which is going to be a Nomad Enterprise feature. There is no new functionality
provided by this changeset and hopefully no new bugs.
Most allocation hooks don't need to know when a single task within the
allocation is restarted. The check watcher for group services triggers the
alloc runner to restart all tasks, but the alloc runner's `Restart` method
doesn't trigger any of the alloc hooks, including the group service hook. The
result is that after the first time a check triggers a restart, we'll never
restart the tasks of an allocation again.
This commit adds a `RunnerTaskRestartHook` interface so that alloc runner
hooks can act if a task within the alloc is restarted. The only implementation
is in the group service hook, which will force a re-registration of the
alloc's services and fix check restarts.
* consul: advertise cni and multi host interface addresses
* structs: add service/check address_mode validation
* ar/groupservices: fetch networkstatus at hook runtime
* ar/groupservice: nil check network status getter before calling
* consul: comment network status can be nil
Before, Connect Native Tasks needed one of these to work:
- To be run in host networking mode
- To have the Consul agent configured to listen to a unix socket
- To have the Consul agent configured to listen to a public interface
None of these are a great experience, though running in host networking is
still the best solution for non-Linux hosts. This PR establishes a connection
proxy between the Consul HTTP listener and a unix socket inside the alloc fs,
bypassing the network namespace for any Connect Native task. Similar to and
re-uses a bunch of code from the gRPC listener version for envoy sidecar proxies.
Proxy is established only if the alloc is configured for bridge networking and
there is at least one Connect Native task in the Task Group.
Fixes#8290
When an allocation runs for a task driver that can't support volume mounts,
the mounting will fail in a way that can be hard to understand. With host
volumes this usually means failing silently, whereas with CSI the operator
gets inscrutable internals exposed in the `nomad alloc status`.
This changeset adds a MountConfig field to the task driver Capabilities
response. We validate this when the `csi_hook` or `volume_hook` fires and
return a user-friendly error.
Note that we don't currently have a way to get driver capabilities up to the
server, except through attributes. Validating this when the user initially
submits the jobspec would be even better than what we're doing here (and could
be useful for all our other capabilities), but that's out of scope for this
changeset.
Also note that the MountConfig enum starts with "supports all" in order to
support community plugins in a backwards compatible way, rather than cutting
them off from volume mounting unexpectedly.
This commit is an initial (read: janky) approach to forwarding state
from an allocrunner hook to a taskrunner using a similar `hookResources`
approach that tr's use internally.
It should eventually probably be replaced with something a little bit
more message based, but for things that only come from pre-run hooks,
and don't change, it's probably fine for now.
This commit introduces the first stage of volume mounting for an
allocation. The csimanager.VolumeMounter interface manages the blocking
and actual minutia of the CSI implementation allowing this hook to do
the minimal work of volume retrieval and creating mount info.
In the future the `CSIVolume.Get` request should be replaced by
`CSIVolume.Claim(Batch?)` to minimize the number of RPCs and to handle
external triggering of a ControllerPublishVolume request as required.
We also need to ensure that if pre-run hooks fail, we still get a full
unwinding of any publish and staged volumes to ensure that there are no hanging
references to volumes. That is not handled in this commit.
As part of introducing support for CSI, AllocRunner hooks need to be
able to communicate with Nomad Servers for validation of and interaction
with storage volumes. Here we create a small RPCer interface and pass
the client (rpc client) to the AR in preparation for making these RPCs.
This changeset is some pre-requisite boilerplate that is required for
introducing CSI volume management for client nodes.
It extracts out fingerprinting logic from the csi instance manager.
This change is to facilitate reusing the csimanager to also manage the
node-local CSI functionality, as it is the easiest place for us to
guaruntee health checking and to provide additional visibility into the
running operations through the fingerprinter mechanism and goroutine.
It also introduces the VolumeMounter interface that will be used to
manage staging/publishing unstaging/unpublishing of volumes on the host.
copy struct values
ensure groupserviceHook implements RunnerPreKillhook
run deregister first
test that shutdown times are delayed
move magic number into variable
* client: improve group service stanza interpolation and check_restart support
Interpolation can now be done on group service stanzas. Note that some task runtime specific information
that was previously available when the service was registered poststart of a task is no longer available.
The check_restart stanza for checks defined on group services will now properly restart the allocation upon
check failures if configured.
* ar: refactor network bridge config to use go-cni lib
* ar: use eth as the iface prefix for bridged network namespaces
* vendor: update containerd/go-cni package
* ar: update network hook to use TODO contexts when calling configurator
* unnecessary conversion
* connect: add unix socket to proxy grpc for envoy
Fixes#6124
Implement a L4 proxy from a unix socket inside a network namespace to
Consul's gRPC endpoint on the host. This allows Envoy to connect to
Consul's xDS configuration API.
* connect: pointer receiver on structs with mutexes
* connect: warn on all proxy errors