This enables checks for ContainerAdmin user on docker images on Windows. It's
only checked if users run docker with process isolation and not hyper-v,
because hyper-v provides its own, proper sandboxing.
---------
Co-authored-by: Tim Gross <tgross@hashicorp.com>
Nomad creates Consul ACL tokens and service registrations to support Consul
service mesh workloads, before bootstrapping the Envoy proxy. Nomad always talks
to the local Consul agent and never directly to the Consul servers. But the
local Consul agent talks to the Consul servers in stale consistency mode to
reduce load on the servers. This can result in the Nomad client making the Envoy
bootstrap request with a tokens or services that have not yet replicated to the
follower that the local client is connected to. This request gets a 404 on the
ACL token and that negative entry gets cached, preventing any retries from
succeeding.
To workaround this, we'll use a method described by our friends over on
`consul-k8s` where after creating the objects in Consul we try to read them from
the local agent in stale consistency mode (which prevents a failed read from
being cached). This cannot completely eliminate this source of error because
it's possible that Consul cluster replication is unhealthy at the time we need
it, but this should make Envoy bootstrap significantly more robust.
This changset adds preflight checks for the objects we create in Consul:
* We add a preflight check for ACL tokens after we login via via Workload
Identity and in the function we use to derive tokens in the legacy
workflow. We do this check early because we also want to use this token for
registering group services in the allocrunner hooks.
* We add a preflight check for services right before we bootstrap Envoy in the
taskrunner hook, so that we have time for our service client to batch updates
to the local Consul agent in addition to the local agent sync.
We've added the timeouts to be configurable via node metadata rather than the
usual static configuration because for most cases, users should not need to
touch or even know these values are configurable; the configuration is mostly
available for testing.
Fixes: https://github.com/hashicorp/nomad/issues/9307
Fixes: https://github.com/hashicorp/nomad/issues/10451
Fixes: https://github.com/hashicorp/nomad/issues/20516
Ref: https://github.com/hashicorp/consul-k8s/pull/887
Ref: https://hashicorp.atlassian.net/browse/NET-10051
Ref: https://hashicorp.atlassian.net/browse/NET-9273
Follow-up: https://hashicorp.atlassian.net/browse/NET-10138
* Upgrade consul-template to 0.39.0 to allow template queries of admin
partitions and sameness groups.
* Upgrade our Consul API to 1.29.1 because it's required for CT, and to remove
the replacement pinned version we were using to pick up some newer Consul API
features we needed in 1.7.0.
Ref: https://hashicorp.atlassian.net/browse/NET-10153
The changelog is slightly misleading in that recent Enterprise-only backports
following our LTS release have titles that don't call out they're for Enterprise
only. Updating the title brings us in line with what Consul has done.
Fixes a bug in the nodeResources.Comparable method, where CPU resources were
accidentally offset with reserved resources, whereas functions that use this
field expect total CPU resources.
When setting up the timer for heartbeat invalidation, there's no control that
allows us to remove that timer when the node is GC'd. If the GC window is narrow
enough, it's possible to GC a node that has a waiting heartbeat timer. In this
case, we hit a bug where querying for the node returns `nil` and this is
incorrectly handled when checking for disconnect/reconnect state. Fix this bug
by correctly handling a `nil` node and allowing the `Node.Update` RPC to fire
normally (which then errors correctly).
Fixes: https://github.com/hashicorp/nomad/issues/23376
Ref: https://hashicorp.atlassian.net/browse/NET-10109
* Stopped status passed through to the statuses endpoint and observed on job model and steady-state panel
* Status passed to statuses endpoint and test for FE model statuses
Update `runc` to 1.1.13 to pick up build support for Go 1.22.4+, in order to
ensure we've resolved errors cloning processes into Linux namespaces for
libcontainer (`exec` driver) with new versions of Go and older but still
supported versions of glibc.
This changeset has two minor quirks:
* Testing shows that the reported issues is already resolved on `main` by
upgrading to Go 1.22.4 without this dependency bump, at least for glibc 2.31.
Upgrading the dependency should make sure there isn't another glibc version
where the problem will still appear.
* This version of `runc` refers to fields in `cilium/ebpf` which are not present
in more recent versions of that library. So in order to build, we have to
downgrade `cilium/ebpf`. Fortunately, `runc` is the only consumer of that
transitive dependency.
Closes: https://github.com/hashicorp/nomad/issues/20212
Ref: https://hashicorp.atlassian.net/browse/NET-10078