* First pass at global token creation and regional awareness at token fetch time
* Reset and refetch token when you switch region but stay in place
* Ugly and functional global token save
* Tests and log cleanup
the ratio of optimized/unoptimized log size in TestPlanNormalize
has been increased several times as people have added to various
structs and coincidentally bumped into the magic limit.
we encountered another such case when adding to NetworkResource,
but here we omitempty on the struct instead of bumping the limit
in the test.
this has the added benefit of reducing the serialized struct size!
which was the original intent behind this test in the first place :P
the actual value of the ratio is now 0.628... but here the
test value is only dropped down to 0.66 to leave some wiggle room.
Several commands that inspect objects where the names are user-controlled share
a bug where the user cannot inspect the object if it has a name that is an exact
prefix of the name of another object (in the same namespace, where
applicable). For example, the object "test" can't be inspected if there's an
object with the name "testing".
Copy existing logic we have for jobs, node pools, etc. to the impacted commands:
* `plugin status`
* `quota inspect`
* `quota status`
* `scaling policy info`
* `service info`
* `volume deregister`
* `volume detach`
* `volume status`
If we get multiple objects for the prefix query, we check if any of them are an
exact match and use that object instead of returning an error. Where possible
because the prefix query signatures are the same, use a generic function that
can be shared across multiple commands.
Fixes: https://github.com/hashicorp/nomad/issues/13920
Fixes: https://github.com/hashicorp/nomad/issues/17132
Fixes: https://github.com/hashicorp/nomad/issues/23236
Ref: https://hashicorp.atlassian.net/browse/NET-10054
Ref: https://hashicorp.atlassian.net/browse/NET-10055
After changes introduced in #23284 we no longer need to make a if
!st.SupportsNUMA() check in the GetNodes() topology method. In fact this check
will now cause panic in nomadTopologyToProto method on systems that don't
support NUMA.
* Generalized namespace handling, generalized facet searching, node pools facet search
* Testfixes for namespace facet on jobs list
* Filter or not, need to watch for * namespaces
The Vault "logical" API doesn't allow configuring the namespace on a per-request
basis. Instead, it's set on the client. Our `vaultclient` wrapper locks access
to the API client and sets the namespace (and token, if applicable) for each
request, and then resets the namespace and unlocks the API client.
The logic for resetting the namespace incorrectly assumed that if the Vault
configuration didn't set the namespace that it was canonicalized to the
non-empty string `"default"`. This results in the API client's namespace getting
"stuck" whenever a job uses a non-default namespace if the configuration value
is empty. Update the logic to always go back to the configuration, rather than
accepting the "previous" namespace from the caller.
This changeset also removes some long-dead code in the Vault client wrapper.
Fixes: https://github.com/hashicorp/nomad/issues/22230
Ref: https://hashicorp.atlassian.net/browse/NET-10207
As part of the work for 1.7.0 we moved portions of the task cgroup setup down
into the executor. This requires that the executor constructor get the
`TaskConfig.Resources` struct, and this was missing from the `qemu` driver. We
fixed a panic caused by this change in #19089 before we shipped, but this fix
was effectively undo after we added plumbing for custom cgroups for `raw_exec`
in 1.8.0. As a result, running `qemu` tasks always fail on Linux.
This was undetected in testing because our CI environment doesn't have QEMU
installed. I've got all the unit tests running locally again and have added QEMU
installation when we're running the drivers tests.
Fixes: https://github.com/hashicorp/nomad/issues/23250
The RPC handler for scaling a job passes flags to enforce the job modify index
is unchanged when it makes the write to Raft. But its only checking against the
existing job modify index at the time the RPC handler snapshots the state store,
so it can only enforce consistency for its own validation.
In clusters with automated scaling, it would be useful to expose the enforce
index options to the API, so that cluster admins can enforce that scaling only
happens when the job state is consistent with a state they've previously seen in
other API calls. Add this option to the CLI and API and have the RPC handler
check them if asked.
Fixes: https://github.com/hashicorp/nomad/issues/23444
The job statuses endpoint does not filter jobs by the namespace query parameter
unless the user passes a management token. The RPC handler creates a filter
based on all the allowed namespaces but improperly conditions reducing this down
to only the requested set on there being a management token. Note this does not
give the user access to jobs they shouldn't have, only ignores the parameter.
Remove the RPC handler's extra condition that prevents using the requested
namespace. This is safe because we specifically check the ACL for that namespace
earlier in the handler.
Fixes: https://github.com/hashicorp/nomad/issues/23370
This enables checks for ContainerAdmin user on docker images on Windows. It's
only checked if users run docker with process isolation and not hyper-v,
because hyper-v provides its own, proper sandboxing.
---------
Co-authored-by: Tim Gross <tgross@hashicorp.com>
Nomad creates Consul ACL tokens and service registrations to support Consul
service mesh workloads, before bootstrapping the Envoy proxy. Nomad always talks
to the local Consul agent and never directly to the Consul servers. But the
local Consul agent talks to the Consul servers in stale consistency mode to
reduce load on the servers. This can result in the Nomad client making the Envoy
bootstrap request with a tokens or services that have not yet replicated to the
follower that the local client is connected to. This request gets a 404 on the
ACL token and that negative entry gets cached, preventing any retries from
succeeding.
To workaround this, we'll use a method described by our friends over on
`consul-k8s` where after creating the objects in Consul we try to read them from
the local agent in stale consistency mode (which prevents a failed read from
being cached). This cannot completely eliminate this source of error because
it's possible that Consul cluster replication is unhealthy at the time we need
it, but this should make Envoy bootstrap significantly more robust.
This changset adds preflight checks for the objects we create in Consul:
* We add a preflight check for ACL tokens after we login via via Workload
Identity and in the function we use to derive tokens in the legacy
workflow. We do this check early because we also want to use this token for
registering group services in the allocrunner hooks.
* We add a preflight check for services right before we bootstrap Envoy in the
taskrunner hook, so that we have time for our service client to batch updates
to the local Consul agent in addition to the local agent sync.
We've added the timeouts to be configurable via node metadata rather than the
usual static configuration because for most cases, users should not need to
touch or even know these values are configurable; the configuration is mostly
available for testing.
Fixes: https://github.com/hashicorp/nomad/issues/9307
Fixes: https://github.com/hashicorp/nomad/issues/10451
Fixes: https://github.com/hashicorp/nomad/issues/20516
Ref: https://github.com/hashicorp/consul-k8s/pull/887
Ref: https://hashicorp.atlassian.net/browse/NET-10051
Ref: https://hashicorp.atlassian.net/browse/NET-9273
Follow-up: https://hashicorp.atlassian.net/browse/NET-10138
* Upgrade consul-template to 0.39.0 to allow template queries of admin
partitions and sameness groups.
* Upgrade our Consul API to 1.29.1 because it's required for CT, and to remove
the replacement pinned version we were using to pick up some newer Consul API
features we needed in 1.7.0.
Ref: https://hashicorp.atlassian.net/browse/NET-10153