Nomad's default serf configuration has a full sync interval of 60s (the WAN
default configuration in the library). If tests need to join nodes and the
leader is not in the join set, the test can hang up to twice that interval
waiting for the new node to be seen by the leader and added to Raft.
This changeset includes the following tweaks to improve test timings:
* Ensure that nodes introduced later in the keyring replication test are joined
to all peers. (Also updates the test to `shoenig/test`.)
* Update the `TestJoin` helper so that all servers passed are joined to the full
set, instead of a set that's offset by 1, and use a single `Join` call for
each server to reduce the number of messages sent.
* Reduce the `PushPullInterval` from 60s to 500ms in our unit test
configuration, to force faster full syncs.
Fixes a bug where variable values in job submissions that contained newlines
weren't encoded correctly, and thus jobs that contained them couldn't be
resumed once stopped via the UI.
Internal ref: https://hashicorp.atlassian.net/browse/NET-9966
* First pass at global token creation and regional awareness at token fetch time
* Reset and refetch token when you switch region but stay in place
* Ugly and functional global token save
* Tests and log cleanup
the ratio of optimized/unoptimized log size in TestPlanNormalize
has been increased several times as people have added to various
structs and coincidentally bumped into the magic limit.
we encountered another such case when adding to NetworkResource,
but here we omitempty on the struct instead of bumping the limit
in the test.
this has the added benefit of reducing the serialized struct size!
which was the original intent behind this test in the first place :P
the actual value of the ratio is now 0.628... but here the
test value is only dropped down to 0.66 to leave some wiggle room.
Several commands that inspect objects where the names are user-controlled share
a bug where the user cannot inspect the object if it has a name that is an exact
prefix of the name of another object (in the same namespace, where
applicable). For example, the object "test" can't be inspected if there's an
object with the name "testing".
Copy existing logic we have for jobs, node pools, etc. to the impacted commands:
* `plugin status`
* `quota inspect`
* `quota status`
* `scaling policy info`
* `service info`
* `volume deregister`
* `volume detach`
* `volume status`
If we get multiple objects for the prefix query, we check if any of them are an
exact match and use that object instead of returning an error. Where possible
because the prefix query signatures are the same, use a generic function that
can be shared across multiple commands.
Fixes: https://github.com/hashicorp/nomad/issues/13920
Fixes: https://github.com/hashicorp/nomad/issues/17132
Fixes: https://github.com/hashicorp/nomad/issues/23236
Ref: https://hashicorp.atlassian.net/browse/NET-10054
Ref: https://hashicorp.atlassian.net/browse/NET-10055
After changes introduced in #23284 we no longer need to make a if
!st.SupportsNUMA() check in the GetNodes() topology method. In fact this check
will now cause panic in nomadTopologyToProto method on systems that don't
support NUMA.
* Generalized namespace handling, generalized facet searching, node pools facet search
* Testfixes for namespace facet on jobs list
* Filter or not, need to watch for * namespaces
The Vault "logical" API doesn't allow configuring the namespace on a per-request
basis. Instead, it's set on the client. Our `vaultclient` wrapper locks access
to the API client and sets the namespace (and token, if applicable) for each
request, and then resets the namespace and unlocks the API client.
The logic for resetting the namespace incorrectly assumed that if the Vault
configuration didn't set the namespace that it was canonicalized to the
non-empty string `"default"`. This results in the API client's namespace getting
"stuck" whenever a job uses a non-default namespace if the configuration value
is empty. Update the logic to always go back to the configuration, rather than
accepting the "previous" namespace from the caller.
This changeset also removes some long-dead code in the Vault client wrapper.
Fixes: https://github.com/hashicorp/nomad/issues/22230
Ref: https://hashicorp.atlassian.net/browse/NET-10207
As part of the work for 1.7.0 we moved portions of the task cgroup setup down
into the executor. This requires that the executor constructor get the
`TaskConfig.Resources` struct, and this was missing from the `qemu` driver. We
fixed a panic caused by this change in #19089 before we shipped, but this fix
was effectively undo after we added plumbing for custom cgroups for `raw_exec`
in 1.8.0. As a result, running `qemu` tasks always fail on Linux.
This was undetected in testing because our CI environment doesn't have QEMU
installed. I've got all the unit tests running locally again and have added QEMU
installation when we're running the drivers tests.
Fixes: https://github.com/hashicorp/nomad/issues/23250
The RPC handler for scaling a job passes flags to enforce the job modify index
is unchanged when it makes the write to Raft. But its only checking against the
existing job modify index at the time the RPC handler snapshots the state store,
so it can only enforce consistency for its own validation.
In clusters with automated scaling, it would be useful to expose the enforce
index options to the API, so that cluster admins can enforce that scaling only
happens when the job state is consistent with a state they've previously seen in
other API calls. Add this option to the CLI and API and have the RPC handler
check them if asked.
Fixes: https://github.com/hashicorp/nomad/issues/23444
The job statuses endpoint does not filter jobs by the namespace query parameter
unless the user passes a management token. The RPC handler creates a filter
based on all the allowed namespaces but improperly conditions reducing this down
to only the requested set on there being a management token. Note this does not
give the user access to jobs they shouldn't have, only ignores the parameter.
Remove the RPC handler's extra condition that prevents using the requested
namespace. This is safe because we specifically check the ACL for that namespace
earlier in the handler.
Fixes: https://github.com/hashicorp/nomad/issues/23370