When performing a graceful shutdown the client drain configuration
is checked for a deadline which is appended to the timeout. When
running as a server the client will not be set. Attempting to get
the drain deadline will result in a panic. This checks for the
client being available prior to fetching the deadline value.
The `killTasks` function will kill all the alloc runners
task runners. If the task of a task runner has already
completed, the killing of the task runner can cause
confusion due to the task event showing that the task
was signaled even though it is already complete.
To prevent this, a check is done when creating the
task event to determine if the task has completed. If
it has no task event is created and when the task
runner is killed, no extra task event is added.
Both the cluster reconciler and node reconciler emit a debug-level log line with
their results, but these are unstructured multi-line logs that are annoying for
operators to parse. Change these to emit structured key-value pairs like we do
everywhere else.
Ref: https://hashicorp.atlassian.net/browse/NMD-818
Ref: https://go.hashi.co/rfc/nmd-212
When debugging an evaluation, you almost always want to know about all the
related evaluations and what allocations were placed by that evaluation (and
where), not just failed placements. We can enrich the command by adding the
`related` query parameter to the API, and having the command query for the
evaluations allocations automatically. Emit this data as a pair of new tables
and expose fields like quota limits, and previous/next/blocked eval without the
`-verbose` flag.
Update the docs to include the full output and remove references to long-removed
behavior of the `-json` flag.
Ref: https://hashicorp.atlassian.net/browse/NMD-818
Ref: https://go.hashi.co/rfc/nmd-212
As part of ongoing work to make the scheduler more legible and more robustly
tested, we're implementing property testing of at least the reconciler. This
changeset provides some infrastructure we'll need for generating the test cases
using `pgregory.net/rapid`, without building out any of the property assertions
yet (that'll be in upcoming PRs over the next couple weeks).
The alloc reconciler generator produces a job, a previous version of the job, a
set of tainted nodes, and a set of existing allocations. The node reconciler
generator produces a job, a set of nodes, and allocations on those
nodes. Reconnecting allocs are not yet well-covered by these generators, and
with ~40 dimensions covered so far we may need to pull those out to their own
tests in order to get good coverage.
Note the scenarios only randomize fields of interest; fields like the job name
that don't impact the reconciler would use up available shrink cycles on failed
tests without actually reducing the scope of the scenario.
Ref: https://hashicorp.atlassian.net/browse/NMD-814
Ref: https://github.com/flyingmutant/rapid
Restoring scaling policies during the start of a stopped job did not account for
jobs that didn't have any scaling policies, and led to a panic when users tried
to restart such jobs.
When a test starts an agent and the client is enabled, we can
wait until this reaches the ready state within the set up method.
This mimics what we already do with leadership and the root
keyring and should reduce flakey tests where it assume the client
is ready as soon as the set up function returns, which is not
guaranteed.
The change exposed a couple of TLS reload tests which were not
using the test agent correctly. They were setting up a client even
though it would never be able to join the cluster due to TLS
configuration issues. These have been fixed.
No matter the passed region identifier, the CLI was always adding
"<role>.global.nomad" to the certificate DNS names. This is not
what we expect and has been removed.
While here, the long deprecated cluster-region flag has been
removed. This removal only impacts CLI functionality, so is safe
to do.
The Nomad server uses an authenticator backend for RPC handling
which includes TLS verification. This verification setting is
configured based on the servers TLS configuration object and is
built when a new server is constructed.
The bug occurs when a servers TLS configuration is reloaded which
can change the desired TLS verification handling. In this case,
the authenticator is not updated, meaning the RPC mTLS verification
is not modified, even if the configuration indicates it should.
This change adds a new function on the authenticator to allow
updating its TLS verification rule. This new function is called
when a servers TLS configuration is reloaded.
In hashicorp/nomad-enterprise#2592 we introduced a
divergence in how Nomad CE and ENT build their binaries. Nomad CE used a more
sophisticated approach, setting uid, gid and home environment variables in the
docker run command. Despite mine (and others) best efforts, we were not able
to do the same in the ENT repo, which relies on special git settings that allow
it to pull dependencies from private repositories, and left a different docker
run command there, that just inherited GHA runner user and copied the resulting
tarball instead of moving it. #26090 then attempted to remedy #25910 resulting
from docker run command ignoring ${{ env.GO_TAGS }} if run with custom
--env, but the resulting backport broke ent builds.
This PR restores ENT behavior of building Nomad builds with GHA runner user,
thus inheriting runner's environment on ent.
For reasons of backwards compatibility, Nomad uses an older branch of
HCL1 (`v1.0.1-nomad`) and HCL2 (`v2.20.2-nomad-1`) and backports a limited set
of changes to those branches.
But the Vault API also has their own HCL1 branch, currently tagged as
`v1.0.1-vault-7`. Normally this isn't a problem because Nomad pins to our own
branch and we don't call any of the Vault API package's HCL code anyways. But in
Vault's branch some functions were changed that break our build unless we
backport them.
We've backported enough of Vault's changes to make our HCL1 branch build, and
now have tags on the HCL repo so that we can pin to specific tags instead of
random commits.
Fixes: https://hashicorp.atlassian.net/browse/NMD-850
Fixes: https://github.com/hashicorp/nomad/pull/26006
Ref: https://github.com/hashicorp/hcl/pull/760
This changeset separates reconciler fields into their own sub-struct to make
testing easier and the code more explicit about what fields relate to which
state.
The RPC is only ever called from a Nomad client which means we
can move it away from the generic Authenticate function to the
tighter AuthenticateClientOnly one. An addition check to ensure
the ACL object allows client operations is performed, mimicking
other endpoints of this nature.
Cluster reconciler code is notoriously hard to follow because most of its
method continuously mutate the fields of the allocReconciler object. Even
for top-level methods it makes the code hard to follow, but gets really gnarly
with lower-level methods (of which there are many). This changeset proposes a
refactoring that makes the vast majority of said methods return explicit values,
and avoid mutating object fields.
In #25963 we added normalization of CPU shares for large hosts where the total
compute was larger than the maximum CPU shares. But if the result after
normalization is less than 2, runc will have an integer overflow. We prevent
this in the shared executor for the `exec`/`rawexec` driver by clamping to the
safe minimum value. Do this for the `docker` driver as well and add test
coverage of it for the shared executor too.
Fixes: https://github.com/hashicorp/nomad/issues/26080
Ref: https://github.com/hashicorp/nomad/pull/25963
In our E2E environment we've seen some flakiness with the Consul-related
tests. As it turns out, the Consul agents are getting restarted every 90s or so
because they're timing out their systemd notification.
> consul.service: start operation timed out. Terminating.
This appears to be a known issue in Consul and we'll try to contribute some help
to hunt down the cause if they want help, but in the meantime let's remove it
from our systemd unit files for the Consul agents.
Ref: https://github.com/hashicorp/consul/issues/16844#issuecomment-1913282248
* E2E: fix scaling test assertion for extra Windows host
The scaling test assumes that all nodes will receive the system job. But the job
can only run on Linux hosts, so the count will be wrong if we're running a
Windows host as part of the cluster. Filter the expected count by the OS.
While we're touching this test, let's also migrate it off the legacy framework.
* address comments from code review
Some time ago the Windows host we were using as a Nomad client agent test target
started failing to allow ssh connections. The underlying problem appears to be
with sysprep but I wasn't able to debug the exact cause as it's not an area I
have a lot of expertise in.
Swap out the deprecated Windows 2016 host for a Windows 2022 host. This will use
a base image provided by Amazon and then we'll use a userdata script to
bootstrap ssh and some target directories for Terraform to upload files to. The
more modern Windows will let us drop some of extra powershell scripts we were
using as well.
Fixes: https://hashicorp.atlassian.net/browse/NMD-151
Fixes: https://github.com/hashicorp/nomad-e2e/issues/125
When we renew Vault tokens, we use the lease duration to determine how often to
renew. But we also set an `increment` value which is never updated from the
initial 30s. For periodic tokens this is not a problem because the `increment`
field is ignored on renewal. But for non-periodic tokens this prevents the token
TTL from being properly incremented. This behavior has been in place since the
initial Vault client implementation in #1606 but before the switch to workload
identity most (all?) tokens being created were periodic tokens so this was never
detected.
Fix this bug by updating the request's `increment` field to the lease duration
on each renewal.
Also switch out a `time.After` call in backoff of the derive token caller with a
safe timer so that we don't have to spawn a new goroutine per loop, and have
tighter control over when that's GC'd.
Ref: https://github.com/hashicorp/nomad/pull/1606
Ref: https://github.com/hashicorp/nomad/issues/25812
Tests running in CI are starting to bump up to this timeout forcing
re-runs. Adding an additional five minutes to the timeout to help
prevent this from occurring.
Batch job allocations that are drained from a node will be moved
to an eligible node. However, when no eligible nodes are available
to place the draining allocations, the tasks will end up being
complete and will not be placed when an eligible node becomes
available. This occurs because the drained allocations are
simultaneously stopped on the draining node while attempting to
be placed on an eligible node. The stopping of the allocations on
the draining node result in tasks being killed, but importantly this
kill does not fail the task. The result is tasks reporting as complete
due to their state being dead and not being failed. As such, when an
eligible node becomes available, all tasks will show as complete and
no allocations will need to be placed.
To prevent the behavior described above a check is performed when
the alloc runner kills its tasks. If the allocation's job type is
batch, and the allocation has a desired transition of migrate, the
task will be failed when it is killed. This ensures the task does
not report as complete, and when an eligible node becomes available
the allocations are placed as expected.
We have a description of the order of shutdown in the `task.leader` docs, but
the `lifecycle` block is an intuitive place to look for this same information,
and the behavior is largely governed by that feature anyways.
When performing a graceful shutdown a channel is used to wait for
the agent to leave. The channel is closed when the agent leaves
successfully, but it also is closed within a deferral. If the
agent successfully leaves and closes the channel, a panic will
occur when the channel is closed the second time within the
deferral. To prevent this from occurring, the channel closing
is wrapped within a `OnceFunc` so the channel is only closed
once.