Cluster reconciler code is notoriously hard to follow because most of its
method continuously mutate the fields of the allocReconciler object. Even
for top-level methods it makes the code hard to follow, but gets really gnarly
with lower-level methods (of which there are many). This changeset proposes a
refactoring that makes the vast majority of said methods return explicit values,
and avoid mutating object fields.
In #25963 we added normalization of CPU shares for large hosts where the total
compute was larger than the maximum CPU shares. But if the result after
normalization is less than 2, runc will have an integer overflow. We prevent
this in the shared executor for the `exec`/`rawexec` driver by clamping to the
safe minimum value. Do this for the `docker` driver as well and add test
coverage of it for the shared executor too.
Fixes: https://github.com/hashicorp/nomad/issues/26080
Ref: https://github.com/hashicorp/nomad/pull/25963
In our E2E environment we've seen some flakiness with the Consul-related
tests. As it turns out, the Consul agents are getting restarted every 90s or so
because they're timing out their systemd notification.
> consul.service: start operation timed out. Terminating.
This appears to be a known issue in Consul and we'll try to contribute some help
to hunt down the cause if they want help, but in the meantime let's remove it
from our systemd unit files for the Consul agents.
Ref: https://github.com/hashicorp/consul/issues/16844#issuecomment-1913282248
* E2E: fix scaling test assertion for extra Windows host
The scaling test assumes that all nodes will receive the system job. But the job
can only run on Linux hosts, so the count will be wrong if we're running a
Windows host as part of the cluster. Filter the expected count by the OS.
While we're touching this test, let's also migrate it off the legacy framework.
* address comments from code review
Some time ago the Windows host we were using as a Nomad client agent test target
started failing to allow ssh connections. The underlying problem appears to be
with sysprep but I wasn't able to debug the exact cause as it's not an area I
have a lot of expertise in.
Swap out the deprecated Windows 2016 host for a Windows 2022 host. This will use
a base image provided by Amazon and then we'll use a userdata script to
bootstrap ssh and some target directories for Terraform to upload files to. The
more modern Windows will let us drop some of extra powershell scripts we were
using as well.
Fixes: https://hashicorp.atlassian.net/browse/NMD-151
Fixes: https://github.com/hashicorp/nomad-e2e/issues/125
When we renew Vault tokens, we use the lease duration to determine how often to
renew. But we also set an `increment` value which is never updated from the
initial 30s. For periodic tokens this is not a problem because the `increment`
field is ignored on renewal. But for non-periodic tokens this prevents the token
TTL from being properly incremented. This behavior has been in place since the
initial Vault client implementation in #1606 but before the switch to workload
identity most (all?) tokens being created were periodic tokens so this was never
detected.
Fix this bug by updating the request's `increment` field to the lease duration
on each renewal.
Also switch out a `time.After` call in backoff of the derive token caller with a
safe timer so that we don't have to spawn a new goroutine per loop, and have
tighter control over when that's GC'd.
Ref: https://github.com/hashicorp/nomad/pull/1606
Ref: https://github.com/hashicorp/nomad/issues/25812
Tests running in CI are starting to bump up to this timeout forcing
re-runs. Adding an additional five minutes to the timeout to help
prevent this from occurring.
Batch job allocations that are drained from a node will be moved
to an eligible node. However, when no eligible nodes are available
to place the draining allocations, the tasks will end up being
complete and will not be placed when an eligible node becomes
available. This occurs because the drained allocations are
simultaneously stopped on the draining node while attempting to
be placed on an eligible node. The stopping of the allocations on
the draining node result in tasks being killed, but importantly this
kill does not fail the task. The result is tasks reporting as complete
due to their state being dead and not being failed. As such, when an
eligible node becomes available, all tasks will show as complete and
no allocations will need to be placed.
To prevent the behavior described above a check is performed when
the alloc runner kills its tasks. If the allocation's job type is
batch, and the allocation has a desired transition of migrate, the
task will be failed when it is killed. This ensures the task does
not report as complete, and when an eligible node becomes available
the allocations are placed as expected.
We have a description of the order of shutdown in the `task.leader` docs, but
the `lifecycle` block is an intuitive place to look for this same information,
and the behavior is largely governed by that feature anyways.
When performing a graceful shutdown a channel is used to wait for
the agent to leave. The channel is closed when the agent leaves
successfully, but it also is closed within a deferral. If the
agent successfully leaves and closes the channel, a panic will
occur when the channel is closed the second time within the
deferral. To prevent this from occurring, the channel closing
is wrapped within a `OnceFunc` so the channel is only closed
once.
While waiting for the agent to leave during a graceful shutdown
the wait can be interrupted immediately if another signal is
received. It is common that while waiting a `SIGPIPE` is received
from journald causing the wait to end early. This results in the
agent not finishing the leave process and reporting an error when
the process has stopped. Instead of allowing any signal to interrupt
the wait, the signal is checked for a `SIGPIPE` and if matched will
continue waiting.
This change isolates all the code that deals with node selection in the
scheduler into its own package called feasible.
---------
Co-authored-by: Tim Gross <tgross@hashicorp.com>
In the original state, when getting ACL policies by job, the
search was performing a prefix-based lookup on the index. This
can result in polcies being applied incorrectly when used for
workload identities. For example, if a `custom-test` policy is
created like so:
```
nomad acl policy apply -namespace=default -job=test-job custom-test ./policy.hcl
```
A job named `test-job` will properly get this ACL policy. However,
due to the lookup being prefix-based on the index, a job named
`test-job-1` will also get this ACL policy.
To prevent this behavior, the lookup behavior on the index is
modified so it is a direct match.
* sec:add sprig template functions in denylists
* remove explicit set which is no longer needed
* go mod tidy
* add changelog
* better changelog and filtered denylist
* go mod tidy with 1.24.4
* edit changelog and remove htpasswd and derive
* fix tests
* Update client/allocrunner/taskrunner/template/template_test.go
Co-authored-by: Tim Gross <tgross@hashicorp.com>
* edit changelog
---------
Co-authored-by: Tim Gross <tgross@hashicorp.com>
In an effort to improve the readability and maintainability of nomad/scheduler
package, we begin with a README file that describes its operation in more detail
than the official documentation does. This PR will be followed by a few small
ones that move the code around that package, improve variable naming and also
keep that readme up to date.
---------
Co-authored-by: Tim Gross <tgross@hashicorp.com>
The server RPC handler and RPC connection pool both use a shared
configuration object for custom yamux configuration. Both
sub-systems were modifying the shared object which could cause a
data race. The passed object is now cloned before being modified.
This changes also moves where the yamux configuration is cloned
and modified to the relevant constructor function. This avoids
performing a clone per connection handle or per new connection
generated in the RPC pool.
fix for:
> This is a scheduled Windows Server 2019 brownout.
> The Windows Server 2019 image will be removed on 2025-06-30.
> For more details, see actions/runner-images#12045
Some test cases were writing the same allocation object (memory
pointer) to Nomad state in subsequent upsert calls. This causes a
race condition with the drainer job watcher which reads the same
object from Nomad state to perform conditional checks.
The data race is fixed by ensuring the allocation is copied
between writes.