Nomad can only use cgroups to control resource requirements if all the cgroups
controllers are actually enabled. Add this to our requirements documentation as
well as the impacted `exec` and `java` task drivers.
The Nomad state store function was recently updated to validate
certain parameters, fixing a panic condition. This change meant
dummy FSM used for the snapshot state command was always failing
this validation and the command no longer worked.
This change adds the required parameter to pass validation and
therefore makes the CLI command functional again.
* Move group into a separate helper module for reuse
* Add shutdownCh to worker
The shutdown channel is used to signal that worker has stopped.
* Make server shutdown block on workers' shutdownCh
* Fix waiting for eval broker state change blocking indefinitely
There was a race condition in the GenericNotifier between the
Run and WaitForChange functions, where WaitForChange blocks
trying to write to a full unsubscribeCh, but the Run function never
reads from the unsubscribeCh as it has already stopped.
This commit fixes it by unblocking if the notifier has been stopped.
* Bound the amount of time server shutdown waits on worker completion
* Fix lostcancel linter error
* Fix worker test using unexpected worker constructor
* Add changelog
---------
Co-authored-by: Marvin Chin <marvinchin@users.noreply.github.com>
* Add OomKilled field to executor proto format
* Teach linux executor to detect and report OOMs
* Teach exec driver to propagate OOMKill information
* Fix data race
* use tail /dev/zero to create oom condition
* use new test framework
* minor tweaks to executor test
* add cl entry
* remove type conversion
---------
Co-authored-by: Marvin Chin <marvinchin@users.noreply.github.com>
Co-authored-by: Seth Hoenig <shoenig@duck.com>
In Nomad 1.7 we updated our JWT library to go-jose, but this changed the wire
format of the embedded struct we have in the `IdentityClaims` struct that we
return as part of the `WhoAmI` RPC response. This wasn't originally intended to
be sent over the wire but other changes in Nomad 1.5+ added a caller to the
client. The library change causes a deserialization error on Nomad 1.5 and 1.6
clients, which prevents access to Nomad Variables and SD via template blocks.
Removed the incompatible fields from the response, which are unused by any
current caller. In a future version of Nomad, we'll likely remove the `WhoAmI`
callers from the client in lieu of using the public keys the clients have to
check auth.
Fixes: https://github.com/hashicorp/nomad/issues/19555
Nomad CI checks for copywrite headers using multiple config files
for specific exemption paths. This means the top-level config file
does not take effect when running the copywrite script within
these sub-folders. Exempt files therefore need to be added to the
sub-config files, along with the top level.
When the server's `vault` block has a default identity, we don't check the
user's Vault token (and in fact, we warn them on job submit if they've provided
one). But the validation hook still checks for a token if
`allow_unauthenticated` is set to true. This is a misconfiguration but there's
no reason for Nomad not to do the expected thing here.
Fixes: https://github.com/hashicorp/nomad/issues/19565
* drivers/executor: set oom_score_adj for raw_exec
This might not be wholly true since I don't know all configurations of
Nomad, but in our use cases, we run some of our tasks as `raw_exec` for
reasons.
We observed that our tasks were running with `oom_score_adj = -1000`,
which prevents them from being OOM'd. This value is being inherited from
the nomad agent parent process, as configured by systemd.
Similar to #10698, we also were shocked to have this value inherited
down to every child process and believe that we should also set this
value to 0 explicitly.
I have no idea if there are other paths that might leverage this or
other ways that `raw_exec` can manifest, but this is how I was able to
observe and fix in one of our configurations.
We have been running in production our tasks wrapped in a script that
does: `echo 0 > /proc/self/oom_score_adj` to avoid this issue.
* drivers/executor: minor cleanup of setting oom adjustment
* e2e: add test for raw_exec oom adjust score
* e2e: set oom score adjust to -999
* cl: add cl
---------
Co-authored-by: Seth Hoenig <shoenig@duck.com>
* Sign-in page now hides token secret by default (toggleable) and updates components to Helios
* General helios-ification
* All the notifications get dismissal buttons
* token-details grid for spacing
* node eligibilty taken into consideration when clients list filtered to 'ready'
* A working draft of complex positive querying
* tags and filter badge
* CompositeStatus -> Status
* Buttons within a Helios SegmentedGroup
* Convert the other dropdowns to helios on clients index
* A bunch of client index test fixes
* Remaining clients list acceptance tests for State facet modified
The allocation table header sometimes conditionally renders the
`Actions` table column, but the allocation row would render it
unconditionally, resulting in broken tables when rendering allocations
for jobs without actions, where rows had more columns than the header.
Also fix the conditional class for the deployments allocation table to
read `length` from the right value.