* Add OomKilled field to executor proto format
* Teach linux executor to detect and report OOMs
* Teach exec driver to propagate OOMKill information
* Fix data race
* use tail /dev/zero to create oom condition
* use new test framework
* minor tweaks to executor test
* add cl entry
* remove type conversion
---------
Co-authored-by: Marvin Chin <marvinchin@users.noreply.github.com>
Co-authored-by: Seth Hoenig <shoenig@duck.com>
In Nomad 1.7 we updated our JWT library to go-jose, but this changed the wire
format of the embedded struct we have in the `IdentityClaims` struct that we
return as part of the `WhoAmI` RPC response. This wasn't originally intended to
be sent over the wire but other changes in Nomad 1.5+ added a caller to the
client. The library change causes a deserialization error on Nomad 1.5 and 1.6
clients, which prevents access to Nomad Variables and SD via template blocks.
Removed the incompatible fields from the response, which are unused by any
current caller. In a future version of Nomad, we'll likely remove the `WhoAmI`
callers from the client in lieu of using the public keys the clients have to
check auth.
Fixes: https://github.com/hashicorp/nomad/issues/19555
Nomad CI checks for copywrite headers using multiple config files
for specific exemption paths. This means the top-level config file
does not take effect when running the copywrite script within
these sub-folders. Exempt files therefore need to be added to the
sub-config files, along with the top level.
When the server's `vault` block has a default identity, we don't check the
user's Vault token (and in fact, we warn them on job submit if they've provided
one). But the validation hook still checks for a token if
`allow_unauthenticated` is set to true. This is a misconfiguration but there's
no reason for Nomad not to do the expected thing here.
Fixes: https://github.com/hashicorp/nomad/issues/19565
* drivers/executor: set oom_score_adj for raw_exec
This might not be wholly true since I don't know all configurations of
Nomad, but in our use cases, we run some of our tasks as `raw_exec` for
reasons.
We observed that our tasks were running with `oom_score_adj = -1000`,
which prevents them from being OOM'd. This value is being inherited from
the nomad agent parent process, as configured by systemd.
Similar to #10698, we also were shocked to have this value inherited
down to every child process and believe that we should also set this
value to 0 explicitly.
I have no idea if there are other paths that might leverage this or
other ways that `raw_exec` can manifest, but this is how I was able to
observe and fix in one of our configurations.
We have been running in production our tasks wrapped in a script that
does: `echo 0 > /proc/self/oom_score_adj` to avoid this issue.
* drivers/executor: minor cleanup of setting oom adjustment
* e2e: add test for raw_exec oom adjust score
* e2e: set oom score adjust to -999
* cl: add cl
---------
Co-authored-by: Seth Hoenig <shoenig@duck.com>
* Sign-in page now hides token secret by default (toggleable) and updates components to Helios
* General helios-ification
* All the notifications get dismissal buttons
* token-details grid for spacing
* node eligibilty taken into consideration when clients list filtered to 'ready'
* A working draft of complex positive querying
* tags and filter badge
* CompositeStatus -> Status
* Buttons within a Helios SegmentedGroup
* Convert the other dropdowns to helios on clients index
* A bunch of client index test fixes
* Remaining clients list acceptance tests for State facet modified
The allocation table header sometimes conditionally renders the
`Actions` table column, but the allocation row would render it
unconditionally, resulting in broken tables when rendering allocations
for jobs without actions, where rows had more columns than the header.
Also fix the conditional class for the deployments allocation table to
read `length` from the right value.
The keys of `meta` fields have all characters outside of `[A-Za-z0-9_.]`
replaced by underscores when we create `NOMAD_META` environment variables. Make
sure this replacement is documented.
Fixes: https://github.com/hashicorp/nomad/issues/15359
Consul Enterprise agents all belong to an admin partition. Fingerprint this
attribute when available. When a Consul agent is not explicitly configured with
"default" it is in the default partition but will not report this in its
`/v1/agent/self` endpoint. Fallback to "default" when missing only for Consul
Enterprise.
This feature provides users the ability to add constraints for jobs to land on
Nomad nodes that have a Consul in that partition. Or it can allow cluster
administrators to pair Consul partitions 1:1 with Nomad node pools. We'll also
have the option to implement a future `partition` field in the jobspec's
`consul` block to create an implicit constraint.
Ref: https://github.com/hashicorp/nomad/issues/13139#issuecomment-1856479581
Lower cased the title and headings in line with our company-wide style since this is being linked in an upcoming blog I was editing. I also lowercased words such as "Auth Method" and other primitives/components when mentioned in prose - this is in line with our style guide as well where we don't capitalize auth method and we only capitalize components that are SKU/product-like in their separateness/importance.
https://docs.google.com/document/d/1MRvGd6tS5JkIwl_GssbyExkMJqOXKeUE00kSEtFi8m8/edit
Adam Trujilo should be in agreement with changes like this based on our past discussions, but feel free to bring in stake holders if you're not sure about accepting and we can discuss.
* numalib: provide a fallback for topology scanning on linux
* numalib: better package var names
* cl: add cl
* lint: fix my sloppy code
* cl: fixup wording