Commit Graph

4 Commits

Author SHA1 Message Date
Seth Hoenig
a58f0eca8e e2e: move rawexec oversub tests into oversubscription e2e test suite (#19717)
* e2e: move rawexec oversub tests into oversubscription e2e test suite

This PR moves two tests for raw_exec and memory oversubscription into
the oversubscription test suite, which has the necessary plumbing to
activate and restore the oversubscription configuration of the scheduler
during the test.

* cr: rename files for better readability
2024-01-11 14:27:05 -06:00
Seth Hoenig
cb7d078c1d drivers/raw_exec: enable configuring raw_exec task to have no memory limit (#19670)
* drivers/raw_exec: enable configuring raw_exec task to have no memory limit

This PR makes it possible to configure a raw_exec task to not have an
upper memory limit, which is how the driver would behave pre-1.7.

This is done by setting memory_max = -1. The cluster (or node pool) must
have memory oversubscription enabled.

* cl: add cl
2024-01-09 14:57:13 -06:00
Seth Hoenig
ccfb13a72d e2e: add test for raw_exec memory_max configuration (#19596)
* e2e: add test for raw_exec memory_max configuration

* docs: note raw_exec supports memory_max in resources documentation
2024-01-04 08:25:56 -06:00
Matt Robenolt
656bb5cafa drivers/executor: set oom_score_adj for raw_exec (#19515)
* drivers/executor: set oom_score_adj for raw_exec

This might not be wholly true since I don't know all configurations of
Nomad, but in our use cases, we run some of our tasks as `raw_exec` for
reasons.

We observed that our tasks were running with `oom_score_adj = -1000`,
which prevents them from being OOM'd. This value is being inherited from
the nomad agent parent process, as configured by systemd.

Similar to #10698, we also were shocked to have this value inherited
down to every child process and believe that we should also set this
value to 0 explicitly.

I have no idea if there are other paths that might leverage this or
other ways that `raw_exec` can manifest, but this is how I was able to
observe and fix in one of our configurations.

We have been running in production our tasks wrapped in a script that
does: `echo 0 > /proc/self/oom_score_adj` to avoid this issue.

* drivers/executor: minor cleanup of setting oom adjustment

* e2e: add test for raw_exec oom adjust score

* e2e: set oom score adjust to -999

* cl: add cl

---------

Co-authored-by: Seth Hoenig <shoenig@duck.com>
2024-01-02 13:35:09 -06:00