Commit Graph

798 Commits

Author SHA1 Message Date
Tim Gross
eedbd36fef qemu: pass task resources into driver for cgroup setup (#23466)
As part of the work for 1.7.0 we moved portions of the task cgroup setup down
into the executor. This requires that the executor constructor get the
`TaskConfig.Resources` struct, and this was missing from the `qemu` driver. We
fixed a panic caused by this change in #19089 before we shipped, but this fix
was effectively undo after we added plumbing for custom cgroups for `raw_exec`
in 1.8.0. As a result, running `qemu` tasks always fail on Linux.

This was undetected in testing because our CI environment doesn't have QEMU
installed. I've got all the unit tests running locally again and have added QEMU
installation when we're running the drivers tests.

Fixes: https://github.com/hashicorp/nomad/issues/23250
2024-07-01 11:41:10 -04:00
Piotr Kazmierczak
d5e1515e80 docker: default to hyper-v isolation on Windows (#23452) 2024-07-01 08:56:43 +02:00
Piotr Kazmierczak
0ece7b5c16 docker: validate that containers do not run as ContainerAdmin on Windows (#23443)
This enables checks for ContainerAdmin user on docker images on Windows. It's
only checked if users run docker with process isolation and not hyper-v,
because hyper-v provides its own, proper sandboxing.

---------

Co-authored-by: Tim Gross <tgross@hashicorp.com>
2024-06-27 16:22:24 +02:00
Piotr Kazmierczak
85430be6dd raw_exec: oom_score_adj support (#23308) 2024-06-14 11:36:27 +02:00
Luke Palmer
75874136ac fix cgroup setup for non-default devices (#22518) 2024-06-13 09:27:19 -04:00
Piotr Kazmierczak
830297bcf0 docker: update image in TestDockerDriver_Start_Image_HTTPS (#23309) 2024-06-12 16:13:39 +02:00
Piotr Kazmierczak
0e8a67f0e1 docker: oom_score_adj support (#23297) 2024-06-12 10:49:20 +02:00
Tim Gross
71fd5c2474 testing: pull Docker images from mirror (#23190)
In https://github.com/hashicorp/nomad/pull/17401 we added test helpers that
would allow `docker` driver tests to pull from a mirror of the Docker Hub
registry. Extend the use of this helper a test that recently hit
rate-limiting.

Fixes: https://github.com/hashicorp/nomad/issues/23174
2024-06-06 11:21:45 -04:00
Piotr Kazmierczak
307fd590d7 docker: new container_exists_attempts configuration field (#22419)
This allows users to set a custom value of attempts that will be made to purge
an existing (not running) container if one is found during task creation.

---------

Co-authored-by: Tim Gross <tgross@hashicorp.com>
2024-05-30 19:22:14 +02:00
Piotr Kazmierczak
bf11e39ac8 docker: add a unit test for "container already exists" error when creating containers (#22238) 2024-05-30 11:24:28 +02:00
Seth Hoenig
7d00a494d9 windows: fix inefficient gathering of task processes (#20619)
* windows: fix inefficient gathering of task processes

* return set of just executor pid in case of ps error
2024-05-17 09:46:23 -05:00
Juana De La Cuesta
169818b1bd [gh-6980] Client: clean up old allocs before running new ones using the exec task driver. (#20500)
Whenever the "exec" task driver is being used, nomad runs a plug in that in time runs the task on a container under the hood. If by any circumstance the executor is killed, the task is reparented to the init service and wont be stopped by Nomad in case of a job updated or stop.

This commit introduces two mechanisms to avoid this behaviour:

* Adds signal catching and handling to the executor, so in case of a SIGTERM, the signal will also be passed on to the task.
* Adds a pre start clean up of the processes in the container, ensuring only the ones the executor runs are present at any given time.
2024-05-14 09:51:27 +02:00
Tim Gross
623486b302 deps: vendor containernetworking/plugins functions for net NS utils (#20556)
We bring in `containernetworking/plugins` for the contents of a single file,
which we use in a few places for running a goroutine in a specific network
namespace. This code hasn't needed an update in a couple of years, and a good
chunk of what we need was previously vendored into `client/lib/nsutil`
already.

Updating the library via dependabot is causing errors in Docker driver tests
because it updates a lot of transient dependencies, and it's bringing in a pile
of new transient dependencies like opentelemetry. Avoid this problem going
forward by vendoring the remaining code we hadn't already.

Ref: https://github.com/hashicorp/nomad/pull/20146
2024-05-13 09:10:16 -04:00
Seth Hoenig
14a022cbc0 drivers/raw_exec: enable setting cgroup override values (#20481)
* drivers/raw_exec: enable setting cgroup override values

This PR enables configuration of cgroup override values on the `raw_exec`
task driver. WARNING: setting cgroup override values eliminates any
gauruntee Nomad can make about resource availability for *any* task on
the client node.

For cgroup v2 systems, set a single unified cgroup path using `cgroup_v2_override`.
The path may be either absolute or relative to the cgroup root.

config {
  cgroup_v2_override = "custom.slice/app.scope"
}

or

config {
  cgroup_v2_override = "/sys/fs/cgroup/custom.slice/app.scope"
}

For cgroup v1 systems, set a per-controller path for each controller using
`cgroup_v1_override`. The path(s) may be either absolute or relative to
the controller root.

config {
  cgroup_v1_override = {
    "pids": "custom/app",
    "cpuset": "custom/app",
  }
}

or

config {
  cgroup_v1_override = {
    "pids": "/sys/fs/cgroup/pids/custom/app",
    "cpuset": "/sys/fs/cgroup/cpuset/custom/app",
  }
}

* drivers/rawexec: ensure only one of v1/v2 cgroup override is set

* drivers/raw_exec: executor should error if setting cgroup does not work

* drivers/raw_exec: create cgroups in raw_exec tests

* drivers/raw_exec: ensure we fail to start if custom cgroup set and non-root

* move custom cgroup func into shared file

---------

Co-authored-by: Michael Schurter <mschurter@hashicorp.com>
2024-05-07 16:46:27 -07:00
Luiz Aoqui
9d4f7bcb68 mock_driver: fix fingreprint key (#20351)
The `mock_driver` is an internal task driver used mostly for testing and
simulating workloads. During the allocrunner v2 work (#4792) its name
changed from `mock_driver` to just `mock` and then back to
`mock_driver`, but the fingreprint key was kept as `driver.mock`.

This results in tasks configured with `driver = "mock"` to be scheduled
(because Nomad thinks the client has a task driver called `mock`), but
fail to actually run (because the Nomad client can't find a driver
called `mock` in its catalog).

Fingerprinting the right name prevents the job from being scheduled in
the first place.

Also removes mentions of the mock driver from documentation since its an
internal driver and not available in any production release.
2024-04-16 07:16:55 +01:00
Seth Hoenig
825efc3925 docker: use correct effective cpuset filename on legacy cgroups v1 systems (#20294) 2024-04-05 08:05:51 -05:00
Yorick Gersie
6124ee8afb cpuset fixer: use correct cgroup path for updates (#20276)
* cpuset fixer: use correct cgroup path for updates

fixes #20275

* docker: flatten switch statement and add test cases

* cl: add cl

---------

Co-authored-by: Seth Hoenig <shoenig@duck.com>
2024-04-04 15:54:10 -05:00
Seth Hoenig
05937ab75b exec2: add client support for unveil filesystem isolation mode (#20115)
* exec2: add client support for unveil filesystem isolation mode

This PR adds support for a new filesystem isolation mode, "Unveil". The
mode introduces a "alloc_mounts" directory where tasks have user-owned
directory structure which are bind mounts into the real alloc directory
structure. This enables a task driver to use landlock (and maybe the
real unveil on openbsd one day) to isolate a task to the task owned
directory structure, providing sandboxing.

* actually create alloc-mounts-dir directory

* fix doc strings about alloc mount dir paths
2024-03-13 08:24:17 -05:00
carrychair
5f5b34db0e remove repetitive words (#20110)
Signed-off-by: carrychair <linghuchong404@gmail.com>
2024-03-11 08:52:08 +00:00
Seth Hoenig
4d83733909 tests: swap testify for test in more places (#20028)
* tests: swap testify for test in plugins/csi/client_test.go

* tests: swap testify for test in testutil/

* tests: swap testify for test in host_test.go

* tests: swap testify for test in plugin_test.go

* tests: swap testify for test in utils_test.go

* tests: swap testify for test in scheduler/

* tests: swap testify for test in parse_test.go

* tests: swap testify for test in attribute_test.go

* tests: swap testify for test in plugins/drivers/

* tests: swap testify for test in command/

* tests: fixup some test usages

* go: run go mod tidy

* windows: cpuset test only on linux
2024-02-29 12:11:35 -06:00
James Rasell
e4648551e5 test: fix test datarace within driver shared eventer. (#19975) 2024-02-15 07:39:43 +00:00
Luiz Aoqui
b52a44717e executor: limit the value of CPU shares (#19935)
The value for the executor cgroup CPU weight must be within the limits
imposed by the Linux kernel.

Nomad used the task `resource.cpu`, an unbounded value, directly as the
cgroup CPU weight, causing it to potentially go outside the imposed
values.

This commit clamps the CPU shares values to be within the limits
allowed.

Co-authored-by: Tim Gross <tgross@hashicorp.com>
2024-02-09 16:29:14 -05:00
Tim Gross
110d93ab25 windows: remove LazyDLL calls for system modules (#19925)
On Windows, Nomad uses `syscall.NewLazyDLL` and `syscall.LoadDLL` functions to
load a few system DLL files, which does not prevent DLL hijacking
attacks. Hypothetically a local attacker on the client host that can place an
abusive library in a specific location could use this to escalate privileges to
the Nomad process. Although this attack does not fall within the Nomad security
model, it doesn't hurt to follow good practices here.

We can remove two of these DLL loads by using wrapper functions provided by the
stdlib in `x/sys/windows`

Co-authored-by: dduzgun-security <deniz.duzgun@hashicorp.com>
2024-02-09 08:47:48 -05:00
James Rasell
10324566ae driver/rawexec: populate OOM killed exit result. (#19829) 2024-01-29 08:54:52 +00:00
James Rasell
8d6067e987 driver/qemu: populate OOM killed exit result. (#19830) 2024-01-29 07:34:27 +00:00
James Rasell
34fe96a420 driver/java: populate OOM killed exit result. (#19818) 2024-01-26 08:09:16 +00:00
Seth Hoenig
9410c519ff drivers/raw_exec: remove plumbing for ineffective no_cgroups configuration (#19599)
* drivers/raw_exec: remove plumbing for ineffective no_cgroups configuration

* fix tests
2024-01-11 08:20:15 -06:00
Seth Hoenig
cb7d078c1d drivers/raw_exec: enable configuring raw_exec task to have no memory limit (#19670)
* drivers/raw_exec: enable configuring raw_exec task to have no memory limit

This PR makes it possible to configure a raw_exec task to not have an
upper memory limit, which is how the driver would behave pre-1.7.

This is done by setting memory_max = -1. The cluster (or node pool) must
have memory oversubscription enabled.

* cl: add cl
2024-01-09 14:57:13 -06:00
Marvin Chin
d75293d2ab Add OOM detection for exec driver (#19563)
* Add OomKilled field to executor proto format

* Teach linux executor to detect and report OOMs

* Teach exec driver to propagate OOMKill information

* Fix data race

* use tail /dev/zero to create oom condition

* use new test framework

* minor tweaks to executor test

* add cl entry

* remove type conversion

---------

Co-authored-by: Marvin Chin <marvinchin@users.noreply.github.com>
Co-authored-by: Seth Hoenig <shoenig@duck.com>
2024-01-03 09:50:27 -06:00
James Rasell
91cba75f5c copywrite: fix and add copywrite config enterprise comments. (#19590)
Nomad CI checks for copywrite headers using multiple config files
for specific exemption paths. This means the top-level config file
does not take effect when running the copywrite script within
these sub-folders. Exempt files therefore need to be added to the
sub-config files, along with the top level.
2024-01-03 08:58:53 +00:00
Matt Robenolt
656bb5cafa drivers/executor: set oom_score_adj for raw_exec (#19515)
* drivers/executor: set oom_score_adj for raw_exec

This might not be wholly true since I don't know all configurations of
Nomad, but in our use cases, we run some of our tasks as `raw_exec` for
reasons.

We observed that our tasks were running with `oom_score_adj = -1000`,
which prevents them from being OOM'd. This value is being inherited from
the nomad agent parent process, as configured by systemd.

Similar to #10698, we also were shocked to have this value inherited
down to every child process and believe that we should also set this
value to 0 explicitly.

I have no idea if there are other paths that might leverage this or
other ways that `raw_exec` can manifest, but this is how I was able to
observe and fix in one of our configurations.

We have been running in production our tasks wrapped in a script that
does: `echo 0 > /proc/self/oom_score_adj` to avoid this issue.

* drivers/executor: minor cleanup of setting oom adjustment

* e2e: add test for raw_exec oom adjust score

* e2e: set oom score adjust to -999

* cl: add cl

---------

Co-authored-by: Seth Hoenig <shoenig@duck.com>
2024-01-02 13:35:09 -06:00
hc-github-team-es-release-engineering
a4ecc2fbc8 Merge pull request #19283 from hashicorp/RELENG-960-EOY-license-fixes
[DO NOT MERGE UNTIL EOY] update year in LICENSE and copywrite files
2024-01-02 09:38:54 -08:00
Luiz Aoqui
e4e70b086a ci: run linter in ./api package (#19513) 2023-12-19 15:59:47 -05:00
Morgan Drake
c5b36b500b move license to 2024 2023-12-01 12:26:27 -08:00
Tim Gross
0236bd0907 qemu: fix panic from missing resources block (#19089)
The `qemu` driver uses our universal executor to run the qemu command line
tool. Because qemu owns the resource isolation, we don't pass in the resource
block that the universal executor uses to configure cgroups and core
pinning. This resulted in a panic.

Fix the panic by returning early in the cgroup configuration in the universal
executor. This fixes `qemu` but also any third-party drivers that might exist
and are using our executor code without passing in the resource block.

In future work, we should ensure that the `resources` block is being translated
into qemu equivalents, so that we have support for things like NUMA-aware
scheduling for that driver.

Fixes: https://github.com/hashicorp/nomad/issues/19078
2023-11-14 16:26:44 -05:00
modrake
51ffe4208e workaround and fixes for MPL and copywrite bot (#18775) 2023-10-17 08:02:13 +01:00
Seth Hoenig
e3c8700ded deps: upgrade to go-set/v2 (#18638)
No functional changes, just cleaning up deprecated usages that are
removed in v2 and replace one call of .Slice with .ForEach to avoid
making the intermediate copy.
2023-10-05 11:56:17 -05:00
Seth Hoenig
591394fb62 drivers: plumb hardware topology via grpc into drivers (#18504)
* drivers: plumb hardware topology via grpc into drivers

This PR swaps out the temporary use of detecting system hardware manually
in each driver for using the Client's detected topology by plumbing the
data over gRPC. This ensures that Client configuration is taken to account
consistently in all references to system topology.

* cr: use enum instead of bool for core grade

* cr: fix test slit tables to be possible
2023-09-18 08:58:07 -05:00
Seth Hoenig
2e1974a574 client: refactor cpuset partitioning (#18371)
* client: refactor cpuset partitioning

This PR updates the way Nomad client manages the split between tasks
that make use of resources.cpus vs. resources.cores.

Previously, each task was explicitly assigned which CPU cores they were
able to run on. Every time a task was started or destroyed, all other
tasks' cpusets would need to be updated. This was inefficient and would
crush the Linux kernel when a client would try to run ~400 or so tasks.

Now, we make use of cgroup heirarchy and cpuset inheritence to efficiently
manage cpusets.

* cr: tweaks for feedback
2023-09-12 09:11:11 -05:00
James Rasell
a9d5beb141 test: use correct parallel test setup func (#18326) 2023-08-25 13:51:36 +01:00
Seth Hoenig
f5b0da1d55 all: swap exp packages for maps, slices (#18311) 2023-08-23 15:42:13 -05:00
Piotr Kazmierczak
53ef6391a5 drivers/docker: fix a hostConfigMemorySwappiness panic (#18238)
cgroupslib.MaybeDisableMemorySwappiness returned an incorrect type, and was
incorrectly typecast to int64 causing a panic on non-linux and non-windows hosts.
2023-08-17 14:45:31 +02:00
Tim Gross
f00bff09f1 fix multiple overflow errors in exponential backoff (#18200)
We use capped exponential backoff in several places in the code when handling
failures. The code we've copy-and-pasted all over has a check to see if the
backoff is greater than the limit, but this check happens after the bitshift and
we always increment the number of attempts. This causes an overflow with a
fairly small number of failures (ex. at one place I tested it occurs after only
24 iterations), resulting in a negative backoff which then never recovers. The
backoff becomes a tight loop consuming resources and/or DoS'ing a Nomad RPC
handler or an external API such as Vault. Note this doesn't occur in places
where we cap the number of iterations so the loop breaks (usually to return an
error), so long as the number of iterations is reasonable.

Introduce a helper with a check on the cap before the bitshift to avoid overflow in all 
places this can occur.

Fixes: #18199
Co-authored-by: stswidwinski <stan.swidwinski@gmail.com>
2023-08-15 14:38:18 -04:00
Seth Hoenig
6747ef8803 drivers/raw_exec: restore ability to run tasks without nomad running as root (#18206)
Although nomad officially does not support running the client as a non-root
user, doing so has been more or less possible with the raw_exec driver as
long as you don't expect features to work like networking or running tasks
as specific users. In the cgroups refactoring I bulldozed right over the
special casing we had in place for raw_exec to continue working if the cgroups
were unable to be created. This PR restores that behavior - you can now
(as before) run the nomad client as a non-root user and make use of the
raw_exec task driver.
2023-08-15 11:22:30 -05:00
hashicorp-copywrite[bot]
2d35e32ec9 Update copyright file headers to BUSL-1.1 2023-08-10 17:27:15 -05:00
hashicorp-copywrite[bot]
89e24d7405 Adding explicit MPL license for sub-package
This directory and its subdirectories (packages) contain files licensed with the MPLv2 `LICENSE` file in this directory and are intentionally licensed separately from the BSL `LICENSE` file at the root of this repository.
2023-08-10 17:27:01 -05:00
Seth Hoenig
a4cc76bd3e numa: enable numa topology detection (#18146)
* client: refactor cgroups management in client

* client: fingerprint numa topology

* client: plumb numa and cgroups changes to drivers

* client: cleanup task resource accounting

* client: numa client and config plumbing

* lib: add a stack implementation

* tools: remove ec2info tool

* plugins: fixup testing for cgroups / numa changes

* build: update makefile and package tests and cl
2023-08-10 17:05:30 -05:00
Patric Stout
e190eae395 Use config "cpu_total_compute" (if set) for all CPU statistics (#17628)
Before this commit, it was only used for fingerprinting, but not
for CPU stats on nodes or tasks. This meant that if the
auto-detection failed, setting the cpu_total_compute didn't resolved
the issue.

This issue was most noticeable on ARM64, as there auto-detection
always failed.
2023-07-19 13:30:47 -05:00
James Rasell
0015d25344 qemu: add test to cover task store functions. (#17967) 2023-07-19 15:35:16 +01:00
James Rasell
81aa274551 qemu: fix log lines to use correct QEMU capitalization. (#17961) 2023-07-19 15:21:15 +01:00