nomad

mirror of https://github.com/kemko/nomad.git synced 2026-01-02 00:15:43 +03:00

Author	SHA1	Message	Date
Tim Gross	77c8acb422	telemetry: fix excessive CPU consumption in executor (#25870 ) Collecting metrics from processes is expensive, especially on platforms like Windows. The executor code has a 5s cache of stats to ensure that we don't thrash syscalls on nodes running many allocations. But the timestamp used to calculate TTL of this cache was never being set, so we were always treating it as expired. This causes excess CPU utilization on client nodes. Ensure that when we fill the cache, we set the timestamp. In testing on Windows, this reduces exector CPU overhead by roughly 75%. This changeset includes two other related items: * The `telemetry.publish_allocation_metrics` field correctly prevents a node from publishing metrics, but the stats hook on the taskrunner still collects the metrics, which can be expensive. Thread the configuration value into the stats hook so that we don't collect if `telemetry.publish_allocation_metrics = false`. * The `linuxProcStats` type in the executor's `procstats` package is misnamed as a result of a couple rounds of refactoring. It's used by all task executors, not just Linux. Rename this and move a comment about how Windows processes are listed so that the comment is closer to where the logic is implemented. Fixes: https://github.com/hashicorp/nomad/issues/23323 Fixes: https://hashicorp.atlassian.net/browse/NMD-455	2025-05-19 09:24:13 -04:00
Piotr Kazmierczak	0fa0624576	exec: Fix incorrect `HOME` and `USER` env variables for tasks that have `user` set (#25859 ) Co-authored-by: Tim Gross <tgross@hashicorp.com>	2025-05-16 15:02:45 +02:00
Tim Gross	374e987b9b	metrics: emit cache and rss stats on cgroup v2 (#25751 ) In cgroups v2, a different map of memory stats is available from the kernel than in v1. The Docker API reflects this change. But there are equivalent values in the map for RSS (anonymously mapped memory) and cache (filesystem cache and tmpfs), which the Docker driver is not currently emitting. Fallback to these alternate values when the cgroups v1 values are not available. Include the anonymous mapping in the "measured" allocation stats as "RSS" so that they both show up in allocation metrics. We can do this on both the `docker` driver and the Linux executor for `exec` and `java` drivers. Fixes: https://github.com/hashicorp/nomad/issues/19185 Ref: https://hashicorp.atlassian.net/browse/NMD-437 Ref: https://www.kernel.org/doc/html/latest/admin-guide/cgroup-v2.html#memory-interface-files Ref: https://www.kernel.org/doc/Documentation/cgroup-v1/memory.txt	2025-04-24 12:48:18 -04:00
Piotr Kazmierczak	e9ebbed32c	drivers: unflake `TestExecutor_OOMKilled` (#25521 ) Every now and then TestExecutor_OOMKilled would fail with: "unable to start container process: container init was OOM-killed (memory limit too low?)" which started happening since we upgraded libcontainer. This PR removes manual (and arbitrary) resource limits on the test task, since it should be OOMd with resources inherited from the testExecutorCommandWithChroot, and it fixes a small possible goroutine leak in the OOM checker in exec driver.	2025-03-28 11:35:02 +01:00
Piotr Kazmierczak	16bbdd9833	drivers: adapt shared executor code to use opencontainers/runc 1.2 (#25138 ) Co-authored-by: Michael Smithhisler <michael.smithhisler@hashicorp.com>	2025-03-17 14:32:16 +01:00
Michael Smithhisler	658c429d75	Drivers: add work_dir config to exec/raw_exec/java drivers (#24249 ) --------- Co-authored-by: wurosh <uros.m.perisic@gmail.com> Co-authored-by: Michael Schurter <mschurter@hashicorp.com> Co-authored-by: Tim Gross <tgross@hashicorp.com>	2024-11-01 11:04:40 -04:00
Seth Hoenig	51215bf102	deps: update to go-set/v3 and refactor to use custom iterators (#23971 ) * deps: update to go-set/v3 * deps: use custom set iterators for looping	2024-09-16 13:40:10 -05:00
Luke Palmer	75874136ac	fix cgroup setup for non-default devices (#22518 )	2024-06-13 09:27:19 -04:00
Seth Hoenig	7d00a494d9	windows: fix inefficient gathering of task processes (#20619 ) * windows: fix inefficient gathering of task processes * return set of just executor pid in case of ps error	2024-05-17 09:46:23 -05:00
Juana De La Cuesta	169818b1bd	[gh-6980] Client: clean up old allocs before running new ones using the `exec` task driver. (#20500 ) Whenever the "exec" task driver is being used, nomad runs a plug in that in time runs the task on a container under the hood. If by any circumstance the executor is killed, the task is reparented to the init service and wont be stopped by Nomad in case of a job updated or stop. This commit introduces two mechanisms to avoid this behaviour: * Adds signal catching and handling to the executor, so in case of a SIGTERM, the signal will also be passed on to the task. * Adds a pre start clean up of the processes in the container, ensuring only the ones the executor runs are present at any given time.	2024-05-14 09:51:27 +02:00
Luiz Aoqui	b52a44717e	executor: limit the value of CPU shares (#19935 ) The value for the executor cgroup CPU weight must be within the limits imposed by the Linux kernel. Nomad used the task `resource.cpu`, an unbounded value, directly as the cgroup CPU weight, causing it to potentially go outside the imposed values. This commit clamps the CPU shares values to be within the limits allowed. Co-authored-by: Tim Gross <tgross@hashicorp.com>	2024-02-09 16:29:14 -05:00
Marvin Chin	d75293d2ab	Add OOM detection for exec driver (#19563 ) * Add OomKilled field to executor proto format * Teach linux executor to detect and report OOMs * Teach exec driver to propagate OOMKill information * Fix data race * use tail /dev/zero to create oom condition * use new test framework * minor tweaks to executor test * add cl entry * remove type conversion --------- Co-authored-by: Marvin Chin <marvinchin@users.noreply.github.com> Co-authored-by: Seth Hoenig <shoenig@duck.com>	2024-01-03 09:50:27 -06:00
Seth Hoenig	e3c8700ded	deps: upgrade to go-set/v2 (#18638 ) No functional changes, just cleaning up deprecated usages that are removed in v2 and replace one call of .Slice with .ForEach to avoid making the intermediate copy.	2023-10-05 11:56:17 -05:00
Seth Hoenig	591394fb62	drivers: plumb hardware topology via grpc into drivers (#18504 ) * drivers: plumb hardware topology via grpc into drivers This PR swaps out the temporary use of detecting system hardware manually in each driver for using the Client's detected topology by plumbing the data over gRPC. This ensures that Client configuration is taken to account consistently in all references to system topology. * cr: use enum instead of bool for core grade * cr: fix test slit tables to be possible	2023-09-18 08:58:07 -05:00
Seth Hoenig	2e1974a574	client: refactor cpuset partitioning (#18371 ) * client: refactor cpuset partitioning This PR updates the way Nomad client manages the split between tasks that make use of resources.cpus vs. resources.cores. Previously, each task was explicitly assigned which CPU cores they were able to run on. Every time a task was started or destroyed, all other tasks' cpusets would need to be updated. This was inefficient and would crush the Linux kernel when a client would try to run ~400 or so tasks. Now, we make use of cgroup heirarchy and cpuset inheritence to efficiently manage cpusets. * cr: tweaks for feedback	2023-09-12 09:11:11 -05:00
Seth Hoenig	a4cc76bd3e	numa: enable numa topology detection (#18146 ) * client: refactor cgroups management in client * client: fingerprint numa topology * client: plumb numa and cgroups changes to drivers * client: cleanup task resource accounting * client: numa client and config plumbing * lib: add a stack implementation * tools: remove ec2info tool * plugins: fixup testing for cgroups / numa changes * build: update makefile and package tests and cl	2023-08-10 17:05:30 -05:00
Patric Stout	e190eae395	Use config "cpu_total_compute" (if set) for all CPU statistics (#17628 ) Before this commit, it was only used for fingerprinting, but not for CPU stats on nodes or tasks. This meant that if the auto-detection failed, setting the cpu_total_compute didn't resolved the issue. This issue was most noticeable on ARM64, as there auto-detection always failed.	2023-07-19 13:30:47 -05:00
Seth Hoenig	33ac5ed1df	client: do not disable memory swappiness if kernel does not support it (#17625 ) * client: do not disable memory swappiness if kernel does not support it This PR adds a workaround for very old Linux kernels which do not support the memory swappiness interface file. Normally we write a "0" to the file to explicitly disable swap. In the case the kernel does not support it, give libcontainer a nil value so it does not write anything. Fixes #17448 * client: detect swappiness by writing to the file * fixup changelog Co-authored-by: James Rasell <jrasell@users.noreply.github.com> --------- Co-authored-by: James Rasell <jrasell@users.noreply.github.com>	2023-06-22 09:36:31 -05:00
hashicorp-copywrite[bot]	f005448366	[COMPLIANCE] Add Copyright and License Headers	2023-04-10 15:36:59 +00:00
Elvis Pranskevichus	70faebbbb8	drivers/exec: Fix handling of capabilities for unprivileged tasks (#16643 ) Currently, the `exec` driver is only setting the Bounding set, which is not sufficient to actually enable the requisite capabilities for the task process. In order for the capabilities to survive `execve` performed by libcontainer, the `Permitted`, `Inheritable`, and `Ambient` sets must also be set. Per CAPABILITIES (7): > Ambient: This is a set of capabilities that are preserved across an > execve(2) of a program that is not privileged. The ambient capability > set obeys the invariant that no capability can ever be ambient if it > is not both permitted and inheritable.	2023-03-28 12:12:55 -04:00
Tim Gross	11a5f79084	exec: allow running commands from host volume (#14851 ) The exec driver and other drivers derived from the shared executor check the path of the command before handing off to libcontainer to ensure that the command doesn't escape the sandbox. But we don't check any host volume mounts, which should be safe to use as a source for executables if we're letting the user mount them to the container in the first place. Check the mount config to verify the executable lives in the mount's host path, but then return an absolute path within the mount's task path so that we can hand that off to libcontainer to run. Includes a good bit of refactoring here because the anchoring of the final task path has different code paths for inside the task dir vs inside a mount. But I've fleshed out the test coverage of this a good bit to ensure we haven't created any regressions in the process.	2022-11-11 09:51:15 -05:00
Seth Hoenig	6d9e179338	deps: update opencontainers/runc to v1.1.3	2022-08-04 12:56:49 -05:00
Seth Hoenig	be7ec8de3e	raw_exec: make raw exec driver work with cgroups v2 This PR adds support for the raw_exec driver on systems with only cgroups v2. The raw exec driver is able to use cgroups to manage processes. This happens only on Linux, when exec_driver is enabled, and the no_cgroups option is not set. The driver uses the freezer controller to freeze processes of a task, issue a sigkill, then unfreeze. Previously the implementation assumed cgroups v1, and now it also supports cgroups v2. There is a bit of refactoring in this PR, but the fundamental design remains the same. Closes #12351 #12348	2022-04-04 16:11:38 -05:00
Seth Hoenig	5da1a31e94	client: enable support for cgroups v2 This PR introduces support for using Nomad on systems with cgroups v2 [1] enabled as the cgroups controller mounted on /sys/fs/cgroups. Newer Linux distros like Ubuntu 21.10 are shipping with cgroups v2 only, causing problems for Nomad users. Nomad mostly "just works" with cgroups v2 due to the indirection via libcontainer, but not so for managing cpuset cgroups. Before, Nomad has been making use of a feature in v1 where a PID could be a member of more than one cgroup. In v2 this is no longer possible, and so the logic around computing cpuset values must be modified. When Nomad detects v2, it manages cpuset values in-process, rather than making use of cgroup heirarchy inheritence via shared/reserved parents. Nomad will only activate the v2 logic when it detects cgroups2 is mounted at /sys/fs/cgroups. This means on systems running in hybrid mode with cgroups2 mounted at /sys/fs/cgroups/unified (as is typical) Nomad will continue to use the v1 logic, and should operate as before. Systems that do not support cgroups v2 are also not affected. When v2 is activated, Nomad will create a parent called nomad.slice (unless otherwise configured in Client conifg), and create cgroups for tasks using naming convention <allocID>-<task>.scope. These follow the naming convention set by systemd and also used by Docker when cgroups v2 is detected. Client nodes now export a new fingerprint attribute, unique.cgroups.version which will be set to 'v1' or 'v2' to indicate the cgroups regime in use by Nomad. The new cpuset management strategy fixes #11705, where docker tasks that spawned processes on startup would "leak". In cgroups v2, the PIDs are started in the cgroup they will always live in, and thus the cause of the leak is eliminated. [1] https://www.kernel.org/doc/html/latest/admin-guide/cgroup-v2.html Closes #11289 Fixes #11705 #11773 #11933	2022-03-23 11:35:27 -05:00
Seth Hoenig	8492c6576e	build: upgrade and speedup circleci configuration This PR upgrades our CI images and fixes some affected tests. - upgrade go-machine-image to premade latest ubuntu LTS (ubuntu-2004:202111-02) - eliminate go-machine-recent-image (no longer necessary) - manage GOPATH in GNUMakefile (see https://discuss.circleci.com/t/gopath-is-set-to-multiple-directories/7174) - fix tcp dial error check (message seems to be OS specific) - spot check values measured instead of specifically 'RSS' (rss no longer reported in cgroups v2) - use safe MkdirTemp for generating tmpfiles NOT applied: (too flakey) - eliminate setting GOMAXPROCS=1 (build tools were also affected by this setting) - upgrade resource type for all imanges to large (2C -> 4C)	2022-01-24 08:28:14 -06:00
Seth Hoenig	87dbc7162b	deps: upgrade docker and runc This PR upgrades - docker dependency to the latest tagged release (v20.10.12) - runc dependency to the latest tagged release (v1.0.3) Docker does not abide by [semver](https://github.com/moby/moby/issues/39302), so it is marked +incompatible, and transitive dependencies are upgrade manually. Runc made three relevant breaking changes * cgroup manager .Set changed to accept Resources instead of Cgroup `3f65946756` * config.Device moved to devices.Device https://github.com/opencontainers/runc/pull/2679 * mountinfo.Mounted now returns an error if the specified path does not exist https://github.com/moby/sys/blob/mountinfo/v0.5.0/mountinfo/mountinfo.go#L16	2022-01-18 08:35:26 -06:00
Alessandro De Blasis	759397533a	metrics: added `mapped_file` metric (#11500 ) Signed-off-by: Alessandro De Blasis <alex@deblasis.net> Co-authored-by: Nate <37554478+servusdei2018@users.noreply.github.com>	2022-01-10 15:35:19 -05:00
Mahmood Ali	feb450a393	executor: set CpuWeight in cgroup-v2 (#11287 ) Cgroup-v2 uses `cpu.weight` property instead of cpu shares: https://www.kernel.org/doc/html/latest/admin-guide/cgroup-v2.html#cpu-interface-files . And it uses a different range (i.e. `[1, 10000]`) from cpu.shares (i.e. `[2, 262144]`) to make things more interesting. Luckily, the libcontainer provides a helper function to perform the conversion [`ConvertCPUSharesToCgroupV2Value`](https://pkg.go.dev/github.com/opencontainers/runc@v1.0.2/libcontainer/cgroups#ConvertCPUSharesToCgroupV2Value). I have confirmed that docker/libcontainer performs the conversion as well in https://github.com/opencontainers/runc/blob/v1.0.2/libcontainer/specconv/spec_linux.go#L536-L541 , and that CpuShares is ignored by libcontainer in https://github.com/opencontainers/runc/blob/v1.0.2/libcontainer/cgroups/fs2/cpu.go#L24-L29 .	2021-10-14 08:46:07 -04:00
Mahmood Ali	6c414cd5f9	gofmt all the files mostly to handle build directives in 1.17.	2021-10-01 10:14:28 -04:00
Mahmood Ali	0be58d72f4	drivers/exec: Don't inherit Nomad oom_score_adj value (#10698 ) Explicitly set the `oom_score_adj` value for `exec` and `java` tasks. We recommend that the Nomad service to have oom_score_adj of a low value (e.g. -1000) to avoid having nomad agent OOM Killed if the node is oversubscriped. However, Nomad's workloads should not inherit Nomad's process, which is the default behavior. Fixes #10663	2021-06-03 14:15:50 -04:00
Seth Hoenig	595cef8136	drivers/exec: pass capabilities through executor RPC Add capabilities to the LaunchRequest proto so that the capabilities set actually gets plumbed all the way through to task launch.	2021-05-17 12:37:40 -06:00
Seth Hoenig	191144c3bf	drivers/exec: enable setting allow_caps on exec driver This PR enables setting allow_caps on the exec driver plugin configuration, as well as cap_add and cap_drop in exec task configuration. These options replicate the functionality already present in the docker task driver. Important: this change also reduces the default set of capabilities enabled by the exec driver to match the default set enabled by the docker driver. Until v1.0.5 the exec task driver would enable all capabilities supported by the operating system. v1.0.5 removed NET_RAW from that list of default capabilities, but left may others which could potentially also be leveraged by compromised tasks. Important: the "root" user is still special cased when used with the exec driver. Older versions of Nomad enabled enabled all capabilities supported by the operating system for tasks set with the root user. To maintain compatibility with existing clusters we continue supporting this "feature", however we maintain support for the legacy set of capabilities rather than enabling all capabilities now supported on modern operating systems.	2021-05-17 12:37:40 -06:00
Seth Hoenig	003d68fe6d	drivers/docker+exec+java: disable net_raw capability by default The default Linux Capabilities set enabled by the docker, exec, and java task drivers includes CAP_NET_RAW (for making ping just work), which has the side affect of opening an ARP DoS/MiTM attack between tasks using bridge networking on the same host network. https://docs.docker.com/engine/reference/run/#runtime-privilege-and-linux-capabilities This PR disables CAP_NET_RAW for the docker, exec, and java task drivers. The previous behavior can be restored for docker using the allow_caps docker plugin configuration option. A future version of nomad will enable similar configurability for the exec and java task drivers.	2021-05-12 13:22:09 -07:00
Nick Ethier	e0a599ed9c	nit: code cleanup/organization	2021-04-16 15:14:29 -04:00
Nick Ethier	5377be43ff	executor: add support for cpuset cgroup	2021-04-15 10:24:31 -04:00
Yoan Blanc	a814f0253f	chore: bump golangci-lint from v1.24 to v1.39 Signed-off-by: Yoan Blanc <yoan@dosimple.ch>	2021-04-03 09:50:23 +02:00
Mahmood Ali	4532272931	drivers/exec: Account for cgroup-v2 memory stats If the host is running with cgroup-v2, RSS and Max Usage doesn't get reported anymore.	2021-04-01 12:13:21 -04:00
zhsj	46b335d652	deps: update runc to v1.0.0-rc93 includes updates for breaking changes in runc v1.0.0-rc93	2021-03-31 10:57:02 -04:00
Mahmood Ali	43549b46fc	driver/exec: set soft memory limit Linux offers soft memory limit: https://www.kernel.org/doc/html/latest/admin-guide/cgroup-v1/memory.html#soft-limits , and https://www.kernel.org/doc/html/latest/admin-guide/cgroup-v2.html?highlight=memory.low . We can set soft memory limits through libcontainer `Resources.MemoryReservation`: https://pkg.go.dev/github.com/opencontainers/runc@v0.1.1/libcontainer/configs#Resources	2021-03-30 16:55:58 -04:00
Mahmood Ali	5e3fbd5774	oversubscription: driver/exec to honor MemoryMaxMB	2021-03-30 16:55:58 -04:00
Seth Hoenig	836ee9e4a2	drivers/exec+java: Add task configuration to restore previous PID/IPC isolation behavior This PR adds pid_mode and ipc_mode options to the exec and java task driver config options. By default these will defer to the default_pid_mode and default_ipc_mode agent plugin options created in #9969. Setting these values to "host" mode disables isolation for the task. Doing so is not recommended, but may be necessary to support legacy job configurations. Closes #9970	2021-02-08 14:26:35 -06:00
Seth Hoenig	6dd5de4b69	docs: fixup comments, var names	2021-02-08 10:58:44 -06:00
Seth Hoenig	b682371a22	drivers/exec+java: Add configuration to restore previous PID/IPC namespace behavior. This PR adds default_pid_mode and default_ipc_mode options to the exec and java task drivers. By default these will default to "private" mode, enabling PID and IPC isolation for tasks. Setting them to "host" mode disables isolation. Doing so is not recommended, but may be necessary to support legacy job configurations. Closes #9969	2021-02-05 15:52:11 -06:00
Chris Baker	7f06adf1af	Merge tag 'v1.0.3' into post-release-1.0.3 Version 1.0.3	2021-01-29 19:30:08 +00:00
Chris Baker	109fb53e50	put exec process in a new IPC namespace	2021-01-28 12:03:19 +00:00
Kris Hicks	677353a205	Add PID namespacing and e2e test	2021-01-28 12:03:19 +00:00
Kris Hicks	bcd4752fc9	executor_linux: Remove unreachable PATH= code (#9778 ) This has to have been unused because the HasPrefix operation is backwards, meaning a Command.Env that includes PATH= never would have worked; the default path was always used.	2021-01-15 11:19:09 -08:00
Kris Hicks	071f4c7596	Add gocritic to golangci-lint config (#9556 )	2020-12-08 12:47:04 -08:00
Shengjing Zhu	274bf2ee1c	Adjust cgroup change in libcontainer	2020-08-20 00:31:07 +08:00
Thomas Lefebvre	5a017acd0b	client: support no_pivot_root in exec driver configuration	2020-02-18 09:27:16 -08:00

1 2 3

109 Commits