nomad

mirror of https://github.com/kemko/nomad.git synced 2026-01-01 16:05:42 +03:00

Author	SHA1	Message	Date
James Rasell	1916a16311	exec: Set LOGNAME env var on exec based drivers. (#26703 ) Typically the `LOGNAME` environment variable should be set according to the values within `/etc/passwd` and represents the name of the logged in user. This should be set, where possible, alongside the USER and HOME variables for all drivers that use the shared executor and do not use a sub-shell.	2025-09-05 14:07:27 +01:00
Michael Schurter	ee5059a6a7	docs: revert to labels={"foo.bar": "baz"} style (#26535 ) * docs: revert to labels={"foo.bar": "baz"} style Back in #24074 I thought it was necessary to wrap labels in a list to support quoted keys in hcl2. This... doesn't appear to be true at all? The simpler `labels={...}` syntax appears to work just fine. I updated the docs and a test (and modernized it a bit). I also switched some other examples to the `labels = {}` format from the old `labels{}` format. * copywronged * fmtd	2025-08-20 09:26:42 -07:00
Daniel Bennett	7c633f8109	exec: don't panic on rootless raw_exec tasks (#26401 ) the executor dies, leaving an orphaned process still running. the panic fix: * don't `panic()` * and return an empty, but non-nil, func on cgroup error feature fix: * allow non-root agent to proceed with exec when cgroups are off	2025-08-04 13:58:35 -04:00
James Rasell	5989d5862a	ci: Update golangci-lint to v2 and fix highlighted issues. (#26334 )	2025-07-25 10:44:08 +01:00
Tim Gross	c8dcd3c2db	docker: clamp CPU shares to minimum of 2 (#26081 ) In #25963 we added normalization of CPU shares for large hosts where the total compute was larger than the maximum CPU shares. But if the result after normalization is less than 2, runc will have an integer overflow. We prevent this in the shared executor for the `exec`/`rawexec` driver by clamping to the safe minimum value. Do this for the `docker` driver as well and add test coverage of it for the shared executor too. Fixes: https://github.com/hashicorp/nomad/issues/26080 Ref: https://github.com/hashicorp/nomad/pull/25963	2025-06-19 13:48:06 -04:00
Conor Mongey	f7096fb9d6	docker: add cgroupns task config (#25927 )	2025-06-11 13:50:44 -04:00
dependabot[bot]	6a35c1b8ea	chore(deps): bump github.com/docker/docker from 28.1.1+incompatible to 28.2.2+incompatible (#25954 ) * chore(deps): bump github.com/docker/docker Bumps [github.com/docker/docker](https://github.com/docker/docker) from 28.1.1+incompatible to 28.2.2+incompatible. - [Release notes](https://github.com/docker/docker/releases) - [Commits](https://github.com/docker/docker/compare/v28.1.1...v28.2.2) --- updated-dependencies: - dependency-name: github.com/docker/docker dependency-version: 28.2.2+incompatible dependency-type: direct:production update-type: version-update:semver-minor ... Signed-off-by: dependabot[bot] <support@github.com> * deps: containerd/errdefs instead of docker/errdefs moby's errdefs are deprecated as of `f1bb44aeee` and now merely point to containerd's --------- Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> Co-authored-by: Daniel Bennett <dbennett@hashicorp.com>	2025-06-05 10:26:18 -04:00
Tim Gross	34e96932a1	drivers: normalize CPU shares/weights to fit large hosts (#25963 ) The `resources.cpu` field is scheduled in MHz. On most Linux task drivers, this value is then mapped to a `cpu.share` (cgroups v1) or `cpu.weight` (cgroups v2). But this means on very large hosts where the total compute is greater than the Linux kernel defined maximum CPU shares, you can't set a `resources.cpu` value large enough to consume the entire host. The `cpu.share`/`cpu.weight` value is relative within the parent cgroup's slice, which is owned by Nomad. So we can fix this by re-normalizing the weight on very large hosts such that the maximum `resources.cpu` matches up with largest possible CPU share. This happens in the task driver so that the rest of Nomad doesn't need to be aware of this implementation detail. Note that these functions will result in bad share config if the request is more than the available, but that's supposed to be caught in the scheduler so by not catching it here we intentionally hit the runc error. Fixes: https://hashicorp.atlassian.net/browse/NMD-297 Fixes: https://github.com/hashicorp/nomad/issues/7731 Ref: https://go.hashi.co/rfc/nmd-211	2025-06-03 15:57:40 -04:00
Tim Gross	77c8acb422	telemetry: fix excessive CPU consumption in executor (#25870 ) Collecting metrics from processes is expensive, especially on platforms like Windows. The executor code has a 5s cache of stats to ensure that we don't thrash syscalls on nodes running many allocations. But the timestamp used to calculate TTL of this cache was never being set, so we were always treating it as expired. This causes excess CPU utilization on client nodes. Ensure that when we fill the cache, we set the timestamp. In testing on Windows, this reduces exector CPU overhead by roughly 75%. This changeset includes two other related items: * The `telemetry.publish_allocation_metrics` field correctly prevents a node from publishing metrics, but the stats hook on the taskrunner still collects the metrics, which can be expensive. Thread the configuration value into the stats hook so that we don't collect if `telemetry.publish_allocation_metrics = false`. * The `linuxProcStats` type in the executor's `procstats` package is misnamed as a result of a couple rounds of refactoring. It's used by all task executors, not just Linux. Rename this and move a comment about how Windows processes are listed so that the comment is closer to where the logic is implemented. Fixes: https://github.com/hashicorp/nomad/issues/23323 Fixes: https://hashicorp.atlassian.net/browse/NMD-455	2025-05-19 09:24:13 -04:00
Piotr Kazmierczak	0fa0624576	exec: Fix incorrect `HOME` and `USER` env variables for tasks that have `user` set (#25859 ) Co-authored-by: Tim Gross <tgross@hashicorp.com>	2025-05-16 15:02:45 +02:00
Tim Gross	374e987b9b	metrics: emit cache and rss stats on cgroup v2 (#25751 ) In cgroups v2, a different map of memory stats is available from the kernel than in v1. The Docker API reflects this change. But there are equivalent values in the map for RSS (anonymously mapped memory) and cache (filesystem cache and tmpfs), which the Docker driver is not currently emitting. Fallback to these alternate values when the cgroups v1 values are not available. Include the anonymous mapping in the "measured" allocation stats as "RSS" so that they both show up in allocation metrics. We can do this on both the `docker` driver and the Linux executor for `exec` and `java` drivers. Fixes: https://github.com/hashicorp/nomad/issues/19185 Ref: https://hashicorp.atlassian.net/browse/NMD-437 Ref: https://www.kernel.org/doc/html/latest/admin-guide/cgroup-v2.html#memory-interface-files Ref: https://www.kernel.org/doc/Documentation/cgroup-v1/memory.txt	2025-04-24 12:48:18 -04:00
Tim Gross	c7cb49f205	testing: fix a panic in docker stats collection test (#25747 ) When the context closes, the stats emitter closes its channel. It's possible for the channel to be closed in the stats emitter goroutine before the `select` in the test sees that the context has closed, which can result in a panic in the test when we try to read the empty value off the channel.	2025-04-24 10:41:03 -04:00
Piotr Kazmierczak	3ad0df71a8	docker: correct stat response for rss, cache and swap memory in cgroups v1 (#25741 ) #25138 refactoring accidentally removed some of the memory stats that weren't available as concrete types in containerapi.	2025-04-24 15:17:56 +02:00
Tim Gross	4d7ed88a8d	testing: use Docker Hub registry mirror for additional tests (#25733 ) This image was missed in https://github.com/hashicorp/nomad/pull/25703 and is resulting in rate limited in tests.	2025-04-24 08:50:32 -04:00
Tim Gross	88dc842729	testing: use Docker Hub registry mirror for CI (#25703 ) As of April 1, Docker Hub rate limits tightened. With only 10 pulls/hr/IP, we're likely to encounter test failures. Switch all Docker images getting pulled from this repository to use the HashiCorp managed registry mirror. Note that most of our tests in `drivers/docker` don't pull from the remote registry but load a local image, while others will need to pull from the remote and fetch different images depending on OS/arch. Refactor the definition of test task configuration to make it clear which is which, and de-factor some false sharing of setup functions. Updates the E2E tests to use that registry by configuring the Docker daemon. This required changing out a few container images that we don't have in the registry, but these new images are all smaller. There are a couple of tests that still use explicitly-tagged `docker.io` images or other third-party registries, which have been left in place. Ref: https://hashicorp.atlassian.net/browse/NET-12233 update E2E images to those in the registry mirror fix windows and docklog test build fix stopsignal test mop-up more mop-up	2025-04-18 14:21:49 -04:00
James Rasell	c85c723336	ci: Run core tests groups workflow on amd64 and arm64 runners. (#25695 )	2025-04-17 15:16:29 +01:00
Tim Gross	48f304d0ca	java: only set nobody user on Unix (#25648 ) In #25496 we introduced the ability to have `task.user` set for on Windows, so long as the user ID fits a particular shape. But this uncovered a 7 year old bug in the `java` driver introduced in #5143, where we set the `task.user` to the non-existent Unix user `nobody`, even if we're running on Windows. Prior to the change in #25496 we always ignored the `task.user`, so this was not a problem. We don't set the `task.user` in the `raw_exec` driver, and the otherwise very similar `exec` driver is Linux-only, so we never see the problem there. Fix the bug in the `java` driver by gating the change to the `task.user` on not being Windows. Also add a check to the new code path that the user is non-empty before parsing it, so that any third party drivers that might be borrowing the executor code don't hit the same probem on Windows. Ref: https://github.com/hashicorp/nomad/pull/5143 Ref: https://github.com/hashicorp/nomad/pull/25496 Fixes: https://github.com/hashicorp/nomad/issues/25638	2025-04-10 10:34:34 -04:00
Denis Rodin	aca0ff438a	raw_exec windows: add support for setting the task user (#25496 )	2025-04-03 11:21:13 -04:00
tehut	27b1d470a8	modify rawexec TaskConfig and Config to accept envvar denylist (#25511 ) * modify rawexec TaskConfig and Config to accept envvar denylist * update rawexec driver docs to include deniedEnvars options Co-authored-by: Daniel Bennett <dbennett@hashicorp.com> --------- Co-authored-by: Daniel Bennett <dbennett@hashicorp.com>	2025-04-02 12:25:28 -07:00
Piotr Kazmierczak	e9ebbed32c	drivers: unflake `TestExecutor_OOMKilled` (#25521 ) Every now and then TestExecutor_OOMKilled would fail with: "unable to start container process: container init was OOM-killed (memory limit too low?)" which started happening since we upgraded libcontainer. This PR removes manual (and arbitrary) resource limits on the test task, since it should be OOMd with resources inherited from the testExecutorCommandWithChroot, and it fixes a small possible goroutine leak in the OOM checker in exec driver.	2025-03-28 11:35:02 +01:00
Allison Larson	d1d8945d2e	Add docker plugin config option image_pull_timeout value for default timeout (#25489 ) * Add docker plugin config image_pull_timeout value for default timeout * Add image_pull_timeout docker plugin config to docs * Add changelog	2025-03-24 13:03:14 -07:00
Piotr Kazmierczak	cb8f4ea452	drivers: set -1 exit code in case executor gets killed (#25453 ) Nomad driver handles incorrectly set exit code 0 in case of executor failure. This corrects that behavior. --------- Co-authored-by: Tim Gross <tgross@hashicorp.com>	2025-03-20 15:06:39 +01:00
Piotr Kazmierczak	e249a6197f	docker: TestDockerDriver_OOMKilled should now run on cgroups v2 (#25443 ) Docker driver's TestDockerDriver_OOMKilled should run on cgroups v2 now, since we're running docker v27 client library and our runners run docker v26 that contain containerd fix containerd/containerd#6323.	2025-03-19 16:53:37 +01:00
dependabot[bot]	459f95ce3f	chore(deps): bump github.com/docker/docker from 27.4.1+incompatible to 28.0.1+incompatible (#25405 ) Co-authored-by: James Rasell <jrasell@hashicorp.com>	2025-03-18 08:32:37 +00:00
Piotr Kazmierczak	16bbdd9833	drivers: adapt shared executor code to use opencontainers/runc 1.2 (#25138 ) Co-authored-by: Michael Smithhisler <michael.smithhisler@hashicorp.com>	2025-03-17 14:32:16 +01:00
Simon Zou	73ceacd236	ListProcesses through PID when cgroup is not found in Linux (#25198 ) * ListProcesses through PID when cgroup is not found * add changelog entry * update the ListByPid for windows	2025-03-06 17:41:51 +01:00
Juana De La Cuesta	6ffe441983	[gh-24931] Return dummy function for moving processes when running rootless (#24944 ) * fix: stop executor launch if nomad doesnt have permissions * func: return move function if c group is not enabled	2025-03-06 10:34:21 +01:00
Juana De La Cuesta	5605f9630d	Fix the docker image parser to account for private repos (#24926 ) * fix: fix the docker image parser to account for private repos * style: change the local regex for docker image indentifiers and use docker package instead * func: return early when no repo found on the image name * func: return error if no path found in image * Update drivers/docker/utils.go Co-authored-by: Tim Gross <tgross@hashicorp.com> * Update coordinator.go * Update driver.go * Update network.go --------- Co-authored-by: Tim Gross <tgross@hashicorp.com>	2025-03-04 16:53:20 +01:00
Jorge Marey	25426f0777	fingerprint: add config option to disable dmidecode (#25108 )	2025-02-13 11:20:48 -05:00
Daniel Bennett	91194b3cc2	docker: refactor to handle futures more easily (#24992 ) at least one bug has been created because it's easy to miss a future.set() in pullImageImpl() this pulls future.set() out to PullImage(), the same level where it's created and wait()ed	2025-02-07 12:45:17 -06:00
Daniel Bennett	62ef621582	docker: respect image_pull_timeout (#24991 ) I believe the docker driver stopped respecting image_pull_timeout in Nomad 1.9.0 in `981ca36049` this makes the timeout apply again	2025-02-07 11:36:31 -06:00
Daniel Bennett	3493551c38	docker: surface image pull progress error (#24981 ) set() on the future, so the caller can handle it instead of wait()ing forever and causing the allocation to get stuck "pending"	2025-02-07 10:36:09 -06:00
Michael Smithhisler	47c14ddf28	remove remote task execution code (#24909 )	2025-01-29 08:08:34 -05:00
James Rasell	0726e4cc3e	driver/docker: Fix container CPU stats collection (#24768 ) The recent change to collection via a "one-shot" Docker API call did not update the stream boolean argument. This results in the PreCPUStats values being zero and therefore breaking the CPU calculations which rely on this data. The base fix is to update the passed boolean parameter to match the desired non-streaming behaviour. The non-streaming API call correctly returns the PreCPUStats data which can be seen in the added unit test. The most recent change also modified the behaviour of the collectStats go routine, so that any error encountered results in the routine exiting. In the event this was a transient error, the container will continue to run, however, no stats will be collected until the task is stopped and replaced. This PR reverts the behaviour, so that an error encountered during a stats collection run results in the error being logged but the collection process continuing with a backoff used.	2025-01-07 07:42:31 +00:00
Vincent Ducamps	6469b59a0a	docker: Fix a bug where images with port number and no tags weren't parsed correctly	2025-01-03 11:38:43 +01:00
Michael Smithhisler	11ae64acb0	drivers: defer executor cleanup func to fix executor leak (#24495 )	2024-12-02 12:25:32 -05:00
Michael Smithhisler	4e2d9675e7	executor: fail early on reattach if listener is not executor (#24538 )	2024-12-02 09:56:00 -05:00
Piotr Kazmierczak	3a18f22c18	goflags: go:build linux for tests that won't compile on other platforms (#24559 ) I'm a heavy LSP user and I frequently goto:next_error. This confuses my editor on macOS.	2024-11-28 15:05:00 +01:00
Juana De La Cuesta	a9e7166b6b	[gh-24339] Move from streaming stats to polling for docker (#24525 ) * fix: dont stream the docker stats, read them one by one * func: add a NewSafeTicker to the herlper functions * style: remove commented code	2024-11-21 17:36:53 +01:00
Seth Hoenig	dd396a3900	windows: revert process listing logic to that of v1.6.10 (#24494 ) * windows: revert process listing logic to that of v1.6.10 In Nomad 1.7 much of the process management code was refactored, including a rewrite of how the process tree of an executor was determined on Windows machines. Unfortunately that rewrite has been cursed with performance issues and bugs. Instead, revert to the logic used in v1.6.10. * changelog	2024-11-20 11:20:20 -06:00
Piotr Kazmierczak	5dfb38d806	drivers: fix capabilities on non-linux systems (#24450 ) Recently we moved from github.com/syndtr/gocapability to github.com/moby/sys/capability due to the former package no longer being maintainer. The new package's capability function works differently: the known/supported functionality is split now, and the .ListSupported() call will always return an empty list on non-linux systems. This means Nomad agents won't start on darwin or windows.	2024-11-13 15:58:25 +01:00
Kir Kolyshkin	d09c8ddf21	deps: switch to moby/sys/capability (#24093 ) github.com/moby/sys/capability is a fork of the (no longer maintained) github.com/syndtr/gocapability package. For changes since the fork took place, see https://github.com/moby/sys/blob/main/capability/CHANGELOG.md Note that the "workaround for RHEL6" is removed for a number of reasons. Feel free to choose the one you like the most, either is sufficient: 1. /proc/sys/kernel/cap_last_cap is available since RHEL 6.7 (kernel 2.6.32-573.el6), released 9 years ago (2015-07-22). 2. It incorrectly returns CAP_BLOCK_SUSPEND (36), which was only added in kernel v3.5 and was never backported to RHEL6 kernels. The correct value for RHEL6 would be CAP_MAC_ADMIN (33). 3. As far as upstream kernels go, /proc/sys/kernel/cap_last_cap was added in kernel v3.2, and a correct value depends on the kernel version. It could be CAP_WAKE_ALARM (35), added to kernel v3.0, or CAP_SYSLOG (34), added to kernel v2.6.38, or possibly a lesser value for even older kernels. Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>	2024-11-11 14:07:31 -05:00
Seth Hoenig	a0ff07393b	drivers: provide empty implementations of cgroup helpers for non-root nomad (#24392 )	2024-11-07 12:24:37 -06:00
Seth Hoenig	b58abf48c1	drivers: move executor process out of v1 task cgroup after process starts (#24340 ) * drivers: move executor process out of v1 task cgroup after process starts This PR changes the behavior of the raw exec task driver on old cgroups v1 systems such that the executor process is no longer a member of the cgroups created for the task. Now, the executor process is placed into those cgroups and starts the task child process (just as before), but now then exits those cgroups and exists in the nomad parent cgroup. This change makes the behavior sort of similar to cgroups v2 systems, where we never have the executor enter the task cgroup to begin with (because we can directly clone(3) the task process into it). Fixes #23951 * executor: handle non-linux case * cgroups: add test case for no executor process in task cgroup (v1) * add changelog * drivers: also move executor out of cpuset cgroup	2024-11-07 07:31:38 -06:00
Michael Smithhisler	0f97574eae	test: fix rawexec driver unix test imports (#24352 )	2024-11-01 12:10:03 -04:00
Michael Smithhisler	658c429d75	Drivers: add work_dir config to exec/raw_exec/java drivers (#24249 ) --------- Co-authored-by: wurosh <uros.m.perisic@gmail.com> Co-authored-by: Michael Schurter <mschurter@hashicorp.com> Co-authored-by: Tim Gross <tgross@hashicorp.com>	2024-11-01 11:04:40 -04:00
Juanadelacuesta	80e398bbf7	test: add tests for validateBounds	2024-10-31 14:54:27 +01:00
Juanadelacuesta	8752bb0a65	func: move the user lookup into the validation, it's used everywhere the function is called	2024-10-31 10:34:26 +01:00
Juana De La Cuesta	f1439f54f7	Update drivers/shared/validators/validators.go Co-authored-by: James Rasell <jrasell@users.noreply.github.com>	2024-10-31 09:32:51 +01:00
Juanadelacuesta	3f884bb3fa	fix: remove the setConfig and modify the test driver to include idValidator to avoid panics	2024-10-30 17:38:54 +01:00

1 2 3 4 5 ...

897 Commits