nomad

mirror of https://github.com/kemko/nomad.git synced 2026-01-01 16:05:42 +03:00

Author	SHA1	Message	Date
tehut	d709accaf5	Add nomad monitor export command (#26178 ) * Add MonitorExport command and handlers * Implement autocomplete * Require nomad in serviceName * Fix race in StreamReader.Read * Add and use framer.Flush() to coordinate function exit * Add LogFile to client/Server config and read NomadLogPath in rpcHandler instead of HTTPServer * Parameterize StreamFixed stream size	2025-08-01 10:26:59 -07:00
James Rasell	5989d5862a	ci: Update golangci-lint to v2 and fix highlighted issues. (#26334 )	2025-07-25 10:44:08 +01:00
Juana De La Cuesta	3b44090156	Avoid panic during startup with 1.10.2 (#26219 ) * fix: initalize the topology of teh processors to avoid nil pointers * func: initialize topology to avoid nil pointers * fix: update the new public method for NodeProcessorResources	2025-07-08 16:07:14 +02:00
Michael Smithhisler	6036ab8b40	client: close namespace file handle and defensively lazy unmount (#25714 )	2025-04-21 16:25:05 -04:00
James Rasell	85c30dfd1e	test: Remove use of "mitchellh/go-testing-interface" for stdlib. (#25640 ) The stdlib testing package now includes this interface, so we can remove our dependency on the external library.	2025-04-14 07:43:49 +01:00
Martijn Vegter	736103aa54	client: fix JSON formatted logs when failing to reserve cores (#25523 ) Fixed a bug where JSON formatted logs would not show the requested and overlapping cores when failing to reserve cores	2025-03-27 08:52:32 -04:00
Jorge Marey	25426f0777	fingerprint: add config option to disable dmidecode (#25108 )	2025-02-13 11:20:48 -05:00
Martijn Vegter	997da25cdb	scheduler: take all assigned cpu cores into account instead of only those part of the largest lifecycle (#24304 ) Fixes a bug in the AllocatedResources.Comparable method, where the scheduler would only take into account the cpusets of the tasks in the largest lifecycle. This could result in overlapping cgroup cpusets. Now we make the distinction between reserved and fungible resources throughout the lifespan of the alloc. In addition, added logging in case of future regressions thus not requiring manual inspection of cgroup files.	2024-11-21 13:21:48 -05:00
Martijn Vegter	bfb714144e	client: fixed a bug where AMD CPUs were not correctly fingerprinting base speed (#24415 ) Relates to: #19468	2024-11-21 09:08:47 -06:00
Gabi	89c3d69d79	nsutil: wrap error that comes from the syscall so caller can do errors.As (#24480 ) User of `nsutil` library should be able to do the following and for it to work: ``` var errno syscall.Errno if errors.As(err, &errno) { if errno == unix.EBUSY { ... } } ``` This commit fixes that issue.	2024-11-19 10:24:49 +01:00
Juanadelacuesta	d0b015ec01	func: move the user andd group type declarations	2024-10-31 10:34:26 +01:00
Juanadelacuesta	0cd1b5ff13	func: move the validation to a dependency and use id sets	2024-10-28 18:59:51 +01:00
Seth Hoenig	51215bf102	deps: update to go-set/v3 and refactor to use custom iterators (#23971 ) * deps: update to go-set/v3 * deps: use custom set iterators for looping	2024-09-16 13:40:10 -05:00
Juana De La Cuesta	9c5f962940	Update client/lib/cgroupslib/partition_linux.go Co-authored-by: Tim Gross <tgross@hashicorp.com>	2024-09-06 10:56:47 +02:00
Juana De La Cuesta	426c225dc2	Update client/lib/cgroupslib/partition_linux.go Co-authored-by: Tim Gross <tgross@hashicorp.com>	2024-09-06 10:56:41 +02:00
Juana De La Cuesta	8e6d85b66f	Update client/lib/cgroupslib/partition_linux.go Co-authored-by: Tim Gross <tgross@hashicorp.com>	2024-09-06 10:56:36 +02:00
Juanadelacuesta	a65d05ff51	fix: keep a register of the usable cores to avoid using more than that	2024-09-05 17:02:54 +02:00
Seth Hoenig	8b093a6a5d	scheduler: support for device - aware numa scheduling (#1760 ) (#23837 ) (CE backport of ENT 59433d56c7215c0b8bf33764f41b57d9bd30160f (without ent files)) * scheduler: enhance numa aware scheduling with support for devices * cr: add comments	2024-08-20 07:53:04 -05:00
Tim Gross	682c8c0c81	cgroupslib: allow initial controller check with delegated cgroups v2 (#23803 ) During Nomad client initialization with cgroups v2, we assert that the required cgroup controllers are available in the root `cgroup.subtree_control` file by idempotently writing to the file. But if Nomad is running with delegated cgroups, this will fail file permissions checks even if the subtree control file already has the controllers we need. Update the initialization to first check if the controllers are missing before attempting to write to them. This allows cgroup delegation so long as the cluster administrator has pre-created a Nomad owned cgroups tree and set the `Delegate` option in a systemd override. If not, initialization fails in the existing way. Although this is one small step along the way to supporting a rootless Nomad client, running Nomad as non-root is still unsupported. I've intentionally not documented setting up cgroup delegation in this PR, as this PR is insufficient by itself to have a secure and properly-working rootless Nomad client. Ref: https://github.com/hashicorp/nomad/issues/18211 Ref: https://github.com/hashicorp/nomad/issues/13669	2024-08-14 16:58:21 -04:00
Piotr Kazmierczak	7772711c89	plugins: fix nomadTopologyToProto panic on systems that don't support NUMA (#23399 ) After changes introduced in #23284 we no longer need to make a if !st.SupportsNUMA() check in the GetNodes() topology method. In fact this check will now cause panic in nomadTopologyToProto method on systems that don't support NUMA.	2024-07-09 08:41:52 +02:00
Tim Gross	7d73065066	numa: fix scheduler panic due to topology serialization bug (#23284 ) The NUMA topology struct field `NodeIDs` is a `idset.Set`, which has no public members. As a result, this field is never serialized via msgpack and persisted in state. When `numa.affinity = "prefer"`, the scheduler dereferences this nil field and panics the scheduler worker. Ideally we would fix this by adding a msgpack serialization extension, but because the field already exists and is just always empty, this breaks RPC wire compatibility across upgrades. Instead, create a new field that's populated at the same time we populate the more useful `idset.Set`, and repopulate the set on demand. Fixes: https://hashicorp.atlassian.net/browse/NET-9924	2024-06-11 08:55:00 -04:00
Deniz Onur Duzgun	1cc99cc1b4	bug: resolve type conversion alerts (#20553 )	2024-05-15 13:22:10 -04:00
Juana De La Cuesta	169818b1bd	[gh-6980] Client: clean up old allocs before running new ones using the `exec` task driver. (#20500 ) Whenever the "exec" task driver is being used, nomad runs a plug in that in time runs the task on a container under the hood. If by any circumstance the executor is killed, the task is reparented to the init service and wont be stopped by Nomad in case of a job updated or stop. This commit introduces two mechanisms to avoid this behaviour: * Adds signal catching and handling to the executor, so in case of a SIGTERM, the signal will also be passed on to the task. * Adds a pre start clean up of the processes in the container, ensuring only the ones the executor runs are present at any given time.	2024-05-14 09:51:27 +02:00
Tim Gross	623486b302	deps: vendor containernetworking/plugins functions for net NS utils (#20556 ) We bring in `containernetworking/plugins` for the contents of a single file, which we use in a few places for running a goroutine in a specific network namespace. This code hasn't needed an update in a couple of years, and a good chunk of what we need was previously vendored into `client/lib/nsutil` already. Updating the library via dependabot is causing errors in Docker driver tests because it updates a lot of transient dependencies, and it's bringing in a pile of new transient dependencies like opentelemetry. Avoid this problem going forward by vendoring the remaining code we hadn't already. Ref: https://github.com/hashicorp/nomad/pull/20146	2024-05-13 09:10:16 -04:00
Seth Hoenig	14a022cbc0	drivers/raw_exec: enable setting cgroup override values (#20481 ) * drivers/raw_exec: enable setting cgroup override values This PR enables configuration of cgroup override values on the `raw_exec` task driver. WARNING: setting cgroup override values eliminates any gauruntee Nomad can make about resource availability for any task on the client node. For cgroup v2 systems, set a single unified cgroup path using `cgroup_v2_override`. The path may be either absolute or relative to the cgroup root. config { cgroup_v2_override = "custom.slice/app.scope" } or config { cgroup_v2_override = "/sys/fs/cgroup/custom.slice/app.scope" } For cgroup v1 systems, set a per-controller path for each controller using `cgroup_v1_override`. The path(s) may be either absolute or relative to the controller root. config { cgroup_v1_override = { "pids": "custom/app", "cpuset": "custom/app", } } or config { cgroup_v1_override = { "pids": "/sys/fs/cgroup/pids/custom/app", "cpuset": "/sys/fs/cgroup/cpuset/custom/app", } } * drivers/rawexec: ensure only one of v1/v2 cgroup override is set * drivers/raw_exec: executor should error if setting cgroup does not work * drivers/raw_exec: create cgroups in raw_exec tests * drivers/raw_exec: ensure we fail to start if custom cgroup set and non-root * move custom cgroup func into shared file --------- Co-authored-by: Michael Schurter <mschurter@hashicorp.com>	2024-05-07 16:46:27 -07:00
Luiz Aoqui	db5ffde2b7	client: prevent start on cgroups init error (#19915 ) The Nomad client expects certain cgroups paths to exist in order to manage tasks. These paths are created when the agent first starts, but if process fails the agent would just log the error and proceed with its initialization, despite not being able to run tasks. This commit surfaces the errors back to the client initialization so the process can stop early and make clear to operators that something went wrong.	2024-02-09 13:45:29 -05:00
David Ventura	fb43b14fb0	Mark CGroups as off when missing essential controllers (#19176 )	2023-12-15 11:20:52 -05:00
Seth Hoenig	6e4d57b330	numalib: provide a fallback for topology scanning on linux (#19457 ) * numalib: provide a fallback for topology scanning on linux * numalib: better package var names * cl: add cl * lint: fix my sloppy code * cl: fixup wording	2023-12-13 13:06:30 -06:00
Piotr Kazmierczak	b6dd376100	numa: account for incorrect core number on topology.insert (#19383 ) Unsupported environments like containers or guests OSs inside LXD can incorrectly number of available cores thus leading to numalib having trouble detecting cores and panicking. This code adds tests for linux sysfs detection methods and fixes the panic.	2023-12-13 17:40:26 +01:00
Seth Hoenig	1604dba508	client: fingerprint cpu on raspberry pi (#18982 ) This PR tweaks the linux cpu fingerprinter to handle the case where no NUMA node data is found under /sys/devices/system/, in which case we need to assume just one node, one socket.	2023-11-02 15:52:37 -05:00
Seth Hoenig	5b56a5c5d1	client: fix cpu core/freq calculation on intel macs (#18934 )	2023-11-01 07:16:26 -05:00
Seth Hoenig	8ed82416e3	client: fix detection of cpuset.mems on cgroups v1 systems (#18868 )	2023-10-26 09:42:10 -05:00
Seth Hoenig	0020139440	core: port common code changes from ENT for numa scheduling (#18818 ) Some additional changes were made in the ENT PR to the common code in support of numa scheduling; this PR copies those changes back to CE.	2023-10-20 13:19:02 -05:00
Seth Hoenig	83720740f5	core: plumbing to support numa aware scheduling (#18681 ) * core: plumbing to support numa aware scheduling * core: apply node resources compatibility upon fsm rstore Handle the case where an upgraded server dequeus an evaluation before a client triggers a new fingerprint - which would be needed to cause the compatibility fix to run. By running the compat fix on restore the server will immediately have the compatible pseudo topology to use. * lint: learn how to spell pseudo	2023-10-19 15:09:30 -05:00
Seth Hoenig	e3c8700ded	deps: upgrade to go-set/v2 (#18638 ) No functional changes, just cleaning up deprecated usages that are removed in v2 and replace one call of .Slice with .ForEach to avoid making the intermediate copy.	2023-10-05 11:56:17 -05:00
Seth Hoenig	591394fb62	drivers: plumb hardware topology via grpc into drivers (#18504 ) * drivers: plumb hardware topology via grpc into drivers This PR swaps out the temporary use of detecting system hardware manually in each driver for using the Client's detected topology by plumbing the data over gRPC. This ensures that Client configuration is taken to account consistently in all references to system topology. * cr: use enum instead of bool for core grade * cr: fix test slit tables to be possible	2023-09-18 08:58:07 -05:00
Seth Hoenig	2e1974a574	client: refactor cpuset partitioning (#18371 ) * client: refactor cpuset partitioning This PR updates the way Nomad client manages the split between tasks that make use of resources.cpus vs. resources.cores. Previously, each task was explicitly assigned which CPU cores they were able to run on. Every time a task was started or destroyed, all other tasks' cpusets would need to be updated. This was inefficient and would crush the Linux kernel when a client would try to run ~400 or so tasks. Now, we make use of cgroup heirarchy and cpuset inheritence to efficiently manage cpusets. * cr: tweaks for feedback	2023-09-12 09:11:11 -05:00
Seth Hoenig	f5b0da1d55	all: swap exp packages for maps, slices (#18311 )	2023-08-23 15:42:13 -05:00
Piotr Kazmierczak	53ef6391a5	drivers/docker: fix a hostConfigMemorySwappiness panic (#18238 ) cgroupslib.MaybeDisableMemorySwappiness returned an incorrect type, and was incorrectly typecast to int64 causing a panic on non-linux and non-windows hosts.	2023-08-17 14:45:31 +02:00
Seth Hoenig	8833452d44	followup to numa/cgroups refactor (#18214 ) * lang: note that Stack is not concurrency-safe * client: use more descriptive name for wrangler hook in logs * numalib: use correct name for receiver parameter	2023-08-15 14:12:17 -05:00
Seth Hoenig	6747ef8803	drivers/raw_exec: restore ability to run tasks without nomad running as root (#18206 ) Although nomad officially does not support running the client as a non-root user, doing so has been more or less possible with the raw_exec driver as long as you don't expect features to work like networking or running tasks as specific users. In the cgroups refactoring I bulldozed right over the special casing we had in place for raw_exec to continue working if the cgroups were unable to be created. This PR restores that behavior - you can now (as before) run the nomad client as a non-root user and make use of the raw_exec task driver.	2023-08-15 11:22:30 -05:00
hashicorp-copywrite[bot]	2d35e32ec9	Update copyright file headers to BUSL-1.1	2023-08-10 17:27:15 -05:00
Seth Hoenig	a4cc76bd3e	numa: enable numa topology detection (#18146 ) * client: refactor cgroups management in client * client: fingerprint numa topology * client: plumb numa and cgroups changes to drivers * client: cleanup task resource accounting * client: numa client and config plumbing * lib: add a stack implementation * tools: remove ec2info tool * plugins: fixup testing for cgroups / numa changes * build: update makefile and package tests and cl	2023-08-10 17:05:30 -05:00
Patric Stout	e190eae395	Use config "cpu_total_compute" (if set) for all CPU statistics (#17628 ) Before this commit, it was only used for fingerprinting, but not for CPU stats on nodes or tasks. This meant that if the auto-detection failed, setting the cpu_total_compute didn't resolved the issue. This issue was most noticeable on ARM64, as there auto-detection always failed.	2023-07-19 13:30:47 -05:00
Seth Hoenig	33ac5ed1df	client: do not disable memory swappiness if kernel does not support it (#17625 ) * client: do not disable memory swappiness if kernel does not support it This PR adds a workaround for very old Linux kernels which do not support the memory swappiness interface file. Normally we write a "0" to the file to explicitly disable swap. In the case the kernel does not support it, give libcontainer a nil value so it does not write anything. Fixes #17448 * client: detect swappiness by writing to the file * fixup changelog Co-authored-by: James Rasell <jrasell@users.noreply.github.com> --------- Co-authored-by: James Rasell <jrasell@users.noreply.github.com>	2023-06-22 09:36:31 -05:00
Patric Stout	a1a5241606	Fix DevicesSets being removed when cpusets are reloaded with cgroup v2 (#17535 ) * Fix DevicesSets being removed when cpusets are reloaded with cgroup v2 This meant that if any allocation was created or removed, all active DevicesSets were removed from all cgroups of all tasks. This was most noticeable with "exec" and "raw_exec", as it meant they no longer had access to /dev files. * e2e: add test for verifying cgroups do not interfere with access to devices --------- Co-authored-by: Seth Hoenig <shoenig@duck.com>	2023-06-15 09:39:36 -05:00
Seth Hoenig	225693ad28	client: fix client panic during drain cause by shutdown (#17450 ) During shutdown of a client with drain_on_shutdown there is a race between the Client ending the cgroup and the task's cpuset manager cleaning up the cgroup. During the path traversal, skip anything we cannot read, which avoids the nil DirEntry we try to dereference now.	2023-06-07 15:12:44 -05:00
hashicorp-copywrite[bot]	f005448366	[COMPLIANCE] Add Copyright and License Headers	2023-04-10 15:36:59 +00:00
Seth Hoenig	a42a33fa6b	cgv1: do not disable cpuset manager if reserved interface already exists (#16467 ) * cgv1: do not disable cpuset manager if reserved interface already exists This PR fixes a bug where restarting a Nomad Client on a machine using cgroups v1 (e.g. Ubuntu 20.04) would cause the cpuset cgroups manager to disable itself. This is being caused by incorrectly interpreting a "file exists" error as problematic when ensuring the reserved cpuset exists. If we get a "file exists" error, that just means the Client was likely restarted. Note that a machine reboot would fix the issue - the groups interfaces are ephemoral. * cl: add cl	2023-03-13 17:00:17 -05:00
Lance Haig	48e7d70fcd	deps: Update ioutil deprecated library references to os and io respectively in the client package (#16318 ) * Update ioutil deprecated library references to os and io respectively * Deal with the errors produced. Add error handling to filEntry info Add error handling to info	2023-03-08 13:25:10 -06:00

1 2

96 Commits