nomad

mirror of https://github.com/kemko/nomad.git synced 2026-01-01 16:05:42 +03:00

Author	SHA1	Message	Date
Tim Gross	7add04eb0f	refactor: volume request modes to be generic between DHV/CSI (#24896 ) When we implemented CSI, the types of the fields for access mode and attachment mode on volume requests were defined with a prefix "CSI". This gets confusing now that we have dynamic host volumes using the same fields. Fortunately the original was a typedef on string, and the Go API in the `api` package just uses strings directly, so we can change the name of the type without breaking backwards compatibility for the msgpack wire format. Update the names to `VolumeAccessMode` and `VolumeAttachmentMode`. Keep the CSI and DHV specific value constant names for these fields (they aren't currently 1:1), so that we can easily differentiate in a given bit of code which values are valid. Ref: https://github.com/hashicorp/nomad/pull/24881#discussion_r1920702890	2025-01-24 10:37:48 -05:00
Piotr Kazmierczak	3d7e4fd634	client: always initialize node.HostVolumes map (#24910 ) The default node configuration in the client should always set an empty HostVolumes map. Otherwise callers can panic, e.g.,: goroutine 179 [running]: github.com/hashicorp/nomad/client/hostvolumemanager.UpdateVolumeMap({0x36042b0, 0xc000c62a80}, 0x0, {0xc000a802a0, 0xd}, 0xc000691940) github.com/hashicorp/nomad/client/hostvolumemanager/volume_fingerprint.go:43 +0x1b2 github.com/hashicorp/nomad/client.(Client).batchFirstFingerprints.func1({0xc000a802a0, 0xd}, 0xc000691940) github.com/hashicorp/nomad/client/node_updater.go:54 +0xd7 github.com/hashicorp/nomad/client.(batchNodeUpdates).batchHostVolumeUpdates(0xc000912608?, 0xc0009f2f88) github.com/hashicorp/nomad/client/node_updater.go:417 +0x152 github.com/hashicorp/nomad/client.(*Client).batchFirstFingerprints(0xc000c2d188) github.com/hashicorp/nomad/client/node_updater.go:53 +0x1c5 created by github.com/hashicorp/nomad/client.NewClient in goroutine 1 github.com/hashicorp/nomad/client/client.go:557 +0x2069 is a panic of the HVM when restarting a client that doesn't have any static host volumes, but does have a dynamic host volume.	2025-01-21 20:45:04 +01:00
James Rasell	689f935e0a	services: Support TLS Skip Verify within Nomad service checks. (#24781 ) Checks within a service using the Nomad provider can now utilise the `tls_skip_verify` parameter.	2025-01-15 07:39:39 +00:00
Daniel Bennett	985eb53c65	dynamic host volumes: plugin spec tweaks (#24848 ) * prefix plugin env vars with DHV_ * add env: DHV_VOLUME_ID, DHV_PLUGIN_DIR * 5s timeout on fingerprint calls	2025-01-13 14:18:10 -06:00
Tim Gross	cca9a5320d	testing: fix test flake in dynamic host volume client tests (#24836 ) The output of `GetDynamicHostVolumes` is a slice but that slice is constructed from iterating over a map and isn't sorted. Sort the output in the test to eliminate a test flake.	2025-01-10 14:48:05 -05:00
Michael Smithhisler	606ce9dd90	deps: upgrade aws-sdk-go from v1 to v2 (#24720 )	2025-01-09 17:27:19 -05:00
Tim Gross	4a65b21aab	dynamic host volumes: send register to client for fingerprint (#24802 ) When we register a volume without a plugin, we need to send a client RPC so that the node fingerprint can be updated. The registered volume also needs to be written to client state so that we can restore the fingerprint after a restart. Co-authored-by: Daniel Bennett <dbennett@hashicorp.com>	2025-01-08 16:58:58 -05:00
Piotr Kazmierczak	7726ae68c6	client: move 'waiting for previous alloc to terminate' log messages to info (#24804 )	2025-01-08 15:44:35 +01:00
Michael Smithhisler	34a34e7233	plugins: validate logmon process during reattach (#24798 )	2025-01-08 08:50:33 -05:00
Tim Gross	08a6f870ad	cni: use check command when restoring from restart (#24658 ) When the Nomad client restarts and restores allocations, the network namespace for an allocation may exist but no longer be correctly configured. For example, if the host is rebooted and the task was a Docker task using a pause container, the network namespace may be recreated by the docker daemon. When we restore an allocation, use the CNI "check" command to verify that any existing network namespace matches the expected configuration. This requires CNI plugins of at least version 1.2.0 to avoid a bug in older plugin versions that would cause the check to fail. If the check fails, destroy the network namespace and try to recreate it from scratch once. If that fails in the second pass, fail the restore so that the allocation can be recreated (rather than silently having networking fail). This should fix the gap left #24650 for Docker task drivers and any other drivers with the `MustInitiateNetwork` capability. Fixes: https://github.com/hashicorp/nomad/issues/24292 Ref: https://github.com/hashicorp/nomad/pull/24650	2025-01-07 09:38:39 -05:00
Daniel Bennett	a9ee66a6ef	dynamic host volumes: unique volume name per node (#24748 ) a node can have only one volume with a given name. the scheduler prevents duplicates, but can only do so after the server knows about the volume. this prevents multiple concurrent creates being called faster than the fingerprint/heartbeat interval. users may still modify an existing volume only if they set the `id` in the volume spec and re-issue `nomad volume create` if a static vol is added to config with a name already being used by a dynamic volume, the dynamic takes precedence, but log a warning.	2025-01-06 15:37:20 -06:00
Daniel Bennett	459453917e	dynamic host volumes: client-side tests, comments, tidying (#24747 )	2025-01-06 13:20:07 -06:00
Charles Z.	f7b12dc54e	add noswap to secretdir tmpfs (#24645 )	2025-01-06 09:44:43 -05:00
Daniel Bennett	af967184a6	dynamic host volumes: tweak plugin fingerprint (#24711 ) Instead of a plugin `version` subcommand that responds with a string (established in #24497), respond to a `fingerprint` command with a data structure that we may extend in the future (such as plugin capabilities, like size constraint support?). In the immediate term, it's still just the version: `{"version": "0.0.1"}` In addition to leaving the door open for future expansion, I think it will also avoid false positives detecting executables that just happen to respond to a `version` command. This also reverses the ordering of the fingerprint string parts from `plugins.host_volume.version.mkdir` (which aligned with CNI) to `plugins.host_volume.mkdir.version` (makes more sense to me)	2024-12-19 09:25:55 -05:00
Daniel Bennett	e76f5e0b4c	dynamic host volumes: volume fingerprinting (#24613 ) and expand the demo a bit	2024-12-19 09:25:54 -05:00
Daniel Bennett	05f1cda594	dynamic host volumes: client state (#24595 ) store dynamic host volume creations in client state, so they can be "restored" on agent restart. restore works by repeating the same Create operation as initial creation, and expecting the plugin to be idempotent. this is (potentially) especially important after host restarts, which may have dropped mount points or such.	2024-12-19 09:25:54 -05:00
Daniel Bennett	46a39560bb	dynamic host volumes: fingerprint client plugins (#24589 )	2024-12-19 09:25:54 -05:00
Daniel Bennett	2b04d47ac2	dynamic host volumes: test client RPC and plugins (#24535 ) also ensure that volume ID is uuid-shaped so user-provided input like `id = "../../../"` which is used as part of the target directory can not find its way very far into the volume submission process	2024-12-19 09:25:54 -05:00
Daniel Bennett	c2dd97dee7	HostVolumePlugin interface and two implementations (#24497 ) * mkdir: HostVolumePluginMkdir: just creates a directory * example-host-volume: HostVolumePluginExternal: plugin script that does mkfs and mount loopback Co-authored-by: Tim Gross <tgross@hashicorp.com>	2024-12-19 09:25:54 -05:00
Tim Gross	6a3803c31e	dynamic host volumes: RPC handlers (#24373 ) This changeset implements the RPC handlers for Dynamic Host Volumes, including the plumbing needed to forward requests to clients. The client-side implementation is stubbed and will be done under a separate PR. Ref: https://hashicorp.atlassian.net/browse/NET-11549	2024-12-19 09:25:54 -05:00
Tim Gross	30e57c39b0	discovery: correctly handle IPv6 addresses from go-discover (#24649 ) Nomad sets a default port when resolving server addresses that don't have one. When we get a "bare" IPv6 address without a port, we end up with an unexpected error "too many colons in address" when we try to split the address and host, because the standard library function expects IPv6 addresses to be wrapped in brackets as recommended by RFC5952. User-configured addresses avoid this problem by accepting IP address and port as separate configuration values, but go-discover emits "bare" IPv6 addresses without a port in IPv6 environments. Fix this by adding brackets to IPv6 addresses when we get the "too many colons" error from the stdlib. This will still give erroneous results if the address includes the port but is missing brackets, but there's no way to unambiguously parse that address. Ref: https://www.rfc-editor.org/rfc/rfc5952 Fixes: https://github.com/hashicorp/nomad/issues/24608	2024-12-17 15:49:40 -05:00
Tim Gross	24fa7439df	cni: use tmpfs location for ipam plugin (#24650 ) When a Nomad host reboots, the network namespace files in the tmpfs in `/var/run` are wiped out. So when we restore allocations after a host reboot, we need to be able to restore both the network namespace and the network configuration. But because the netns is newly created and we need to run the CNI plugins again, this create potential conflicts with the IPAM plugin which has written state to persistent disk at `/var/lib/cni`. These IPs aren't the ones advertised to Consul, so there's no particular reason to keep them around after a host reboot because all virtual interfaces need to be recreated too. Reconfigure the CNI bridge configuration to use `/var/run/cni` as its state directory. We already expect this location to be created by CNI because the netns files are hard-coded to be created there too in `libcni`. Note this does not fix the problem described for Docker in #24292 because that appears to be related to the netns itself being restored unexpectedly from Docker's state. Ref: https://github.com/hashicorp/nomad/issues/24292#issuecomment-2537078584 Ref: https://www.cni.dev/plugins/current/ipam/host-local/#files	2024-12-16 09:36:35 -05:00
James Rasell	7d48aa2667	client: emit optional telemetry from prerun and prestart hooks. (#24556 ) The Nomad client can now optionally emit telemetry data from the prerun and prestart hooks. This allows operators to monitor and alert on failures and time taken to complete. The new datapoints are: - nomad.client.alloc_hook.prerun.success (counter) - nomad.client.alloc_hook.prerun.failed (counter) - nomad.client.alloc_hook.prerun.elapsed (sample) - nomad.client.task_hook.prestart.success (counter) - nomad.client.task_hook.prestart.failed (counter) - nomad.client.task_hook.prestart.elapsed (sample) The hook execution time is useful to Nomad engineering and will help optimize code where possible and understand job specification impacts on hook performance. Currently only the PreRun and PreStart hooks have telemetry enabled, so we limit the number of new metrics being produced.	2024-12-12 14:43:14 +00:00
Piotr Kazmierczak	3a18f22c18	goflags: go:build linux for tests that won't compile on other platforms (#24559 ) I'm a heavy LSP user and I frequently goto:next_error. This confuses my editor on macOS.	2024-11-28 15:05:00 +01:00
Piotr Kazmierczak	f7a4ded2c0	security: add CT executeTemplate to default function_denylist (#24541 ) This PR adds Consul Template's executeTemplate function to the denylist by default, in order to prevent accidental or malicious infinitely recursive execution. --------- Co-authored-by: Tim Gross <tgross@hashicorp.com>	2024-11-22 19:33:56 +01:00
Martijn Vegter	997da25cdb	scheduler: take all assigned cpu cores into account instead of only those part of the largest lifecycle (#24304 ) Fixes a bug in the AllocatedResources.Comparable method, where the scheduler would only take into account the cpusets of the tasks in the largest lifecycle. This could result in overlapping cgroup cpusets. Now we make the distinction between reserved and fungible resources throughout the lifespan of the alloc. In addition, added logging in case of future regressions thus not requiring manual inspection of cgroup files.	2024-11-21 13:21:48 -05:00
Martijn Vegter	bfb714144e	client: fixed a bug where AMD CPUs were not correctly fingerprinting base speed (#24415 ) Relates to: #19468	2024-11-21 09:08:47 -06:00
James Rasell	beb4097e81	client: mark the remote_task hook as deprecated. (#24505 )	2024-11-20 15:32:50 +00:00
Florian Apolloner	0a343798b6	Add NOMAD_* variables to CNI args. Fixes #23830 (#24319 ) Co-authored-by: Daniel Bennett <dbennett@hashicorp.com>	2024-11-19 12:48:48 -08:00
Tim Gross	a420732424	consul: allow non-root Nomad to rewrite token (#24410 ) When a task restarts, the Nomad client may need to rewrite the Consul token, but it's created with permissions that prevent a non-root agent from writing to it. While Nomad clients should be run as root (currently), it's harmless to allow whatever user the Nomad agent is running as to be able to write to it, and that's one less barrier to rootless Nomad. Ref: https://github.com/hashicorp/nomad/issues/23859#issuecomment-2465757392	2024-11-19 10:21:14 -05:00
Gabi	89c3d69d79	nsutil: wrap error that comes from the syscall so caller can do errors.As (#24480 ) User of `nsutil` library should be able to do the following and for it to work: ``` var errno syscall.Errno if errors.As(err, &errno) { if errno == unix.EBUSY { ... } } ``` This commit fixes that issue.	2024-11-19 10:24:49 +01:00
Tim Gross	6be9a50626	vault: catch expired lease as fatal error (#24409 ) When a Vault lease expires, it's revoked on the server and cannot be removed, so this error should be treated as fatal. The errors we get aren't wrapped by the Vault SDK, so unfortunately we have to read the error messages and can't easily enumerate non-fatal error messages (which might be bubbling up from the stdlib). I've audited the errors currently used and have documented their source. Ref `52ba156d47/vault/expiration.go (L1327)` Fixes: https://github.com/hashicorp/nomad/issues/23859	2024-11-18 09:12:35 -05:00
Michael Smithhisler	0714353324	fix: handle template re-renders on client restart (#24399 ) When multiple templates with api functions are included in a task, it's possible for consul-template to re-render templates as it creates watchers, overwriting render event data. This change uses event fields that do not get overwritten, and only executes the change mode for templates that were actually written to disk. --------- Co-authored-by: Tim Gross <tgross@hashicorp.com>	2024-11-08 12:49:38 -05:00
Seth Hoenig	4ef4bebd1f	connect: handle grpc_address as gosockaddr/template string (#24280 ) * connect: handle grpc_address as gosockaddr/template string This PR fixes a bug where the consul.grpc_address could not be set using a go-sockaddr/template string. This was inconsistent with how we do accept such strings for consul.address values. * add changelog	2024-11-07 09:04:58 -06:00
James Rasell	c44f933aeb	test: ensure RPC only test client sets enterprise specific config. (#24376 )	2024-11-06 13:43:25 +00:00
Tim Gross	a8b84a6eed	testing: RPC-only test client helper (#24371 ) In #10193 we introduced a testing helper that spins up a client RPC server without the rest of the client operations so that we can make server-side client RPC tests lighter. But this wasn't actually ever wired up to the intended target. While working on Dynamic Host Volumes I noticed that this would be useful for RPC tests. This changeset fixes some bugs in the helper that arose from client code drift, and makes it used by the client RPC tests for CSI. This will also get used for the DHV RPC tests. Ref: https://github.com/hashicorp/nomad/pull/10193	2024-11-05 14:59:53 -05:00
Juanadelacuesta	d0b015ec01	func: move the user andd group type declarations	2024-10-31 10:34:26 +01:00
Juanadelacuesta	0cd1b5ff13	func: move the validation to a dependency and use id sets	2024-10-28 18:59:51 +01:00
Rodrigo Lourenço	cdebf96b0e	fingerprint gce: collect preemptibility	2024-10-23 15:19:20 +02:00
Seth Hoenig	f1ce127524	jobspec: add a chown option to artifact block (#24157 ) * jobspec: add a chown option to artifact block This PR adds a boolean 'chown' field to the artifact block. It indicates whether the Nomad client should chown the downloaded files and directories to be owned by the task.user. This is useful for drivers like raw_exec and exec2 which are subject to the host filesystem user permissions structure. Before, these drivers might not be able to use or manage the downloaded artifacts since they would be owned by the root user on a typical Nomad client configuration. * api: no need for pointer of chown field	2024-10-11 11:30:27 -05:00
Tim Gross	b7595c646d	alloc fs: use case-insensitive check for reads of secret/private dir (#24125 ) When using the Client FS APIs, we check to ensure that reads don't traverse into the allocation's secret dir and private dir. But this check can be bypassed on case-insensitive file systems (ex. Windows, macOS, and Linux with obscure ext4 options enabled). This allows a user with `read-fs` permissions but not `alloc-exec` permissions to read from the secrets dir. This changeset updates the check so that it's case-insensitive. This risks false positives for escape (see linked Go issue), but only if a task without filesystem isolation deliberately writes into the task working directory to do so, which is a fail-safe failure mode. Ref: https://github.com/golang/go/issues/18358 Co-authored-by: dduzgun-security <deniz.duzgun@hashicorp.com>	2024-10-03 14:20:24 -04:00
Martijn Vegter	3ecf0d21e2	metrics: introduce client config to include alloc metadata as part of the base labels (#23964 )	2024-10-02 10:55:44 -04:00
Juliano Martinez	4a74fda8ce	Allow client template config block to be parsed when using json config (#24007 ) - Adds tests - Adds sample test data for parsing hcl and json - Adds changelog	2024-10-01 15:44:36 -04:00
Piotr Kazmierczak	981ca36049	docker: use official client instead of fsouza/go-dockerclient (#23966 ) This PR replaces fsouza/go-dockerclient 3rd party docker client library with docker's official SDK. --------- Co-authored-by: Tim Gross <tgross@hashicorp.com> Co-authored-by: Seth Hoenig <shoenig@duck.com>	2024-09-26 18:41:44 +02:00
Tim Gross	cc9227b858	template: fix panic in change_mode=script on client restart (#24057 ) When we introduced change_mode=script to templates, we passed the driver handle down into the template manager so we could call its `Exec` method directly. But the lifecycle of the driver handle is managed by the taskrunner and isn't available when the template manager is first created. This has led to a series of patches trying to fixup the behavior (#15915, #15192, #23663, #23917). Part of the challenge in getting this right is using an interface to avoid the circular import of the driver handle. But the taskrunner already has a way to deal with this problem using a "lazy handle". The other template change modes already use this indirectly through the `Lifecycle` interface. Change the driver handle `Exec` call in the template manager to a new `Lifecycle.Exec` call that reuses the existing behavior. This eliminates the need for the template manager to know anything at all about the handle state. Fixes: https://github.com/hashicorp/nomad/issues/24051	2024-09-25 08:59:01 -04:00
Michael Smithhisler	338487c159	fix: add node pool attribute to interpretable values in task env (#24052 )	2024-09-24 13:23:16 -04:00
Michael Smithhisler	6b6aa7cc26	identity: adds ability to specify custom filepath for saving workload identities (#24038 )	2024-09-23 10:27:00 -04:00
Tim Gross	b7f1800657	fingerprint: update landlock test to accept v4+ APIs (#23979 ) The landlock fingerprint test assumes there's no version of the landlock API >3. Update the test assertion to allow for the current v4 and any future versions.	2024-09-17 15:07:44 -04:00
Seth Hoenig	51215bf102	deps: update to go-set/v3 and refactor to use custom iterators (#23971 ) * deps: update to go-set/v3 * deps: use custom set iterators for looping	2024-09-16 13:40:10 -05:00
Daniel Bennett	5e1fae2856	networking: set alloc NetworkStatus.AddressIPv6 (#23959 ) when a CNI result includes an IPv6 address, set it on the alloc's NetworkStatus for reference. e.g.: $ nomad alloc status -json 3dca \| jq '.NetworkStatus' { "Address": "172.26.64.14", "AddressIPv6": "fd00:a110:c8::b", "DNS": null, "InterfaceName": "eth0" }	2024-09-16 10:21:52 -05:00

1 2 3 4 5 ...

5126 Commits