nomad

mirror of https://github.com/kemko/nomad.git synced 2026-01-05 09:55:44 +03:00

Author	SHA1	Message	Date
hc-github-team-nomad-core	e1333eb9f6	Prepare for next release	2024-05-07 07:06:12 +00:00
hc-github-team-nomad-core	e1a176c120	Generate files for 1.8.0-beta.1 release	2024-05-07 07:06:07 +00:00
Tim Gross	f41bc468eb	consul: provide `CONSUL_HTTP_TOKEN` env var to tasks (#20519 ) When available, we provide an environment variable `CONSUL_TOKEN` to tasks, but this isn't the environment variable expected by the Consul CLI. Job specifications like deploying an API Gateway become noticeably nicer if we can instead provide the expected env var.	2024-05-03 11:30:33 -04:00
Seth Hoenig	5f64e42d73	client: fixup how alloc mounts directory are setup (#20463 )	2024-04-26 07:29:52 -05:00
Tim Gross	6d58acd897	WI: ensure tasks within same alloc get different Consul tokens (#20411 ) The `consul_hook` in the allocrunner gets a separate Consul token for each task, even if the tasks' identities have the same name, but used the identity name as the key to the alloc hook resources map. This means the last task in the group overwrites the Consul tokens of all other tasks. Fix this by adding the task name to the key in the allocrunner's `consul_hook`. And update the taskrunner's `consul_hook` to expect the task name in the key. Fixes: https://github.com/hashicorp/nomad/issues/20374 Fixes: https://hashicorp.atlassian.net/browse/NOMAD-614	2024-04-17 11:29:58 -04:00
Luiz Aoqui	9d4f7bcb68	mock_driver: fix fingreprint key (#20351 ) The `mock_driver` is an internal task driver used mostly for testing and simulating workloads. During the allocrunner v2 work (#4792) its name changed from `mock_driver` to just `mock` and then back to `mock_driver`, but the fingreprint key was kept as `driver.mock`. This results in tasks configured with `driver = "mock"` to be scheduled (because Nomad thinks the client has a task driver called `mock`), but fail to actually run (because the Nomad client can't find a driver called `mock` in its catalog). Fingerprinting the right name prevents the job from being scheduled in the first place. Also removes mentions of the mock driver from documentation since its an internal driver and not available in any production release.	2024-04-16 07:16:55 +01:00
Tim Gross	9cb1ef3e3d	CNI: fix bugs in parsing strings to port number integers (#20379 ) Ports are a maximum of uint16, but we have a few places in the recent tproxy code where we were parsing them as 64-bit wide integers and then downcasting them to `int`, which is technically unsafe and triggers code scanning alerts. In practice we've validated the range elsewhere and don't build for 32-bit platforms. This changeset fixes the parsing to make everything a bit more robust and silence the alert. Fixes: https://github.com/hashicorp/nomad-enterprise/security/code-scanning/444	2024-04-12 13:31:25 -04:00
Seth Hoenig	ae6c4c8e3f	deps: purge use of old x/exp packages (#20373 )	2024-04-12 08:29:00 -05:00
astudentofblake	7b7ed12326	func: Allow custom paths to be added the the getter landlock (#20349 ) * func: Allow custom paths to be added the the getter landlock Fixes: 20315 * fix: slices imports fix: more meaningful examples fix: improve documentation fix: quote error output	2024-04-11 15:17:33 -05:00
Tim Gross	d56e8ad1aa	WI: ensure Consul hook and WID manager interpolate services (#20344 ) Services can have some of their string fields interpolated. The new Workload Identity flow doesn't interpolate the services before requesting signed identities or using those identities to get Consul tokens. Add support for interpolation to the WID manager and the Consul tokens hook by providing both with a taskenv builder. Add an "interpolate workload" field to the WI handle to allow passing the original workload name to the server so the server can find the correct service to sign. This changeset also makes two related test improvements: * Remove the mock WID manager, which was only used in the Consul hook tests and isn't necessary so long as we provide the real WID manager with the mock signer and never call `Run` on it. It wasn't feasible to exercise the correct behavior without this refactor, as the mocks were bypassing the new code. * Fixed swapped expect-vs-actual assertions on the `consul_hook` tests. Fixes: https://github.com/hashicorp/nomad/issues/20025	2024-04-11 15:40:28 -04:00
Tim Gross	8298d39e78	Connect transparent proxy support Add support for Consul Connect transparent proxies Fixes: https://github.com/hashicorp/nomad/issues/10628	2024-04-10 11:00:18 -04:00
Tim Gross	4fef82e8e2	tproxy: refactor `getPortMapping` The `getPortMapping` method forces callers to handle two different data structures, but only one caller cares about it. We don't want to return a single map or slice because the `cni.PortMapping` object doesn't include a label field that we need for tproxy. Return a new datastructure that closes over both a slice of `cni.PortMapping` and a map of label to index in that slice.	2024-04-10 10:16:13 -04:00
Tim Gross	8eaf176868	client: fix IPv6 parsing for `client.servers` block (#20324 ) When the `client.servers` block is parsed, we split the port from the address. This does not correctly handle IPv6 addresses when they are in URL format (wrapped in brackets), which we require to disambiguate the port and address. Fix the parser to correctly split out the port and handle a missing port value for IPv6. Update the documentation to make the URL format requirement clear. Fixes: https://github.com/hashicorp/nomad/issues/20310	2024-04-08 15:06:27 -04:00
Tim Gross	76009d89af	tproxy: networking hook changes (#20183 ) When `transparent_proxy` block is present and the network mode is `bridge`, use a different CNI configuration that includes the `consul-cni` plugin. Before invoking the CNI plugins, create a Consul SDK `iptables.Config` struct for the allocation. This includes: * Use all the `transparent_proxy` block fields * The reserved ports are added to the inbound exclusion list so the alloc is reachable from outside the mesh * The `expose` blocks and `check` blocks with `expose=true` are added to the inbound exclusion list so health checks work. The `iptables.Config` is then passed as a CNI argument to the `consul-cni` plugin. Ref: https://github.com/hashicorp/nomad/issues/10628	2024-04-04 17:01:07 -04:00
Seth Hoenig	6ad648bec8	networking: Inject implicit constraints on CNI plugins when using bridge mode (#15473 ) This PR adds a job mutator which injects constraints on the job taskgroups that make use of bridge networking. Creating a bridge network makes use of the CNI plugins: bridge, firewall, host-local, loopback, and portmap. Starting with Nomad 1.5 these plugins are fingerprinted on each node, and as such we can ensure jobs are correctly scheduled only on nodes where they are available, when needed.	2024-03-27 16:11:39 -04:00
Tim Gross	15162917c1	cni: fix regression in falling back to DNS owned by `dockerd` (#20189 ) In #20007 we fixed a bug where the DNS configuration set by CNI plugins was not threaded through to the task configuration. This resulted in a regression where a DNS override set by `dockerd` was not respected for `bridge` mode networking. Our existing handling of CNI DNS incorrectly assumed that the DNS field would be empty, when in fact it contains a single empty DNS struct. Handle this case correctly by checking whether the DNS struct we get back from CNI has any nameservers, and ignore it if it doesn't. Expand test coverage of this case. Fixes: https://github.com/hashicorp/nomad/issues/20174	2024-03-22 10:54:16 -04:00
Michael Schurter	23e4b7c9d2	Upgrade go-msgpack to v2 (#20173 ) Replaces #18812 Upgraded with: ``` find . -name '.go' -exec sed -i s/"github.com\/hashicorp\/go-msgpack\/codec"/"github.com\/hashicorp\/go-msgpack\/v2\/codec/" '{}' ';' find . -name '.go' -exec sed -i s/"github.com\/hashicorp\/net-rpc-msgpackrpc"/"github.com\/hashicorp\/net-rpc-msgpackrpc\/v2/" '{}' ';' go get go get -v -u github.com/hashicorp/raft-boltdb/v2 go get -v github.com/hashicorp/serf@5d32001edfaa18d1c010af65db707cdb38141e80 ``` see https://github.com/hashicorp/go-msgpack/releases/tag/v2.1.0 for details	2024-03-21 11:44:23 -07:00
Tim Gross	7b9bce2d08	config: fix `client.template` config merging with defaults (#20165 ) When loading the client configuration, the user-specified `client.template` block was not properly merged with the default values. As a result, if the user set any `client.template` field, all the other field defaulted to their zero values instead of the documented defaults. This changeset: * Adds the missing `Merge` method for the client template config and ensures it's called. * Makes a single source of truth for the default template configuration, instead of two different constructors. * Extends the tests to cover the merge of a partial block better. Fixes: https://github.com/hashicorp/nomad/issues/20164	2024-03-20 10:18:56 -04:00
Charlie Voiselle	7b27bc344b	[refactor] Move task directory destroy logic from alloc_dir.go to task_dir.go (#20006 ) * Move task directory destroy logic from alloc_dir to task_dir * Update errors to wrap error cause * Use constants for file permissions * Make multierror handling consistent. * Make helpers for directory creation * Move mount dir unlink to task_dir Unlink method * Make constant for file mode 710 Co-authored-by: Tim Gross <tgross@hashicorp.com> Co-authored-by: Michael Schurter <mschurter@hashicorp.com>	2024-03-19 13:49:09 -04:00
Tim Gross	13617eee4b	template: improve internal documentation around shutdown (#20134 ) While investigating a report around possible consul-template shutdown issues, which didn't bear fruit, I found that some of the logic around template runner shutdown is unintuitive. * Add some doc strings to the places where someone might think we should be obviously stopping the runner or returning early. * Mark context argument for `Poststart`, `Stop`, and `Update` hooks as unused. No functional code changes.	2024-03-14 15:33:32 -04:00
Amir Abbas	40b8f17717	Support insecure flag on artifact (#20126 )	2024-03-14 10:59:20 -05:00
Seth Hoenig	bb54d16e4a	exec2: setup RPC plumbing for dynamic workload users (#20129 ) And pass the dynamic users pool from the client into the hook.	2024-03-13 14:06:52 -05:00
Seth Hoenig	05937ab75b	exec2: add client support for unveil filesystem isolation mode (#20115 ) * exec2: add client support for unveil filesystem isolation mode This PR adds support for a new filesystem isolation mode, "Unveil". The mode introduces a "alloc_mounts" directory where tasks have user-owned directory structure which are bind mounts into the real alloc directory structure. This enables a task driver to use landlock (and maybe the real unveil on openbsd one day) to isolate a task to the task owned directory structure, providing sandboxing. * actually create alloc-mounts-dir directory * fix doc strings about alloc mount dir paths	2024-03-13 08:24:17 -05:00
carrychair	5f5b34db0e	remove repetitive words (#20110 ) Signed-off-by: carrychair <linghuchong404@gmail.com>	2024-03-11 08:52:08 +00:00
Seth Hoenig	286dce7a2a	exec2: add a client.users configuration block (#20093 ) * exec: add a client.users configuration block For now just add min/max dynamic user values; soon we can also absorb the "user.denylist" and "user.checked_drivers" options from the deprecated client.options map. * give the no-op pool implementation a better name * use explicit error types to make referencing them cleaner in tests * use import alias to not shadow package name	2024-03-08 16:02:32 -06:00
Seth Hoenig	2c1f5daad7	more test refactoring (#20092 ) * tests: swap testify for test in client/config * tests: swap testify for test in logmon/	2024-03-07 11:04:16 -06:00
Seth Hoenig	67554b8f91	exec2: implement dynamic workload users taskrunner hook (#20069 ) * exec2: implement dynamic workload users taskrunner hook This PR impelements a TR hook for allocating dynamic workload users from a pool managed by the Nomad client. This adds a new task driver Capability, DynamicWorkloadUsers - which a task driver must indicate in order to make use of this feature. The client config plumbing is coming in a followup PR - in the RFC we realized having a client.users block would be nice to have, with some additional unrelated options being moved from the deprecated client.options config. * learn to spell	2024-03-06 09:34:27 -06:00
Tim Gross	45b2c34532	cni: add DNS set by CNI plugins to task configuration (#20007 ) CNI plugins may set DNS configuration, but this isn't threaded through to the task configuration so that we can write it to the `/etc/resolv.conf` file as needed. Add the `AllocNetworkStatus` to the alloc hook resources so they're accessible from the taskrunner. Any DNS entries provided by the user will override these values. Fixes: https://github.com/hashicorp/nomad/issues/11102	2024-02-20 10:17:27 -05:00
Juana De La Cuesta	20cfbc82d3	Introduces `Disconnect` block into the `TaskGroup` configuration (#19886 ) This PR is the first on two that will implement the new Disconnect block. In this PR the new block is introduced to be backwards compatible with the fields it will replace. For more information refer to this RFC and this ticket.	2024-02-19 16:41:35 +01:00
Tim Gross	a74775814c	fingerprint: add DNS address and port to Consul fingerprint (#19969 ) In order to provide a DNS address and port to Connect tasks configured for transparent proxy, we need to fingerprint the Consul DNS address and port. The client will pass this address/port to the iptables configuration provided to the `consul-cni` plugin. Ref: https://github.com/hashicorp/nomad/issues/10628	2024-02-14 12:15:58 -05:00
Cedric Le Roux	994a2b1036	client: fixed a bug where corrupt client state could panic the client (#19972 )	2024-02-14 11:14:11 -05:00
Luiz Aoqui	62b7d6ffe9	vault: revert #18998 to fix potential deadlock (#19963 ) * Revert "vault: always renew tokens using the renewal loop (#18998)" This reverts commit `7054fe1a8c`. * test: add case for concurrent Vault token renewal	2024-02-13 09:50:46 -05:00
Tim Gross	a54657899c	CNI: fix deprecation warnings (#19954 ) We updated our `go-cni` dependency in #17582 but this left deprecation warnings on the `cni.CNIResult` type (now `cni.Result`).	2024-02-12 15:35:43 -05:00
Luiz Aoqui	db5ffde2b7	client: prevent start on cgroups init error (#19915 ) The Nomad client expects certain cgroups paths to exist in order to manage tasks. These paths are created when the agent first starts, but if process fails the agent would just log the error and proceed with its initialization, despite not being able to run tasks. This commit surfaces the errors back to the client initialization so the process can stop early and make clear to operators that something went wrong.	2024-02-09 13:45:29 -05:00
Tim Gross	62c57d208b	fingerprint: eliminate spurious warning logs with Consul CE (#19923 ) Support for fingerprinting the Consul admin partition was added in #19485. But when the client fingerprints Consul CE, it gets a valid fingerprint and working Consul but with a warn-level log. Return "ok" from the partition extractor, but also ensure that we only add the Consul attribute if it actually has a value. Fixes: https://github.com/hashicorp/nomad/issues/19756	2024-02-09 08:19:00 -05:00
hc-github-team-nomad-core	33f0a5b268	Prepare for next release	2024-02-08 10:40:24 -05:00
hc-github-team-nomad-core	875e96cccc	Generate files for 1.7.4 release	2024-02-08 10:40:24 -05:00
Tim Gross	df86503349	template: sandbox template rendering The Nomad client renders templates in the same privileged process used for most other client operations. During internal testing, we discovered that a malicious task can create a symlink that can cause template rendering to read and write to arbitrary files outside the allocation sandbox. Because the Nomad agent can be restarted without restarting tasks, we can't simply check that the path is safe at the time we write without encountering a time-of-check/time-of-use race. To protect Nomad client hosts from this attack, we'll now read and write templates in a subprocess: * On Linux/Unix, this subprocess is sandboxed via chroot to the allocation directory. This requires that Nomad is running as a privileged process. A non-root Nomad agent will warn that it cannot sandbox the template renderer. * On Windows, this process is sandboxed via a Windows AppContainer which has been granted access to only to the allocation directory. This does not require special privileges on Windows. (Creating symlinks in the first place can be prevented by running workloads as non-Administrator or non-ContainerAdministrator users.) Both sandboxes cause encountered symlinks to be evaluated in the context of the sandbox, which will result in a "file not found" or "access denied" error, depending on the platform. This change will also require an update to Consul-Template to allow callers to inject a custom `ReaderFunc` and `RenderFunc`. This design is intended as a workaround to allow us to fix this bug without creating backwards compatibility issues for running tasks. A future version of Nomad may introduce a read-only mount specifically for templates and artifacts so that tasks cannot write into the same location that the Nomad agent is. Fixes: https://github.com/hashicorp/nomad/issues/19888 Fixes: CVE-2024-1329	2024-02-08 10:40:24 -05:00
Tim Gross	0d3cd1427f	migration: check symlink sources during archive unpack During allocation directory migration, the client was not checking that any symlinks in the archive aren't pointing to somewhere outside the allocation directory. While task driver sandboxing will protect against processes inside the task from reading/writing thru the symlink, this doesn't protect against the client itself from performing unintended operations outside the sandbox. This changeset includes two changes: * Update the archive unpacking to check the source of symlinks and require that they fall within the sandbox. * Fix a bug in the symlink check where it was using `filepath.Rel` which doesn't work for paths in the sibling directories of the sandbox directory. This bug doesn't appear to be exploitable but caused errors in testing. Fixes: https://github.com/hashicorp/nomad/issues/19887	2024-02-08 10:40:24 -05:00
Juana De La Cuesta	120c3ca3c9	Add granular control of SELinux labels for host mounts (#19839 ) Add new configuration option on task's volume_mounts, to give a fine grained control over SELinux "z" label * Update website/content/docs/job-specification/volume_mount.mdx Co-authored-by: Luiz Aoqui <luiz@hashicorp.com> * fix: typo * func: make volume mount verification happen even on mounts with no volume --------- Co-authored-by: Luiz Aoqui <luiz@hashicorp.com> Co-authored-by: Tim Gross <tgross@hashicorp.com>	2024-02-05 10:05:33 +01:00
Tim Gross	334c383eb6	template: run template tests on Windows where possible (#19856 ) We don't run the whole suite of unit tests on all platforms to keep CI times reasonable, so the only things we've been running on Windows are platform-specific. I'm working on some platform-specific `template` related work and having these tests run on Windows will reduce the risk of regressions. Our Windows CI box doesn't have Consul or Vault, so I've skipped those tests for the time being, and can follow up with that later. There's also a test with assertions looking for specific paths, and the results are different on Windows. I've skipped those for the moment as well and will follow up under a separate PR. Also swap `testify` for `shoenig/test`	2024-02-02 09:22:03 -05:00
Michael Schurter	8f564182ef	connect: rewrite envoy bootstrap on every restart (#19787 ) Fixes #19781 Do not mark the envoy bootstrap hook as done after successfully running once. Since the bootstrap file is written to /secrets, which is a tmpfs on supported platforms, it is not persisted across reboots. This causes the task and allocation to fail on reboot (see #19781). This fixes it by always rewriting the envoy bootstrap file every time the Nomad agent starts. This does mean we may write a new bootstrap file to an already running Envoy task, but in my testing that doesn't have any impact. This commit doesn't necessarily fix every use of Done by hooks, but hopefully improves the situation. The comment on Done has been expanded to hopefully avoid misuse in the future. Done assertions were removed from tests as they add more noise than value. Alternative 1: Use a regular file An alternative approach would be to write the bootstrap file somewhere other than the tmpfs, but this is unsafe as when Consul ACLs are enabled the file will contain a secret token: https://developer.hashicorp.com/consul/commands/connect/envoy#bootstrap Alternative 2: Detect if file is already written An alternative approach would be to detect if the bootstrap file exists, and only write it if it doesn't. This is just a more complicated form of the current fix. I think in general in the absence of other factors task hooks should be idempotent and therefore able to rerun on any agent startup. This simplifies the code and our ability to reason about task restarts vs agent restarts vs node reboots by making them all take the same code path.	2024-01-24 11:26:31 -08:00
Seth Hoenig	5b7f4746ce	client/allocdir: use an interface in place of AllocDir structs (#19703 ) * client/allocdir: use an interface in place of AllocDir structs This PR replace allocdir.AllocDir with allocdir.Interface such that we may eventually have another implementation of alloc directories. This is in support of the exec2 driver, which will need an implementation of the alloc directory incompatibile with the current version. use rlock	2024-01-12 14:13:29 -06:00
Tim Gross	0935f443dc	vault: support allowing tokens to expire without refresh (#19691 ) Some users with batch workloads or short-lived prestart tasks want to derive a Vaul token, use it, and then allow it to expire without requiring a constant refresh. Add the `vault.allow_token_expiration` field, which works only with the Workload Identity workflow and not the legacy workflow. When set to true, this disables the client's renewal loop in the `vault_hook`. When Vault revokes the token lease, the token will no longer be valid. The client will also now automatically detect if the Vault auth configuration does not allow renewals and will disable the renewal loop automatically. Note this should only be used when a secret is requested from Vault once at the start of a task or in a short-lived prestart task. Long-running tasks should never set `allow_token_expiration=true` if they obtain Vault secrets via `template` blocks, as the Vault token will expire and the template runner will continue to make failing requests to Vault until the `vault_retry` attempts are exhausted. Fixes: https://github.com/hashicorp/nomad/issues/8690	2024-01-10 14:49:02 -05:00
Marvin Chin	be8575a8a2	Fix server shutdown not waiting for worker run completion (#19560 ) * Move group into a separate helper module for reuse * Add shutdownCh to worker The shutdown channel is used to signal that worker has stopped. * Make server shutdown block on workers' shutdownCh * Fix waiting for eval broker state change blocking indefinitely There was a race condition in the GenericNotifier between the Run and WaitForChange functions, where WaitForChange blocks trying to write to a full unsubscribeCh, but the Run function never reads from the unsubscribeCh as it has already stopped. This commit fixes it by unblocking if the notifier has been stopped. * Bound the amount of time server shutdown waits on worker completion * Fix lostcancel linter error * Fix worker test using unexpected worker constructor * Add changelog --------- Co-authored-by: Marvin Chin <marvinchin@users.noreply.github.com>	2024-01-05 08:45:07 -06:00
David Ventura	fb43b14fb0	Mark CGroups as off when missing essential controllers (#19176 )	2023-12-15 11:20:52 -05:00
Piotr Kazmierczak	f1fb51422b	client: consul hook not called for templates (#19490 ) Due to some refactoring mishap, task-level Consul hook was never triggered and thus never wrote any secrets in task secret dirs.	2023-12-15 17:16:00 +01:00
Tim Gross	2e33115c15	consul: fingerprint Consul Enterprise admin partitions (#19485 ) Consul Enterprise agents all belong to an admin partition. Fingerprint this attribute when available. When a Consul agent is not explicitly configured with "default" it is in the default partition but will not report this in its `/v1/agent/self` endpoint. Fallback to "default" when missing only for Consul Enterprise. This feature provides users the ability to add constraints for jobs to land on Nomad nodes that have a Consul in that partition. Or it can allow cluster administrators to pair Consul partitions 1:1 with Nomad node pools. We'll also have the option to implement a future `partition` field in the jobspec's `consul` block to create an implicit constraint. Ref: https://github.com/hashicorp/nomad/issues/13139#issuecomment-1856479581	2023-12-15 09:26:25 -05:00
Seth Hoenig	6e4d57b330	numalib: provide a fallback for topology scanning on linux (#19457 ) * numalib: provide a fallback for topology scanning on linux * numalib: better package var names * cl: add cl * lint: fix my sloppy code * cl: fixup wording	2023-12-13 13:06:30 -06:00
Piotr Kazmierczak	b6dd376100	numa: account for incorrect core number on topology.insert (#19383 ) Unsupported environments like containers or guests OSs inside LXD can incorrectly number of available cores thus leading to numalib having trouble detecting cores and panicking. This code adds tests for linux sysfs detection methods and fixes the panic.	2023-12-13 17:40:26 +01:00

1 2 3 4 5 ...

4934 Commits