nomad

mirror of https://github.com/kemko/nomad.git synced 2026-01-06 10:25:42 +03:00

Author	SHA1	Message	Date
Michael Schurter	e49ca3c431	identity: Implement `change_mode` (#18943 ) * identity: support change_mode and change_signal wip - just jobspec portion * test struct * cleanup some insignificant boogs * actually implement change mode * docs tweaks * add changelog * test identity.change_mode operations * use more words in changelog * job endpoint tests * address comments from code review --------- Co-authored-by: Tim Gross <tgross@hashicorp.com>	2023-11-01 09:41:11 -05:00
Tim Gross	d62213a135	consul: fix lookups of default cluster across upgrades (#18945 ) Allocations that were created before Nomad 1.7 will not have the cluster field set for their Consul blocks. While this can be corrected server-side, that doesn't help allocations already on clients.	2023-11-01 10:11:54 -04:00
Tim Gross	c1fa145765	vault: fix lookups of default cluster across upgrades (#18940 ) Allocations that were created before Nomad 1.7 will not have the `cluster` field set for their Vault blocks. While this can be corrected server-side, that doesn't help allocations already on clients. Also add extra safety on Consul cluster lookup too	2023-10-31 17:30:01 -04:00
Tim Gross	6fd3143fe7	services: fix lookup for Consul tokens (#18914 ) The `group_service_hook` needs to supply the Consul service client with Consul tokens for its services. The lookup in the hook resources was looking for the wrong key. This would cause the service client to ignore the Consul token we've received and use the agent's own token. This changeset also moves the prefix formatting into `MakeUniqueIdentityName` method to reduce the risk of this kind of bug in the future.	2023-10-30 13:42:18 -04:00
Tim Gross	f0330d6df1	`identity_hook`: implement PreKill hook, not TaskStop hook (#18913 ) The allocrunner's `identity_hook` implements the interface for TaskStop, but this interface is only ever called for task-level hooks. This results in a leaked goroutine that tries to periodically renew WIs until the client shuts down gracefully. Add an implementation for the allocrunner's `PreKill` and `Destroy` hooks, so that whenever an allocation is stopped or garbage collected we stop renewing its Workload Identities. This also requires making the `Shutdown` method of `WIDMgr` safe to call multiple times.	2023-10-30 10:54:22 -04:00
Luiz Aoqui	347389f9f9	vault: derive token using `create_from_role` (#18880 ) Fallback to the ACL role defined in the client's `create_from_role` configuration when using the JWT flow and the task does not specify a role to use.	2023-10-27 13:03:44 -04:00
Kerim Satirli	5e1bbf90fc	docs: update all URLs to `developer.hashicorp.com` (#16247 )	2023-10-24 11:00:11 -04:00
Tim Gross	4d9cc73ed2	sids_hook: fix check for Consul token derived from WI (#18821 ) The `sids_hook` serves the legacy Connect workflow, and we want to bypass it when using workload identities. So the hook checks that there's not already a Consul token in the alloc hook resources derived from the Workload Identity. This check was looking for the wrong key. This would cause the hook to ignore the Consul token we already have and then fail to derive a SI token unless the Nomad agent has its own token with `acl:write` permission. Fix the lookup and add tests covering the bypass behavior.	2023-10-23 08:57:02 -04:00
Seth Hoenig	83720740f5	core: plumbing to support numa aware scheduling (#18681 ) * core: plumbing to support numa aware scheduling * core: apply node resources compatibility upon fsm rstore Handle the case where an upgraded server dequeus an evaluation before a client triggers a new fingerprint - which would be needed to cause the compatibility fix to run. By running the compat fix on restore the server will immediately have the compatible pseudo topology to use. * lint: learn how to spell pseudo	2023-10-19 15:09:30 -05:00
Luiz Aoqui	8b9a5fde4e	vault: add multi-cluster support on templates (#18790 ) In Nomad Enterprise, a task may connect to a non-default Vault cluster, requiring `consul-template` to be configured with a specific client `vault` block.	2023-10-18 20:45:01 -04:00
Piotr Kazmierczak	16d71582f6	client: `consul_hook` tests (#18780 ) ref https://github.com/hashicorp/team-nomad/issues/404	2023-10-18 20:02:35 +02:00
Tim Gross	d0957eb109	Consul: agent config updates for WI (#18774 ) This changeset makes two changes: * Removes the `consul.use_identity` field from the agent configuration. This behavior is properly covered by the presence of `consul.service_identity` / `consul.task_identity` blocks. * Adds a `consul.task_auth_method` and `consul.service_auth_method` fields to the agent configuration. This allows the cluster administrator to choose specific Consul Auth Method names for their environment, with a reasonable default.	2023-10-17 14:42:14 -04:00
Tim Gross	ac56855f07	consul: add multi-cluster support to client constructors (#18624 ) When agents start, they create a shared Consul client that is then wrapped as various interfaces for testability, and used in constructing the Nomad client and server. The interfaces that support workload services (rather than the Nomad agent itself) need to support multiple Consul clusters for Nomad Enterprise. Update these interfaces to be factory functions that return the Consul client for a given cluster name. Update the `ServiceClient` to split workload updates between clusters by creating a wrapper around all the clients that delegates to the cluster-specific `ServiceClient`. Ref: https://github.com/hashicorp/team-nomad/issues/404	2023-10-17 13:46:49 -04:00
Luiz Aoqui	349c032369	vault: update task runner vault hook to support workload identity (#18534 )	2023-10-16 19:37:57 -04:00
Piotr Kazmierczak	299f3bf74b	client: use WI-issued consul tokens in the template_hook (#18752 ) ref https://github.com/hashicorp/team-nomad/issues/404	2023-10-16 09:39:20 +02:00
Piotr Kazmierczak	b697de9dda	client: correct consul block validation in the consul_hook (#18751 )	2023-10-13 15:15:04 +02:00
Piotr Kazmierczak	91753308b3	WI: set the right identity name for Consul tasks (#18742 ) Consul tasks should only have 1 identity of the form consul/{consul_cluster_name}.	2023-10-12 20:34:15 +02:00
Tim Gross	c7f97722ef	consul hook: get WIs only for own task group (#18732 ) The WID manager will only sign WI tokens for the allocation's task group. We're accidentally looping over all the task groups, which for jobs with multiple task groups results in a failure in the `consul_hook`.	2023-10-11 17:01:28 -04:00
Tim Gross	e22c5b82f3	WID manager: request signed identities for services (#18650 ) Includes changes to WID Manager that make it request signed identities for services, as well as a few improvements to WIHandle introduced in #18672. --------- Co-authored-by: Piotr Kazmierczak <470696+pkazmierczak@users.noreply.github.com>	2023-10-11 12:07:16 +02:00
Tim Gross	928a82a184	WID manager: save and restore signed WIs from client state DB (#18661 ) When clients are restarted and the identity hook runs when we restore allocations, the running allocations are likely to have already-signed Workload Identities that are unexpired. Save these to the client's local state DB so that we can avoid a thundering herd of RPCs during client restart. When we restore, we'll check if there's at least one expired signed WI before making any initial signing request. Included: * Renames `getIdentities` to `getInitialIdentities` to make the workflow more clear. * Renames the existing `widmgr_test.go` file of integration tests, which is in its own package to avoid circular imports to `widmgr_int_test.go`	2023-10-09 09:16:23 -04:00
Piotr Kazmierczak	597d835220	wi: introduce workload identity handler (#18672 ) Any code that tracks workloads and their identities should not rely on string comparisons, especially since we support 2 types of workload identities: those that identify tasks and those that identify services. This means we cannot rely on task.Name for workload-identity pairs. The new type structs.WIHandle solves this problem by providing a uniform way of identifying workloads and their identities.	2023-10-06 18:32:47 +02:00
Luiz Aoqui	ed204e0fd9	client: ensure task only runs with prestart hooks (#18662 ) Since the allocation in the task runner is updated in a separate goroutine, a race condition may happen where the task is started but the prestart hooks are skipped because the allocation became terminal. Checking for a terminal allocation before proceeding with the task start ensures the task only runs if the prestart hooks are also executed. Since `shouldShutdown()` only uses terminal allocation status, it remains `true` after the first transition, so it's safe to check it again after the prestart hooks as it will never revert to `false`.	2023-10-05 10:16:57 -04:00
Tim Gross	bf65e44a09	consul: only fetch Consul tokens for Consul-specific identities (#18649 ) Only the workload identities signed specifically for Consul, named for the task or service, should result in authenticating to Consul to get tokens.	2023-10-04 11:12:50 -04:00
Matthew Salsamendi	aa9ff3a5b3	fix: use interpolated address when performing health checks (#18584 ) * fix: use interpolated address when performing health checks * Fix tests, add changelog * Update .changelog/18584.txt Co-authored-by: Seth Hoenig <shoenig@duck.com> --------- Co-authored-by: Seth Hoenig <shoenig@duck.com>	2023-10-04 07:58:55 -05:00
Tim Gross	fb7582d596	services: get Consul token from hook resources (#18600 ) When Workload Identity is being used with Consul, the `consul_hook` will add Consul tokens to the alloc hook resources. Update the `group_service_hook` and `service_hook` to use those tokens when available for registering and deregistering Consul workloads.	2023-10-04 08:35:18 -04:00
Tim Gross	52ef476a72	sids_hook: read tokens from consul_hook when available (#18594 ) The `sids_hook` runs for Connect sidecar/gateway tasks and gets Consul Service Identity (SI) tokens for use by the Envoy bootstrap hook. When Workload Identity is being used with Consul, the `consul_hook` will have already added these tokens to the alloc hook resources. Update the `sids_hook` to use those tokens instead and write them to the expected area of the taskdir.	2023-10-03 09:12:13 -04:00
Piotr Kazmierczak	3d62438876	consul: consul taskrunner hook should only write tokens that belong to its task (#18635 ) Ref hashicorp/team-nomad#404	2023-10-02 19:49:02 +02:00
Tim Gross	aaee3076c2	consul: allow `consul` block in task scope (#18597 ) To support Workload Identity with Consul for templates, we want templates to be able to use the WI created at the task scope (either implicitly or set by the user). But to allow different tasks within a group to be assigned to different clusters as we're doing for Vault, we need to be able to set the `consul` block with its `cluster` field at the task level to override the group.	2023-09-29 15:03:48 -04:00
Piotr Kazmierczak	5dab41881b	client: new consul_hook (#18557 ) This PR introduces a new allocrunner-level consul_hook which iterates over services and tasks, if their provider is consul, fetches consul tokens for all of them, stores them in AllocHookResources and in task secret dirs. Ref: hashicorp/team-nomad#404 --------- Co-authored-by: Tim Gross <tgross@hashicorp.com>	2023-09-29 17:41:48 +02:00
Luiz Aoqui	868aba57bb	vault: update identity name to start with `vault_` (#18591 ) * vault: update identity name to start with `vault_` In the original proposal, workload identities used to derive Vault tokens were expected to be called just `vault`. But in order to support multiple Vault clusters it is necessary to associate identities with specific Vault cluster configuration. This commit implements a new proposal to have Vault identities named as `vault_<cluster>`.	2023-09-27 15:53:28 -03:00
Daniel Bennett	7bd5c6e84e	test: Refactor mock CSI manager (#18554 ) and MockCSIManager to support the call counting that csi_hook_test expects instead of implementing csimanager interfaces in two separate places: * client/allocrunner/csi_hook_test * client/csi_endpoint_test they can both use the same mocks defined in client/pluginmanager/csimanager/ alongside the actual implementations of them. also refactor TestCSINode_DetachVolume to use use it like Node_ExpandVolume so we can also test the happy path there	2023-09-21 16:03:53 -05:00
Piotr Kazmierczak	86d2cdcf80	client: split identity_hook across allocrunner and taskrunner (#18431 ) This commit splits identity_hook between the allocrunner and taskrunner. The allocrunner-level part of the hook signs each task identity, and the taskrunner-level part picks it up and stores secrets for each task. The code revamps the WIDMgr, which is now split into 2 interfaces: IdentityManager which manages renewals of signatures and handles sending updates to subscribers via Watch method, and IdentitySigner which only does the signing. This work is necessary for having a unified Consul login workflow that comes with the new Consul integration. A new, allocrunner-level consul_hook will now be the only hook doing Consul authentication.	2023-09-21 17:31:27 +02:00
Tim Gross	fdc6c2151d	vault: select Vault API client by cluster name (#18533 ) Nomad Enterprise will support configuring multiple Vault clients. Instead of having a single Vault client field in the Nomad client, we'll have a function that callers can parameterize by the Vault cluster name that returns the correctly configured Vault API client wrapper.	2023-09-19 14:35:01 -04:00
Daniel Bennett	4895d708b4	csi: implement NodeExpandVolume (#18522 ) following ControllerExpandVolume in `c6dbba7cde`, which expands the disk at e.g. a cloud vendor, the controller plugin may say that we also need to issue NodeExpandVolume for the node plugin to make the new disk space available to task(s) that have claims on the volume by e.g. expanding the filesystem on the node. csi spec: https://github.com/container-storage-interface/spec/blob/c918b7f/spec.md#nodeexpandvolume	2023-09-18 10:30:15 -05:00
Seth Hoenig	591394fb62	drivers: plumb hardware topology via grpc into drivers (#18504 ) * drivers: plumb hardware topology via grpc into drivers This PR swaps out the temporary use of detecting system hardware manually in each driver for using the Client's detected topology by plumbing the data over gRPC. This ensures that Client configuration is taken to account consistently in all references to system topology. * cr: use enum instead of bool for core grade * cr: fix test slit tables to be possible	2023-09-18 08:58:07 -05:00
Shantanu Gadgil	12580c345a	bubble up the error message from go-getter (#18444 )	2023-09-13 09:36:39 -04:00
Seth Hoenig	2e1974a574	client: refactor cpuset partitioning (#18371 ) * client: refactor cpuset partitioning This PR updates the way Nomad client manages the split between tasks that make use of resources.cpus vs. resources.cores. Previously, each task was explicitly assigned which CPU cores they were able to run on. Every time a task was started or destroyed, all other tasks' cpusets would need to be updated. This was inefficient and would crush the Linux kernel when a client would try to run ~400 or so tasks. Now, we make use of cgroup heirarchy and cpuset inheritence to efficiently manage cpusets. * cr: tweaks for feedback	2023-09-12 09:11:11 -05:00
Michael Schurter	ef24e40b39	identity: support jwt expiration and rotation (#18262 ) Implements expirations and renewals for alternate workload identity tokens.	2023-09-08 14:50:34 -07:00
Daniel Bennett	22cbb913db	csi: rename volume Mounter to Manager (#18434 ) to align with its broader purpose, and the volumeManager implementation	2023-09-08 15:33:46 -05:00
James Rasell	a9d5beb141	test: use correct parallel test setup func (#18326 )	2023-08-25 13:51:36 +01:00
Seth Hoenig	f5b0da1d55	all: swap exp packages for maps, slices (#18311 )	2023-08-23 15:42:13 -05:00
Seth Hoenig	8833452d44	followup to numa/cgroups refactor (#18214 ) * lang: note that Stack is not concurrency-safe * client: use more descriptive name for wrangler hook in logs * numalib: use correct name for receiver parameter	2023-08-15 14:12:17 -05:00
Tim Gross	f00bff09f1	fix multiple overflow errors in exponential backoff (#18200 ) We use capped exponential backoff in several places in the code when handling failures. The code we've copy-and-pasted all over has a check to see if the backoff is greater than the limit, but this check happens after the bitshift and we always increment the number of attempts. This causes an overflow with a fairly small number of failures (ex. at one place I tested it occurs after only 24 iterations), resulting in a negative backoff which then never recovers. The backoff becomes a tight loop consuming resources and/or DoS'ing a Nomad RPC handler or an external API such as Vault. Note this doesn't occur in places where we cap the number of iterations so the loop breaks (usually to return an error), so long as the number of iterations is reasonable. Introduce a helper with a check on the cap before the bitshift to avoid overflow in all places this can occur. Fixes: #18199 Co-authored-by: stswidwinski <stan.swidwinski@gmail.com>	2023-08-15 14:38:18 -04:00
Michael Schurter	0e22fc1a0b	identity: add support for multiple identities + audiences (#18123 ) Allows for multiple `identity{}` blocks for tasks along with user-specified audiences. This is a building block to allow workload identities to be used with Consul, Vault and 3rd party JWT based auth methods. Expiration is still unimplemented and is necessary for JWTs to be used securely, so that's up next. --------- Co-authored-by: Tim Gross <tgross@hashicorp.com>	2023-08-15 09:11:53 -07:00
Seth Hoenig	d9341f0664	update go1.21 (#18184 ) * build: update to go1.21 * go: eliminate helpers in favor of min/max * build: run go mod tidy * build: swap depguard for semgrep * command: fixup broken tls error check on go1.21	2023-08-14 08:43:27 -05:00
hashicorp-copywrite[bot]	2d35e32ec9	Update copyright file headers to BUSL-1.1	2023-08-10 17:27:15 -05:00
Seth Hoenig	a4cc76bd3e	numa: enable numa topology detection (#18146 ) * client: refactor cgroups management in client * client: fingerprint numa topology * client: plumb numa and cgroups changes to drivers * client: cleanup task resource accounting * client: numa client and config plumbing * lib: add a stack implementation * tools: remove ec2info tool * plugins: fixup testing for cgroups / numa changes * build: update makefile and package tests and cl	2023-08-10 17:05:30 -05:00
Charlie Voiselle	585b0533c0	[dep] bump golang.org/x/exp (#18102 ) There are some refactorings that have to be made in the getter and state where the api changed in `slices` * Bump golang.org/x/exp * Bump golang.org/x/exp in api * Update job_endpoint_test * [feedback] unexport sort function	2023-08-01 11:50:17 -04:00
Gerard Nguyen	9e98d694a6	feature: Add new field render_templates on restart block (#18054 ) This feature is necessary when user want to explicitly re-render all templates on task restart. E.g. to fetch all new secrets from Vault, even if the lease on the existing secrets has not been expired.	2023-07-28 11:53:32 -07:00
Ville Vesilehto	2c463bb038	chore(lint): use Go stdlib variables for HTTP methods and status codes (#17968 )	2023-07-26 15:28:09 +01:00

1 2 3 4 5 ...

891 Commits