nomad

mirror of https://github.com/kemko/nomad.git synced 2026-01-05 09:55:44 +03:00

Author	SHA1	Message	Date
Chris Roberts	362690ddd1	client: suppress kill task event on completed tasks (#26075 ) The `killTasks` function will kill all the alloc runners task runners. If the task of a task runner has already completed, the killing of the task runner can cause confusion due to the task event showing that the task was signaled even though it is already complete. To prevent this, a check is done when creating the task event to determine if the task has completed. If it has no task event is created and when the task runner is killed, no extra task event is added.	2025-07-01 13:30:52 -07:00
Chris Roberts	dfa07e10ed	client: fix batch job drain behavior (#26025 ) Batch job allocations that are drained from a node will be moved to an eligible node. However, when no eligible nodes are available to place the draining allocations, the tasks will end up being complete and will not be placed when an eligible node becomes available. This occurs because the drained allocations are simultaneously stopped on the draining node while attempting to be placed on an eligible node. The stopping of the allocations on the draining node result in tasks being killed, but importantly this kill does not fail the task. The result is tasks reporting as complete due to their state being dead and not being failed. As such, when an eligible node becomes available, all tasks will show as complete and no allocations will need to be placed. To prevent the behavior described above a check is performed when the alloc runner kills its tasks. If the allocation's job type is batch, and the allocation has a desired transition of migrate, the task will be failed when it is killed. This ensures the task does not report as complete, and when an eligible node becomes available the allocations are placed as expected.	2025-06-13 08:28:31 -07:00
Tim Gross	e168548341	provide allocrunner hooks with prebuilt taskenv and fix mutation bugs (#25373 ) Some of our allocrunner hooks require a task environment for interpolating values based on the node or allocation. But several of the hooks accept an already-built environment or builder and then keep that in memory. Both of these retain a copy of all the node attributes and allocation metadata, which balloons memory usage until the allocation is GC'd. While we'd like to look into ways to avoid keeping the allocrunner around entirely (see #25372), for now we can significantly reduce memory usage by creating the task environment on-demand when calling allocrunner methods, rather than persisting it in the allocrunner hooks. In doing so, we uncover two other bugs: * The WID manager, the group service hook, and the checks hook have to interpolate services for specific tasks. They mutated a taskenv builder to do so, but each time they mutate the builder, they write to the same environment map. When a group has multiple tasks, it's possible for one task to set an environment variable that would then be interpolated in the service definition for another task if that task did not have that environment variable. Only the service definition interpolation is impacted. This does not leak env vars across running tasks, as each taskrunner has its own builder. To fix this, we move the `UpdateTask` method off the builder and onto the taskenv as the `WithTask` method. This makes a shallow copy of the taskenv with a deep clone of the environment map used for interpolation, and then overwrites the environment from the task. * The checks hook interpolates Nomad native service checks only on `Prerun` and not on `Update`. This could cause unexpected deregistration and registration of checks during in-place updates. To fix this, we make sure we interpolate in the `Update` method. I also bumped into an incorrectly implemented interface in the CSI hook. I've pulled that and some better guardrails out to https://github.com/hashicorp/nomad/pull/25472. Fixes: https://github.com/hashicorp/nomad/issues/25269 Fixes: https://hashicorp.atlassian.net/browse/NET-12310 Ref: https://github.com/hashicorp/nomad/issues/25372	2025-03-24 12:05:04 -04:00
Michael Smithhisler	5c4d0e923d	consul: Remove legacy token based authentication workflow (#25217 )	2025-03-05 15:38:11 -05:00
Matt Keeler	833e240597	Upgrade to using hashicorp/go-metrics@v0.5.4 (#24856 ) * Upgrade to using hashicorp/go-metrics@v0.5.4 This also requires bumping the dependencies for: * memberlist * serf * raft * raft-boltdb * (and indirectly hashicorp/mdns due to the memberlist or serf update) Unlike some other HashiCorp products, Nomads root module is currently expected to be consumed by others. This means that it needs to be treated more like our libraries and upgrade to hashicorp/go-metrics by utilizing its compat packages. This allows those importing the root module to control the metrics module used via build tags.	2025-01-31 15:22:00 -05:00
James Rasell	7d48aa2667	client: emit optional telemetry from prerun and prestart hooks. (#24556 ) The Nomad client can now optionally emit telemetry data from the prerun and prestart hooks. This allows operators to monitor and alert on failures and time taken to complete. The new datapoints are: - nomad.client.alloc_hook.prerun.success (counter) - nomad.client.alloc_hook.prerun.failed (counter) - nomad.client.alloc_hook.prerun.elapsed (sample) - nomad.client.task_hook.prestart.success (counter) - nomad.client.task_hook.prestart.failed (counter) - nomad.client.task_hook.prestart.elapsed (sample) The hook execution time is useful to Nomad engineering and will help optimize code where possible and understand job specification impacts on hook performance. Currently only the PreRun and PreStart hooks have telemetry enabled, so we limit the number of new metrics being produced.	2024-12-12 14:43:14 +00:00
Tim Gross	b25f1b66ce	resources: allow job authors to configure size of secrets tmpfs (#23696 ) On supported platforms, the secrets directory is a 1MiB tmpfs. But some tasks need larger space for downloading large secrets. This is especially the case for tasks using `templates`, which need extra room to write a temporary file to the secrets directory that gets renamed to the old file atomically. This changeset allows increasing the size of the tmpfs in the `resources` block. Because this is a memory resource, we need to include it in the memory we allocate for scheduling purposes. The task is already prevented from using more memory in the tmpfs than the `resources.memory` field allows, but can bypass that limit by writing to the tmpfs via `template` or `artifact` blocks. Therefore, we need to account for the size of the tmpfs in the allocation resources. Simply adding it to the memory needed when we create the allocation allows it to be accounted for in all downstream consumers, and then we'll subtract that amount from the memory resources just before configuring the task driver. For backwards compatibility, the default value of 1MiB is "free" and ignored by the scheduler. Otherwise we'd be increasing the allocated resources for every existing alloc, which could cause problems across upgrades. If a user explicitly sets `resources.secrets = 1` it will no longer be free. Fixes: https://github.com/hashicorp/nomad/issues/2481 Ref: https://hashicorp.atlassian.net/browse/NET-10070	2024-08-05 16:06:58 -04:00
Seth Hoenig	09bd11383c	client: alloc_mounts directory must be sibling of data directory (#22199 ) This PR adjusts the default location of -alloc-mounts-dir path to be a sibling of the -data-dir path rather than a child. This is because on a production-hardened systems the data dir is supposed to be chmod 0700 owned by root - preventing the exec2 task driver (and others using unveil file system isolation features) from working properly. For reference the directory structure from -data-dir now looks like this after running an example job. Under the alloc_mounts directory, task specific directories are mode 0710 and owned by the task user (which may be a dynamic user UID/GID). ➜ sudo tree -p -d -u /tmp/mynomad [drwxrwxr-x shoenig ] /tmp/mynomad ├── [drwx--x--x root ] alloc_mounts │ └── [drwx--x--- 80552 ] c753b71d-c6a1-3370-1f59-47ab838fd8a6-mytask │ ├── [drwxrwxrwx nobody ] alloc │ │ ├── [drwxrwxrwx nobody ] data │ │ ├── [drwxrwxrwx nobody ] logs │ │ └── [drwxrwxrwx nobody ] tmp │ ├── [drwxrwxrwx nobody ] local │ ├── [drwxr-xr-x root ] private │ ├── [drwx--x--- 80552 ] secrets │ └── [drwxrwxrwt nobody ] tmp └── [drwx------ root ] data ├── [drwx--x--x root ] alloc │ └── [drwxr-xr-x root ] c753b71d-c6a1-3370-1f59-47ab838fd8a6 │ ├── [drwxrwxrwx nobody ] alloc │ │ ├── [drwxrwxrwx nobody ] data │ │ ├── [drwxrwxrwx nobody ] logs │ │ └── [drwxrwxrwx nobody ] tmp │ └── [drwx--x--- 80552 ] mytask │ ├── [drwxrwxrwx nobody ] alloc │ │ ├── [drwxrwxrwx nobody ] data │ │ ├── [drwxrwxrwx nobody ] logs │ │ └── [drwxrwxrwx nobody ] tmp │ ├── [drwxrwxrwx nobody ] local │ ├── [drwxrwxrwx nobody ] private │ ├── [drwx--x--- 80552 ] secrets │ └── [drwxrwxrwt nobody ] tmp ├── [drwx------ root ] client └── [drwxr-xr-x root ] server ├── [drwx------ root ] keystore ├── [drwxr-xr-x root ] raft │ └── [drwxr-xr-x root ] snapshots └── [drwxr-xr-x root ] serf 32 directories	2024-05-22 13:14:34 -05:00
James Rasell	04ba358266	client: expose network namespace CNI config as task env vars. (#11810 ) This change exposes CNI configuration details of a network namespace as environment variables. This allows a task to use these value to configure itself; a potential use case is to run a Raft application binding to IP and Port details configured using the bridge network mode.	2024-05-14 09:02:06 +01:00
Seth Hoenig	ae6c4c8e3f	deps: purge use of old x/exp packages (#20373 )	2024-04-12 08:29:00 -05:00
Tim Gross	d56e8ad1aa	WI: ensure Consul hook and WID manager interpolate services (#20344 ) Services can have some of their string fields interpolated. The new Workload Identity flow doesn't interpolate the services before requesting signed identities or using those identities to get Consul tokens. Add support for interpolation to the WID manager and the Consul tokens hook by providing both with a taskenv builder. Add an "interpolate workload" field to the WI handle to allow passing the original workload name to the server so the server can find the correct service to sign. This changeset also makes two related test improvements: * Remove the mock WID manager, which was only used in the Consul hook tests and isn't necessary so long as we provide the real WID manager with the mock signer and never call `Run` on it. It wasn't feasible to exercise the correct behavior without this refactor, as the mocks were bypassing the new code. * Fixed swapped expect-vs-actual assertions on the `consul_hook` tests. Fixes: https://github.com/hashicorp/nomad/issues/20025	2024-04-11 15:40:28 -04:00
Seth Hoenig	05937ab75b	exec2: add client support for unveil filesystem isolation mode (#20115 ) * exec2: add client support for unveil filesystem isolation mode This PR adds support for a new filesystem isolation mode, "Unveil". The mode introduces a "alloc_mounts" directory where tasks have user-owned directory structure which are bind mounts into the real alloc directory structure. This enables a task driver to use landlock (and maybe the real unveil on openbsd one day) to isolate a task to the task owned directory structure, providing sandboxing. * actually create alloc-mounts-dir directory * fix doc strings about alloc mount dir paths	2024-03-13 08:24:17 -05:00
Seth Hoenig	67554b8f91	exec2: implement dynamic workload users taskrunner hook (#20069 ) * exec2: implement dynamic workload users taskrunner hook This PR impelements a TR hook for allocating dynamic workload users from a pool managed by the Nomad client. This adds a new task driver Capability, DynamicWorkloadUsers - which a task driver must indicate in order to make use of this feature. The client config plumbing is coming in a followup PR - in the RFC we realized having a client.users block would be nice to have, with some additional unrelated options being moved from the deprecated client.options config. * learn to spell	2024-03-06 09:34:27 -06:00
Tim Gross	45b2c34532	cni: add DNS set by CNI plugins to task configuration (#20007 ) CNI plugins may set DNS configuration, but this isn't threaded through to the task configuration so that we can write it to the `/etc/resolv.conf` file as needed. Add the `AllocNetworkStatus` to the alloc hook resources so they're accessible from the taskrunner. Any DNS entries provided by the user will override these values. Fixes: https://github.com/hashicorp/nomad/issues/11102	2024-02-20 10:17:27 -05:00
Seth Hoenig	5b7f4746ce	client/allocdir: use an interface in place of AllocDir structs (#19703 ) * client/allocdir: use an interface in place of AllocDir structs This PR replace allocdir.AllocDir with allocdir.Interface such that we may eventually have another implementation of alloc directories. This is in support of the exec2 driver, which will need an implementation of the alloc directory incompatibile with the current version. use rlock	2024-01-12 14:13:29 -06:00
Tim Gross	7191c78928	refactor: rename allocrunner's Consul service reg handler (#19019 ) The allocrunner has a service registration handler that proxies various API calls to Consul. With multi-cluster support (for ENT), the service registration handler is what selects the correct Consul client. The name of this field in the allocrunner and taskrunner code base looks like it's referring to the actual Consul API client. This was actually the case before Nomad native service discovery was implemented, but now the name is misleading.	2023-11-08 15:39:32 -05:00
Tim Gross	ac56855f07	consul: add multi-cluster support to client constructors (#18624 ) When agents start, they create a shared Consul client that is then wrapped as various interfaces for testability, and used in constructing the Nomad client and server. The interfaces that support workload services (rather than the Nomad agent itself) need to support multiple Consul clusters for Nomad Enterprise. Update these interfaces to be factory functions that return the Consul client for a given cluster name. Update the `ServiceClient` to split workload updates between clusters by creating a wrapper around all the clients that delegates to the cluster-specific `ServiceClient`. Ref: https://github.com/hashicorp/team-nomad/issues/404	2023-10-17 13:46:49 -04:00
Tim Gross	928a82a184	WID manager: save and restore signed WIs from client state DB (#18661 ) When clients are restarted and the identity hook runs when we restore allocations, the running allocations are likely to have already-signed Workload Identities that are unexpired. Save these to the client's local state DB so that we can avoid a thundering herd of RPCs during client restart. When we restore, we'll check if there's at least one expired signed WI before making any initial signing request. Included: * Renames `getIdentities` to `getInitialIdentities` to make the workflow more clear. * Renames the existing `widmgr_test.go` file of integration tests, which is in its own package to avoid circular imports to `widmgr_int_test.go`	2023-10-09 09:16:23 -04:00
Piotr Kazmierczak	86d2cdcf80	client: split identity_hook across allocrunner and taskrunner (#18431 ) This commit splits identity_hook between the allocrunner and taskrunner. The allocrunner-level part of the hook signs each task identity, and the taskrunner-level part picks it up and stores secrets for each task. The code revamps the WIDMgr, which is now split into 2 interfaces: IdentityManager which manages renewals of signatures and handles sending updates to subscribers via Watch method, and IdentitySigner which only does the signing. This work is necessary for having a unified Consul login workflow that comes with the new Consul integration. A new, allocrunner-level consul_hook will now be the only hook doing Consul authentication.	2023-09-21 17:31:27 +02:00
Tim Gross	fdc6c2151d	vault: select Vault API client by cluster name (#18533 ) Nomad Enterprise will support configuring multiple Vault clients. Instead of having a single Vault client field in the Nomad client, we'll have a function that callers can parameterize by the Vault cluster name that returns the correctly configured Vault API client wrapper.	2023-09-19 14:35:01 -04:00
Seth Hoenig	2e1974a574	client: refactor cpuset partitioning (#18371 ) * client: refactor cpuset partitioning This PR updates the way Nomad client manages the split between tasks that make use of resources.cpus vs. resources.cores. Previously, each task was explicitly assigned which CPU cores they were able to run on. Every time a task was started or destroyed, all other tasks' cpusets would need to be updated. This was inefficient and would crush the Linux kernel when a client would try to run ~400 or so tasks. Now, we make use of cgroup heirarchy and cpuset inheritence to efficiently manage cpusets. * cr: tweaks for feedback	2023-09-12 09:11:11 -05:00
Michael Schurter	0e22fc1a0b	identity: add support for multiple identities + audiences (#18123 ) Allows for multiple `identity{}` blocks for tasks along with user-specified audiences. This is a building block to allow workload identities to be used with Consul, Vault and 3rd party JWT based auth methods. Expiration is still unimplemented and is necessary for JWTs to be used securely, so that's up next. --------- Co-authored-by: Tim Gross <tgross@hashicorp.com>	2023-08-15 09:11:53 -07:00
hashicorp-copywrite[bot]	2d35e32ec9	Update copyright file headers to BUSL-1.1	2023-08-10 17:27:15 -05:00
Seth Hoenig	a4cc76bd3e	numa: enable numa topology detection (#18146 ) * client: refactor cgroups management in client * client: fingerprint numa topology * client: plumb numa and cgroups changes to drivers * client: cleanup task resource accounting * client: numa client and config plumbing * lib: add a stack implementation * tools: remove ec2info tool * plugins: fixup testing for cgroups / numa changes * build: update makefile and package tests and cl	2023-08-10 17:05:30 -05:00
stswidwinski	b9a388f5df	Retain task states for post stop tasks at the time of node GC (#18005 ) * Retain task states for post stop tasks at the time of node GC	2023-07-21 10:55:00 -07:00
Tim Gross	0ba7d0036b	CSI: persist previous mounts on client to restore during restart (#17840 ) When claiming a CSI volume, we need to ensure the CSI node plugin is running before we send any CSI RPCs. This extends even to the controller publish RPC because it requires the storage provider's "external node ID" for the client. This primarily impacts client restarts but also is a problem if the node plugin exits (and fingerprints) while the allocation that needs a CSI volume claim is being placed. Unfortunately there's no mapping of volume to plugin ID available in the jobspec, so we don't have enough information to wait on plugins until we either get the volume from the server or retrieve the plugin ID from data we've persisted on the client. If we always require getting the volume from the server before making the claim, a client restart for disconnected clients will cause all the allocations that need CSI volumes to fail. Even while connected, checking in with the server to verify the volume's plugin before trying to make a claim RPC is inherently racy, so we'll leave that case as-is and it will fail the claim if the node plugin needed to support a newly-placed allocation is flapping such that the node fingerprint is changing. This changeset persists a minimum subset of data about the volume and its plugin in the client state DB, and retrieves that data during the CSI hook's prerun to avoid re-claiming and remounting the volume unnecessarily. This changeset also updates the RPC handler to use the external node ID from the claim whenever it is available. Fixes: #13028	2023-07-10 13:20:15 -04:00
Tim Gross	78f4f76520	adjust prioritized client updates (#17541 ) In #17354 we made client updates prioritized to reduce client-to-server traffic. When the client has no previously-acknowledged update we assume that the update is of typical priority; although we don't know that for sure in practice an allocation will never become healthy quickly enough that the first update we send is the update saying the alloc is healthy. But that doesn't account for allocations that quickly fail in an unrecoverable way because of allocrunner hook failures, and it'd be nice to be able to send those failure states to the server more quickly. This changeset does so and adds some extra comments on reasoning behind priority.	2023-06-26 09:14:24 -04:00
Tim Gross	893d4a77c8	prioritized client updates (#17354 ) The allocrunner sends several updates to the server during the early lifecycle of an allocation and its tasks. Clients batch-up allocation updates every 200ms, but experiments like the C2M challenge has shown that even with this batching, servers can be overwhelmed with client updates during high volume deployments. Benchmarking done in #9451 has shown that client updates can easily represent ~70% of all Nomad Raft traffic. Each allocation sends many updates during its lifetime, but only those that change the `ClientStatus` field are critical for progressing a deployment or kicking off a reschedule to recover from failures. Add a priority to the client allocation sync and update the `syncTicker` receiver so that we only send an update if there's a high priority update waiting, or on every 5th tick. This means when there are no high priority updates, the client will send updates at most every 1s instead of 200ms. Benchmarks have shown this can reduce overall Raft traffic by 10%, as well as reduce client-to-server RPC traffic. This changeset also switches from a channel-based collection of updates to a shared buffer, so as to split batching from sending and prevent backpressure onto the allocrunner when the RPC is slow. This doesn't have a major performance benefit in the benchmarks but makes the implementation of the prioritized update simpler. Fixes: #9451	2023-05-31 15:34:16 -04:00
Seth Hoenig	3cc25949fa	client: ignore restart issued to terminal allocations (#17175 ) * client: ignore restart issued to terminal allocations This PR fixes a bug where issuing a restart to a terminal allocation would cause the allocation to run its hooks anyway. This was particularly apparent with group_service_hook who would then register services but then never deregister them - as the allocation would be effectively in a "zombie" state where it is prepped to run tasks but never will. * e2e: add e2e test for alloc restart zombies * cl: tweak text Co-authored-by: Tim Gross <tgross@hashicorp.com> --------- Co-authored-by: Tim Gross <tgross@hashicorp.com>	2023-05-16 10:19:41 -05:00
Tim Gross	88323bab4a	allocrunner: provide factory function so we can build mock ARs (#17161 ) Tools like `nomad-nodesim` are unable to implement a minimal implementation of an allocrunner so that we can test the client communication without having to lug around the entire allocrunner/taskrunner code base. The allocrunner was implemented with an interface specifically for this purpose, but there were circular imports that made it challenging to use in practice. Move the AllocRunner interface into an inner package and provide a factory function type. Provide a minimal test that exercises the new function so that consumers have some idea of what the minimum implementation required is.	2023-05-12 13:29:44 -04:00
Tim Gross	116f24d768	client: de-duplicate alloc updates and gate during restore (#17074 ) When client nodes are restarted, all allocations that have been scheduled on the node have their modify index updated, including terminal allocations. There are several contributing factors: * The `allocSync` method that updates the servers isn't gated on first contact with the servers. This means that if a server updates the desired state while the client is down, the `allocSync` races with the `Node.ClientGetAlloc` RPC. This will typically result in the client updating the server with "running" and then immediately thereafter "complete". * The `allocSync` method unconditionally sends the `Node.UpdateAlloc` RPC even if it's possible to assert that the server has definitely seen the client state. The allocrunner may queue-up updates even if we gate sending them. So then we end up with a race between the allocrunner updating its internal state to overwrite the previous update and `allocSync` sending the bogus or duplicate update. This changeset adds tracking of server-acknowledged state to the allocrunner. This state gets checked in the `allocSync` before adding the update to the batch, and updated when `Node.UpdateAlloc` returns successfully. To implement this we need to be able to equality-check the updates against the last acknowledged state. We also need to add the last acknowledged state to the client state DB, otherwise we'd drop unacknowledged updates across restarts. The client restart test has been expanded to cover a variety of allocation states, including allocs stopped before shutdown, allocs stopped by the server while the client is down, and allocs that have been completely GC'd on the server while the client is down. I've also bench tested scenarios where the task workload is killed while the client is down, resulting in a failed restore. Fixes #16381	2023-05-11 09:05:24 -04:00
Daniel Bennett	c2dc1c58dd	full task cleanup when alloc prerun hook fails (#17104 ) to avoid leaking task resources (e.g. containers, iptables) if allocRunner prerun fails during restore on client restart. now if prerun fails, TaskRunner.MarkFailedKill() will only emit an event, mark the task as failed, and cancel the tr's killCtx, so then ar.runTasks() -> tr.Run() can take care of the actual cleanup. removed from (formerly) tr.MarkFailedDead(), now handled by tr.Run(): * set task state as dead * save task runner local state * task stop hooks also done in tr.Run() now that it's not skipped: * handleKill() to kill tasks while respecting their shutdown delay, and retrying as needed * also includes task preKill hooks * clearDriverHandle() to destroy the task and associated resources * task exited hooks	2023-05-08 13:17:10 -05:00
hashicorp-copywrite[bot]	f005448366	[COMPLIANCE] Add Copyright and License Headers	2023-04-10 15:36:59 +00:00
Tim Gross	f3fc54adcf	CSI: set mounts in alloc hook resources atomically (#16722 ) The allocrunner has a facility for passing data written by allocrunner hooks to taskrunner hooks. Currently the only consumers of this facility are the allocrunner CSI hook (which writes data) and the taskrunner volume hook (which reads that same data). The allocrunner hook for CSI volumes doesn't set the alloc hook resources atomically. Instead, it gets the current resources and then writes a new version back. Because the CSI hook is currently the only writer and all readers happen long afterwards, this should be safe but #16623 shows there's some sequence of events during restore where this breaks down. Refactor hook resources so that hook data is accessed via setters and getters that hold the mutex.	2023-04-03 11:03:36 -04:00
Seth Hoenig	d30e34261e	client: always run alloc cleanup hooks on final update (#15855 ) * client: run alloc pre-kill hooks on last pass despite no live tasks This PR fixes a bug where alloc pre-kill hooks were not run in the edge case where there are no live tasks remaining, but it is also the final update to process for the (terminal) allocation. We need to run cleanup hooks here, otherwise they will not run until the allocation gets garbage collected (i.e. via Destroy()), possibly at a distant time in the future. Fixes #15477 * client: do not run ar cleanup hooks if client is shutting down	2023-01-27 09:59:31 -06:00
Luiz Aoqui	f74f50804a	Task lifecycle restart (#14127 ) * allocrunner: handle lifecycle when all tasks die When all tasks die the Coordinator must transition to its terminal state, coordinatorStatePoststop, to unblock poststop tasks. Since this could happen at any time (for example, a prestart task dies), all states must be able to transition to this terminal state. * allocrunner: implement different alloc restarts Add a new alloc restart mode where all tasks are restarted, even if they have already exited. Also unifies the alloc restart logic to use the implementation that restarts tasks concurrently and ignores ErrTaskNotRunning errors since those are expected when restarting the allocation. * allocrunner: allow tasks to run again Prevent the task runner Run() method from exiting to allow a dead task to run again. When the task runner is signaled to restart, the function will jump back to the MAIN loop and run it again. The task runner determines if a task needs to run again based on two new task events that were added to differentiate between a request to restart a specific task, the tasks that are currently running, or all tasks that have already run. * api/cli: add support for all tasks alloc restart Implement the new -all-tasks alloc restart CLI flag and its API counterpar, AllTasks. The client endpoint calls the appropriate restart method from the allocrunner depending on the restart parameters used. * test: fix tasklifecycle Coordinator test * allocrunner: kill taskrunners if all tasks are dead When all non-poststop tasks are dead we need to kill the taskrunners so we don't leak their goroutines, which are blocked in the alloc restart loop. This also ensures the allocrunner exits on its own. * taskrunner: fix tests that waited on WaitCh Now that "dead" tasks may run again, the taskrunner Run() method will not return when the task finishes running, so tests must wait for the task state to be "dead" instead of using the WaitCh, since it won't be closed until the taskrunner is killed. * tests: add tests for all tasks alloc restart * changelog: add entry for #14127 * taskrunner: fix restore logic. The first implementation of the task runner restore process relied on server data (`tr.Alloc().TerminalStatus()`) which may not be available to the client at the time of restore. It also had the incorrect code path. When restoring a dead task the driver handle always needs to be clear cleanly using `clearDriverHandle` otherwise, after exiting the MAIN loop, the task may be killed by `tr.handleKill`. The fix is to store the state of the Run() loop in the task runner local client state: if the task runner ever exits this loop cleanly (not with a shutdown) it will never be able to run again. So if the Run() loops starts with this local state flag set, it must exit early. This local state flag is also being checked on task restart requests. If the task is "dead" and its Run() loop is not active it will never be able to run again. * address code review requests * apply more code review changes * taskrunner: add different Restart modes Using the task event to differentiate between the allocrunner restart methods proved to be confusing for developers to understand how it all worked. So instead of relying on the event type, this commit separated the logic of restarting an taskRunner into two methods: - `Restart` will retain the current behaviour and only will only restart the task if it's currently running. - `ForceRestart` is the new method where a `dead` task is allowed to restart if its `Run()` method is still active. Callers will need to restart the allocRunner taskCoordinator to make sure it will allow the task to run again. * minor fixes	2022-08-24 17:43:07 -04:00
Luiz Aoqui	6070fa0c8d	allocrunner: refactor task coordinator (#14009 ) The current implementation for the task coordinator unblocks tasks by performing destructive operations over its internal state (like closing channels and deleting maps from keys). This presents a problem in situations where we would like to revert the state of a task, such as when restarting an allocation with tasks that have already exited. With this new implementation the task coordinator behaves more like a finite state machine where task may be blocked/unblocked multiple times by performing a state transition. This initial part of the work only refactors the task coordinator and is functionally equivalent to the previous implementation. Future work will build upon this to provide bug fixes and enhancements.	2022-08-22 18:38:49 -04:00
Piotr Kazmierczak	c4be2c6078	cleanup: replace TypeToPtr helper methods with pointer.Of (#14151 ) Bumping compile time requirement to go 1.18 allows us to simplify our pointer helper methods.	2022-08-17 18:26:34 +02:00
Seth Hoenig	b2861f2a9b	client: add support for checks in nomad services This PR adds support for specifying checks in services registered to the built-in nomad service provider. Currently only HTTP and TCP checks are supported, though more types could be added later.	2022-07-12 17:09:50 -05:00
Seth Hoenig	dbcccc7a68	client: enforce max_kill_timeout client configuration This PR fixes a bug where client configuration max_kill_timeout was not being enforced. The feature was introduced in `9f44780` but seems to have been removed during the major drivers refactoring. We can make sure the value is enforced by pluming it through the DriverHandler, which now uses the lesser of the task.killTimeout or client.maxKillTimeout. Also updates Event.SetKillTimeout to require both the task.killTimeout and client.maxKillTimeout so that we don't make the mistake of using the wrong value - as it was being given only the task.killTimeout before.	2022-07-06 15:29:38 -05:00
Derek Strickland	ec3b7150e4	alloc_runner: stop sidecar tasks last (#13055 ) alloc_runner: stop sidecar tasks last	2022-06-07 11:35:19 -04:00
Michael Schurter	3968509886	artifact: fix numerous go-getter security issues Fix numerous go-getter security issues: - Add timeouts to http, git, and hg operations to prevent DoS - Add size limit to http to prevent resource exhaustion - Disable following symlinks in both artifacts and `job run` - Stop performing initial HEAD request to avoid file corruption on retries and DoS opportunities. Approach Since Nomad has no ability to differentiate a DoS-via-large-artifact vs a legitimate workload, all of the new limits are configurable at the client agent level. The max size of HTTP downloads is also exposed as a node attribute so that if some workloads have large artifacts they can specify a high limit in their jobspecs. In the future all of this plumbing could be extended to enable/disable specific getters or artifact downloading entirely on a per-node basis.	2022-05-24 16:29:39 -04:00
Derek Strickland	8863d1e45a	disconnected clients: Support operator manual interventions (#12436 ) * allocrunner: Remove Shutdown call in Reconnect * Node.UpdateAlloc: Stop orphaned allocs. * reconciler: Stop failed reconnects. * Apply feedback from code review. Handle rebase conflict. * Apply suggestions from code review Co-authored-by: Tim Gross <tgross@hashicorp.com>	2022-04-06 09:33:32 -04:00
Derek Strickland	8ac3e642e6	reconciler: 2 phase reconnects and tests (#12333 ) * structs: Add alloc.Expired & alloc.Reconnected functions. Add Reconnect eval trigger by. * node_endpoint: Emit new eval for reconnecting unknown allocs. * filterByTainted: handle 2 phase commit filtering rules. * reconciler: Append AllocState on disconnect. Logic updates from testing and 2 phase reconnects. * allocs: Set reconnect timestamp. Destroy if not DesiredStatusRun. Watch for unknown status.	2022-04-05 17:13:10 -04:00
Derek Strickland	35752655b0	disconnected clients: Add reconnect task event (#12133 ) * Add TaskClientReconnectedEvent constant * Add allocRunner.Reconnect function to manage task state manually * Removes server-side push	2022-04-05 17:12:23 -04:00
James Rasell	f0be952cb5	client: hookup service wrapper for use within client hooks.	2022-03-21 10:29:57 +01:00
James Rasell	6e8f32a290	client: refactor common service registration objects from Consul. This commit performs refactoring to pull out common service registration objects into a new `client/serviceregistration` package. This new package will form the base point for all client specific service registration functionality. The Consul specific implementation is not moved as it also includes non-service registration implementations; this reduces the blast radius of the changes as well.	2022-03-15 09:38:30 +01:00
Tim Gross	35c22bcb6c	provide `-no-shutdown-delay` flag for job/alloc stop (#11596 ) Some operators use very long group/task `shutdown_delay` settings to safely drain network connections to their workloads after service deregistration. But during incident response, they may want to cause that drain to be skipped so they can quickly shed load. Provide a `-no-shutdown-delay` flag on the `nomad alloc stop` and `nomad job stop` commands that bypasses the delay. This sets a new desired transition state on the affected allocations that the allocation/task runner will identify during pre-kill on the client. Note (as documented here) that using this flag will almost always result in failed inbound network connections for workloads as the tasks will exit before clients receive updated service discovery information and won't be gracefully drained.	2021-12-13 14:54:53 -05:00
Michael Schurter	37f053ff89	client: never embed alloc_dir in chroot Fixes #2522 Skip embedding client.alloc_dir when building chroot. If a user configures a Nomad client agent so that the chroot_env will embed the client.alloc_dir, Nomad will happily infinitely recurse while building the chroot until something horrible happens. The best case scenario is the filesystem's path length limit is hit. The worst case scenario is disk space is exhausted. A bad agent configuration will look something like this: ```hcl data_dir = "/tmp/nomad-badagent" client { enabled = true chroot_env { # Note that the source matches the data_dir "/tmp/nomad-badagent" = "/ohno" # ... } } ``` Note that `/ohno/client` (the state_dir) will still be created but not `/ohno/alloc` (the alloc_dir). While I cannot think of a good reason why someone would want to embed Nomad's client (and possibly server) directories in chroots, there should be no cause for harm. chroots are only built when Nomad runs as root, and Nomad disables running exec jobs as root by default. Therefore even if client state is copied into chroots, it will be inaccessible to tasks. Skipping the `data_dir` and `{client,server}.state_dir` is possible, but this PR attempts to implement the minimum viable solution to reduce risk of unintended side effects or bugs. When running tests as root in a vm without the fix, the following error occurs: ``` === RUN TestAllocDir_SkipAllocDir alloc_dir_test.go:520: Error Trace: alloc_dir_test.go:520 Error: Received unexpected error: Couldn't create destination file /tmp/TestAllocDir_SkipAllocDir1457747331/001/nomad/test/testtask/nomad/test/testtask/.../nomad/test/testtask/secrets/.nomad-mount: open /tmp/TestAllocDir_SkipAllocDir1457747331/001/nomad/test/.../testtask/secrets/.nomad-mount: file name too long Test: TestAllocDir_SkipAllocDir --- FAIL: TestAllocDir_SkipAllocDir (22.76s) ``` Also removed unused Copy methods on AllocDir and TaskDir structs. Thanks to @eveld for not letting me forget about this!	2021-10-18 09:22:01 -07:00
Nick Ethier	4a25ec9410	testing fixes	2021-04-14 10:17:28 -04:00

1 2 3

129 Commits