nomad

mirror of https://github.com/kemko/nomad.git synced 2026-01-04 17:35:43 +03:00

Author	SHA1	Message	Date
hashicorp-copywrite[bot]	f005448366	[COMPLIANCE] Add Copyright and License Headers	2023-04-10 15:36:59 +00:00
Lance Haig	48e7d70fcd	deps: Update ioutil deprecated library references to os and io respectively in the client package (#16318 ) * Update ioutil deprecated library references to os and io respectively * Deal with the errors produced. Add error handling to filEntry info Add error handling to info	2023-03-08 13:25:10 -06:00
Charlie Voiselle	55df5af4aa	client: Add option to enable hairpinMode on Nomad bridge (#15961 ) * Add `bridge_network_hairpin_mode` client config setting * Add node attribute: `nomad.bridge.hairpin_mode` * Changed format string to use `%q` to escape user provided data * Add test to validate template JSON for developer safety Co-authored-by: Daniel Bennett <dbennett@hashicorp.com>	2023-02-02 10:12:15 -05:00
Seth Hoenig	d30e34261e	client: always run alloc cleanup hooks on final update (#15855 ) * client: run alloc pre-kill hooks on last pass despite no live tasks This PR fixes a bug where alloc pre-kill hooks were not run in the edge case where there are no live tasks remaining, but it is also the final update to process for the (terminal) allocation. We need to run cleanup hooks here, otherwise they will not run until the allocation gets garbage collected (i.e. via Destroy()), possibly at a distant time in the future. Fixes #15477 * client: do not run ar cleanup hooks if client is shutting down	2023-01-27 09:59:31 -06:00
Luiz Aoqui	f74f50804a	Task lifecycle restart (#14127 ) * allocrunner: handle lifecycle when all tasks die When all tasks die the Coordinator must transition to its terminal state, coordinatorStatePoststop, to unblock poststop tasks. Since this could happen at any time (for example, a prestart task dies), all states must be able to transition to this terminal state. * allocrunner: implement different alloc restarts Add a new alloc restart mode where all tasks are restarted, even if they have already exited. Also unifies the alloc restart logic to use the implementation that restarts tasks concurrently and ignores ErrTaskNotRunning errors since those are expected when restarting the allocation. * allocrunner: allow tasks to run again Prevent the task runner Run() method from exiting to allow a dead task to run again. When the task runner is signaled to restart, the function will jump back to the MAIN loop and run it again. The task runner determines if a task needs to run again based on two new task events that were added to differentiate between a request to restart a specific task, the tasks that are currently running, or all tasks that have already run. * api/cli: add support for all tasks alloc restart Implement the new -all-tasks alloc restart CLI flag and its API counterpar, AllTasks. The client endpoint calls the appropriate restart method from the allocrunner depending on the restart parameters used. * test: fix tasklifecycle Coordinator test * allocrunner: kill taskrunners if all tasks are dead When all non-poststop tasks are dead we need to kill the taskrunners so we don't leak their goroutines, which are blocked in the alloc restart loop. This also ensures the allocrunner exits on its own. * taskrunner: fix tests that waited on WaitCh Now that "dead" tasks may run again, the taskrunner Run() method will not return when the task finishes running, so tests must wait for the task state to be "dead" instead of using the WaitCh, since it won't be closed until the taskrunner is killed. * tests: add tests for all tasks alloc restart * changelog: add entry for #14127 * taskrunner: fix restore logic. The first implementation of the task runner restore process relied on server data (`tr.Alloc().TerminalStatus()`) which may not be available to the client at the time of restore. It also had the incorrect code path. When restoring a dead task the driver handle always needs to be clear cleanly using `clearDriverHandle` otherwise, after exiting the MAIN loop, the task may be killed by `tr.handleKill`. The fix is to store the state of the Run() loop in the task runner local client state: if the task runner ever exits this loop cleanly (not with a shutdown) it will never be able to run again. So if the Run() loops starts with this local state flag set, it must exit early. This local state flag is also being checked on task restart requests. If the task is "dead" and its Run() loop is not active it will never be able to run again. * address code review requests * apply more code review changes * taskrunner: add different Restart modes Using the task event to differentiate between the allocrunner restart methods proved to be confusing for developers to understand how it all worked. So instead of relying on the event type, this commit separated the logic of restarting an taskRunner into two methods: - `Restart` will retain the current behaviour and only will only restart the task if it's currently running. - `ForceRestart` is the new method where a `dead` task is allowed to restart if its `Run()` method is still active. Callers will need to restart the allocRunner taskCoordinator to make sure it will allow the task to run again. * minor fixes	2022-08-24 17:43:07 -04:00
Luiz Aoqui	6070fa0c8d	allocrunner: refactor task coordinator (#14009 ) The current implementation for the task coordinator unblocks tasks by performing destructive operations over its internal state (like closing channels and deleting maps from keys). This presents a problem in situations where we would like to revert the state of a task, such as when restarting an allocation with tasks that have already exited. With this new implementation the task coordinator behaves more like a finite state machine where task may be blocked/unblocked multiple times by performing a state transition. This initial part of the work only refactors the task coordinator and is functionally equivalent to the previous implementation. Future work will build upon this to provide bug fixes and enhancements.	2022-08-22 18:38:49 -04:00
Seth Hoenig	b2861f2a9b	client: add support for checks in nomad services This PR adds support for specifying checks in services registered to the built-in nomad service provider. Currently only HTTP and TCP checks are supported, though more types could be added later.	2022-07-12 17:09:50 -05:00
Seth Hoenig	dbcccc7a68	client: enforce max_kill_timeout client configuration This PR fixes a bug where client configuration max_kill_timeout was not being enforced. The feature was introduced in `9f44780` but seems to have been removed during the major drivers refactoring. We can make sure the value is enforced by pluming it through the DriverHandler, which now uses the lesser of the task.killTimeout or client.maxKillTimeout. Also updates Event.SetKillTimeout to require both the task.killTimeout and client.maxKillTimeout so that we don't make the mistake of using the wrong value - as it was being given only the task.killTimeout before.	2022-07-06 15:29:38 -05:00
Derek Strickland	ec3b7150e4	alloc_runner: stop sidecar tasks last (#13055 ) alloc_runner: stop sidecar tasks last	2022-06-07 11:35:19 -04:00
Radek Simko	e2e635a87d	client/allochealth: add healthy_deadline as context to error messages (#13214 )	2022-06-06 10:11:08 -04:00
Derek Strickland	b3fb9430bb	Fix client test reconnect test; Remove guard test (#12173 ) * Update reconnect test to new algorithm and interface; remove guard test	2022-04-05 17:12:23 -04:00
Derek Strickland	35752655b0	disconnected clients: Add reconnect task event (#12133 ) * Add TaskClientReconnectedEvent constant * Add allocRunner.Reconnect function to manage task state manually * Removes server-side push	2022-04-05 17:12:23 -04:00
James Rasell	d49cf2388a	Merge branch 'main' into f-1.3-boogie-nights	2022-03-23 09:41:25 +01:00
James Rasell	f0be952cb5	client: hookup service wrapper for use within client hooks.	2022-03-21 10:29:57 +01:00
Seth Hoenig	b242957990	ci: swap ci parallelization for unconstrained gomaxprocs	2022-03-15 12:58:52 -05:00
James Rasell	6e8f32a290	client: refactor common service registration objects from Consul. This commit performs refactoring to pull out common service registration objects into a new `client/serviceregistration` package. This new package will form the base point for all client specific service registration functionality. The Consul specific implementation is not moved as it also includes non-service registration implementations; this reduces the blast radius of the changes as well.	2022-03-15 09:38:30 +01:00
Jasmine Dahilig	b85cce42fe	lifecycle: add poststop hook (#8194 )	2020-11-12 08:01:42 -08:00
Jasmine Dahilig	81cad55d40	task lifecycle poststart: code review fixes	2020-08-31 13:22:41 -07:00
Michael Schurter	599b56e054	test: add allocrunner test for poststart hooks	2020-08-12 09:54:14 -07:00
Jasmine Dahilig	9cf4429518	lifecycle: add allocrunner and task hook coordinator unit tests	2020-07-29 12:39:42 -07:00
Mahmood Ali	73f19eb3b8	allocrunner: terminate sidecars in the end This fixes a bug where a batch allocation fails to complete if it has sidecars. If the only remaining running tasks in an allocations are sidecars - we must kill them and mark the allocation as complete.	2020-06-29 15:12:15 -04:00
Mahmood Ali	55db937f16	tests: update AR task restart policy	2020-03-24 17:00:42 -04:00
Jasmine Dahilig	db7e8614f3	remove debugging test code from TestAllocRunner_TaskLeader_StopRestoredTG	2020-03-21 17:52:54 -04:00
Jasmine Dahilig	60671f880d	fix bug in lifecycle restore tests after refactor	2020-03-21 17:52:54 -04:00
Jasmine Dahilig	88d3e232a2	refactor task hook coordinator helper method and tests	2020-03-21 17:52:53 -04:00
Jasmine Dahilig	0031b6777f	clean up restore test	2020-03-21 17:52:52 -04:00
Jasmine Dahilig	aced15ea27	partial test for restore functionality	2020-03-21 17:52:52 -04:00
Drew Bailey	3b033b2ef5	allow only positive shutdown delay more explicit test case, remove select statement	2019-12-16 11:38:30 -05:00
Drew Bailey	672b76056b	shutdown delay for task groups copy struct values ensure groupserviceHook implements RunnerPreKillhook run deregister first test that shutdown times are delayed move magic number into variable	2019-12-16 11:38:16 -05:00
Nick Ethier	387b016ac4	client: improve group service stanza interpolation and check_re… (#6586 ) * client: improve group service stanza interpolation and check_restart support Interpolation can now be done on group service stanzas. Note that some task runtime specific information that was previously available when the service was registered poststart of a task is no longer available. The check_restart stanza for checks defined on group services will now properly restart the allocation upon check failures if configured.	2019-11-18 13:04:01 -05:00
Mahmood Ali	a80643e46d	Don't persist allocs of destroyed alloc runners This fixes a bug where allocs that have been GCed get re-run again after client is restarted. A heavily-used client may launch thousands of allocs on startup and get killed. The bug is that an alloc runner that gets destroyed due to GC remains in client alloc runner set. Periodically, they get persisted until alloc is gced by server. During that time, the client db will contain the alloc but not its individual tasks status nor completed state. On client restart, client assumes that alloc is pending state and re-runs it. Here, we fix it by ensuring that destroyed alloc runners don't persist any alloc to the state DB. This is a short-term fix, as we should consider revamping client state management. Storing alloc and task information in non-transaction non-atomic concurrently while alloc runner is running and potentially changing state is a recipe for bugs. Fixes https://github.com/hashicorp/nomad/issues/5984 Related to https://github.com/hashicorp/nomad/pull/5890	2019-08-25 11:21:28 -04:00
Preetha Appan	7de4018656	code review feedback	2019-07-10 10:41:06 -05:00
Preetha Appan	26652d7a6b	Populate task event struct with kill timeout This makes for a nicer task event message	2019-07-09 09:37:09 -05:00
Preetha Appan	b4ecb448b3	Update deployment health on failed allocations only if health is unset This fixes a confusing UX where a previously successful deployment's healthy/unhealthy count would get updated if any allocations failed after the deployment was already marked as successful.	2019-05-02 22:59:56 -05:00
Michael Schurter	8d409a6e39	client: test logmon cleanup The test is sadly quite complicated and peeks into things (logmon's reattach config) AR doesn't normally have access to. However, I couldn't find another way of asserting logmon got cleaned up without resorting to smaller unit tests. Smaller unit tests risk re-implementing dependencies in an unrealistic way, so I opted for an ugly integration test.	2019-03-04 13:15:15 -08:00
Preetha Appan	ad58ba3e18	More alloc runner tests ported from 0.8.7	2019-02-22 17:58:06 -06:00
Mahmood Ali	8b7f66499f	address review comments	2019-02-22 15:56:14 -05:00
Mahmood Ali	4c30b03879	tests: port TestAllocRunner_RetryArtifact Port TestAllocRunner_RetryArtifact from https://github.com/hashicorp/nomad/blob/v0.8.7/client/alloc_runner_test.go#L610-L672 I changed the test name because it doesn't actually test that artifact hooks is retried	2019-02-22 15:50:39 -05:00
Mahmood Ali	69906bade4	tests: port TestAllocRunner_MoveAllocDir test	2019-02-22 15:50:39 -05:00
Michael Schurter	159266ccec	tests: port TestAllocRunner_Destroy from 0.8 Also add destroy(ar) helper to fix a bunch of shutdown races in AR tests.	2019-02-20 12:35:09 -08:00
Michael Schurter	7b8ec414a3	client: fix setting alloc unhealthy at deadline During the 0.9 client refactor the code to fail a deployment when the deadline was reached was broken. This restores and tests that behavior.	2019-02-19 07:44:14 -08:00
Michael Schurter	7445e418ca	test: port some pre-0.9 DeploymentHealth tests Skipping a failing one as I need to move to some other work and don't want to leave this work orphaned on my machine.	2019-01-14 09:56:53 -08:00
Alex Dadgar	296141bb58	Merge pull request #5002 from hashicorp/b-task-config-resources Convert driver resource to AllocatedTaskResource	2018-12-18 16:46:34 -08:00
Alex Dadgar	517bf1c35f	Fix unit tests + upgrade pathing resources	2018-12-18 15:50:44 -08:00
Danielle Tomlinson	502f36335e	allocrunner: Drop and log updates after closing waitCh	2018-12-18 23:38:34 +01:00
Danielle Tomlinson	69fc73767a	allocrunner: Handle updates asynchronously This creates a new buffered channel and goroutine on the allocrunner for serializing updates to allocations. This allows us to take updates off the routine that is used from processing updates from the server, without having complicated machinery for tracking update lifetimes, or other external synchronization. This results in a nice performance improvement and signficantly better throughput on batch changes such as preempting a large number of jobs for a larger placement.	2018-12-18 23:38:33 +01:00
Danielle Tomlinson	800bd57333	allocrunner: Async shutdown and destroy This commit reduces the locking required to shutdown or destroy allocrunners, and allows parallel shutdown and destroy of allocrunners during shutdown.	2018-12-18 23:38:33 +01:00
Danielle Tomlinson	62ac40ab09	allocrunner: Basic test alloc runner	2018-12-06 12:28:23 +01:00
Alex Dadgar	429c5bb885	Device hook and devices affect computed node class This PR introduces a device hook that retrieves the device mount information for an allocation. It also updates the computed node class computation to take into account devices. TODO Fix the task runner unit test. The environment variable is being lost even though it is being properly set in the prestart hook.	2018-11-27 17:25:33 -08:00
Michael Schurter	31f113ba4d	client: support graceful shutdowns Client.Shutdown now blocks until all AllocRunners and TaskRunners have exited their Run loops. Tasks are left running.	2018-11-19 16:39:30 -08:00

1 2

62 Commits