nomad

mirror of https://github.com/kemko/nomad.git synced 2026-01-07 19:05:42 +03:00

Author	SHA1	Message	Date
Tim Gross	047b93c253	client: don't run alloc postrun during shutdown	2019-09-25 14:58:17 -04:00
Michael Schurter	aa60b03d7b	client: reword error message	2019-09-04 12:40:09 -07:00
Mahmood Ali	ff3dedd534	Write to client store while holding lock Protect against a race where destroying and persist state goroutines race. The downside is that the database io operation will run while holding the lock and may run indefinitely. The risk of lock being long held is slow destruction, but slow io has bigger problems.	2019-08-26 13:45:58 -04:00
Mahmood Ali	a80643e46d	Don't persist allocs of destroyed alloc runners This fixes a bug where allocs that have been GCed get re-run again after client is restarted. A heavily-used client may launch thousands of allocs on startup and get killed. The bug is that an alloc runner that gets destroyed due to GC remains in client alloc runner set. Periodically, they get persisted until alloc is gced by server. During that time, the client db will contain the alloc but not its individual tasks status nor completed state. On client restart, client assumes that alloc is pending state and re-runs it. Here, we fix it by ensuring that destroyed alloc runners don't persist any alloc to the state DB. This is a short-term fix, as we should consider revamping client state management. Storing alloc and task information in non-transaction non-atomic concurrently while alloc runner is running and potentially changing state is a recipe for bugs. Fixes https://github.com/hashicorp/nomad/issues/5984 Related to https://github.com/hashicorp/nomad/pull/5890	2019-08-25 11:21:28 -04:00
Nick Ethier	dc08ec8783	ar: plumb client config for networking into the network hook	2019-07-31 01:04:06 -04:00
Nick Ethier	35de444e9b	ar: plumb error handling into alloc runner hook initialization	2019-07-31 01:03:18 -04:00
Nick Ethier	c39e8dca6e	ar: move linux specific code to it's own file and add tests	2019-07-31 01:03:18 -04:00
Nick Ethier	4a8a96fa1a	ar: initial driver based network management	2019-07-31 01:03:17 -04:00
Preetha Appan	26652d7a6b	Populate task event struct with kill timeout This makes for a nicer task event message	2019-07-09 09:37:09 -05:00
Mahmood Ali	380262613d	Fail alloc if alloc runner prestart hooks fail When an alloc runner prestart hook fails, the task runners aren't invoked and they remain in a pending state. This leads to terrible results, some of which are: * Lockup in GC process as reported in https://github.com/hashicorp/nomad/pull/5861 * Lockup in shutdown process as TR.Shutdown() waits for WaitCh to be closed * Alloc not being restarted/rescheduled to another node (as it's still in pending state) * Unexpected restart of alloc on a client restart, potentially days/weeks after alloc expected start time! Here, we treat all tasks to have failed if alloc runner prestart hook fails. This fixes the lockups, and permits the alloc to be rescheduled on another node. While it's desirable to retry alloc runner in such failures, I opted to treat it out of scope. I'm afraid of some subtles about alloc and task runners and their idempotency that's better handled in a follow up PR. This might be one of the root causes for https://github.com/hashicorp/nomad/issues/5840 .	2019-07-02 18:35:47 +08:00
Mahmood Ali	41a7fe8530	client/allocrunner: depend on internal task state Alloc runner already tracks tasks associated with alloc. Here, we become defensive by relying on the alloc runner tracked tasks, rather than depend on server never updating the job unexpectedly.	2019-06-10 18:42:51 -04:00
Michael Schurter	6a2792ad90	client: do not restart dead tasks until server is contacted (try 2) Refactoring of 104067bc2b2002a4e45ae7b667a476b89addc162 Switch the MarkLive method for a chan that is closed by the client. Thanks to @notnoop for the idea! The old approach called a method on most existing ARs and TRs on every runAllocs call. The new approach does a once.Do call in runAllocs to accomplish the same thing with less work. Able to remove the gate abstraction that did much more than was needed.	2019-05-14 10:53:27 -07:00
Michael Schurter	e7042b674b	client: do not restart dead tasks until server is contacted Fixes #1795 Running restored allocations and pulling what allocations to run from the server happen concurrently. This means that if a client is rebooted, and has its allocations rescheduled, it may restart the dead allocations before it contacts the server and determines they should be dead. This commit makes tasks that fail to reattach on restore wait until the server is contacted before restarting.	2019-05-14 10:53:27 -07:00
Mahmood Ali	5abbee5d39	Merge pull request #5632 from hashicorp/f-nomad-exec-parts-01-base nomad exec part 1: plumbing and docker driver	2019-05-09 18:09:27 -04:00
Mahmood Ali	979a6a1778	implement client endpoint of nomad exec Add a client streaming RPC endpoint for processing nomad exec tasks, by invoking the relevant task handler for execution.	2019-05-09 16:49:08 -04:00
Michael Schurter	5c43a16b03	Fix comment Co-Authored-By: preetapan <preetha@hashicorp.com>	2019-05-03 10:01:30 -05:00
Michael Schurter	96d69022df	Remove unnecessary boolean clause Co-Authored-By: preetapan <preetha@hashicorp.com>	2019-05-03 10:00:17 -05:00
Preetha Appan	b4ecb448b3	Update deployment health on failed allocations only if health is unset This fixes a confusing UX where a previously successful deployment's healthy/unhealthy count would get updated if any allocations failed after the deployment was already marked as successful.	2019-05-02 22:59:56 -05:00
Danielle Lancashire	023d0dff31	allocs: Add nomad alloc signal command This command will be used to send a signal to either a single task within an allocation, or all of the tasks if <task-name> is omitted. If the sent signal terminates the allocation, it will be treated as if the allocation has crashed, rather than as if it was operator-terminated. Signal validation is currently handled by the driver itself and nomad does not attempt to restrict or validate them.	2019-04-25 12:43:32 +02:00
Danielle Lancashire	419d70c5f9	allocs: Add nomad alloc restart This adds a `nomad alloc restart` command and api that allows a job operator with the alloc-lifecycle acl to perform an in-place restart of a Nomad allocation, or a given subtask.	2019-04-11 14:25:49 +02:00
Michael Schurter	db9daf6631	client: ensure task is cleaned up when terminal This commit is a significant change. TR.Run is now always executed, even for terminal allocations. This was changed to allow TR.Run to cleanup (run stop hooks) if a handle was recovered. This is intended to handle the case of Nomad receiving a DesiredStatus=Stop allocation update, persisting it, but crashing before stopping AR/TR. The commit also renames task runner hook data as it was very easy to accidently set state on Requests instead of Responses using the old field names.	2019-03-01 14:00:23 -08:00
Preetha Appan	80919bf713	Modified destroy failure handling to rely on allocrunner's destroy method Added a unit test with custom statedb implementation that errors, to use to verify destroy errors	2019-01-12 10:37:12 -06:00
Michael Schurter	1ae8261139	client: emit Killing/Killed task events We were just emitting Killed/Terminated events before. In v0.8 we emitted Killing/Killed, but lacked Terminated when explicitly stopping a task. This change makes it so Terminated is always included, whether explicitly stopping a task or it exiting on its own. New output: 2019-01-04T14:58:51-08:00 Killed Task successfully killed 2019-01-04T14:58:51-08:00 Terminated Exit Code: 130, Signal: 2 2019-01-04T14:58:51-08:00 Killing Sent interrupt 2019-01-04T14:58:51-08:00 Leader Task Dead Leader Task in Group dead 2019-01-04T14:58:49-08:00 Started Task started by client 2019-01-04T14:58:49-08:00 Task Setup Building Task Directory 2019-01-04T14:58:49-08:00 Received Task received by client Old (v0.8.6) output: 2019-01-04T22:14:54Z Killed Task successfully killed 2019-01-04T22:14:54Z Killing Sent interrupt. Waiting 5s before force killing 2019-01-04T22:14:54Z Leader Task Dead Leader Task in Group dead 2019-01-04T22:14:53Z Started Task started by client 2019-01-04T22:14:53Z Task Setup Building Task Directory 2019-01-04T22:14:53Z Received Task received by client	2019-01-08 07:20:54 -08:00
Danielle Tomlinson	acf2c524f3	allocrunner: Standardised discard logs Follow up from https://github.com/hashicorp/nomad/pull/5007#pullrequestreview-186739124	2019-01-03 14:04:31 +01:00
Michael Schurter	784706a1e5	client/state: support upgrading from 0.8->0.9 Also persist and load DeploymentStatus to avoid rechecking health after client restarts.	2018-12-19 10:39:27 -08:00
Danielle Tomlinson	b63095db60	allocrunner: Close updates routine correctly	2018-12-19 18:32:51 +01:00
Nick Ethier	6951ca487d	drivermanager: use allocID and task name to route task events	2018-12-18 23:01:51 -05:00
Nick Ethier	39ca1b00dd	client/drivermananger: add driver manager The driver manager is modeled after the device manager and is started by the client. It's responsible for handling driver lifecycle and reattachment state, as well as processing the incomming fingerprint and task events from each driver. The mananger exposes a method for registering event handlers for task events that is used by the task runner to update the server when a task has been updated with an event. Since driver fingerprinting has been implemented by the driver manager, it is no longer needed in the fingerprint mananger and has been removed.	2018-12-18 22:55:18 -05:00
Danielle Tomlinson	502f36335e	allocrunner: Drop and log updates after closing waitCh	2018-12-18 23:38:34 +01:00
Danielle Tomlinson	5464a9565a	allocrunner: Documentation for ShutdownCh/DestroyCh	2018-12-18 23:38:34 +01:00
Danielle Tomlinson	9f1b53f2a8	fixup: Log when we detect out of order updates	2018-12-18 23:38:33 +01:00
Danielle Tomlinson	69fc73767a	allocrunner: Handle updates asynchronously This creates a new buffered channel and goroutine on the allocrunner for serializing updates to allocations. This allows us to take updates off the routine that is used from processing updates from the server, without having complicated machinery for tracking update lifetimes, or other external synchronization. This results in a nice performance improvement and signficantly better throughput on batch changes such as preempting a large number of jobs for a larger placement.	2018-12-18 23:38:33 +01:00
Danielle Tomlinson	800bd57333	allocrunner: Async shutdown and destroy This commit reduces the locking required to shutdown or destroy allocrunners, and allows parallel shutdown and destroy of allocrunners during shutdown.	2018-12-18 23:38:33 +01:00
Danielle Tomlinson	d44d4b57de	client: Unify handling of previous and preempted allocs	2018-12-11 13:12:35 +01:00
Danielle Tomlinson	a4cf83d00c	client: Wait for preempted allocs to terminate When starting an allocation that is preempting other allocs, we create a new group allocation watcher, and then wait for the allocations to terminate in the allocation PreRun hooks. If there's no preempted allocations, then we simply provide a NoopAllocWatcher.	2018-12-11 00:59:18 +01:00
Alex Dadgar	429c5bb885	Device hook and devices affect computed node class This PR introduces a device hook that retrieves the device mount information for an allocation. It also updates the computed node class computation to take into account devices. TODO Fix the task runner unit test. The environment variable is being lost even though it is being properly set in the prestart hook.	2018-11-27 17:25:33 -08:00
Michael Schurter	6d49163b12	client: emit last sent alloc to new listeners Fixes a deadlock where the allocwatcher would block forever waiting for an update from a terminal alloc. Made the broadcaster easier to debug as well.	2018-11-27 14:06:08 -08:00
Michael Schurter	5d6d4bf290	Merge pull request #4883 from hashicorp/f-graceful-shutdown Support graceful shutdowns in agent	2018-11-27 15:55:15 -06:00
Michael Schurter	134c04744e	client/ar: remove useless wait ch from runTasks Arguably this makes task.WaitCh() useless, but I think exposing a wait chan from TaskRunners is a generically useful API.	2018-11-26 12:51:18 -08:00
Michael Schurter	021c0cc4bf	client: document how AR/TR Run methods behave	2018-11-26 12:50:35 -08:00
Michael Schurter	31f113ba4d	client: support graceful shutdowns Client.Shutdown now blocks until all AllocRunners and TaskRunners have exited their Run loops. Tasks are left running.	2018-11-19 16:39:30 -08:00
Mahmood Ali	58cbafe913	Populate alloc stats API with device stats This change makes few compromises: * Looks up the devices associated with tasks at look up time. Given that `nomad alloc status` is called rarely generally (compared to stats telemetry and general job reporting), it seems fine. However, the lookup overhead grows bounded by number of `tasks x total-host-devices`, which can be significant. * `client.Client` performs the task devices->statistics lookup. It passes self to alloc/task runners so they can look up the device statistics allocated to them. * Currently alloc/task runners are responsible for constructing the entire RPC response for stats * The alternatives for making task runners device statistics aware don't seem appealing (e.g. having task runners contain reference to hostStats) * On the alloc aggregation resource usage, I did a naive merging of task device statistics. * Personally, I question the value of such aggregation, compared to costs of struct duplication and bloating the response - but opted to be consistent in the API. * With naive concatination, device instances from a single device group used by separate tasks in the alloc, would be aggregated in two separate device group statistics.	2018-11-16 10:26:32 -05:00
Michael Schurter	e58a91b701	client: update alloc status when terminating Defensively update alloc status whenever killing all tasks.	2018-11-05 15:11:10 -08:00
Michael Schurter	740ca8e6ca	client: fix tr lifecycle logic and shutdown delay ShutdownDelay must be honored whenever the task is killed or restarted. Services were not being deregistered prior to restarting.	2018-11-05 12:32:05 -08:00
Michael Schurter	9b82025608	client: do not run terminal allocs	2018-11-05 12:32:05 -08:00
Michael Schurter	fdbe446ea6	client: first pass at implementing task restoring Task restoring works but dead tasks may be restarted	2018-11-05 12:32:05 -08:00
Michael Schurter	d71e7666bd	ar: fix leader handling, state restoring, and destroying unrun ARs * Migrated all of the old leader task tests and got them passing * Refactor and consolidate task killing code in AR to always kill leader tasks first * Fixed lots of issues with state restoring * Fixed deadlock in AR.Destroy if AR.Run had never been called * Added a new in memory statedb for testing	2018-10-19 09:45:45 -07:00
Michael Schurter	2aed3e8527	ar: refactor task killing into 1 method Update comments and address some PR comments from #4775	2018-10-17 10:06:59 -07:00
Michael Schurter	2417ec5621	ar: fix task leader, update, and stop handling	2018-10-17 10:06:59 -07:00
Nick Ethier	4f9522dd54	client: review comments and fixup/skip tests	2018-10-16 16:56:56 -07:00

1 2

60 Commits