This PR improves how killing a task is handled. Before the kill function
directly orchestrated the killing and was only valid while the task was
running. The new behavior is to mark the desired state and wait for the
task runner to converge to that state.
We were just emitting Killed/Terminated events before. In v0.8 we
emitted Killing/Killed, but lacked Terminated when explicitly stopping
a task. This change makes it so Terminated is always included, whether
explicitly stopping a task or it exiting on its own.
New output:
2019-01-04T14:58:51-08:00 Killed Task successfully killed
2019-01-04T14:58:51-08:00 Terminated Exit Code: 130, Signal: 2
2019-01-04T14:58:51-08:00 Killing Sent interrupt
2019-01-04T14:58:51-08:00 Leader Task Dead Leader Task in Group dead
2019-01-04T14:58:49-08:00 Started Task started by client
2019-01-04T14:58:49-08:00 Task Setup Building Task Directory
2019-01-04T14:58:49-08:00 Received Task received by client
Old (v0.8.6) output:
2019-01-04T22:14:54Z Killed Task successfully killed
2019-01-04T22:14:54Z Killing Sent interrupt. Waiting 5s before force killing
2019-01-04T22:14:54Z Leader Task Dead Leader Task in Group dead
2019-01-04T22:14:53Z Started Task started by client
2019-01-04T22:14:53Z Task Setup Building Task Directory
2019-01-04T22:14:53Z Received Task received by client
Simplify allocDir.Build() function to avoid depending on client/structs,
and remove a parameter that's always set to `false`.
The motivation here is to avoid a dependency cycle between
drivers/cstructs and alloc_dir.
The driver manager is modeled after the device manager and is started by the client.
It's responsible for handling driver lifecycle and reattachment state, as well as
processing the incomming fingerprint and task events from each driver. The mananger
exposes a method for registering event handlers for task events that is used by the
task runner to update the server when a task has been updated with an event.
Since driver fingerprinting has been implemented by the driver manager, it is no
longer needed in the fingerprint mananger and has been removed.
The RestartCount is not really suitable for use as a source of
uniqueness within task invocations as it is not monotonic, and interacts
with the restart stanza in a users config, so conflates restarts due to
task failures, with restarts due to enviromental changes, such as consul
template or vault secrets changing.
Here we instead use a substring from a uuid, which is more random than
we strictly need, but is nicer than rolling our own random string
generator here.
As part of deprecating legacy drivers, we're moving the env package to a
new drivers/shared tree, as it is used by the modern docker and rkt
driver packages, and is useful for 3rd party plugins.
Looking at NewTaskRunner I'm unsure whether TaskRunner.TaskResources
(from which req.TaskResources is set) is intended to be nil at times or
if the TODO in NewTaskRunner is intended to ensure it is always non-nil.
The old approach was incomplete. Hook env vars are now:
* persisted and restored between agent restarts
* deterministic (LWW if 2 hooks set the same key)
This PR introduces a device hook that retrieves the device mount
information for an allocation. It also updates the computed node class
computation to take into account devices.
TODO Fix the task runner unit test. The environment variable is being
lost even though it is being properly set in the prestart hook.
We were incorrectly returning a 0 duration to the taskrunner when
determining when a task should restart. This would cause tasks to be
restarted immediately, ignoring the restart {} stanza in a users
configuration.
This commit causes us to return the restart duration to the task runner
so it may correctly delay further execution.