Commit Graph

181 Commits

Author SHA1 Message Date
Michael Schurter
a05862dbdf Destroy partially migrated alloc dirs
Test that snapshot errors don't return a valid tar currently fails.
2017-11-29 17:26:11 -08:00
Michael Schurter
0de0e1d342 Handle leader task being dead in RestoreState
Fixes the panic mentioned in
https://github.com/hashicorp/nomad/issues/3420#issuecomment-341666932

While a leader task dying serially stops all follower tasks, the
synchronizing of state is asynchrnous. Nomad can shutdown before all
follower tasks have updated their state to dead thus saving the state
necessary to hit this panic: *have a non-terminal alloc with a dead
leader.*

The actual fix is a simple nil check to not assume non-terminal allocs
leader's have a TaskRunner.
2017-11-15 10:36:13 -08:00
Alex Dadgar
c15f49ae8d Alloc Runner doesn't panic on restoration. 2017-11-02 16:14:13 -07:00
Diptanu Choudhury
5d36408475 Added the node_id as a tag 2017-11-02 13:29:10 -07:00
Diptanu Choudhury
103ff5526e Added support for tagged metrics 2017-11-02 10:07:57 -07:00
Diptanu Choudhury
9593e12672 Incrementing the start counter when we are actually starting a container 2017-11-02 09:51:20 -07:00
Diptanu Choudhury
0bade76fd5 Recording counter for dead allocs properly 2017-11-02 09:51:20 -07:00
Diptanu Choudhury
45583d757e Added metrics to track task/alloc start/restarts/dead events 2017-11-02 09:51:20 -07:00
Michael Schurter
fb3a780b7a Trigger GCs after alloc changes
GC much more aggressively by triggering GCs when allocations become
terminal as well as after new allocations are added.
2017-11-01 15:16:38 -05:00
Michael Schurter
9c1e595e2e Fix GC'd alloc tracking
The Client.allocs map now contains all AllocRunners again, not just
un-GC'd AllocRunners. Client.allocs is only pruned when the server GCs
allocs.

Also stops logging "marked for GC" twice.
2017-11-01 15:16:38 -05:00
Alex Dadgar
a9e3a41407 Enable more linters 2017-09-26 15:26:33 -07:00
Michael Schurter
ebbf87f979 Use existing restart policy infrastructure 2017-09-14 16:46:54 -07:00
Alex Dadgar
c26ecb7092 Add version package
This PR adds a version package and consolidates version strings into a
Version struct.
2017-08-16 15:44:21 -07:00
Michael Schurter
85b9dd9cce Move migrating state into prevAllocWatcher 2017-08-14 16:02:28 -07:00
Michael Schurter
537d0e5ab5 Soft fail on migration errors 2017-08-11 16:50:30 -07:00
Michael Schurter
113d8e3667 Set failed status instead of panic'ing
Fixup some TODOs and formatting left from new prevAllocWatcher code.
2017-08-11 16:21:35 -07:00
Michael Schurter
8c1811911e switch from alloc blocker to new interface
interface has 3 implementations:

1. local for blocking and moving data locally
2. remote for blocking and moving data from another node
3. noop for allocs that don't need to block
2017-08-11 16:21:35 -07:00
Michael Schurter
0f584a0143 initial attempt at refactoring blocked/migrating 2017-08-11 16:21:35 -07:00
Michael Schurter
de15046cd6 Only set alloc status if it's not already terminal 2017-08-11 16:21:35 -07:00
Alex Dadgar
5ad955ef07 Unmount task directories when alloc is terminal
This PR unmounts directories from tasks when the alloc is terminal
rather than when it is garbage collected.

/cc @angrycub
2017-08-10 13:28:17 -07:00
Alex Dadgar
1e7ae913e2 Template emits events explaining why it is blocked
This PR does the following:
* Adds a mechanism to emit events in the TaskRunner
* Vendors a new version of Consul-Template that allows extraction of
missing dependencies
* Adds logic to our consul_template.go to determine missing events and
emit them in a batched fashion.
* Refactors the consul_template code to split the run method and take in
a config struct rather than many parameters.

Fixes https://github.com/hashicorp/nomad/issues/2578
2017-08-09 18:01:27 -07:00
Alex Dadgar
43d2c425d1 Emit generic task events 2017-08-07 21:26:04 -07:00
Michael Schurter
e271c28de2 Merge branch 'master' into fix-pending-state 2017-08-03 17:27:03 -07:00
Michael Schurter
fd1d8a9e1d Don't attempt to restore tasks that never sync'd 2017-07-24 15:58:46 -07:00
Michael Schurter
2569c58cb7 Fix race by not accessing tr.task from ar 2017-07-21 16:16:53 -07:00
Michael Schurter
cf62d02378 Remove unneeded saveTaskRunnerState method
Collapse it into the one place it's called
2017-07-21 16:16:02 -07:00
Alex Dadgar
bb958ba745 Destroy tasks that are part of terminal alloc 2017-07-20 12:02:04 -07:00
Alex Dadgar
ae2ac8ab58 Should not persist state after alloc_runner is garbage collected 2017-07-19 17:31:30 -07:00
Michael Schurter
9150135b50 Use broadcast send retry logic everywhere 2017-07-18 14:36:32 -07:00
Alex Dadgar
4f376d08ed Merge pull request #2853 from hashicorp/b-watcher
Improve alloc health watcher
2017-07-18 14:12:28 -07:00
Alex Dadgar
459ddf63ec Save deployment status 2017-07-18 12:37:52 -07:00
Alex Dadgar
386557da73 Small fixes 2017-07-18 12:19:57 -07:00
Michael Schurter
8c4b760803 Fix deadlock caused by syncing during destroy
When replacing an alloc the new alloc is blocked until the old alloc is
destroyed. This could cause a deadlock:

1. Destroying the old alloc includes a final sync of its status
2. Syncing status causes a GC
3. A GC looks for terminal allocs to cleanup
4. The GC waits for an alloc to stop completely before GC'ing

If the GC chooses the currently-being-destroyed-alloc to GC, the GC
deadlocks. If `client.max_parallel` deadlocks happen the GC is wedged
until the Nomad process is restarted.

Performing the final sync asynchronously is an ugly hack but prevents
the deadlock by allowing the final sync to occur after the alloc runner
has shutdown and been destroyed.
2017-07-18 11:12:56 -07:00
Michael Schurter
dc5ea4acb9 Add AllocRunner.allocID for ease-of-use
Since the AllocRunner.alloc struct can be mutated, most of AllocRunner
needs to acquire a lock to get the alloc's ID. Log lines always need to
include the alloc ID, so we often skipped acquiring a lock just to grab
the ID and accepted the race.

Let's make the race detector a little happier by storing the ID in a
single assignment field.
2017-07-17 15:46:54 -07:00
Michael Schurter
802a99749c Fix log level 2017-07-17 15:46:54 -07:00
Michael Schurter
427a0ae1db Don't fail if task dirs don't exist on creation
Task dir metadata is created in AllocRunner.Run which may not run before
an alloc is sync'd and Nomad exits. There's no reason not to just create
task dir metadata on restore if it doesn't exist.
2017-07-17 15:46:54 -07:00
Michael Schurter
12d9e91f65 Ensure allocDir is never nil and persisted safely
Fixes #2834
2017-07-17 15:46:54 -07:00
Michael Schurter
c5c3b3103b Merge branch 'master' into fix-pending-state 2017-07-10 10:43:23 -07:00
unknown
5f9cd4f329 #2563 fixed pending state for allocations with terminal status 2017-07-09 16:18:06 +03:00
Alex Dadgar
3beaafca9a Vet and small improvement on watcher failure detection 2017-07-07 14:53:01 -07:00
Alex Dadgar
e1c631064a @jippi Changed my mind! Good suggestion 2017-07-07 12:12:48 -07:00
Alex Dadgar
f72bbaa370 Client watches for allocation health using task state and Consul checks
This PR adds watching of allocation health at the client. The client can
watch for health based on the tasks running on time and also based on
the consul checks passing.
2017-07-07 12:10:04 -07:00
Alex Dadgar
d165f65013 watcher per alloc 2017-07-07 12:07:08 -07:00
Alex Dadgar
da82a6e814 initial watcher 2017-07-07 12:07:08 -07:00
Michael Schurter
89e5971bc7 Merge pull request #2732 from hashicorp/b-persist-alloc-updates
Persist Alloc when EvalID changes
2017-07-03 14:46:43 -07:00
Michael Schurter
11863660a0 Destroy task group leader first
Before this commit all tasks in a task group were destroyed
concurrently. This meant logging sidecars might be stopped before the
leader task whose logs still need to be shipped.

This commit blocks on the leader shutting down before signalling to
followers to shutdown.
2017-07-03 13:56:56 -07:00
Michael Schurter
bb432213a6 Fix spelling & re-add immutable state struct 2017-06-23 13:01:39 -07:00
Michael Schurter
da4fb1f293 Rename immutable -> alloc
meh; naming is hard
2017-06-23 10:58:36 -07:00
Michael Schurter
38cb0f4af1 Persist Alloc when EvalID changes 2017-06-22 17:33:12 -07:00
Alex Dadgar
aa15ad51d1 Fix nil job on allocation
The way the copying was happening on the alloc_runner was by temporarily
setting the alloc.Job to nil, copying and then restoring it. This
created an issue in which when the alloc was shared (which it is in
server/client mode and between alloc_runner/task_runner) there were race
conditions that could create a panic.

Fixes https://github.com/hashicorp/nomad/issues/2605
2017-05-17 14:07:06 -04:00