Commit Graph

156 Commits

Author SHA1 Message Date
Michael Schurter
fd1d8a9e1d Don't attempt to restore tasks that never sync'd 2017-07-24 15:58:46 -07:00
Michael Schurter
2569c58cb7 Fix race by not accessing tr.task from ar 2017-07-21 16:16:53 -07:00
Michael Schurter
cf62d02378 Remove unneeded saveTaskRunnerState method
Collapse it into the one place it's called
2017-07-21 16:16:02 -07:00
Alex Dadgar
bb958ba745 Destroy tasks that are part of terminal alloc 2017-07-20 12:02:04 -07:00
Alex Dadgar
ae2ac8ab58 Should not persist state after alloc_runner is garbage collected 2017-07-19 17:31:30 -07:00
Michael Schurter
9150135b50 Use broadcast send retry logic everywhere 2017-07-18 14:36:32 -07:00
Alex Dadgar
4f376d08ed Merge pull request #2853 from hashicorp/b-watcher
Improve alloc health watcher
2017-07-18 14:12:28 -07:00
Alex Dadgar
459ddf63ec Save deployment status 2017-07-18 12:37:52 -07:00
Alex Dadgar
386557da73 Small fixes 2017-07-18 12:19:57 -07:00
Michael Schurter
8c4b760803 Fix deadlock caused by syncing during destroy
When replacing an alloc the new alloc is blocked until the old alloc is
destroyed. This could cause a deadlock:

1. Destroying the old alloc includes a final sync of its status
2. Syncing status causes a GC
3. A GC looks for terminal allocs to cleanup
4. The GC waits for an alloc to stop completely before GC'ing

If the GC chooses the currently-being-destroyed-alloc to GC, the GC
deadlocks. If `client.max_parallel` deadlocks happen the GC is wedged
until the Nomad process is restarted.

Performing the final sync asynchronously is an ugly hack but prevents
the deadlock by allowing the final sync to occur after the alloc runner
has shutdown and been destroyed.
2017-07-18 11:12:56 -07:00
Michael Schurter
dc5ea4acb9 Add AllocRunner.allocID for ease-of-use
Since the AllocRunner.alloc struct can be mutated, most of AllocRunner
needs to acquire a lock to get the alloc's ID. Log lines always need to
include the alloc ID, so we often skipped acquiring a lock just to grab
the ID and accepted the race.

Let's make the race detector a little happier by storing the ID in a
single assignment field.
2017-07-17 15:46:54 -07:00
Michael Schurter
802a99749c Fix log level 2017-07-17 15:46:54 -07:00
Michael Schurter
427a0ae1db Don't fail if task dirs don't exist on creation
Task dir metadata is created in AllocRunner.Run which may not run before
an alloc is sync'd and Nomad exits. There's no reason not to just create
task dir metadata on restore if it doesn't exist.
2017-07-17 15:46:54 -07:00
Michael Schurter
12d9e91f65 Ensure allocDir is never nil and persisted safely
Fixes #2834
2017-07-17 15:46:54 -07:00
Alex Dadgar
3beaafca9a Vet and small improvement on watcher failure detection 2017-07-07 14:53:01 -07:00
Alex Dadgar
e1c631064a @jippi Changed my mind! Good suggestion 2017-07-07 12:12:48 -07:00
Alex Dadgar
f72bbaa370 Client watches for allocation health using task state and Consul checks
This PR adds watching of allocation health at the client. The client can
watch for health based on the tasks running on time and also based on
the consul checks passing.
2017-07-07 12:10:04 -07:00
Alex Dadgar
d165f65013 watcher per alloc 2017-07-07 12:07:08 -07:00
Alex Dadgar
da82a6e814 initial watcher 2017-07-07 12:07:08 -07:00
Michael Schurter
89e5971bc7 Merge pull request #2732 from hashicorp/b-persist-alloc-updates
Persist Alloc when EvalID changes
2017-07-03 14:46:43 -07:00
Michael Schurter
11863660a0 Destroy task group leader first
Before this commit all tasks in a task group were destroyed
concurrently. This meant logging sidecars might be stopped before the
leader task whose logs still need to be shipped.

This commit blocks on the leader shutting down before signalling to
followers to shutdown.
2017-07-03 13:56:56 -07:00
Michael Schurter
bb432213a6 Fix spelling & re-add immutable state struct 2017-06-23 13:01:39 -07:00
Michael Schurter
da4fb1f293 Rename immutable -> alloc
meh; naming is hard
2017-06-23 10:58:36 -07:00
Michael Schurter
38cb0f4af1 Persist Alloc when EvalID changes 2017-06-22 17:33:12 -07:00
Alex Dadgar
aa15ad51d1 Fix nil job on allocation
The way the copying was happening on the alloc_runner was by temporarily
setting the alloc.Job to nil, copying and then restoring it. This
created an issue in which when the alloc was shared (which it is in
server/client mode and between alloc_runner/task_runner) there were race
conditions that could create a panic.

Fixes https://github.com/hashicorp/nomad/issues/2605
2017-05-17 14:07:06 -04:00
Alex Dadgar
997390b04c Fix test 2017-05-09 11:35:48 -07:00
Alex Dadgar
e47be9f771 Merge branch 'master' into f-bolt-db 2017-05-09 11:11:55 -07:00
Alex Dadgar
3f1ccf7278 Respond to comments 2017-05-09 10:50:24 -07:00
Alex Dadgar
098b02d5bf Helpful comment 2017-05-03 11:27:33 -07:00
Alex Dadgar
5952dae3e0 Fix tests 2017-05-03 11:15:30 -07:00
Alex Dadgar
e22393aeb8 Restore state + upgrade path 2017-05-02 18:21:49 -07:00
Alex Dadgar
85a81f47de Revert "metrics"
This reverts commit 4d6a012c6f.
2017-05-02 09:28:11 -07:00
Alex Dadgar
4d6a012c6f metrics 2017-05-01 14:51:27 -07:00
Alex Dadgar
5aa6e18807 Use batching 2017-05-01 14:50:34 -07:00
Alex Dadgar
7614feddbd boltDB database for client state 2017-05-01 14:50:34 -07:00
Alex Dadgar
9def7e1a14 Don't deepcopy job when retrieving copy of Alloc
This PR removes deepcopying of the job attached to the allocation in the
alloc runner. This operation is called very often so removing reflect
from the code path and the potentially large number of mallocs need to
create a job reduced memory and cpu pressure.
2017-05-01 14:50:34 -07:00
Michael Schurter
90f5e232a5 Switch java/exec to use Exec in Executor 2017-04-21 16:25:49 -07:00
Michael Schurter
fe2b3e382d Restart tasks on upgrade with script checks and old executors 2017-04-21 16:25:49 -07:00
Michael Schurter
10cb924b2c Refactor Consul Syncer into new ServiceClient
Fixes #2478 #2474 #1995 #2294

The new client only handles agent and task service advertisement. Server
discovery is mostly unchanged.

The Nomad client agent now handles all Consul operations instead of the
executor handling task related operations. When upgrading from an
earlier version of Nomad existing executors will be told to deregister
from Consul so that the Nomad agent can re-register the task's services
and checks.

Drivers - other than qemu - now support an Exec method for executing
abritrary commands in a task's environment. This is used to implement
script checks.

Interfaces are used extensively to avoid interacting with Consul in
tests that don't assert any Consul related behavior.
2017-04-19 12:42:47 -07:00
Alex Dadgar
3018ae07bb Sync allocation state before waiting for a destroy
This change ensures that the client syncs allocation state with the
servers before entering its wait loop for the allocation to be
destroyed.

Fixes https://github.com/hashicorp/nomad/issues/2563
2017-04-14 13:09:54 -07:00
Alex Dadgar
c934063713 FinishedAt only records when the task has actually started 2017-03-31 17:06:05 -07:00
Alex Dadgar
d212f6fe94 Track task start/finish time & improve logs errors
This PR adds tracking to when a task starts and finishes and the logs
API takes advantage of this and returns better errors when asking for
logs that do not exist.
2017-03-31 16:14:11 -07:00
Alex Dadgar
07973793b8 Address comment 2017-03-09 21:05:34 -08:00
Alex Dadgar
4fbe182372 Add metrics to show allocations on the client
This PR adds the following metrics to the client:
client.allocations.migrating
client.allocations.blocked
client.allocations.pending
client.allocations.running
client.allocations.terminal

Also adds some missing fields to the API version of the evaluation.
2017-03-09 12:37:41 -08:00
Alex Dadgar
07f7e19578 Fix vet script and fix vet problems
This PR fixes our vet script and fixes all the missed vet changes.

It also fixes pointers being printed in `nomad stop <job>` and `nomad
node-status <node>`.
2017-02-27 16:00:19 -08:00
Alex Dadgar
edbc84087c Add Leader support to client 2017-02-10 17:55:19 -08:00
Michael Schurter
9bfa8dc2bd Add COMPAT comment 2017-01-06 11:39:17 -08:00
Michael Schurter
e25274b775 Put a logger in AllocDir/TaskDir 2017-01-05 16:31:56 -08:00
Michael Schurter
3011df1c89 Fix upgrade path for #2132
AllocRunner's state dropped the Context struct which needs to be
converted to the new AllocDir+TaskDir structs in RestoreState.

TaskRunner added a TaskDirBuilt flag, but it's safe to just let that
default to `false` and rebuild all task dirs once on upgrade.
2017-01-05 16:31:55 -08:00
Michael Schurter
d87dcce722 Fail fast on taskdir errors 2017-01-05 16:31:55 -08:00