Commit Graph

2410 Commits

Author SHA1 Message Date
Michael Schurter
a96fc052dd Fix tr race by not sharing alloc/task
prestart only needs the original alloc/task so pass their pointers in.
Task updates may concurrently replace the pointer on tr.
2017-07-21 16:17:42 -07:00
Michael Schurter
96baafebd3 Minor test race fix 2017-07-21 16:17:23 -07:00
Michael Schurter
2569c58cb7 Fix race by not accessing tr.task from ar 2017-07-21 16:16:53 -07:00
Michael Schurter
cf62d02378 Remove unneeded saveTaskRunnerState method
Collapse it into the one place it's called
2017-07-21 16:16:02 -07:00
Michael Schurter
a59d3a80ba Fix test race by locking around ar.tasks access 2017-07-21 14:25:51 -07:00
Michael Schurter
96127527a1 Fix handle race 2017-07-21 14:00:32 -07:00
Michael Schurter
a04f5016a5 Fix more test races 2017-07-21 14:00:21 -07:00
Michael Schurter
8fa599c4a5 Fixup a few more even rarer test races 2017-07-21 13:43:32 -07:00
Michael Schurter
55713e2a61 Always interpolate task before calling with Consul
Also switch to returning a copy of the task to avoid races between
altering the Task and persitence.
2017-07-21 13:37:16 -07:00
Michael Schurter
3974dfa98c Fix TestAllocRunner_TaskLeader_StopTG
Also make alloc runner tests less racy. Basically every alloc runner
test used to have races with `upd.{Count,Allocs}`
2017-07-21 13:37:16 -07:00
Alex Dadgar
700147c90e Speed up client startup 2017-07-20 22:34:24 -07:00
Michael Schurter
a7a830a980 Merge pull request #2878 from hashicorp/b-save-state
Fix state handling on restart
2017-07-20 17:16:46 -07:00
Karel Malec
bafe9276ec Pass task group name as NOMAD_GROUP_NAME environment variable 2017-07-21 01:22:54 +02:00
Alex Dadgar
bb958ba745 Destroy tasks that are part of terminal alloc 2017-07-20 12:02:04 -07:00
Michael Schurter
738321efa3 Don't save task runner state if it is destroyed 2017-07-20 10:17:41 -07:00
Alex Dadgar
ae2ac8ab58 Should not persist state after alloc_runner is garbage collected 2017-07-19 17:31:30 -07:00
Michael Schurter
9150135b50 Use broadcast send retry logic everywhere 2017-07-18 14:36:32 -07:00
Alex Dadgar
4f376d08ed Merge pull request #2853 from hashicorp/b-watcher
Improve alloc health watcher
2017-07-18 14:12:28 -07:00
Alex Dadgar
459ddf63ec Save deployment status 2017-07-18 12:37:52 -07:00
Alex Dadgar
386557da73 Small fixes 2017-07-18 12:19:57 -07:00
Michael Schurter
8c4b760803 Fix deadlock caused by syncing during destroy
When replacing an alloc the new alloc is blocked until the old alloc is
destroyed. This could cause a deadlock:

1. Destroying the old alloc includes a final sync of its status
2. Syncing status causes a GC
3. A GC looks for terminal allocs to cleanup
4. The GC waits for an alloc to stop completely before GC'ing

If the GC chooses the currently-being-destroyed-alloc to GC, the GC
deadlocks. If `client.max_parallel` deadlocks happen the GC is wedged
until the Nomad process is restarted.

Performing the final sync asynchronously is an ugly hack but prevents
the deadlock by allowing the final sync to occur after the alloc runner
has shutdown and been destroyed.
2017-07-18 11:12:56 -07:00
Michael Schurter
b0eae2f002 Test AllocDir.Copy 2017-07-17 15:46:54 -07:00
Michael Schurter
dc5ea4acb9 Add AllocRunner.allocID for ease-of-use
Since the AllocRunner.alloc struct can be mutated, most of AllocRunner
needs to acquire a lock to get the alloc's ID. Log lines always need to
include the alloc ID, so we often skipped acquiring a lock just to grab
the ID and accepted the race.

Let's make the race detector a little happier by storing the ID in a
single assignment field.
2017-07-17 15:46:54 -07:00
Michael Schurter
802a99749c Fix log level 2017-07-17 15:46:54 -07:00
Michael Schurter
427a0ae1db Don't fail if task dirs don't exist on creation
Task dir metadata is created in AllocRunner.Run which may not run before
an alloc is sync'd and Nomad exits. There's no reason not to just create
task dir metadata on restore if it doesn't exist.
2017-07-17 15:46:54 -07:00
Michael Schurter
12d9e91f65 Ensure allocDir is never nil and persisted safely
Fixes #2834
2017-07-17 15:46:54 -07:00
Alex Dadgar
2b917380eb Fix alloc broadcaster panic on double close 2017-07-17 14:09:05 -07:00
Michael Schurter
56f697580b Fix nil panic in Docker error condition
Fixes #2835

Yet another bug caused by overwriting container and then trying to
reference container.ID in the err handling block. Did a quick audit of
docker.go and it seems to be the last offender. See #2804 for previous
bug.
2017-07-14 10:48:19 -07:00
Alex Dadgar
d06fa455b7 Small fixes 2017-07-07 17:34:50 -07:00
Michael Schurter
0ce0973d3a Merge pull request #2793 from hashicorp/b-2776-ct-vault-servername
Propagate vault.tls_server_name to consul-template
2017-07-07 16:44:19 -07:00
Michael Schurter
ef98449b07 Merge pull request #2787 from hashicorp/f-docker-test-mac
Test #2652 - Docker MAC Address option
2017-07-07 16:22:10 -07:00
Michael Schurter
5ab252fe43 Merge pull request #2797 from hashicorp/f-2785-docker-bridge-ip
Add driver.docker.bridge_ip node attribute
2017-07-07 16:20:20 -07:00
Michael Schurter
024d5a8edc Remove debug logging 2017-07-07 16:19:42 -07:00
Michael Schurter
faa4da7c29 Merge pull request #2804 from hashicorp/b-2802-docker-panic
Don't panic in container list/remove/inspect race
2017-07-07 15:35:51 -07:00
Michael Schurter
c47860f928 Don't panic in container list/remove/inspect race
Fixes #2802

While it's hard to reproduce the theoretical race is:

1. This goroutine calls ListContainers()
2. Another goroutine removes a container X
3. This goroutine attempts to InspectContainer(X)

However, this bug could be hit in the much simpler case of
InspectContainer() timing out.

In those cases an error is returned and the old code attempted to wrap
the error with the now-nil container.ID. Storing the container ID fixes
that panic.
2017-07-07 15:10:59 -07:00
Alex Dadgar
3beaafca9a Vet and small improvement on watcher failure detection 2017-07-07 14:53:01 -07:00
Alex Dadgar
beb01f1754 test fixes 2017-07-07 14:11:27 -07:00
Alex Dadgar
e1c631064a @jippi Changed my mind! Good suggestion 2017-07-07 12:12:48 -07:00
Alex Dadgar
b8ba29bf93 Warn log 2017-07-07 12:10:04 -07:00
Alex Dadgar
f72bbaa370 Client watches for allocation health using task state and Consul checks
This PR adds watching of allocation health at the client. The client can
watch for health based on the tasks running on time and also based on
the consul checks passing.
2017-07-07 12:10:04 -07:00
Alex Dadgar
d165f65013 watcher per alloc 2017-07-07 12:07:08 -07:00
Alex Dadgar
8e58ddcceb Update index 2017-07-07 12:07:08 -07:00
Alex Dadgar
da82a6e814 initial watcher 2017-07-07 12:07:08 -07:00
Alex Dadgar
85e0d6fccd assign names 2017-07-07 12:03:11 -07:00
Michael Schurter
38bf16d2ef Add driver.docker.bridge_ip node attribute
Fixes #2785
2017-07-07 10:14:10 -07:00
Michael Schurter
3bf148bc05 Propagate vault.tls_server_name to consul-template
Fixes #2776
2017-07-06 16:56:50 -07:00
Michael Schurter
4f150e1650 Merge pull request #2786 from hashicorp/f-docker-auth-soft-fail
Default to auth hard fail but optionally soft fail
2017-07-06 13:25:56 -07:00
Michael Schurter
eaab2b2eee Test #2652
Also cleanup docker config opts docs
2017-07-06 12:46:25 -07:00
Michael Schurter
58186bfb88 Merge branch 'master' into master 2017-07-06 12:09:36 -07:00
Michael Schurter
3aae173432 Default to auth hard fail but optionally soft fail 2017-07-06 11:35:34 -07:00