Commit Graph

268 Commits

Author SHA1 Message Date
Chris Baker
03500494b2 cleanup test 2019-06-18 14:15:25 +00:00
Chris Baker
7050e14eb5 formatting and clarity 2019-06-18 14:00:57 +00:00
Chris Baker
240d68765c metrics: add namespace label to allocation metrics 2019-06-17 20:50:26 +00:00
Danielle
f6757ff106 Merge pull request #5821 from hashicorp/dani/b-5770
trhooks: Add TaskStopHook interface to services
2019-06-12 17:30:49 +02:00
Danielle Lancashire
1aa6bc80d8 trt: Fix test 2019-06-12 17:06:11 +02:00
Danielle Lancashire
5933528404 trhooks: Add TaskStopHook interface to services
We currently only run cleanup Service Hooks when a task is either
Killed, or Exited. However, due to the implementation of a task runner,
tasks are only Exited if they every correctly started running, which is
not true when you recieve an error early in the task start flow, such as
not being able to pull secrets from Vault.

This updates the service hook to also call consul deregistration
routines during a task Stop lifecycle event, to ensure that any
registered checks and services are cleared in such cases.

fixes #5770
2019-06-12 16:00:21 +02:00
Mahmood Ali
7a01a96cc2 Fallback to alloc.TaskResources for old allocs
When a client is running against an old server (e.g. running 0.8),
`alloc.AllocatedResources` may be nil, and we need to check the
deprecated `alloc.TaskResources` instead.

Fixes https://github.com/hashicorp/nomad/issues/5810
2019-06-11 10:32:53 -04:00
Mahmood Ali
41a7fe8530 client/allocrunner: depend on internal task state
Alloc runner already tracks tasks associated with alloc.  Here, we
become defensive by relying on the alloc runner tracked tasks, rather
than depend on server never updating the job unexpectedly.
2019-06-10 18:42:51 -04:00
Mahmood Ali
6ccfd47cdd Merge pull request #5747 from hashicorp/b-test-fixes-20190521-1
More test fixes
2019-06-05 19:09:18 -04:00
Mahmood Ali
d9ac7c25d5 Merge pull request #5737 from fwkz/fix-restart-attempts
Fix restart attempts of `restart` stanza in `delay` mode.
2019-06-05 19:05:07 -04:00
Danielle Lancashire
92527c6b4e client: Pass servers contacted ch to allocrunner
This fixes an issue where batch and service workloads would never be
restarted due to indefinitely blocking on a nil channel.

It also raises the restoration logging message to `Info` to simplify log
analysis.
2019-05-22 13:47:35 +02:00
Mahmood Ali
d06aeab087 tests: fix data race in client/allocrunner/taskrunner/template TestTaskTemplateManager_Rerender_Signal
Given that Signal may be called multiple times, blocking for `SignalCh`
isn't sufficient to synchornizing access to Signals field.
2019-05-21 13:56:58 -04:00
Mahmood Ali
af48f7d3dc client: synchronize access to ar.alloc
`allocRunner.alloc` is protected by `allocRunner.allocLock`, so let's
use `allocRunner.Alloc()` helper function to access it.
2019-05-21 09:55:05 -04:00
fwkz
e2d793cfd9 Fix restart attempts of restart stanza.
Number of restarts during 2nd interval is off by one.
2019-05-21 13:27:19 +02:00
Michael Schurter
abd809d60a docs: changelog entry for #5669 and fix comment 2019-05-14 10:54:00 -07:00
Michael Schurter
796c05b9b8 client: register before restoring
Registration and restoring allocs don't share state or depend on each
other in any way (syncing allocs with servers is done outside of
registration).

Since restoring is synchronous, start the registration goroutine first.

For nodes with lots of allocs to restore or close to their heartbeat
deadline, this could be the difference between becoming "lost" or not.
2019-05-14 10:53:27 -07:00
Michael Schurter
6a2792ad90 client: do not restart dead tasks until server is contacted (try 2)
Refactoring of 104067bc2b2002a4e45ae7b667a476b89addc162

Switch the MarkLive method for a chan that is closed by the client.
Thanks to @notnoop for the idea!

The old approach called a method on most existing ARs and TRs on every
runAllocs call. The new approach does a once.Do call in runAllocs to
accomplish the same thing with less work. Able to remove the gate
abstraction that did much more than was needed.
2019-05-14 10:53:27 -07:00
Michael Schurter
e7042b674b client: do not restart dead tasks until server is contacted
Fixes #1795

Running restored allocations and pulling what allocations to run from
the server happen concurrently. This means that if a client is rebooted,
and has its allocations rescheduled, it may restart the dead allocations
before it contacts the server and determines they should be dead.

This commit makes tasks that fail to reattach on restore wait until the
server is contacted before restarting.
2019-05-14 10:53:27 -07:00
Michael Schurter
3c19af4bfd client: expose allocated memory per task
Related to #4280

This PR adds
`client.allocs.<job>.<group>.<alloc>.<task>.memory.allocated` as a gauge
in bytes to metrics to ease calculating how close a task is to OOMing.

```
'nomad.client.allocs.memory.allocated.example.cache.6d98cbaf-d6bc-2a84-c63f-bfff8905a9d8.redis.rusty': 268435456.000
'nomad.client.allocs.memory.cache.example.cache.6d98cbaf-d6bc-2a84-c63f-bfff8905a9d8.redis.rusty': 5677056.000
'nomad.client.allocs.memory.kernel_max_usage.example.cache.6d98cbaf-d6bc-2a84-c63f-bfff8905a9d8.redis.rusty': 0.000
'nomad.client.allocs.memory.kernel_usage.example.cache.6d98cbaf-d6bc-2a84-c63f-bfff8905a9d8.redis.rusty': 0.000
'nomad.client.allocs.memory.max_usage.example.cache.6d98cbaf-d6bc-2a84-c63f-bfff8905a9d8.redis.rusty': 8908800.000
'nomad.client.allocs.memory.rss.example.cache.6d98cbaf-d6bc-2a84-c63f-bfff8905a9d8.redis.rusty': 876544.000
'nomad.client.allocs.memory.swap.example.cache.6d98cbaf-d6bc-2a84-c63f-bfff8905a9d8.redis.rusty': 0.000
'nomad.client.allocs.memory.usage.example.cache.6d98cbaf-d6bc-2a84-c63f-bfff8905a9d8.redis.rusty': 8208384.000
```
2019-05-10 11:12:12 -07:00
Mahmood Ali
5abbee5d39 Merge pull request #5632 from hashicorp/f-nomad-exec-parts-01-base
nomad exec part 1: plumbing and docker driver
2019-05-09 18:09:27 -04:00
Mahmood Ali
979a6a1778 implement client endpoint of nomad exec
Add a client streaming RPC endpoint for processing nomad exec tasks, by invoking
the relevant task handler for execution.
2019-05-09 16:49:08 -04:00
Chris Baker
4b54e27841 stale allocation data leads to incorrect (and even negative) metrics (#5637)
* client: was not using up-to-date client state in determining which alloc count towards allocated resources

* Update client/client.go

Co-Authored-By: cgbaker <cgbaker@hashicorp.com>
2019-05-07 15:54:36 -04:00
Michael Schurter
5c43a16b03 Fix comment
Co-Authored-By: preetapan <preetha@hashicorp.com>
2019-05-03 10:01:30 -05:00
Michael Schurter
96d69022df Remove unnecessary boolean clause
Co-Authored-By: preetapan <preetha@hashicorp.com>
2019-05-03 10:00:17 -05:00
Preetha Appan
b4ecb448b3 Update deployment health on failed allocations only if health is unset
This fixes a confusing UX where a previously successful deployment's
healthy/unhealthy count would get updated if any allocations failed after
the deployment was already marked as successful.
2019-05-02 22:59:56 -05:00
Danielle
91fa55f4cc Merge pull request #5515 from hashicorp/dani/f-alloc-signal
allocs: Add nomad alloc signal command
2019-04-26 14:21:05 +02:00
Mahmood Ali
a321901ad8 retry grpc unavailable errors even if not shutting down 2019-04-25 18:39:17 -04:00
Mahmood Ali
658a734912 try checking process status 2019-04-25 18:16:13 -04:00
Mahmood Ali
1f1551a4ae add logging about attempts 2019-04-25 18:09:36 -04:00
Mahmood Ali
ba373fee2a try sleeping for stop signal to take effect 2019-04-25 17:16:29 -04:00
Mahmood Ali
978fc65a2b add a test that simulates logmon dying during Start() call 2019-04-25 16:41:17 -04:00
Mahmood Ali
b21849cb02 logmon: retry starting logmon if it exits
Retry if we detect shutting down during Start() api call is started,
locally.
2019-04-25 15:10:16 -04:00
Danielle Lancashire
023d0dff31 allocs: Add nomad alloc signal command
This command will be used to send a signal to either a single task within an
allocation, or all of the tasks if <task-name> is omitted. If the sent signal
terminates the allocation, it will be treated as if the allocation has crashed,
rather than as if it was operator-terminated.

Signal validation is currently handled by the driver itself and nomad
does not attempt to restrict or validate them.
2019-04-25 12:43:32 +02:00
Michael Schurter
0f91277d85 tweak logging level for failed log line
Co-Authored-By: notnoop <mahmood@notnoop.com>
2019-04-22 14:40:17 -04:00
Danielle Lancashire
269e2c00fb loggging: Attempt to recover logmon failures
Currently, when logmon fails to reattach, we will retry reattachment to
the same pid until the task restart specification is exhausted.

Because we cannot clear hook state during error conditions, it is not
possible for us to signal to a future restart that it _shouldn't_
attempt to reattach to the plugin.

Here we revert to explicitly detecting reattachment seperately from a
launch of a new logmon, so we can recover from scenarios where a logmon
plugin has failed.

This is a net improvement over the current hard failure situation, as it
means in the most common case (the pid has gone away), we can recover.

Other reattachment failure modes where the plugin may still be running
could potentially cause a duplicate process, or a subsequent failure to launch
a new plugin.

If there was a duplicate process, it could potentially cause duplicate
logging. This is better than a production workload outage.

If there was a subsequent failure to launch a new plugin, it would fail
in the same (retry until restarts are exhausted) as the current failure
mode.
2019-04-18 13:41:56 +02:00
Michael Schurter
eeb282ca2f Merge pull request #5518 from hashicorp/f-simplify-kill
client: simplify kill logic
2019-04-15 14:11:58 -07:00
Chris Baker
377c1d694b vault namespaces: inject VAULT_NAMESPACE alongside VAULT_TOKEN + documentation 2019-04-12 15:06:34 +00:00
Danielle Lancashire
419d70c5f9 allocs: Add nomad alloc restart
This adds a `nomad alloc restart` command and api that allows a job operator
with the alloc-lifecycle acl to perform an in-place restart of a Nomad
allocation, or a given subtask.
2019-04-11 14:25:49 +02:00
Chris Baker
312721427d vault e2e: pass vault version into setup instead of having to infer it from test name 2019-04-10 10:34:10 -05:00
Chris Baker
401c9fdd16 taskrunner: removed some unecessary config from a test 2019-04-10 10:34:10 -05:00
Chris Baker
e09badbe8b client: gofmt 2019-04-10 10:34:10 -05:00
Chris Baker
3a28763455 taskrunner: pass configured Vault namespace into TaskTemplateConfig 2019-04-10 10:34:10 -05:00
Michael Schurter
3bad050bf1 client: simplify kill logic
Remove runLaunched tracking as Run is *always* called for killable
TaskRunners. TaskRunners which fail before Run can be called (during
NewTaskRunner or Restore) are not killable as they're never added to the
client's alloc map.
2019-04-04 15:18:33 -07:00
Michael Schurter
21e895e2e7 Revert "executor/linux: add defensive checks to binary path"
This reverts commit cb36f4537e.
2019-04-02 11:17:12 -07:00
Michael Schurter
cb36f4537e executor/linux: add defensive checks to binary path 2019-04-02 09:40:53 -07:00
Michael Schurter
254901a51e executor/linux: make chroot binary paths absolute
Avoid libcontainer.Process trying to lookup the binary via $PATH as the
executor has already found where the binary is located.
2019-04-01 15:45:31 -07:00
Michael Schurter
6674b270ff Merge pull request #5456 from hashicorp/test-taskenv
tests: port pre-0.9 task env tests
2019-03-25 10:41:38 -07:00
Michael Schurter
2dbc06de61 tests: port pre-0.9 task env tests
I chose to make them more of integration tests since there's a lot more
plumbing involved. The internal implementation details of how we craft
task envs can now change and these tests will still properly assert the
task runtime environment is setup properly.
2019-03-25 09:46:53 -07:00
Nick Ethier
c62f9a0f58 logmon: make Start rpc idempotent and simplify hook 2019-03-19 14:02:36 -04:00
Nick Ethier
a28a67d263 logmon:add static check for logmon exited hook 2019-03-18 15:59:43 -04:00