Commit Graph

3693 Commits

Author SHA1 Message Date
Chris Baker
7b4ac71d2f Merge pull request #5541 from hashicorp/b/5540-bad-client-alloc-metrics
client/metrics: fixed stale metrics
2019-04-22 15:07:30 -04:00
Mahmood Ali
151e0ae772 Merge pull request #5577 from hashicorp/dani/b-logmon-unrecoverable
logging: Attempt to recover logmon failures
2019-04-22 14:40:24 -04:00
Michael Schurter
0f91277d85 tweak logging level for failed log line
Co-Authored-By: notnoop <mahmood@notnoop.com>
2019-04-22 14:40:17 -04:00
Chris Baker
7d8fa4c045 client/metrics: modified metrics to use (updated) client copy of allocation instead of (unupdated) server copy 2019-04-22 18:31:45 +00:00
Michael Schurter
8a0df4034d Merge pull request #5583 from ygersie/fingerprint_nilpointer
fix nil pointer in fingerprinting AWS env leading to crash
2019-04-19 08:08:59 -07:00
Mahmood Ali
8041b0cbe2 clarify cryptic log line 2019-04-19 09:31:43 -04:00
Mahmood Ali
9a2f46f332 client: log detected driver health state
Noticed that `detected drivers` log line was misleading - when a driver
doesn't fingerprint before timeout, their health status is empty string
`""` which we would mark as detected.

Now, we log all drivers along with their state to ease driver
fingerprint debugging.
2019-04-19 09:15:25 -04:00
Mahmood Ali
9dcebcd8a3 client: avoid registering node twice right away
I noticed that `watchNodeUpdates()` almost immediately after
`registerAndHeartbeat()` calls `retryRegisterNode()`, well after 5
seconds.

This call is unnecessary and made debugging a bit harder.  So here, we
ensure that we only re-register node for new node events, not for
initial registration.
2019-04-19 09:12:50 -04:00
Mahmood Ali
7a68d76160 client: wait for batched driver updated
Here we retain 0.8.7 behavior of waiting for driver fingerprints before
registering a node, with some timeout.  This is needed for system jobs,
as system job scheduling for node occur at node registration, and the
race might mean that a system job may not get placed on the node because
of missing drivers.

The timeout isn't strictly necessary, but raising it to 1 minute as it's
closer to indefinitely blocked than 1 second.  We need to keep the value
high enough to capture as much drivers/devices, but low enough that
doesn't risk blocking too long due to misbehaving plugin.

Fixes https://github.com/hashicorp/nomad/issues/5579
2019-04-19 09:00:24 -04:00
Yorick Gersie
77a8fda87c fix nil pointer in fingerprinting AWS env leading to crash
HTTP Client returns a nil response if an error has occured. We first
  need to check for an error before being able to check the HTTP response
  code.
2019-04-19 11:07:13 +02:00
Danielle Lancashire
269e2c00fb loggging: Attempt to recover logmon failures
Currently, when logmon fails to reattach, we will retry reattachment to
the same pid until the task restart specification is exhausted.

Because we cannot clear hook state during error conditions, it is not
possible for us to signal to a future restart that it _shouldn't_
attempt to reattach to the plugin.

Here we revert to explicitly detecting reattachment seperately from a
launch of a new logmon, so we can recover from scenarios where a logmon
plugin has failed.

This is a net improvement over the current hard failure situation, as it
means in the most common case (the pid has gone away), we can recover.

Other reattachment failure modes where the plugin may still be running
could potentially cause a duplicate process, or a subsequent failure to launch
a new plugin.

If there was a duplicate process, it could potentially cause duplicate
logging. This is better than a production workload outage.

If there was a subsequent failure to launch a new plugin, it would fail
in the same (retry until restarts are exhausted) as the current failure
mode.
2019-04-18 13:41:56 +02:00
Michael Schurter
b135d28450 vault: fix data races 2019-04-16 11:22:44 -07:00
Michael Schurter
0e6da17a8f vault: fix renewal time
Renewal time was being calculated as 10s+Intn(lease-10s), so the renewal
time could be very rapid or within 1s of the deadline: [10s, lease)

This commit fixes the renewal time by calculating it as:

	(lease/2) +/- 10s

For a lease of 60s this means the renewal will occur in [20s, 40s).
2019-04-16 11:22:44 -07:00
Michael Schurter
eeb282ca2f Merge pull request #5518 from hashicorp/f-simplify-kill
client: simplify kill logic
2019-04-15 14:11:58 -07:00
Chris Baker
377c1d694b vault namespaces: inject VAULT_NAMESPACE alongside VAULT_TOKEN + documentation 2019-04-12 15:06:34 +00:00
Lang Martin
c45652ab8c Revert accidental merge of pr #5482
Revert "fingerprint Constraints and Affinities have Equals, as set"
This reverts commit 596f16fb5f.

Revert "client tests assert the independent handling of interface and speed"
This reverts commit 7857ac5993.

Revert "structs missed applying a style change from the review"
This reverts commit 658916e327.

Revert "client, structs comments"
This reverts commit be2838d6ba.

Revert "client fingerprint updateNetworks preserves the network configuration"
This reverts commit fc309cb430.

Revert "client_test cleanup comments from review"
This reverts commit bc0bf4efb9.

Revert "client Networks Equals is set equality"
This reverts commit f8d432345b.

Revert "struct cleanup indentation in RequestedDevice Equals"
This reverts commit f4746411ca.

Revert "struct Equals checks for identity before value checking"
This reverts commit 0767a4665e.

Revert "fix client-test, avoid hardwired platform dependecy on lo0"
This reverts commit e89dbb2ab1.

Revert "refactor error in client fingerprint to include the offending data"
This reverts commit a7fed726c6.

Revert "add client updateNodeResources to merge but preserve manual config"
This reverts commit 84bd433c7e.

Revert "refactor struts.RequestedDevice to have its own Equals"
This reverts commit 6897825240.

Revert "refactor structs.Resource.Networks to have its own Equals"
This reverts commit 49e2e6c77b.

Revert "refactor structs.Resource.Devices to have its own Equals"
This reverts commit 4ede9226bb.

Revert "add COMPAT(0.10): Remove in 0.10 notes to impl for structs.Resources"
This reverts commit 49fbaace52.

Revert "add structs.Resources Equals"
This reverts commit 8528a2a2a6.

Revert "test that fingerprint resources are updated, net not clobbered"
This reverts commit 8ee02ddd23.
2019-04-11 10:29:40 -04:00
Lang Martin
7857ac5993 client tests assert the independent handling of interface and speed 2019-04-11 09:56:22 -04:00
Lang Martin
be2838d6ba client, structs comments 2019-04-11 09:56:22 -04:00
Lang Martin
fc309cb430 client fingerprint updateNetworks preserves the network configuration 2019-04-11 09:56:22 -04:00
Lang Martin
bc0bf4efb9 client_test cleanup comments from review 2019-04-11 09:56:22 -04:00
Lang Martin
e89dbb2ab1 fix client-test, avoid hardwired platform dependecy on lo0 2019-04-11 09:56:22 -04:00
Lang Martin
a7fed726c6 refactor error in client fingerprint to include the offending data 2019-04-11 09:56:22 -04:00
Lang Martin
84bd433c7e add client updateNodeResources to merge but preserve manual config 2019-04-11 09:56:22 -04:00
Lang Martin
8ee02ddd23 test that fingerprint resources are updated, net not clobbered 2019-04-11 09:56:21 -04:00
Danielle Lancashire
419d70c5f9 allocs: Add nomad alloc restart
This adds a `nomad alloc restart` command and api that allows a job operator
with the alloc-lifecycle acl to perform an in-place restart of a Nomad
allocation, or a given subtask.
2019-04-11 14:25:49 +02:00
Chris Baker
2022db72b6 vault client test: minor formatting
vendor: using upstream circonus-gometrics
2019-04-10 10:34:10 -05:00
Chris Baker
312721427d vault e2e: pass vault version into setup instead of having to infer it from test name 2019-04-10 10:34:10 -05:00
Chris Baker
401c9fdd16 taskrunner: removed some unecessary config from a test 2019-04-10 10:34:10 -05:00
Chris Baker
20a3884559 docs: -vault-namespace, VAULT_NAMESPACE, and config
agent: added VAULT_NAMESPACE env-based configuration
2019-04-10 10:34:10 -05:00
Chris Baker
e09badbe8b client: gofmt 2019-04-10 10:34:10 -05:00
Chris Baker
3a28763455 taskrunner: pass configured Vault namespace into TaskTemplateConfig 2019-04-10 10:34:10 -05:00
Chris Baker
1349497152 config/docs: added namespace to vault config
server/client: process `namespace` config, setting on the instantiated vault client
2019-04-10 10:34:10 -05:00
Michael Schurter
8caa1c5b0d Bump to 0.9.1-dev 2019-04-09 09:01:48 -07:00
Nomad Release bot
18dd59056e Generate files for 0.9.0 release 2019-04-09 01:56:00 +00:00
Michael Schurter
3bad050bf1 client: simplify kill logic
Remove runLaunched tracking as Run is *always* called for killable
TaskRunners. TaskRunners which fail before Run can be called (during
NewTaskRunner or Restore) are not killable as they're never added to the
client's alloc map.
2019-04-04 15:18:33 -07:00
Michael Schurter
b51e9e09fc Remove 0.9.0-rc2 generated files 2019-04-03 07:41:09 -07:00
Nomad Release bot
6a838b8c3b Generate files for 0.9.0-rc2 release 2019-04-03 01:54:29 +00:00
Michael Schurter
800bd848c1 Merge pull request #5504 from hashicorp/b-exec-path
executor/linux: make chroot binary paths absolute
2019-04-02 14:09:50 -07:00
Michael Schurter
21e895e2e7 Revert "executor/linux: add defensive checks to binary path"
This reverts commit cb36f4537e.
2019-04-02 11:17:12 -07:00
Michael Schurter
cb36f4537e executor/linux: add defensive checks to binary path 2019-04-02 09:40:53 -07:00
Michael Schurter
254901a51e executor/linux: make chroot binary paths absolute
Avoid libcontainer.Process trying to lookup the binary via $PATH as the
executor has already found where the binary is located.
2019-04-01 15:45:31 -07:00
Mahmood Ali
714c41185c rename fifo methods for clarity 2019-04-01 16:52:58 -04:00
Mahmood Ali
7661857c6e clarify closeDone blocking and field name 2019-04-01 16:10:34 -04:00
Mahmood Ali
3c68c946c4 no requires in a test goroutine 2019-04-01 15:38:39 -04:00
Mahmood Ali
9f1ee37687 log when fifo fails to open 2019-04-01 13:18:03 -04:00
Mahmood Ali
5ca9b6eb37 fifo: Use plain fifo file in Unix
This PR switches to using plain fifo files instead of golang structs
managed by containerd/fifo library.

The library main benefit is management of opening fifo files.  In Linux,
a reader `open()` request would block until a writer opens the file (and
vice-versa).  The library uses goroutines so that it's the first IO
operation that blocks.

This benefit isn't really useful for us: Given that logmon simply
streams output in a separate process, blocking of opening or first read
is effectively the same.

The library additionally makes further complications for managing state
and tracking read/write permission that seems overhead for our use,
compared to using a file directly.

Looking here, I made the following incidental changes:
* document that we do handle if fifo files are already created, as we
rely on that behavior for logmon restarts
* use type system to lock read vs write: currently, fifo library returns
`io.ReadWriteCloser` even if fifo is opened for writing only!
2019-04-01 13:18:03 -04:00
Michael Schurter
6674b270ff Merge pull request #5456 from hashicorp/test-taskenv
tests: port pre-0.9 task env tests
2019-03-25 10:41:38 -07:00
Michael Schurter
2dbc06de61 tests: port pre-0.9 task env tests
I chose to make them more of integration tests since there's a lot more
plumbing involved. The internal implementation details of how we craft
task envs can now change and these tests will still properly assert the
task runtime environment is setup properly.
2019-03-25 09:46:53 -07:00
Michael Schurter
a77f769f1e Bump to dev post-0.9.0-rc1 release 2019-03-22 08:26:30 -07:00
Nomad Release bot
7c00ab4f3f Generate files for 0.9.0-rc1 release 2019-03-21 19:06:13 +00:00