Commit Graph

14885 Commits

Author SHA1 Message Date
Michael Lange
5aa938e121 Test coverage for preemption on the allocation detail page 2019-04-22 16:40:09 -07:00
Michael Lange
c7e1598ed3 Preemption modeling as page objects 2019-04-22 16:40:08 -07:00
Michael Lange
d4ae0a2819 Integration test for the alloc row icon 2019-04-22 16:40:07 -07:00
Michael Lange
4c773a1f3c Add preemption properties to Mirage allocation factory 2019-04-22 16:40:07 -07:00
Michael Lange
4752950cae Show which allocations an allocation preempted on the alloc page 2019-04-22 16:40:06 -07:00
Michael Lange
400deae4ce Show which alloc, if any, preempted an alloc on the alloc detail page 2019-04-22 16:40:05 -07:00
Michael Lange
7ae2081282 Preemptions count and filtering on client detail page
Show the count in the allocations table next to the existing total alloc
count badge. Clicking either will filter by all or by preemptions.
2019-04-22 16:40:04 -07:00
Michael Lange
a33b105181 Add preempted icon to alloc row 2019-04-22 16:40:04 -07:00
Michael Lange
dca386ca70 Make sure tooltips show up over the top of the side bar 2019-04-22 16:40:03 -07:00
Michael Lange
384a0e5a54 Add wasPreempted bool to allocs 2019-04-22 16:40:02 -07:00
Michael Lange
c456c5eed0 Show preemptions on the job plan phase of job submission 2019-04-22 16:40:01 -07:00
Michael Lange
cf1d4a3a1e Data modeling for preemptions 2019-04-22 16:40:00 -07:00
Chris Baker
09c998a4a1 Merge pull request #5591 from hashicorp/cgbaker/changelog
changelog: added entry for #5540 fix
2019-04-22 15:31:22 -04:00
Michael Schurter
95bc6fe301 Merge pull request #5586 from hashicorp/docs-deploy-ver
docs: bump deployment guide to 0.9.0
2019-04-22 12:29:22 -07:00
Chris Baker
184e171e11 changelog: added entry for #5540 fix 2019-04-22 19:27:40 +00:00
Chris Baker
7b4ac71d2f Merge pull request #5541 from hashicorp/b/5540-bad-client-alloc-metrics
client/metrics: fixed stale metrics
2019-04-22 15:07:30 -04:00
Mahmood Ali
151e0ae772 Merge pull request #5577 from hashicorp/dani/b-logmon-unrecoverable
logging: Attempt to recover logmon failures
2019-04-22 14:40:24 -04:00
Michael Schurter
0f91277d85 tweak logging level for failed log line
Co-Authored-By: notnoop <mahmood@notnoop.com>
2019-04-22 14:40:17 -04:00
Chris Baker
7d8fa4c045 client/metrics: modified metrics to use (updated) client copy of allocation instead of (unupdated) server copy 2019-04-22 18:31:45 +00:00
Lang Martin
f5c621979e tests over setwise equality of fingerprinted parts 2019-04-19 15:49:24 -04:00
Michael Schurter
a3e8f51643 docs: bump deployment guide to 0.9.0 2019-04-19 12:39:38 -07:00
Lang Martin
5c7e10e0b9 structs need to keep assert Equal interface implementation for tests 2019-04-19 15:23:49 -04:00
Lang Martin
228a7d6124 structs equals use labeled continue for clarity 2019-04-19 15:23:48 -04:00
Lang Martin
3e1c6ac890 struct equals use a working pattern for setwise comparison 2019-04-19 15:23:48 -04:00
Lang Martin
583ae3722c client fingerprinter doesn't overwrite manual configuration
Revert "Revert accidental merge of pr #5482"
This reverts commit c45652ab8c.
2019-04-19 15:23:48 -04:00
Michael Schurter
8a0df4034d Merge pull request #5583 from ygersie/fingerprint_nilpointer
fix nil pointer in fingerprinting AWS env leading to crash
2019-04-19 08:08:59 -07:00
Mahmood Ali
54e1e0760b Merge pull request #5437 from hashicorp/r-upstream-libcontainer-plain
Use upstream libcontainer package
2019-04-19 10:15:13 -04:00
Mahmood Ali
6747195682 comment on using init() for libcontainer handling 2019-04-19 09:49:04 -04:00
Mahmood Ali
9bf54eae97 comment what refer to 2019-04-19 09:49:04 -04:00
Mahmood Ali
b6af5c9dca Move libcontainer helper to executor package 2019-04-19 09:49:04 -04:00
Mahmood Ali
0088f40fd4 vendor upstream opencontainers/runc 2019-04-19 09:49:04 -04:00
Mahmood Ali
9050f5f611 Merge pull request #5585 from hashicorp/b-drivers-node-registration
client: wait for batched driver updates before registering nodes
2019-04-19 09:47:21 -04:00
Mahmood Ali
8041b0cbe2 clarify cryptic log line 2019-04-19 09:31:43 -04:00
Mahmood Ali
9a2f46f332 client: log detected driver health state
Noticed that `detected drivers` log line was misleading - when a driver
doesn't fingerprint before timeout, their health status is empty string
`""` which we would mark as detected.

Now, we log all drivers along with their state to ease driver
fingerprint debugging.
2019-04-19 09:15:25 -04:00
Mahmood Ali
9dcebcd8a3 client: avoid registering node twice right away
I noticed that `watchNodeUpdates()` almost immediately after
`registerAndHeartbeat()` calls `retryRegisterNode()`, well after 5
seconds.

This call is unnecessary and made debugging a bit harder.  So here, we
ensure that we only re-register node for new node events, not for
initial registration.
2019-04-19 09:12:50 -04:00
Preetha
92a4033a1a Update CHANGELOG.md 2019-04-19 08:02:48 -05:00
Mahmood Ali
7a68d76160 client: wait for batched driver updated
Here we retain 0.8.7 behavior of waiting for driver fingerprints before
registering a node, with some timeout.  This is needed for system jobs,
as system job scheduling for node occur at node registration, and the
race might mean that a system job may not get placed on the node because
of missing drivers.

The timeout isn't strictly necessary, but raising it to 1 minute as it's
closer to indefinitely blocked than 1 second.  We need to keep the value
high enough to capture as much drivers/devices, but low enough that
doesn't risk blocking too long due to misbehaving plugin.

Fixes https://github.com/hashicorp/nomad/issues/5579
2019-04-19 09:00:24 -04:00
Yorick Gersie
77a8fda87c fix nil pointer in fingerprinting AWS env leading to crash
HTTP Client returns a nil response if an error has occured. We first
  need to check for an error before being able to check the HTTP response
  code.
2019-04-19 11:07:13 +02:00
Preetha
83a2e693b7 Merge pull request #5580 from hashicorp/f-api-preemption-info
Add preemption related fields to AllocationListStub
2019-04-18 18:38:25 -07:00
Preetha Appan
ad77c18c87 Add preemption related fields to AllocationListStub 2019-04-18 10:36:44 -05:00
Danielle
11388ab992 Merge pull request #5572 from hashicorp/dani/b-docker-volumes
Switch to pre-0.9 behaviour for handling volumes
2019-04-18 15:48:23 +02:00
Danielle
4789948ba8 Merge pull request #5573 from hashicorp/dani/update-vol-docs
docs: Clarify docker volume behaviour
2019-04-18 14:30:16 +02:00
Danielle Lancashire
ccce364cbd Switch to pre-0.9 behaviour for handling volumes
In Nomad 0.9, we made volume driver handling the same for `""`, and
`"local"` volumes. Prior to Nomad 0.9 however these had slightly different
behaviour for relative paths and named volumes.

Prior to 0.9 the empty string would expand relative paths within the task
dir, and `"local"` volumes that are not absolute paths would be treated
as docker named volumes.

This commit reverts to the previous behaviour as follows:

| Nomad Version | Driver  |   Volume Spec    | Behaviour                 |
|-------------------------------------------------------------------------
| all           | ""      | testing:/testing | allocdir/testing          |
| 0.8.7         | "local" | testing:/testing | "testing" as named volume |
| 0.9.0         | "local" | testing:/testing | allocdir/testing          |
| 0.9.1         | "local" | testing:/testing | "testing" as named volume |
2019-04-18 14:28:45 +02:00
Danielle Lancashire
269e2c00fb loggging: Attempt to recover logmon failures
Currently, when logmon fails to reattach, we will retry reattachment to
the same pid until the task restart specification is exhausted.

Because we cannot clear hook state during error conditions, it is not
possible for us to signal to a future restart that it _shouldn't_
attempt to reattach to the plugin.

Here we revert to explicitly detecting reattachment seperately from a
launch of a new logmon, so we can recover from scenarios where a logmon
plugin has failed.

This is a net improvement over the current hard failure situation, as it
means in the most common case (the pid has gone away), we can recover.

Other reattachment failure modes where the plugin may still be running
could potentially cause a duplicate process, or a subsequent failure to launch
a new plugin.

If there was a duplicate process, it could potentially cause duplicate
logging. This is better than a production workload outage.

If there was a subsequent failure to launch a new plugin, it would fail
in the same (retry until restarts are exhausted) as the current failure
mode.
2019-04-18 13:41:56 +02:00
Chris Baker
15c64875d1 Merge pull request #5559 from ArangoGutierrez/website_docs_singularity
list singularity as a community driver
2019-04-17 12:42:29 -04:00
Charlie Voiselle
4a0da839a9 fixed header level 2019-04-17 10:12:43 -04:00
Danielle Lancashire
acf8ab8665 docs: Clairfy docker volume behaviour 2019-04-17 11:31:55 +02:00
Mahmood Ali
c07b72959d Merge pull request #5568 from hashicorp/b-nomad-logger-restart
Fixes #5566 .

Fix a case where docker logging process may lock up nomad agent restart.

Looks like we have a case where docker logger is started even through logmon isn't. In such case, the fifo writer blocks indefinitely and because the open operation happens in the main goroutine, nomad agent blocks indefinitely.

This fixes the issue where the fifo open operation happens in goroutine instead of main goroutine.

We should follow up independently to ensure logmon <-> dockerlogger ordering and consider having task recovery happen in non-main goroutine with some sensible timeouts.
2019-04-16 19:34:37 -04:00
Eduardo Arango
9f97da0956 resolve merge conflicts
Signed-off-by: Eduardo Arango <eduardo@sylabs.io>
2019-04-16 17:01:22 -05:00
Eduardo Arango
bd0d641a5e address @cgbaker comments
Signed-off-by: Eduardo Arango <eduardo@sylabs.io>
2019-04-16 16:59:59 -05:00