Commit Graph

22967 Commits

Author SHA1 Message Date
Michael Schurter
e7924e35cb client: fix waiting on preempted alloc (#12779)
Fixes #10200

**The bug**

A user reported receiving the following error when an alloc was placed
that needed to preempt existing allocs:

```
[ERROR] client.alloc_watcher: error querying previous alloc:
alloc_id=28... previous_alloc=8e... error="rpc error: alloc lookup
failed: index error: UUID must be 36 characters"
```

The previous alloc (8e) was already complete on the client. This is
possible if an alloc stops *after* the scheduling decision was made to
preempt it, but *before* the node running both allocations was able to
pull and start the preemptor. While that is hopefully a narrow window of
time, you can expect it to occur in high throughput batch scheduling
heavy systems.

However the RPC error made no sense! `previous_alloc` in the logs was a
valid 36 character UUID!

**The fix**

The fix is:

```
-		prevAllocID:  c.Alloc.PreviousAllocation,
+		prevAllocID:  watchedAllocID,
```

The alloc watcher new func used for preemption improperly referenced
Alloc.PreviousAllocation instead of the passed in watchedAllocID. When
multiple allocs are preempted, a watcher is created for each with
watchedAllocID set properly by the caller. In this case
Alloc.PreviousAllocation="" -- which is where the `UUID must be 36 characters`
error was coming from! Sadly we were properly referencing
watchedAllocID in the log, so it made the error make no sense!

**The repro**

I was able to reproduce this with a dev agent with [preemption enabled](https://gist.github.com/schmichael/53f79cbd898afdfab76865ad8c7fc6a0#file-preempt-hcl)
and [lowered limits](https://gist.github.com/schmichael/53f79cbd898afdfab76865ad8c7fc6a0#file-limits-hcl)
for ease of repro.

First I started a [low priority count 3 job](https://gist.github.com/schmichael/53f79cbd898afdfab76865ad8c7fc6a0#file-preempt-lo-nomad),
then a [high priority job](https://gist.github.com/schmichael/53f79cbd898afdfab76865ad8c7fc6a0#file-preempt-hi-nomad)
that evicts 2 low priority jobs. Everything worked as expected.

However if I force it to use the [remotePrevAlloc implementation](https://github.com/hashicorp/nomad/blob/v1.3.0-beta.1/client/allocwatcher/alloc_watcher.go#L147),
it reproduces the bug because the watcher references PreviousAllocation
instead of watchedAllocID.
2022-04-26 13:14:43 -07:00
Tim Gross
059c89dff0 E2E: move volume mounts test to use golang's stdlib test runner (#12788)
Part of ongoing work to remove the old E2E framework code.
2022-04-26 14:28:20 -04:00
Tim Gross
26b0e04717 E2E: remove old CLI for driving provisioning (#12787)
We moved off the old provisioning process for nightly E2E to one driven
entirely by Terraform quite a while back now. We're in the slow
process of removing the framework code for this test-by-test, but this
chunk of code no longer has any callers.
2022-04-26 13:43:25 -04:00
Tim Gross
b32722a6a6 CSI: enforce one plugin supervisor loop via sync.Once (#12785)
We enforce exactly one plugin supervisor loop by checking whether
`running` is set and returning early. This works but is fairly
subtle. It can briefly result in two goroutines where one quickly
exits before doing any work. Clarify the intent by using
`sync.Once`. The goroutine we've spawned only exits when the entire
task runner is being torn down, and not when the task driver restarts
the workload, so it should never be re-run.
2022-04-26 10:38:50 -04:00
Michael Schurter
aeff83b77a api: add ParseHCLOpts helper method (#12777)
The existing ParseHCL func didn't allow setting HCLv1=true.
2022-04-25 11:51:52 -07:00
Tim Gross
3aa520e0bd CSI: plugin config updates should always be destructive (#12774) 2022-04-25 12:59:25 -04:00
Luiz Aoqui
a01e219b1c update LAST_RELEASE comment to match new release branches structure (#12773) 2022-04-25 11:57:55 -04:00
Michael Schurter
e4d6d51035 docs: update json jobs docs (#12766)
* docs: update json jobs docs

Did you know that Nomad has not 1 but 2 JSON formats for jobs? 2½ if you
want to acknowledge that sometimes our JSON job representations have a
Job top-level wrapper and sometimes do not.

The 2½ formats are:
```
 1.   HCL JSON
 2.   Input API JSON (top-level Job field)
 2.5. Output API JSON (lacks top-level Job field)
```

`#2` is what our docs consider our API JSON. `#2.5` seems to be an
accident of history we can't fix with breaking API compatibility.

`#1` is an even more interesting accident of history: the `jobspec2`
package automatically detects if the input to Parse is JSON and switches
to a JSON parser. This behavior is undocumented, the format is
unspecified, and there is no official HashiCorp tooling to produce this
JSON from HCL. The plot thickens when you discover popular third party
tools like hcl2json.com and https://github.com/tmccombs/hcl2json seem to
produce JSON that `nomad run` accepts!

Since we have no telemetry around whether or not anyone passes HCL JSON
to `nomad run`, and people don't file bugs around features that Just
Work, I'm choosing to leave that code path in place and *acknowledged
but not suggested* in documentation.

See https://github.com/hashicorp/hcl/issues/498 for a more comprehensive
discussion of what officially supporting HCL JSON in Nomad would look
like.

(I also added some of the missing fields to the (Input API flavor) JSON
Job documentation, but it still needs a lot of work to be
comprehensive.)

Co-authored-by: Tim Gross <tgross@hashicorp.com>
2022-04-22 15:57:27 -07:00
Jai
5660d889a4 bug: fix filter and search (#12587)
* chore:  remove commented out code and skipped tests

* refact:  triggeredBy requires filter expression not qp

* refact:  use filter expression dsl instead of named params

* fix:  add  type

* docs:  add in-line reference to filter expression DSL

* fix:  update filter copy for non-matches

* fix:  correct conditional logic to render no match copy
2022-04-22 15:40:13 -04:00
Phil Renaud
34add685e5 Sets up a new z-modal z-index and assigns it to the sidebar (#12758) 2022-04-22 15:23:49 -04:00
Phil Renaud
30bc79d654 Accidentally added while setting lint rules elsewhere (#12759) 2022-04-22 15:04:21 -04:00
Tim Gross
b1ce392972 CSI: plugin supervisor prestart should not mark itself done (#12752)
The task runner hook `Prestart` response object includes a `Done`
field that's intended to tell the client not to run the hook
again. The plugin supervisor creates mount points for the task during
prestart and saves these mounts in the hook resources. But if a client
restarts the hook resources will not be populated. If the plugin task
restarts at any time after the client restarts, it will fail to have
the correct mounts and crash loop until restart attempts run out.

Fix this by not returning `Done` in the response, just as we do for
the `volume_mount_hook`.
2022-04-22 13:07:47 -04:00
James Rasell
ff9c9acc99 deps: update consul-template to v0.29.0 (#12747)
* deps: update consul-template to v0.29.0

* changelog: add entry for #12747
2022-04-22 09:58:54 -07:00
Phil Renaud
6b7cefb96f Adding changelog note (#12753) 2022-04-22 12:38:49 -04:00
Phil Renaud
cabe05705c [ui] Disconnected Clients: "Unknown" allocations in the UI (#12544)
* Unknown status for allocations accounted for

* Canary string removed

* Test cleanup

* Generate unknown in mirage

* aacidentally oovervoowled

* Update ui/app/components/allocation-status-bar.js

Co-authored-by: Derek Strickland <1111455+DerekStrickland@users.noreply.github.com>

* Disconnected state on job status in client

* Renaming Disconnected to Unknown in the job-status-in-client

* Unknown accounted for on job rows filtering and testsfix

* Adding lostAllocs as a computed dependency

* Unknown client status within acceptance test

* Swatches updated and PR comments addressed

* Unknown and disconnected added to test fixtures

Co-authored-by: Derek Strickland <1111455+DerekStrickland@users.noreply.github.com>
2022-04-22 11:25:02 -04:00
Luiz Aoqui
0abe5a6c79 vault: revert support for entity aliases (#12723)
After a more detailed analysis of this feature, the approach taken in
PR #12449 was found to be not ideal due to poor UX (users are
responsible for setting the entity alias they would like to use) and
issues around jobs potentially masquerading itself as another Vault
entity.
2022-04-22 10:46:34 -04:00
Seth Hoenig
c8bd0904cf Merge pull request #12720 from hashicorp/f-arbitrary-addresses
services: enable setting arbitrary address value in service registrations
2022-04-22 09:34:02 -05:00
Seth Hoenig
24431745e2 services: fix imports 2022-04-22 09:15:51 -05:00
Seth Hoenig
ed37d2116d services: cr followup 2022-04-22 09:14:29 -05:00
Seth Hoenig
2e26098614 services: format ipv6 in nomad service info output
Co-authored-by: Michael Schurter <mschurter@hashicorp.com>
2022-04-22 09:14:29 -05:00
Seth Hoenig
890d4a91b7 services: enable setting arbitrary address value in service registrations
This PR introduces the `address` field in the `service` block so that Nomad
or Consul services can be registered with a custom `.Address.` to advertise.

The address can be an IP address or domain name. If the `address` field is
set, the `service.address_mode` must be set in `auto` mode.
2022-04-22 09:14:29 -05:00
Tim Gross
512338f0aa E2E: remove platform specific realpath code from UI run script (#12750)
We don't need the absolute path for any of the commands in this script
so long as we `cd` into the source directory path. Doing this removes
the need for weird platform-specific tricks we have to do with
realpath vs GNU realpath.
2022-04-22 10:10:18 -04:00
James Rasell
89b74632d4 docs: add upgrade note for Consul implicit constraint. (#12749) 2022-04-22 15:53:27 +02:00
Tim Gross
0b9a85f5c2 CSI: handle nil topologies safely in command line (#12751) 2022-04-22 09:25:04 -04:00
Tim Gross
a29023ef69 E2E: fix debug logging on disconnected clients test (#12621) 2022-04-22 09:07:05 -04:00
James Rasell
2c6966c61a cli: add pagination flags to service info command. (#12730) 2022-04-22 10:32:40 +02:00
Tim Gross
cf913ba66b E2E: make UIs runnable from any working directory (#12739)
The E2E test runner is running from the root of the Nomad
repository. Make this run independent of the working directory for
convenience of developers and the test runner.
2022-04-21 17:00:01 -04:00
Michael Schurter
7af0c3c9e5 cli: add -json flag to support job commands (#12591)
* cli: add -json flag to support job commands

While the CLI has always supported running JSON jobs, its support has
been via HCLv2's JSON parsing. I have no idea what format it expects the
job to be in, but it's absolutely not in the same format as the API
expects.

So I ignored that and added a new -json flag to explicitly support *API*
style JSON jobspecs.

The jobspecs can even have the wrapping {"Job": {...}} envelope or not!

* docs: fix example for `nomad job validate`

We haven't been able to validate inside driver config stanzas ever since
the move to task driver plugins. 😭
2022-04-21 13:20:36 -07:00
Tim Gross
42bcb74a51 cli: detect directory when applying namespace spec file (#12738)
The new `namespace apply` feature that allows for passing a namespace
specification file detects the difference between an empty namespace
and a namespace specification by checking if the file exists. For most
cases, the file will have an extension like `.hcl` and so there's
little danger that a user will apply a file spec when they intended to
apply a file name.

But because directory names typically don't include an extension,
you're much more likely to collide when trying to `namespace apply` by
name only, and then you get a confusing error message of the form:

   Failed to read file: read $namespace: is a directory

Detect the case where the namespace name collides with a directory in
the current working directory, and skip trying to load the directory.
2022-04-21 14:53:45 -04:00
Phil Renaud
a977577e44 [ui, bugfix] Link fix for volumes where per_alloc=true (#12713)
* Allocation page linkfix

* fix added to task page and computed prop moved to allocation model

* Fallback query added to task group when specific volume isnt knowable

* Delog

* link text reflects alloc suffix

* Helper instead of in-template conditionals

* formatVolumeName unit test

* Removing unused helper import
2022-04-21 13:57:18 -04:00
Seth Hoenig
f1fcd50938 Merge pull request #12736 from hashicorp/build-update-go-1.17.9
build: update golang to 1.17.9
2022-04-21 12:13:07 -05:00
Seth Hoenig
7637a6c9c2 build: update golang version script to use .go-version file 2022-04-21 12:10:14 -05:00
Seth Hoenig
8206621833 Merge pull request #12737 from hashicorp/buid-update-ec2-instances
build: update ec2 instance profiles
2022-04-21 11:57:40 -05:00
Seth Hoenig
96b6a8d985 build: update ec2 instance profiles
using tools/ec2info
2022-04-21 11:47:40 -05:00
Seth Hoenig
91d91e28e4 build: update golang to 1.17.9 2022-04-21 11:43:01 -05:00
Tim Gross
55ca76e205 docker: back out cgroup v2 OOM detection (#12735)
When shutting down an allocation that ends up needing to be
force-killed, we're getting a spurious "OOM Killed (137)" message on
the task termination event. We introduced this as part of cgroups v2
support because the Docker daemon isn't detecting the container status
correctly. Although exit code 137 is the exit code we get for
OOM-killed processes, that's because OOM kill is a `SIGKILL`. So any
sigkilled process will get that exit code.
2022-04-21 12:31:34 -04:00
Tim Gross
5c17f91117 E2E: set longer timeout for CSI plugin alloc start (#12732)
The CSI plugin allocations take a while to be marked healthy,
sometimes causing E2E test flakes during the setup phase of the
tests. There's nothing CSI specific about marking plugin allocs
healthy, as the plugin supervisor hook does all the fingerprinting in
the postrun hook (the prestart hook just makes a couple of empty
directories). The timeouts we're seeing may be because of where we're
pulling the images from; most our jobs pull from a CDN-backed public
registry whereas these are pulling from ECR. Set a 1min timeout for
these to make sure we have enough time to pull the image and start the
task.
2022-04-21 11:11:43 -04:00
James Rasell
15e6e5befc api: Add support for filtering and pagination to the node list endpoint (#12727) 2022-04-21 17:04:33 +02:00
Tim Gross
1f1c970135 docs: fix broken link from template to client config (#12733) 2022-04-21 11:04:04 -04:00
Derek Strickland
5b1413a1ae reconciler: Handle canaries when client disconnects (#12539)
* plan_apply: Allow node updates in disconnected node plans
* plan: Keep the job when persisting unknown allocs
* reconciler: stop unknown allocs when stopping all
* reconcile_util: reorder filtering to handle canaries; skip rescheduling unknown
* heartbeat: Fix bug in node heartbeating
2022-04-21 10:05:58 -04:00
Tim Gross
d1aa801407 E2E: playwright configuration and smoke test (#12721)
Scripts for running playwright tests in a Docker container that has
chromium and webkit preinstalled. Includes a basic smoke test for
authentication so that we can be sure the test rig is working
end-to-end. Wiring this up in CI will be in an upcoming PR.
2022-04-21 09:13:10 -04:00
James Rasell
a911d83cf4 docs: update HCL2 dynamic example to use block with label. (#12715) 2022-04-21 10:18:04 +02:00
James Rasell
61ec5f0456 autopilot: correctly return errors within state functions. (#12714) 2022-04-21 08:54:50 +02:00
Luiz Aoqui
7f1b838abb ui: fix bug that prevented files streaming (#12719)
During the Ember dependecy upgrade work,
https://github.com/hashicorp/nomad/commit/ce8c039f4ce7359d60ede5dee36b9cef82
moved the `isSupported` method from using Ember's `reopenClass` to a
getter, but `reopenClass` creates a static method, so the getter must be
static as well.
2022-04-20 14:39:18 -04:00
Gowtham
f601cc39b1 Add Concurrent Download Support for artifacts (#11531)
* add concurrent download support - resolves #11244

* format imports

* mark `wg.Done()` via `defer`

* added tests for successful and failure cases and resolved some goleak

* docs: add changelog for #11531

* test typo fixes and improvements

Co-authored-by: Michael Schurter <mschurter@hashicorp.com>
2022-04-20 10:15:56 -07:00
James Rasell
8eb569faf4 job_hooks: add implicit constraint when using Consul for services. (#12602) 2022-04-20 14:09:13 +02:00
James Rasell
4c55339cc6 client: add NOMAD_SHORT_ALLOC_ID allocation env var. (#12603) 2022-04-20 10:30:48 +02:00
Tim Gross
aafcf97984 E2E: provide options for reverse proxy for web UI (#12671)
Our E2E test environment is deployed with mTLS, but it's impractical
for us to use mTLS in headless browsers for automated testing (or even
in manual testing). Provide certificates for proxying the web UI via
Nginx. This proxy uses client certs for proxying to the HTTP endpoint
and a self-signed cert for the browser-facing endpoint. We can accept
certificate errors in the automated tests we'll be adding in the next
step of this work.
2022-04-19 16:55:05 -04:00
Tim Gross
e2a8d45f2d E2E: terraform provisioner upgrades (#12652)
While working on infrastructure for testing the UI in E2E, we needed
to upgrade the certificate provider. Performing a provider upgrade via
the TF `init -upgrade` brought in updates for the file and AWS
providers as well. These updates include deprecating the use of
`sensitive_content` fields, removing CA algorithm parameters that can
be inferred from keys, and removing the requirement to manually
specify AWS assume role parameters in the provider config if they're
available in the calling environment's AWS config file (as they are
via doormat or our E2E environment).
2022-04-19 14:27:14 -04:00
Seth Hoenig
19c9779d57 Merge pull request #12604 from hashicorp/b-fixup-chroot-test
ci: fixup task runner chroot test
2022-04-19 12:58:03 -05:00