Commit Graph

25304 Commits

Author SHA1 Message Date
Michael Schurter
2def3bb2b9 Prepare release 1.7.0-beta.2 2023-11-15 14:42:22 -08:00
Adriano Caloiaro
f66eb83fc0 Add go-netaddrs support to retry_join (#18745) 2023-11-15 10:07:18 -05:00
Phil Renaud
bb6c86d2a4 Shows the client/node name alongside alloc short ID if the job is sys/sysbatch (#19051) 2023-11-15 10:05:12 -05:00
Luiz Aoqui
26746a4093 cli: add zero nodes message to node status (#19082)
Display a message to indicate that there are no nodes registered when
`node status` returns zero values.
2023-11-14 23:00:12 -05:00
Tim Gross
98e9fb4698 docs: clarify when "all" is not permitted for cap_add (#19091)
Linux capabilities configurable by the task must be a subset of those configured
in the plugin configuration. Clarify this implies that `"all"` is not permitted
if the plugin is not also configured to allow all capabilities.

Fixes: https://github.com/hashicorp/nomad/issues/19059
2023-11-14 16:33:55 -05:00
Tim Gross
0236bd0907 qemu: fix panic from missing resources block (#19089)
The `qemu` driver uses our universal executor to run the qemu command line
tool. Because qemu owns the resource isolation, we don't pass in the resource
block that the universal executor uses to configure cgroups and core
pinning. This resulted in a panic.

Fix the panic by returning early in the cgroup configuration in the universal
executor. This fixes `qemu` but also any third-party drivers that might exist
and are using our executor code without passing in the resource block.

In future work, we should ensure that the `resources` block is being translated
into qemu equivalents, so that we have support for things like NUMA-aware
scheduling for that driver.

Fixes: https://github.com/hashicorp/nomad/issues/19078
2023-11-14 16:26:44 -05:00
dependabot[bot]
9bc4a8df59 chore(deps): bump debug from 4.1.1 to 4.3.4 in /scripts/screenshots/src (#18636)
Bumps [debug](https://github.com/debug-js/debug) from 4.1.1 to 4.3.4.
- [Release notes](https://github.com/debug-js/debug/releases)
- [Commits](https://github.com/debug-js/debug/compare/4.1.1...4.3.4)

---
updated-dependencies:
- dependency-name: debug
  dependency-type: indirect
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2023-11-14 14:36:02 -05:00
Phil Renaud
12e43aa07f Re-add wildcard for test-ui path restrictions (#19085) 2023-11-14 11:28:53 -05:00
Tim Gross
42f0540f9a docs: fix link to dynamic node metadata API (#19086) 2023-11-14 11:16:12 -05:00
Tim Gross
8fac70c92c E2E: refactor vaultcompat to allow for ENT tests (#19081)
We want to run the Vault compatibility E2E test with Vault Enterprise binaries
and use Vault namespaces. Refactor the `vaultcompat` test so as to parameterize
most of the test setup logic with the namespace, and add the appropriate build
tag for the CE version of the test.
2023-11-14 09:54:47 -05:00
Tim Gross
b5af87ebf3 set Vault namespace from task in vault_hook JWT login (#19080)
The JWT login codepath for the `vault_hook` was missing the Vault namespace, so
the login request for non-default namespaces would fail.
2023-11-14 09:54:36 -05:00
Juana De La Cuesta
bae82b14b4 docs: Add section for disable restart (#19083)
* docs: add section for disable restart that mirrors what is on disable reschedule

* Update restart.mdx
2023-11-14 14:53:43 +01:00
Tim Gross
1c9c75cc83 E2E: refactor consulcompat to allow for ENT tests (#19068)
We want to run the Consul compatibility E2E test with Consul Enterprise binaries
and use Consul namespaces. Refactor the `consulcompat` test so as to
parameterize most of the test setup logic with the namespace, and add the
appropriate build tag for the CE version of the test.

Ref: https://github.com/hashicorp/nomad-enterprise/pull/1305
2023-11-10 15:05:51 -05:00
Seth Hoenig
5987ba434f e2ev3: wait for logs to become ready (#19067)
Just because an alloc is running does not mean nomad is ready to serve
task logs. In a test case where you immediatly read logs after starting
a task, it could be that nomad responds with "no logs found" when you
try to read logs, in which case you just need to wait longer. Do so in
the v3 TaskLogs helper function.
2023-11-10 12:43:16 -06:00
Luiz Aoqui
f0acf72ae7 client: fix Consul token retrievel for templates (#19058)
The template hook must use the Consul token for the cluster defined in
the task-level `consul` block or, if `nil, in the group-level `consul`
block.

The Consul tokens are generated by the allocrunner consul hook, but
during the transition period we must fallback to the Nomad agent token
if workload identities are not being used.

So an empty token returned from `GetConsulTokens()` is not enough to
determine if we should use the legacy flow (either this is an old task
or the cluster is not configured for Consul WI), or if there is a
misconfiguration (task or group is `consul` block is using a cluster
that doesn't have an `identity` set).

In order to distinguish between the two scenarios we must iterate over
the task identities looking for one suitable for the Consul cluster
being used.
2023-11-10 13:42:30 -05:00
Phil Renaud
62007e3b18 [ui] Small fix to let UI actions passing use job.name instead of job.id, since namespace is passed as an explicit param afterward (#19061) 2023-11-10 10:55:00 -05:00
Seth Hoenig
c17333d74a e2e refactor oversubscription (#19060)
* e2e: remove old oversubscription test

* e2e: fixup and cleanup oversubscription test suite

Fix and cleanup this old oversubscription test.

* use t.Cleanup instead of defer in tests
2023-11-10 09:25:32 -06:00
Tim Gross
5d0008a9b4 tools: bump version of hc-install (#19063)
The version we have of `hc-install` doesn't allow installing Enterprise
binaries. Upgrade so that this is available to the development team and to our
E2E tests in the Enterprise repo.
2023-11-10 09:57:29 -05:00
Tim Gross
4e38b41d9d E2E: add template block to consulcompat test (#19055)
The Consul compatibility test focuses on Connect, but it'd be a good idea to
ensure we can successfully get template data out of Consul as well.

Also tightens up the test's Consul ACL policy for the Nomad agent.
2023-11-10 09:25:37 -05:00
Seth Hoenig
1f957947b4 e2e: refactor nomadexec test suite (#19054) 2023-11-10 07:09:24 -06:00
Seth Hoenig
2f8d94ae3e e2e: more cpu and memory for java tasks and some scripts (#19057) 2023-11-10 07:08:14 -06:00
Tim Gross
5ad715b281 fix taskrunner test after broken signature (#19056)
PRs #19034 and #19040 accidentally conflicted with each other without a merge
conflict when #19034 changes the method signature of `SetConsulTokens`. Because
CI doesn't rebase, both PRs tested fine and only were broken once they landed on
`main`. Fix that.
2023-11-09 15:53:25 -05:00
Seth Hoenig
f211a0ab7c e2e: update terrform lock file for 1.6.3 (#19049)
Using the latest version of terraform, the lock file is not the same
as when it was generated. Seems like the http module is not needed?
versioned? present? anymore.
2023-11-09 10:49:49 -06:00
Luiz Aoqui
b61a31c38f chore: remove comment about WI change mode (#19047)
Identity change mode was implemented in #18943 and handles the update at
the task level, so workload identity manager receives the update as
expected.
2023-11-09 11:06:03 -05:00
Luiz Aoqui
85d923b759 cli: fix Consul env var URL reference (#19041) 2023-11-09 10:58:03 -05:00
Luiz Aoqui
6d8417014f client: pass alloc hook resources to template hook (#19040)
The task template hook uses the alloc resource to retrieve Consul
tokens, so it must be passed from the allocation.
2023-11-09 10:55:35 -05:00
Seth Hoenig
402540f7fb e2e: bump packer build instances because faster (#19046) 2023-11-09 09:33:30 -06:00
Tim Gross
c7c3b3ae33 revoke Consul tokens obtained via WI when alloc stops (#19034)
Add a `Postrun` and `Destroy` hook to the allocrunner's `consul_hook` to ensure
that Consul tokens we've created via WI get revoked via the logout API when
we're done with them. Also add the logout to the `Prerun` hook if we've hit an
error.
2023-11-09 10:08:09 -05:00
Luke Kysow
36c9aee3f0 Bump consul-template to 0.35.0 (#19032)
* Bump consul-template to 0.35.0

* run go mod tidy
2023-11-09 09:48:33 -05:00
Seth Hoenig
a28e5b6965 e2e: refactor metrics test to use NSD and WI (#19022)
* e2e: remove old metrics suite

* e2e: install stress on e2e jammy image

* e2e: overhaul metrics test to use nomad service discovery, workload identity

* e2e: format metrics hcl files and copywrite

* e2e: undo tf lock file

* e2e: undo reg auth file perms

* e2e: format cpustress.hcl
2023-11-09 08:21:16 -06:00
Phil Renaud
f322bb7efb Nicer comment styles in example jobs (#19037) 2023-11-08 20:13:34 -05:00
Phil Renaud
6cd706f460 Only run test-ui, and percy, in the event that a push/pr touches the ui directory (#19038) 2023-11-08 20:12:54 -05:00
Piotr Kazmierczak
128c71b579 cli: simplify conditionals in setup commands (#19011) 2023-11-08 19:41:15 -05:00
Tim Gross
7191c78928 refactor: rename allocrunner's Consul service reg handler (#19019)
The allocrunner has a service registration handler that proxies various API
calls to Consul. With multi-cluster support (for ENT), the service registration
handler is what selects the correct Consul client. The name of this field in the
allocrunner and taskrunner code base looks like it's referring to the actual
Consul API client. This was actually the case before Nomad native service
discovery was implemented, but now the name is misleading.
2023-11-08 15:39:32 -05:00
Luiz Aoqui
6761f1f98c cli: fix setup consul binding rule config (#19033)
When creating the binding rule, `BindName` must match the pattern used
for the role name, otherwise the task will not be able to login to
Consul.

Also update the equality check for the binding rule to ensure this
property is held even if the auth method already has existing binding
rules attached.
2023-11-08 15:13:16 -05:00
Michael Schurter
c4ae91f8be Fix WorkloadIdentity.TTL handling, jobspec2 testing, and hcl1 vs 2 parsing (#19024)
* make the little dots consistent
* don't trim delimiter as that over matches
* test jobspec2 package
* copy api/WorkloadIdentity.TTL -> structs
* test ttl parsing
* fix hcl1 v 2 parsing mismatch
* make jobspec(1) tests match jobspec2 tests
2023-11-08 09:01:16 -08:00
Tim Gross
9d075c44b2 config: remove old Vault/Consul config blocks from parser (#18997)
Remove the now-unused original configuration blocks for Consul and Vault from
the agent configuration parsing. When the agent needs to refer to a Consul or
Vault block it will always be for a specific cluster for the task/service (or
the default cluster for the agent's own use).

This is third of three changesets for this work.

Fixes: https://github.com/hashicorp/nomad/issues/18947
Ref: https://github.com/hashicorp/nomad/pull/18991
Ref: https://github.com/hashicorp/nomad/pull/18994
2023-11-08 09:30:08 -05:00
Seth Hoenig
63da22063b e2e: update pledge driver to 0.3.0 (#19020) 2023-11-08 06:58:59 -06:00
hc-github-team-es-release-engineering
57d3019879 REPLAT-962 Update LICENSE text (#19023) 2023-11-08 11:54:54 +00:00
Luiz Aoqui
ab36cf031c vault: avoid continual renewal of invalid token (#18985)
A series of errors may happen when a token is invalidated while the
Vault client is waiting to renew it. The token may have been invalidated
for several reasons, such as the alloc finished running and it's now
terminal or the token may have been change directly on Vault
out-of-band.

Most of the errors are caused by retries that will never succeed until
Vault fully removes the token from its state.

This commit prevents the retries by making the error `invalid lease ID`
a fatal error.

In earlier versions of Vault, this case was covered by the error `lease
not found or lease is not renewable`, which is already considered to be
a fatal error by Nomad:

2d0cde4ccc/vault/expiration.go (L636-L639)

But https://github.com/hashicorp/vault/pull/5346 introduced an earlier
`nil` check that generates a different error message:

750ab337ea/vault/expiration.go (L1362-L1364)

Both errors happen for the same reason (`le == nil`) and so should be
considered fatal on renewal.
2023-11-07 19:50:19 -05:00
Luiz Aoqui
7054fe1a8c vault: always renew tokens using the renewal loop (#18998)
Previously, a Vault token could renewed either periodically via the
renewal loop or immediately by calling `RenewToken()`.

But a race condition in the renewal loop could cause an attempt to renew
an expired token. If both `updateCh` and `renewalCh` are active (such as
when a task stops at the same time its token is waiting for renewal),
the following `select` picks a `case` at random.

78f0c6b2a9/client/vaultclient/vaultclient.go (L557-L564)

If `case <-renewalCh` is picked, the token is incorrectly re-added to
the heap, causing unnecessary renewals of a token that is already expired.

1604dba508/client/vaultclient/vaultclient.go (L505-L510)

To prevent this situation, the `renew()` function should only renew
tokens that are currently in the heap, so `RenewToken()` must first push
the token to the heap and wait for the renewal to happen instead of
calling `renew()` directly since this could cause another race condition
where the token is renewed twice: once by `RenewToken()` calling
`renew()` directly and a second time if the renewal happens to pick the
token as soon as `RenewToken()` adds it to the heap.
2023-11-07 19:49:33 -05:00
Phil Renaud
783572de7d [ui] Actions implementation in the web UI (#18793)
* runAction model and adapter funcs

* Hacky but functional action running from job index

* remove proxy hack

* runAction added to taskSubRow

* Added tty and ws_handshake to job action endpoint call

* delog

* Bunch of streaming work

* action started, running, and finished notification titles, neutral color, and ansi escape

* Handle random alloc selection in the web ui

* Run on All implementation in web ui

* [ui] Helios two-step button and uniform title bar for Actions (#18912)

* Initial pass at title bar button uniformity

* Vertical align on actions dropdown toggle and small edits to prevent keynav overflow issue

* We represent loading state w text and disable now

* Pageheader component to align buttons

* Buttons standardized

* Actions dropdown reveal for multi-alloc job

* Notification code styles

* An action-having single alloc job

* Mirageed

* Actions-laden jobs in mirage

* Separating allocCount and taskCount in mirage mocks

* Unbreak stop job tests

* Permissions for actions dropdown

* tests for running actions from the job index page

* running from a task row actions tests

* some todocleanup

* PR feedback addressed, including page helper for actions
2023-11-07 15:29:43 -05:00
Seth Hoenig
cf2f48efd4 build: update to Go 1.21.4 (#19013) 2023-11-07 13:18:07 -06:00
Seth Hoenig
a2f7ab2645 e2e disable windows (#19012)
* e2e: disable windows client

* e2e: disable windows artifact test
2023-11-07 09:34:18 -06:00
Tim Gross
50f0ce5412 config: remove old Vault/Consul config blocks from client (#18994)
Remove the now-unused original configuration blocks for Consul and Vault from
the client. When the client needs to refer to a Consul or Vault block it will
always be for a specific cluster for the task/service. Add a helper for
accessing the default clusters (for the client's own use).

This is two of three changesets for this work. The remainder will implement the
same changes in the `command/agent` package.

As part of this work I discovered and fixed two bugs:

* The gRPC proxy socket that we create for Envoy is only ever created using the
  default Consul cluster's configuration. This will prevent Connect from being
  used with the non-default cluster.
* The Consul configuration we use for templates always comes from the default
  Consul cluster's configuration, but will use the correct Consul token for the
  non-default cluster. This will prevent templates from being used with the
  non-default cluster.

Ref: https://github.com/hashicorp/nomad/issues/18947
Ref: https://github.com/hashicorp/nomad/pull/18991
Fixes: https://github.com/hashicorp/nomad/issues/18984
Fixes: https://github.com/hashicorp/nomad/issues/18983
2023-11-07 09:15:37 -05:00
Tim Gross
1998004483 move deprecation warning for Vault/Consul token to admission hook (#18995)
Submitting a Consul or Vault token with a job is deprecated in Nomad 1.7 and
intended for removal in Nomad 1.9. We added a deprecation warning to the CLI
when the user passes in the appropriate flag or environment variable in
does not use Vault or Consul but happen to have the appropriate environment
variable in your environment. While this is generally a bad practice (because
the token is leaked to Nomad), it's also the existing practice for some users.

Move the warning to the job admission hook. This will allow us to warn only when
appropriate, and that will also help the migration process by producing warnings
only for the relevant jobs.
2023-11-07 08:37:06 -05:00
Seth Hoenig
3ba364e42f deps: update some dependencies (#19002)
* deps: update shoenig/test to 1.7.0

* deps: update go-set/v2 to v2.1.0

* deps: update shoenig/go-landlock to v1.2.0
2023-11-07 07:34:40 -06:00
Piotr Kazmierczak
7c6863b479 cli: setup vault command (#18910)
An interactive setup helper for configuring Vault to accept Nomad WI-enabled
workloads.

---------

Co-authored-by: Luiz Aoqui <luiz@hashicorp.com>
2023-11-07 10:42:00 +01:00
Dave May
e4f98a8d1d docs: fix broken links in docker.mdx (#19003) 2023-11-07 07:34:47 +00:00
Tim Gross
1ef99f0536 config: remove old Vault/Consul config blocks from server (#18991)
Remove the now-unused original configuration blocks for Consul and Vault from
the server. When the server needs to refer to a Consul or Vault block it will
always be for a specific cluster for the task/service. Add a helper for
accessing the default clusters (for the servers own use).

This is one of three changesets for this work. The remainder will implement the
same changes in the `client` package and on the `command/agent` package.

As part of this work I discovered that the job submission hook for Vault only
checks the enabled flag on the default cluster, rather than the clusters that
are used by the job being submitted. This will return an error on job
registration saying that Vault is disabled. Fix that to check only the
cluster(s) used by the job.

Ref: https://github.com/hashicorp/nomad/issues/18947
Fixes: https://github.com/hashicorp/nomad/issues/18990
2023-11-06 10:26:20 -05:00