Commit Graph

26627 Commits

Author SHA1 Message Date
Tim Gross
3f2d4000a6 E2E: dynamic host volume tests for sticky volumes (#24869)
Add tests for dynamic host volumes where the claiming jobs have `volume.sticky =
true`. Includes a test for forced rescheduling and a test for node drain.

This changeset includes a new `e2e/v3`-style package for creating dynamic host
volumes, so we can reuse that across other tests.
2025-02-07 15:50:54 -05:00
Michael Smithhisler
a6523be478 state store: fix logic for evaluating job status (#24974) 2025-02-07 15:34:14 -05:00
Daniel Bennett
91194b3cc2 docker: refactor to handle futures more easily (#24992)
at least one bug has been created because it's
easy to miss a future.set() in pullImageImpl()

this pulls future.set() out to PullImage(),
the same level where it's created and wait()ed
2025-02-07 12:45:17 -06:00
Daniel Bennett
62ef621582 docker: respect image_pull_timeout (#24991)
I believe the docker driver stopped respecting image_pull_timeout
in Nomad 1.9.0 in 981ca36049

this makes the timeout apply again
2025-02-07 11:36:31 -06:00
Piotr Kazmierczak
611452e1af stateful deployments: use TaskGroupVolumeClaim table to associate volume requests with volume IDs (#24993)
We introduce an alternative solution to the one presented in #24960 which is
based on the state store and not previous-next allocation tracking in the
reconciler. This new solution reduces cognitive complexity of the scheduler
code at the cost of slightly more boilerplate code, but also opens up new
possibilities in the future, e.g., allowing users to explicitly "un-stick"
volumes with workloads still running.

The diagram below illustrates the new logic:

     SetVolumes()                                               upsertAllocsImpl()          
     sets ns, job                             +-----------------checks if alloc requests    
     tg in the scheduler                      v                 sticky vols and consults    
            |                  +-----------------------+        state. If there is no claim,
            |                  | TaskGroupVolumeClaim: |        it creates one.             
            |                  | - namespace           |                                    
            |                  | - jobID               |                                    
            |                  | - tg name             |                                    
            |                  | - vol ID              |                                    
            v                  | uniquely identify vol |                                    
     hasVolumes()              +----+------------------+                                    
     consults the state             |           ^                                           
     and returns true               |           |               DeleteJobTxn()              
     if there's a match <-----------+           +---------------removes the claim from      
     or if there is no                                          the state                   
     previous claim                                                                         
|                             | |                                                      |    
+-----------------------------+ +------------------------------------------------------+    
                                                                                            
           scheduler                                  state store
2025-02-07 17:41:01 +01:00
Daniel Bennett
3493551c38 docker: surface image pull progress error (#24981)
set() on the future, so the caller can handle it
instead of wait()ing forever and causing the
allocation to get stuck "pending"
2025-02-07 10:36:09 -06:00
Tim Gross
d0a6424844 enos: improve documentation around required variables (#25051)
The variables definitions for Enos upgrade scenarios have a couple of unused
variables and some of the documentation strings are ambiguous:

* `nomad_region` and `binary_local_path` variables are unused and can be removed.
* `nomad_local_binary` refers to the directory where the binaries will be
  download, not the binaries themselves. Rename to make it clear this belongs to
  the artifactory fetch and not the provisioning step (which uses the
  artifactory fetch outputs).
2025-02-07 11:35:50 -05:00
James Rasell
4fbacee328 sec: Remove yamux suppression as vuln has been revoked. (#25044) 2025-02-07 15:15:15 +00:00
Tim Gross
5d09d7ad07 reset max query time of blocking queries in client after retries (#25039)
When a blocking query on the client hits a retryable error, we change the max
query time so that it falls within the `RPCHoldTimeout` timeout. But when the
retry succeeds we don't reset it to the original value.

Because the calls to `Node.GetClientAllocs` reuse the same request struct
instead of reallocating it, any retry will cause the agent to poll at a faster
frequency until the agent restarts. No other current RPC on the client has this
behavior, but we'll fix this in the `rpc` method rather than in the caller so
that any future users of the `rpc` method don't have to remember this detail.

Fixes: https://github.com/hashicorp/nomad/issues/25033
2025-02-07 08:45:56 -05:00
Tim Gross
b5faeff233 vault: fix bug in logging logic around renewals (#25040)
In #24409 we fixed a bug where some of the error messages we get from Vault
weren't being caught correctly. This fix itself contains a bug where we changed
the logic that logged the non-fatal errors so that it logs when there is no
renewal error.

Ref: https://github.com/hashicorp/nomad/pull/24409
Fixes: https://github.com/hashicorp/nomad/issues/24933
2025-02-07 08:45:33 -05:00
Juana De La Cuesta
cf0a046364 Module to upgrade servers (#24971)
* func: add initial enos skeleton

* style: add headers

* func: change the variables input to a map of objects to simplify the workloads creation

* style: formating

* Add tests for servers and clients

* style: separate the tests in diferent scripts

* style: add missing headers

* func: add tests for allocs

* style: improve output

* func: add step to copy remote upgrade version

* style: hcl formatting

* fix: remove the terraform nomad provider

* fix: Add clean token to remove extra new line added in provision

* fix: Add clean token to remove extra new line added in provision

* fix: Add clean token to remove extra new line added in provision

* fix: add missing license headers

* style: hcl fmt

* style: rename variables and fix format

* func: remove the template step on the workloads module and chop the noamd token output on the provide module

* fix: correct the jobspec path on the workloads module

* fix: add missing variable definitions on job specs for workloads

* style: formatting

* fix: Add clean token to remove extra new line added in provision

* func: add module to upgrade servers

* style: missing headers

* func: add upgrade module

* func: add install for windows as well

* func: add an intermediate module that runs the upgrade server for each server

* fix: add missing license headers

* fix: remove extra input variables and connect upgrade servers to the scenario

* fix: rename missing env variables for cluster health scripts

* func: move the cluster health test outside of the modules and into the upgrade scenario

* fix: fix the regex to ignore snap files on the gitignore file

* fix: Add clean token to remove extra new line added in provision

* fix: Add clean token to remove extra new line added in provision

* fix: Add clean token to remove extra new line added in provision

* fix: remove extra input variables and connect upgrade servers to the scenario

* style: formatting

* fix: move taken and restoring snapshots out of the upgrade_single_server to avoid possible race conditions

* fix: rename variable in health test

* fix: Add clean token to remove extra new line added in provision

* func: add an intermediate module that runs the upgrade server for each server

* fix: Add clean token to remove extra new line added in provision

* fix: Add clean token to remove extra new line added in provision

* fix: Add clean token to remove extra new line added in provision

* func: fix the last_log_index check and add a versions check

* func: done use for_each when upgrading the servers, hardcodes each one to ensure they are upgraded one by one

* Update enos/modules/upgrade_instance/variables.tf

Co-authored-by: Tim Gross <tgross@hashicorp.com>

* Update enos/modules/upgrade_instance/variables.tf

Co-authored-by: Tim Gross <tgross@hashicorp.com>

* Update enos/modules/upgrade_instance/variables.tf

Co-authored-by: Tim Gross <tgross@hashicorp.com>

* func: make snapshot by calling every server and allowing stale data

* style: formatting

* fix: make the source for the upgrade binary unknow until apply

* func: use enos bundle to install remote upgrade version, enos_files is not meant for dynamic files

---------

Co-authored-by: Tim Gross <tgross@hashicorp.com>
2025-02-07 10:26:03 +01:00
salehjafarli
a914888c2c docs: Corrected meta keys example from sidecar_service documentation (#25042) 2025-02-07 08:43:13 +00:00
James Rasell
aef33e264a build: update to go 1.23.6 (#25041) 2025-02-07 08:09:31 +00:00
Juana De La Cuesta
d53b8a7e98 func: remove triggers from resources that copy the binaries into the remote instances (#25036) 2025-02-06 17:11:19 +01:00
dependabot[bot]
d9e2cb23b9 chore(deps): bump go.etcd.io/bbolt from 1.3.9 to 1.3.11 (#24841) 2025-02-06 09:57:10 +00:00
James Rasell
b394a76b89 jobspec2: isolate package from Nomad core and BUSL. (#25021) 2025-02-06 08:42:34 +00:00
dependabot[bot]
9fef959daf chore(deps): bump github.com/docker/cli (#24949) 2025-02-05 13:00:00 +00:00
Juana De La Cuesta
caeee0f238 Fix the last_log_index check and add a versions check (#24989)
* func: fix the last_log_index check and add a versions check

* fix: add small window to consider raft index equal
2025-02-05 10:34:11 +01:00
dependabot[bot]
21b53c85c2 chore(deps): bump github.com/aws/aws-sdk-go-v2/config (#24995) 2025-02-05 08:05:26 +00:00
Robert C. Ewing
824d362226 Fix: inaccurate docs (#25023)
Internally, sizes are always in binary units; this documentation is misleading and implies that they work in decimal units.

Without going through and replacing _every_ "MB" -> "MiB" this is the best way to hint to developers that binary sizes are used.
2025-02-04 13:42:13 -06:00
Phil Renaud
9367929d87 [cli] Adds Actions to job status command output (#24959)
* Adds Actions to job status command output

* Adds Actions to job status command output

* Status documentation updated to show actions and formatJobActions no longer cares about pipe delineation
2025-02-04 09:34:49 -05:00
Phil Renaud
389f4612b6 [ui] Multi-condition start/revert/edit buttons when a job isn't running (#24985)
* Multi-condition start/revert/edit buttons when a job isn't running

* mirage-mocked revertable jobs and acceptance tests

* Remove version-watching from job index route
2025-02-03 22:36:50 -05:00
Marcel Johannesmann
ec073d0eab Update acl.mdx (#25013) 2025-02-03 12:52:45 -06:00
Tim Gross
7929939116 volume delete: allow prefix for ID (#24997)
The `volume delete` command doesn't allow using a prefix for the volume ID for
either CSI or dynamic host volumes. Use a prefix search and wildcard namespace
as we do for other CLI commands.

Ref: https://hashicorp.atlassian.net/browse/NET-12057
2025-02-03 11:29:43 -05:00
Aimee Ukasick
d9bb241b43 Docs SEO: Update runtime, networking, Nomad vs K8s, Nomad Enterprise, upgrading, release notes, and sectionless pages (#24764)
* Docs SEO: Updates

CE-781,782,785,788

* CE-791 single pages

* CE-786 enterprise section

* CE-789 release notes

* fix content-check error

* Update description and add intro body paragraph when appropriate

* fix typo

* Apply suggestions from Jeff's code review

Co-authored-by: Jeff Boruszak <104028618+boruszak@users.noreply.github.com>

---------

Co-authored-by: Jeff Boruszak <104028618+boruszak@users.noreply.github.com>
2025-02-03 10:03:36 -06:00
Aimee Ukasick
03faedbc69 Docs SEO: Update Concepts for search (#24757)
* Update for search engine optimization

* Update descriptions and add intro body summary paragraph

* Apply suggestions from code review

Co-authored-by: Jeff Boruszak <104028618+boruszak@users.noreply.github.com>

---------

Co-authored-by: Jeff Boruszak <104028618+boruszak@users.noreply.github.com>
2025-02-03 09:26:51 -06:00
Tim Gross
cc99e8f0a2 dynamic host volumes: add -id arg for updates of existing volumes (#24996)
If you create a volume via `volume create/register` and want to update it later,
you need to change the volume spec to add the ID that was returned. This isn't a
very nice UX, so let's add an `-id` argument that allows you to update existing
volumes that have that ID.

Ref: https://hashicorp.atlassian.net/browse/NET-12083
2025-02-03 10:26:30 -05:00
James Rasell
e4659970b1 sec: Suppress additional yamux advisory and AWS v1 indirect dep. (#25003) 2025-02-03 14:52:27 +00:00
dependabot[bot]
fd20f666ef chore(deps): bump github.com/hashicorp/memberlist from 0.5.2 to 0.5.3 (#24994) 2025-02-03 09:20:32 +00:00
Matt Keeler
833e240597 Upgrade to using hashicorp/go-metrics@v0.5.4 (#24856)
* Upgrade to using hashicorp/go-metrics@v0.5.4

This also requires bumping the dependencies for:

* memberlist
* serf
* raft
* raft-boltdb
* (and indirectly hashicorp/mdns due to the memberlist or serf update)

Unlike some other HashiCorp products, Nomads root module is currently expected to be consumed by others. This means that it needs to be treated more like our libraries and upgrade to hashicorp/go-metrics by utilizing its compat packages. This allows those importing the root module to control the metrics module used via build tags.
2025-01-31 15:22:00 -05:00
James Rasell
3d6de7fa6b docs: Update CNI install detail to use 1.6.2 (#24976)
CNI had release problems which meant 1.6.1 got pulled and 1.6.2 is
identical.
2025-01-31 07:30:15 +00:00
Juana De La Cuesta
3861c40220 func: add initial enos skeleton (#24787)
* func: add initial enos skeleton

* style: add headers

* func: change the variables input to a map of objects to simplify the workloads creation

* style: formating

* Add tests for servers and clients

* style: separate the tests in diferent scripts

* style: add missing headers

* func: add tests for allocs

* style: improve output

* func: add step to copy remote upgrade version

* style: hcl formatting

* fix: remove the terraform nomad provider

* fix: Add clean token to remove extra new line added in provision

* fix: Add clean token to remove extra new line added in provision

* fix: Add clean token to remove extra new line added in provision

* fix: add missing license headers

* style: hcl fmt

* style: rename variables and fix format

* func: remove the template step on the workloads module and chop the noamd token output on the provide module

* fix: correct the jobspec path on the workloads module

* fix: add missing variable definitions on job specs for workloads

* style: formatting

* fix: rename variable in health test
2025-01-30 16:37:55 +01:00
James Rasell
0d57e91282 sec: Surpress yamux OSV alert in CRT. (#24978)
The change also removes an old surpression which has now been
resolved.
2025-01-30 15:27:19 +00:00
James Rasell
bfd5f38761 ui: Remove unrequired node read from task log streaming page. (#24973)
Co-authored-by: Phil Renaud <phil@riotindustries.com>
2025-01-30 07:42:27 +00:00
Michael Smithhisler
47c14ddf28 remove remote task execution code (#24909) 2025-01-29 08:08:34 -05:00
Daniel Bennett
dcf6201d2b dynamic host volumes: CE side of quota tweaks (#24972)
* quota spec:
  if `region_limit.storage.host_volumes` is set,
  do not require that `variables` also be set,
  and vice versa.
* subtract from quota usage on volume delete
* stub CE quota subtraction method
2025-01-28 17:27:25 -06:00
Juana De La Cuesta
1b1ad896ec Add the path to the ssh key to connect to the cluster's instances as an output (#24969)
* fix: add the ssh key pem path to te outputs and fix the message with the correct path

* func: add ssh pem key as output
2025-01-28 18:25:02 +01:00
James Rasell
c8d7e741c8 e2e: Fix TF output SSH key path. (#24965) 2025-01-28 16:29:56 +00:00
Deniz Onur Duzgun
bfcbe83ab5 sec: sanitize identity token from events (#24966)
* bug: sanitize identity token from events

* add changelog
2025-01-28 10:57:06 -05:00
James Rasell
7a450f5499 build: update to go 1.23.5 (#24963) 2025-01-28 15:47:00 +00:00
James Rasell
8859cfa3f5 e2e: Ensure Consul client is running before starting Nomad service. (#24964) 2025-01-28 15:28:12 +00:00
Tim Gross
09eb473189 dynamic host volumes: set status unavailable on failed restore (#24962)
When a client restarts but can't restore a volume (ex. the plugin is now
missing), it's removed from the node fingerprint. So we won't allow future
scheduling of the volume, but we were not updating the volume state field to
report this reasoning to operators. Make debugging easier and the state field
more meaningful by setting the value to "unavailable".

Also, remove the unused "deleted" field. We did not implement soft deletes and
aren't planning on it for Nomad 1.10.0.

Ref: https://hashicorp.atlassian.net/browse/NET-11551
2025-01-27 16:35:53 -05:00
Michael Smithhisler
b7aabb11be changelog: add entry for PR #24739 (#24961) 2025-01-27 13:48:37 -05:00
Gabi
e107d84c78 taskrunner: fix panic when a task that has a dynamic user is recovered (#24739) 2025-01-27 13:05:55 -05:00
Phil Renaud
7106ac1462 Update playwright to 1.50.0 for e2e ui tests (#24956) 2025-01-27 12:03:59 -05:00
Daniel Bennett
49c147bcd7 dynamic host volumes: change env vars, fixup auto-delete (#24943)
* plugin env: DHV_HOST_PATH->DHV_VOLUMES_DIR
* client config: host_volumes_dir
* plugin env: add namespace+nodepool
* only auto-delete after error saving client state
  on *initial* create
2025-01-27 10:36:53 -06:00
Judith Malnick
890daba432 Remove web team from CODEOWNERS for content directories (#24946) 2025-01-27 08:57:58 -05:00
Seth Hoenig
1356880962 fingerprint: convert consul and vault fingerprinters to be reloadable (#24526)
This PR changes the Consul and Vault fingerprint implementations to be
reloadable rather than periodic. Reasons described in the issue.
2025-01-27 09:20:01 +00:00
Tim Gross
7add04eb0f refactor: volume request modes to be generic between DHV/CSI (#24896)
When we implemented CSI, the types of the fields for access mode and attachment
mode on volume requests were defined with a prefix "CSI". This gets confusing
now that we have dynamic host volumes using the same fields. Fortunately the
original was a typedef on string, and the Go API in the `api` package just uses
strings directly, so we can change the name of the type without breaking
backwards compatibility for the msgpack wire format.

Update the names to `VolumeAccessMode` and `VolumeAttachmentMode`. Keep the CSI
and DHV specific value constant names for these fields (they aren't currently
1:1), so that we can easily differentiate in a given bit of code which values
are valid.

Ref: https://github.com/hashicorp/nomad/pull/24881#discussion_r1920702890
2025-01-24 10:37:48 -05:00
James Rasell
b4d71f6693 changelog: add entry for #24919 (#24939) 2025-01-24 14:29:41 +00:00