14 Commits

Author SHA1 Message Date
Tim Gross
6c9f2fdd29 reduce upgrade testing flakes (#25839)
This changeset includes several adjustments to the upgrade testing scripts to
reduce flakes and make problems more understandable:

* When a node is drained prior to the 3rd client upgrade, it's entirely
  possible the 3rd client to be upgraded is the drained node. This results in
  miscounting the expected number of allocations because many of them will be
  "complete" (service/batch) or "pending" (system). Leave the system jobs running
  during drains and only count the running allocations at that point as the
  expected set. Move the inline script that gets this count into a script file for
  legibility.

* When the last initial workload is deployed, it's possible for it to be
  briefly still in "pending" when we move to the next step. Poll for a short
  window for the expected count of jobs.

* Make sure that any scripts that are being run right after a server or client
 is coming back up can handle temporary unavailability gracefully.

* Change the debugging output of several scripts to avoid having the debug
  output run into the error message (Ex. "some allocs are not running" looked like
  the first allocation running was the missing allocation).

* Add some notes to the README about running locally with `-dev` builds and
  tagging a cluster with your own name.

Ref: https://hashicorp.atlassian.net/browse/NMD-162
2025-05-13 08:40:22 -04:00
Juana De La Cuesta
695ba2c159 Fix the verify alloc script (#25837)
* fix: use the raw option on jq to avoid trating the " like a char

* Update verify_allocs.sh
2025-05-12 14:53:28 +02:00
Juana De La Cuesta
cb09696b1c Nojira upgrade3 (#25817)
* fix: typo

* fix: correct the script for unbound var

* fix: typo

* fix: typo
2025-05-06 18:21:33 +02:00
Juana De La Cuesta
f68203549b Fix the verify allocs, missing echo (#25816)
* fix: typo

* fix: correct the script for unbound var

* fix: typo
2025-05-06 17:16:56 +02:00
Juana De La Cuesta
42d4067d55 Nojira upgrade3 (#25815)
* fix: typo

* fix: correct the script for unbound var
2025-05-06 16:57:44 +02:00
Juana De La Cuesta
da0ea9935d fix: typo (#25814) 2025-05-06 16:44:25 +02:00
Juana De La Cuesta
22921418b6 Check for allocs running before checking for IDs after a client upgrade (#25790)
* fix: wait for all allocs to be running before checking for their IDs after client upgrade

* style: linter fix

* fix: filter running allocs per client ID when checking for allocs after upgrade
2025-05-06 16:22:45 +02:00
Juanadelacuesta
2f02c90391 func: expand on some logs to get more info in case of a failure 2025-04-15 14:37:57 -04:00
Juanadelacuesta
332e859da0 Typo: Wrong function name 2025-03-26 10:06:40 +01:00
Juanadelacuesta
c3258ab0f6 fix: reuse client_id when checking for running allocs 2025-03-24 15:11:33 +01:00
Juanadelacuesta
ce261be358 style: linter fix 2025-03-21 15:26:25 +01:00
Juanadelacuesta
82fcc62c46 func: add verification for allocs correctly reattaching after client restarts 2025-03-21 15:14:00 +01:00
Tim Gross
f528022e3a upgrade testing: add missing dependency during client upgrades (#25306)
The check to read back node metadata depends on a resource that waits for the
Nomad API, but that resource doesn't wait for the metadata to be written in the
first place (and the client subsequently upgraded). Add this dependency so that
we're reading back the node metadata as the last step.

Ref: https://github.com/hashicorp/nomad-e2e/actions/runs/13690355150/job/38282457406
2025-03-07 09:06:04 -05:00
Juana De La Cuesta
2dadf9fe6c Improve stability (#25244)
* func: add dependencies to avoid race conditions and move the update to each client to the main upgrade scenario

* Update enos/enos-scenario-upgrade.hcl

Co-authored-by: Tim Gross <tgross@hashicorp.com>

* Update enos/enos-scenario-upgrade.hcl

Co-authored-by: Tim Gross <tgross@hashicorp.com>

---------

Co-authored-by: Tim Gross <tgross@hashicorp.com>
2025-03-04 16:23:07 +01:00