Commit Graph

61 Commits

Author SHA1 Message Date
James Rasell
d3e077a78e enos: Modify Windows TF variable to match new 2022 value. (#26067) 2025-06-17 08:13:36 +01:00
Tim Gross
9ee2582379 upgrade test: remove change mode from Vault workload (#25861)
During the upgrade test we can trigger a re-render of the Vault secret due to
client restart before the allocrunner has marked the task as running, which
triggers the change mode on the template and restarts the task. This results in
a race where the alloc is still "pending" when we go to check it. We never
change the value of this secret in upgrade testing, so paper over this race
condition by setting a "noop" change mode.
2025-05-15 10:10:58 -04:00
Tim Gross
6c9f2fdd29 reduce upgrade testing flakes (#25839)
This changeset includes several adjustments to the upgrade testing scripts to
reduce flakes and make problems more understandable:

* When a node is drained prior to the 3rd client upgrade, it's entirely
  possible the 3rd client to be upgraded is the drained node. This results in
  miscounting the expected number of allocations because many of them will be
  "complete" (service/batch) or "pending" (system). Leave the system jobs running
  during drains and only count the running allocations at that point as the
  expected set. Move the inline script that gets this count into a script file for
  legibility.

* When the last initial workload is deployed, it's possible for it to be
  briefly still in "pending" when we move to the next step. Poll for a short
  window for the expected count of jobs.

* Make sure that any scripts that are being run right after a server or client
 is coming back up can handle temporary unavailability gracefully.

* Change the debugging output of several scripts to avoid having the debug
  output run into the error message (Ex. "some allocs are not running" looked like
  the first allocation running was the missing allocation).

* Add some notes to the README about running locally with `-dev` builds and
  tagging a cluster with your own name.

Ref: https://hashicorp.atlassian.net/browse/NMD-162
2025-05-13 08:40:22 -04:00
Juana De La Cuesta
695ba2c159 Fix the verify alloc script (#25837)
* fix: use the raw option on jq to avoid trating the " like a char

* Update verify_allocs.sh
2025-05-12 14:53:28 +02:00
Juana De La Cuesta
cb09696b1c Nojira upgrade3 (#25817)
* fix: typo

* fix: correct the script for unbound var

* fix: typo

* fix: typo
2025-05-06 18:21:33 +02:00
Juana De La Cuesta
f68203549b Fix the verify allocs, missing echo (#25816)
* fix: typo

* fix: correct the script for unbound var

* fix: typo
2025-05-06 17:16:56 +02:00
Juana De La Cuesta
42d4067d55 Nojira upgrade3 (#25815)
* fix: typo

* fix: correct the script for unbound var
2025-05-06 16:57:44 +02:00
Juana De La Cuesta
da0ea9935d fix: typo (#25814) 2025-05-06 16:44:25 +02:00
Juana De La Cuesta
22921418b6 Check for allocs running before checking for IDs after a client upgrade (#25790)
* fix: wait for all allocs to be running before checking for their IDs after client upgrade

* style: linter fix

* fix: filter running allocs per client ID when checking for allocs after upgrade
2025-05-06 16:22:45 +02:00
Juanadelacuesta
2f02c90391 func: expand on some logs to get more info in case of a failure 2025-04-15 14:37:57 -04:00
Tim Gross
e4d2fc93cd upgrade testing: temporarily disable CSI workload (#25589)
The CSI workload we're using for upgrade testing seems to be flaky to come
up. The plugin jobs don't launch in a timely fashion despite several
attempts. In order to not block running the rest of the upgrade testing, let's
disable this workload temporarily. We'll fix this in NET-12430.

Ref: https://hashicorp.atlassian.net/browse/NET-12430
2025-04-03 08:53:20 -04:00
Juanadelacuesta
332e859da0 Typo: Wrong function name 2025-03-26 10:06:40 +01:00
Juana De La Cuesta
2bd5dc5970 Merge pull request #25479 from hashicorp/NET-11546-enos-same-allocs
Add a test for re attaching allocs after client restart
2025-03-24 16:03:57 +01:00
Juanadelacuesta
c3258ab0f6 fix: reuse client_id when checking for running allocs 2025-03-24 15:11:33 +01:00
Juanadelacuesta
ce261be358 style: linter fix 2025-03-21 15:26:25 +01:00
Juanadelacuesta
b1dbc14499 func: make the csi_plugin health timeout a little longer help the test run better locally 2025-03-21 15:16:00 +01:00
Juanadelacuesta
82fcc62c46 func: add verification for allocs correctly reattaching after client restarts 2025-03-21 15:14:00 +01:00
Juanadelacuesta
e0d3be81da fix: declare license inputs as sensitive variables 2025-03-19 19:53:32 +01:00
Juanadelacuesta
cd1640e59a style: linter fix 2025-03-17 16:19:29 +01:00
Juana De La Cuesta
9b9d16421e Merge branch 'main' into NET-11546-enos-drain 2025-03-17 16:14:18 +01:00
Juana De La Cuesta
9d5359886e Update drain.sh 2025-03-17 14:37:23 +01:00
Juana De La Cuesta
9574a0d319 Update enos/modules/drain_nodes/scripts/drain.sh
Co-authored-by: Tim Gross <tgross@hashicorp.com>
2025-03-17 14:36:57 +01:00
Juanadelacuesta
134441b4a7 func: add .gitignore entry to avoid commiting the rendered vault job spec 2025-03-17 14:29:35 +01:00
Juanadelacuesta
0239e0e915 fix: add missing command to enable drain eligibility 2025-03-17 13:56:28 +01:00
Juanadelacuesta
cfd4ee1756 fix: add missing variables for drain module 2025-03-14 17:57:26 +01:00
Juanadelacuesta
fba2efa728 func: add a step to drain a node as part of the upgrade process 2025-03-14 17:43:36 +01:00
Juanadelacuesta
4b0903789e func: add check script for vault workload 2025-03-14 17:03:35 +01:00
Juanadelacuesta
4c1ba45d48 func: add workload to test vault workload identity 2025-03-13 17:55:59 +01:00
Tim Gross
8cf34bde62 upgrade testing: allow configurable artifactory repo (#25350)
Prerelease builds are in a different Artifactory repository than release
builds. Make this a variable option so we can test prerelease builds in the
nightly/weekly runs.
2025-03-13 10:32:02 -04:00
Juana De La Cuesta
ad7dc7a4eb Merge pull request #25348 from hashicorp/NET-11546-enos-linux
Add instructions to add new workloads to the tests.
2025-03-13 10:38:47 +01:00
Juanadelacuesta
ebeb3047c8 docs: add note about workloads life expectancy 2025-03-12 16:51:03 +01:00
Juana De La Cuesta
3de2a6b1d6 Update README.md 2025-03-11 17:51:56 +01:00
Juana De La Cuesta
b1ea04a4d1 Update README.md 2025-03-11 17:50:26 +01:00
Juana De La Cuesta
859f257d32 Update README.md 2025-03-11 17:48:45 +01:00
Juanadelacuesta
08f386e8e5 docs: Add section of readme to add workloads 2025-03-11 17:48:14 +01:00
Tim Gross
61bbff9c24 upgrade testing: Variables, Workload Identity, and Task API (#25229)
Add an upgrade test workload for that continuously writes to a Nomad
Variable. In order to run this workload, we'll need to deploy a
Workload-Associated ACL policy. So this extends the `run_workloads` module to
allow for a "pre script" to be run before a given job is deployed. We can use
that as a model for other test workloads.

Ref: https://hashicorp.atlassian.net/browse/NET-12217
2025-03-11 08:48:40 -04:00
Tim Gross
5cc1b4e606 upgrade tests: add transparent proxy workload (#25176)
Add an upgrade test workload for Consul service mesh with transparent
proxy. Note this breaks from the "countdash" demo. The dashboard application
only can verify the backend is up by making a websocket connection, which we
can't do as a health check, and the health check it exposes for that purpose
only passes once the websocket connection has been made. So replace the
dashboard with a minimal nginx reverse proxy to the count-api instead.

Ref: https://hashicorp.atlassian.net/browse/NET-12217
2025-03-07 15:25:26 -05:00
Tim Gross
f528022e3a upgrade testing: add missing dependency during client upgrades (#25306)
The check to read back node metadata depends on a resource that waits for the
Nomad API, but that resource doesn't wait for the metadata to be written in the
first place (and the client subsequently upgraded). Add this dependency so that
we're reading back the node metadata as the last step.

Ref: https://github.com/hashicorp/nomad-e2e/actions/runs/13690355150/job/38282457406
2025-03-07 09:06:04 -05:00
Tim Gross
694b10d71c upgrade testing: commit missing volume specification (#25305)
In #25285 we converted the CSI workload for upgrade testing to use a self-hosted
NFS. But the volume spec name got changed to `volume.hcl` in the process, which
is in our `.gitignore` file for the repo. We missed this during testing because
the file existed locally, but it fails in nightly runs.

Ref: https://github.com/hashicorp/nomad/pull/25285
Ref: https://github.com/hashicorp/nomad-e2e/actions/runs/13703979647/job/38324786351
2025-03-06 14:36:34 -05:00
Tim Gross
916fe2c7fa upgrade testing: rework CSI test to use self-contained workload (#25285)
Getting the CSI test to work with AWS EFS or EBS has proven to be awkward
because we're having to deal with external APIs with their own consistency
guarantees, as well as challenges around teardown. Make the CSI test entirely
self-contained by using a userland NFS server and the rocketduck CSI plugin.

Ref: https://hashicorp.atlassian.net/browse/NET-12217
Ref: https://gitlab.com/rocketduck/csi-plugin-nfs
2025-03-05 11:48:19 -05:00
Tim Gross
7a051991bd upgrade testing: temporarily disable CSI test (#25283)
The CSI workload is failing and creating complications for teardown, so I'm
reworking it. But this work is taking a while to finish, so while that's in
progress let's disable the CSI workload so that we're running the upgrade tests
all the way through to the end. I expect to be able to revert this in the next
couple days.
2025-03-04 11:21:45 -05:00
Tim Gross
9cc0e2eae0 upgrade testing: make cluster name prefix a variable (#25281)
During initial development of upgrade testing, we had a hard-coded prefix to
distinguish between clusters created for this vs those created by GHA
runners. Update the prefix to be a variable so that developers can add their own
prefix during test workload development.
2025-03-04 11:11:02 -05:00
Juana De La Cuesta
2dadf9fe6c Improve stability (#25244)
* func: add dependencies to avoid race conditions and move the update to each client to the main upgrade scenario

* Update enos/enos-scenario-upgrade.hcl

Co-authored-by: Tim Gross <tgross@hashicorp.com>

* Update enos/enos-scenario-upgrade.hcl

Co-authored-by: Tim Gross <tgross@hashicorp.com>

---------

Co-authored-by: Tim Gross <tgross@hashicorp.com>
2025-03-04 16:23:07 +01:00
Tim Gross
4a62d1b75c upgrade tests: add CSI workload (#25223)
Add an upgrade test workload for CSI with the AWS EFS plugin. In order to
validate this workload, we'll need to deploy the plugin job and then register a
volume with it. So this extends the `run_workloads` module to allow for "pre
scripts" and "post scripts" to be run before and after a given job has been
deployed. We can use that as a model for other test workloads.

Ref: https://hashicorp.atlassian.net/browse/NET-12217
2025-02-27 15:16:04 -05:00
Tim Gross
6ae1444cf4 upgrade testing: debugging assistance (#25232)
Enos buries the Terraform output from provisioning. Add a shell script to load
the environment from provisioning for debugging Nomad during development of
upgrade tests.
2025-02-27 08:35:45 -05:00
Juana De La Cuesta
461d4268e2 func: add python servers to raw exec workloads (#25230) 2025-02-26 18:05:46 +01:00
Juana De La Cuesta
b13132043b Add new workloads (#25106)
* func: Add more workloads

* Update jobs.sh

* Update versions.sh

* style: format

* Update enos/modules/test_cluster_health/scripts/allocs.sh

Co-authored-by: Tim Gross <tgross@hashicorp.com>

* docs: improve outputs descriptions

* func: change docker workloads to be redis boxes and add healthchecks

* func: register the services on consul

* style: format

---------

Co-authored-by: Tim Gross <tgross@hashicorp.com>
2025-02-26 17:02:27 +01:00
Tim Gross
8c95f5f17e upgrade testing: make sure we capture last error if not exiting (#25186)
While testing #25172 I found a few spots where #25152 wasn't capturing the
errors from transient failures correctly or exiting early instead of
retrying.

Ref: https://hashicorp.atlassian.net/browse/NET-11546
2025-02-24 09:37:17 -05:00
Juana De La Cuesta
0529c0247d Only take one snapshot when upgrading servers (#25187)
* func: add possibility of having different binaries for server and clients

* style: rename binaries modules

* func: remove the check for last configuration log, and only take one snapshot when upgrading the servers

* Update enos/modules/upgrade_servers/main.tf

Co-authored-by: Tim Gross <tgross@hashicorp.com>

---------

Co-authored-by: Tim Gross <tgross@hashicorp.com>
2025-02-24 15:06:16 +01:00
Juana De La Cuesta
4a75d2de63 Adjust the servers to be always linux instances (#25172)
* func: add possibility of having different binaries for server and clients

* style: rename binaries modules

* docs: update comments

* fix: correct the token input variable for fetch binaries
2025-02-24 13:09:57 +01:00