This changeset includes several adjustments to the upgrade testing scripts to
reduce flakes and make problems more understandable:
* When a node is drained prior to the 3rd client upgrade, it's entirely
possible the 3rd client to be upgraded is the drained node. This results in
miscounting the expected number of allocations because many of them will be
"complete" (service/batch) or "pending" (system). Leave the system jobs running
during drains and only count the running allocations at that point as the
expected set. Move the inline script that gets this count into a script file for
legibility.
* When the last initial workload is deployed, it's possible for it to be
briefly still in "pending" when we move to the next step. Poll for a short
window for the expected count of jobs.
* Make sure that any scripts that are being run right after a server or client
is coming back up can handle temporary unavailability gracefully.
* Change the debugging output of several scripts to avoid having the debug
output run into the error message (Ex. "some allocs are not running" looked like
the first allocation running was the missing allocation).
* Add some notes to the README about running locally with `-dev` builds and
tagging a cluster with your own name.
Ref: https://hashicorp.atlassian.net/browse/NMD-162
* func: add possibility of having different binaries for server and clients
* style: rename binaries modules
* func: remove the check for last configuration log, and only take one snapshot when upgrading the servers
* Update enos/modules/upgrade_servers/main.tf
Co-authored-by: Tim Gross <tgross@hashicorp.com>
---------
Co-authored-by: Tim Gross <tgross@hashicorp.com>
We're using `set -eo pipefail` everywhere in the Enos scripts, several of the
scripts used for checking assertions didn't take advantage of pipefail in such a
way that we could avoid early exits from transient errors. This meant that if a
server was slightly late to come back up, we'd hit an error and exit the whole
script instead of polling as expected.
While fixing this, I've made a number of other improvements to the shell scripts:
* I've changed the design of the polling loops so that we're calling a function
that returns an exit code and sets `last_error` value, along with any global
variables required by downstream functions. This makes the loops more readable
by reducing the number of global variables, and helped identify some places
where we're exiting instead of returning into the loop.
* Using `shellcheck -s bash` I fixes some unused variables and undefined
variables that we were missing because they were only used on the error paths.
* fix: change the value of the version used for testing to account for ent versions
* func: add more specific test for servers stability
* func: change the criteria we use to verify the cluster stability after server upgrades
* style: syntax
* func: add initial enos skeleton
* style: add headers
* func: change the variables input to a map of objects to simplify the workloads creation
* style: formating
* Add tests for servers and clients
* style: separate the tests in diferent scripts
* style: add missing headers
* func: add tests for allocs
* style: improve output
* func: add step to copy remote upgrade version
* style: hcl formatting
* fix: remove the terraform nomad provider
* fix: Add clean token to remove extra new line added in provision
* fix: Add clean token to remove extra new line added in provision
* fix: Add clean token to remove extra new line added in provision
* fix: add missing license headers
* style: hcl fmt
* style: rename variables and fix format
* func: remove the template step on the workloads module and chop the noamd token output on the provide module
* fix: correct the jobspec path on the workloads module
* fix: add missing variable definitions on job specs for workloads
* style: formatting
* fix: Add clean token to remove extra new line added in provision
* func: add module to upgrade servers
* style: missing headers
* func: add upgrade module
* func: add install for windows as well
* func: add an intermediate module that runs the upgrade server for each server
* fix: add missing license headers
* fix: remove extra input variables and connect upgrade servers to the scenario
* fix: rename missing env variables for cluster health scripts
* func: move the cluster health test outside of the modules and into the upgrade scenario
* fix: fix the regex to ignore snap files on the gitignore file
* fix: Add clean token to remove extra new line added in provision
* fix: Add clean token to remove extra new line added in provision
* fix: Add clean token to remove extra new line added in provision
* fix: remove extra input variables and connect upgrade servers to the scenario
* style: formatting
* fix: move taken and restoring snapshots out of the upgrade_single_server to avoid possible race conditions
* fix: rename variable in health test
* fix: Add clean token to remove extra new line added in provision
* func: add an intermediate module that runs the upgrade server for each server
* fix: Add clean token to remove extra new line added in provision
* fix: Add clean token to remove extra new line added in provision
* fix: Add clean token to remove extra new line added in provision
* func: fix the last_log_index check and add a versions check
* func: done use for_each when upgrading the servers, hardcodes each one to ensure they are upgraded one by one
* Update enos/modules/upgrade_instance/variables.tf
Co-authored-by: Tim Gross <tgross@hashicorp.com>
* Update enos/modules/upgrade_instance/variables.tf
Co-authored-by: Tim Gross <tgross@hashicorp.com>
* Update enos/modules/upgrade_instance/variables.tf
Co-authored-by: Tim Gross <tgross@hashicorp.com>
* func: make snapshot by calling every server and allowing stale data
* style: formatting
* fix: make the source for the upgrade binary unknow until apply
* func: use enos bundle to install remote upgrade version, enos_files is not meant for dynamic files
---------
Co-authored-by: Tim Gross <tgross@hashicorp.com>