reduce upgrade testing flakes (#25839)

This changeset includes several adjustments to the upgrade testing scripts to
reduce flakes and make problems more understandable:

* When a node is drained prior to the 3rd client upgrade, it's entirely
  possible the 3rd client to be upgraded is the drained node. This results in
  miscounting the expected number of allocations because many of them will be
  "complete" (service/batch) or "pending" (system). Leave the system jobs running
  during drains and only count the running allocations at that point as the
  expected set. Move the inline script that gets this count into a script file for
  legibility.

* When the last initial workload is deployed, it's possible for it to be
  briefly still in "pending" when we move to the next step. Poll for a short
  window for the expected count of jobs.

* Make sure that any scripts that are being run right after a server or client
 is coming back up can handle temporary unavailability gracefully.

* Change the debugging output of several scripts to avoid having the debug
  output run into the error message (Ex. "some allocs are not running" looked like
  the first allocation running was the missing allocation).

* Add some notes to the README about running locally with `-dev` builds and
  tagging a cluster with your own name.

Ref: https://hashicorp.atlassian.net/browse/NMD-162
This commit is contained in:
Tim Gross
2025-05-13 08:40:22 -04:00
committed by GitHub
parent 695ba2c159
commit 6c9f2fdd29
10 changed files with 135 additions and 53 deletions

View File

@@ -66,6 +66,7 @@ Next you'll need to populate the Enos variables file `enos.vars.hcl (unlike
Terraform, Enos doesn't accept variables on the command line):
```hcl
prefix = "<your first name or initials>"
artifactory_username = "<your email address>"
artifactory_token = "<your ARTIFACTORY_TOKEN from above>"
product_version = "1.8.9" # starting version
@@ -76,6 +77,15 @@ consul_license = "<your Consul Enterprise license, currently always requir
aws_region = "us-east-1"
```
If you want to test "dev" builds, you'll need to adjust the above as follows:
```hcl
product_version = "1.8.9-dev" # starting version
upgrade_version = "1.9.4-dev" # version to upgrade to
artifactory_repo_start = "hashicorp-crt-dev-local*" # Artifactory repo to search
artifactory_repo_upgrade = "hashicorp-crt-dev-local*" # Artifactory repo to search
```
When the variables file is placed in the enos root folder with the name
`enos.vars.hcl` it is automatically picked up by enos, if a different variables
files will be used, it can be pass using the flag `--var-file`.
@@ -127,53 +137,53 @@ along to bootstrap the Nomad cluster.
## Adding New Workloads
As part of the testing process some test workloads are dispatched and are
expected to run during all the update process, they are stored under
`enos/modules/run_workloads/jobs` and must be defined with the following
As part of the testing process some test workloads are dispatched and are
expected to run during all the update process, they are stored under
`enos/modules/run_workloads/jobs` and must be defined with the following
attributes:
### Required Attributes
- **`job_spec`** *(string)*: Path to the job specification for your workload.
The path should be relative to the `run_workloads` module.
The path should be relative to the `run_workloads` module.
For example: `jobs/raw-exec-service.nomad.hcl`.
- **`alloc_count`** *(number)*: This variable serves two purposes:
1. Every workload must define the `alloc_count` variable, regardless of
1. Every workload must define the `alloc_count` variable, regardless of
whether it is actively used.
This is because jobs are executed using [this command](https://github.com/hashicorp/nomad/blob/1ffb7ab3fb0dffb0e530fd3a8a411c7ad8c72a6a/enos/modules/run_workloads/main.tf#L66):
```hcl
variable "alloc_count" {
type = number
}
```
This is done to force the job spec author to add a value to the `alloc_count`.
2. It is used to calculate the expected number of allocations in the cluster
2. It is used to calculate the expected number of allocations in the cluster
once all jobs are running.
If the variable is missing or left undefined, the job will fail to run,
If the variable is missing or left undefined, the job will fail to run,
which will impact the upgrade scenario.
For `system` jobs, the number of allocations is determined by the number
For `system` jobs, the number of allocations is determined by the number
of nodes. In such cases, `alloc_count` is conventionally set to `0`,
as it is not directly used.
- **`type`** *(string)*: Specifies the type of workload—`service`, `batch`, or
- **`type`** *(string)*: Specifies the type of workload—`service`, `batch`, or
`system`. Setting the correct type is important, as it affects the calculation
of the total number of expected allocations in the cluster.
### Optional Attributes
The following attributes are only required if your workload has prerequisites
The following attributes are only required if your workload has prerequisites
or final configurations before it is fully operational. For example, a job using
`tproxy` may require a new intention to be configured in Consul.
- **`pre_script`** *(optional, string)*: Path to a script that should be
- **`pre_script`** *(optional, string)*: Path to a script that should be
executed before the job runs.
- **`post_script`** *(optional, string)*: Path to a script that should be
executed after the job runs.
All scripts are located in `enos/modules/run_workloads/scripts`.
Similar to `job_spec`, the path should be relative to the `run_workloads`
module. Example: `scripts/wait_for_nfs_volume.sh`.
@@ -182,8 +192,8 @@ module. Example: `scripts/wait_for_nfs_volume.sh`.
If you want to add a new workload to test a specific feature, follow these steps:
1. Modify the `run_initial_workloads` [step](https://github.com/hashicorp/nomad/blob/04db81951fd0f6b7cc543410585a4da0d70a354a/enos/enos-scenario-upgrade.hcl#L139)
in `enos-scenario-upgrade.hcl` and include your workload in the `workloads`
1. Modify the `run_initial_workloads` [step](https://github.com/hashicorp/nomad/blob/04db81951fd0f6b7cc543410585a4da0d70a354a/enos/enos-scenario-upgrade.hcl#L139)
in `enos-scenario-upgrade.hcl` and include your workload in the `workloads`
variable.
2. Add the job specification and any necessary pre/post scripts to the
@@ -191,19 +201,18 @@ appropriate directories:
- [`enos/modules/run_workloads/jobs`](https://github.com/hashicorp/nomad/tree/main/enos/modules/run_workloads/jobs)
- [`enos/modules/run_workloads/scripts`](https://github.com/hashicorp/nomad/tree/main/enos/modules/run_workloads/scripts)
**Important:**
**Important:**
* Ensure that the `alloc_count` variable is included in the job
specification. If it is missing or undefined, the job will fail to run,
specification. If it is missing or undefined, the job will fail to run,
potentially disrupting the upgrade scenario.
* During normal execution of the test and to verify the health of the cluster,
the number of jobs and allocs running is verified multiple times at different
stages of the process. Make sure your job has a health check, to ensure it will
the number of jobs and allocs running is verified multiple times at different
stages of the process. Make sure your job has a health check, to ensure it will
be restarted in case of unexpected failures and if it is a batch job,
it will not exit before the test has concluded.
it will not exit before the test has concluded.
If you want to verify your workload without having to run all the scenario,
If you want to verify your workload without having to run all the scenario,
you can manually pass values to variables with flags or a `.tfvars`
file and run the module from the `run_workloads` directory like you would any
other terraform module.

View File

@@ -15,8 +15,10 @@ nodes=$(nomad node status -json | jq -r "[.[] | select(.Status == \"ready\") | .
for node in $nodes; do
echo "Draining the node $node"
nomad node drain --enable --deadline "$DRAIN_DEADLINE" "$node" \
# we --ignore-system both to exercise the feature and make sure we won't
# have to reschedule system jobs and wait for them again
nomad node drain --enable --ignore-system --deadline "$DRAIN_DEADLINE" "$node" \
|| error_exit "Failed to drain node $node"
allocs=$(nomad alloc status -json | jq --arg node "$node" '[.[] | select(.NodeID == $node and .ClientStatus == "running")] | length')

View File

@@ -2,7 +2,7 @@
# SPDX-License-Identifier: BUSL-1.1
output "jobs_count" {
description = "The number of jobs thar should be running in the cluster"
description = "The number of jobs that should be running in the cluster"
value = length(var.workloads) + tonumber(coalesce(chomp(enos_local_exec.get_jobs.stdout)))
}

View File

@@ -6,6 +6,8 @@ set -euo pipefail
error_exit() {
printf 'Error: %s' "${1}"
echo "All allocs:"
nomad alloc status -json
exit 1
}
@@ -18,10 +20,14 @@ elapsed_time=0
running_allocs=
allocs_length=
last_error=
checkAllocsCount() {
local allocs
allocs=$(nomad alloc status -json) || error_exit "Failed to check alloc status"
allocs=$(nomad alloc status -json) || {
last_error="Failed to check alloc status"
return 1
}
running_allocs=$(echo "$allocs" | jq '[.[] | select(.ClientStatus == "running")]')
allocs_length=$(echo "$running_allocs" | jq 'length') \
@@ -31,6 +37,7 @@ checkAllocsCount() {
return 0
fi
last_error="Some allocs are not running"
return 1
}
@@ -38,10 +45,10 @@ while true; do
checkAllocsCount && break
if [ "$elapsed_time" -ge "$MAX_WAIT_TIME" ]; then
error_exit "Some allocs are not running: $(nomad alloc status -json | jq -r '.[] | "\(.ID) \(.Name) \(.ClientStatus)"')"
error_exit "$last_error within $elapsed_time seconds."
fi
echo "Running allocs: $allocs_length, expected $ALLOC_COUNT. Waiting for $elapsed_time Retrying in $POLL_INTERVAL seconds..."
echo "Running allocs: $allocs_length, expected ${ALLOC_COUNT}. Have been waiting for ${elapsed_time}. Retrying in $POLL_INTERVAL seconds..."
sleep $POLL_INTERVAL
elapsed_time=$((elapsed_time + POLL_INTERVAL))
done

View File

@@ -21,8 +21,10 @@ last_error=
checkReadyClients() {
local clients_length
ready_clients=$(nomad node status -json | jq '[.[] | select(.Status == "ready" and .SchedulingEligibility == "eligible")]') ||
error_exit "Could not query node status"
ready_clients=$(nomad node status -json | jq '[.[] | select(.Status == "ready" and .SchedulingEligibility == "eligible")]') || {
last_error="Could not query node status"
return 1
}
clients_length=$(echo "$ready_clients" | jq 'length')
if [ "$clients_length" -eq "$CLIENT_COUNT" ]; then

View File

@@ -6,19 +6,49 @@ set -euo pipefail
error_exit() {
printf 'Error: %s' "${1}"
nomad job status
exit 1
}
# Quality: nomad_job_status: A GET call to /v1/jobs returns the correct number of jobs and they are all running.
# Quality: nomad_job_status: A GET call to /v1/jobs returns the correct number
# of jobs and they are all running.
jobs_length=$(nomad job status| awk '$4 == "running" {count++} END {print count+0}')
# jobs should move from "pending" to "running" fairly quickly
MAX_WAIT_TIME=30
POLL_INTERVAL=2
elapsed_time=0
last_error=
if [ -z "$jobs_length" ]; then
error_exit "No jobs found"
fi
checkRunningJobsCount() {
jobs_length=$(nomad job status| awk '$4 == "running" {count++} END {print count+0}') || {
last_error="Could not query job status"
return 1
}
if [ "$jobs_length" -ne "$JOB_COUNT" ]; then
error_exit "The number of running jobs ($jobs_length) does not match the expected count ($JOB_COUNT) $(nomad job status | awk 'NR > 1 && $4 != "running" {print $4}') "
fi
if [ -z "$jobs_length" ]; then
last_error="No running jobs found"
return 1
fi
echo "All $JOB_COUNT JOBS are running."
if [ "$jobs_length" -ne "$JOB_COUNT" ]; then
last_error="The number of running jobs ($jobs_length) does not match the expected count ($JOB_COUNT)"
return 1
fi
}
while true; do
# reset
jobs_length=
checkRunningJobsCount && break
if [ "$elapsed_time" -ge "$MAX_WAIT_TIME" ]; then
error_exit "$last_error within $elapsed_time seconds."
fi
echo "Expected $JOB_COUNT running jobs, found ${jobs_length}. Retrying in $POLL_INTERVAL seconds..."
sleep $POLL_INTERVAL
elapsed_time=$((elapsed_time + POLL_INTERVAL))
done
echo "Expected number of jobs ($JOB_COUNT) are running."

View File

@@ -63,8 +63,10 @@ checkServerHealth() {
ip=$1
echo "Checking server health for $ip"
node_info=$(nomad agent-info -address "https://$ip:4646" -json) \
|| error_exit "Unable to get info for node at $ip"
node_info=$(nomad agent-info -address "https://$ip:4646" -json) || {
last_error="Unable to get info for node at $ip"
return 1
}
last_log_index=$(echo "$node_info" | jq -r '.stats.raft.last_log_index')
last_log_term=$(echo "$node_info" | jq -r '.stats.raft.last_log_term')

View File

@@ -0,0 +1,24 @@
#!/usr/bin/env bash
# Copyright (c) HashiCorp, Inc.
# SPDX-License-Identifier: BUSL-1.1
set -euo pipefail
# note: the stdout from this script gets read in as JSON to a later step, so
# it's critical we only emit other text if we're failing anyways
error_exit() {
printf 'Error: %s' "${1}"
exit 1
}
# we have a client IP and not a node ID, so query that node via 'node status
# -self' to get its ID
NODE_ID=$(nomad node status \
-allocs -address="https://${CLIENT_IP}:4646" -self -json | jq -r '.ID')
# dump the allocs for this node only, keeping only client-relevant data and not
# the full jobspec. We only want the running allocations because we might have
# previously drained this node, which will mess up our expected counts.
nomad alloc status -json | \
jq -r --arg NODE_ID "$NODE_ID" \
'[ .[] | select(.NodeID == $NODE_ID and .ClientStatus == "running") | {ID: .ID, Name: .Name, ClientStatus: .ClientStatus, TaskStates: .TaskStates}]'

View File

@@ -6,6 +6,9 @@ set -euo pipefail
error_exit() {
printf 'Error: %s' "${1}"
echo "Allocs on node ${client_id}:"
nomad alloc status -json | \
jq -r --arg client_id "$client_id" '[.[] | select(.NodeID == $client_id)]'
exit 1
}
@@ -48,15 +51,16 @@ done
echo "Client $client_id at $CLIENT_IP is ready"
allocs_count=$(echo $ALLOCS |jq '[ .[] | select(.ClientStatus == "running")] | length')
allocs_count=$(echo $ALLOCS | jq '[ .[] | select(.ClientStatus == "running")] | length')
echo "$allocs_count allocs found before upgrade $ALLOCS"
# Quality: "nomad_alloc_reconnect: A GET call to /v1/allocs will return the same IDs for running allocs before and after a client upgrade on each client"
checkAllocsCount() {
local allocs
running_allocs=$(nomad alloc status -json | jq -r --arg client_id "$client_id" '[.[] | select(.ClientStatus == "running" and .NodeID == $client_id)]') \
|| error_exit "Failed to check alloc status"
running_allocs=$(nomad alloc status -json | jq -r --arg client_id "$client_id" '[.[] | select(.ClientStatus == "running" and .NodeID == $client_id)]') || {
last_error="Failed to check alloc status"
return 1
}
allocs_length=$(echo "$running_allocs" | jq 'length') \
|| error_exit "Invalid alloc status -json output"
@@ -64,6 +68,7 @@ checkAllocsCount() {
return 0
fi
last_error="Some allocs are not running"
return 1
}
@@ -74,10 +79,10 @@ while true; do
checkAllocsCount && break
if [ "$elapsed_time" -ge "$MAX_WAIT_TIME" ]; then
error_exit "Some allocs are not running: $(nomad alloc status -json | jq -r '.[] | "\(.ID) \(.Name) \(.ClientStatus)"')"
error_exit "$last_error within $elapsed_time seconds."
fi
echo "Running allocs: $allocs_length, expected $allocs_count. Waiting for $elapsed_time Retrying in $POLL_INTERVAL seconds..."
echo "Running allocs: $allocs_length, expected ${allocs_count}. Have been waiting for ${elapsed_time}. Retrying in $POLL_INTERVAL seconds..."
sleep $POLL_INTERVAL
elapsed_time=$((elapsed_time + POLL_INTERVAL))
@@ -99,8 +104,7 @@ sorted_input=($(printf "%s\n" "${INPUT_ARRAY[@]}" | sort))
sorted_running=($(printf "%s\n" "${RUNNING_ARRAY[@]}" | sort))
if [[ "${sorted_input[*]}" != "${sorted_running[*]}" ]]; then
full_current_allocs=$(nomad alloc status -json | jq -r --arg client_id "$client_id" '[.[] | select(.NodeID == $client_id) | { ID: .ID, Name: .Name, ClientStatus: .ClientStatus}]')
error_exit "Different allocs found, expected: ${sorted_input[*]} found: ${sorted_running[*]}. Current allocs info: $full_current_allocs"
error_exit "Different allocs found, expected: ${sorted_input[*]} found: ${sorted_running[*]}"
fi
echo "All allocs reattached correctly for node at $CLIENT_IP"

View File

@@ -53,8 +53,10 @@ checkServerHealth() {
ip=$1
echo "Checking server $ip is up to date"
node_info=$(nomad agent-info -address "https://$ip:4646" -json) \
|| error_exit "Unable to get info for node at $ip"
node_info=$(nomad agent-info -address "https://$ip:4646" -json) || {
last_error="Unable to get info for node at $ip"
return 1
}
last_log_index=$(echo "$node_info" | jq -r '.stats.raft.last_log_index')
last_log_term=$(echo "$node_info" | jq -r '.stats.raft.last_log_term')