mirror of
https://github.com/kemko/nomad.git
synced 2026-01-01 16:05:42 +03:00
This changeset includes several adjustments to the upgrade testing scripts to reduce flakes and make problems more understandable: * When a node is drained prior to the 3rd client upgrade, it's entirely possible the 3rd client to be upgraded is the drained node. This results in miscounting the expected number of allocations because many of them will be "complete" (service/batch) or "pending" (system). Leave the system jobs running during drains and only count the running allocations at that point as the expected set. Move the inline script that gets this count into a script file for legibility. * When the last initial workload is deployed, it's possible for it to be briefly still in "pending" when we move to the next step. Poll for a short window for the expected count of jobs. * Make sure that any scripts that are being run right after a server or client is coming back up can handle temporary unavailability gracefully. * Change the debugging output of several scripts to avoid having the debug output run into the error message (Ex. "some allocs are not running" looked like the first allocation running was the missing allocation). * Add some notes to the README about running locally with `-dev` builds and tagging a cluster with your own name. Ref: https://hashicorp.atlassian.net/browse/NMD-162
219 lines
8.7 KiB
Markdown
219 lines
8.7 KiB
Markdown
# Upgrade Testing with Enos
|
|
|
|
We're using [Enos](https://github.com/hashicorp/enos) to perform upgrade
|
|
testing. These tests are run via GitHub Actions from the private `nomad-e2e`
|
|
repository. This document describes how you can run these tests from your local
|
|
development environment if you're a HashiCorp developer.
|
|
|
|
There are two major components to be aware of:
|
|
* This directory includes the upgrade scenario and the Terraform modules and
|
|
shell scripts needed to execute that scenario.
|
|
* The scenario uses the same cluster provisioning infrastructure as the E2E
|
|
tests in the `e2e/` directory in the root of this repo. So to run the upgrade
|
|
scenario you also have to have all the credentials set up to run the E2E
|
|
tests. (We may try to fold these together in the future.)
|
|
|
|
The `terraform/` folder has provisioning code to spin up a Nomad cluster on
|
|
AWS. You'll need both Terraform and AWS credentials to setup AWS instances on
|
|
which e2e tests will run. See the
|
|
[README](https://github.com/hashicorp/nomad/blob/main/e2e/terraform/README.md)
|
|
for details. The number of servers and clients is configurable, as is the
|
|
specific build of Nomad to deploy and the configuration file for each client
|
|
and server.
|
|
|
|
## Setup
|
|
|
|
You'll need a recent version of Terraform, the most current version of Enos, as
|
|
well as AWS credentials to create the Nomad cluster and credentials for HCP. The
|
|
Terraform configurations assume that an appropriate instance role has been
|
|
configured elsewhere and that you have the ability to `AssumeRole` into the AWS
|
|
account.
|
|
|
|
Configure the following environment variables. For HashiCorp Nomad developers,
|
|
this configuration can be found in 1Pass in the Nomad team's vault under
|
|
`nomad-e2e`.
|
|
|
|
```
|
|
export HCP_CLIENT_ID=
|
|
export HCP_CLIENT_SECRET=
|
|
```
|
|
|
|
The Vault admin token will expire after 6 hours. If you haven't created one
|
|
already use the separate Terraform configuration found in the
|
|
`$REPO/e2e/terraform/hcp-vault-auth` directory. The following will set the correct
|
|
values for `VAULT_TOKEN`, `VAULT_ADDR`, and `VAULT_NAMESPACE`:
|
|
|
|
```sh
|
|
terraform init
|
|
terraform apply --auto-approve
|
|
$(terraform output --raw environment)
|
|
```
|
|
|
|
Make sure your AWS credentials have been refreshed with the appropriate IAM role:
|
|
|
|
```sh
|
|
$ doormat login --force
|
|
$ doormat aws cred-file add-profile --role "$ROLE" --set-default
|
|
```
|
|
|
|
Next you'll need to obtain an Artifactory token via Doormat.
|
|
|
|
```
|
|
export ARTIFACTORY_TOKEN=$(doormat artifactory create-token | jq -r .access_token)
|
|
```
|
|
|
|
Next you'll need to populate the Enos variables file `enos.vars.hcl (unlike
|
|
Terraform, Enos doesn't accept variables on the command line):
|
|
|
|
```hcl
|
|
prefix = "<your first name or initials>"
|
|
artifactory_username = "<your email address>"
|
|
artifactory_token = "<your ARTIFACTORY_TOKEN from above>"
|
|
product_version = "1.8.9" # starting version
|
|
upgrade_version = "1.9.4" # version to upgrade to
|
|
download_binary_path = "/home/foo/Downloads/nomad" # directory on your machine to download binaries
|
|
nomad_license = "<your Nomad Enterprise license, when running Nomad ENT>"
|
|
consul_license = "<your Consul Enterprise license, currently always required>"
|
|
aws_region = "us-east-1"
|
|
```
|
|
|
|
If you want to test "dev" builds, you'll need to adjust the above as follows:
|
|
|
|
```hcl
|
|
product_version = "1.8.9-dev" # starting version
|
|
upgrade_version = "1.9.4-dev" # version to upgrade to
|
|
artifactory_repo_start = "hashicorp-crt-dev-local*" # Artifactory repo to search
|
|
artifactory_repo_upgrade = "hashicorp-crt-dev-local*" # Artifactory repo to search
|
|
```
|
|
|
|
When the variables file is placed in the enos root folder with the name
|
|
`enos.vars.hcl` it is automatically picked up by enos, if a different variables
|
|
files will be used, it can be pass using the flag `--var-file`.
|
|
|
|
## Reviewing Enos
|
|
|
|
You can quickly validate the Enos scenario configuration without running it:
|
|
|
|
```sh
|
|
$ enos scenario validate upgrade --var-file /tmp/enos.vars
|
|
$ echo $?
|
|
0
|
|
```
|
|
|
|
You can also review what Enos will do by generating an outline you can read in
|
|
your browser:
|
|
|
|
```sh
|
|
$ enos scenario outline upgrade --var-file /tmp/enos.vars --format=html > /tmp/outline.html
|
|
$ open /tmp/outline.html
|
|
```
|
|
|
|
## Running Enos
|
|
|
|
Run the Enos scenario end-to-end:
|
|
|
|
```sh
|
|
$ enos scenario run upgrade --var-file /tmp/enos.vars --timeout 2h
|
|
```
|
|
|
|
Enos will not clean up after itself automatically if interrupted. If you have to
|
|
interrupt it, you may need to run `enos scenario destroy upgrade --var-file
|
|
/tmp/enos.vars `
|
|
|
|
## Debugging
|
|
|
|
Enos builds Terraform state in the `.enos` directory, in a subdirectory named
|
|
with a hash. If you're working on Enos scenarios or test workloads and want to
|
|
connect to the Nomad cluster you create, you can use the `debug-environment`
|
|
script in this directory to set your Nomad environment variables by passing it
|
|
the path to that subdirectory. For example:
|
|
|
|
```sh
|
|
$ $(./debug-environment .enos/c545bbc25c5eec0ca86c99595a9034b5451a91aa10b586da2baab435df65be2e)
|
|
```
|
|
|
|
Note that this won't be fully populated until the Enos scenario is far enough
|
|
along to bootstrap the Nomad cluster.
|
|
|
|
## Adding New Workloads
|
|
|
|
As part of the testing process some test workloads are dispatched and are
|
|
expected to run during all the update process, they are stored under
|
|
`enos/modules/run_workloads/jobs` and must be defined with the following
|
|
attributes:
|
|
|
|
### Required Attributes
|
|
|
|
- **`job_spec`** *(string)*: Path to the job specification for your workload.
|
|
The path should be relative to the `run_workloads` module.
|
|
For example: `jobs/raw-exec-service.nomad.hcl`.
|
|
|
|
- **`alloc_count`** *(number)*: This variable serves two purposes:
|
|
1. Every workload must define the `alloc_count` variable, regardless of
|
|
whether it is actively used.
|
|
This is because jobs are executed using [this command](https://github.com/hashicorp/nomad/blob/1ffb7ab3fb0dffb0e530fd3a8a411c7ad8c72a6a/enos/modules/run_workloads/main.tf#L66):
|
|
|
|
```hcl
|
|
variable "alloc_count" {
|
|
type = number
|
|
}
|
|
```
|
|
This is done to force the job spec author to add a value to the `alloc_count`.
|
|
2. It is used to calculate the expected number of allocations in the cluster
|
|
once all jobs are running.
|
|
|
|
If the variable is missing or left undefined, the job will fail to run,
|
|
which will impact the upgrade scenario.
|
|
|
|
For `system` jobs, the number of allocations is determined by the number
|
|
of nodes. In such cases, `alloc_count` is conventionally set to `0`,
|
|
as it is not directly used.
|
|
|
|
- **`type`** *(string)*: Specifies the type of workload—`service`, `batch`, or
|
|
`system`. Setting the correct type is important, as it affects the calculation
|
|
of the total number of expected allocations in the cluster.
|
|
|
|
### Optional Attributes
|
|
|
|
The following attributes are only required if your workload has prerequisites
|
|
or final configurations before it is fully operational. For example, a job using
|
|
`tproxy` may require a new intention to be configured in Consul.
|
|
|
|
- **`pre_script`** *(optional, string)*: Path to a script that should be
|
|
executed before the job runs.
|
|
- **`post_script`** *(optional, string)*: Path to a script that should be
|
|
executed after the job runs.
|
|
|
|
All scripts are located in `enos/modules/run_workloads/scripts`.
|
|
Similar to `job_spec`, the path should be relative to the `run_workloads`
|
|
module. Example: `scripts/wait_for_nfs_volume.sh`.
|
|
|
|
### Adding a New Workload
|
|
|
|
If you want to add a new workload to test a specific feature, follow these steps:
|
|
|
|
1. Modify the `run_initial_workloads` [step](https://github.com/hashicorp/nomad/blob/04db81951fd0f6b7cc543410585a4da0d70a354a/enos/enos-scenario-upgrade.hcl#L139)
|
|
in `enos-scenario-upgrade.hcl` and include your workload in the `workloads`
|
|
variable.
|
|
|
|
2. Add the job specification and any necessary pre/post scripts to the
|
|
appropriate directories:
|
|
- [`enos/modules/run_workloads/jobs`](https://github.com/hashicorp/nomad/tree/main/enos/modules/run_workloads/jobs)
|
|
- [`enos/modules/run_workloads/scripts`](https://github.com/hashicorp/nomad/tree/main/enos/modules/run_workloads/scripts)
|
|
|
|
**Important:**
|
|
* Ensure that the `alloc_count` variable is included in the job
|
|
specification. If it is missing or undefined, the job will fail to run,
|
|
potentially disrupting the upgrade scenario.
|
|
|
|
* During normal execution of the test and to verify the health of the cluster,
|
|
the number of jobs and allocs running is verified multiple times at different
|
|
stages of the process. Make sure your job has a health check, to ensure it will
|
|
be restarted in case of unexpected failures and if it is a batch job,
|
|
it will not exit before the test has concluded.
|
|
|
|
If you want to verify your workload without having to run all the scenario,
|
|
you can manually pass values to variables with flags or a `.tfvars`
|
|
file and run the module from the `run_workloads` directory like you would any
|
|
other terraform module.
|