While working on #26831 and #26832 I made some minor improvements to our end-to-end test setup for CSI: * bump the AWS EBS plugin versions to latest release (1.48.0) * remove the unnnecessary `datacenters` field from the AWS EBS plugin jobs * add a name tag to the EBS volumes we create * add a user-specific name tag to the cluster name when using the makefile to deploy a cluster * add volumes and other missing variables from the `provision-infra` module to the main E2E module Ref: https://github.com/hashicorp/nomad/pull/26832 Ref: https://github.com/hashicorp/nomad/pull/26831
Terraform infrastructure
This folder contains Terraform resources for provisioning a Nomad cluster on EC2 instances on AWS to use as the target of end-to-end tests.
Terraform provisions the AWS infrastructure assuming that EC2 AMIs have already been built via Packer and a HCP Vault cluster is already running. It deploys a build of Nomad from your local machine along with configuration files, as well as a single-node Consul server cluster.
Setup
You'll need a recent version of Terraform (1.1+ recommended), as well
as AWS credentials to create the Nomad cluster and credentials for
HCP. This Terraform stack assumes that an appropriate instance role
has been configured elsewhere and that you have the ability to
AssumeRole into the AWS account.
If you're trying to provision the cluster from macOS on Apple Silicon hardware,
you will also need Nomad Linux binaries for x86_64 architecture. Since it's
currently impossible to cross-compile Nomad for Linux on macOS, you need to grab
a Nomad binary from releases page and
put it in ../pkg/linux_amd64 directory before running Terraform.
Configure the following environment variables. For HashiCorp Nomad
developers, this configuration can be found in 1Pass in the Nomad
team's vault under nomad-e2e.
export HCP_CLIENT_ID=
export HCP_CLIENT_SECRET=
The Vault admin token will expire after 6 hours. If you haven't
created one already use the separate Terraform configuration found in
the hcp-vault-auth directory. The following will set the correct
values for VAULT_TOKEN, VAULT_ADDR, and VAULT_NAMESPACE:
cd ./hcp-vault-auth
terraform init
terraform apply --auto-approve
$(terraform output --raw environment)
cd ../
Optionally, edit the terraform.tfvars file to change the number of
Linux clients or Windows clients.
region = "us-east-1"
instance_type = "t2.medium"
server_count = "3"
client_count_linux = "4"
client_count_windows_2022 = "1"
You will also need a Consul Enterprise license file and a Nomad Enterprise license file, and a local Consul binary to provision Consul.
Optionally, edit the nomad_local_binary variable in the terraform.tfvars
file to change the path to the local binary of Nomad you'd like to upload, but
keep in mind it has to match the OS and the CPU architecture of the nodes (amd64
linux).
NOTE: If you want to have a cluster with mixed CPU architectures,
you need to specify the count and also provide the corresponding
binary using var.nomad_local_binary_client_ubuntu_jammy and or
var.nomad_local_binary_client_windows_2022.
Run Terraform apply to deploy the infrastructure:
cd e2e/terraform/
terraform init
terraform apply -var="consul_license=$(cat full_path_to_consul.hclic)" -var="nomad_license=$(cat full_path_to_nomad.hclic)"
Alternative you can also run make apply_full from the terraform directory:
export NOMAD_LICENSE_PATH=./nomad.hclic
export CONSUL_LICENSE_PATH=./consul.hclic
make apply_full
Note: You will likely see "Connection refused" or "Permission denied" errors in the logs as the provisioning script run by Terraform hits an instance where the ssh service isn't yet ready. That's ok and expected; they'll get retried. In particular, Windows instances can take a few minutes before ssh is ready.
Also note: When ACLs are being bootstrapped, you may see "No cluster leader" in the output several times while the ACL bootstrap script polls the cluster to start and and elect a leader.
Configuration
The files in etc are template configuration files for Nomad and the
Consul agent. Terraform will render these files to the uploads
folder and upload them to the cluster during provisioning.
etc/nomad.dare the Nomad configuration files.base.hcl,tls.hcl,consul.hcl, andvault.hclare shared.server-linux.hcl,client-linux.hcl, andclient-windows.hclare role and platform specific.client-linux-0.hcl, etc. are specific to individual instances.
etc/consul.dare the Consul agent configuration files.etc/aclsare ACL policy files for Consul and Vault.
Web UI
To access the web UI, deploy a reverse proxy to the cluster. All
clients have a TLS proxy certificate at /etc/nomad.d/tls_proxy.crt
and a self-signed cert at /etc/nomad.d/self_signed.crt. See
../ui/inputs/proxy.nomad for an example of using this. Deploy as follows:
nomad namespace apply proxy
nomad job run ../ui/input/proxy.nomad
You can get the public IP for the proxy allocation from the following nested query:
nomad node status -json -verbose \
$(nomad operator api '/v1/allocations?namespace=proxy' | jq -r '.[] | select(.JobID == "nomad-proxy") | .NodeID') \
| jq '.Attributes."unique.platform.aws.public-ipv4"'
Outputs
After deploying the infrastructure, you can get connection information about the cluster:
$(terraform output --raw environment)will set your current shell'sNOMAD_ADDRandCONSUL_HTTP_ADDRto point to one of the cluster's server nodes, and set theNOMAD_E2Evariable.terraform output serverswill output the list of server node IPs.terraform output linux_clientswill output the list of Linux client node IPs.terraform output windows_clientswill output the list of Windows client node IPs.cluster_unique_identifierwill output the random name used to identify the cluster's resources
SSH
You can use Terraform outputs above to access nodes via ssh:
ssh -i keys/${CLUSTER_UNIQUE_IDENTIFIER}/nomad-e2e-*.pem ubuntu@${EC2_IP_ADDR}
The Windows client runs OpenSSH for convenience, but has a different user and will drop you into a Powershell shell instead of bash:
ssh -i keys/${CLUSTER_UNIQUE_IDENTIFIER}/nomad-e2e-*.pem Administrator@${EC2_IP_ADDR}
Teardown
The terraform state file stores all the info.
cd e2e/terraform/
terraform destroy
FAQ
E2E Provisioning Goals
- The provisioning process should be able to run a nightly build against a variety of OS targets.
- The provisioning process should be able to support update-in-place tests. (See #7063)
- A developer should be able to quickly stand up a small E2E cluster and provision it with a version of Nomad they've built on their laptop. The developer should be able to send updated builds to that cluster with a short iteration time, rather than having to rebuild the cluster.
Why not just drop all the provisioning into the AMI?
While that's the "correct" production approach for cloud infrastructure, it creates a few pain points for testing:
- Creating a Linux AMI takes >10min, and creating a Windows AMI can take 15-20min. This interferes with goal (3) above.
- We won't be able to do in-place upgrade testing without having an in-place provisioning process anyways. This interferes with goals (2) above.
Why not just drop all the provisioning into the user data?
- Userdata is executed on boot, which prevents using them for in-place upgrade testing.
- Userdata scripts are not very observable and it's painful to determine whether they've failed or simply haven't finished yet before trying to run tests.