When a node is fingerprinted, we calculate a "computed class" from a hash over a
subset of its fields and attributes. In the scheduler, when a given node fails
feasibility checking (before fit checking) we know that no other node of that
same class will be feasible, and we add the hash to a map so we can reject them
early. This hash cannot include any values that are unique to a given node,
otherwise no other node will have the same hash and we'll never save ourselves
the work of feasibility checking those nodes.
In #4390 we introduce the `nomad.advertise.address` attribute and in #19969 we
introduced `consul.dns.addr` attribute. Both of these are unique per node and
break the hash.
Additionally, we were not correctly filtering attributes out when checking if a
node escaped the class by not filtering for attributes that start with
`unique.`. The test for this introduced in #708 had an inverted assertion, which
allowed this to pass unnoticed since the early days of Nomad.
Ref: https://github.com/hashicorp/nomad/pull/708
Ref: https://github.com/hashicorp/nomad/pull/4390
Ref: https://github.com/hashicorp/nomad/pull/19969
The legacy workflow for Vault whereby servers were configured
using a token to provide authentication to the Vault API has now
been removed. This change also removes the workflow where servers
were responsible for deriving Vault tokens for Nomad clients.
The deprecated Vault config options used byi the Nomad agent have
all been removed except for "token" which is still in use by the
Vault Transit keyring implementation.
Job specification authors can no longer use the "vault.policies"
parameter and should instead use "vault.role" when not using the
default workload identity.
---------
Co-authored-by: Tim Gross <tgross@hashicorp.com>
Co-authored-by: Aimee Ukasick <aimee.ukasick@hashicorp.com>
Add an upgrade test workload for CSI with the AWS EFS plugin. In order to
validate this workload, we'll need to deploy the plugin job and then register a
volume with it. So this extends the `run_workloads` module to allow for "pre
scripts" and "post scripts" to be run before and after a given job has been
deployed. We can use that as a model for other test workloads.
Ref: https://hashicorp.atlassian.net/browse/NET-12217
Enos buries the Terraform output from provisioning. Add a shell script to load
the environment from provisioning for debugging Nomad during development of
upgrade tests.
* func: Add more workloads
* Update jobs.sh
* Update versions.sh
* style: format
* Update enos/modules/test_cluster_health/scripts/allocs.sh
Co-authored-by: Tim Gross <tgross@hashicorp.com>
* docs: improve outputs descriptions
* func: change docker workloads to be redis boxes and add healthchecks
* func: register the services on consul
* style: format
---------
Co-authored-by: Tim Gross <tgross@hashicorp.com>
In #25185 we changed the output of `volume status` to include both DHV and CSI
volumes by default. When the E2E test parses the output, it's not expecting the
new section header.
Ref: https://github.com/hashicorp/nomad/pull/25185
* Changes the behaviour of system/batch/sysbatch jobs not to look for a latest stable version, as their versions never go to stable
* Dont show job stability on versions page for system/sysbatch/batch jobs
* Tests that depend on jobs to revert specify that they are Service jobs
* Batch jobs added to detail-restart test loop
* Right, they're not stable, they're just versions
Dependabot can update actions to versions that are not in the TSCCR
allowlist. The TSCCR check doesn't happen in CE, which means we don't learn we
have a problem until after we've spent the effort to backport them. Remove the
automation that updates actions automatically until this issue is resolved on
the security team's side.
Fixes a bug where connections would not be closed on write errors in the
msgpack encoder, which would cause the reader end of RPC connections to hang
indefinitely. This resulted in clients in widely-distributed geographies being
unable to poll for allocation updates.
Fixes: https://github.com/hashicorp/nomad/issues/23305
The `-type` option for `volume status` is a UX papercut because for many
clusters there will be only one sort of volume in use. Update the CLI so that
the default behavior is to query CSI and/or DHV.
This behavior is subtly different when the user provides an ID or not. If the
user doesn't provide an ID, we query both CSI and DHV and show both tables. If
the user provides an ID, we query DHV first and then CSI, and show only the
appropriate volume. Because DHV IDs are UUIDs, we're sure we won't have
collisions between the two. We only show errors if both queries return an error.
Fixes: https://hashicorp.atlassian.net/browse/NET-12214
* func: add possibility of having different binaries for server and clients
* style: rename binaries modules
* func: remove the check for last configuration log, and only take one snapshot when upgrading the servers
* Update enos/modules/upgrade_servers/main.tf
Co-authored-by: Tim Gross <tgross@hashicorp.com>
---------
Co-authored-by: Tim Gross <tgross@hashicorp.com>
* func: add possibility of having different binaries for server and clients
* style: rename binaries modules
* docs: update comments
* fix: correct the token input variable for fetch binaries
We update the status of a volume when the node fingerprint changes. But if a
node goes down, we still show the volume as available. The scheduler behavior is
correct because a down node can never have work scheduled on it, but it might be
confusing for job authors who are looking at volumes that are showing as
available.
Update the volume update logic we have on node updates to include marking the
volume as unavailable when a node goes down.
Fixes: https://hashicorp.atlassian.net/browse/NET-12068
In #24526 we updated Consul and Vault fingerprinting so that we no longer
periodically fingerprint. In #25102 we made it so that we fingerprint
periodically on start until the first fingerprint, in order to tolerate Consul
or Vault not being available on start. For clusters not running Consul, this
leads to a warn-level log every 15s. This same log exists for Vault, but Vault
support is opt-in via `vault.enable = true` whereas you have to manually disable
the fingerprinter for Consul.
Make it so that we only log a failed Consul fingerprint once per Consul
cluster. Reset the gate on this once we have a successful fingerprint, so that
we get the logs after a reload if Consul is unavailable.
Ref: https://github.com/hashicorp/nomad/pull/24526
Ref: https://github.com/hashicorp/nomad/pull/25102
Fixes: https://github.com/hashicorp/nomad/issues/25181
* Add factory hooks for jobs to have previously stable versions and stopped status
* Since #24973 node-read isn't presupposed and so should regex match only on the common url parts
* Job detail tests for title buttons are now bimodal and default to having previously-stable version in history
* prettier plz
* Breaking a thing on purpose to see if my other broken thing is broken
* continue-on-error set to false to get things red when appropriate
* OK what if continue-on-error=true but we do a separate failure reporting after the fact
* fail-fast are you the magic incantation that I need?
* Re-fix my test now that fast-fail is off
* Fix to server-leader by adding a region first, and always()-append to uploading partition results
* Express failure step lists failing tests so you don't have to click back into ember-exam step
* temporary snapshot and logging for flakey test in service job detail
* Bunch of region and tasklogs test fixups
* using allocStatusDistribution to ensure service job always has a non-queued alloc
Job authors need to be able to review what capabilities a dynamic host volume or
CSI volume has so that they can set the correct access mode and attachment mode
in their job. Add these to the CLI output of `volume status`.
Ref: https://hashicorp.atlassian.net/browse/NET-12063
Nomad 1.10.0 is removing the legacy Vault token based workflow
which means the legacy e2e compatibility tests will fail and not
work.
The Nomad e2e cluster was using the legacy Vault token based
workflow for initial cluster build. This change migrates to using
the workload identity flow which utilizes authentication methods,
roles, and policies.
The Nomad server network has been modified to allow traffic from
the HCP Vault HVN which is a private network peered into our AWS
account. This is required, so that Vault can pull JWKS
information from the Nomad API without going over the public
internet.
The cluster build will now also configure a Vault KV v2 mount at
a unique indentifier for the e2e cluster. This allows all Nomad
workloads and tests to use this if required.
The vaultsecrets suite has been updated to accommodate the new
changes and extended to test the default workload ID flow for
allocations which use Vault for secrets.
We're using `set -eo pipefail` everywhere in the Enos scripts, several of the
scripts used for checking assertions didn't take advantage of pipefail in such a
way that we could avoid early exits from transient errors. This meant that if a
server was slightly late to come back up, we'd hit an error and exit the whole
script instead of polling as expected.
While fixing this, I've made a number of other improvements to the shell scripts:
* I've changed the design of the polling loops so that we're calling a function
that returns an exit code and sets `last_error` value, along with any global
variables required by downstream functions. This makes the loops more readable
by reducing the number of global variables, and helped identify some places
where we're exiting instead of returning into the loop.
* Using `shellcheck -s bash` I fixes some unused variables and undefined
variables that we were missing because they were only used on the error paths.
The nightly E2E run only builds a new AMI when required by changes to the
build. The AMI is tagged with the SHA of the commit that forced that build,
which may not be the commit that's spawning a particular test run. So we have a
resource in the `provision-infra` module that finds that SHA.
But when we run upgrade testing via Enos, we're running the E2E Terraform
configuration from outside the `e2e/terraform` folder. So the script that
resource runs will fail and prevent us from getting the AMI. Fix the script so
it can be run from any folder.
We also have duplicate resources for the "ubuntu jammy" AMI, but this is because
the Enos matrix might (in the near future) test with ARM64. For now, we'll pin
the Consul server to AMD64. Rename the resource appropriately to make the source
of the duplicate obvious.
Our vocabulary around scheduler behaviors outside of the `reschedule` and
`migrate` blocks leaves room for confusion around whether the reschedule tracker
should be propagated between allocations. There are effectively five different
behaviors we need to cover:
* restart: when the tasks of an allocation fail and we try to restart the tasks
in place.
* reschedule: when the `restart` block runs out of attempts (or the allocation
fails before tasks even start), and we need to move
the allocation to another node to try again.
* migrate: when the user has asked to drain a node and we need to move the
allocations. These are not failures, so we don't want to propagate the
reschedule tracker.
* replacement: when a node is lost, we don't count that against the `reschedule`
tracker for the allocations on the node (it's not the allocation's "fault",
after all). We don't want to run the `migrate` machinery here here either, as we
can't contact the down node. To the scheduler, this is effectively the same as
if we bumped the `group.count`
* replacement for `disconnect.replace = true`: this is a replacement, but the
replacement is intended to be temporary, so we propagate the reschedule tracker.
Add a section to the `reschedule`, `migrate`, and `disconnect` blocks explaining
when each item applies. Update the use of the word "reschedule" in several
places where "replacement" is correct, and vice-versa.
Fixes: https://github.com/hashicorp/nomad/issues/24918
Co-authored-by: Aimee Ukasick <aimee.ukasick@hashicorp.com>
* func: remove the lists to override the nomad_local_binary for servers and clients
* docs: add a note to the terraform e2e readme
* fix: remove the extra 'windows' from the aws_ami filter
* style: hcl fmt