Dependabot can update actions to versions that are not in the TSCCR
allowlist. The TSCCR check doesn't happen in CE, which means we don't learn we
have a problem until after we've spent the effort to backport them. Remove the
automation that updates actions automatically until this issue is resolved on
the security team's side.
Fixes a bug where connections would not be closed on write errors in the
msgpack encoder, which would cause the reader end of RPC connections to hang
indefinitely. This resulted in clients in widely-distributed geographies being
unable to poll for allocation updates.
Fixes: https://github.com/hashicorp/nomad/issues/23305
The `-type` option for `volume status` is a UX papercut because for many
clusters there will be only one sort of volume in use. Update the CLI so that
the default behavior is to query CSI and/or DHV.
This behavior is subtly different when the user provides an ID or not. If the
user doesn't provide an ID, we query both CSI and DHV and show both tables. If
the user provides an ID, we query DHV first and then CSI, and show only the
appropriate volume. Because DHV IDs are UUIDs, we're sure we won't have
collisions between the two. We only show errors if both queries return an error.
Fixes: https://hashicorp.atlassian.net/browse/NET-12214
* func: add possibility of having different binaries for server and clients
* style: rename binaries modules
* func: remove the check for last configuration log, and only take one snapshot when upgrading the servers
* Update enos/modules/upgrade_servers/main.tf
Co-authored-by: Tim Gross <tgross@hashicorp.com>
---------
Co-authored-by: Tim Gross <tgross@hashicorp.com>
* func: add possibility of having different binaries for server and clients
* style: rename binaries modules
* docs: update comments
* fix: correct the token input variable for fetch binaries
We update the status of a volume when the node fingerprint changes. But if a
node goes down, we still show the volume as available. The scheduler behavior is
correct because a down node can never have work scheduled on it, but it might be
confusing for job authors who are looking at volumes that are showing as
available.
Update the volume update logic we have on node updates to include marking the
volume as unavailable when a node goes down.
Fixes: https://hashicorp.atlassian.net/browse/NET-12068
In #24526 we updated Consul and Vault fingerprinting so that we no longer
periodically fingerprint. In #25102 we made it so that we fingerprint
periodically on start until the first fingerprint, in order to tolerate Consul
or Vault not being available on start. For clusters not running Consul, this
leads to a warn-level log every 15s. This same log exists for Vault, but Vault
support is opt-in via `vault.enable = true` whereas you have to manually disable
the fingerprinter for Consul.
Make it so that we only log a failed Consul fingerprint once per Consul
cluster. Reset the gate on this once we have a successful fingerprint, so that
we get the logs after a reload if Consul is unavailable.
Ref: https://github.com/hashicorp/nomad/pull/24526
Ref: https://github.com/hashicorp/nomad/pull/25102
Fixes: https://github.com/hashicorp/nomad/issues/25181
* Add factory hooks for jobs to have previously stable versions and stopped status
* Since #24973 node-read isn't presupposed and so should regex match only on the common url parts
* Job detail tests for title buttons are now bimodal and default to having previously-stable version in history
* prettier plz
* Breaking a thing on purpose to see if my other broken thing is broken
* continue-on-error set to false to get things red when appropriate
* OK what if continue-on-error=true but we do a separate failure reporting after the fact
* fail-fast are you the magic incantation that I need?
* Re-fix my test now that fast-fail is off
* Fix to server-leader by adding a region first, and always()-append to uploading partition results
* Express failure step lists failing tests so you don't have to click back into ember-exam step
* temporary snapshot and logging for flakey test in service job detail
* Bunch of region and tasklogs test fixups
* using allocStatusDistribution to ensure service job always has a non-queued alloc
Job authors need to be able to review what capabilities a dynamic host volume or
CSI volume has so that they can set the correct access mode and attachment mode
in their job. Add these to the CLI output of `volume status`.
Ref: https://hashicorp.atlassian.net/browse/NET-12063
Nomad 1.10.0 is removing the legacy Vault token based workflow
which means the legacy e2e compatibility tests will fail and not
work.
The Nomad e2e cluster was using the legacy Vault token based
workflow for initial cluster build. This change migrates to using
the workload identity flow which utilizes authentication methods,
roles, and policies.
The Nomad server network has been modified to allow traffic from
the HCP Vault HVN which is a private network peered into our AWS
account. This is required, so that Vault can pull JWKS
information from the Nomad API without going over the public
internet.
The cluster build will now also configure a Vault KV v2 mount at
a unique indentifier for the e2e cluster. This allows all Nomad
workloads and tests to use this if required.
The vaultsecrets suite has been updated to accommodate the new
changes and extended to test the default workload ID flow for
allocations which use Vault for secrets.
We're using `set -eo pipefail` everywhere in the Enos scripts, several of the
scripts used for checking assertions didn't take advantage of pipefail in such a
way that we could avoid early exits from transient errors. This meant that if a
server was slightly late to come back up, we'd hit an error and exit the whole
script instead of polling as expected.
While fixing this, I've made a number of other improvements to the shell scripts:
* I've changed the design of the polling loops so that we're calling a function
that returns an exit code and sets `last_error` value, along with any global
variables required by downstream functions. This makes the loops more readable
by reducing the number of global variables, and helped identify some places
where we're exiting instead of returning into the loop.
* Using `shellcheck -s bash` I fixes some unused variables and undefined
variables that we were missing because they were only used on the error paths.
The nightly E2E run only builds a new AMI when required by changes to the
build. The AMI is tagged with the SHA of the commit that forced that build,
which may not be the commit that's spawning a particular test run. So we have a
resource in the `provision-infra` module that finds that SHA.
But when we run upgrade testing via Enos, we're running the E2E Terraform
configuration from outside the `e2e/terraform` folder. So the script that
resource runs will fail and prevent us from getting the AMI. Fix the script so
it can be run from any folder.
We also have duplicate resources for the "ubuntu jammy" AMI, but this is because
the Enos matrix might (in the near future) test with ARM64. For now, we'll pin
the Consul server to AMD64. Rename the resource appropriately to make the source
of the duplicate obvious.
Our vocabulary around scheduler behaviors outside of the `reschedule` and
`migrate` blocks leaves room for confusion around whether the reschedule tracker
should be propagated between allocations. There are effectively five different
behaviors we need to cover:
* restart: when the tasks of an allocation fail and we try to restart the tasks
in place.
* reschedule: when the `restart` block runs out of attempts (or the allocation
fails before tasks even start), and we need to move
the allocation to another node to try again.
* migrate: when the user has asked to drain a node and we need to move the
allocations. These are not failures, so we don't want to propagate the
reschedule tracker.
* replacement: when a node is lost, we don't count that against the `reschedule`
tracker for the allocations on the node (it's not the allocation's "fault",
after all). We don't want to run the `migrate` machinery here here either, as we
can't contact the down node. To the scheduler, this is effectively the same as
if we bumped the `group.count`
* replacement for `disconnect.replace = true`: this is a replacement, but the
replacement is intended to be temporary, so we propagate the reschedule tracker.
Add a section to the `reschedule`, `migrate`, and `disconnect` blocks explaining
when each item applies. Update the use of the word "reschedule" in several
places where "replacement" is correct, and vice-versa.
Fixes: https://github.com/hashicorp/nomad/issues/24918
Co-authored-by: Aimee Ukasick <aimee.ukasick@hashicorp.com>
* func: remove the lists to override the nomad_local_binary for servers and clients
* docs: add a note to the terraform e2e readme
* fix: remove the extra 'windows' from the aws_ami filter
* style: hcl fmt
Similarly to #6732 it removes checking affinity and spread for inplace update.
Both affinity and spread should be as soft preference for Nomad scheduler rather than strict constraint. Therefore modifying them should not trigger job reallocation.
Fixes#25070
Co-authored-by: Tim Gross <tgross@hashicorp.com>
* Add dead (stopped) to status mapping to clarify Stopped
CE-816
* Pull status mapping into partial and include in job status command
* change `complete` to dead in table after discuss with Michael
* added clarifications; add CLI status definitions
* fixed line endings
* fixed typoce816dead
In #20165 we fixed a bug where a partially configured `client.template` retry
block would set any unset fields to nil instead of their default values. But
this patch introduced a regression in the default values, so we were now
defaulting to unlimited retries if the retry block was unset. Restore the
correct behavior and add better test coverage at both the config parsing and
template configuration code.
Ref: https://github.com/hashicorp/nomad/pull/20165
Ref: https://github.com/hashicorp/nomad/issues/23305#issuecomment-2643731565
In #24526 we updated the Consul and Vault fingerprints so that they are no
longer periodic. This fixed a problem that cluster admins reported where rolling
updates of Vault or Consul would cause a thundering herd of fingerprint updates
across the whole cluster.
But if Consul/Vault is not available during the initial fingerprint, it will
never get fingerprinted again. This is challenging for cluster updates and black
starts because the implicit service startup ordering may require
reloads. Instead, have the fingerprinter run periodically but mark that it has
made its first successful fingerprint of all Consul/Vault clusters. At that
point, we can skip further periodic updates. The `Reload` method will reset the
mark and allow the subsequent fingerprint to run normally.
Fixes: https://github.com/hashicorp/nomad/issues/25097
Ref: https://github.com/hashicorp/nomad/pull/24526
Ref: https://github.com/hashicorp/nomad/issues/24049
* fix: change the value of the version used for testing to account for ent versions
* func: add more specific test for servers stability
* func: change the criteria we use to verify the cluster stability after server upgrades
* style: syntax
In #24650 we switched to using ephemeral state for CNI plugins, so that when a
host reboots and we lose all the allocations we don't end up trying to use IPs
we created in network namespaces we just destroyed. Unfortunately upgrade
testing missed that in a non-reboot scenario, the existing CNI state was being
used by plugins like the ipam plugin to hand out the "next available" IP
address. So with no state carried over, we might allocate new addresses that
conflict with existing allocations. (This can be avoided by draining the node
first.)
As a compatibility shim, copy the old CNI state directory to the new CNI state
directory during agent startup, if the new CNI state directory doesn't already
exist.
Ref: https://github.com/hashicorp/nomad/pull/24650
Add a README describing the setup required for running upgrade testing via
Enos. Also fix the authorization header of our `wget` to use the proper header
for short-lived tokens, and the output path variable of the artifactory step.
Co-authored-by: Juanadelacuesta <8647634+Juanadelacuesta@users.noreply.github.com>