When Nomad registers a service within Consul it is regarded as a
node service. In order for Nomad workloads to read these services,
it must have an ACL policy which includes node_prefix read. If it
does not, the service is filtered out from the result.
This change adds the required permission to the Consul setup
command.
Add an upgrade test workload for Consul service mesh with transparent
proxy. Note this breaks from the "countdash" demo. The dashboard application
only can verify the backend is up by making a websocket connection, which we
can't do as a health check, and the health check it exposes for that purpose
only passes once the websocket connection has been made. So replace the
dashboard with a minimal nginx reverse proxy to the count-api instead.
Ref: https://hashicorp.atlassian.net/browse/NET-12217
* Basic implementation for server members and node status
* Commands for alloc status and job status
* -ui flag for most commands
* url hints for variables
* url hints for job dispatch, evals, and deployments
* agent config ui.cli_url_links to disable
* Fix an issue where path prefix was presumed for variables
* driver uncomment and general cleanup
* -ui flag on the generic status endpoint
* Job run command gets namespaces, and no longer gets ui hints for --output flag
* Dispatch command hints get a namespace, and bunch o tests
* Lots of tests depend on specific output, so let's not mess with them
* figured out what flagAddress is all about for testServer, oof
* Parallel outside of test instances
* Browser-opening test, sorta
* Env var for disabling/enabling CLI hints
* Addressing a few PR comments
* CLI docs available flags now all have -ui
* PR comments addressed; switched the env var to be consistent and scrunched monitor-adjacent hints a bit more
* ui.Output -> ui.Warn; moves hints from stdout to stderr
* isTerminal check and parseBool on command option
* terminal.IsTerminal check removed for test-runner-not-being-terminal reasons
When a CSI plugin is launched, we probe it until the csi_plugin.health_timeout
expires (by default 30s). But if the plugin never becomes healthy, we're not
restarting the task as documented.
Update the plugin supervisor to trigger a restart instead. We still exit the
supervisor loop at that point to avoid having the supervisor send probes to a
task that isn't running yet. This requires reworking the poststart hook to allow
the supervisor loop to be restarted when the task restarts.
In doing so, I identified that we weren't respecting the task kill context from
the post start hook, which would leave the supervisor running in the window
between when a task is killed because it failed and its stop hooks were
triggered. Combine the two contexts to make sure we stop the supervisor
whichever context gets closed first.
Fixes: https://github.com/hashicorp/nomad/issues/25293
Ref: https://hashicorp.atlassian.net/browse/NET-12264
The check to read back node metadata depends on a resource that waits for the
Nomad API, but that resource doesn't wait for the metadata to be written in the
first place (and the client subsequently upgraded). Add this dependency so that
we're reading back the node metadata as the last step.
Ref: https://github.com/hashicorp/nomad-e2e/actions/runs/13690355150/job/38282457406
When upgrading from older versions of Nomad, the reschedule policy block may be
nil. There is logic to handle this safely in the `NextRescheduleTimeByTime` used
for allocs on disconnected clients, but it's missing from the
`NextRescheduleTime` method used by more typical allocations. Return an empty
time object in this case.
Fixes: https://github.com/hashicorp/nomad/issues/24846
The group level fields stop_after_client_disconnect,
max_client_disconnect, and prevent_reschedule_on_lost were deprecated in
Nomad 1.8 and replaced by field in the disconnect block. This change
removes any logic related to those deprecated fields.
---------
Co-authored-by: Tim Gross <tgross@hashicorp.com>
Getting the CSI test to work with AWS EFS or EBS has proven to be awkward
because we're having to deal with external APIs with their own consistency
guarantees, as well as challenges around teardown. Make the CSI test entirely
self-contained by using a userland NFS server and the rocketduck CSI plugin.
Ref: https://hashicorp.atlassian.net/browse/NET-12217
Ref: https://gitlab.com/rocketduck/csi-plugin-nfs
The CSI workload is failing and creating complications for teardown, so I'm
reworking it. But this work is taking a while to finish, so while that's in
progress let's disable the CSI workload so that we're running the upgrade tests
all the way through to the end. I expect to be able to revert this in the next
couple days.
During initial development of upgrade testing, we had a hard-coded prefix to
distinguish between clusters created for this vs those created by GHA
runners. Update the prefix to be a variable so that developers can add their own
prefix during test workload development.
* fix: fix the docker image parser to account for private repos
* style: change the local regex for docker image indentifiers and use docker package instead
* func: return early when no repo found on the image name
* func: return error if no path found in image
* Update drivers/docker/utils.go
Co-authored-by: Tim Gross <tgross@hashicorp.com>
* Update coordinator.go
* Update driver.go
* Update network.go
---------
Co-authored-by: Tim Gross <tgross@hashicorp.com>
* func: add dependencies to avoid race conditions and move the update to each client to the main upgrade scenario
* Update enos/enos-scenario-upgrade.hcl
Co-authored-by: Tim Gross <tgross@hashicorp.com>
* Update enos/enos-scenario-upgrade.hcl
Co-authored-by: Tim Gross <tgross@hashicorp.com>
---------
Co-authored-by: Tim Gross <tgross@hashicorp.com>
Before the fixes in #20165, the wait feature was disabled by
default. After these changes, it's always enabled, which - at
least on some platforms - leads to a significant increase in
load (5-7x).
This patch allows disabling the wait feature in the client
stanza of the configuration file by setting min and max to 0:
wait {
min = "0"
max = "0"
}
Per-template wait blocks in the task description still work like
one would expect.
* docs: Add note to stop allocs to make sure system allocs are not rescheduled
* Update stop.mdx
* Update website/content/docs/commands/alloc/stop.mdx
Co-authored-by: Tim Gross <tgross@hashicorp.com>
---------
Co-authored-by: Tim Gross <tgross@hashicorp.com>
The paginator was developed before generics were available, so we've had to work
around a lack of compile-time safety by creating configuration objects at
runtime that require a lot of branching and type casts. This results in a lot of
added boilerplate in the RPC handlers.
Refactor the paginator to take advantage of generics.
* Move all decision making around tokenization to compile-time by providing
pre-built generic functions that close over target tokens.
* Remove the `appendFunc` parameter in lieu of a `Stub` function parameter that
will accept existing `Stub` functions in most cases (with the addition of an
extra `error` return value).
* Generally remove boilerplate in the RPC handlers as a result, except where a
given handler wants more complex filtering.
This doesn't reduce the boilerplate we need at the top of many blocking queries
to define the iterator we want based on arguments, which we're typically doing
to decide upon which memdb index we want. That's a query optimization problem
and way beyond the scope of this PR.
When a node is fingerprinted, we calculate a "computed class" from a hash over a
subset of its fields and attributes. In the scheduler, when a given node fails
feasibility checking (before fit checking) we know that no other node of that
same class will be feasible, and we add the hash to a map so we can reject them
early. This hash cannot include any values that are unique to a given node,
otherwise no other node will have the same hash and we'll never save ourselves
the work of feasibility checking those nodes.
In #4390 we introduce the `nomad.advertise.address` attribute and in #19969 we
introduced `consul.dns.addr` attribute. Both of these are unique per node and
break the hash.
Additionally, we were not correctly filtering attributes out when checking if a
node escaped the class by not filtering for attributes that start with
`unique.`. The test for this introduced in #708 had an inverted assertion, which
allowed this to pass unnoticed since the early days of Nomad.
Ref: https://github.com/hashicorp/nomad/pull/708
Ref: https://github.com/hashicorp/nomad/pull/4390
Ref: https://github.com/hashicorp/nomad/pull/19969
The legacy workflow for Vault whereby servers were configured
using a token to provide authentication to the Vault API has now
been removed. This change also removes the workflow where servers
were responsible for deriving Vault tokens for Nomad clients.
The deprecated Vault config options used byi the Nomad agent have
all been removed except for "token" which is still in use by the
Vault Transit keyring implementation.
Job specification authors can no longer use the "vault.policies"
parameter and should instead use "vault.role" when not using the
default workload identity.
---------
Co-authored-by: Tim Gross <tgross@hashicorp.com>
Co-authored-by: Aimee Ukasick <aimee.ukasick@hashicorp.com>
Add an upgrade test workload for CSI with the AWS EFS plugin. In order to
validate this workload, we'll need to deploy the plugin job and then register a
volume with it. So this extends the `run_workloads` module to allow for "pre
scripts" and "post scripts" to be run before and after a given job has been
deployed. We can use that as a model for other test workloads.
Ref: https://hashicorp.atlassian.net/browse/NET-12217
Enos buries the Terraform output from provisioning. Add a shell script to load
the environment from provisioning for debugging Nomad during development of
upgrade tests.
* func: Add more workloads
* Update jobs.sh
* Update versions.sh
* style: format
* Update enos/modules/test_cluster_health/scripts/allocs.sh
Co-authored-by: Tim Gross <tgross@hashicorp.com>
* docs: improve outputs descriptions
* func: change docker workloads to be redis boxes and add healthchecks
* func: register the services on consul
* style: format
---------
Co-authored-by: Tim Gross <tgross@hashicorp.com>
In #25185 we changed the output of `volume status` to include both DHV and CSI
volumes by default. When the E2E test parses the output, it's not expecting the
new section header.
Ref: https://github.com/hashicorp/nomad/pull/25185
* Changes the behaviour of system/batch/sysbatch jobs not to look for a latest stable version, as their versions never go to stable
* Dont show job stability on versions page for system/sysbatch/batch jobs
* Tests that depend on jobs to revert specify that they are Service jobs
* Batch jobs added to detail-restart test loop
* Right, they're not stable, they're just versions
Dependabot can update actions to versions that are not in the TSCCR
allowlist. The TSCCR check doesn't happen in CE, which means we don't learn we
have a problem until after we've spent the effort to backport them. Remove the
automation that updates actions automatically until this issue is resolved on
the security team's side.