The docs for ephemeral disk migration use the term "best effort" without
outlining the requirements or the cases under which the migration can
fail. Update the docs to make it obvious that ephemeral disk migration is
subject to data loss.
Fixes: https://github.com/hashicorp/nomad/issues/20355
Services can have some of their string fields interpolated. The new Workload
Identity flow doesn't interpolate the services before requesting signed
identities or using those identities to get Consul tokens.
Add support for interpolation to the WID manager and the Consul tokens hook by
providing both with a taskenv builder. Add an "interpolate workload" field to
the WI handle to allow passing the original workload name to the server so the
server can find the correct service to sign.
This changeset also makes two related test improvements:
* Remove the mock WID manager, which was only used in the Consul hook tests and
isn't necessary so long as we provide the real WID manager with the mock
signer and never call `Run` on it. It wasn't feasible to exercise the correct
behavior without this refactor, as the mocks were bypassing the new code.
* Fixed swapped expect-vs-actual assertions on the `consul_hook` tests.
Fixes: https://github.com/hashicorp/nomad/issues/20025
The deployment watcher on the leader makes blocking queries to detect when the
set of active deployments changes. It takes the resulting list of deployments
and adds or removes watchers based on whether the deployment is active. But when
a job is purged, the deployment will be deleted. This unblocks the query but
the query result only shows the remaining deployments.
When the query unblocks, ensure that all active watchers have a corresponding
deployment in state. If not, remove the watcher so that the goroutine stops.
Fixes: https://github.com/hashicorp/nomad/issues/19988
The `getPortMapping` method forces callers to handle two different data
structures, but only one caller cares about it. We don't want to return a single
map or slice because the `cni.PortMapping` object doesn't include a label field
that we need for tproxy. Return a new datastructure that closes over both a
slice of `cni.PortMapping` and a map of label to index in that slice.
When a node is set to drain, the state store reads the auth token off the
request to record `LastDrain` metadata about the token used to drain the
node. This code path in the state store can't correctly handle signed Workload
Identity tokens or bearer tokens that may have expired (for example, while
restarting a server and applying uncompacted Raft logs).
Rather than re-authenticating the request at the time of FSM apply, record the
string derived from the authenticated identity as part of the Raft log
entry.
Fixes: https://github.com/hashicorp/nomad/issues/17471
When the `client.servers` block is parsed, we split the port from the
address. This does not correctly handle IPv6 addresses when they are in URL
format (wrapped in brackets), which we require to disambiguate the port and
address.
Fix the parser to correctly split out the port and handle a missing port value
for IPv6. Update the documentation to make the URL format requirement clear.
Fixes: https://github.com/hashicorp/nomad/issues/20310
When `nomad fmt` writes to stdout instead of overwriting a file, the command was
using the `UI` output, which appends an extra newline. This results in extra
trailing newlines when using `nomad fmt` as part of a pipeline or editor plugin.
Update the command to write directly to stdout when in the stdout mode.
Fixes: https://github.com/hashicorp/nomad/issues/20307
Add the `consul-cni` plugin to the Linux AMI for E2E, and add a test case that
covers the transparent proxy feature. Add test assertions to the Connect tests
for upstream reachability
Ref: https://github.com/hashicorp/nomad/pull/20175
Add a constraint on job submission that requires the `consul-cni` plugin
fingerprint whenever transparent proxy is used.
Add a validation that the `network.dns` cannot be set when transparent proxy is
used, unless the `no_dns` flag is set.
* enable uint64 pagination tokens, so they can be compared as numbers instead of strings
* tokenize job ModifyIndex as uint64, so an new upcoming state index can paginate properly
* test require->must
The E2E test for periodic dispatch jobs has a `cron` trigger for once a
minute. If the test happens to run at the top of the minute, it's possible for
the forced dispatch to run from the test code, then the periodic timer triggers
and leaves a running child job. This fails the test because it expects only a
single job in the "dead" state.
Make it so that the `cron` expression is implausible to run during our test
window, and migrate the test off the old framework while we're at it.
Update the service mesh integration docs to explain how Consul needs to be
configured for transparent proxy. Update the walkthrough to assume that
`transparent_proxy` mode is the best approach, and move the manually-configured
`upstreams` to a separate section for users who don't want to use Consul DNS.
Ref: https://github.com/hashicorp/nomad/pull/20175
Ref: https://github.com/hashicorp/nomad/pull/20241
When `transparent_proxy` block is present and the network mode is `bridge`, use
a different CNI configuration that includes the `consul-cni` plugin. Before
invoking the CNI plugins, create a Consul SDK `iptables.Config` struct for the
allocation. This includes:
* Use all the `transparent_proxy` block fields
* The reserved ports are added to the inbound exclusion list so the alloc is
reachable from outside the mesh
* The `expose` blocks and `check` blocks with `expose=true` are added to the
inbound exclusion list so health checks work.
The `iptables.Config` is then passed as a CNI argument to the `consul-cni`
plugin.
Ref: https://github.com/hashicorp/nomad/issues/10628
Add a transparent proxy block to the existing Connect sidecar service proxy
block. This changeset is plumbing required to support transparent proxy
configuration on the client.
Ref: https://github.com/hashicorp/nomad/issues/10628
Vault 1.16.1 has a known issue around the JWT auth configuration that will
prevent this test from ever passing. Skip testing the JWT code path on
1.16.1. Once 1.16.2 ships it will no longer get skipped.
Ref: https://github.com/hashicorp/nomad/issues/20298
Migrate our E2E tests for Connect off the old framework in preparation for
writing E2E tests for transparent proxy and the updated workload identity
workflow. Mark the tests that cover the legacy Consul token submitted workflow.
Ref: https://github.com/hashicorp/nomad/pull/20175
The JSON response for the Read Stats client API includes an `AllocDirStats`
field. This field is missing in the `api` package, so consumers of the Go API
can't use it to read the values we're getting back from the HTTP server.
Fixes: https://github.com/hashicorp/nomad/issues/20246
If a E2E cluster is destroyed after a different one has been created, the role
and policy we create in Vault for the cluster will be deleted and Vault-related
tests will fail. Note that before 1.9, we should figure out a way to give HCP
Vault access to the JWKS endpoint and have a different set of policies, but
we'll need to have a role-per-cluster in that case as well.
Fixes: https://github.com/hashicorp/nomad-e2e/issues/138 (internal)
Our `consulcompat` tests exercise both the Workload Identity and legacy Consul
token workflow, but they are limited to running single node tests. The E2E
cluster is network isolated, so using our HCP Consul cluster runs into a
problem validating WI tokens because it can't reach the JWKS endpoint. In real
production environments, you'd solve this with a CNAME pointing to a public IP
pointing to a proxy with a real domain name. But that's logisitcally
impractical for our ephemeral nightly cluster.
Migrate the HCP Consul to a single-node Consul cluster on AWS EC2 alongside our
Nomad cluster. Bootstrap TLS and ACLs in Terraform and ensure all nodes can
reach each other. This will allow us to update our Consul tests so they can use
Workload Identity, in a separate PR.
Ref: #19698
The `DNSConfig.IsZero` method incorrectly returns true if any of the fields are
empty, rather than if all of them are empty.
The only code path that consumes this method is on the client, where it's used
as part of equality checks on the allocation network status to set the priority
of allocation updates to the server. Hypothetically, if the network hook
modified only the DNS configuration and no task states were emitted, it would be
possible to miss an allocation update. In practice this appears to be
impossible, but we should fix the bug so that there aren't errors in future
consumers.
This PR adds a job mutator which injects constraints on the job taskgroups
that make use of bridge networking. Creating a bridge network makes use of the
CNI plugins: bridge, firewall, host-local, loopback, and portmap. Starting
with Nomad 1.5 these plugins are fingerprinted on each node, and as such we
can ensure jobs are correctly scheduled only on nodes where they are available,
when needed.
Version of Nomad and Consul that were known not to be compatible are no longer
supported in general. Update the compatibility matrix for Consul to match.
A Nomad user reported an issue where template runner `View.poll` goroutines were
being leaked when using templates with many dependencies. This resource leak was
fixed in consul-template 0.37.4.
Fixes: https://github.com/hashicorp/nomad/issues/20163
Although it's not recommended, it's possible to federate regions without ACLs
enabled. In this case, ACL-related objects such as namespaces and node pools can
be written independently in each region and won't be replicated. If you use
commands like `namespace apply` or `node pool delete`, the RPC is supposed to be
forwarded to the authoritative region. But when ACLs are disabled, there is no
authoritative region and so the RPC will always be applied to the local region
even if the `-region` flag is passed.
Remove the change to the RPC region for the namespace and node pool write RPC
whenver ACLs are disabled, so that forwarding works.
Fixes: https://github.com/hashicorp/nomad/issues/20197
Ref: https://github.com/hashicorp/nomad/issues/20128
Also add an explicit exit code to subproc package for when a child
process is instructed to run an unrunnable command (i.e. cannot be
found or is not executable) - with the 127 return code folks using bash
are familiar with
The pprof `allocs` profile is identical to the `heap` profile, just with a
different default view. Collecting only one of the two is sufficient to view all
of `alloc_objects`, `alloc_space`, `inuse_objects`, and `inuse_space`, and
collecting only one means that both views will be of the same profile.
Also improve the docstrings on the goroutine profiles explaining what's in each
so that it's clear why we might want all of debug=0, debug=1, and debug=2.