The RPC TLS enforcment test creates network connections to a server and these
are occassionally failing in testing with `write: broken pipe` errors. This has
been an ongoing issue where it'll appear to get fixed, then reoccur, and no one
seems to be able to reproduce outside of CI. The test assertion itself is
reliable, which is why it's been hard to spend effort to hunt this down.
The failing test cases are ones that are never supposed to work b/c they fail
our TLS cert role validation. The error message is coming from the TLS handshake
error. The RPC connection handler closes the connection immediately on getting
the error from the TLS handshake. The stdlib's TLS library flushes the
connection's buffer before returning the error. So the theory is that in the
failing case we don't get the error message before the connection is closed, but
do get the error return that allows the client to move on to a write, which
tries to write on the closed pipe.
I've been unable to reproduce this exactly, as the race is effectively between
the OS and the runtime. The equivalent test of the Raft TLS enforcement includes
handling of a EOF intead of the certificate error, so it appears this actually
expected (or at least known) behavior. Because the code under test is operating
as expected, this changeset updates the assertion to accept the error.
When the scheduler assigns a device instance, it iterates over the feasible
devices and then picks the first instance with availability. If the jobspec uses
a constraint on device ID, this can lead to buggy/surprising behavior where the
node's device matches the constraint but then the individual device instance
does not.
Add a second filter based on the `${device.ids}` constraint after selecting a
node's device to ensure the device instance ID falls within the constraint as
well.
Fixes: #18112
When ephemeral disks are migrated from an allocation on the same node,
allocation logs for the previous allocation are lost.
There are two workflows for the best-effort attempt to migrate the allocation
data between the old and new allocations. For previous allocations on other
clients (the "remote" workflow), we create a local allocdir and download the
data from the previous client into it. That data is then moved into the new
allocdir and we delete the allocdir of the previous alloc.
For "local" previous allocations we don't need to create an extra directory for
the previous allocation and instead move the files directly from one to the
other. But we still delete the old allocdir _entirely_, which includes all the
logs!
There doesn't seem to be any reason to destroy the local previous allocdir, as
the usual client garbage collection should destroy it later on when needed. By
not deleting it, the previous allocation's logs are still available for the user
to read.
Fixes: #18034
There are some refactorings that have to be made in the getter and state
where the api changed in `slices`
* Bump golang.org/x/exp
* Bump golang.org/x/exp in api
* Update job_endpoint_test
* [feedback] unexport sort function
* Bones of a component that has job variable awareness
* Got vars listed woo
* Variables as its own subnav and some pathLinkedVariable perf fixes
* Automatic Access to Variables alerter
* Helper and component to conditionally render the right link
* A bit of cleanup post-template stuff
* testfix for looping right-arrow keynav bc we have a new subnav section
* A very roundabout way of ensuring that, if a job exists when saving a variable with a pathLinkedEntity of that job, its saved right through to the job itself
* hacky but an async version of pathLinkedVariable
* model-driven and async fetcher driven with cleanup
* Only run the update-job func if jobname is detected in var path
* Test cases begun
* Management token for variables to appear in tests
* Its a management token so it gets to see the clients tab under system jobs
* Pre-review cleanup
* More tests
* Number of requests test and small fix to groups-by-way-or-resource-arrays elsewhere
* Variable intro text tests
* Variable name re-use
* Simplifying our wording a bit
* parse json vs plainId
* Addressed PR feedback, including de-waterfalling
The alloc exec and filesystem/logs commands allow passing the `-job` flag to
select a random allocation. If the namespace for the command is set to `*`, the
RPC handler doesn't handle this correctly as it's expecting to query for a
specific job. Most commands handle this ambiguity by first verifying that only a
single object of the type in question exists (ex. a single node or job).
Update these commands so that when the `-job` flag is set we first verify
there's a single job that matches. This also allows us to extend the
functionality to allow for the `-job` flag to support prefix matching.
Fixes: #12097
* Attempt at a varied end-result when sorting and searching
* Consider sort direction as well
* computed property dep update
* prioritizeSearchOrder and test
* Side-effecty but resets sort on search etc
* changelog
In #18054 we introduced a new field `render_templates` in the `restart`
block. Previously changes to the `restart` block were always non-destructive in
the scheduler but we now need to check the new field so that we can update the
template runner. The check assumed that the block was always non-nil, which
causes panics in our scheduler tests.
This feature is necessary when user want to explicitly re-render all templates on task restart.
E.g. to fetch all new secrets from Vault, even if the lease on the existing secrets has not been expired.
Trusted Supply Chain Component Registry (TSCCR) enforcement starts Monday and an
internal report shows our semgrep action is pinned to a version that's not
currently permitted. Update all the action versions to whatever's the new
hotness to maximum the time-to-live on these until we have automated pinning
setup.
Also version bumps our chromedriver action, which randomly broke upstream today.
Add JWKS endpoint to HTTP API for exposing the root public signing keys used for signing workload identity JWTs.
Part 1 of N components as part of making workload identities consumable by third party services such as Consul and Vault. Identity attenuation (audience) and expiration (+renewal) are necessary to securely use workload identities with 3rd parties, so this merge does not yet document this endpoint.
---------
Co-authored-by: Tim Gross <tgross@hashicorp.com>
When accessing a region running a version of Nomad without node pools an
error was thrown because the request is handled by the nodes endpoint
which fails because it assumes `pools` is the node ID.
When a request is made to an RPC service that doesn't exist (for
example, a cross-region request from a newer version of Nomad to an
older version that doesn't implement the endpoint) the application
should return an empty list as well.
The upgrade path to Nomad 1.6.0 requires canonicalizing the namespace in
order to set the default scheduler configuration values.
Previous implementation only canonicalized on namespace upsert
operations, which works for recent namespaces as those Raft transactions
are reapplied on upgrade.
But for older namespaces restore from a snapshot the code path did not
canonicalize them, leaving the scheduler configuration set as `nil`.
The CSI specification says that we "SHOULD" send no more than one in-flight
request per *volume* at a time, with an allowance for losing state
(ex. leadership transitions) which the plugins "SHOULD" handle gracefully. We
mostly successfully serialize node and controller RPCs for the same volume,
except when Nomad clients are lost. (See also
https://github.com/container-storage-interface/spec/issues/512)
These concurrency requirements in the spec fall short because Storage Provider
APIs aren't necessarily safe to call concurrently on the same host even for
_different_ volumes. For example, concurrently attaching AWS EBS volumes to an
EC2 instance results in a race for device names, which results in failure to
attach (because the device name is taken already and the API call fails) and
confused results when releasing claims. So in practice many CSI plugins rely on
k8s-specific sidecars for serializing storage provider API calls globally. As a
result, we have to be much more conservative about concurrency in Nomad than the
spec allows.
This changeset includes four major changes to fix this:
* Add a serializer method to the CSI volume RPC handler. When the RPC handler
makes a destructive CSI Controller RPC, we send the RPC thru this serializer
and only one RPC is sent at a time. Any other RPCs in flight will block.
* Ensure that requests go to the same controller plugin instance whenever
possible by sorting by lowest client ID out of the plugin instances.
* Ensure that requests go to _healthy_ plugin instances only.
* Ensure that requests for controllers can go to a controller on any _live_
node, not just ones eligible for scheduling (which CSI controllers don't care
about)
Fixes: #15415