When the scheduler assigns a device instance, it iterates over the feasible
devices and then picks the first instance with availability. If the jobspec uses
a constraint on device ID, this can lead to buggy/surprising behavior where the
node's device matches the constraint but then the individual device instance
does not.
Add a second filter based on the `${device.ids}` constraint after selecting a
node's device to ensure the device instance ID falls within the constraint as
well.
Fixes: #18112
When ephemeral disks are migrated from an allocation on the same node,
allocation logs for the previous allocation are lost.
There are two workflows for the best-effort attempt to migrate the allocation
data between the old and new allocations. For previous allocations on other
clients (the "remote" workflow), we create a local allocdir and download the
data from the previous client into it. That data is then moved into the new
allocdir and we delete the allocdir of the previous alloc.
For "local" previous allocations we don't need to create an extra directory for
the previous allocation and instead move the files directly from one to the
other. But we still delete the old allocdir _entirely_, which includes all the
logs!
There doesn't seem to be any reason to destroy the local previous allocdir, as
the usual client garbage collection should destroy it later on when needed. By
not deleting it, the previous allocation's logs are still available for the user
to read.
Fixes: #18034
There are some refactorings that have to be made in the getter and state
where the api changed in `slices`
* Bump golang.org/x/exp
* Bump golang.org/x/exp in api
* Update job_endpoint_test
* [feedback] unexport sort function
* Bones of a component that has job variable awareness
* Got vars listed woo
* Variables as its own subnav and some pathLinkedVariable perf fixes
* Automatic Access to Variables alerter
* Helper and component to conditionally render the right link
* A bit of cleanup post-template stuff
* testfix for looping right-arrow keynav bc we have a new subnav section
* A very roundabout way of ensuring that, if a job exists when saving a variable with a pathLinkedEntity of that job, its saved right through to the job itself
* hacky but an async version of pathLinkedVariable
* model-driven and async fetcher driven with cleanup
* Only run the update-job func if jobname is detected in var path
* Test cases begun
* Management token for variables to appear in tests
* Its a management token so it gets to see the clients tab under system jobs
* Pre-review cleanup
* More tests
* Number of requests test and small fix to groups-by-way-or-resource-arrays elsewhere
* Variable intro text tests
* Variable name re-use
* Simplifying our wording a bit
* parse json vs plainId
* Addressed PR feedback, including de-waterfalling
The alloc exec and filesystem/logs commands allow passing the `-job` flag to
select a random allocation. If the namespace for the command is set to `*`, the
RPC handler doesn't handle this correctly as it's expecting to query for a
specific job. Most commands handle this ambiguity by first verifying that only a
single object of the type in question exists (ex. a single node or job).
Update these commands so that when the `-job` flag is set we first verify
there's a single job that matches. This also allows us to extend the
functionality to allow for the `-job` flag to support prefix matching.
Fixes: #12097
* Attempt at a varied end-result when sorting and searching
* Consider sort direction as well
* computed property dep update
* prioritizeSearchOrder and test
* Side-effecty but resets sort on search etc
* changelog
In #18054 we introduced a new field `render_templates` in the `restart`
block. Previously changes to the `restart` block were always non-destructive in
the scheduler but we now need to check the new field so that we can update the
template runner. The check assumed that the block was always non-nil, which
causes panics in our scheduler tests.
This feature is necessary when user want to explicitly re-render all templates on task restart.
E.g. to fetch all new secrets from Vault, even if the lease on the existing secrets has not been expired.
Trusted Supply Chain Component Registry (TSCCR) enforcement starts Monday and an
internal report shows our semgrep action is pinned to a version that's not
currently permitted. Update all the action versions to whatever's the new
hotness to maximum the time-to-live on these until we have automated pinning
setup.
Also version bumps our chromedriver action, which randomly broke upstream today.
Add JWKS endpoint to HTTP API for exposing the root public signing keys used for signing workload identity JWTs.
Part 1 of N components as part of making workload identities consumable by third party services such as Consul and Vault. Identity attenuation (audience) and expiration (+renewal) are necessary to securely use workload identities with 3rd parties, so this merge does not yet document this endpoint.
---------
Co-authored-by: Tim Gross <tgross@hashicorp.com>
When accessing a region running a version of Nomad without node pools an
error was thrown because the request is handled by the nodes endpoint
which fails because it assumes `pools` is the node ID.
When a request is made to an RPC service that doesn't exist (for
example, a cross-region request from a newer version of Nomad to an
older version that doesn't implement the endpoint) the application
should return an empty list as well.
The upgrade path to Nomad 1.6.0 requires canonicalizing the namespace in
order to set the default scheduler configuration values.
Previous implementation only canonicalized on namespace upsert
operations, which works for recent namespaces as those Raft transactions
are reapplied on upgrade.
But for older namespaces restore from a snapshot the code path did not
canonicalize them, leaving the scheduler configuration set as `nil`.
The CSI specification says that we "SHOULD" send no more than one in-flight
request per *volume* at a time, with an allowance for losing state
(ex. leadership transitions) which the plugins "SHOULD" handle gracefully. We
mostly successfully serialize node and controller RPCs for the same volume,
except when Nomad clients are lost. (See also
https://github.com/container-storage-interface/spec/issues/512)
These concurrency requirements in the spec fall short because Storage Provider
APIs aren't necessarily safe to call concurrently on the same host even for
_different_ volumes. For example, concurrently attaching AWS EBS volumes to an
EC2 instance results in a race for device names, which results in failure to
attach (because the device name is taken already and the API call fails) and
confused results when releasing claims. So in practice many CSI plugins rely on
k8s-specific sidecars for serializing storage provider API calls globally. As a
result, we have to be much more conservative about concurrency in Nomad than the
spec allows.
This changeset includes four major changes to fix this:
* Add a serializer method to the CSI volume RPC handler. When the RPC handler
makes a destructive CSI Controller RPC, we send the RPC thru this serializer
and only one RPC is sent at a time. Any other RPCs in flight will block.
* Ensure that requests go to the same controller plugin instance whenever
possible by sorting by lowest client ID out of the plugin instances.
* Ensure that requests go to _healthy_ plugin instances only.
* Ensure that requests for controllers can go to a controller on any _live_
node, not just ones eligible for scheduling (which CSI controllers don't care
about)
Fixes: #15415
* Boot the user off the job if it gets deleted
* de-yoink
* watching the job watcher
* Unload record so history.back has to refire a (failing) request
* Acceptance tests for boot-out and notification
* e2e: add tests for using private registry with podman driver
This PR adds e2e tests that stands up a private docker registry
and has a podman tasks run a container from an image in that private
registry.
Tests
- user:password set in task config
- auth_soft_fail works for public images when auth is set in driver
- credentials helper is set in driver auth config
- config auth.json file is set in driver auth config
* packer: use nomad-driver-podman v0.5.0
* e2e: eliminate unnecessary chmod
Co-authored-by: Daniel Bennett <dbennett@hashicorp.com>
* cr: no need to install nomad twice
* cl: no need to install docker twice
---------
Co-authored-by: Daniel Bennett <dbennett@hashicorp.com>