Adds new topics to the event stream for CSI volumes and CSI plugins. We'll emit
event when either is created or deleted, and when CSI volumes are claimed.
Add a new topic to the event stream for host volumes. We'll emit events when a
dynamic host volume is registered or deregistered, and whenever a node
fingerprints with a changed volume.
Ref: https://hashicorp.atlassian.net/browse/NET-11549
When we feasibility check a dynamic host volume against a volume request, we
check the attachment mode and access mode. This only ensures that the
capabilities match, but doesn't enforce the semantics of the capabilities
against other claims that may be made on the allocation.
Add support for checking the requested capability against other allocations that
the volume claimed.
Ref: https://github.com/hashicorp/nomad/pull/24479
Instead of a plugin `version` subcommand that responds with a string
(established in #24497), respond to a `fingerprint` command with a data
structure that we may extend in the future (such as plugin capabilities,
like size constraint support?). In the immediate term, it's still just the
version: `{"version": "0.0.1"}`
In addition to leaving the door open for future expansion, I think it will
also avoid false positives detecting executables that just happen to
respond to a `version` command.
This also reverses the ordering of the fingerprint string parts
from `plugins.host_volume.version.mkdir` (which aligned with CNI)
to `plugins.host_volume.mkdir.version` (makes more sense to me)
CSI volumes support multi-node access patterns on the same volume ID, but
dynamic host volumes by nature do not. The underlying volume may actually be
multi-node (ex. NFS), but Nomad is ignorant of this. Remove the CSI-specific
multi-node access modes and instead include the single-node access modes
intended that are currently in the alpha edition of the CSI spec but which are
better suited for DHV.
This PR has been extracted from #24684 to keep reviews manageable.
Ref: https://github.com/hashicorp/nomad/pull/24479
Ref: https://github.com/hashicorp/nomad/pull/24684
Adds a `-type` flag to the `volume init` command that generates an example
volume specification with only those fields relevant to dynamic host
volumes. This changeset also moves the string literals into uses of `go:embed`
Ref: https://github.com/hashicorp/nomad/pull/24479
Static host volumes have a simple readonly toggle, but dynamic host volumes have
a more complex set of capabilities similar to CSI volumes. Update the
feasibility checker to account for these capabilities and volume readiness.
Also fixes a minor bug in the state store where a soft-delete (not yet
implemented) could cause a volume to be marked ready again. This is needed to
support testing the readiness checking in the scheduler.
Ref: https://github.com/hashicorp/nomad/pull/24479
The List Volumes API was originally written for CSI but assumed we'd have future
volume types, dispatched on a query parameter. Dynamic host volumes uses this,
but the resulting code has host volumes concerns comingled in the CSI volumes
endpoint. Refactor this so that we have a top-level `GET /v1/volumes` route that's
shared between CSI and DHV, and have it dispatch to the appropriate handler in
the type-specific endpoints.
Ref: https://github.com/hashicorp/nomad/pull/24479
Node fingerprints include attributes for the host volume plugins, including the
built-in plugins. Add an implicit constraint on this fingerprint during volume
placement to ensure we only place volumes on hosts with the right plugins.
Ref: https://github.com/hashicorp/nomad/pull/24479
store dynamic host volume creations in client state,
so they can be "restored" on agent restart. restore works
by repeating the same Create operation as initial creation,
and expecting the plugin to be idempotent.
this is (potentially) especially important after host restarts,
which may have dropped mount points or such.
When we add a Sentinel scope for dynamic host volumes, having a default `-scope`
value for `sentinel apply` risks accidentally adding policies for volumes to the
job scope. This would immediately prevent any job from being submitted. Forcing
the administrator to pass a `-scope` will prevent accidental misuse.
Ref: https://github.com/hashicorp/nomad-enterprise/pull/2087
Ref: https://github.com/hashicorp/nomad/pull/24479
The create/register volume RPCs support a policy override flag for
soft-mandatory Sentinel policies, but the CLI and Go API were missing support
for it.
Also add support for Sentinel warnings to the Go API and CLI.
Ref: https://github.com/hashicorp/nomad/pull/24479
In #24528 we added monitoring to the CLI for dynamic host volume creation. But
when the volume's namespace is set by the volume specification instead of the
`-namespace` flag, the API client doesn't have the right namespace and gets a
404 when setting up the monitoring. The specification always overrides the
`-namespace` flag, so use that when available for all subsequent API calls.
Ref: https://github.com/hashicorp/nomad/pull/24479
Adds dynamic host volumes to argument autocomplete for the `volume status` and
`volume delete` commands. Adds flag autocompletion for those commands plus
`volume create`.
Ref: https://github.com/hashicorp/nomad/pull/24479
Most Nomad upsert RPCs accept a single object with the notable exception of
CSI. But in CSI we don't actually expose this to users except through the Go
API. It deeply complicates how we present errors to users, especially once
Sentinel policy enforcement enters the mix.
Refactor the `HostVolume.Create` and `HostVolume.Register` RPCs to take a single
volume instead of a slice of volumes.
Add a stub function for Enterprise policy enforcement. This requires splitting
out placement from the `createVolume` function so that we can ensure we've
completed placement before trying to enforce policy.
Ref: https://github.com/hashicorp/nomad/pull/24479
Add support for dynamic host volumes to the search endpoint. Like many other
objects with UUID identifiers, we're not supporting fuzzy search here, just
prefix search on the fuzzy search endpoint.
Because the search endpoint only returns IDs, we need to seperate CSI volumes
and host volumes for it to be useful. The new context is called `"host_volumes"`
to disambiguate it from `"volumes"`. In future versions of Nomad we should
consider deprecating the `"volumes"` context in lieu of a `"csi_volumes"`
context.
Ref: https://github.com/hashicorp/nomad/pull/24479
also ensure that volume ID is uuid-shaped so user-provided input
like `id = "../../../"` which is used as part of the target directory
can not find its way very far into the volume submission process
When dynamic host volumes are created, they're written to the state store in a
"pending" state. Once the client fingerprints the volume it's eligible for
scheduling, so we mark the state as ready at that point.
Because the fingerprint could potentially be returned before the RPC handler has
a chance to write to the state store, this changeset adds test coverage to
verify that upserts of pending volumes check the node for a
previously-fingerprinted volume as well.
Ref: https://github.com/hashicorp/nomad/pull/24479
When making a request to create a dynamic host volumes, users can pass a node
pool and constraints instead of a specific node ID.
This changeset implements a node scheduling logic by instantiating a filter by
node pool and constraint checker borrowed from the scheduler package. Because
host volumes with the same name can't land on the same host, we don't need to
support `distinct_hosts`/`distinct_property`; this would be challenging anyways
without building out a much larger node iteration mechanism to keep track of
usage across multiple hosts.
Ref: https://github.com/hashicorp/nomad/pull/24479
* mkdir: HostVolumePluginMkdir: just creates a directory
* example-host-volume: HostVolumePluginExternal:
plugin script that does mkfs and mount loopback
Co-authored-by: Tim Gross <tgross@hashicorp.com>
Add several validation steps in the create/register RPCs for dynamic host
volumes. We first check that submitted volumes are self-consistent (ex. max
capacity is more than min capacity), then that any updates we've made are
valid. And we validate against state: preventing claimed volumes from being
updated and preventing placement requests for nodes that don't exist.
Ref: https://github.com/hashicorp/nomad/issues/15489
The `HostVolumeByID` state store method didn't add a watch channel to the
watchset, which meant that it would never unblock. The tests missed this because
they were racy, so move the updates for unblocking tests into a `time.After`
call to ensure the queries are blocked before the update happens.
This changeset implements the HTTP API endpoints for Dynamic Host Volumes.
The `GET /v1/volumes` endpoint is shared between CSI and DHV with a query
parameter for the type. In the interest of getting some working handlers
available for use in development (and minimizing the size of the diff to
review), this changeset doesn't do any sort of refactoring of how the existing
List Volumes CSI endpoint works. That will come in a later PR, as will the
corresponding `api` package updates we need to support the CLI.
Ref: https://hashicorp.atlassian.net/browse/NET-11549
This changeset implements the RPC handlers for Dynamic Host Volumes, including
the plumbing needed to forward requests to clients. The client-side
implementation is stubbed and will be done under a separate PR.
Ref: https://hashicorp.atlassian.net/browse/NET-11549
This changeset implements the ACLs required for dynamic host volumes RPCs:
* `host-volume-write` is a coarse-grained policy that implies all operations.
* `host-volume-register` is the highest fine-grained privilege because it
potentially bypasses quotas.
* `host-volume-create` is implicitly granted by `host-volume-register`
* `host-volume-delete` is implicitly granted only by `host-volume-write`
* `host-volume-read` is implicitly granted by `policy = "read"`,
These are namespaced operations, so the testing here is predominantly around
parsing and granting of implicit capabilities rather than the well-tested
`AllowNamespaceOperation` method.
This changeset does not include any changes to the `host_volumes` policy which
we'll need for claiming volumes on job submit. That'll be covered in a later PR.
Ref: https://hashicorp.atlassian.net/browse/NET-11549
The parameters used for the reusable action were incorrect since
the 5.0.1 update. The permissions were also incorrect as the
workflow needs to write to issues and PRs.
This change creates a reusable workflow for notifying Slack on CI
failures. The message will include useful links and information
about the failure, so product engineers can investigate and fix
any problems.
The new workflow is used by selected workflows which trigger on
merges to main or release/* branches. The notification is only
sent on failure and when the event was a push (PR merge) meaning
the number of notifications should be minimal.
The aim is to help identify and draw attention to failure across
our release branches, in particular when automated processes
happen.
Nomad sets a default port when resolving server addresses that don't have
one. When we get a "bare" IPv6 address without a port, we end up with an
unexpected error "too many colons in address" when we try to split the address
and host, because the standard library function expects IPv6 addresses to be
wrapped in brackets as recommended by RFC5952. User-configured addresses avoid
this problem by accepting IP address and port as separate configuration values,
but go-discover emits "bare" IPv6 addresses without a port in IPv6 environments.
Fix this by adding brackets to IPv6 addresses when we get the "too many colons"
error from the stdlib. This will still give erroneous results if the address
includes the port but is missing brackets, but there's no way to unambiguously
parse that address.
Ref: https://www.rfc-editor.org/rfc/rfc5952
Fixes: https://github.com/hashicorp/nomad/issues/24608
* Experimenting with a generic meta job-part component
* Taskstate.task gets me every time
* continue-on-error false test
* continue-on-error back in, but explicit success check after exam
* Testfixes for new meta structure on tasks and groups
* Clean up test and dev code