The group level fields stop_after_client_disconnect,
max_client_disconnect, and prevent_reschedule_on_lost were deprecated in
Nomad 1.8 and replaced by field in the disconnect block. This change
removes any logic related to those deprecated fields.
---------
Co-authored-by: Tim Gross <tgross@hashicorp.com>
The legacy workflow for Vault whereby servers were configured
using a token to provide authentication to the Vault API has now
been removed. This change also removes the workflow where servers
were responsible for deriving Vault tokens for Nomad clients.
The deprecated Vault config options used byi the Nomad agent have
all been removed except for "token" which is still in use by the
Vault Transit keyring implementation.
Job specification authors can no longer use the "vault.policies"
parameter and should instead use "vault.role" when not using the
default workload identity.
---------
Co-authored-by: Tim Gross <tgross@hashicorp.com>
Co-authored-by: Aimee Ukasick <aimee.ukasick@hashicorp.com>
The `-type` option for `volume status` is a UX papercut because for many
clusters there will be only one sort of volume in use. Update the CLI so that
the default behavior is to query CSI and/or DHV.
This behavior is subtly different when the user provides an ID or not. If the
user doesn't provide an ID, we query both CSI and DHV and show both tables. If
the user provides an ID, we query DHV first and then CSI, and show only the
appropriate volume. Because DHV IDs are UUIDs, we're sure we won't have
collisions between the two. We only show errors if both queries return an error.
Fixes: https://hashicorp.atlassian.net/browse/NET-12214
Job authors need to be able to review what capabilities a dynamic host volume or
CSI volume has so that they can set the correct access mode and attachment mode
in their job. Add these to the CLI output of `volume status`.
Ref: https://hashicorp.atlassian.net/browse/NET-12063
Our vocabulary around scheduler behaviors outside of the `reschedule` and
`migrate` blocks leaves room for confusion around whether the reschedule tracker
should be propagated between allocations. There are effectively five different
behaviors we need to cover:
* restart: when the tasks of an allocation fail and we try to restart the tasks
in place.
* reschedule: when the `restart` block runs out of attempts (or the allocation
fails before tasks even start), and we need to move
the allocation to another node to try again.
* migrate: when the user has asked to drain a node and we need to move the
allocations. These are not failures, so we don't want to propagate the
reschedule tracker.
* replacement: when a node is lost, we don't count that against the `reschedule`
tracker for the allocations on the node (it's not the allocation's "fault",
after all). We don't want to run the `migrate` machinery here here either, as we
can't contact the down node. To the scheduler, this is effectively the same as
if we bumped the `group.count`
* replacement for `disconnect.replace = true`: this is a replacement, but the
replacement is intended to be temporary, so we propagate the reschedule tracker.
Add a section to the `reschedule`, `migrate`, and `disconnect` blocks explaining
when each item applies. Update the use of the word "reschedule" in several
places where "replacement" is correct, and vice-versa.
Fixes: https://github.com/hashicorp/nomad/issues/24918
Co-authored-by: Aimee Ukasick <aimee.ukasick@hashicorp.com>
In #20165 we fixed a bug where a partially configured `client.template` retry
block would set any unset fields to nil instead of their default values. But
this patch introduced a regression in the default values, so we were now
defaulting to unlimited retries if the retry block was unset. Restore the
correct behavior and add better test coverage at both the config parsing and
template configuration code.
Ref: https://github.com/hashicorp/nomad/pull/20165
Ref: https://github.com/hashicorp/nomad/issues/23305#issuecomment-2643731565
* Adds Actions to job status command output
* Adds Actions to job status command output
* Status documentation updated to show actions and formatJobActions no longer cares about pipe delineation
The `volume delete` command doesn't allow using a prefix for the volume ID for
either CSI or dynamic host volumes. Use a prefix search and wildcard namespace
as we do for other CLI commands.
Ref: https://hashicorp.atlassian.net/browse/NET-12057
If you create a volume via `volume create/register` and want to update it later,
you need to change the volume spec to add the ID that was returned. This isn't a
very nice UX, so let's add an `-id` argument that allows you to update existing
volumes that have that ID.
Ref: https://hashicorp.atlassian.net/browse/NET-12083
* Upgrade to using hashicorp/go-metrics@v0.5.4
This also requires bumping the dependencies for:
* memberlist
* serf
* raft
* raft-boltdb
* (and indirectly hashicorp/mdns due to the memberlist or serf update)
Unlike some other HashiCorp products, Nomads root module is currently expected to be consumed by others. This means that it needs to be treated more like our libraries and upgrade to hashicorp/go-metrics by utilizing its compat packages. This allows those importing the root module to control the metrics module used via build tags.
* quota spec:
if `region_limit.storage.host_volumes` is set,
do not require that `variables` also be set,
and vice versa.
* subtract from quota usage on volume delete
* stub CE quota subtraction method
When a client restarts but can't restore a volume (ex. the plugin is now
missing), it's removed from the node fingerprint. So we won't allow future
scheduling of the volume, but we were not updating the volume state field to
report this reasoning to operators. Make debugging easier and the state field
more meaningful by setting the value to "unavailable".
Also, remove the unused "deleted" field. We did not implement soft deletes and
aren't planning on it for Nomad 1.10.0.
Ref: https://hashicorp.atlassian.net/browse/NET-11551
When we implemented CSI, the types of the fields for access mode and attachment
mode on volume requests were defined with a prefix "CSI". This gets confusing
now that we have dynamic host volumes using the same fields. Fortunately the
original was a typedef on string, and the Go API in the `api` package just uses
strings directly, so we can change the name of the type without breaking
backwards compatibility for the msgpack wire format.
Update the names to `VolumeAccessMode` and `VolumeAttachmentMode`. Keep the CSI
and DHV specific value constant names for these fields (they aren't currently
1:1), so that we can easily differentiate in a given bit of code which values
are valid.
Ref: https://github.com/hashicorp/nomad/pull/24881#discussion_r1920702890
The `volume init` command creates example volume specifications. But one of the
values for `capability.access_mode` is not a valid value. Correct the example to
match the validation logic.
The Nomad agent used a log filter to ensure logs were written at
the expected level. Since the use of hclog this is not required,
as hclog acts as the gate keeper and filter for logging. All log
writers accept messages from hclog which has already done the
filtering.
The agent syslog write handler was unable to handle JSON log lines
correctly, meaning all syslog entries when using JSON log format
showed as NOTICE level.
This change adds a new handler to the Nomad agent which can parse
JSON log lines and correctly understand the expected log level
entry.
The change also removes the use of a filter from the default log
format handler. This is not needed as the logs are fed into the
syslog handler via hclog, which is responsible for level
filtering.
We can reduce the amount of volume specification configuration many users will
need by setting a default capability on a dynamic host volume if none is
set. The default capability will allow using the volume in read/write mode on
its node, with no further restrictions except those that might be set in the
jobspec.
In anticipation of having quotas for dynamic host volumes, we want the user
experience of the storage limits to feel integrated with the other resource
limits. This is currently prevented by reusing the `Resources` type instead of
having a specific type for `QuotaResources`.
Update the quota limit/usage types to use a `QuotaResources` that includes a new
storage resources quota block. The wire format for the two types are compatible
such that we can migrate the existing variables limit in the FSM.
Also fixes improper parallelism in the quota init test where we change working
directory to avoid file write conflicts but this breaks when multiple tests are
executed in the same process.
Ref: https://github.com/hashicorp/nomad-enterprise/pull/2096
When we register a volume without a plugin, we need to send a client RPC so that
the node fingerprint can be updated. The registered volume also needs to be
written to client state so that we can restore the fingerprint after a restart.
Co-authored-by: Daniel Bennett <dbennett@hashicorp.com>
Adds an additional check in the Keyring.Delete RPC to make sure we're not
trying to delete a key that's been used to encrypt a variable. It also adds a
-force flag for the CLI/API to sidestep that check.
Adds a `-type` flag to the `volume init` command that generates an example
volume specification with only those fields relevant to dynamic host
volumes. This changeset also moves the string literals into uses of `go:embed`
Ref: https://github.com/hashicorp/nomad/pull/24479
The List Volumes API was originally written for CSI but assumed we'd have future
volume types, dispatched on a query parameter. Dynamic host volumes uses this,
but the resulting code has host volumes concerns comingled in the CSI volumes
endpoint. Refactor this so that we have a top-level `GET /v1/volumes` route that's
shared between CSI and DHV, and have it dispatch to the appropriate handler in
the type-specific endpoints.
Ref: https://github.com/hashicorp/nomad/pull/24479
When we add a Sentinel scope for dynamic host volumes, having a default `-scope`
value for `sentinel apply` risks accidentally adding policies for volumes to the
job scope. This would immediately prevent any job from being submitted. Forcing
the administrator to pass a `-scope` will prevent accidental misuse.
Ref: https://github.com/hashicorp/nomad-enterprise/pull/2087
Ref: https://github.com/hashicorp/nomad/pull/24479
The create/register volume RPCs support a policy override flag for
soft-mandatory Sentinel policies, but the CLI and Go API were missing support
for it.
Also add support for Sentinel warnings to the Go API and CLI.
Ref: https://github.com/hashicorp/nomad/pull/24479
In #24528 we added monitoring to the CLI for dynamic host volume creation. But
when the volume's namespace is set by the volume specification instead of the
`-namespace` flag, the API client doesn't have the right namespace and gets a
404 when setting up the monitoring. The specification always overrides the
`-namespace` flag, so use that when available for all subsequent API calls.
Ref: https://github.com/hashicorp/nomad/pull/24479
Adds dynamic host volumes to argument autocomplete for the `volume status` and
`volume delete` commands. Adds flag autocompletion for those commands plus
`volume create`.
Ref: https://github.com/hashicorp/nomad/pull/24479
Most Nomad upsert RPCs accept a single object with the notable exception of
CSI. But in CSI we don't actually expose this to users except through the Go
API. It deeply complicates how we present errors to users, especially once
Sentinel policy enforcement enters the mix.
Refactor the `HostVolume.Create` and `HostVolume.Register` RPCs to take a single
volume instead of a slice of volumes.
Add a stub function for Enterprise policy enforcement. This requires splitting
out placement from the `createVolume` function so that we can ensure we've
completed placement before trying to enforce policy.
Ref: https://github.com/hashicorp/nomad/pull/24479
When making a request to create a dynamic host volumes, users can pass a node
pool and constraints instead of a specific node ID.
This changeset implements a node scheduling logic by instantiating a filter by
node pool and constraint checker borrowed from the scheduler package. Because
host volumes with the same name can't land on the same host, we don't need to
support `distinct_hosts`/`distinct_property`; this would be challenging anyways
without building out a much larger node iteration mechanism to keep track of
usage across multiple hosts.
Ref: https://github.com/hashicorp/nomad/pull/24479