In 1.6.0 we shipped the ability to review the original HCL in the web UI, but
didn't follow-up with an equivalent in the command line. Add a `-hcl` flag to
the `job inspect` command.
Closes: https://github.com/hashicorp/nomad/issues/6778
When a root key is rotated, the servers immediately start signing Workload
Identities with the new active key. But workloads may be using those WI tokens
to sign into external services, which may not have had time to fetch the new
public key and which might try to fetch new keys as needed.
Add support for prepublishing keys. Prepublished keys will be visible in the
JWKS endpoint but will not be used for signing or encryption until their
`PublishTime`. Update the periodic key rotation to prepublish keys at half the
`root_key_rotation_threshold` window, and promote prepublished keys to active
after the `PublishTime`.
This changeset also fixes two bugs in periodic root key rotation and garbage
collection, both of which can't be safely fixed without implementing
prepublishing:
* Periodic root key rotation would never happen because the default
`root_key_rotation_threshold` of 720h exceeds the 72h maximum window of the FSM
time table. We now compare the `CreateTime` against the wall clock time instead
of the time table. (We expect to remove the time table in future work, ref
https://github.com/hashicorp/nomad/issues/16359)
* Root key garbage collection could GC keys that were used to sign
identities. We now wait until `root_key_rotation_threshold` +
`root_key_gc_threshold` before GC'ing a key.
* When rekeying a root key, the core job did not mark the key as inactive after
the rekey was complete.
Ref: https://hashicorp.atlassian.net/browse/NET-10398
Ref: https://hashicorp.atlassian.net/browse/NET-10280
Fixes: https://github.com/hashicorp/nomad/issues/19669
Fixes: https://github.com/hashicorp/nomad/issues/23528
Fixes: https://github.com/hashicorp/nomad/issues/19368
The RPC handler for scaling a job passes flags to enforce the job modify index
is unchanged when it makes the write to Raft. But its only checking against the
existing job modify index at the time the RPC handler snapshots the state store,
so it can only enforce consistency for its own validation.
In clusters with automated scaling, it would be useful to expose the enforce
index options to the API, so that cluster admins can enforce that scaling only
happens when the job state is consistent with a state they've previously seen in
other API calls. Add this option to the CLI and API and have the RPC handler
check them if asked.
Fixes: https://github.com/hashicorp/nomad/issues/23444
Our documentation for the `node drain` command doesn't include a treatment of
batch jobs, which are not migrated. The user is left to piece this behavior
together from the `migrate` documentation and the tutorial. Instead, let's
explicitly list the behaviors per job type.
Fixes: https://github.com/hashicorp/nomad/issues/17563
When setting up auth methods for Consul and Vault in production environments, we
can typically assume that the CA certificate for the JWKS endpoint will be in
the host certificate store (as part of the usual configuration management
cluster admins needs to do). But for quick demos with `-dev` agents, this won't
be the case.
Add a `-jwks-ca-file` parameter to the setup commands so that we can use this
tool to quickly setup WI with `-dev` agents running TLS.
The `nomad operator debug` command saves a CPU profile for each interval, and
names these files based on the interval.
The same functions takes a goroutine profile, heap profile, etc. but is missing
the logic to interpolate the file name with the interval. This results in the
operator debug command making potentially many expensive profile requests, and
then overwriting the data. Update the command to save every profile it scrapes,
and number them similarly to the existing CPU profile.
Additionally, the command flags for `-pprof-interval` and `-pprof-duration` were
validated backwards, which meant that we always coerced the `-pprof-interval` to
be the same as the `-pprof-duration`, which always resulted in a single profile
being taken at the start of the bundle. Correct the check as well as change the
defaults to be more sensible.
Fixes: https://github.com/hashicorp/nomad/issues/20151
Our documentation has a hidden assumption that users know that federation
replication requires ACLs to be enabled and bootstrapped. Add notes at some of
the places users are likely to look for it.
A separate follow-up PR to the federation tutorial should point to the ACL
multi-region tutorial as well.
Fixes: https://github.com/hashicorp/nomad/issues/20128
The new `nomad setup vault -check` commmand can be used to retrieve
information about the changes required before a cluster is migrated from
the deprecated legacy authentication flow with Vault to use only
workload identities.
The `operator snapshot` commands and agent don't back up Nomad's key
material. Add some warnings about this to places where users might be looking
for information on cluster recovery.
Fixes: https://github.com/hashicorp/nomad/issues/19389
The `-dev-consul` and `-dev-vault` flags add default identities and
configuration to the Nomad agent to connect and use the workload
identity integration with Consul and Vault.
* API command and jobspec docs
* PR comments addressed
* API docs for job/jobid/action socket
* Removing a perhaps incorrect origin of job_id across the jobs api doc
* PR comments addressed
The `-reschedule` flag stops allocations and assumes the Nomad scheduler
will create new allocations to replace them. But this is only true for
service and batch jobs.
Restarting non-service jobs with the `-reschedule` flag causes the
command to loop forever waiting for the allocations to be replaced,
which never happens.
Allocations for system jobs may be replaced by triggering an evaluation
after each stop to cause the reconciler to run again.
Sysbatch jobs should not be allowed to be rescheduled as they are never
replaced by the scheduler.
With a default value set to `client`, the `nomad acl token update`
command can silently downgrade a management token to client on update if
the command does not specify `-type=management` on every update.
It includes the work over the state store, the PRC server, the HTTP server, the go API package and the CLI's command. To read more on the actuall functionality, refer to the RFCs [NMD-178] Locking with Nomad Variables and [NMD-179] Leader election using locking mechanism for the Autoscaler.
This feature will help operator to remove a failed/left node from Serf layer immediately
without waiting for 24 hours for the node to be reaped
* Update CLI with prune flag
* Update API /v1/agent/force-leave with prune query string parameter
* Update CLI and API doc
* Add unit test
The alloc exec and filesystem/logs commands allow passing the `-job` flag to
select a random allocation. If the namespace for the command is set to `*`, the
RPC handler doesn't handle this correctly as it's expecting to query for a
specific job. Most commands handle this ambiguity by first verifying that only a
single object of the type in question exists (ex. a single node or job).
Update these commands so that when the `-job` flag is set we first verify
there's a single job that matches. This also allows us to extend the
functionality to allow for the `-job` flag to support prefix matching.
Fixes: #12097