Commit Graph

3974 Commits

Author SHA1 Message Date
tehut
d709accaf5 Add nomad monitor export command (#26178)
* Add MonitorExport command and handlers
* Implement autocomplete
* Require nomad in serviceName
* Fix race in StreamReader.Read
* Add and use framer.Flush() to coordinate function exit
* Add LogFile to client/Server config and read NomadLogPath in rpcHandler instead of HTTPServer
* Parameterize StreamFixed stream size
2025-08-01 10:26:59 -07:00
Gautam Kumar
6f81222ec8 CL: improve acl policy self output for management tokens (#26396)
Improved the acl policy self CLI command to handle both management and client tokens.
Management tokens now display a clear message indicating global access with no individual policies.

Fixes: https://github.com/hashicorp/nomad/issues/26389
2025-08-01 09:02:47 -04:00
James Rasell
5989d5862a ci: Update golangci-lint to v2 and fix highlighted issues. (#26334) 2025-07-25 10:44:08 +01:00
James Rasell
2ef837f02f cli: Ensure all no argument console messages are the same. (#26331)
Use a constant to ensure consistency across the CLI when displaying
a console message indicating the command was passed arguments when
it takes none.
2025-07-25 07:05:10 +01:00
Tim Gross
35f3f6ce41 scheduler: add disconnect and reschedule info to reconciler output (#26255)
The `DesiredUpdates` struct that we send to the Read Eval API doesn't include
information about disconnect/reconnect and rescheduling. Annotate the
`DesiredUpdates` with this data, and adjust the `eval status` command to display
only those fields that have non-zero values in order to make the output width
manageable.

Ref: https://hashicorp.atlassian.net/browse/NMD-815
2025-07-16 08:46:38 -04:00
Tim Gross
b23ab5ac15 docs: clarify requirements for deleting volumes (#26240)
If you delete a CSI volume, the volume cannot be currently claimed by an
allocation or in the process of being unpublished. This is documented in the CLI
but not the API. Also, the documentation incorrectly says that the `volume
delete` command silently returns without error if the volume doesn't exist, but
that's incorrect.

Fixes: https://github.com/hashicorp/nomad/issues/24756
2025-07-11 15:01:06 -04:00
hc-github-team-nomad-core
ccba3ae6a2 Generate files for 1.10.3 release 2025-07-08 16:47:39 -07:00
Tim Gross
5c909213ce scheduler: add reconciler annotations to completed evals (#26188)
The output of the reconciler stage of scheduling is only visible via debug-level
logs, typically accessible only to the cluster admin. We can give job authors
better ability to understand what's happening to their jobs if we expose this
information to them in the `eval status` command.

Add the reconciler's desired updates to the evaluation struct so it can be
exposed in the API. This increases the size of evals by roughly 15% in the state
store, or a bit more when there are preemptions (but we expect this will be a
small minority of evals).

Ref: https://hashicorp.atlassian.net/browse/NMD-818
Fixes: https://github.com/hashicorp/nomad/issues/15564
2025-07-07 09:40:21 -04:00
James Rasell
d6757609dc cli: Fix a bug where self token lookups via token CLI flag failed. (#26183)
The meta client looks for both an environment variable and a CLI
flag when generating a client. The CLI UUID checker needs to do
this also, so we account for users using both env vars and CLI
flag tokens.
2025-07-03 13:50:42 +01:00
Chris Roberts
493e7b2faa command: prevent server panic on graceful shutdown (#26171)
When performing a graceful shutdown the client drain configuration
is checked for a deadline which is appended to the timeout. When
running as a server the client will not be set. Attempting to get
the drain deadline will result in a panic. This checks for the
client being available prior to fetching the deadline value.
2025-07-01 15:54:03 -07:00
Allison Larson
63f0788747 Expose Kind field for Consul Service Registrations (#26170)
* consul: Add service kind to jobspec

* consul: Add kind to service docs

* Add changelog
2025-06-30 14:32:23 -07:00
Tim Gross
aa3c08d069 eval status: enrich with related evals and placed allocs tables (#26156)
When debugging an evaluation, you almost always want to know about all the
related evaluations and what allocations were placed by that evaluation (and
where), not just failed placements. We can enrich the command by adding the
`related` query parameter to the API, and having the command query for the
evaluations allocations automatically. Emit this data as a pair of new tables
and expose fields like quota limits, and previous/next/blocked eval without the
`-verbose` flag.

Update the docs to include the full output and remove references to long-removed
behavior of the `-json` flag.

Ref: https://hashicorp.atlassian.net/browse/NMD-818
Ref: https://go.hashi.co/rfc/nmd-212
2025-06-30 09:23:36 -04:00
Piotr Kazmierczak
7647491588 cli: fix panic when starting stopped jobs with no scaling policies (#26131)
Restoring scaling policies during the start of a stopped job did not account for
jobs that didn't have any scaling policies, and led to a panic when users tried
to restart such jobs.
2025-06-25 11:19:56 +02:00
James Rasell
7a5f5750b0 test: Wait for client when enabled in test agent if possible. (#26129)
When a test starts an agent and the client is enabled, we can
wait until this reaches the ready state within the set up method.
This mimics what we already do with leadership and the root
keyring and should reduce flakey tests where it assume the client
is ready as soon as the set up function returns, which is not
guaranteed.

The change exposed a couple of TLS reload tests which were not
using the test agent correctly. They were setting up a client even
though it would never be able to join the cluster due to TLS
configuration issues. These have been fixed.
2025-06-25 10:00:28 +01:00
James Rasell
216140255d cli: Do not always add global DNS name to certificate DNS names. (#26086)
No matter the passed region identifier, the CLI was always adding
"<role>.global.nomad" to the certificate DNS names. This is not
what we expect and has been removed.

While here, the long deprecated cluster-region flag has been
removed. This removal only impacts CLI functionality, so is safe
to do.
2025-06-25 07:35:56 +01:00
Chris Roberts
4dbf645bf7 command: prevent panic on graceful shutdown (#26018)
When performing a graceful shutdown a channel is used to wait for
the agent to leave. The channel is closed when the agent leaves
successfully, but it also is closed within a deferral. If the
agent successfully leaves and closes the channel, a panic will
occur when the channel is closed the second time within the
deferral. To prevent this from occurring, the channel closing
is wrapped within a `OnceFunc` so the channel is only closed
once.
2025-06-12 09:35:57 -07:00
Chris Roberts
eeec603975 command: prevent early exit from graceful shutdown (#26023)
While waiting for the agent to leave during a graceful shutdown
the wait can be interrupted immediately if another signal is
received. It is common that while waiting a `SIGPIPE` is received
from journald causing the wait to end early. This results in the
agent not finishing the leave process and reporting an error when
the process has stopped. Instead of allowing any signal to interrupt
the wait, the signal is checked for a `SIGPIPE` and if matched will
continue waiting.
2025-06-12 08:56:55 -07:00
hc-github-team-nomad-core
1e49d9eb44 Generate files for 1.10.2 release 2025-06-10 14:35:25 -07:00
James Rasell
e95148c10d consul: Fix data race within test by using mutex to read map. (#25977) 2025-06-04 15:09:37 +01:00
Juana De La Cuesta
bdfd573fc4 Update the scaling policies when deregistering a job (#25911)
* func: Update the scaling policies when deregistering a job

* func: Add tests for updating the policy

* docs: add changelog

* func: set back the old order

* style: rearrange for clarity and to reuse the watchset

* func: set the policies to teh last submitted when starting a job

* func: expand tests  of teh start job command to include job submission

* func: Expand the tests to verify the correct state of the scaling policy after job start

* Update command/job_start.go

Co-authored-by: Tim Gross <tgross@hashicorp.com>

* Update nomad/fsm_test.go

Co-authored-by: Tim Gross <tgross@hashicorp.com>

* func: add warning when there is no previous job submission

---------

Co-authored-by: Tim Gross <tgross@hashicorp.com>
2025-06-02 16:11:38 +02:00
Michael Smithhisler
4c8257d0c7 client: add once mode to template block (#25922) 2025-05-28 11:45:11 -04:00
Tim Gross
3f59860254 host volumes: add configuration to GC on node GC (#25903)
When a node is garbage collected, any dynamic host volumes on the node are
orphaned in the state store. We generally don't want to automatically collect
these volumes and risk data loss, and have provided a CLI flag to `-force`
remove them in #25902. But for clusters running on ephemeral cloud
instances (ex. AWS EC2 in an autoscaling group), deleting host volumes may add
excessive friction. Add a configuration knob to the client configuration to
remove host volumes from the state store on node GC.

Ref: https://github.com/hashicorp/nomad/pull/25902
Ref: https://github.com/hashicorp/nomad/issues/25762
Ref: https://hashicorp.atlassian.net/browse/NMD-705
2025-05-27 10:22:08 -04:00
tehut
55523ecf8e Add NodeMaxAllocations to client configuration (#25785)
* Set MaxAllocations in client config
Add NodeAllocationTracker struct to Node struct
Evaluate MaxAllocations in AllocsFit function
Set up cli config parsing
Integrate maxAllocs into AllocatedResources view
Co-authored-by: Tim Gross <tgross@hashicorp.com>

---------

Co-authored-by: Tim Gross <tgross@hashicorp.com>
2025-05-22 12:49:27 -07:00
Daniel Bennett
15c01e5a49 ipv6: normalize addrs per RFC-5942 §4 (#25921)
https://datatracker.ietf.org/doc/html/rfc5952#section-4

* copy NormalizeAddr func from vault
  * PRs hashicorp/vault#29228 & hashicorp/vault#29517
* normalize bind/advertise addrs
* normalize consul/vault addrs
2025-05-22 14:21:30 -04:00
Chris Roberts
1aa416e2f2 Support applying policy to all jobs within namespace (#25871)
Workflow identities currently support ACL policies being applied
to a job ID within a namespace. With this update an ACL policy
can be applied to a namespace. This results in the ACL policy
being applied to all jobs within the namespace.
2025-05-21 07:44:14 -07:00
Tim Gross
41cf1b03b4 host volumes: -force flag for delete (#25902)
When a node is garbage collected, we leave behind the dynamic host volume in the
state store. We don't want to automatically garbage collect the volumes and risk
data loss, but we should allow these to be removed via the API.

Fixes: https://github.com/hashicorp/nomad/issues/25762
Fixes: https://hashicorp.atlassian.net/browse/NMD-705
2025-05-21 08:55:52 -04:00
Piotr Kazmierczak
cdc308a0eb wi: new endpoint for listing workload attached ACL policies (#25588)
This introduces a new HTTP endpoint (and an associated CLI command) for querying
ACL policies associated with a workload identity. It allows users that want
to learn about the ACL capabilities from within WI-tasks to know what sort of
policies are enabled.

---------

Co-authored-by: Tim Gross <tgross@hashicorp.com>
Co-authored-by: Aimee Ukasick <aimee.ukasick@hashicorp.com>
2025-05-19 19:54:12 +02:00
Martina Santangelo
18eddf53a4 commands: adds job start command to start stopped jobs (#24150)
---------

Co-authored-by: Michael Smithhisler <michael.smithhisler@hashicorp.com>
Co-authored-by: Tim Gross <tgross@hashicorp.com>
2025-05-14 15:17:44 -04:00
James Rasell
ef25c3d55a cli: Fix help indentation format on node meta commands. (#25851) 2025-05-14 14:53:48 +01:00
Tim Gross
8a5a057d88 offline license utilization reporting (#25844)
Nomad Enterprise users operating in air-gapped or otherwise secured environments
don't want to send license reporting metrics directly from their
servers. Implement manual/offline reporting by periodically recording usage
metrics snapshots in the state store, and providing an API and CLI by which
cluster administrators can download the snapshot for review and out-of-band
transmission to HashiCorp.

This is the CE portion of the work required for implemention in the Enterprise
product. Nomad CE does not perform utilization reporting.

Ref: https://github.com/hashicorp/nomad-enterprise/pull/2673
Ref: https://hashicorp.atlassian.net/browse/NMD-68
Ref: https://go.hashi.co/rfc/nmd-210
2025-05-14 09:51:13 -04:00
hc-github-team-nomad-core
9ef42e9807 Generate files for 1.10.1 release 2025-05-13 14:26:48 +02:00
James Rasell
0b265d2417 encrypter: Track initial tasks for is ready calculation. (#25803)
The server startup could "hang" to the view of an operator if it
had a key that could not be decrypted or replicated loaded from
the FSM at startup.

In order to prevent this happening, the server startup function
will now use a timeout to wait for the encrypter to be ready. If
the timeout is reached, the error is sent back to the caller which
fails the CLI command. This bubbling of error message will also
flush to logs which will provide addition operator feedback.

The server only cares about keys loaded from the FSM snapshot and
trailing logs before the encrypter should be classed as ready. So
that the encrypter ready function does not get blocked by keys
added outside of the initial Raft load, we take a snapshot of the
decryption tasks as we enter the blocking call, and class these as
our barrier.
2025-05-07 15:38:16 +01:00
Tim Gross
da592ab1b7 testing: fix vault setup test's reliance on specific Raft index (#25806)
The test for `nomad setup vault` command expects a specific `CreateIndex` for the
job it creates. Any Raft write when a server comes up or establishes leadership
can cause this test to break. Interpolate the expected index as we've done for
other indexes on the job to make this test less brittle.

Ref: https://github.com/hashicorp/nomad-enterprise/pull/2673#issuecomment-2847619747
2025-05-02 14:30:10 -04:00
Juanadelacuesta
9288a3141a func and docs: Use the config from the client and not from the agent that is already parsed. Add the breaking change to the release notes 2025-04-30 10:53:02 +02:00
Juanadelacuesta
949571e313 func: read the config from the agent, dont reparse 2025-04-24 05:01:53 +02:00
Juanadelacuesta
46343ee56e func: use the client's configured drain deadline to calculate the graceful timeout when terminating an agent 2025-04-23 23:59:50 +02:00
Juanadelacuesta
c91f24681d style: add changelog 2025-04-23 23:28:54 +02:00
Juana De La Cuesta
9778a31e29 Update command/agent/command.go
Co-authored-by: Tim Gross <tgross@hashicorp.com>
2025-04-23 23:18:09 +02:00
Juana De La Cuesta
39b3d63172 Update command/agent/command.go
Co-authored-by: Tim Gross <tgross@hashicorp.com>
2025-04-23 23:18:02 +02:00
Juana De La Cuesta
313f430fdd Update command/agent/command.go
Co-authored-by: Michael Schurter <mschurter@hashicorp.com>
2025-04-23 23:17:36 +02:00
Juanadelacuesta
b47b962439 fix: use a timer instead of a time.After to avoid memory leaks 2025-04-23 16:26:46 +02:00
Juanadelacuesta
38c27b7e7f fix: make teh disctintion when for os.Interrupt 2025-04-23 16:20:04 +02:00
Juanadelacuesta
adf038b495 fix: correct the logic for LeaveOnTerm or LeaveOnInt depending on the incoming signal 2025-04-23 16:03:12 +02:00
Juanadelacuesta
b375974bc3 style: add comments 2025-04-23 15:47:37 +02:00
Juanadelacuesta
c5c4272aee func: force agent return if there is an error on reload 2025-04-23 15:14:48 +02:00
Piotr Kazmierczak
df3b00bce0 acl: use WhoAmI RPC endpoint in /acl/token/self (#25547)
ResolveToken RPC endpoint was only used by the /acl/token/self API. We should migrate to the WI-aware WhoAmI instead.

---------

Co-authored-by: Tim Gross <tgross@hashicorp.com>
2025-04-22 17:53:39 +02:00
Daniel Bennett
c46521a80d cli: operator debug: respect NOMAD_REGION env var (#25716)
properly filter out regions other than the one specified
like the -namespace flag does
2025-04-21 17:06:50 -04:00
tehut
b11619010e Add priority flag to Dispatch CLI and API (#25622)
* Add priority flag to Dispatch CLI and DispatchOpts() helper to HTTP API
2025-04-18 13:24:52 -07:00
Arian van Putten
d28af58cbb agent: implement sd-notify reload correctly (#25636)
First of all, we should not send the unix time, but the monotonic time.
Second of all, RELOADING= and MONOTONIC_USEC fields should be sent in
*single* message not two separate messages.

From the man page of [systemd.service](https://www.freedesktop.org/software/systemd/man/latest/systemd.service.html#Type=)

> notification message via sd_notify(3) that contains the "RELOADING=1" field in
> combination with "MONOTONIC_USEC=" set to the current monotonic time (i.e.
> CLOCK_MONOTONIC in clock_gettime(2)) in μs, formatted as decimal string.

[sd_notify](https://www.freedesktop.org/software/systemd/man/latest/sd_notify.html)
now has code samples of the protocol to clarify.

Without these changes, if you'd set
Type=notify-reload on the agen'ts systemd unit, systemd
would kill the service due to the service not responding to reload
correctly.
2025-04-14 11:38:56 -04:00
Michael Schurter
c5451cf300 Merge pull request #25635 from hashicorp/post-1.10.0-release
Post 1.10.0 release
2025-04-10 10:32:24 -07:00