Commit Graph

27215 Commits

Author SHA1 Message Date
hc-github-team-nomad-core
26e16febad Prepare for next release 2025-07-08 16:47:39 -07:00
hc-github-team-nomad-core
ccba3ae6a2 Generate files for 1.10.3 release 2025-07-08 16:47:39 -07:00
Hazmei Abdul Rahman
c2d8424e3f fix: website task driver virt link (#26222) 2025-07-08 11:36:55 -05:00
Juana De La Cuesta
3b44090156 Avoid panic during startup with 1.10.2 (#26219)
* fix: initalize the topology of teh processors to avoid nil pointers

* func: initialize topology to avoid nil pointers

* fix: update the new public method for NodeProcessorResources
2025-07-08 16:07:14 +02:00
Tim Gross
e13ceab855 host volumes: require allocs to be client terminal to delete vols (#26213)
The RPC handler for deleting dynamic host volumes has a check that any
allocations associated with a volume are client-terminal before deleting the
volume. But the state store delete that happens after we send client RPCs to the
plugin checks that the allocs are non-terminal on both server and client.

This can improperly allow deleting a volume from a client but then not being
able to delete it from the state store because of a time-of-check / time-of-use
bug. If the allocation fails/completes on the client before the server marks its
desired status as terminal, or if the allocation is marked server-terminal
during the client RPC, we can get a volume that passes the first check but not
the second check that happens in the state store and cannot be deleted.

Update the state store delete method to require that any allocation for a volume
is client terminal in order to delete the volume, not just server terminal.

Fixes: https://github.com/hashicorp/nomad/issues/26140
Ref: https://hashicorp.atlassian.net/browse/NMD-883
2025-07-07 14:48:06 -04:00
Tim Gross
c043d1c850 scheduler: property testing of reconcile reconnecting (#26180)
To help break down the larger property tests we're doing in #26167 and #26172
into more manageable chunks, pull out a property test for just the
`reconcileReconnecting` method. This method helpfully already defines its
important properties, so we can implement those as test assertions.

Ref: https://hashicorp.atlassian.net/browse/NMD-814
Ref: https://github.com/hashicorp/nomad/pull/26167
Ref: https://github.com/hashicorp/nomad/pull/26172
2025-07-07 09:40:49 -04:00
Tim Gross
d4ab277154 docs: add missing metrics for Consul service client (#26186)
Nomad agents emit metrics for Consul service and check operations, but these
were not documented. Update the metrics reference table to include these
metrics. Note that the metrics are prefixed `nomad.client` but are present on
all agents, because the server registers itself in Consul as well.
2025-07-07 09:40:32 -04:00
Tim Gross
5c909213ce scheduler: add reconciler annotations to completed evals (#26188)
The output of the reconciler stage of scheduling is only visible via debug-level
logs, typically accessible only to the cluster admin. We can give job authors
better ability to understand what's happening to their jobs if we expose this
information to them in the `eval status` command.

Add the reconciler's desired updates to the evaluation struct so it can be
exposed in the API. This increases the size of evals by roughly 15% in the state
store, or a bit more when there are preemptions (but we expect this will be a
small minority of evals).

Ref: https://hashicorp.atlassian.net/browse/NMD-818
Fixes: https://github.com/hashicorp/nomad/issues/15564
2025-07-07 09:40:21 -04:00
Tim Gross
60a953ca00 docs: add upgrade guide note for publish_allocation_metrics (#26187)
In #25870 we fixed a longstanding bug where allocation metrics were being
collected and published even if `telemetry.publish_allocation_metrics` was
disabled (the default). This change is unexpected enough that we should surface
it in the upgrade guide.

Ref: https://github.com/hashicorp/nomad/pull/25870
Ref: https://github.com/hashicorp/nomad/issues/26166
2025-07-07 09:40:00 -04:00
dependabot[bot]
53e2855f47 chore(deps): bump github.com/docker/docker (#26205) 2025-07-07 08:29:23 +00:00
dependabot[bot]
605daee759 chore(deps): bump github.com/docker/cli (#26158) 2025-07-04 11:21:48 +01:00
dependabot[bot]
8e407c7070 chore(deps): bump github.com/docker/docker (#26160) 2025-07-04 10:49:07 +01:00
James Rasell
e158356dd2 client: Remove created directory when mkdir plugin fails to chown. (#26194)
The mkdir plugin creates the directory and then chowns it. In the
event the chown command fails, we should attempt to remove the
directory. Without this, we leave directories on the client in
partial failure situations.
2025-07-04 08:36:36 +01:00
Allison Larson
004fa6132b docs: Fix link in service page documentation (#26174)
* docs: fix link in service page

* docs: correct indentation
2025-07-03 09:42:52 -07:00
dependabot[bot]
6cfef21cce chore(deps): bump go.etcd.io/bbolt from 1.4.1 to 1.4.2 (#26159) 2025-07-03 14:51:13 +01:00
James Rasell
d6757609dc cli: Fix a bug where self token lookups via token CLI flag failed. (#26183)
The meta client looks for both an environment variable and a CLI
flag when generating a client. The CLI UUID checker needs to do
this also, so we account for users using both env vars and CLI
flag tokens.
2025-07-03 13:50:42 +01:00
dependabot[bot]
ae47231304 chore(deps): bump github.com/klauspost/cpuid/v2 from 2.2.10 to 2.2.11 (#26161) 2025-07-03 13:18:36 +01:00
dependabot[bot]
d73d3a1542 chore(deps): bump github.com/prometheus/common from 0.64.0 to 0.65.0 (#26157) 2025-07-03 11:48:49 +01:00
Chris Roberts
4c66930a6e drainer: respect max parallel setting when draining (#26175)
When draining nodes allocs are checked for a healthy state and
marked to be drained, with the value in the max parallel setting
determining how many allocs will be migrated. Depending on the
circumstances, however, the max parallel setting may not be
properly respected.

Given a job with max parallel set to one, a group count greater
than one, and allocs on multiple nodes: Draining a single node
will result in one alloc being marked to drain. If another
node is immediately drained the alloc running on the first
node will be seen as "healthy" and another alloc will be
marked to be drained resulting in two allocs being marked
for migration at the same time. This can lead to issues with
service availablility.

To prevent this allocs can only be marked as healthy when the
alloc has not been marked for migration. This prevents migrating
allocs being seen as healthy which results in the max parallel
setting being properly respected.
2025-07-02 12:43:45 -07:00
Chris Roberts
493e7b2faa command: prevent server panic on graceful shutdown (#26171)
When performing a graceful shutdown the client drain configuration
is checked for a deadline which is appended to the timeout. When
running as a server the client will not be set. Attempting to get
the drain deadline will result in a panic. This checks for the
client being available prior to fetching the deadline value.
2025-07-01 15:54:03 -07:00
Chris Roberts
362690ddd1 client: suppress kill task event on completed tasks (#26075)
The `killTasks` function will kill all the alloc runners
task runners. If the task of a task runner has already
completed, the killing of the task runner can cause
confusion due to the task event showing that the task
was signaled even though it is already complete.

To prevent this, a check is done when creating the
task event to determine if the task has completed. If
it has no task event is created and when the task
runner is killed, no extra task event is added.
2025-07-01 13:30:52 -07:00
Tim Gross
9a29df2292 scheduler: emit structured logs from reconciliation (#26169)
Both the cluster reconciler and node reconciler emit a debug-level log line with
their results, but these are unstructured multi-line logs that are annoying for
operators to parse. Change these to emit structured key-value pairs like we do
everywhere else.

Ref: https://hashicorp.atlassian.net/browse/NMD-818
Ref: https://go.hashi.co/rfc/nmd-212
2025-07-01 10:37:44 -04:00
Piotr Kazmierczak
36e7148247 scheduler: doc.go files for new packages (#26177) 2025-07-01 16:28:33 +02:00
Allison Larson
63f0788747 Expose Kind field for Consul Service Registrations (#26170)
* consul: Add service kind to jobspec

* consul: Add kind to service docs

* Add changelog
2025-06-30 14:32:23 -07:00
Tim Gross
aa3c08d069 eval status: enrich with related evals and placed allocs tables (#26156)
When debugging an evaluation, you almost always want to know about all the
related evaluations and what allocations were placed by that evaluation (and
where), not just failed placements. We can enrich the command by adding the
`related` query parameter to the API, and having the command query for the
evaluations allocations automatically. Emit this data as a pair of new tables
and expose fields like quota limits, and previous/next/blocked eval without the
`-verbose` flag.

Update the docs to include the full output and remove references to long-removed
behavior of the `-json` flag.

Ref: https://hashicorp.atlassian.net/browse/NMD-818
Ref: https://go.hashi.co/rfc/nmd-212
2025-06-30 09:23:36 -04:00
Piotr Kazmierczak
0c2fcb3e30 docs: explicitly list all schedulers enabled by default (#26150)
Co-authored-by: Aimee Ukasick <aimee.ukasick@hashicorp.com>
Co-authored-by: Tim Gross <tgross@hashicorp.com>
2025-06-26 17:37:26 +02:00
Tim Gross
ec8250ed30 property test generation for reconciler (#26142)
As part of ongoing work to make the scheduler more legible and more robustly
tested, we're implementing property testing of at least the reconciler. This
changeset provides some infrastructure we'll need for generating the test cases
using `pgregory.net/rapid`, without building out any of the property assertions
yet (that'll be in upcoming PRs over the next couple weeks).

The alloc reconciler generator produces a job, a previous version of the job, a
set of tainted nodes, and a set of existing allocations. The node reconciler
generator produces a job, a set of nodes, and allocations on those
nodes. Reconnecting allocs are not yet well-covered by these generators, and
with ~40 dimensions covered so far we may need to pull those out to their own
tests in order to get good coverage.

Note the scenarios only randomize fields of interest; fields like the job name
that don't impact the reconciler would use up available shrink cycles on failed
tests without actually reducing the scope of the scenario.

Ref: https://hashicorp.atlassian.net/browse/NMD-814
Ref: https://github.com/flyingmutant/rapid
2025-06-26 11:09:53 -04:00
Juana De La Cuesta
0a84587c65 Add the data dog rate limiter to the autoscaler docs (#26130)
* func: add documentation for the data dog rate limiter

* Update datadog.mdx

* Update website/content/tools/autoscaling/plugins/apm/datadog.mdx

Co-authored-by: Aimee Ukasick <aimee.ukasick@hashicorp.com>

* Update website/content/tools/autoscaling/plugins/apm/datadog.mdx

Co-authored-by: Tim Gross <tgross@hashicorp.com>

---------

Co-authored-by: Aimee Ukasick <aimee.ukasick@hashicorp.com>
Co-authored-by: Tim Gross <tgross@hashicorp.com>
2025-06-26 12:51:12 +02:00
Mattias Fjellström
8e6b2e1b63 docs: adding note on azure msi for server join (#26141) 2025-06-26 10:29:06 +02:00
Elijah Wright
f76d9e0cec jobspec: define DiffID for Constraint and Affinity (#26134) 2025-06-25 17:42:25 +02:00
Piotr Kazmierczak
7647491588 cli: fix panic when starting stopped jobs with no scaling policies (#26131)
Restoring scaling policies during the start of a stopped job did not account for
jobs that didn't have any scaling policies, and led to a panic when users tried
to restart such jobs.
2025-06-25 11:19:56 +02:00
James Rasell
7a5f5750b0 test: Wait for client when enabled in test agent if possible. (#26129)
When a test starts an agent and the client is enabled, we can
wait until this reaches the ready state within the set up method.
This mimics what we already do with leadership and the root
keyring and should reduce flakey tests where it assume the client
is ready as soon as the set up function returns, which is not
guaranteed.

The change exposed a couple of TLS reload tests which were not
using the test agent correctly. They were setting up a client even
though it would never be able to join the cluster due to TLS
configuration issues. These have been fixed.
2025-06-25 10:00:28 +01:00
James Rasell
30b5e91f3c test: Fix TLS reload tests. (#26135)
The tests ran fine in CI but were done before #26107 was raised
and merged. This then altered the test behavior on merge to the
main branch.
2025-06-25 09:15:14 +01:00
James Rasell
216140255d cli: Do not always add global DNS name to certificate DNS names. (#26086)
No matter the passed region identifier, the CLI was always adding
"<role>.global.nomad" to the certificate DNS names. This is not
what we expect and has been removed.

While here, the long deprecated cluster-region flag has been
removed. This removal only impacts CLI functionality, so is safe
to do.
2025-06-25 07:35:56 +01:00
Piotr Kazmierczak
27da75044e scheduler: move tests that depend on calling schedulers into integration package (#26037) 2025-06-24 09:31:10 +02:00
James Rasell
a3e096b0c9 tls: Reset server TLS authenticator when TLS config reloaded. (#26107)
The Nomad server uses an authenticator backend for RPC handling
which includes TLS verification. This verification setting is
configured based on the servers TLS configuration object and is
built when a new server is constructed.

The bug occurs when a servers TLS configuration is reloaded which
can change the desired TLS verification handling. In this case,
the authenticator is not updated, meaning the RPC mTLS verification
is not modified, even if the configuration indicates it should.

This change adds a new function on the authenticator to allow
updating its TLS verification rule. This new function is called
when a servers TLS configuration is reloaded.
2025-06-24 08:30:15 +01:00
dependabot[bot]
9cbadf3e34 chore(deps): bump google.golang.org/grpc from 1.72.2 to 1.73.0 (#26102)
---
updated-dependencies:
- dependency-name: google.golang.org/grpc
  dependency-version: 1.73.0
  dependency-type: direct:production
  update-type: version-update:semver-minor
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2025-06-23 21:06:14 +02:00
Paweł Bęza
1e328e8341 Docs: fix indentation in job annotations description for /v1/job/:job_id/plan response (#26115) 2025-06-23 13:16:35 -05:00
Daniel Bennett
949b23602c e2e: ui: bump playwright version (#26119) 2025-06-23 13:31:11 -04:00
dependabot[bot]
cda267814f chore(deps): bump golang.org/x/crypto from 0.38.0 to 0.39.0 (#26101)
Bumps [golang.org/x/crypto](https://github.com/golang/crypto) from 0.38.0 to 0.39.0.
- [Commits](https://github.com/golang/crypto/compare/v0.38.0...v0.39.0)

---
updated-dependencies:
- dependency-name: golang.org/x/crypto
  dependency-version: 0.39.0
  dependency-type: direct:production
  update-type: version-update:semver-minor
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2025-06-23 17:51:26 +02:00
dependabot[bot]
13e32429b2 chore(deps): bump github.com/aws/aws-sdk-go-v2/config (#26098)
Bumps [github.com/aws/aws-sdk-go-v2/config](https://github.com/aws/aws-sdk-go-v2) from 1.29.16 to 1.29.17.
- [Release notes](https://github.com/aws/aws-sdk-go-v2/releases)
- [Changelog](https://github.com/aws/aws-sdk-go-v2/blob/main/changelog-template.json)
- [Commits](https://github.com/aws/aws-sdk-go-v2/compare/config/v1.29.16...config/v1.29.17)

---
updated-dependencies:
- dependency-name: github.com/aws/aws-sdk-go-v2/config
  dependency-version: 1.29.17
  dependency-type: direct:production
  update-type: version-update:semver-patch
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2025-06-23 17:39:57 +02:00
Piotr Kazmierczak
05c3b5050c ci: align CE build command with ENT (#26108)
In hashicorp/nomad-enterprise#2592 we introduced a
divergence in how Nomad CE and ENT build their binaries. Nomad CE used a more
sophisticated approach, setting uid, gid and home environment variables in the
docker run command. Despite mine (and others) best efforts, we were not able
to do the same in the ENT repo, which relies on special git settings that allow
it to pull dependencies from private repositories, and left a different docker
run command there, that just inherited GHA runner user and copied the resulting
tarball instead of moving it. #26090 then attempted to remedy #25910 resulting
from docker run command ignoring ${{ env.GO_TAGS }} if run with custom
--env, but the resulting backport broke ent builds.

This PR restores ENT behavior of building Nomad builds with GHA runner user,
thus inheriting runner's environment on ent.
2025-06-23 17:13:22 +02:00
Tim Gross
74389cc306 update Vault API dependency and pin HCL dependencies (#26089)
For reasons of backwards compatibility, Nomad uses an older branch of
HCL1 (`v1.0.1-nomad`) and HCL2 (`v2.20.2-nomad-1`) and backports a limited set
of changes to those branches.

But the Vault API also has their own HCL1 branch, currently tagged as
`v1.0.1-vault-7`. Normally this isn't a problem because Nomad pins to our own
branch and we don't call any of the Vault API package's HCL code anyways. But in
Vault's branch some functions were changed that break our build unless we
backport them.

We've backported enough of Vault's changes to make our HCL1 branch build, and
now have tags on the HCL repo so that we can pin to specific tags instead of
random commits.

Fixes: https://hashicorp.atlassian.net/browse/NMD-850
Fixes: https://github.com/hashicorp/nomad/pull/26006
Ref: https://github.com/hashicorp/hcl/pull/760
2025-06-23 10:02:12 -04:00
Piotr Kazmierczak
12ddb6db94 scheduler: capture reconciler state in ReconcilerState object (#26088)
This changeset separates reconciler fields into their own sub-struct to make
testing easier and the code more explicit about what fields relate to which
state.
2025-06-23 15:36:39 +02:00
Mattias Fjellström
e2a30df14c docs: clarified azure cloud join requirements (#26091) 2025-06-23 08:34:56 -05:00
Piotr Kazmierczak
8f98dca8f8 ci: docker GO_TAGS must be quoted (#26105)
ent builds use multiple tags
2025-06-23 10:14:47 +02:00
James Rasell
d1f77a48ab rpc: Use client only auth for node get client allocs endpoint. (#26084)
The RPC is only ever called from a Nomad client which means we
can move it away from the generic Authenticate function to the
tighter AuthenticateClientOnly one. An addition check to ensure
the ACL object allows client operations is performed, mimicking
other endpoints of this nature.
2025-06-23 07:44:32 +01:00
Aimee Ukasick
cdde082362 Docs bug: Fix broken link on concepts/job.mdx (#26093) 2025-06-20 17:16:33 -05:00
Allison Larson
732a671da6 ci: pass go_tags to linux docker builder (#26090) 2025-06-20 11:54:50 -07:00
Piotr Kazmierczak
1030760d3f scheduler: adjust method comments and names to reflect recent refactoring (#26085)
Co-authored-by: Tim Gross <tgross@hashicorp.com>
2025-06-20 17:23:31 +02:00