Commit Graph

149 Commits

Author SHA1 Message Date
Tim Gross
ac86225e09 metrics: reduce heap usage of eval broker metrics (#26737)
The metrics on the eval broker include labels for the job ID, but under a high
volume of dispatch workloads, this results in excessive heap usage on the
leader. Dispatch workloads should use their parent ID rather than their child ID
for any metrics we collect.

Also, eliminate an extra copy of the labels. And remove the extremely high
cardinality `"eval_id"` label from the `nomad.broker.eval_waiting` metric.

Fixes: https://github.com/hashicorp/nomad/issues/26657
2025-09-12 08:29:46 -04:00
Aimee Ukasick
bb7114e518 Docs Chore: Add release notes for 1.10.1-1.10.3 (#26593)
* add 1.10.3

* add 1.10.2

* Add 1.10.1 release notes; add partials to share

* address feedback
2025-08-25 09:38:15 -05:00
Tim Gross
279775082c sysbatch: correctly validate that reschedule policy is not allowed (#26279)
System and sysbatch jobs don't support the reschedule block, because we'd always
replace allocations back onto the same node. The job validation for system jobs
asserts that the user hasn't set a `reschedule` block so that users aren't
submitting jobs expecting it to be supported. But this validation was missing
for sysbatch jobs.

Validate that sysbatch jobs don't have a reschedule block.
2025-07-15 10:47:02 -04:00
Aimee Ukasick
53b083b8c5 Docs: Nomad IA (#26063)
* Move commands from docs to its own root-level directory

* temporarily use modified dev-portal branch with nomad ia changes

* explicitly clone nomad ia exp branch

* retrigger build, fixed dev-portal broken build

* architecture, concepts and get started individual pages

* fix get started section destinations

* reference section

* update repo comment in website-build.sh to show branch

* docs nav file update capitalization

* update capitalization to force deploy

* remove nomad-vs-kubernetes dir; move content to what is nomad pg

* job section

* Nomad operations category, deploy section

* operations category, govern section

* operations - manage

* operations/scale; concepts scheduling fix

* networking

* monitor

* secure section

* remote auth-methods folder and move up pages to sso; linkcheck

* Fix install2deploy redirects

* fix architecture redirects

* Job section: Add missing section index pages

* Add section index pages so breadcrumbs build correctly

* concepts/index fix front matter indentation

* move task driver plugin config to new deploy section

* Finish adding full URL to tutorials links in nav

* change SSO to Authentication in nav and file system

* Docs NomadIA: Move tutorials into NomadIA branch (#26132)

* Move governance and policy from tutorials to docs

* Move tutorials content to job-declare section

* run jobs section

* stateful workloads

* advanced job scheduling

* deploy section

* manage section

* monitor section

* secure/acl and secure/authorization

* fix example that contains an unseal key in real format

* remove images from sso-vault

* secure/traffic

* secure/workload-identities

* vault-acl change unseal key and root token in command output sample

* remove lines from sample output

* fix front matter

* move nomad pack tutorials to tools

* search/replace /nomad/tutorials links

* update acl overview with content from deleted architecture/acl

* fix spelling mistake

* linkcheck - fix broken links

* fix link to Nomad variables tutorial

* fix link to Prometheus tutorial

* move who uses Nomad to use cases page; move spec/config shortcuts

add dividers

* Move Consul out of Integrations; move namespaces to govern

* move integrations/vault to secure/vault; delete integrations

* move ref arch to docs; rename Deploy Nomad back to Install Nomad

* address feedback

* linkcheck fixes

* Fixed raw_exec redirect

* add info from /nomad/tutorials/manage-jobs/jobs

* update page content with newer tutorial

* link updates for architecture sub-folders

* Add redirects for removed section index pages. Fix links.

* fix broken links from linkcheck

* Revert to use dev-portal main branch instead of nomadIA branch

* build workaround: add intro-nav-data.json with single entry

* fix content-check error

* add intro directory to get around Vercel build error

* workound for emtpry directory

* remove mdx from /intro/ to fix content-check and git snafu

* Add intro index.mdx so Vercel build should work

---------

Co-authored-by: Tu Nguyen <im2nguyen@gmail.com>
2025-07-08 19:24:52 -05:00
Tim Gross
60a953ca00 docs: add upgrade guide note for publish_allocation_metrics (#26187)
In #25870 we fixed a longstanding bug where allocation metrics were being
collected and published even if `telemetry.publish_allocation_metrics` was
disabled (the default). This change is unexpected enough that we should surface
it in the upgrade guide.

Ref: https://github.com/hashicorp/nomad/pull/25870
Ref: https://github.com/hashicorp/nomad/issues/26166
2025-07-07 09:40:00 -04:00
Piotr Kazmierczak
5dd880ad61 docs: upgrade guide entry for /v1/acl/token/self changes (#25940)
During #25547 and #25588 work, incorrect response codes from
/v1/acl/token/self were changed, but we did not make a note about this in the
upgrade guide.
2025-05-28 16:22:37 +02:00
James Rasell
0b265d2417 encrypter: Track initial tasks for is ready calculation. (#25803)
The server startup could "hang" to the view of an operator if it
had a key that could not be decrypted or replicated loaded from
the FSM at startup.

In order to prevent this happening, the server startup function
will now use a timeout to wait for the encrypter to be ready. If
the timeout is reached, the error is sent back to the caller which
fails the CLI command. This bubbling of error message will also
flush to logs which will provide addition operator feedback.

The server only cares about keys loaded from the FSM snapshot and
trailing logs before the encrypter should be classed as ready. So
that the encrypter ready function does not get blocked by keys
added outside of the initial Raft load, we take a snapshot of the
decryption tasks as we enter the blocking call, and class these as
our barrier.
2025-05-07 15:38:16 +01:00
Juana De La Cuesta
dcaa96f0e5 Update website/content/docs/upgrade/upgrade-specific.mdx
Co-authored-by: Aimee Ukasick <aimee.ukasick@hashicorp.com>
2025-04-30 15:03:49 +02:00
Juana De La Cuesta
e8fb36f4d3 Style: typo 2025-04-30 13:01:57 +02:00
Juanadelacuesta
9288a3141a func and docs: Use the config from the client and not from the agent that is already parsed. Add the breaking change to the release notes 2025-04-30 10:53:02 +02:00
Aimee Ukasick
d293684d3d Update rel notes, upgrade links to point to correct previous ver (#25652) 2025-04-11 10:22:23 -05:00
Tim Gross
27caae2b2a api: make attempting to remove peer by address a no-op (#25599)
In Nomad 1.4.0 we removed support for Raft Protocol v2 entirely. But the
`Operator.RemoveRaftPeerByAddress` RPC handler was left in place, along with its
supporting HTTP API and command line flags. Using this API will always result in
the Raft library error "operation not supported with current protocol version".

Unfortunately it's still possible in unit tests to exercise this code path, and
these tests are quite flaky. This changeset turns the RPC handler and HTTP API
into a no-op, removes the associated command line flags, and removes the flaky
tests. I've also cleaned up the test for `RemoveRaftPeerByID` to consolidate
test servers and use `shoenig/test`.

Fixes: https://hashicorp.atlassian.net/browse/NET-12413
Ref: https://github.com/hashicorp/nomad/pull/13467
Ref: https://developer.hashicorp.com/nomad/docs/upgrade/upgrade-specific#raft-protocol-version-2-unsupported
Ref: https://github.com/hashicorp/nomad-enterprise/actions/runs/13201513025/job/36855234398?pr=2302
2025-04-10 09:19:25 -04:00
Aimee Ukasick
87aabc9af2 Docs: 1.10 release notes, some factoring, sentinel apply update (#25433)
* Docs: 1.10 release notes and upgrade factoring

* Update based on code review suggestions

* add CLI for disabling UI URL hints

* fix indentation

* nav: list release notes in reverse order

fix broken link to v1.6.x docs

* Update PKCE section from Daniel's latest PR

* update pkce per daniel's suggestion

* Add dynamic host volumes governance section from blog
2025-04-09 15:43:58 -07:00
Daniel Bennett
6a0c4f5a3d auth: oidc: enable pkce only on new auth methods (#25593)
trying not to violate the principle of least astonishment.

we want to only auto-enable PKCE on *new* auth methods,
rather than *new or updated* auth methods, to avoid a
scenario where a Nomad admin updates an auth method
sometime in the future -- something innocent like a new
client secret -- and their OIDC provider doesn't like PKCE.

the main concern is that the provider won't like PKCE
in a totally confusing way. error messages rarely
say PKCE directly, so why the user's auth method
suddenly broke would be a big mystery.

this means that to enable it on existing auth methods,
you would set `OIDCDisablePKCE = false`, and the double-
negative doesn't feel right, so instead, swap the language,
so enabling it on *existing* methods reads sensibly, and to
disable it on *new* methods reads ok-enough:
`OIDCEnablePKCE = false`
2025-04-03 10:56:17 -05:00
Aimee Ukasick
9778fa4912 Docs: Fix broken links in main for 1.10 release (#25540)
* Docs: Fix broken links in main for 1.10 release

* Implement Tim's suggestions

* Remove link to Portworx from ecosystem page

* remove "Portworx" since Portworx 3.2 no longer supports Nomad
2025-04-01 09:09:44 -05:00
Daniel Bennett
8c609ad762 docs: oidc client assertions and pkce (#25375) 2025-03-20 09:14:17 -05:00
James Rasell
c53ba3e7d1 consul: Remove implicit workload identity when task has a template. (#25298)
When a task included a template block, Nomad was adding a Consul
identity by default which allowed the template to use Consul API
template functions even when they were not needed or desired.

This change removes the implict addition of Consul identities to
tasks when they include a template block. Job specification
authors will now need to add a Consul identity or Consul block to
their task if they have a template which uses Consul API functions.

This change also removes the default addition of a Consul block to
all task groups registered and processed by the API package.
2025-03-10 13:49:50 +00:00
Michael Smithhisler
5c4d0e923d consul: Remove legacy token based authentication workflow (#25217) 2025-03-05 15:38:11 -05:00
Michael Smithhisler
f2b761f17c disconnected: removes deprecated disconnect fields (#25284)
The group level fields stop_after_client_disconnect,
max_client_disconnect, and prevent_reschedule_on_lost were deprecated in
Nomad 1.8 and replaced by field in the disconnect block. This change
removes any logic related to those deprecated fields.

---------

Co-authored-by: Tim Gross <tgross@hashicorp.com>
2025-03-05 14:46:02 -05:00
James Rasell
7268053174 vault: Remove legacy token based authentication workflow. (#25155)
The legacy workflow for Vault whereby servers were configured
using a token to provide authentication to the Vault API has now
been removed. This change also removes the workflow where servers
were responsible for deriving Vault tokens for Nomad clients.

The deprecated Vault config options used byi the Nomad agent have
all been removed except for "token" which is still in use by the
Vault Transit keyring implementation.

Job specification authors can no longer use the "vault.policies"
parameter and should instead use "vault.role" when not using the
default workload identity.

---------

Co-authored-by: Tim Gross <tgross@hashicorp.com>
Co-authored-by: Aimee Ukasick <aimee.ukasick@hashicorp.com>
2025-02-28 07:40:02 +00:00
Paweł Bęza
43885f6854 Allow for in-place update when affinity or spread was changed (#25109)
Similarly to #6732 it removes checking affinity and spread for inplace update.
Both affinity and spread should be as soft preference for Nomad scheduler rather than strict constraint. Therefore modifying them should not trigger job reallocation.

Fixes #25070
Co-authored-by: Tim Gross <tgross@hashicorp.com>
2025-02-14 14:33:18 -05:00
stswidwinski
871585ee90 18529 nomad executes any file in plugins (#18530)
Co-authored-by: James Rasell <jrasell@hashicorp.com>
2025-02-10 16:08:22 +00:00
Aimee Ukasick
d9bb241b43 Docs SEO: Update runtime, networking, Nomad vs K8s, Nomad Enterprise, upgrading, release notes, and sectionless pages (#24764)
* Docs SEO: Updates

CE-781,782,785,788

* CE-791 single pages

* CE-786 enterprise section

* CE-789 release notes

* fix content-check error

* Update description and add intro body paragraph when appropriate

* fix typo

* Apply suggestions from Jeff's code review

Co-authored-by: Jeff Boruszak <104028618+boruszak@users.noreply.github.com>

---------

Co-authored-by: Jeff Boruszak <104028618+boruszak@users.noreply.github.com>
2025-02-03 10:03:36 -06:00
Michael Smithhisler
47c14ddf28 remove remote task execution code (#24909) 2025-01-29 08:08:34 -05:00
Tim Gross
3a11a0b1e1 quotas: refactor storage limit specification (#24785)
In anticipation of having quotas for dynamic host volumes, we want the user
experience of the storage limits to feel integrated with the other resource
limits. This is currently prevented by reusing the `Resources` type instead of
having a specific type for `QuotaResources`.

Update the quota limit/usage types to use a `QuotaResources` that includes a new
storage resources quota block. The wire format for the two types are compatible
such that we can migrate the existing variables limit in the FSM.

Also fixes improper parallelism in the quota init test where we change working
directory to avoid file write conflicts but this breaks when multiple tests are
executed in the same process.

Ref: https://github.com/hashicorp/nomad-enterprise/pull/2096
2025-01-13 09:25:00 -05:00
Tim Gross
08a6f870ad cni: use check command when restoring from restart (#24658)
When the Nomad client restarts and restores allocations, the network namespace
for an allocation may exist but no longer be correctly configured. For example,
if the host is rebooted and the task was a Docker task using a pause container,
the network namespace may be recreated by the docker daemon.

When we restore an allocation, use the CNI "check" command to verify that any
existing network namespace matches the expected configuration. This requires CNI
plugins of at least version 1.2.0 to avoid a bug in older plugin versions that
would cause the check to fail.

If the check fails, destroy the network namespace and try to recreate it from
scratch once. If that fails in the second pass, fail the restore so that the
allocation can be recreated (rather than silently having networking fail).

This should fix the gap left #24650 for Docker task drivers and any other
drivers with the `MustInitiateNetwork` capability.

Fixes: https://github.com/hashicorp/nomad/issues/24292
Ref: https://github.com/hashicorp/nomad/pull/24650
2025-01-07 09:38:39 -05:00
Piotr Kazmierczak
f7a4ded2c0 security: add CT executeTemplate to default function_denylist (#24541)
This PR adds Consul Template's executeTemplate function to the denylist by
default, in order to prevent accidental or malicious infinitely recursive
execution.

---------

Co-authored-by: Tim Gross <tgross@hashicorp.com>
2024-11-22 19:33:56 +01:00
Piotr Kazmierczak
368241dbf2 security: a more comprehensive env.denylist (#24540)
A more comprehensive env.denylist that now includes more token, token file and
license variables. 

---------

Co-authored-by: Daniel Bennett <dbennett@hashicorp.com>
2024-11-22 18:54:18 +01:00
Michael Schurter
8dd570d6ca docs: upgrade docs should point at real version (#24438)
Let users know what happened to 1.9.2 but label the gc change as the
first working release (1.9.3).
2024-11-12 11:05:27 -08:00
Piotr Kazmierczak
f7847c6e5b state: remove TimeTable and rely on objects' modify times instead (#24112)
Core scheduler relies on a special table in the state store—the TimeTable—to
figure out which objects can be GC'd. The TimeTable correlates Raft indices
with objects insertion time, a solution we used before most of the objects we
store in the state contained timestamps. This introduced a bit of a memory
overhead and complexity, but most importantly meant that any GC threshold users
set greater than timeTableLimit = 72 * time.Hour was ignored. This PR removes
the TimeTable and relies on object timestamps to determine whether they could
be GCd or not.
2024-11-01 19:38:04 +01:00
R.B. Boyer
4e8f596311 docs: update broken consul acl token links (#24287) 2024-10-23 13:34:21 -04:00
Michael Schurter
34cb05d297 docs: explain how to use dots in docker labels (#24074)
Nomad v1.9.0 (finally!) removes support for HCL1 and the `-hcl1` flag.
See #23912 for details.

One of the uses of HCL1 over HCL2 was that HCL1 allowed quoted keys in
blocks such as env, meta, and Docker's labels:

```hcl
some_block {
  "foo.bar" = "baz"
}
```

This works in HCL1 but is invalid HCL2. In HCL2 you must use a map
instead of a block:

```hcl
some_map = {
  "eggs.spam" = "works!"
}
```

This was such a hassle for users we special cased the `env` and `meta`
blocks to be accepted as blocks or maps in #9936.

However Docker `labels`, being a task config option, is much harder to
special case and commonly needs dots-in-keys for things like DataDog
autodiscovery via Docker container labels:
https://docs.datadoghq.com/containers/docker/integrations/?tab=labels

Luckily `labels` can be specified as a list-of-maps instead:

```hcl
labels = [
  {
    "com.datadoghq.ad.check_names"  = "[\"openmetrics\"]"
    "com.datadoghq.ad.init_configs" = "[{}]"
  }
]
```

So instead of adding more awkward hcl1/2 backward compat code to Nomad,
I just updated the docs to hopefully help people hit by this.

The only other known workaround is dropping HCL in favor of JSON
jobspecs altogether, but that forces a huge migration and maintenance
burden on users:
https://discuss.hashicorp.com/t/docker-based-autodiscovery-with-datadog-how-can-we-make-it-work/18870
2024-09-27 10:02:50 -07:00
Tim Gross
a3a2028837 docs: update key management docs for keyring-in-Raft (#24026)
In #23977 we moved the keyring into Raft. This changeset documents the
operational changes and adds notes to the upgrade guide.
2024-09-25 10:48:14 -04:00
Tim Gross
192d70cee7 docker: update infra_image to new registry (#23927)
The gcr.io container registry is shutting down in March. Update the default
`image_image` for Docker's "pause" containers to point to the new location
hosted by the k8s project.

Fixes: https://github.com/hashicorp/nomad/issues/23911
Ref: https://hashicorp.atlassian.net/browse/NET-10942
2024-09-06 14:34:03 -04:00
Tim Gross
06f5fbc5d6 auth: enforce use of node secret and remove legacy auth (#23838)
As of Nomad 1.6.0, Nomad client agents send their secret with all the
RPCs (other than registration). But for backwards compatibility we had to keep
a legacy auth method that didn't require the node secret. We've previously
announced that this legacy auth method would be removed and that nodes older
than 1.6.0 would not be supported with Nomad 1.9.0.

This changeset removes the legacy auth method.

Ref: https://developer.hashicorp.com/nomad/docs/release-notes/nomad/upcoming#nomad-1-9-0
2024-09-05 14:24:28 -04:00
Tim Gross
a9beef7edd jobspec: remove HCL1 support (#23912)
This changeset removes support for parsing jobspecs via the long-deprecated
HCLv1.

Fixes: https://github.com/hashicorp/nomad/issues/20195
Ref: https://hashicorp.atlassian.net/browse/NET-10220
2024-09-05 09:02:45 -04:00
Austin Culter
ce3e159ee8 docs: update upgrade-specific.mdx (#23906) 2024-09-04 08:42:27 -04:00
Tim Gross
d5ca07a247 docs: notices of upcoming deprecations and backports (#23683)
Add a section to the docs describing planned upcoming deprecations and
removals. Also added some missing upgrade guide sections missed during the last
release.
2024-07-25 10:20:18 -04:00
Tim Gross
2f4353412d keyring: support prepublishing keys (#23577)
When a root key is rotated, the servers immediately start signing Workload
Identities with the new active key. But workloads may be using those WI tokens
to sign into external services, which may not have had time to fetch the new
public key and which might try to fetch new keys as needed.

Add support for prepublishing keys. Prepublished keys will be visible in the
JWKS endpoint but will not be used for signing or encryption until their
`PublishTime`. Update the periodic key rotation to prepublish keys at half the
`root_key_rotation_threshold` window, and promote prepublished keys to active
after the `PublishTime`.

This changeset also fixes two bugs in periodic root key rotation and garbage
collection, both of which can't be safely fixed without implementing
prepublishing:

* Periodic root key rotation would never happen because the default
  `root_key_rotation_threshold` of 720h exceeds the 72h maximum window of the FSM
  time table. We now compare the `CreateTime` against the wall clock time instead
  of the time table. (We expect to remove the time table in future work, ref
  https://github.com/hashicorp/nomad/issues/16359)
* Root key garbage collection could GC keys that were used to sign
  identities. We now wait until `root_key_rotation_threshold` +
  `root_key_gc_threshold` before GC'ing a key.
* When rekeying a root key, the core job did not mark the key as inactive after
  the rekey was complete.

Ref: https://hashicorp.atlassian.net/browse/NET-10398
Ref: https://hashicorp.atlassian.net/browse/NET-10280
Fixes: https://github.com/hashicorp/nomad/issues/19669
Fixes: https://github.com/hashicorp/nomad/issues/23528
Fixes: https://github.com/hashicorp/nomad/issues/19368
2024-07-19 13:29:41 -04:00
Piotr Kazmierczak
d5e1515e80 docker: default to hyper-v isolation on Windows (#23452) 2024-07-01 08:56:43 +02:00
Piotr Kazmierczak
863d42bc4b docs: upgrade guide updates for backported Docker windows changes (#23453)
Upgrade guide should be uniform across all supported versions, otherwise
backporting breaking changes is tedious.
2024-06-27 19:35:56 +02:00
Piotr Kazmierczak
0ece7b5c16 docker: validate that containers do not run as ContainerAdmin on Windows (#23443)
This enables checks for ContainerAdmin user on docker images on Windows. It's
only checked if users run docker with process isolation and not hyper-v,
because hyper-v provides its own, proper sandboxing.

---------

Co-authored-by: Tim Gross <tgross@hashicorp.com>
2024-06-27 16:22:24 +02:00
Charlie Voiselle
07516c8159 [docs] Add Sentinel info to version-specific upgrade page (#23173)
The upgrade to sentinel v0.26 is a breaking change, requiring users of
custom Sentinel plugins to rebuild them using sentinel-sdk v4
2024-06-26 10:46:38 -04:00
Seth Hoenig
6ad648bec8 networking: Inject implicit constraints on CNI plugins when using bridge mode (#15473)
This PR adds a job mutator which injects constraints on the job taskgroups
that make use of bridge networking. Creating a bridge network makes use of the
CNI plugins: bridge, firewall, host-local, loopback, and portmap. Starting
with Nomad 1.5 these plugins are fingerprinted on each node, and as such we
can ensure jobs are correctly scheduled only on nodes where they are available,
when needed.
2024-03-27 16:11:39 -04:00
Juana De La Cuesta
56bf253474 Add docs for disconnected block (#20147)
Expand the job settings to include the disconnect block and set as deprecated the fields that will be replaced by it.
2024-03-20 10:08:16 +01:00
Michael Schurter
3193ac204f docs: skipping a major release is fine (#20075)
Nomad has always placed an extremely high priority on backward
compatibility. We have always aimed to support N-2 major releases and
usually gone above and beyond that.

The new https://www.hashicorp.com/long-term-support policy also mentions
that N-2 is what we have always supported, so it's probably time for our
docs to reflect that reality.
2024-03-06 08:57:12 -08:00
Seth Hoenig
9410c519ff drivers/raw_exec: remove plumbing for ineffective no_cgroups configuration (#19599)
* drivers/raw_exec: remove plumbing for ineffective no_cgroups configuration

* fix tests
2024-01-11 08:20:15 -06:00
Seth Hoenig
4b3ee77d6b docs: update raw_exec driver docs and 1.7 upgrade notes (#19598) 2024-01-04 08:26:46 -06:00
Etienne Bruines
f18d5c7c32 docs: fix migration to workload identity links (#19508)
Fixes #19507
2023-12-18 21:27:38 -05:00
Tim Gross
0e42569ffb docs: note that 1.7.2 yanks 1.7.0-1.7.1 due to CPU fingeprint bug (#19474) 2023-12-14 11:32:13 -05:00