389 Commits

Author SHA1 Message Date
Tim Gross
3432b0a2d6 consul: only add fingerprint link if unique.consul.name is set (#26787)
In Nomad Enterprise we can fingerprint multiple Consul datacenters. If neither
is `"default"` then we end up with warning logs about adding a "link".

The `Link` field on the `Node` struct is a map of attributes that only
contributes to the node's computed hash. The `"consul"` key's value is derived
from the `unique.consul.name` attribute, which only exists if there's a default
Consul cluster.

Update the fingerprint to skip setting the link field if there's no
`unique.consul.name`, and lower the warning log for malformed fields to debug;
this is a minor scheduling optimization largely captured by existing Consul
fields in the node computed class. The only reason not to remove it entirely is
to avoid changing computed classes on existing large clusters.

Fixes: https://github.com/hashicorp/nomad/issues/26781
Ref: https://hashicorp.atlassian.net/browse/NMD-998
2025-09-17 13:23:01 -04:00
Olli Janatuinen
6398ef9475 secrets: Support custom plugins in Windows (#26751)
Signed-off-by: Olli Janatuinen <olli.janatuinen@gmail.com>
2025-09-16 09:14:50 -04:00
Michael Smithhisler
10ed46cbd4 secrets: pass key/value config data to plugins as env (#26455)
Co-authored-by: Michael Schurter <mschurter@hashicorp.com>
Co-authored-by: Tim Gross <tgross@hashicorp.com>
2025-09-05 16:08:24 -04:00
Michael Smithhisler
9950ef515c secrets: validate name and update client config (#26447) 2025-09-05 16:08:23 -04:00
Michael Smithhisler
00ef9cacab secrets: add common secrets plugins impl (#26335)
Co-authored-by: Michael Schurter <mschurter@hashicorp.com>
2025-09-05 16:08:23 -04:00
Michael Smithhisler
ac32b0864d scheduler: adds implicit constraint for secrets plugin node attributes (#26303) 2025-09-05 16:08:23 -04:00
James Rasell
c85c723336 ci: Run core tests groups workflow on amd64 and arm64 runners. (#25695) 2025-04-17 15:16:29 +01:00
Daniel Bennett
99c25fc635 dhv: mkdir plugin parameters: uid,guid,mode (#25533)
also remove Error logs from client rpc and promote plugin Debug logs to Error (since they have more info in them)
2025-03-28 10:13:13 -05:00
Tim Gross
1788bfb42e remove addresses from node class hash (#24942)
When a node is fingerprinted, we calculate a "computed class" from a hash over a
subset of its fields and attributes. In the scheduler, when a given node fails
feasibility checking (before fit checking) we know that no other node of that
same class will be feasible, and we add the hash to a map so we can reject them
early. This hash cannot include any values that are unique to a given node,
otherwise no other node will have the same hash and we'll never save ourselves
the work of feasibility checking those nodes.

In #4390 we introduce the `nomad.advertise.address` attribute and in #19969 we
introduced `consul.dns.addr` attribute. Both of these are unique per node and
break the hash.

Additionally, we were not correctly filtering attributes out when checking if a
node escaped the class by not filtering for attributes that start with
`unique.`. The test for this introduced in #708 had an inverted assertion, which
allowed this to pass unnoticed since the early days of Nomad.

Ref: https://github.com/hashicorp/nomad/pull/708
Ref: https://github.com/hashicorp/nomad/pull/4390
Ref: https://github.com/hashicorp/nomad/pull/19969
2025-03-03 09:28:32 -05:00
Tim Gross
e02ef73abf fingerprint: only log failed Consul fingerprint once (#25182)
In #24526 we updated Consul and Vault fingerprinting so that we no longer
periodically fingerprint. In #25102 we made it so that we fingerprint
periodically on start until the first fingerprint, in order to tolerate Consul
or Vault not being available on start. For clusters not running Consul, this
leads to a warn-level log every 15s. This same log exists for Vault, but Vault
support is opt-in via `vault.enable = true` whereas you have to manually disable
the fingerprinter for Consul.

Make it so that we only log a failed Consul fingerprint once per Consul
cluster. Reset the gate on this once we have a successful fingerprint, so that
we get the logs after a reload if Consul is unavailable.

Ref: https://github.com/hashicorp/nomad/pull/24526
Ref: https://github.com/hashicorp/nomad/pull/25102
Fixes: https://github.com/hashicorp/nomad/issues/25181
2025-02-21 13:09:34 -05:00
Tim Gross
8c57fd5eb0 fingerprint: initial fingerprint of Vault/Consul should be periodic (#25102)
In #24526 we updated the Consul and Vault fingerprints so that they are no
longer periodic. This fixed a problem that cluster admins reported where rolling
updates of Vault or Consul would cause a thundering herd of fingerprint updates
across the whole cluster.

But if Consul/Vault is not available during the initial fingerprint, it will
never get fingerprinted again. This is challenging for cluster updates and black
starts because the implicit service startup ordering may require
reloads. Instead, have the fingerprinter run periodically but mark that it has
made its first successful fingerprint of all Consul/Vault clusters. At that
point, we can skip further periodic updates. The `Reload` method will reset the
mark and allow the subsequent fingerprint to run normally.

Fixes: https://github.com/hashicorp/nomad/issues/25097
Ref: https://github.com/hashicorp/nomad/pull/24526
Ref: https://github.com/hashicorp/nomad/issues/24049
2025-02-13 14:26:04 -05:00
Jorge Marey
25426f0777 fingerprint: add config option to disable dmidecode (#25108) 2025-02-13 11:20:48 -05:00
Matt Keeler
833e240597 Upgrade to using hashicorp/go-metrics@v0.5.4 (#24856)
* Upgrade to using hashicorp/go-metrics@v0.5.4

This also requires bumping the dependencies for:

* memberlist
* serf
* raft
* raft-boltdb
* (and indirectly hashicorp/mdns due to the memberlist or serf update)

Unlike some other HashiCorp products, Nomads root module is currently expected to be consumed by others. This means that it needs to be treated more like our libraries and upgrade to hashicorp/go-metrics by utilizing its compat packages. This allows those importing the root module to control the metrics module used via build tags.
2025-01-31 15:22:00 -05:00
Daniel Bennett
49c147bcd7 dynamic host volumes: change env vars, fixup auto-delete (#24943)
* plugin env: DHV_HOST_PATH->DHV_VOLUMES_DIR
* client config: host_volumes_dir
* plugin env: add namespace+nodepool
* only auto-delete after error saving client state
  on *initial* create
2025-01-27 10:36:53 -06:00
Seth Hoenig
1356880962 fingerprint: convert consul and vault fingerprinters to be reloadable (#24526)
This PR changes the Consul and Vault fingerprint implementations to be
reloadable rather than periodic. Reasons described in the issue.
2025-01-27 09:20:01 +00:00
Daniel Bennett
985eb53c65 dynamic host volumes: plugin spec tweaks (#24848)
* prefix plugin env vars with DHV_
* add env: DHV_VOLUME_ID, DHV_PLUGIN_DIR
* 5s timeout on fingerprint calls
2025-01-13 14:18:10 -06:00
Michael Smithhisler
606ce9dd90 deps: upgrade aws-sdk-go from v1 to v2 (#24720) 2025-01-09 17:27:19 -05:00
Daniel Bennett
af967184a6 dynamic host volumes: tweak plugin fingerprint (#24711)
Instead of a plugin `version` subcommand that responds with a string
(established in #24497), respond to a `fingerprint` command with a data
structure that we may extend in the future (such as plugin capabilities,
like size constraint support?). In the immediate term, it's still just the
version: `{"version": "0.0.1"}`

In addition to leaving the door open for future expansion, I think it will
also avoid false positives detecting executables that just happen to
respond to a `version` command.

This also reverses the ordering of the fingerprint string parts
from `plugins.host_volume.version.mkdir` (which aligned with CNI)
to `plugins.host_volume.mkdir.version` (makes more sense to me)
2024-12-19 09:25:55 -05:00
Daniel Bennett
46a39560bb dynamic host volumes: fingerprint client plugins (#24589) 2024-12-19 09:25:54 -05:00
Rodrigo Lourenço
cdebf96b0e fingerprint gce: collect preemptibility 2024-10-23 15:19:20 +02:00
Tim Gross
b7f1800657 fingerprint: update landlock test to accept v4+ APIs (#23979)
The landlock fingerprint test assumes there's no version of the landlock API
>3. Update the test assertion to allow for the current v4 and any future
versions.
2024-09-17 15:07:44 -04:00
Piotr Kazmierczak
0bc9796d3b client: log an error message if total detected cpu is zero (#23827) 2024-08-15 18:31:27 +02:00
Seth Hoenig
db0642099e build: update golangci-lint to 1.60.1 (#23807)
* build: update golangci-lint to 1.60.1

* ci: update golangci-lint to v1.60.1

Helps with go1.23 compatability. Introduces some breaking changes / newly
enforced linter patterns so those are fixed as well.
2024-08-14 10:09:31 -05:00
guifran001
1c44521543 client: Add a preferred address family option for network-interface (#23389)
to prefer ipv4 or ipv6 when deducing IP from network interface

Co-authored-by: Daniel Bennett <dbennett@hashicorp.com>
2024-07-12 15:30:38 -05:00
Tim Gross
7d73065066 numa: fix scheduler panic due to topology serialization bug (#23284)
The NUMA topology struct field `NodeIDs` is a `idset.Set`, which has no public
members. As a result, this field is never serialized via msgpack and persisted
in state. When `numa.affinity = "prefer"`, the scheduler dereferences this nil
field and panics the scheduler worker.

Ideally we would fix this by adding a msgpack serialization extension, but
because the field already exists and is just always empty, this breaks RPC wire
compatibility across upgrades. Instead, create a new field that's populated at
the same time we populate the more useful `idset.Set`, and repopulate the set on
demand.

Fixes: https://hashicorp.atlassian.net/browse/NET-9924
2024-06-11 08:55:00 -04:00
Tim Gross
a74775814c fingerprint: add DNS address and port to Consul fingerprint (#19969)
In order to provide a DNS address and port to Connect tasks configured for
transparent proxy, we need to fingerprint the Consul DNS address and port. The
client will pass this address/port to the iptables configuration provided to the
`consul-cni` plugin.

Ref: https://github.com/hashicorp/nomad/issues/10628
2024-02-14 12:15:58 -05:00
Tim Gross
62c57d208b fingerprint: eliminate spurious warning logs with Consul CE (#19923)
Support for fingerprinting the Consul admin partition was added in #19485. But
when the client fingerprints Consul CE, it gets a valid fingerprint and working
Consul but with a warn-level log. Return "ok" from the partition extractor, but
also ensure that we only add the Consul attribute if it actually has a value.

Fixes: https://github.com/hashicorp/nomad/issues/19756
2024-02-09 08:19:00 -05:00
Tim Gross
2e33115c15 consul: fingerprint Consul Enterprise admin partitions (#19485)
Consul Enterprise agents all belong to an admin partition. Fingerprint this
attribute when available. When a Consul agent is not explicitly configured with
"default" it is in the default partition but will not report this in its
`/v1/agent/self` endpoint. Fallback to "default" when missing only for Consul
Enterprise.

This feature provides users the ability to add constraints for jobs to land on
Nomad nodes that have a Consul in that partition. Or it can allow cluster
administrators to pair Consul partitions 1:1 with Nomad node pools. We'll also
have the option to implement a future `partition` field in the jobspec's
`consul` block to create an implicit constraint.

Ref: https://github.com/hashicorp/nomad/issues/13139#issuecomment-1856479581
2023-12-15 09:26:25 -05:00
Tim Gross
50f0ce5412 config: remove old Vault/Consul config blocks from client (#18994)
Remove the now-unused original configuration blocks for Consul and Vault from
the client. When the client needs to refer to a Consul or Vault block it will
always be for a specific cluster for the task/service. Add a helper for
accessing the default clusters (for the client's own use).

This is two of three changesets for this work. The remainder will implement the
same changes in the `command/agent` package.

As part of this work I discovered and fixed two bugs:

* The gRPC proxy socket that we create for Envoy is only ever created using the
  default Consul cluster's configuration. This will prevent Connect from being
  used with the non-default cluster.
* The Consul configuration we use for templates always comes from the default
  Consul cluster's configuration, but will use the correct Consul token for the
  non-default cluster. This will prevent templates from being used with the
  non-default cluster.

Ref: https://github.com/hashicorp/nomad/issues/18947
Ref: https://github.com/hashicorp/nomad/pull/18991
Fixes: https://github.com/hashicorp/nomad/issues/18984
Fixes: https://github.com/hashicorp/nomad/issues/18983
2023-11-07 09:15:37 -05:00
Seth Hoenig
951cde4e3b numa: fix cpu topology conversion for non linux systems (#18843) 2023-10-24 09:12:34 -05:00
Seth Hoenig
83720740f5 core: plumbing to support numa aware scheduling (#18681)
* core: plumbing to support numa aware scheduling

* core: apply node resources compatibility upon fsm rstore

Handle the case where an upgraded server dequeus an evaluation before
a client triggers a new fingerprint - which would be needed to cause
the compatibility fix to run. By running the compat fix on restore the
server will immediately have the compatible pseudo topology to use.

* lint: learn how to spell pseudo
2023-10-19 15:09:30 -05:00
Tim Gross
5001bf4547 consul: use constant instead of "default" literal (#18611)
Use the constant `structs.ConsulDefaultCluster` instead of the string literal
"default", as we've done for Vault.
2023-09-28 16:50:21 -04:00
Luiz Aoqui
868aba57bb vault: update identity name to start with vault_ (#18591)
* vault: update identity name to start with `vault_`

In the original proposal, workload identities used to derive Vault
tokens were expected to be called just `vault`. But in order to support
multiple Vault clusters it is necessary to associate identities with
specific Vault cluster configuration.

This commit implements a new proposal to have Vault identities named as
`vault_<cluster>`.
2023-09-27 15:53:28 -03:00
Tim Gross
20eadc7b29 config: move Consul getter out of fingerprinter (#18556) 2023-09-22 10:58:39 -04:00
Tim Gross
fdc6c2151d vault: select Vault API client by cluster name (#18533)
Nomad Enterprise will support configuring multiple Vault clients. Instead of
having a single Vault client field in the Nomad client, we'll have a function
that callers can parameterize by the Vault cluster name that returns the
correctly configured Vault API client wrapper.
2023-09-19 14:35:01 -04:00
Seth Hoenig
591394fb62 drivers: plumb hardware topology via grpc into drivers (#18504)
* drivers: plumb hardware topology via grpc into drivers

This PR swaps out the temporary use of detecting system hardware manually
in each driver for using the Client's detected topology by plumbing the
data over gRPC. This ensures that Client configuration is taken to account
consistently in all references to system topology.

* cr: use enum instead of bool for core grade

* cr: fix test slit tables to be possible
2023-09-18 08:58:07 -05:00
Seth Hoenig
2e1974a574 client: refactor cpuset partitioning (#18371)
* client: refactor cpuset partitioning

This PR updates the way Nomad client manages the split between tasks
that make use of resources.cpus vs. resources.cores.

Previously, each task was explicitly assigned which CPU cores they were
able to run on. Every time a task was started or destroyed, all other
tasks' cpusets would need to be updated. This was inefficient and would
crush the Linux kernel when a client would try to run ~400 or so tasks.

Now, we make use of cgroup heirarchy and cpuset inheritence to efficiently
manage cpusets.

* cr: tweaks for feedback
2023-09-12 09:11:11 -05:00
Tim Gross
b022346575 fingerprint: backoff on Consul fingerprint after initial success (#18426)
In the original design of Consul fingerprinting, we would poll every period so
that we could change the client's fingerprint if Consul became unavailable. As
of 1.4.0 (ref #14673) we no longer update the fingerprint in order to avoid
excessive `Node.Register` RPCs when someone's Consul cluster is flapping.

This allows us to safely backoff Consul fingerprinting on success, just as we
have with Vault.
2023-09-08 08:17:07 -04:00
Tim Gross
a8e68e6479 fingerprint: add support for fingerprinting multiple Consul clusters (#18392)
fingerprint: add support for fingerprinting multiple Consul clusters

Add fingerprinting we'll need to accept multiple Consul clusters in upcoming
Nomad Enterprise features. The fingerprinter will create a map of Consul clients
by cluster name. In Nomad CE, all but the default cluster will be ignored and
there will be no visible behavior change.

Ref: https://github.com/hashicorp/team-nomad/issues/404
2023-09-07 14:05:35 -04:00
Tim Gross
c145e8b30f fingerprint: add warning in CE when there are multiple vaults (#18412)
Nomad CE only supports a single (default) Vault cluster, so log a warning if the
user has configured multiple Vaults.
2023-09-07 09:51:48 -04:00
Tim Gross
b51b2a2705 fingerprint: add support for fingerprinting multiple Vault clusters (#18253)
Add fingerprinting we'll need to accept multiple Vault clusters in upcoming
Nomad Enterprise features. The fingerprinter will create a map of Vault clients
by cluster name. In Nomad CE, all but the default cluster will be ignored and
there will be no visible behavior change.
2023-08-18 15:33:22 -04:00
James Rasell
6108f5c4c3 admin: rename _oss files to _ce (#18209) 2023-08-18 07:47:24 +01:00
hashicorp-copywrite[bot]
2d35e32ec9 Update copyright file headers to BUSL-1.1 2023-08-10 17:27:15 -05:00
Seth Hoenig
a4cc76bd3e numa: enable numa topology detection (#18146)
* client: refactor cgroups management in client

* client: fingerprint numa topology

* client: plumb numa and cgroups changes to drivers

* client: cleanup task resource accounting

* client: numa client and config plumbing

* lib: add a stack implementation

* tools: remove ec2info tool

* plugins: fixup testing for cgroups / numa changes

* build: update makefile and package tests and cl
2023-08-10 17:05:30 -05:00
Kevin Schoonover
4841791c86 fingerprint: fix 'default' alias not added to interface specified by network_interface (#18096) 2023-08-01 08:35:31 -04:00
Ville Vesilehto
2c463bb038 chore(lint): use Go stdlib variables for HTTP methods and status codes (#17968) 2023-07-26 15:28:09 +01:00
Patric Stout
e190eae395 Use config "cpu_total_compute" (if set) for all CPU statistics (#17628)
Before this commit, it was only used for fingerprinting, but not
for CPU stats on nodes or tasks. This meant that if the
auto-detection failed, setting the cpu_total_compute didn't resolved
the issue.

This issue was most noticeable on ARM64, as there auto-detection
always failed.
2023-07-19 13:30:47 -05:00
Seth Hoenig
100c460467 env/aws: updates from ec2info (#17835) 2023-07-07 10:12:05 -05:00
VishnuJin
102f73274b fingerprint: added windows os.build attribute to host fingerprint (#17576) 2023-06-21 10:53:50 -04:00
Jerome Eteve
0d41fb6747 client checks kernel module in /sys/module for WSL2 bridge networking (#17306) 2023-06-06 10:26:50 -04:00