When a node is fingerprinted, we calculate a "computed class" from a hash over a
subset of its fields and attributes. In the scheduler, when a given node fails
feasibility checking (before fit checking) we know that no other node of that
same class will be feasible, and we add the hash to a map so we can reject them
early. This hash cannot include any values that are unique to a given node,
otherwise no other node will have the same hash and we'll never save ourselves
the work of feasibility checking those nodes.
In #4390 we introduce the `nomad.advertise.address` attribute and in #19969 we
introduced `consul.dns.addr` attribute. Both of these are unique per node and
break the hash.
Additionally, we were not correctly filtering attributes out when checking if a
node escaped the class by not filtering for attributes that start with
`unique.`. The test for this introduced in #708 had an inverted assertion, which
allowed this to pass unnoticed since the early days of Nomad.
Ref: https://github.com/hashicorp/nomad/pull/708
Ref: https://github.com/hashicorp/nomad/pull/4390
Ref: https://github.com/hashicorp/nomad/pull/19969
In #24526 we updated the Consul and Vault fingerprints so that they are no
longer periodic. This fixed a problem that cluster admins reported where rolling
updates of Vault or Consul would cause a thundering herd of fingerprint updates
across the whole cluster.
But if Consul/Vault is not available during the initial fingerprint, it will
never get fingerprinted again. This is challenging for cluster updates and black
starts because the implicit service startup ordering may require
reloads. Instead, have the fingerprinter run periodically but mark that it has
made its first successful fingerprint of all Consul/Vault clusters. At that
point, we can skip further periodic updates. The `Reload` method will reset the
mark and allow the subsequent fingerprint to run normally.
Fixes: https://github.com/hashicorp/nomad/issues/25097
Ref: https://github.com/hashicorp/nomad/pull/24526
Ref: https://github.com/hashicorp/nomad/issues/24049
In order to provide a DNS address and port to Connect tasks configured for
transparent proxy, we need to fingerprint the Consul DNS address and port. The
client will pass this address/port to the iptables configuration provided to the
`consul-cni` plugin.
Ref: https://github.com/hashicorp/nomad/issues/10628
Support for fingerprinting the Consul admin partition was added in #19485. But
when the client fingerprints Consul CE, it gets a valid fingerprint and working
Consul but with a warn-level log. Return "ok" from the partition extractor, but
also ensure that we only add the Consul attribute if it actually has a value.
Fixes: https://github.com/hashicorp/nomad/issues/19756
Consul Enterprise agents all belong to an admin partition. Fingerprint this
attribute when available. When a Consul agent is not explicitly configured with
"default" it is in the default partition but will not report this in its
`/v1/agent/self` endpoint. Fallback to "default" when missing only for Consul
Enterprise.
This feature provides users the ability to add constraints for jobs to land on
Nomad nodes that have a Consul in that partition. Or it can allow cluster
administrators to pair Consul partitions 1:1 with Nomad node pools. We'll also
have the option to implement a future `partition` field in the jobspec's
`consul` block to create an implicit constraint.
Ref: https://github.com/hashicorp/nomad/issues/13139#issuecomment-1856479581
Remove the now-unused original configuration blocks for Consul and Vault from
the client. When the client needs to refer to a Consul or Vault block it will
always be for a specific cluster for the task/service. Add a helper for
accessing the default clusters (for the client's own use).
This is two of three changesets for this work. The remainder will implement the
same changes in the `command/agent` package.
As part of this work I discovered and fixed two bugs:
* The gRPC proxy socket that we create for Envoy is only ever created using the
default Consul cluster's configuration. This will prevent Connect from being
used with the non-default cluster.
* The Consul configuration we use for templates always comes from the default
Consul cluster's configuration, but will use the correct Consul token for the
non-default cluster. This will prevent templates from being used with the
non-default cluster.
Ref: https://github.com/hashicorp/nomad/issues/18947
Ref: https://github.com/hashicorp/nomad/pull/18991
Fixes: https://github.com/hashicorp/nomad/issues/18984
Fixes: https://github.com/hashicorp/nomad/issues/18983
In the original design of Consul fingerprinting, we would poll every period so
that we could change the client's fingerprint if Consul became unavailable. As
of 1.4.0 (ref #14673) we no longer update the fingerprint in order to avoid
excessive `Node.Register` RPCs when someone's Consul cluster is flapping.
This allows us to safely backoff Consul fingerprinting on success, just as we
have with Vault.
fingerprint: add support for fingerprinting multiple Consul clusters
Add fingerprinting we'll need to accept multiple Consul clusters in upcoming
Nomad Enterprise features. The fingerprinter will create a map of Consul clients
by cluster name. In Nomad CE, all but the default cluster will be ignored and
there will be no visible behavior change.
Ref: https://github.com/hashicorp/team-nomad/issues/404
Consul v1.13.8 was released with a breaking change in the /v1/agent/self
endpoint version where a line break was being returned.
This caused the Nomad finterprint to fail because `NewVersion` errors on
parse.
This commit removes any extra space from the Consul version returned by
the API.
* Update ioutil deprecated library references to os and io respectively
* Deal with the errors produced.
Add error handling to filEntry info
Add error handling to info
* client: accommodate Consul 1.14.0 gRPC and agent self changes.
Consul 1.14.0 changed the way in which gRPC listeners are
configured, particularly when using TLS. Prior to the change, a
single listener was responsible for handling plain-text and
encrypted gRPC requests. In 1.14.0 and beyond, separate listeners
will be used for each, defaulting to 8502 and 8503 for plain-text
and TLS respectively.
The change means that Nomad’s Consul Connect integration would not
work when integrated with Consul clusters using TLS and running
1.14.0 or greater.
The Nomad Consul fingerprinter identifies the gRPC port Consul has
exposed using the "DebugConfig.GRPCPort" value from Consul’s
“/v1/agent/self” endpoint. In Consul 1.14.0 and greater, this only
represents the plain-text gRPC port which is likely to be disbaled
in clusters running TLS. In order to fix this issue, Nomad now
takes into account the Consul version and configured scheme to
optionally use “DebugConfig.GRPCTLSPort” value from Consul’s agent
self return.
The “consul_grcp_socket” allocrunner hook has also been updated so
that the fingerprinted gRPC port attribute is passed in. This
provides a better fallback method, when the operator does not
configure the “consul.grpc_address” option.
* docs: modify Consul Connect entries to detail 1.14.0 changes.
* changelog: add entry for #15309
* fixup: tidy tests and clean version match from review feedback.
* fixup: use strings tolower func.
Clients periodically fingerprint Vault and Consul to ensure the server has
updated attributes in the client's fingerprint. If the client can't reach
Vault/Consul, the fingerprinter clears the attributes and requires a node
update. Although this seems like correct behavior so that we can detect
intentional removal of Vault/Consul access, it has two serious failure modes:
(1) If a local Consul agent is restarted to pick up configuration changes and the
client happens to fingerprint at that moment, the client will update its
fingerprint and result in evaluations for all its jobs and all the system jobs
in the cluster.
(2) If a client loses Vault connectivity, the same thing happens. But the
consequences are much worse in the Vault case because Vault is not run as a
local agent, so Vault connectivity failures are highly correlated across the
entire cluster. A 15 second Vault outage will cause a new `node-update`
evalution for every system job on the cluster times the number of nodes, plus
one `node-update` evaluation for every non-system job on each node. On large
clusters of 1000s of nodes, we've seen this create a large backlog of evaluations.
This changeset updates the fingerprinting behavior to keep the last fingerprint
if Consul or Vault queries fail. This prevents a storm of evaluations at the
cost of requiring a client restart if Consul or Vault is intentionally removed
from the client.
This PR changes Nomad's wrapper around the Consul NamespaceAPI so that
it will detect if the Consul Namespaces feature is enabled before making
a request to the Namespaces API. Namespaces are not enabled in Consul OSS,
and require a suitable license to be used with Consul ENT.
Previously Nomad would check for a 404 status code when makeing a request
to the Namespaces API to "detect" if Consul OSS was being used. This does
not work for Consul ENT with Namespaces disabled, which returns a 500.
Now we avoid requesting the namespace API altogether if Consul is detected
to be the OSS sku, or if the Namespaces feature is not licensed. Since
Consul can be upgraded from OSS to ENT, or a new license applied, we cache
the value for 1 minute, refreshing on demand if expired.
Fixes https://github.com/hashicorp/nomad-enterprise/issues/575
Note that the ticket originally describes using attributes from https://github.com/hashicorp/nomad/issues/10688.
This turns out not to be possible due to a chicken-egg situation between
bootstrapping the agent and setting up the consul client. Also fun: the
Consul fingerprinter creates its own Consul client, because there is no
[currently] no way to pass the agent's client through the fingerprint factory.
This PR adds new probes for detecting these new Consul related attributes:
Consul namespaces are a Consul enterprise feature that may be disabled depending
on the enterprise license associated with the Consul servers. Having this attribute
available will enable Nomad to properly decide whether to query the Consul Namespace
API.
Consul connect must be explicitly enabled before Connect APIs will work. Currently
Nomad only checks for a minimum Consul version. Having this attribute available will
enable Nomad to properly schedule Connect tasks only on nodes with a Consul agent that
has Connect enabled.
Consul connect requires the grpc port to be explicitly set before Connect APIs will work.
Currently Nomad only checks for a minimal Consul version. Having this attribute available
will enable Nomad to schedule Connect tasks only on nodes with a Consul agent that has
the grpc listener enabled.
This PR refactors the ConsulFingerprint implementation, breaking individual attributes
into individual functions to make testing them easier. This is in preparation for
additional extractors about to be added. Behavior should be otherwise unchanged.
It adds the attribute consul.sku, which can be used to differentiate between Consul
OSS vs Consul ENT.
This removes a cyclical dependency when importing client/structs from
dependencies of the plugin_loader, specifically, drivers. Due to
client/config also depending on the plugin_loader.
It also better reflects the ownership of fingerprint structs, as they
are fairly internal to the fingerprint manager.
Change the unit test to only test if the consul link exists, not the
value of the link. The old test was hostname specific and therefore
would always be different based on the environment running the tests.
Reduce future confusion by introducing a minor version that is gossiped out
via the `mvn` Serf tag (Minor Version Number, `vsn` is already being used for
to communicate `Major Version Number`).
Background: hashicorp/consul/issues/1346#issuecomment-151663152