Nomad servers, if upgraded, can return node identities as part of
the register and update/heartbeat response objects. The Nomad
client will now handle this and store it as appropriate within its
memory and statedb.
The client will now use any stored identity for RPC authentication
with a fallback to the secretID. This supports upgrades paths where
the Nomad clients are updated before the Nomad servers.
When the allocation is stopped, we deregister the service in the alloc runner's
`PreKill` hook. This ensures we delete the service registration and wait for the
shutdown delay before shutting down the tasks, so that workloads can drain their
connections. However, the call to remove the workload only logs errors and never
retries them.
Add a short retry loop to the `RemoveWorkload` method for Nomad services, so
that transient errors give us an extra opportunity to deregister the service
before the tasks are stopped, before we need to fall back to the data integrity
improvements implemented in #20590.
Ref: https://github.com/hashicorp/nomad/issues/16616
When a Connect service is registered with Consul, Nomad includes the nested
`Connect.SidecarService` field that includes health checks for the Envoy
proxy. Because these are not part of the job spec, the alloc health tracker
created by `health_hook` doesn't know to read the value of these checks.
In many circumstances this won't be noticed, but if the Envoy health check
happens to take longer than the `update.min_healthy_time` (perhaps because it's
been set low), it's possible for a deployment to progress too early such that
there will briefly be no healthy instances of the service available in Consul.
Update the Consul service client to find the nested sidecar service in the
service catalog and attach it to the results provided to the tracker. The
tracker can then check the sidecar health checks.
Fixes: https://github.com/hashicorp/nomad/issues/19269
When agents start, they create a shared Consul client that is then wrapped as
various interfaces for testability, and used in constructing the Nomad client
and server. The interfaces that support workload services (rather than the Nomad
agent itself) need to support multiple Consul clusters for Nomad
Enterprise. Update these interfaces to be factory functions that return the
Consul client for a given cluster name. Update the `ServiceClient` to split
workload updates between clusters by creating a wrapper around all the clients
that delegates to the cluster-specific `ServiceClient`.
Ref: https://github.com/hashicorp/team-nomad/issues/404
No functional changes, just cleaning up deprecated usages that are
removed in v2 and replace one call of .Slice with .ForEach to avoid
making the intermediate copy.
This is a work-in-progress changeset to provide workload-specific Consul tokens
that are created by the `consul_hook` and attached to workload registration
requests by the `group_service_hook` and `service_hook`.
This requires unreleased updates to Consul's `api` package, so this changeset
includes a temporary `replace` directive in the go.mod file.
* Revert "client: include response body in output for successful HTTP checks (#18345)"
This reverts commit d0a93f12d1.
* cr: add comment about dropping ok output
Co-authored-by: Luiz Aoqui <luiz@hashicorp.com>
---------
Co-authored-by: Luiz Aoqui <luiz@hashicorp.com>
* services: always set deregister flag after deregistration of group
This PR fixes a bug where the group service hook's deregister flag was
not set in some cases, causing the hook to attempt deregistrations twice
during job updates (alloc replacement).
In the tests ... we used to assert on the wrong behvior (remove twice) which
has now been corrected to assert we remove only once.
This bug was "silent" in the Consul provider world because the error logs for
double deregistration only show up in Consul logs; with the Nomad provider the
error logs are in the Nomad agent logs.
* services: cleanup group service hook tests
* nsd: block on removal of services
This PR uses a WaitGroup to ensure workload removals are complete
before returning from ServiceRegistrationHandler.RemoveWorkload of
the nomad service provider. The de-registration of individual services
still occurs asynchrously, but we must block on the parent removal
call so that we do not race with further operations on the same set
of services - e.g. in the case of a task restart where we de-register
and then re-register the services in quick succession.
Fixes#15032
* nsd: add e2e test for initial failing check and restart
* consul: correctly understand missing consul checks as unhealthy
This PR fixes a bug where Nomad assumed any registered Checks would exist
in the service registration coming back from Consul. In some cases, the
Consul may be slow in processing the check registration, and the response
object would not contain checks. Nomad would then scan the empty response
looking for Checks with failing health status, finding none, and then
marking a task/alloc as healthy.
In reality, we must always use Nomad's view of what checks should exist as
the source of truth, and compare that with the response Consul gives us,
making sure they match, before scanning the Consul response for failing
check statuses.
Fixes#15536
* consul: minor CR refactor using maps not sets
* consul: observe transition from healthy to unhealthy checks
* consul: spell healthy correctly
* consul: register checks along with service on initial registration
This PR updates Nomad's Consul service client to include checks in
an initial service registration, so that the checks associated with
the service are registered "atomically" with the service. Before, we
would only register the checks after the service registration, which
causes problems where the service is deemed healthy, even if one or
more checks are unhealthy - especially problematic in the case where
SuccessBeforePassing is configured.
Fixes#3935
* cr: followup to fix cause of extra consul logging
* cr: fix another bug
* cr: fixup changelog
* cleanup: refactor MapStringStringSliceValueSet to be cleaner
* cleanup: replace SliceStringToSet with actual set
* cleanup: replace SliceStringSubset with real set
* cleanup: replace SliceStringContains with slices.Contains
* cleanup: remove unused function SliceStringHasPrefix
* cleanup: fixup StringHasPrefixInSlice doc string
* cleanup: refactor SliceSetDisjoint to use real set
* cleanup: replace CompareSliceSetString with SliceSetEq
* cleanup: replace CompareMapStringString with maps.Equal
* cleanup: replace CopyMapStringString with CopyMap
* cleanup: replace CopyMapStringInterface with CopyMap
* cleanup: fixup more CopyMapStringString and CopyMapStringInt
* cleanup: replace CopySliceString with slices.Clone
* cleanup: remove unused CopySliceInt
* cleanup: refactor CopyMapStringSliceString to be generic as CopyMapOfSlice
* cleanup: replace CopyMap with maps.Clone
* cleanup: run go mod tidy
This PR implements support for check_restart for checks registered
in the Nomad service provider.
Unlike Consul, Nomad service checks never report a "warning" status,
and so the check_restart.ignore_warnings configuration is not valid
for Nomad service checks.
This PR refactors agent/consul/check_watcher into client/serviceregistration,
and abstracts away the Consul-specific check lookups.
In doing so we should be able to reuse the existing check watcher logic for
also watching NSD checks in a followup PR.
A chunk of consul/unit_test.go is removed - we'll cover that in e2e tests
in a follow PR if needed. In the long run I'd like to remove this whole file.
This PR enables setting of the headers block on services registered
into Nomad's service provider. Works just like the existing support
in Consul checks.
This PR hopefully fixes a race condition of our little test tcp server
that the check observer is making connections against for test cases.
The tcp listener would either startup too slow or exit too soon.
This PR adds support for specifying checks in services registered to
the built-in nomad service provider.
Currently only HTTP and TCP checks are supported, though more types
could be added later.
This PR introduces the `address` field in the `service` block so that Nomad
or Consul services can be registered with a custom `.Address.` to advertise.
The address can be an IP address or domain name. If the `address` field is
set, the `service.address_mode` must be set in `auto` mode.
The service registration wrapper handles sending requests to
backend providers without the caller needing to know this
information. This will be used within the task and alloc runner
service hooks when performing service registration activities.
This commit performs refactoring to pull out common service
registration objects into a new `client/serviceregistration`
package. This new package will form the base point for all
client specific service registration functionality.
The Consul specific implementation is not moved as it also
includes non-service registration implementations; this reduces
the blast radius of the changes as well.