Files
nomad/.changelog/26424.txt
Tim Gross 6563d0ec3c wait for service registration cleanup until allocs marked lost (#26424)
When a node misses a heartbeat and is marked down, Nomad deletes service
registration instances for that node. But if the node then successfully
heartbeats before its allocations are marked lost, the services are never
restored. The node is unaware that it has missed a heartbeat and there's no
anti-entropy on the node in any case.

We already delete services when the plan applier marks allocations as stopped,
so deleting the services when the node goes down is only an optimization to more
quickly divert service traffic. But because the state after a plan apply is the
"canonical" view of allocation health, this breaks correctness.

Remove the code path that deletes services from nodes when nodes go down. Retain
the state store code that deletes services when allocs are marked terminal by
the plan applier. Also add a path in the state store to delete services when
allocs are marked terminal by the client. This gets back some of the
optimization but avoids the correctness bug because marking the allocation
client-terminal is a one way operation.

Fixes: https://github.com/hashicorp/nomad/issues/16983
2025-08-06 13:40:37 -04:00

4 lines
150 B
Plaintext

```release-note:bug
services: Fixed a bug where Nomad services were deleted if a node missed heartbeats and recovered before allocs were migrated
```