10 Commits

Author SHA1 Message Date
Tim Gross
48b1b01e69 prevent client deadlock and incorrect timing on stop_on_client_after (#25946)
The `disconnect.stop_on_client_after` feature is implemented as a loop on the
client that's intended to wait on the shortest timeout of all the allocations on
the node and then check whether the interval since the last heartbeat has been
longer than the timeout. It uses a buffered channel of allocations written and
read from the same goroutine to push "stops" from the timeout expiring to the
next pass through the loop. Unfortunately if there are multiple allocations that
need to be stopped in the same timeout event, or even if a previous event has
not yet been dequeued, then sending on the channel will block and the entire
goroutine deadlocks itself.

While fixing this, I also discovered that the `stop_on_client_after` and
heartbeat loops can synchronize in a pathological way that extends the
`stop_on_client_after` window. If a heartbeat fails close to the beginning of
the shortest `stop_on_client_after` window, the loop will end up waiting until
almost 2x the intended wait period.

While fixing both of those issues, I discovered that the existing tests had a
bug such that we were asserting that an allocrunner was being destroyed when it
had already exited.

This commit includes the following:
* Rework the watch loop so that we handle the stops in the same case as the
  timer expiration, rather than using a channel in the method scope.
* Remove the alloc intervals map field from the struct and keep it in the
  method scope, in order to discourage writing racy tests that read its value.
* Reset the timer whenever we receive a heartbeat, which forces the two
  intervals to synchronize correctly.
* Minor refactoring of the disconnect timeout lookup to improve brevity.

Fixes: https://github.com/hashicorp/nomad/issues/24679
Ref: https://hashicorp.atlassian.net/browse/NMD-407
2025-05-29 15:05:33 -04:00
Michael Schurter
92de40b00d tests: fixes a few data races in tests (#25455)
* test: use statedb factory

Swapping fields on Client after it has been created is a race.

* test: lock before checking heartbeat state

Fixes races

* test: fix races by copying fsm objects

A common source of data races in tests is when they insert a fixture
directly into memdb and then later mutate the object. Since objects in
the state store are readonly, any later mutation is a data race.

* test: lock when peeking at eval stats

* test: lock when peeking at serf state

* test: lock when looking at stats

* test: fix default eval broker state test

The test was not applying the config callback. In addition the test
raced against the configuration being applied. Waiting for the keyring
to be initialized resolved the race in my testing, but given the high
concurrency of the various leadership subsystems it's possible it may
still flake.
2025-03-20 10:56:17 -07:00
Michael Smithhisler
f2b761f17c disconnected: removes deprecated disconnect fields (#25284)
The group level fields stop_after_client_disconnect,
max_client_disconnect, and prevent_reschedule_on_lost were deprecated in
Nomad 1.8 and replaced by field in the disconnect block. This change
removes any logic related to those deprecated fields.

---------

Co-authored-by: Tim Gross <tgross@hashicorp.com>
2025-03-05 14:46:02 -05:00
Juana De La Cuesta
20cfbc82d3 Introduces Disconnect block into the TaskGroup configuration (#19886)
This PR is the first on two that will implement the new Disconnect block. In this PR the new block is introduced to be backwards compatible with the fields it will replace. For more information refer to this RFC and this ticket.
2024-02-19 16:41:35 +01:00
hashicorp-copywrite[bot]
2d35e32ec9 Update copyright file headers to BUSL-1.1 2023-08-10 17:27:15 -05:00
hashicorp-copywrite[bot]
f005448366 [COMPLIANCE] Add Copyright and License Headers 2023-04-10 15:36:59 +00:00
Seth Hoenig
b242957990 ci: swap ci parallelization for unconstrained gomaxprocs 2022-03-15 12:58:52 -05:00
Michael Schurter
b2dea4ca5b docs: s/hearbeat/heartbeat and fix link
Also fixed the same typo in a test. Fixing the typo fixes the link, but
the link was still broken when running the website locally due to the
trailing slash. It would have worked in prod thanks to redirects, but
using the canonical URL seems ideal.
2020-07-23 11:33:34 -07:00
Lang Martin
3477f2e87a client/heartbeatstop: don't store client state, use timeout
In order to minimize this change while keeping a simple version of the
behavior, we set `lastOk` to the current time less the intial server
connection timeout. If the client starts and never contacts the
server, it will stop all configured tasks after the initial server
connection grace period, on the assumption that we've been out of
touch longer than any configured `stop_after_client_disconnect`.

The more complex state behavior might be justified later, but we
should learn about failure modes first.
2020-05-01 12:35:49 -04:00
Lang Martin
7405961144 client/heartbeatstop: destroy allocs when disconnected from servers
- track lastHeartbeat, the client local time of the last successful
  heartbeat round trip
- track allocations with `stop_after_client_disconnect` configured
- trigger allocation destroy (which handles cleanup)
- restore heartbeat/killable allocs tracking when allocs are recovered from disk
- on client restart, stop those allocs after a grace period if the
  servers are still partioned
2020-05-01 12:35:49 -04:00