nomad/client at cf9f269ccfcb4e6592fa675625b7abfaafe8b18f - nomad - Gitea: Git with a cup of tea

kemko/nomad

mirror of https://github.com/kemko/nomad.git synced 2026-01-01 16:05:42 +03:00

Files

History

Tim Gross 48b1b01e69 prevent client deadlock and incorrect timing on stop_on_client_after (#25946 )

The `disconnect.stop_on_client_after` feature is implemented as a loop on the
client that's intended to wait on the shortest timeout of all the allocations on
the node and then check whether the interval since the last heartbeat has been
longer than the timeout. It uses a buffered channel of allocations written and
read from the same goroutine to push "stops" from the timeout expiring to the
next pass through the loop. Unfortunately if there are multiple allocations that
need to be stopped in the same timeout event, or even if a previous event has
not yet been dequeued, then sending on the channel will block and the entire
goroutine deadlocks itself.

While fixing this, I also discovered that the `stop_on_client_after` and
heartbeat loops can synchronize in a pathological way that extends the
`stop_on_client_after` window. If a heartbeat fails close to the beginning of
the shortest `stop_on_client_after` window, the loop will end up waiting until
almost 2x the intended wait period.

While fixing both of those issues, I discovered that the existing tests had a
bug such that we were asserting that an allocrunner was being destroyed when it
had already exited.

This commit includes the following:
* Rework the watch loop so that we handle the stops in the same case as the
  timer expiration, rather than using a channel in the method scope.
* Remove the alloc intervals map field from the struct and keep it in the
  method scope, in order to discourage writing racy tests that read its value.
* Reset the timer whenever we receive a heartbeat, which forces the two
  intervals to synchronize correctly.
* Minor refactoring of the disconnect timeout lookup to improve brevity.

Fixes: https://github.com/hashicorp/nomad/issues/24679
Ref: https://hashicorp.atlassian.net/browse/NMD-407

2025-05-29 15:05:33 -04:00

..

test: Remove use of "mitchellh/go-testing-interface" for stdlib. (#25640 )

2025-04-14 07:43:49 +01:00

provide allocrunner hooks with prebuilt taskenv and fix mutation bugs (#25373 )

2025-03-24 12:05:04 -04:00

client: add once mode to template block (#25922 )

2025-05-28 11:45:11 -04:00

client: move 'waiting for previous alloc to terminate' log messages to info (#24804 )

2025-01-08 15:44:35 +01:00

host volumes: add configuration to GC on node GC (#25903 )

2025-05-27 10:22:08 -04:00

test: fix go 1.24 test complaints (#25346 )

2025-03-11 11:01:39 -05:00

fix multiple overflow errors in exponential backoff (#18200 )

2023-08-15 14:38:18 -04:00

Update copyright file headers to BUSL-1.1

2023-08-10 17:27:15 -05:00

ci: Run core tests groups workflow on amd64 and arm64 runners. (#25695 )

2025-04-17 15:16:29 +01:00

metrics: prevent negative counter from iowait decrease (#18835 )

2023-10-24 09:58:25 -04:00

hostvolumemanager

dhv: mkdir plugin parameters: uid,guid,mode (#25533 )

2025-03-28 10:13:13 -05:00

jobspec: add a chown option to artifact block (#24157 )

2024-10-11 11:30:27 -05:00

client: close namespace file handle and defensively lazy unmount (#25714 )

2025-04-21 16:25:05 -04:00

plugins: validate logmon process during reattach (#24798 )

2025-01-08 08:50:33 -05:00

csi: fix CSI ExpandVolume stagingPath (#25253 )

2025-03-25 12:36:46 -05:00

test: Move client server manager tests to use must library. (#25569 )

2025-04-01 14:23:08 +01:00

serviceregistration

services: Support TLS Skip Verify within Nomad service checks. (#24781 )

2025-01-15 07:39:39 +00:00

dynamic host volumes: client state (#24595 )

2024-12-19 09:25:54 -05:00

dynamic host volumes: change env vars, fixup auto-delete (#24943 )

2025-01-27 10:36:53 -06:00

provide allocrunner hooks with prebuilt taskenv and fix mutation bugs (#25373 )

2025-03-24 12:05:04 -04:00

exec: Fix incorrect HOME and USER env variables for tasks that have user set (#25859 )

2025-05-16 15:02:45 +02:00

vault: Remove legacy token based authentication workflow. (#25155 )

2025-02-28 07:40:02 +00:00

agent: Fix misaligned contextual k/v logging arguments. (#25629 )

2025-04-10 14:40:21 +01:00

acl_test.go

client: unflake TestClient_ACL_ResolveToken_InvalidClaims (#25758 )

2025-04-25 14:53:09 +02:00

acl.go

Upgrade to using hashicorp/go-metrics@v0.5.4 (#24856 )

2025-01-31 15:22:00 -05:00

agent_endpoint_test.go

Upgrade go-msgpack to v2 (#20173 )

2024-03-21 11:44:23 -07:00

agent_endpoint.go

Upgrade to using hashicorp/go-metrics@v0.5.4 (#24856 )

2025-01-31 15:22:00 -05:00

alloc_endpoint_test.go

Upgrade go-msgpack to v2 (#20173 )

2024-03-21 11:44:23 -07:00

alloc_endpoint.go

Upgrade to using hashicorp/go-metrics@v0.5.4 (#24856 )

2025-01-31 15:22:00 -05:00

alloc_watcher_e2e_test.go

auth: use ACLsDisabledACL when ACLs are disabled (#18754 )

2023-10-16 09:30:24 -04:00

client_interface_test.go

jobspec: time based task execution (#22201 )

2024-05-22 15:40:25 -05:00

client_stats_endpoint_test.go

Update copyright file headers to BUSL-1.1

2023-08-10 17:27:15 -05:00

client_stats_endpoint.go

Upgrade to using hashicorp/go-metrics@v0.5.4 (#24856 )

2025-01-31 15:22:00 -05:00

client_test.go

tests: fixes a few data races in tests (#25455 )

2025-03-20 10:56:17 -07:00

client.go

host volumes: add configuration to GC on node GC (#25903 )

2025-05-27 10:22:08 -04:00

csi_endpoint_test.go

refactor: volume request modes to be generic between DHV/CSI (#24896 )

2025-01-24 10:37:48 -05:00

csi_endpoint.go

Upgrade to using hashicorp/go-metrics@v0.5.4 (#24856 )

2025-01-31 15:22:00 -05:00

drain_test.go

Update copyright file headers to BUSL-1.1

2023-08-10 17:27:15 -05:00

drain.go

Update copyright file headers to BUSL-1.1

2023-08-10 17:27:15 -05:00

driver_manager_test.go

fingerprint: add config option to disable dmidecode (#25108 )

2025-02-13 11:20:48 -05:00

enterprise_client_ce.go

admin: rename _oss files to _ce (#18209 )

2023-08-18 07:47:24 +01:00

fingerprint_manager_test.go

core: plumbing to support numa aware scheduling (#18681 )

2023-10-19 15:09:30 -05:00

fingerprint_manager.go

Update copyright file headers to BUSL-1.1

2023-08-10 17:27:15 -05:00

fs_endpoint_test.go

Upgrade go-msgpack to v2 (#20173 )

2024-03-21 11:44:23 -07:00

fs_endpoint.go

Upgrade to using hashicorp/go-metrics@v0.5.4 (#24856 )

2025-01-31 15:22:00 -05:00

gc_test.go

client: fix client blocking during garbage collection (#25123 )

2025-03-19 14:32:46 -04:00

gc.go

client: fix client blocking during garbage collection (#25123 )

2025-03-19 14:32:46 -04:00

heartbeatstop_test.go

prevent client deadlock and incorrect timing on stop_on_client_after (#25946 )

2025-05-29 15:05:33 -04:00

heartbeatstop.go

prevent client deadlock and incorrect timing on stop_on_client_after (#25946 )

2025-05-29 15:05:33 -04:00

host_volume_endpoint_test.go

dynamic host volumes: change env vars, fixup auto-delete (#24943 )

2025-01-27 10:36:53 -06:00

host_volume_endpoint.go

dhv: mkdir plugin parameters: uid,guid,mode (#25533 )

2025-03-28 10:13:13 -05:00

meta_endpoint_test.go

client: remove null dynamic metadata keys (#18664 )

2023-10-05 11:41:44 -04:00

meta_endpoint.go

Upgrade to using hashicorp/go-metrics@v0.5.4 (#24856 )

2025-01-31 15:22:00 -05:00

node_updater.go

dynamic host volumes: unique volume name per node (#24748 )

2025-01-06 15:37:20 -06:00

rpc_test.go

reset max query time of blocking queries in client after retries (#25039 )

2025-02-07 08:45:56 -05:00

rpc.go

reset max query time of blocking queries in client after retries (#25039 )

2025-02-07 08:45:56 -05:00

testing.go

test: Remove use of "mitchellh/go-testing-interface" for stdlib. (#25640 )

2025-04-14 07:43:49 +01:00

util.go

Update copyright file headers to BUSL-1.1

2023-08-10 17:27:15 -05:00