nomad

mirror of https://github.com/kemko/nomad.git synced 2026-01-07 02:45:42 +03:00

Author	SHA1	Message	Date
Tim Gross	77c8acb422	telemetry: fix excessive CPU consumption in executor (#25870 ) Collecting metrics from processes is expensive, especially on platforms like Windows. The executor code has a 5s cache of stats to ensure that we don't thrash syscalls on nodes running many allocations. But the timestamp used to calculate TTL of this cache was never being set, so we were always treating it as expired. This causes excess CPU utilization on client nodes. Ensure that when we fill the cache, we set the timestamp. In testing on Windows, this reduces exector CPU overhead by roughly 75%. This changeset includes two other related items: * The `telemetry.publish_allocation_metrics` field correctly prevents a node from publishing metrics, but the stats hook on the taskrunner still collects the metrics, which can be expensive. Thread the configuration value into the stats hook so that we don't collect if `telemetry.publish_allocation_metrics = false`. * The `linuxProcStats` type in the executor's `procstats` package is misnamed as a result of a couple rounds of refactoring. It's used by all task executors, not just Linux. Rename this and move a comment about how Windows processes are listed so that the comment is closer to where the logic is implemented. Fixes: https://github.com/hashicorp/nomad/issues/23323 Fixes: https://hashicorp.atlassian.net/browse/NMD-455	2025-05-19 09:24:13 -04:00
Tim Gross	f00bff09f1	fix multiple overflow errors in exponential backoff (#18200 ) We use capped exponential backoff in several places in the code when handling failures. The code we've copy-and-pasted all over has a check to see if the backoff is greater than the limit, but this check happens after the bitshift and we always increment the number of attempts. This causes an overflow with a fairly small number of failures (ex. at one place I tested it occurs after only 24 iterations), resulting in a negative backoff which then never recovers. The backoff becomes a tight loop consuming resources and/or DoS'ing a Nomad RPC handler or an external API such as Vault. Note this doesn't occur in places where we cap the number of iterations so the loop breaks (usually to return an error), so long as the number of iterations is reasonable. Introduce a helper with a check on the cap before the bitshift to avoid overflow in all places this can occur. Fixes: #18199 Co-authored-by: stswidwinski <stan.swidwinski@gmail.com>	2023-08-15 14:38:18 -04:00
hashicorp-copywrite[bot]	2d35e32ec9	Update copyright file headers to BUSL-1.1	2023-08-10 17:27:15 -05:00
hashicorp-copywrite[bot]	f005448366	[COMPLIANCE] Add Copyright and License Headers	2023-04-10 15:36:59 +00:00
Mahmood Ali	354b2ee1a6	tests: deflake TestTaskRunner_StatsHook_Periodic (#9734 ) This PR deflakes TestTaskRunner_StatsHook_Periodic tests and adds backoff when the driver closes the channel. TestTaskRunner_StatsHook_Periodic is currently the most flaky test - failing ~4% of the time (20 out of 486 workflows). A sample failure: https://app.circleci.com/pipelines/github/hashicorp/nomad/14028/workflows/957b674f-cbcc-4228-96d9-1094fdee5b9c/jobs/128563 . This change has two components: First, it updates the StatsHook so that it backs off when stats channel is closed. In the context of the test where the mock driver emits a single stats update and closes the channel, the test may make tens of thousands update during the period. In real context, if a driver doesn't implement the stats handler properly or when a task finishes, we may generate way too many Stats queries in a tight loop. Here, the backoff reduces these queries. I've added a failing test that shows 154,458 stats updates within 500ms in https://app.circleci.com/pipelines/github/hashicorp/nomad/14092/workflows/50672445-392d-4661-b19e-e3561ed32746/jobs/129423 . Second, the test ignores the first stats update after a task exit. Due to the asynchronicity of updates and channel/context use, it's possible that an update is enqueued while the test marks the task as exited, resulting into a spurious update.	2021-01-06 16:03:00 -05:00
Kris Hicks	7747124ef0	Apply some suggested fixes from staticcheck (#9598 )	2020-12-10 07:29:18 -08:00
Tim Gross	db4c88f71b	stats_hook: log normal shutdown condition as debug, not error (#8028 ) The `stats_hook` writes an Error log every time an allocation becomes terminal. This is a normal condition, not an error. A real error condition like a failure to collect the stats is logged later. It just creates log noise, and this is a particularly bad operator experience for heavy batch workloads.	2020-05-20 10:28:30 -04:00
Mahmood Ali	66bef39dd5	log unrecoverable errors	2019-07-17 11:01:59 +07:00
Mahmood Ali	bbf8f90ecb	client/taskrunner: fix stats stats retry logic Previously, if a channel is closed, we retry the Stats call. But, if that call fails, we go in a backoff loop without calling Stats ever again. Here, we use a utility function for calling driverHandle.Stats call that retries as one expects. I aimed to preserve the logging formats but made small improvements as I saw fit.	2019-07-11 13:58:07 +08:00
Nick Ethier	f51b7f9942	tr: use context in as select statement	2019-01-22 20:11:39 -05:00
Michael Schurter	903779769d	Update client/allocrunner/taskrunner/stats_hook.go Co-Authored-By: nickethier <ncethier@gmail.com>	2019-01-14 12:31:27 -05:00
Nick Ethier	091cdbcb12	tr: stop stats collection on Exited hook	2019-01-14 12:30:14 -05:00
Nick Ethier	fc84e8f2bc	tr: add retry /w backoff to stats_hook failure	2019-01-12 12:18:24 -05:00
Nick Ethier	9904463da2	executor: fix failing stats related test	2019-01-12 12:18:23 -05:00
Nick Ethier	fbf9a4c772	executor: implement streaming stats API plugins/driver: update driver interface to support streaming stats client/tr: use streaming stats api TODO: * how to handle errors and closed channel during stats streaming * prevent tight loop if Stats(ctx) returns an error drivers: update drivers TaskStats RPC to handle streaming results executor: better error handling in stats rpc docker: better control and error handling of stats rpc driver: allow stats to return a recoverable error	2019-01-12 12:18:22 -05:00
Alex Dadgar	6bb99c93d0	Review comments	2019-01-07 14:50:28 -08:00
Alex Dadgar	b300306c4a	comments	2019-01-07 14:49:40 -08:00
Alex Dadgar	3257eb6d86	Fix hooks	2019-01-07 14:49:40 -08:00
Alex Dadgar	437f03d877	recover	2019-01-07 14:49:40 -08:00
Danielle Tomlinson	756325bcbd	client: Merge driver/shared/structs and client/structs	2018-11-30 10:56:45 +01:00
Danielle Tomlinson	bacd6175f5	client: Migrate DriverStats optout to drivers/shared/structs	2018-11-30 10:46:13 +01:00
Michael Schurter	31f113ba4d	client: support graceful shutdowns Client.Shutdown now blocks until all AllocRunners and TaskRunners have exited their Run loops. Tasks are left running.	2018-11-19 16:39:30 -08:00
Alex Dadgar	3a492bb33f	allocrunnerv2 -> allocrunner	2018-10-16 16:56:56 -07:00

23 Commits