nomad

mirror of https://github.com/kemko/nomad.git synced 2026-01-03 00:45:43 +03:00

Author	SHA1	Message	Date
Piotr Kazmierczak	3ad0df71a8	docker: correct stat response for rss, cache and swap memory in cgroups v1 (#25741 ) #25138 refactoring accidentally removed some of the memory stats that weren't available as concrete types in containerapi.	2025-04-24 15:17:56 +02:00
James Rasell	0726e4cc3e	driver/docker: Fix container CPU stats collection (#24768 ) The recent change to collection via a "one-shot" Docker API call did not update the stream boolean argument. This results in the PreCPUStats values being zero and therefore breaking the CPU calculations which rely on this data. The base fix is to update the passed boolean parameter to match the desired non-streaming behaviour. The non-streaming API call correctly returns the PreCPUStats data which can be seen in the added unit test. The most recent change also modified the behaviour of the collectStats go routine, so that any error encountered results in the routine exiting. In the event this was a transient error, the container will continue to run, however, no stats will be collected until the task is stopped and replaced. This PR reverts the behaviour, so that an error encountered during a stats collection run results in the error being logged but the collection process continuing with a backoff used.	2025-01-07 07:42:31 +00:00
Juana De La Cuesta	a9e7166b6b	[gh-24339] Move from streaming stats to polling for docker (#24525 ) * fix: dont stream the docker stats, read them one by one * func: add a NewSafeTicker to the herlper functions * style: remove commented code	2024-11-21 17:36:53 +01:00
Tim Gross	d12128c380	docker: use streaming stats collection to correct CPU stats (#24229 ) In #23966 we switched to the official Docker SDK for the `docker` driver. In the process we refactored code around stats collection to use the "one shot" version of stats. Unfortunately this "one shot" stats collection does not include the `PreCPU` stats, which are the stats from the previous read. This breaks the calculation we use to determine CPU ticks, because now we're subtracting 0 from the current value to get the delta. Switch back to using the streaming stats collection. Add a test that fully exercises the `TaskStats` API. Fixes: https://github.com/hashicorp/nomad/issues/24224 Ref: https://hashicorp.atlassian.net/browse/NET-11348	2024-10-17 08:25:59 -04:00
Tim Gross	bf0a65f2d6	docker: reset timer after collecting stats (#24092 ) In ##23966 when we switched to using the official Docker SDK client, we had to rework the stats collection loop for the new client. But we missed resetting the timer on the collection loop, which meant that we'd only collect stats once and then never again. * Ref: [NET-11202 (comment)](https://hashicorp.atlassian.net/browse/NET-11202?focusedCommentId=550814) * This has shipped in Nomad 1.9.0-beta.1 but not production yet.	2024-10-01 08:31:03 -04:00
Piotr Kazmierczak	981ca36049	docker: use official client instead of fsouza/go-dockerclient (#23966 ) This PR replaces fsouza/go-dockerclient 3rd party docker client library with docker's official SDK. --------- Co-authored-by: Tim Gross <tgross@hashicorp.com> Co-authored-by: Seth Hoenig <shoenig@duck.com>	2024-09-26 18:41:44 +02:00
Seth Hoenig	591394fb62	drivers: plumb hardware topology via grpc into drivers (#18504 ) * drivers: plumb hardware topology via grpc into drivers This PR swaps out the temporary use of detecting system hardware manually in each driver for using the Client's detected topology by plumbing the data over gRPC. This ensures that Client configuration is taken to account consistently in all references to system topology. * cr: use enum instead of bool for core grade * cr: fix test slit tables to be possible	2023-09-18 08:58:07 -05:00
Tim Gross	f00bff09f1	fix multiple overflow errors in exponential backoff (#18200 ) We use capped exponential backoff in several places in the code when handling failures. The code we've copy-and-pasted all over has a check to see if the backoff is greater than the limit, but this check happens after the bitshift and we always increment the number of attempts. This causes an overflow with a fairly small number of failures (ex. at one place I tested it occurs after only 24 iterations), resulting in a negative backoff which then never recovers. The backoff becomes a tight loop consuming resources and/or DoS'ing a Nomad RPC handler or an external API such as Vault. Note this doesn't occur in places where we cap the number of iterations so the loop breaks (usually to return an error), so long as the number of iterations is reasonable. Introduce a helper with a check on the cap before the bitshift to avoid overflow in all places this can occur. Fixes: #18199 Co-authored-by: stswidwinski <stan.swidwinski@gmail.com>	2023-08-15 14:38:18 -04:00
hashicorp-copywrite[bot]	2d35e32ec9	Update copyright file headers to BUSL-1.1	2023-08-10 17:27:15 -05:00
Seth Hoenig	a4cc76bd3e	numa: enable numa topology detection (#18146 ) * client: refactor cgroups management in client * client: fingerprint numa topology * client: plumb numa and cgroups changes to drivers * client: cleanup task resource accounting * client: numa client and config plumbing * lib: add a stack implementation * tools: remove ec2info tool * plugins: fixup testing for cgroups / numa changes * build: update makefile and package tests and cl	2023-08-10 17:05:30 -05:00
Seth Hoenig	ec4fa55bbf	drivers/docker: refactor use of clients in docker driver (#17731 ) * drivers/docker: refactor use of clients in docker driver This PR refactors how we manage the two underlying clients used by the docker driver for communicating with the docker daemon. We keep two clients - one with a hard-coded timeout that applies to all operations no matter what, intended for use with short lived / async calls to docker. The other has no timeout and is the responsibility of the caller to set a context that will ensure the call eventually terminates. The use of these two clients has been confusing and mistakes were made in a number of places where calls were making use of the wrong client. This PR makes it so that a user must explicitly call a function to get the client that makes sense for that use case. Fixes #17023 * cr: followup items	2023-06-26 15:21:42 -05:00
hashicorp-copywrite[bot]	f005448366	[COMPLIANCE] Add Copyright and License Headers	2023-04-10 15:36:59 +00:00
Seth Hoenig	c1e033c8c6	cleanup: prevent leaks from time.After This PR replaces use of time.After with a safe helper function that creates a time.Timer to use instead. The new function returns both a time.Timer and a Stop function that the caller must handle. Unlike time.NewTimer, the helper function does not panic if the duration set is <= 0.	2022-02-02 14:32:26 -06:00
Kris Hicks	7747124ef0	Apply some suggested fixes from staticcheck (#9598 )	2020-12-10 07:29:18 -08:00
Michael Schurter	855a08810c	docker: improve stats names and comments	2019-04-02 09:18:38 -07:00
Mahmood Ali	82d3c43e31	Update drivers/docker/stats.go comment Co-Authored-By: schmichael <michael.schurter@gmail.com>	2019-04-02 09:09:17 -07:00
Michael Schurter	a69664e5ee	docker: fix send after close panic in stats destCh was being written to by one goroutine and closed by another goroutine. This panic occurred in Travis: ``` === FAIL: drivers/docker TestDockerCoordinator_ConcurrentPulls (117.66s) === PAUSE TestDockerCoordinator_ConcurrentPulls === CONT TestDockerCoordinator_ConcurrentPulls panic: send on closed channel goroutine 5358 [running]: github.com/hashicorp/nomad/drivers/docker.dockerStatsCollector(0xc0003a4a20, 0xc0003a49c0, 0x3b9aca00) /home/travis/gopath/src/github.com/hashicorp/nomad/drivers/docker/stats.go:108 +0x167 created by github.com/hashicorp/nomad/drivers/docker.TestDriver_DockerStatsCollector /home/travis/gopath/src/github.com/hashicorp/nomad/drivers/docker/stats_test.go:33 +0x1ab ``` The 2 ways to fix this kind of error are to either (1) add extra coordination around multiple goroutines writing to a chan or (2) make it so only one goroutines writes to a chan. I implemented (2) first as it's simpler, but @notnoop pointed out since the same destCh in reused in the stats loop there's now a double close panic possible! So this implements (1) by adding a *usageSender struct for handling concurrent senders and closing.	2019-04-02 08:28:08 -07:00
Danielle Tomlinson	6624d3667b	docker: Support stats on Windows	2019-02-22 14:19:58 +01:00
Mahmood Ali	b5c20aa50b	Track Basic Memory Usage as reported by cgroups Track current memory usage, `memory.usage_in_bytes`, in addition to `memory.max_memory_usage_in_bytes` and friends. This number is closer what Docker reports. Related to https://github.com/hashicorp/nomad/issues/5165 .	2019-01-14 18:47:52 -05:00
Nick Ethier	f6af1d4d04	docker: add test for stats collection	2019-01-12 12:18:22 -05:00
Nick Ethier	fbf9a4c772	executor: implement streaming stats API plugins/driver: update driver interface to support streaming stats client/tr: use streaming stats api TODO: * how to handle errors and closed channel during stats streaming * prevent tight loop if Stats(ctx) returns an error drivers: update drivers TaskStats RPC to handle streaming results executor: better error handling in stats rpc docker: better control and error handling of stats rpc driver: allow stats to return a recoverable error	2019-01-12 12:18:22 -05:00

21 Commits