The recent change to collection via a "one-shot" Docker API call
did not update the stream boolean argument. This results in the
PreCPUStats values being zero and therefore breaking the CPU
calculations which rely on this data. The base fix is to update
the passed boolean parameter to match the desired non-streaming
behaviour. The non-streaming API call correctly returns the
PreCPUStats data which can be seen in the added unit test.
The most recent change also modified the behaviour of the
collectStats go routine, so that any error encountered results in
the routine exiting. In the event this was a transient error, the
container will continue to run, however, no stats will be collected
until the task is stopped and replaced. This PR reverts the
behaviour, so that an error encountered during a stats collection
run results in the error being logged but the collection process
continuing with a backoff used.
In #23966 we switched to the official Docker SDK for the `docker` driver. In the
process we refactored code around stats collection to use the "one shot" version
of stats. Unfortunately this "one shot" stats collection does not include the
`PreCPU` stats, which are the stats from the previous read. This breaks the
calculation we use to determine CPU ticks, because now we're subtracting 0 from
the current value to get the delta.
Switch back to using the streaming stats collection. Add a test that fully
exercises the `TaskStats` API.
Fixes: https://github.com/hashicorp/nomad/issues/24224
Ref: https://hashicorp.atlassian.net/browse/NET-11348
In ##23966 when we switched to using the official Docker SDK client, we had to
rework the stats collection loop for the new client. But we missed resetting the
timer on the collection loop, which meant that we'd only collect stats once and
then never again.
* Ref: [NET-11202 (comment)](https://hashicorp.atlassian.net/browse/NET-11202?focusedCommentId=550814)
* This has shipped in Nomad 1.9.0-beta.1 but not production yet.
This PR replaces fsouza/go-dockerclient 3rd party docker client library with
docker's official SDK.
---------
Co-authored-by: Tim Gross <tgross@hashicorp.com>
Co-authored-by: Seth Hoenig <shoenig@duck.com>
* drivers: plumb hardware topology via grpc into drivers
This PR swaps out the temporary use of detecting system hardware manually
in each driver for using the Client's detected topology by plumbing the
data over gRPC. This ensures that Client configuration is taken to account
consistently in all references to system topology.
* cr: use enum instead of bool for core grade
* cr: fix test slit tables to be possible
We use capped exponential backoff in several places in the code when handling
failures. The code we've copy-and-pasted all over has a check to see if the
backoff is greater than the limit, but this check happens after the bitshift and
we always increment the number of attempts. This causes an overflow with a
fairly small number of failures (ex. at one place I tested it occurs after only
24 iterations), resulting in a negative backoff which then never recovers. The
backoff becomes a tight loop consuming resources and/or DoS'ing a Nomad RPC
handler or an external API such as Vault. Note this doesn't occur in places
where we cap the number of iterations so the loop breaks (usually to return an
error), so long as the number of iterations is reasonable.
Introduce a helper with a check on the cap before the bitshift to avoid overflow in all
places this can occur.
Fixes: #18199
Co-authored-by: stswidwinski <stan.swidwinski@gmail.com>
* drivers/docker: refactor use of clients in docker driver
This PR refactors how we manage the two underlying clients used by the
docker driver for communicating with the docker daemon. We keep two clients
- one with a hard-coded timeout that applies to all operations no matter
what, intended for use with short lived / async calls to docker. The other
has no timeout and is the responsibility of the caller to set a context
that will ensure the call eventually terminates.
The use of these two clients has been confusing and mistakes were made
in a number of places where calls were making use of the wrong client.
This PR makes it so that a user must explicitly call a function to get
the client that makes sense for that use case.
Fixes#17023
* cr: followup items
This PR replaces use of time.After with a safe helper function
that creates a time.Timer to use instead. The new function returns
both a time.Timer and a Stop function that the caller must handle.
Unlike time.NewTimer, the helper function does not panic if the duration
set is <= 0.
destCh was being written to by one goroutine and closed by another
goroutine. This panic occurred in Travis:
```
=== FAIL: drivers/docker TestDockerCoordinator_ConcurrentPulls (117.66s)
=== PAUSE TestDockerCoordinator_ConcurrentPulls
=== CONT TestDockerCoordinator_ConcurrentPulls
panic: send on closed channel
goroutine 5358 [running]:
github.com/hashicorp/nomad/drivers/docker.dockerStatsCollector(0xc0003a4a20, 0xc0003a49c0, 0x3b9aca00)
/home/travis/gopath/src/github.com/hashicorp/nomad/drivers/docker/stats.go:108 +0x167
created by
github.com/hashicorp/nomad/drivers/docker.TestDriver_DockerStatsCollector
/home/travis/gopath/src/github.com/hashicorp/nomad/drivers/docker/stats_test.go:33 +0x1ab
```
The 2 ways to fix this kind of error are to either (1) add extra
coordination around multiple goroutines writing to a chan or (2) make it
so only one goroutines writes to a chan.
I implemented (2) first as it's simpler, but @notnoop pointed out since
the same destCh in reused in the stats loop there's now a double close
panic possible!
So this implements (1) by adding a *usageSender struct for handling
concurrent senders and closing.
Track current memory usage, `memory.usage_in_bytes`, in addition to
`memory.max_memory_usage_in_bytes` and friends. This number is closer
what Docker reports.
Related to https://github.com/hashicorp/nomad/issues/5165 .
plugins/driver: update driver interface to support streaming stats
client/tr: use streaming stats api
TODO:
* how to handle errors and closed channel during stats streaming
* prevent tight loop if Stats(ctx) returns an error
drivers: update drivers TaskStats RPC to handle streaming results
executor: better error handling in stats rpc
docker: better control and error handling of stats rpc
driver: allow stats to return a recoverable error