Registration and restoring allocs don't share state or depend on each
other in any way (syncing allocs with servers is done outside of
registration).
Since restoring is synchronous, start the registration goroutine first.
For nodes with lots of allocs to restore or close to their heartbeat
deadline, this could be the difference between becoming "lost" or not.
Refactoring of 104067bc2b2002a4e45ae7b667a476b89addc162
Switch the MarkLive method for a chan that is closed by the client.
Thanks to @notnoop for the idea!
The old approach called a method on most existing ARs and TRs on every
runAllocs call. The new approach does a once.Do call in runAllocs to
accomplish the same thing with less work. Able to remove the gate
abstraction that did much more than was needed.
Fixes#1795
Running restored allocations and pulling what allocations to run from
the server happen concurrently. This means that if a client is rebooted,
and has its allocations rescheduled, it may restart the dead allocations
before it contacts the server and determines they should be dead.
This commit makes tasks that fail to reattach on restore wait until the
server is contacted before restarting.
* client: was not using up-to-date client state in determining which alloc count towards allocated resources
* Update client/client.go
Co-Authored-By: cgbaker <cgbaker@hashicorp.com>
This removes an unnecessary shared lock between discovery and heartbeating
which was causing heartbeats to be missed upon retries when a single server
fails. Also made a drive by fix to call the periodic server shuffler goroutine.
This command will be used to send a signal to either a single task within an
allocation, or all of the tasks if <task-name> is omitted. If the sent signal
terminates the allocation, it will be treated as if the allocation has crashed,
rather than as if it was operator-terminated.
Signal validation is currently handled by the driver itself and nomad
does not attempt to restrict or validate them.
I noticed that `watchNodeUpdates()` almost immediately after
`registerAndHeartbeat()` calls `retryRegisterNode()`, well after 5
seconds.
This call is unnecessary and made debugging a bit harder. So here, we
ensure that we only re-register node for new node events, not for
initial registration.
Here we retain 0.8.7 behavior of waiting for driver fingerprints before
registering a node, with some timeout. This is needed for system jobs,
as system job scheduling for node occur at node registration, and the
race might mean that a system job may not get placed on the node because
of missing drivers.
The timeout isn't strictly necessary, but raising it to 1 minute as it's
closer to indefinitely blocked than 1 second. We need to keep the value
high enough to capture as much drivers/devices, but low enough that
doesn't risk blocking too long due to misbehaving plugin.
Fixes https://github.com/hashicorp/nomad/issues/5579
Revert "fingerprint Constraints and Affinities have Equals, as set"
This reverts commit 596f16fb5f.
Revert "client tests assert the independent handling of interface and speed"
This reverts commit 7857ac5993.
Revert "structs missed applying a style change from the review"
This reverts commit 658916e327.
Revert "client, structs comments"
This reverts commit be2838d6ba.
Revert "client fingerprint updateNetworks preserves the network configuration"
This reverts commit fc309cb430.
Revert "client_test cleanup comments from review"
This reverts commit bc0bf4efb9.
Revert "client Networks Equals is set equality"
This reverts commit f8d432345b.
Revert "struct cleanup indentation in RequestedDevice Equals"
This reverts commit f4746411ca.
Revert "struct Equals checks for identity before value checking"
This reverts commit 0767a4665e.
Revert "fix client-test, avoid hardwired platform dependecy on lo0"
This reverts commit e89dbb2ab1.
Revert "refactor error in client fingerprint to include the offending data"
This reverts commit a7fed726c6.
Revert "add client updateNodeResources to merge but preserve manual config"
This reverts commit 84bd433c7e.
Revert "refactor struts.RequestedDevice to have its own Equals"
This reverts commit 6897825240.
Revert "refactor structs.Resource.Networks to have its own Equals"
This reverts commit 49e2e6c77b.
Revert "refactor structs.Resource.Devices to have its own Equals"
This reverts commit 4ede9226bb.
Revert "add COMPAT(0.10): Remove in 0.10 notes to impl for structs.Resources"
This reverts commit 49fbaace52.
Revert "add structs.Resources Equals"
This reverts commit 8528a2a2a6.
Revert "test that fingerprint resources are updated, net not clobbered"
This reverts commit 8ee02ddd23.
This adds a `nomad alloc restart` command and api that allows a job operator
with the alloc-lifecycle acl to perform an in-place restart of a Nomad
allocation, or a given subtask.
The driver manager is modeled after the device manager and is started by the client.
It's responsible for handling driver lifecycle and reattachment state, as well as
processing the incomming fingerprint and task events from each driver. The mananger
exposes a method for registering event handlers for task events that is used by the
task runner to update the server when a task has been updated with an event.
Since driver fingerprinting has been implemented by the driver manager, it is no
longer needed in the fingerprint mananger and has been removed.
Currently, there is a race condition between creating a taskrunner, and
updating node attributes via fingerprinting.
This is because the taskenv builder will try to iterate over the
clientconfig.Node.Attributes map, which can be concurrently updated by
the fingerprinting process, thus causing a panic.
This fixes that by providing a copy of the clientconfg to the
allocrunner inside the Read lock during config creation.
The allocLock is used to synchronize access to the alloc runner map, not
to ensure internal consistency of the alloc runners themselves. This
updates the updateAlloc process to avoid hanging on to an exclusive lock
of the map while applying changes to allocrunners themselves, as they
should be internally consistent.
This fixes a bug where any client allocation api will block during the
shutdown or updating of an allocrunner and its child taskrunners.
Fixes a bug where a driver health and attributes are never updated from
their initial status. If a driver started unhealthy, it may never go
into a healthy status.