Commit Graph

7 Commits

Author SHA1 Message Date
Derek Strickland
bc03aadf3b csi_hook: valid if any driver supports csi (#13446)
* csi_hook: valid if any driver supports csi volumes
2022-06-22 10:43:43 -04:00
Tim Gross
a8d5e5e7a3 CSI: don't block client shutdown for node unmount (#12457)
When we unmount a volume we need to be able to recover from cases
where the plugin has been shutdown before the allocation that needs
it, so in #11892 we blocked shutting down the alloc runner hook. But
this blocks client shutdown if we're in the middle of unmounting. The
client won't be able to communicate with the plugin or send the
unpublish RPC anyways, so we should cancel the context and assume that
we'll resume the unmounting process when the client restarts.

For `-dev` mode we don't send the graceful `Shutdown()` method and
instead destroy all the allocations. In this case, we'll never be able
to communicate with the plugin but also never close the context we
need to prevent the hook from blocking. To fix this, move the retries
into their own goroutine that doesn't block the main `Postrun`.
2022-04-05 13:05:10 -04:00
Seth Hoenig
6f37b28b87 cleanup: purge github.com/pkg/errors 2022-04-01 19:24:02 -05:00
Seth Hoenig
b242957990 ci: swap ci parallelization for unconstrained gomaxprocs 2022-03-15 12:58:52 -05:00
Tim Gross
649f1e3967 CSI: retry claims from client when max claims are reached (#12113)
When the alloc runner claims a volume, an allocation for a previous
version of the job may still have the volume claimed because it's
still shutting down. In this case we'll receive an error from the
server. Retry this error until we succeed or until a very long timeout
expires, to give operators a chance to recover broken plugins.

Make the alloc runner hook tolerant of temporary RPC failures.
2022-02-24 10:39:07 -05:00
Tim Gross
8364eda1d7 CSI: node unmount from the client before unpublish RPC (#11892)
When an allocation stops, the `csi_hook` makes an unpublish RPC to the
servers to unpublish via the CSI RPCs: first to the node plugins and
then the controller plugins. The controller RPCs must happen after the
node RPCs so that the node has had a chance to unmount the volume
before the controller tries to detach the associated device.

But the client has local access to the node plugins and can
independently determine if it's safe to send unpublish RPC to those
plugins. This will allow the server to treat the node plugin as
abandoned if a client is disconnected and `stop_on_client_disconnect`
is set. This will let the server try to send unpublish RPCs to the
controller plugins, under the assumption that the client will be
trying to unmount the volume on its end first.

Note that the CSI `NodeUnpublishVolume`/`NodeUnstageVolume` RPCs can 
return ignorable errors in the case where the volume has already been
unmounted from the node. Handle all other errors by retrying until we
get success so as to give operators the opportunity to reschedule a
failed node plugin (ex. in the case where they accidentally drained a
node without `-ignore-system`). Fan-out the work for each volume into
its own goroutine so that we can release a subset of volumes if only
one is stuck.
2022-01-28 08:30:31 -05:00
Tim Gross
d27b1370ae CSI: tests to exercise csi_hook (#11788)
Small refactoring of the allocrunner hook for CSI to make it more
testable, and a unit test that covers most of its logic.
2022-01-07 15:23:47 -05:00