113 Commits

Author SHA1 Message Date
Charlie Voiselle
2c1dcc8cd2 Use ExternalID in NodeStageVolume RPC (#7754) 2020-04-20 17:13:46 -04:00
Tim Gross
a11fb6a323 csi: fix unpublish workflow ID mismatches
The CSI plugins uses the external volume ID for all operations, but
the Client CSI RPCs uses the Nomad volume ID (human-friendly) for the
mount paths. Pass the External ID as an arg in the RPC call so that
the unpublish workflows have it without calling back to the server to
find the external ID.

The controller CSI plugins need the CSI node ID (or in other words,
the storage provider's view of node ID like the EC2 instance ID), not
the Nomad node ID, to determine how to detach the external volume.
2020-04-06 10:15:55 -04:00
Tim Gross
414caf76e5 CSI: move node unmount to server-driven RPCs (#7596)
If a volume-claiming alloc stops and the CSI Node plugin that serves
that alloc's volumes is missing, there's no way for the allocrunner
hook to send the `NodeUnpublish` and `NodeUnstage` RPCs.

This changeset addresses this issue with a redesign of the client-side
for CSI. Rather than unmounting in the alloc runner hook, the alloc
runner hook will simply exit. When the server gets the
`Node.UpdateAlloc` for the terminal allocation that had a volume claim,
it creates a volume claim GC job. This job will made client RPCs to a
new node plugin RPC endpoint, and only once that succeeds, move on to
making the client RPCs to the controller plugin. If the node plugin is
unavailable, the GC job will fail and be requeued.
2020-04-02 16:04:56 -04:00
Lang Martin
bc750d8bb0 csi: add node events to report progress mounting and unmounting volumes (#7547)
* nomad/structs/structs: new NodeEventSubsystemCSI

* client/client: pass triggerNodeEvent in the CSIConfig

* client/pluginmanager/csimanager/instance: add eventer to instanceManager

* client/pluginmanager/csimanager/manager: pass triggerNodeEvent

* client/pluginmanager/csimanager/volume: node event on [un]mount

* nomad/structs/structs: use storage, not CSI

* client/pluginmanager/csimanager/volume: use storage, not CSI

* client/pluginmanager/csimanager/volume_test: eventer

* client/pluginmanager/csimanager/volume: event on error

* client/pluginmanager/csimanager/volume_test: check event on error

* command/node_status: remove an extra space in event detail format

* client/pluginmanager/csimanager/volume: use snake_case for details

* client/pluginmanager/csimanager/volume_test: snake_case details
2020-03-31 17:13:52 -04:00
Tim Gross
74e5c90b42 csi: annotate remaining missing cancellation contexts (#7552) 2020-03-30 16:46:43 -04:00
Tim Gross
ffa13adf90 csi: add grpc retries to client controller RPCs (#7549)
The CSI Specification defines various gRPC Errors and how they may be retried. After auditing all our CSI RPC calls in #6863, this changeset:

* adds retries and backoffs to the where they were needed but not implemented
* annotates those CSI RPCs that do not need retries so that we don't wonder whether it's been left off accidentally
* added a timeout and cancellation context to the `Probe` call, which didn't have one.
2020-03-30 16:26:03 -04:00
Lang Martin
1bef8b8879 csi: add mount_options to volumes and volume requests (#7398)
Add mount_options to both the volume definition on registration and to the volume block in the group where the volume is requested. If both are specified, the options provided in the request replace the options defined in the volume. They get passed to the NodePublishVolume, which causes the node plugin to actually mount the volume on the host.

Individual tasks just mount bind into the host mounted volume (unchanged behavior). An operator can mount the same volume with different options by specifying it twice in the group context.

closes #7007

* nomad/structs/volumes: add MountOptions to volume request

* jobspec/test-fixtures/basic.hcl: add mount_options to volume block

* jobspec/parse_test: add expected MountOptions

* api/tasks: add mount_options

* jobspec/parse_group: use hcl decode not mapstructure, mount_options

* client/allocrunner/csi_hook: pass MountOptions through

client/allocrunner/csi_hook: add a VolumeMountOptions

client/allocrunner/csi_hook: drop Options

client/allocrunner/csi_hook: use the structs options

* client/pluginmanager/csimanager/interface: UsageOptions.MountOptions

* client/pluginmanager/csimanager/volume: pass MountOptions in capabilities

* plugins/csi/plugin: remove todo 7007 comment

* nomad/structs/csi: MountOptions

* api/csi: add options to the api for parsing, match structs

* plugins/csi/plugin: move VolumeMountOptions to structs

* api/csi: use specific type for mount_options

* client/allocrunner/csi_hook: merge MountOptions here

* rename CSIOptions to CSIMountOptions

* client/allocrunner/csi_hook

* client/pluginmanager/csimanager/volume

* nomad/structs/csi

* plugins/csi/fake/client: add PrevVolumeCapability

* plugins/csi/plugin

* client/pluginmanager/csimanager/volume_test: remove debugging

* client/pluginmanager/csimanager/volume: fix odd merging logic

* api: rename CSIOptions -> CSIMountOptions

* nomad/csi_endpoint: remove a 7007 comment

* command/alloc_status: show mount options in the volume list

* nomad/structs/csi: include MountOptions in the volume stub

* api/csi: add MountOptions to stub

* command/volume_status_csi: clean up csiVolMountOption, add it

* command/alloc_status: csiVolMountOption lives in volume_csi_status

* command/node_status: display mount flags

* nomad/structs/volumes: npe

* plugins/csi/plugin: npe in ToCSIRepresentation

* jobspec/parse_test: expand volume parse test cases

* command/agent/job_endpoint: ApiTgToStructsTG needs MountOptions

* command/volume_status_csi: copy paste error

* jobspec/test-fixtures/basic: hclfmt

* command/volume_status_csi: clean up csiVolMountOption
2020-03-23 13:59:25 -04:00
Tim Gross
a280cf06eb csi: stub fingerprint on instance manager shutdown (#7388)
Run the plugin fingerprint one last time with a closed client during
instance manager shutdown. This will return quickly and will give us a
correctly-populated `PluginInfo` marked as unhealthy so the Nomad
client can update the server about plugin health.
2020-03-23 13:59:25 -04:00
Tim Gross
0f9983f230 csi: dynamically update plugin registration (#7386)
Allow for faster updates to plugin status when allocations become
terminal by listening for register/deregister events from the dynamic
plugin registry (which in turn are triggered by the plugin supervisor
hook).

The deregistration function closures that we pass up to the CSI plugin
manager don't properly close over the name and type of the
registration, causing monolith-type plugins to deregister only one of
their two plugins on alloc shutdown. Rebind plugin supervisor 
deregistration targets to fix that.

Includes log message and comment improvements
2020-03-23 13:59:25 -04:00
Tim Gross
42323c41d9 csi: add dynamicplugins registry to client state store (#7330)
In order to correctly fingerprint dynamic plugins on client restarts,
we need to persist a handle to the plugin (that is, connection info)
to the client state store.

The dynamic registry will sync automatically to the client state
whenever it receives a register/deregister call.
2020-03-23 13:58:30 -04:00
Lang Martin
13dea1e5bb csi: use ExternalID, when set, to identify volumes for outside RPC calls (#7326)
* nomad/structs/csi: new RemoteID() uses the ExternalID if set

* nomad/csi_endpoint: pass RemoteID to volume request types

* client/pluginmanager/csimanager/volume: pass RemoteID to NodePublishVolume
2020-03-23 13:58:30 -04:00
Tim Gross
17e3e882b8 csi: docstring and log message fixups (#7327)
Fix some docstring typos and fix noisy log message during client restarts.
A log for the common case where the plugin socket isn't ready yet
isn't actionable by the operator so having it at info is just noise.
2020-03-23 13:58:30 -04:00
Tim Gross
d6c9952d84 csi: add Provider field to CSI CLIs and APIs (#7285)
Derive a provider name and version for plugins (and the volumes that
use them) from the CSI identity API `GetPluginInfo`. Expose the vendor
name as `Provider` in the API and CLI commands.
2020-03-23 13:58:30 -04:00
Lang Martin
056a1dc2ee csi add allocation context to fingerprinting results (#7133)
* structs: CSIInfo include AllocID, CSIPlugins no Jobs

* state_store: eliminate plugin Jobs, delete an empty plugin

* nomad/structs/csi: detect empty plugins correctly

* client/allocrunner/taskrunner/plugin_supervisor_hook: option AllocID

* client/pluginmanager/csimanager/instance: allocID

* client/pluginmanager/csimanager/fingerprint: set AllocID

* client/node_updater: split controller and node plugins

* api/csi: remove Jobs

The CSI Plugin API will map plugins to allocations, which allows
plugins to be defined by jobs in many configurations. In particular,
multiple plugins can be defined in the same job, and multiple jobs can
be used to define a single plugin.

Because we now map the allocation context directly from the node, it's
no longer necessary to track the jobs associated with a plugin
directly.

* nomad/csi_endpoint_test: CreateTestPlugin & register via fingerprint

* client/dynamicplugins: lift AllocID into the struct from Options

* api/csi_test: remove Jobs test

* nomad/structs/csi: CSIPlugins has an array of allocs

* nomad/state/state_store: implement CSIPluginDenormalize

* nomad/state/state_store: CSIPluginDenormalize npe on missing alloc

* nomad/csi_endpoint_test: defer deleteNodes for clarity

* api/csi_test: disable this test awaiting mocks:
https://github.com/hashicorp/nomad/issues/7123
2020-03-23 13:58:30 -04:00
Danielle Lancashire
5cbcabb343 csimanager/volume: Update MountVolume docstring 2020-03-23 13:58:30 -04:00
Danielle Lancashire
4a105a6679 csi: Claim CSI Volumes during csi_hook.Prerun
This commit is the initial implementation of claiming volumes from the
server and passes through any publishContext information as appropriate.

There's nothing too fancy here.
2020-03-23 13:58:30 -04:00
Danielle Lancashire
d95858dc47 csi: Basic volume usage tracking 2020-03-23 13:58:30 -04:00
Danielle Lancashire
6c0cbbb96d csi: Add comment to UsageOptions.ToFS() 2020-03-23 13:58:30 -04:00
Danielle Lancashire
ae4f59746c csi: Move VolumeCapabilties helper to package 2020-03-23 13:58:30 -04:00
Danielle Lancashire
4fe9c74f28 csi: Pass through usage options to the csimanager
The CSI Spec requires us to attach and stage volumes based on different
types of usage information when it may effect how they are bound. Here
we pass through some basic usage options in the CSI Hook (specifically
the volume aliases ReadOnly field), and the attachment/access mode from
the volume. We pass the attachment/access mode seperately from the
volume as it simplifies some handling and doesn't necessarily force
every attachment to use the same mode should more be supported (I.e if
we let each `volume "foo" {}` specify an override in the future).
2020-03-23 13:58:30 -04:00
Danielle Lancashire
01b69ef7bb csi: Unpublish volumes during ar.Postrun
This commit introduces initial support for unmounting csi volumes.

It takes a relatively simplistic approach to performing
NodeUnpublishVolume calls, optimising for cleaning up any leftover state
rather than terminating early in the case of errors.

This is because it happens during an allocation's shutdown flow and may
not always have a corresponding call to `NodePublishVolume` that
succeeded.
2020-03-23 13:58:30 -04:00
Danielle Lancashire
5291c3acd0 csi: Fix broken call to newVolumeManager 2020-03-23 13:58:29 -04:00
Danielle Lancashire
b3a1110c50 csi: Provide plugin-scoped paths during RPCs
When providing paths to plugins, the path needs to be in the scope of
the plugins container, rather than that of the host.

Here we enable that by providing the mount point through the plugin
registration and then use it when constructing request target paths.
2020-03-23 13:58:29 -04:00
Danielle Lancashire
30527e3e51 csimanager: Cleanup volumemanager setup 2020-03-23 13:58:29 -04:00
Danielle Lancashire
c8e21f4547 csimanager: Instantiate fingerprint manager's csiclient 2020-03-23 13:58:29 -04:00
Danielle Lancashire
5d9f560f59 volume_manager: cleanup of mount detection
No functional changes, but makes ensure.*Dir follow a nicer return
style.
2020-03-23 13:58:29 -04:00
Danielle Lancashire
bead8dc8fc volume_manager: Add support for publishing volumes 2020-03-23 13:58:29 -04:00
Danielle Lancashire
2e3599d35c volume_manager: Initial support for unstaging volumes 2020-03-23 13:58:29 -04:00
Danielle Lancashire
9bc0ff16f6 volume_manager: NodeStageVolume Support
This commit introduces support for staging volumes when a plugin
implements the STAGE_UNSTAGE_VOLUME capability.

See the following for further reference material:
 4731db0e0b/spec.md (nodestagevolume)
2020-03-23 13:58:29 -04:00
Danielle Lancashire
6b28db98b6 volume_manager: Introduce helpers for staging
This commit adds helpers that create and validate the staging directory
for a given volume. It is currently missing usage options as the
interfaces are not yet in place for those.

The staging directory is only required when a volume has the
STAGE_UNSTAGE Volume capability and has to live within the plugin root
as the plugin needs to be able to create mounts inside it from within
the container.
2020-03-23 13:58:29 -04:00
Lang Martin
5c26cdf08b csi: pluginmanager use PluginID instead of Driver 2020-03-23 13:58:29 -04:00
Danielle Lancashire
1250d56333 csi: Add VolumeManager (#6920)
This changeset is some pre-requisite boilerplate that is required for
introducing CSI volume management for client nodes.

It extracts out fingerprinting logic from the csi instance manager.
This change is to facilitate reusing the csimanager to also manage the
node-local CSI functionality, as it is the easiest place for us to
guaruntee health checking and to provide additional visibility into the
running operations through the fingerprinter mechanism and goroutine.

It also introduces the VolumeMounter interface that will be used to
manage staging/publishing unstaging/unpublishing of volumes on the host.
2020-03-23 13:58:29 -04:00
Danielle Lancashire
cd0c2a6df0 csi: Setup gRPC Clients with a logger 2020-03-23 13:58:29 -04:00
Danielle Lancashire
3f36dae246 csimanager: Fingerprint Node Service capabilities 2020-03-23 13:58:29 -04:00
Danielle Lancashire
406984ca8d csimanager: Fingerprint controller capabilities 2020-03-23 13:58:29 -04:00
Danielle Lancashire
d296efd2c6 CSI Plugin Registration (#6555)
This changeset implements the initial registration and fingerprinting
of CSI Plugins as part of #5378. At a high level, it introduces the
following:

* A `csi_plugin` stanza as part of a Nomad task configuration, to
  allow a task to expose that it is a plugin.

* A new task runner hook: `csi_plugin_supervisor`. This hook does two
  things. When the `csi_plugin` stanza is detected, it will
  automatically configure the plugin task to receive bidirectional
  mounts to the CSI intermediary directory. At runtime, it will then
  perform an initial heartbeat of the plugin and handle submitting it to
  the new `dynamicplugins.Registry` for further use by the client, and
  then run a lightweight heartbeat loop that will emit task events
  when health changes.

* The `dynamicplugins.Registry` for handling plugins that run
  as Nomad tasks, in contrast to the existing catalog that requires
  `go-plugin` type plugins and to know the plugin configuration in
  advance.

* The `csimanager` which fingerprints CSI plugins, in a similar way to
  `drivermanager` and `devicemanager`. It currently only fingerprints
  the NodeID from the plugin, and assumes that all plugins are
  monolithic.

Missing features

* We do not use the live updates of the `dynamicplugin` registry in
  the `csimanager` yet.

* We do not deregister the plugins from the client when they shutdown
  yet, they just become indefinitely marked as unhealthy. This is
  deliberate until we figure out how we should manage deploying new
  versions of plugins/transitioning them.
2020-03-23 13:58:28 -04:00
Nick Ethier
64b7c91538 drivermanager: attempt dispense on reattachment failure 2020-02-15 00:50:06 -05:00
Mahmood Ali
0cfea7d978 client: don't retry fingerprinting on shutdown
At shutdown, driver manager context expires and the fingerprinting
channel closes.  Thus it is undeterministic which clause of The select
statement gets executed, and we may keep retrying until the
`i.ctx.Done()` block is executed.

Here, we check always check ctx expiration before retrying again.
2019-10-21 08:54:11 -04:00
Mahmood Ali
979a6a1778 implement client endpoint of nomad exec
Add a client streaming RPC endpoint for processing nomad exec tasks, by invoking
the relevant task handler for execution.
2019-05-09 16:49:08 -04:00
Mahmood Ali
9a2f46f332 client: log detected driver health state
Noticed that `detected drivers` log line was misleading - when a driver
doesn't fingerprint before timeout, their health status is empty string
`""` which we would mark as detected.

Now, we log all drivers along with their state to ease driver
fingerprint debugging.
2019-04-19 09:15:25 -04:00
Preetha Appan
8a24660941 s/mananger/manager 2019-03-04 12:25:54 -06:00
Michael Schurter
4119767be8 fingerprint: improve initial fingerpint message
The initial fingerprint message is actually fairly useful, so I bumped
it to Debug and fixed the output formatting.
2019-02-21 15:32:18 -08:00
Nick Ethier
1400012cbc drivermanager: don't store nil reattach configs 2019-01-25 23:07:04 -05:00
Michael Schurter
158c74887e goimports until make check is happy 2019-01-23 06:27:14 -08:00
Michael Schurter
0d61ff0fb9 move pluginutils -> helper/pluginutils
I wanted a different color bikeshed, so I get to paint it
2019-01-22 15:50:08 -08:00
Alex Dadgar
2d23f4a038 move reattach config 2019-01-22 15:11:58 -08:00
Alex Dadgar
c19cd2e5cf loader and singleton 2019-01-22 15:11:57 -08:00
Alex Dadgar
b9f36134dc move catalog + grpcutils 2019-01-22 15:11:57 -08:00
Michael Schurter
838c8fcc75 Merge pull request #5034 from hashicorp/test-fix-races
Test fix races
2019-01-08 07:04:09 -08:00
Danielle Tomlinson
994057abf4 drivers: Add internal interface for Shutdown
This allows us to correctly terminate internal state during runs of the
nomad test suite, e.g closing eventer contexts correctly.
2019-01-08 13:48:49 +01:00