Commit Graph

327 Commits

Author SHA1 Message Date
Seth Hoenig
abd38b3a86 consul/connect: fixup some comments and context timeout 2020-08-26 13:17:16 -05:00
Seth Hoenig
db8020f4eb consul/connect: fixup tests to use new consul sdk 2020-08-24 12:02:41 -05:00
Seth Hoenig
9ffdeed904 consul/connect: add initial support for ingress gateways
This PR adds initial support for running Consul Connect Ingress Gateways (CIGs) in Nomad. These gateways are declared as part of a task group level service definition within the connect stanza.

```hcl
service {
  connect {
    gateway {
      proxy {
        // envoy proxy configuration
      }
      ingress {
        // ingress-gateway configuration entry
      }
    }
  }
}
```

A gateway can be run in `bridge` or `host` networking mode, with the caveat that host networking necessitates manually specifying the Envoy admin listener (which cannot be disabled) via the service port value.

Currently Envoy is the only supported gateway implementation in Consul, and Nomad only supports running Envoy as a gateway using the docker driver.

Aims to address #8294 and tangentially #8647
2020-08-21 16:21:54 -05:00
Nick Ethier
c11dbcd001 docker: support group allocated ports and host_networks (#8623)
* docker: support group allocated ports

* docker: add new ports driver config to specify which group ports are mapped

* docker: update port mapping docs
2020-08-11 18:30:22 -04:00
Seth Hoenig
8ec3aa1716 consul/connect: add support for bridge networks with connect native tasks
Before, Connect Native Tasks needed one of these to work:

- To be run in host networking mode
- To have the Consul agent configured to listen to a unix socket
- To have the Consul agent configured to listen to a public interface

None of these are a great experience, though running in host networking is
still the best solution for non-Linux hosts. This PR establishes a connection
proxy between the Consul HTTP listener and a unix socket inside the alloc fs,
bypassing the network namespace for any Connect Native task. Similar to and
re-uses a bunch of code from the gRPC listener version for envoy sidecar proxies.

Proxy is established only if the alloc is configured for bridge networking and
there is at least one Connect Native task in the Task Group.

Fixes #8290
2020-07-29 09:26:01 -05:00
Drew Bailey
19810365f6 oss compoments for multi-vault namespaces
adds in oss components to support enterprise multi-vault namespace feature

upgrade specific doc on vault multi-namespaces

vault docs

update test to reflect new error
2020-07-24 10:14:59 -04:00
Seth Hoenig
ef52328eb4 connect/native: doc and comment tweaks from PR 2020-06-24 10:13:22 -05:00
Seth Hoenig
cd3ec2f783 connect/native: check for pre-existing consul token 2020-06-24 09:16:28 -05:00
Seth Hoenig
193e355ec3 connect/native: update connect native hook tests 2020-06-23 12:07:35 -05:00
Seth Hoenig
a87130c0e8 connect/native: give tls files an extension 2020-06-23 12:06:28 -05:00
Seth Hoenig
7e8d5c2392 consul/connect: add support for running connect native tasks
This PR adds the capability of running Connect Native Tasks on Nomad,
particularly when TLS and ACLs are enabled on Consul.

The `connect` stanza now includes a `native` parameter, which can be
set to the name of task that backs the Connect Native Consul service.

There is a new Client configuration parameter for the `consul` stanza
called `share_ssl`. Like `allow_unauthenticated` the default value is
true, but recommended to be disabled in production environments. When
enabled, the Nomad Client's Consul TLS information is shared with
Connect Native tasks through the normal Consul environment variables.
This does NOT include auth or token information.

If Consul ACLs are enabled, Service Identity Tokens are automatically
and injected into the Connect Native task through the CONSUL_HTTP_TOKEN
environment variable.

Any of the automatically set environment variables can be overridden by
the Connect Native task using the `env` stanza.

Fixes #6083
2020-06-22 14:07:44 -05:00
Nick Ethier
e9ff8a8daa Task DNS Options (#7661)
Co-Authored-By: Tim Gross <tgross@hashicorp.com>
Co-Authored-By: Seth Hoenig <shoenig@hashicorp.com>
2020-06-18 11:01:31 -07:00
Tim Gross
8860b72bc3 volumes: return better error messages for unsupported task drivers (#8030)
When an allocation runs for a task driver that can't support volume mounts,
the mounting will fail in a way that can be hard to understand. With host
volumes this usually means failing silently, whereas with CSI the operator
gets inscrutable internals exposed in the `nomad alloc status`.

This changeset adds a MountConfig field to the task driver Capabilities
response. We validate this when the `csi_hook` or `volume_hook` fires and
return a user-friendly error.

Note that we don't currently have a way to get driver capabilities up to the
server, except through attributes. Validating this when the user initially
submits the jobspec would be even better than what we're doing here (and could
be useful for all our other capabilities), but that's out of scope for this
changeset.

Also note that the MountConfig enum starts with "supports all" in order to
support community plugins in a backwards compatible way, rather than cutting
them off from volume mounting unexpectedly.
2020-05-21 09:18:02 -04:00
Tim Gross
db4c88f71b stats_hook: log normal shutdown condition as debug, not error (#8028)
The `stats_hook` writes an Error log every time an allocation becomes
terminal. This is a normal condition, not an error. A real error
condition like a failure to collect the stats is logged later. It just
creates log noise, and this is a particularly bad operator experience
for heavy batch workloads.
2020-05-20 10:28:30 -04:00
Mahmood Ali
fcddfa4971 Update hcl2 vendoring
The hcl2 library has moved from http://github.com/hashicorp/hcl2 to https://github.com/hashicorp/hcl/tree/hcl2.

This updates Nomad's vendoring to start using hcl2 library.  Also
updates some related libraries (e.g. `github.com/zclconf/go-cty/cty` and
`github.com/apparentlymart/go-textseg`).
2020-05-19 15:00:03 -04:00
Tim Gross
eb2f77a011 csi: use a blocking initial connection with timeout
The plugin supervisor lazily connects to plugins, but this means we
only get "Unavailable" back from the gRPC call in cases where the
plugin can never be reached (for example, if the Nomad client has the
wrong permissions for the socket).

This changeset improves the operator experience by switching to a
blocking `DialWithContext`. It eagerly connects so that we can
validate the connection is real and get a "failed to open" error in
case where Nomad can't establish the initial connection.
2020-05-14 15:59:19 -04:00
Mahmood Ali
8e655086ed Deflake TestTaskTemplateManager_BlockedEvents test
This change deflakes TestTaskTemplateManager_BlockedEvents test, because
it is expecting a number of events without accounting for transitional
state.

The test TestTaskTemplateManager_BlockedEvents attempts to ensure that a
template rendering emits blocked events for missing template ksys.

It works by setting a template that requires keys 0,1,2,3,4 and then
eventually sets keys 0,1,2,3 and ensures that we get a final event indicating
that keys 3 and 4 are still missing.

The test waits to get a blocked event for the final state, but it can
fail if receives a blocked event for a transitional state (e.g. one
reporting 2,3,4,5 are missing).

This fixes the test by ensuring that it waits until the final message
before assertion.

Also, it clarifies the intent of the test with stricter assertions and
additional comments.
2020-05-09 14:09:39 -04:00
Drew Bailey
3af2d05f6b Run task shutdown_delay regardless of service registration
task shutdown_delay will currently only run if there are registered
services for the task. This implementation detail isn't explicity stated
anywhere and is defined outside of the service stanza.

This change moves shutdown_delay to be evaluated after prekill hooks are
run, outside of any task runner hooks.

just use time.sleep
2020-04-10 11:06:26 -04:00
Nick Ethier
9df9e5e122 Merge pull request #7600 from hashicorp/b-5767
tr/service_hook: prevent Update from running before Poststart finish
2020-04-06 16:52:42 -04:00
Nick Ethier
8a8bd9b02d tr/service_hook: reset initialized flag during deregister 2020-04-06 16:05:36 -04:00
Nick Ethier
d4a3524064 tr/service_hook: update hook fields during update when poststart hasn't finished 2020-04-02 12:48:19 -04:00
Seth Hoenig
fb0bd3c25f connect: set consul TLS options on envoy bootstrap
Fixes #6594 #6711 #6714 #7567

e2e testing is still TBD in #6502

Before, we only passed the Nomad agent's configured Consul HTTP
address onto the `consul connect envoy ...` bootstrap command.
This meant any Consul setup with TLS enabled would not work with
Nomad's Connect integration.

This change now sets CLI args and Environment Variables for
configuring TLS options for communicating with Consul when doing
the envoy bootstrap, as described in
https://www.consul.io/docs/commands/connect/envoy.html#usage
2020-04-02 10:30:50 -06:00
Nick Ethier
88438e8982 tr/service_hook: prevent Update from running before Poststart has finished 2020-04-02 12:17:36 -04:00
Mahmood Ali
d155e4d412 tests: restart restartpolicy for all tasks in tests 2020-03-24 21:52:48 -04:00
Mahmood Ali
df08c6c399 tests: populate task restart policy properly 2020-03-24 21:44:37 -04:00
Mahmood Ali
c55f3ed084 per-task restart policy 2020-03-24 17:00:41 -04:00
Tim Gross
0f9983f230 csi: dynamically update plugin registration (#7386)
Allow for faster updates to plugin status when allocations become
terminal by listening for register/deregister events from the dynamic
plugin registry (which in turn are triggered by the plugin supervisor
hook).

The deregistration function closures that we pass up to the CSI plugin
manager don't properly close over the name and type of the
registration, causing monolith-type plugins to deregister only one of
their two plugins on alloc shutdown. Rebind plugin supervisor 
deregistration targets to fix that.

Includes log message and comment improvements
2020-03-23 13:59:25 -04:00
Tim Gross
8b38fc2183 volumes: add task environment interpolation to volume_mount (#7364) 2020-03-23 13:59:25 -04:00
Tim Gross
17e3e882b8 csi: docstring and log message fixups (#7327)
Fix some docstring typos and fix noisy log message during client restarts.
A log for the common case where the plugin socket isn't ready yet
isn't actionable by the operator so having it at info is just noise.
2020-03-23 13:58:30 -04:00
Tim Gross
d6c9952d84 csi: add Provider field to CSI CLIs and APIs (#7285)
Derive a provider name and version for plugins (and the volumes that
use them) from the CSI identity API `GetPluginInfo`. Expose the vendor
name as `Provider` in the API and CLI commands.
2020-03-23 13:58:30 -04:00
Lang Martin
056a1dc2ee csi add allocation context to fingerprinting results (#7133)
* structs: CSIInfo include AllocID, CSIPlugins no Jobs

* state_store: eliminate plugin Jobs, delete an empty plugin

* nomad/structs/csi: detect empty plugins correctly

* client/allocrunner/taskrunner/plugin_supervisor_hook: option AllocID

* client/pluginmanager/csimanager/instance: allocID

* client/pluginmanager/csimanager/fingerprint: set AllocID

* client/node_updater: split controller and node plugins

* api/csi: remove Jobs

The CSI Plugin API will map plugins to allocations, which allows
plugins to be defined by jobs in many configurations. In particular,
multiple plugins can be defined in the same job, and multiple jobs can
be used to define a single plugin.

Because we now map the allocation context directly from the node, it's
no longer necessary to track the jobs associated with a plugin
directly.

* nomad/csi_endpoint_test: CreateTestPlugin & register via fingerprint

* client/dynamicplugins: lift AllocID into the struct from Options

* api/csi_test: remove Jobs test

* nomad/structs/csi: CSIPlugins has an array of allocs

* nomad/state/state_store: implement CSIPluginDenormalize

* nomad/state/state_store: CSIPluginDenormalize npe on missing alloc

* nomad/csi_endpoint_test: defer deleteNodes for clarity

* api/csi_test: disable this test awaiting mocks:
https://github.com/hashicorp/nomad/issues/7123
2020-03-23 13:58:30 -04:00
Danielle Lancashire
08ad2b97e4 csi: Add /dev mounts to CSI Plugins
CSI Plugins that manage devices need not just access to the CSI
directory, but also to manage devices inside `/dev`.

This commit introduces a `/dev:/dev` mount to the container so that they
may do so.
2020-03-23 13:58:30 -04:00
Danielle Lancashire
8a8ed6036d taskrunner/volume_hook: Cleanup arg order of prepareHostVolumes 2020-03-23 13:58:30 -04:00
Danielle Lancashire
9e1d99e49b taskrunner/volume_hook: Mounts for CSI Volumes
This commit implements support for creating driver mounts for CSI
Volumes.

It works by fetching the created mounts from the allocation resources
and then iterates through the volume requests, creating driver mount
configs as required.

It's a little bit messy primarily because there's _so_ much terminology
overlap and it's a bit difficult to follow.
2020-03-23 13:58:30 -04:00
Danielle Lancashire
01ff8960b5 volume_hook: Loosen validation in host volume prep 2020-03-23 13:58:30 -04:00
Danielle Lancashire
7d044a340f allocrunner: Push state from hooks to taskrunners
This commit is an initial (read: janky) approach to forwarding state
from an allocrunner hook to a taskrunner using a similar `hookResources`
approach that tr's use internally.

It should eventually probably be replaced with something a little bit
more message based, but for things that only come from pre-run hooks,
and don't change, it's probably fine for now.
2020-03-23 13:58:30 -04:00
Danielle Lancashire
b3a1110c50 csi: Provide plugin-scoped paths during RPCs
When providing paths to plugins, the path needs to be in the scope of
the plugins container, rather than that of the host.

Here we enable that by providing the mount point through the plugin
registration and then use it when constructing request target paths.
2020-03-23 13:58:29 -04:00
Danielle Lancashire
1250d56333 csi: Add VolumeManager (#6920)
This changeset is some pre-requisite boilerplate that is required for
introducing CSI volume management for client nodes.

It extracts out fingerprinting logic from the csi instance manager.
This change is to facilitate reusing the csimanager to also manage the
node-local CSI functionality, as it is the easiest place for us to
guaruntee health checking and to provide additional visibility into the
running operations through the fingerprinter mechanism and goroutine.

It also introduces the VolumeMounter interface that will be used to
manage staging/publishing unstaging/unpublishing of volumes on the host.
2020-03-23 13:58:29 -04:00
Danielle Lancashire
cd0c2a6df0 csi: Setup gRPC Clients with a logger 2020-03-23 13:58:29 -04:00
Danielle Lancashire
d296efd2c6 CSI Plugin Registration (#6555)
This changeset implements the initial registration and fingerprinting
of CSI Plugins as part of #5378. At a high level, it introduces the
following:

* A `csi_plugin` stanza as part of a Nomad task configuration, to
  allow a task to expose that it is a plugin.

* A new task runner hook: `csi_plugin_supervisor`. This hook does two
  things. When the `csi_plugin` stanza is detected, it will
  automatically configure the plugin task to receive bidirectional
  mounts to the CSI intermediary directory. At runtime, it will then
  perform an initial heartbeat of the plugin and handle submitting it to
  the new `dynamicplugins.Registry` for further use by the client, and
  then run a lightweight heartbeat loop that will emit task events
  when health changes.

* The `dynamicplugins.Registry` for handling plugins that run
  as Nomad tasks, in contrast to the existing catalog that requires
  `go-plugin` type plugins and to know the plugin configuration in
  advance.

* The `csimanager` which fingerprints CSI plugins, in a similar way to
  `drivermanager` and `devicemanager`. It currently only fingerprints
  the NodeID from the plugin, and assumes that all plugins are
  monolithic.

Missing features

* We do not use the live updates of the `dynamicplugin` registry in
  the `csimanager` yet.

* We do not deregister the plugins from the client when they shutdown
  yet, they just become indefinitely marked as unhealthy. This is
  deliberate until we figure out how we should manage deploying new
  versions of plugins/transitioning them.
2020-03-23 13:58:28 -04:00
Mahmood Ali
83b08ab158 tr: proceed to mark other tasks as dead if alloc fails 2020-03-21 17:52:58 -04:00
Mahmood Ali
4558fa6aec fix test 2020-03-21 17:52:57 -04:00
Jasmine Dahilig
6c1474398f change jobspec lifecycle stanza to use sidecar attribute instead of
block_until status
2020-03-21 17:52:57 -04:00
Jasmine Dahilig
dcd317745d fix restart policy for system jobs with no lifecycle 2020-03-21 17:52:56 -04:00
Jasmine Dahilig
90fa242d83 fix failing ci test: TestTaskRunner_UnregisterConsul_Retries 2020-03-21 17:52:54 -04:00
Jasmine Dahilig
c6cd7b523b clean up restart conditions and restart tests for task lifecycle 2020-03-21 17:52:50 -04:00
Jasmine Dahilig
262d204096 incorporate lifecycle into restart tracker 2020-03-21 17:52:40 -04:00
Mahmood Ali
5377b4cb58 Add a coordinator for alloc runners 2020-03-21 17:52:38 -04:00
Fredrik Hoem Grelland
26cca14f27 Update consul-template to v0.24.1 and remove deprecated vault_grace (#7170) 2020-02-23 16:24:53 +01:00
Mahmood Ali
a3b0b25acb update rest of consul packages 2020-02-16 16:25:04 -06:00