Commit Graph

290 Commits

Author SHA1 Message Date
Danielle Lancashire
1250d56333 csi: Add VolumeManager (#6920)
This changeset is some pre-requisite boilerplate that is required for
introducing CSI volume management for client nodes.

It extracts out fingerprinting logic from the csi instance manager.
This change is to facilitate reusing the csimanager to also manage the
node-local CSI functionality, as it is the easiest place for us to
guaruntee health checking and to provide additional visibility into the
running operations through the fingerprinter mechanism and goroutine.

It also introduces the VolumeMounter interface that will be used to
manage staging/publishing unstaging/unpublishing of volumes on the host.
2020-03-23 13:58:29 -04:00
Danielle Lancashire
cd0c2a6df0 csi: Setup gRPC Clients with a logger 2020-03-23 13:58:29 -04:00
Danielle Lancashire
d296efd2c6 CSI Plugin Registration (#6555)
This changeset implements the initial registration and fingerprinting
of CSI Plugins as part of #5378. At a high level, it introduces the
following:

* A `csi_plugin` stanza as part of a Nomad task configuration, to
  allow a task to expose that it is a plugin.

* A new task runner hook: `csi_plugin_supervisor`. This hook does two
  things. When the `csi_plugin` stanza is detected, it will
  automatically configure the plugin task to receive bidirectional
  mounts to the CSI intermediary directory. At runtime, it will then
  perform an initial heartbeat of the plugin and handle submitting it to
  the new `dynamicplugins.Registry` for further use by the client, and
  then run a lightweight heartbeat loop that will emit task events
  when health changes.

* The `dynamicplugins.Registry` for handling plugins that run
  as Nomad tasks, in contrast to the existing catalog that requires
  `go-plugin` type plugins and to know the plugin configuration in
  advance.

* The `csimanager` which fingerprints CSI plugins, in a similar way to
  `drivermanager` and `devicemanager`. It currently only fingerprints
  the NodeID from the plugin, and assumes that all plugins are
  monolithic.

Missing features

* We do not use the live updates of the `dynamicplugin` registry in
  the `csimanager` yet.

* We do not deregister the plugins from the client when they shutdown
  yet, they just become indefinitely marked as unhealthy. This is
  deliberate until we figure out how we should manage deploying new
  versions of plugins/transitioning them.
2020-03-23 13:58:28 -04:00
Mahmood Ali
83b08ab158 tr: proceed to mark other tasks as dead if alloc fails 2020-03-21 17:52:58 -04:00
Mahmood Ali
4558fa6aec fix test 2020-03-21 17:52:57 -04:00
Jasmine Dahilig
6c1474398f change jobspec lifecycle stanza to use sidecar attribute instead of
block_until status
2020-03-21 17:52:57 -04:00
Jasmine Dahilig
dcd317745d fix restart policy for system jobs with no lifecycle 2020-03-21 17:52:56 -04:00
Jasmine Dahilig
90fa242d83 fix failing ci test: TestTaskRunner_UnregisterConsul_Retries 2020-03-21 17:52:54 -04:00
Jasmine Dahilig
c6cd7b523b clean up restart conditions and restart tests for task lifecycle 2020-03-21 17:52:50 -04:00
Jasmine Dahilig
262d204096 incorporate lifecycle into restart tracker 2020-03-21 17:52:40 -04:00
Mahmood Ali
5377b4cb58 Add a coordinator for alloc runners 2020-03-21 17:52:38 -04:00
Fredrik Hoem Grelland
26cca14f27 Update consul-template to v0.24.1 and remove deprecated vault_grace (#7170) 2020-02-23 16:24:53 +01:00
Mahmood Ali
a3b0b25acb update rest of consul packages 2020-02-16 16:25:04 -06:00
Seth Hoenig
1f8e31770c tests: set consul token for nomad client for testing SIDS TR hook 2020-01-31 19:06:15 -06:00
Seth Hoenig
04b526662c e2e: setup consul ACLs a little more correctly 2020-01-31 19:06:11 -06:00
Seth Hoenig
0f285b840e tests: skip some SIDS hook tests if running tests as root 2020-01-31 19:05:32 -06:00
Seth Hoenig
08951ac759 client: additional test cases around failures in SIDS hook 2020-01-31 19:05:27 -06:00
Seth Hoenig
91c7dbaa8d client: PR cleanup - improved logging around kill task in SIDS hook 2020-01-31 19:05:23 -06:00
Seth Hoenig
f8949dde35 client: PR cleanup - shadow context variable 2020-01-31 19:05:19 -06:00
Seth Hoenig
0589b656b7 nomad: make TaskGroup.UsesConnect helper a public helper 2020-01-31 19:05:11 -06:00
Seth Hoenig
40de85867d client: manage TR kill from parent on SI token derivation failure
Re-orient the management of the tr.kill to happen in the parent of
the spawned goroutine that is doing the actual token derivation. This
makes the code a little more straightforward, making it easier to
reason about not leaking the worker goroutine.
2020-01-31 19:05:02 -06:00
Seth Hoenig
1fca495a85 client: set context timeout around SI token derivation
The derivation of an SI token needs to be safegaurded by a context
timeout, otherwise an unresponsive Consul could cause the siHook
to block forever on Prestart.
2020-01-31 19:04:56 -06:00
Seth Hoenig
bbedeb670d nomad,client: apply more comment/style PR tweaks 2020-01-31 19:04:52 -06:00
Seth Hoenig
cc7b768907 nomad,client: apply smaller PR suggestions
Apply smaller suggestions like doc strings, variable names, etc.

Co-Authored-By: Nick Ethier <nethier@hashicorp.com>
Co-Authored-By: Michael Schurter <mschurter@hashicorp.com>
2020-01-31 19:04:40 -06:00
Seth Hoenig
d24d470775 comments: cleanup some leftover debug comments and such 2020-01-31 19:04:35 -06:00
Seth Hoenig
e825a0f769 client: skip task SI token file load failure if testing as root
The TestEnvoyBootstrapHook_maybeLoadSIToken test case only works when
running as a non-priveleged user, since it deliberately tries to read
an un-readable file to simulate a failure loading the SI token file.
2020-01-31 19:04:30 -06:00
Seth Hoenig
4b4dfacda5 client: remove unused indirection for referencing consul executable
Was thinking about using the testing pattern where you create executable
shell scripts as test resources which "mock" the process a bit of code
is meant to fork+exec. Turns out that wasn't really necessary in this case.
2020-01-31 19:04:25 -06:00
Seth Hoenig
d85cccc8d0 nomad: fixup token policy validation 2020-01-31 19:04:08 -06:00
Seth Hoenig
6bc6a52f99 client: enable envoy bootstrap hook to set SI token
When creating the envoy bootstrap configuration, we should append
the "-token=<token>" argument in the case where the sidsHook placed
the token in the secrets directory.
2020-01-31 19:04:01 -06:00
Seth Hoenig
674ccaa122 nomad: proxy requests for Service Identity tokens between Clients and Consul
Nomad jobs may be configured with a TaskGroup which contains a Service
definition that is Consul Connect enabled. These service definitions end
up establishing a Consul Connect Proxy Task (e.g. envoy, by default). In
the case where Consul ACLs are enabled, a Service Identity token is required
for these tasks to run & connect, etc. This changeset enables the Nomad Server
to recieve RPC requests for the derivation of SI tokens on behalf of instances
of Consul Connect using Tasks. Those tokens are then relayed back to the
requesting Client, which then injects the tokens in the secrets directory of
the Task.
2020-01-31 19:03:53 -06:00
Seth Hoenig
f8666bb1f9 client: enable nomad client to request and set SI tokens for tasks
When a job is configured with Consul Connect aware tasks (i.e. sidecar),
the Nomad Client should be able to request from Consul (through Nomad Server)
Service Identity tokens specific to those tasks.
2020-01-31 19:03:38 -06:00
Mahmood Ali
b789b507d1 Merge pull request #6922 from hashicorp/b-alloc-canoncalize
Handle Upgrades and Alloc.TaskResources modification
2020-01-28 15:12:41 -05:00
Mahmood Ali
3291523d8c address review comments 2020-01-15 08:57:05 -05:00
Nick Ethier
4b6f9e800b Merge pull request #6816 from hashicorp/b-multiple-envoy
connect: configure envoy to support multiple sidecars in the same alloc
2020-01-09 23:25:39 -05:00
Mahmood Ali
058076afd0 client: stop using alloc.TaskResources
Now that alloc.Canonicalize() is called in all alloc sources in the
client (i.e. on state restore and RPC fetching), we no longer need to
check alloc.TaskResources.

alloc.AllocatedResources is always non-nil through alloc runner.
Though, early on, we check for alloc validity, so NewTaskRunner and
TaskEnv must still check.  `TestClient_AddAllocError` test validates
that behavior.
2020-01-09 09:25:07 -05:00
Tim Gross
f31482ae8a interpolate environment for services in script checks (#6916)
In 0.10.2 (specifically 387b016) we added interpolation to group
service blocks and centralized the logic for task environment
interpolation. This wasn't also added to script checks, which caused a
regression where the IDs for script checks for services w/
interpolated fields (ex. the service name) didn't match the service ID
that was registered with Consul.

This changeset calls the same taskenv interpolation logic during
`script_check` configuration, and adds tests to reduce the risk of
future regressions by comparing the IDs of service hook and the check hook.
2020-01-09 08:12:54 -05:00
Nick Ethier
55217423c7 tr: initialize envoybootstrap prestart hook response.Env field 2020-01-08 13:41:38 -05:00
Nick Ethier
a44490182b tr: expose envoy sidecar admin port as environment variable 2020-01-06 21:53:45 -05:00
Nick Ethier
04a1623b85 connect: configure envoy such that multiple sidecars can run in the same alloc 2020-01-06 11:26:27 -05:00
Mahmood Ali
20f8227c0a Merge pull request #6820 from hashicorp/f-skip-docker-logging-knob
driver: allow disabling log collection
2019-12-13 11:41:20 -05:00
Mahmood Ali
e82dad732b address review comments 2019-12-13 11:21:00 -05:00
Chris Dickson
bbb6b2af09 client: expose allocated CPU per task (#6784) 2019-12-09 15:40:22 -05:00
Mahmood Ali
943854469d driver: allow disabling log collection
Operators commonly have docker logs aggregated using various tools and
don't need nomad to manage their docker logs.  Worse, Nomad uses a
somewhat heavy docker api call to collect them and it seems to cause
problems when a client runs hundreds of log collections.

Here we add a knob to disable log aggregation completely for nomad.
When log collection is disabled, we avoid running logmon and
docker_logger for the docker tasks in this implementation.

The downside here is once disabled, `nomad logs ...` commands and API
no longer return logs and operators must corrolate alloc-ids with their
aggregated log info.

This is meant as a stop gap measure.  Ideally, we'd follow up with at
least two changes:

First, we should optimize behavior when we can such that operators don't
need to disable docker log collection.  Potentially by reverting to
using pre-0.9 syslog aggregation in linux environments, though with
different trade-offs.

Second, when/if logs are disabled, nomad logs endpoints should lookup
docker logs api on demand.  This ensures that the cost of log collection
is paid sparingly.
2019-12-08 14:15:03 -05:00
Preetha
d4f801d188 Merge pull request #6349 from hashicorp/b-host-stats
client: Return empty values when host stats fail
2019-11-20 10:13:02 -06:00
Lang Martin
c47d52e865 getter: allow the gcs download scheme (#6692) 2019-11-19 09:10:56 -05:00
Nick Ethier
387b016ac4 client: improve group service stanza interpolation and check_re… (#6586)
* client: improve group service stanza interpolation and check_restart support

Interpolation can now be done on group service stanzas. Note that some task runtime specific information
that was previously available when the service was registered poststart of a task is no longer available.

The check_restart stanza for checks defined on group services will now properly restart the allocation upon
check failures if configured.
2019-11-18 13:04:01 -05:00
Michael Schurter
0fcb0d4016 client: fix panic from 0.8 -> 0.10 upgrade
makeAllocTaskServices did not do a nil check on AllocatedResources
which causes a panic when upgrading directly from 0.8 to 0.10. While
skipping 0.9 is not supported we intend to fix serious crashers caused
by such upgrades to prevent cluster outages.

I did a quick audit of the client package and everywhere else that
accesses AllocatedResources appears to be properly guarded by a nil
check.
2019-11-01 07:47:03 -07:00
Michael Schurter
43909b1374 Revert "Revert "Use joint context to cancel prestart hooks"" 2019-10-08 11:34:09 -07:00
Michael Schurter
680e30457f Revert "Use joint context to cancel prestart hooks" 2019-10-08 11:27:08 -07:00
Drew Bailey
12be12020e simplify logic to check for vault read event
defer shutdown to cleanup after failed run

Co-Authored-By: Michael Schurter <mschurter@hashicorp.com>

update comment to include ctx note for shutdown
2019-09-30 11:02:14 -07:00