nomad

mirror of https://github.com/kemko/nomad.git synced 2026-01-04 17:35:43 +03:00

Author	SHA1	Message	Date
Juan Larriba	65f09ed119	Run Linux Images (LCOW) and Windows Containers side by side (#7850 ) Makes it possible to run Linux Containers On Windows with Nomad alongside Windows Containers. Fingerprint prevents only to run Nomad in Windows 10 with Linux Containers	2020-05-04 13:08:47 -04:00
Lang Martin	3477f2e87a	client/heartbeatstop: don't store client state, use timeout In order to minimize this change while keeping a simple version of the behavior, we set `lastOk` to the current time less the intial server connection timeout. If the client starts and never contacts the server, it will stop all configured tasks after the initial server connection grace period, on the assumption that we've been out of touch longer than any configured `stop_after_client_disconnect`. The more complex state behavior might be justified later, but we should learn about failure modes first.	2020-05-01 12:35:49 -04:00
Lang Martin	7405961144	client/heartbeatstop: destroy allocs when disconnected from servers - track lastHeartbeat, the client local time of the last successful heartbeat round trip - track allocations with `stop_after_client_disconnect` configured - trigger allocation destroy (which handles cleanup) - restore heartbeat/killable allocs tracking when allocs are recovered from disk - on client restart, stop those allocs after a grace period if the servers are still partioned	2020-05-01 12:35:49 -04:00
Tim Gross	5731be4b79	csi: restore long timeout for controller plugins (#7840 ) During MVP development, we reduced the timeout for controller plugins to avoid long hangs in GC workers. But now that this work has been moved to the volume watcher, we can restore the original timeout which is better suited for the characteristic timescales of some cloud provider APIs and better matches the behavior of k8s.	2020-04-30 17:12:05 -04:00
Seth Hoenig	a869394a03	env_aws: combine 3 log lines into 1	2020-04-29 10:47:36 -06:00
Seth Hoenig	0d5d1781d3	env_aws: downgrade log line Co-Authored-By: Mahmood Ali <mahmood@hashicorp.com>	2020-04-29 10:34:26 -06:00
Seth Hoenig	f47c57fa2d	env_aws: fixup log line Co-Authored-By: Mahmood Ali <mahmood@hashicorp.com>	2020-04-29 10:33:53 -06:00
Seth Hoenig	9230fa9eff	env_aws: use best-effort lookup table for CPU performance in EC2 Fixes #7681 The current behavior of the CPU fingerprinter in AWS is that it reads the current speed from `/proc/cpuinfo` (`CPU MHz` field). This is because the max CPU frequency is not available by reading anything on the EC2 instance itself. Normally on Linux one would look at e.g. `sys/devices/system/cpu/cpuN/cpufreq/cpuinfo_max_freq` or perhaps parse the values from the `CPU max MHz` field in `/proc/cpuinfo`, but those values are not available. Furthermore, no metadata about the CPU is made available in the EC2 metadata service. https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/instancedata-data-categories.html Since `go-psutil` cannot determine the max CPU speed it defaults to the current CPU speed, which could be basically any number between 0 and the true max. This is particularly bad on large, powerful reserved instances which often idle at ~800 MHz while Nomad does its fingerprinting (typically IO bound), which Nomad then uses as the max, which results in severe loss of available resources. Since the CPU specification is unavailable programmatically (at least not without sudo) use a best-effort lookup table. This table was generated by going through every instance type in AWS documentation and copy-pasting the numbers. https://aws.amazon.com/ec2/instance-types/ This approach obviously is not ideal as future instance types will need to be added as they are introduced to AWS. However, using the table should only be an improvement over the status quo since right now Nomad miscalculates available CPU resources on all instance types.	2020-04-28 19:01:33 -06:00
Mahmood Ali	1fd22623cd	Harmonize go-msgpack/codec/codecgen Use v1.1.5 of go-msgpack/codec/codecgen, so go-msgpack codecgen matches the library version. We branched off earlier to pick up `f51b518921` , but apparently that's not needed as we could customize the package via `-c` argument.	2020-04-28 17:12:31 -04:00
Tim Gross	22d4b88b69	csi: checkpoint volume claim garbage collection (#7782 ) Adds a `CSIVolumeClaim` type to be tracked as current and past claims on a volume. Allows for a client RPC failure during node or controller detachment without having to keep the allocation around after the first garbage collection eval. This changeset lays groundwork for moving the actual detachment RPCs into a volume watching loop outside the GC eval.	2020-04-23 11:06:23 -04:00
Charlie Voiselle	2c1dcc8cd2	Use ExternalID in NodeStageVolume RPC (#7754 )	2020-04-20 17:13:46 -04:00
Anthony Scalisi	e1287846ae	fix spelling errors (#6985 )	2020-04-20 09:28:19 -04:00
Drew Bailey	3af2d05f6b	Run task shutdown_delay regardless of service registration task shutdown_delay will currently only run if there are registered services for the task. This implementation detail isn't explicity stated anywhere and is defined outside of the service stanza. This change moves shutdown_delay to be evaluated after prekill hooks are run, outside of any task runner hooks. just use time.sleep	2020-04-10 11:06:26 -04:00
Nick Ethier	18de6c4e41	ar/bridge: use cni.IsCNINotInitialized helper	2020-04-06 21:44:01 -04:00
Nick Ethier	b078d7855b	ar/bridge: better cni status err handling	2020-04-06 21:21:42 -04:00
Nick Ethier	f68b85b86d	ar/bridge: ensure cni configuration is always loaded	2020-04-06 21:02:26 -04:00
Nick Ethier	9df9e5e122	Merge pull request #7600 from hashicorp/b-5767 tr/service_hook: prevent Update from running before Poststart finish	2020-04-06 16:52:42 -04:00
Nick Ethier	8a8bd9b02d	tr/service_hook: reset initialized flag during deregister	2020-04-06 16:05:36 -04:00
Drew Bailey	004d200c17	Merge pull request #7618 from hashicorp/b-shutdown-delay-updates Fixes bug that prevented group shutdown_delay updates	2020-04-06 13:05:20 -04:00
Drew Bailey	d45fc506e5	ensure shutdown delay can be removed	2020-04-06 11:33:04 -04:00
Drew Bailey	b81a0018b4	Group shutdown delay fixes Group shutdown delay updates were not properly handled in Update hook. This commit also ensures that plan output is displayed.	2020-04-06 11:29:12 -04:00
Tim Gross	b946906865	csi: make volume GC in job deregister safely async The `Job.Deregister` call will block on the client CSI controller RPCs while the alloc still exists on the Nomad client node. So we need to make the volume claim reaping async from the `Job.Deregister`. This allows `nomad job stop` to return immediately. In order to make this work, this changeset changes the volume GC so that the GC jobs are on a by-volume basis rather than a by-job basis; we won't have to query the (possibly deleted) job at the time of volume GC. We smuggle the volume ID and whether it's a purge into the GC eval ID the same way we smuggled the job ID previously.	2020-04-06 10:15:55 -04:00
Tim Gross	a11fb6a323	csi: fix unpublish workflow ID mismatches The CSI plugins uses the external volume ID for all operations, but the Client CSI RPCs uses the Nomad volume ID (human-friendly) for the mount paths. Pass the External ID as an arg in the RPC call so that the unpublish workflows have it without calling back to the server to find the external ID. The controller CSI plugins need the CSI node ID (or in other words, the storage provider's view of node ID like the EC2 instance ID), not the Nomad node ID, to determine how to detach the external volume.	2020-04-06 10:15:55 -04:00
Seth Hoenig	222886e488	Merge pull request #7602 from hashicorp/b-connect-bootstrap-tls-config connect: set consul TLS options on envoy bootstrap	2020-04-03 08:50:36 -06:00
Tim Gross	414caf76e5	CSI: move node unmount to server-driven RPCs (#7596 ) If a volume-claiming alloc stops and the CSI Node plugin that serves that alloc's volumes is missing, there's no way for the allocrunner hook to send the `NodeUnpublish` and `NodeUnstage` RPCs. This changeset addresses this issue with a redesign of the client-side for CSI. Rather than unmounting in the alloc runner hook, the alloc runner hook will simply exit. When the server gets the `Node.UpdateAlloc` for the terminal allocation that had a volume claim, it creates a volume claim GC job. This job will made client RPCs to a new node plugin RPC endpoint, and only once that succeeds, move on to making the client RPCs to the controller plugin. If the node plugin is unavailable, the GC job will fail and be requeued.	2020-04-02 16:04:56 -04:00
Nick Ethier	d4a3524064	tr/service_hook: update hook fields during update when poststart hasn't finished	2020-04-02 12:48:19 -04:00
Seth Hoenig	fb0bd3c25f	connect: set consul TLS options on envoy bootstrap Fixes #6594 #6711 #6714 #7567 e2e testing is still TBD in #6502 Before, we only passed the Nomad agent's configured Consul HTTP address onto the `consul connect envoy ...` bootstrap command. This meant any Consul setup with TLS enabled would not work with Nomad's Connect integration. This change now sets CLI args and Environment Variables for configuring TLS options for communicating with Consul when doing the envoy bootstrap, as described in https://www.consul.io/docs/commands/connect/envoy.html#usage	2020-04-02 10:30:50 -06:00
Nick Ethier	88438e8982	tr/service_hook: prevent Update from running before Poststart has finished	2020-04-02 12:17:36 -04:00
Mahmood Ali	e625f07b57	fix codegen for ugorji/go When generating ugorji/go package, we should use github.com/hashicorp/go-msgpack/codec instead. Also fix the reference for codegen_generated	2020-03-31 21:30:21 -04:00
Seth Hoenig	2a9749c41c	connect: enable proxy.passthrough configuration Enable configuration of HTTP and gRPC endpoints which should be exposed by the Connect sidecar proxy. This changeset is the first "non-magical" pass that lays the groundwork for enabling Consul service checks for tasks running in a network namespace because they are Connect-enabled. The changes here provide for full configuration of the connect { sidecar_service { proxy { expose { paths = [{ path = <exposed endpoint> protocol = <http or grpc> local_path_port = <local endpoint port> listener_port = <inbound mesh port> }, ... ] } } } stanza. Everything from `expose` and below is new, and partially implements the precedent set by Consul: https://www.consul.io/docs/connect/registration/service-registration.html#expose-paths-configuration-reference Combined with a task-group level network port-mapping in the form: port "exposeExample" { to = -1 } it is now possible to "punch a hole" through the network namespace to a specific HTTP or gRPC path, with the anticipated use case of creating Consul checks on Connect enabled services. A future PR may introduce more automagic behavior, where we can do things like 1) auto-fill the 'expose.path.local_path_port' with the default value of the 'service.port' value for task-group level connect-enabled services. 2) automatically generate a port-mapping 3) enable an 'expose.checks' flag which automatically creates exposed endpoints for every compatible consul service check (http/grpc checks on connect enabled services).	2020-03-31 17:15:27 -06:00
Lang Martin	bc750d8bb0	csi: add node events to report progress mounting and unmounting volumes (#7547 ) * nomad/structs/structs: new NodeEventSubsystemCSI * client/client: pass triggerNodeEvent in the CSIConfig * client/pluginmanager/csimanager/instance: add eventer to instanceManager * client/pluginmanager/csimanager/manager: pass triggerNodeEvent * client/pluginmanager/csimanager/volume: node event on [un]mount * nomad/structs/structs: use storage, not CSI * client/pluginmanager/csimanager/volume: use storage, not CSI * client/pluginmanager/csimanager/volume_test: eventer * client/pluginmanager/csimanager/volume: event on error * client/pluginmanager/csimanager/volume_test: check event on error * command/node_status: remove an extra space in event detail format * client/pluginmanager/csimanager/volume: use snake_case for details * client/pluginmanager/csimanager/volume_test: snake_case details	2020-03-31 17:13:52 -04:00
Mahmood Ali	137a94fdd0	Merge pull request #7560 from hashicorp/vendor-go-msgpack-v1.1.5 vendor: explicit use of hashicorp/go-msgpack	2020-03-31 10:09:05 -04:00
Tim Gross	3f110d2019	client: use NewNodeEvent builder for consistency (#7559 )	2020-03-31 10:02:16 -04:00
Yoan Blanc	c3928fe360	fixup! vendor: explicit use of hashicorp/go-msgpack Signed-off-by: Yoan Blanc <yoan@dosimple.ch>	2020-03-31 09:48:07 -04:00
Yoan Blanc	887f23a351	vendor: explicit use of hashicorp/go-msgpack Signed-off-by: Yoan Blanc <yoan@dosimple.ch>	2020-03-31 09:45:21 -04:00
Tim Gross	74e5c90b42	csi: annotate remaining missing cancellation contexts (#7552 )	2020-03-30 16:46:43 -04:00
Tim Gross	ffa13adf90	csi: add grpc retries to client controller RPCs (#7549 ) The CSI Specification defines various gRPC Errors and how they may be retried. After auditing all our CSI RPC calls in #6863, this changeset: * adds retries and backoffs to the where they were needed but not implemented * annotates those CSI RPCs that do not need retries so that we don't wonder whether it's been left off accidentally * added a timeout and cancellation context to the `Probe` call, which didn't have one.	2020-03-30 16:26:03 -04:00
Seth Hoenig	a86e575670	Merge pull request #7524 from hashicorp/docs-consul-acl-minimums consul: annotate Consul interfaces with ACLs	2020-03-30 13:27:27 -06:00
Seth Hoenig	0957c24646	docs: remove erroneous characters from comment	2020-03-30 13:26:48 -06:00
Seth Hoenig	dfb55132d3	Merge pull request #7542 from jorgemarey/b-fix-lockedUpstreamsUpdate Add new setUpstreamsLocked function to avoid blocking on Update	2020-03-30 11:27:32 -06:00
Seth Hoenig	7a7701a4eb	consul: annotate Consul interfaces with ACLs	2020-03-30 10:17:28 -06:00
Mahmood Ali	81073ff88e	tests: deflake TestAllocGarbageCollector_MakeRoomFor_MaxAllocs The test inserts an alloc in the server state, but expect the client to start the alloc runner for it almost immediately. Here, we add a retry loop to check that the client start all expected alloc runners eventually.	2020-03-30 07:06:53 -04:00
Jorge Marey	a3aa03acf0	Add new setUpstreamsLocked function to avoid lock	2020-03-29 20:34:04 +02:00
Mahmood Ali	6199e96972	fixup! tests: Add tests for EC2 Metadata immitation cases	2020-03-26 11:37:54 -04:00
Mahmood Ali	0c1dd0e75b	fixup! tests: Add tests for EC2 Metadata immitation cases	2020-03-26 11:33:44 -04:00
Mahmood Ali	e37f7af811	fingerprint: handle incomplete AWS immitation APIs Fix a regression where we accidentally started treating non-AWS environments as AWS environments, resulting in bad networking settings. Two factors some at play: First, in [1], we accidentally switched the ultimate AWS test from checking `ami-id` to `instance-id`. This means that nomad started treating more environments as AWS; e.g. Hetzner implements `instance-id` but not `ami-id`. Second, some of these environments return empty values instead of errors! Hetzner returns empty 200 response for `local-ipv4`, resulting into bad networking configuration. This change fix the situation by restoring the check to `ami-id` and ensuring that we only set network configuration when the ip address is not-empty. Also, be more defensive around response whitespace input. [1] https://github.com/hashicorp/nomad/pull/6779	2020-03-26 11:23:15 -04:00
Mahmood Ali	500c3c2d87	tests: Add tests for EC2 Metadata immitation cases Test that nomad doesn't set empty/bad network configuration when in an environment that does incomplete immitation of EC2 Metadata API.	2020-03-26 11:13:21 -04:00
Mahmood Ali	4a27cddec8	Merge pull request #7383 from hashicorp/b-health-detect-failing-tasks health: detect failing tasks	2020-03-25 06:30:05 -04:00
Mahmood Ali	d155e4d412	tests: restart restartpolicy for all tasks in tests	2020-03-24 21:52:48 -04:00
Mahmood Ali	df08c6c399	tests: populate task restart policy properly	2020-03-24 21:44:37 -04:00

1 2 3 4 5 ...

4149 Commits