nomad

mirror of https://github.com/kemko/nomad.git synced 2026-01-04 17:35:43 +03:00

Author	SHA1	Message	Date
Michael Schurter	2411d3afd2	core: remove all traces of unused protocol version Nomad inherited protocol version numbering configuration from Consul and Serf, but unlike those projects Nomad has never used it. Nomad's `protocol_version` has always been `1`. While the code is effectively unused and therefore poses no runtime risks to leave, I felt like removing it was best because: 1. Nomad's RPC subsystem has been able to evolve extensively without needing to increment the version number. 2. Nomad's HTTP API has evolved extensively without increment `API{Major,Minor}Version`. If we want to version the HTTP API in the future, I doubt this is the mechanism we would choose. 3. The presence of the `server.protocol_version` configuration parameter is confusing since `server.raft_protocol` is an important parameter for operators to consider. Even more confusing is that there is a distinct Serf protocol version which is included in `nomad server members` output under the heading `Protocol`. `raft_protocol` is the only protocol version relevant to Nomad developers and operators. The other protocol versions are either deadcode or have never changed (Serf). 4. If we were to need to version the RPC, HTTP API, or Serf protocols, I don't think these configuration parameters and variables are the best choice. If we come to that point we should choose a versioning scheme based on the use case and modern best practices -- not this 6+ year old dead code.	2022-02-18 16:12:36 -08:00
Tim Gross	b775a73ded	CSI: make gRPC client creation more robust (#12057 ) Nomad communicates with CSI plugin tasks via gRPC. The plugin supervisor hook uses this to ping the plugin for health checks which it emits as task events. After the first successful health check the plugin supervisor registers the plugin in the client's dynamic plugin registry, which in turn creates a CSI plugin manager instance that has its own gRPC client for fingerprinting the plugin and sending mount requests. If the plugin manager instance fails to connect to the plugin on its first attempt, it exits. The plugin supervisor hook is unaware that connection failed so long as its own pings continue to work. A transient failure during plugin startup may mislead the plugin supervisor hook into thinking the plugin is up (so there's no need to restart the allocation) but no fingerprinter is started. * Refactors the gRPC client to connect on first use. This provides the plugin manager instance the ability to retry the gRPC client connection until success. * Add a 30s timeout to the plugin supervisor so that we don't poll forever waiting for a plugin that will never come back up. Minor improvements: * The plugin supervisor hook creates a new gRPC client for every probe and then throws it away. Instead, reuse the client as we do for the plugin manager. * The gRPC client constructor has a 1 second timeout. Clarify that this timeout applies to the connection and not the rest of the client lifetime.	2022-02-15 16:57:29 -05:00
Conor Evans	31978a0366	replace 'a alloc' with 'an alloc' where appropriate (#11792 )	2022-01-10 11:59:46 -05:00
James Rasell	ab9ba35e6a	chore: fixup inconsistent method receiver names. (#11704 )	2021-12-20 11:44:21 +01:00
Tim Gross	d38266aef8	client: respect `client_auto_join` after connection loss (#11585 ) The `consul.client_auto_join` configuration block tells the Nomad client whether to use Consul service discovery to find Nomad servers. By default it is set to `true`, but contrary to the documentation it was only respected during the initial client registration. If a client missed a heartbeat, failed a `Node.UpdateStatus` RPC, or if there was no Nomad leader, the client would fallback to Consul even if `client_auto_join` was set to `false`. This changeset returns early from the client's trigger for Consul discovery if the `client_auto_join` field is set to `false`.	2021-11-30 13:20:42 -05:00
Danish Prakash	e70b0b7727	client: emit max_memory metric (#11490 )	2021-11-17 08:34:22 -05:00
Alessandro De Blasis	9a5248b932	cli: show `host_network` in `nomad status` (#11432 ) Enhance the CLI in order to return the host network in two flavors (default, verbose) of the `node status` command. Fixes: #11223. Signed-off-by: Alessandro De Blasis <alex@deblasis.net>	2021-11-05 09:02:46 -04:00
Michael Schurter	c615870911	client: defensively log reserved ports - Fix test broken due to being improperly setup. - Include min/max ports in default client config.	2021-10-04 15:43:35 -07:00
Michael Schurter	2968c01295	client: output reserved ports with min/max ports Also add a little more min/max port testing and add the consts back that had been removed: but unexported and as defaults.	2021-09-30 17:05:46 -07:00
Aleksandr Zagaevskiy	e3b6f62198	Support configurable dynamic port range	2021-09-10 11:52:47 +03:00
James Rasell	e26f1c4591	lint: mark false positive or fix gocritic append lint errors.	2021-09-06 10:49:44 +02:00
James Rasell	3bffe443ac	chore: fix incorrect docstring formatting.	2021-08-30 11:08:12 +02:00
Mahmood Ali	b1d10ff69d	Speed up client startup and registration (#11005 ) Speed up client startup, by retrying more until the servers are known. Currently, if client fingerprinting is fast and finishes before the client connect to a server, node registration may be delayed by 15 seconds or so! Ideally, we'd wait until the client discovers the servers and then retry immediately, but that requires significant code changes. Here, we simply retry the node registration request every second. That's basically the equivalent of check if the client discovered servers every second. Should be a cheap operation. When testing this change on my local computer and where both servers and clients are co-located, the time from startup till node registration dropped from 34 seconds to 8 seconds!	2021-08-10 17:06:18 -04:00
Mahmood Ali	3165ae8112	client: avoid acting on stale data after launch (#10907 ) When the client launches, use a consistent read to fetch its own allocs, but allow stale read afterwards as long as reads don't revert into older state. This change addresses an edge case affecting restarting client. When a client restarts, it may fetch a stale data concerning its allocs: allocs that have completed prior to the client shutdown may still have "run/running" desired/client status, and have the client attempt to re-run again. An alternative approach is to track the indices such that the client set MinQueryIndex on the maximum index the client ever saw, or compare received allocs against locally restored client state. Garbage collection complicates this approach (local knowledge is not complete), and the approach still risks starting "dead" allocations (e.g. the allocation may have been placed when client just restarted and have already been reschuled by the time the client started. This approach here is effective against all kinds of stalness problems with small overhead.	2021-07-20 15:13:28 -04:00
Mahmood Ali	7badf0fda2	tests: deflake CSI forwarding tests This updates `client.Ready()` so it returns once the client node got registered at the servers. Previously, it returns when the fingerprinters first batch completes, wtihout ensuring that the node is stored in the Raft data. The tests may fail later when it with unknown node errors later. `client.Reedy()` seem to be only called in CSI and some client stats now. This class of bug, assuming client is registered without checking, is a source of flakiness elsewhere. Other tests use other mechanisms for checking node readiness, though not consistently.	2021-06-10 21:26:34 -04:00
Michael Schurter	d50fb2a00e	core: propagate remote task handles Add a new driver capability: RemoteTasks. When a task is run by a driver with RemoteTasks set, its TaskHandle will be propagated to the server in its allocation's TaskState. If the task is replaced due to a down node or draining, its TaskHandle will be propagated to its replacement allocation. This allows tasks to be scheduled in remote systems whose lifecycles are disconnected from the Nomad node's lifecycle. See https://github.com/hashicorp/nomad-driver-ecs for an example ECS remote task driver.	2021-04-27 15:07:03 -07:00
Nick Ethier	9003717ae3	client: disable cpuset cgroup managment if init fails	2021-04-14 14:44:08 -04:00
Nick Ethier	d5f97c11a5	another testing fix	2021-04-14 10:37:03 -04:00
Nick Ethier	f897ac79e8	client/ar: thread through cpuset manager	2021-04-13 13:28:36 -04:00
Nick Ethier	84e44d53d0	Apply suggestions from code review Co-authored-by: Drew Bailey <drewbailey5@gmail.com>	2021-04-13 13:28:15 -04:00
Nick Ethier	b8397a712d	fingerprint: implement client fingerprinting of reservable cores on Linux systems this is derived from the configure cpuset cgroup parent (defaults to /nomad) for non Linux systems and Linux systems where cgroups are not enabled, the client defaults to using all cores	2021-04-13 13:28:15 -04:00
Tim Gross	bb194cb91d	test infrastructure for mock client RPCs (#10193 ) This commit includes a new test client that allows overriding the RPC protocols. Only the RPCs that are passed in are registered, which lets you implement a mock RPC in the server tests. This commit includes an example of this for the ClientCSI RPC server.	2021-03-31 16:37:09 -04:00
Tim Gross	14568b3e00	deps: bump gopsutil to v3.21.2	2021-03-30 16:02:51 -04:00
Kris Hicks	2cd7136bc7	Fix some errcheck errors (#9811 ) * Throw away result of multierror.Append When given a multierror.Error, it is mutated, therefore the return value is not needed. Simplify MergeMultierrorWarnings, use StringBuilder * Hash.Write() never returns an error * Remove error that was always nil * Remove error from Resources.Add signature When this was originally written it could return an error, but that was refactored away, and callers of it as of today never handle the error. * Throw away results of io.Copy during Bridge * Handle errors when computing node class in test	2021-01-14 12:46:35 -08:00
Seth Hoenig	803cd312b1	consul/connect: fix panic during in-place upgrade with connect jobs When upgrading from Nomad v0.12.x to v1.0.x, Nomad client will panic on startup if the node is running Connect enabled jobs. This is caused by a missing piece of plumbing of the Consul Proxies API interface during the client restore process. Fixes #9738	2021-01-07 13:24:24 -06:00
Seth Hoenig	f0f6f3a18f	consul/connect: fix regression where client connect images ignored Nomad v1.0.0 introduced a regression where the client configurations for `connect.sidecar_image` and `connect.gateway_image` would be ignored despite being set. This PR restores that functionality. There was a missing layer of interpolation that needs to occur for these parameters. Since Nomad 1.0 now supports dynamic envoy versioning through the ${NOMAD_envoy_version} psuedo variable, we basically need to first interpolate ${connect.sidecar_image} => envoyproxy/envoy:v${NOMAD_envoy_version} then use Consul at runtime to resolve to a real image, e.g. envoyproxy/envoy:v${NOMAD_envoy_version} => envoyproxy/envoy:v1.16.0 Of course, if the version of Consul is too old to provide an envoy version preference, we then need to know to fallback to the old version of envoy that we used before. envoyproxy/envoy:v${NOMAD_envoy_version} => envoyproxy/envoy:v1.11.2@sha256:a7769160c9c1a55bb8d07a3b71ce5d64f72b1f665f10d81aa1581bc3cf850d09 Beyond that, we also need to continue to support jobs that set the sidecar task themselves, e.g. sidecar_task { config { image: "custom/envoy" } } which itself could include teh pseudo envoy version variable.	2020-12-14 09:47:55 -06:00
Kris Hicks	85ed8ddd4f	Add gosimple linter (#9590 )	2020-12-09 11:05:18 -08:00
Kris Hicks	071f4c7596	Add gocritic to golangci-lint config (#9556 )	2020-12-08 12:47:04 -08:00
Seth Hoenig	4d6a166989	Merge pull request #9487 from hashicorp/f-connect-sidecar-concurrency consul/connect: default envoy concurrency to 1	2020-12-01 15:51:41 -06:00
Seth Hoenig	2a397dbda2	consul/connect: default envoy concurrency to 1 Previously, every Envoy Connect sidecar would spawn as many worker threads as logical CPU cores. That is Envoy's default behavior when `--concurrency` is not explicitly set. Nomad now sets the concurrency flag to 1, which is sensible for the default cpu = 250 Mhz resources allocated for sidecar proxies. The concurrency value can be configured in Client configuration by setting `meta.connect.proxy_concurrency`. Closes #9341	2020-12-01 13:12:45 -06:00
Michael Schurter	e6fd2583fa	client: always wait 200ms before sending updates Always wait 200ms before calling the Node.UpdateAlloc RPC to send allocation updates to servers. Prior to this change we only reset the update ticker when an error was encountered. This meant the 200ms ticker was running while the RPC was being performed. If the RPC was slow due to network latency or server load and took >=200ms, the ticker would tick during the RPC. Then on the next loop only the select would randomly choose between the two viable cases: receive an update or fire the RPC again. If the RPC case won it would immediately loop again due to there being no updates to send. When the update chan receive is selected a single update is added to the slice. The odds are then 50/50 that the subsequent loop will send the single update instead of receiving any more updates. This could cause a couple of problems: 1. Since only a small number of updates are sent, the chan buffer may fill, applying backpressure, and slowing down other client operations. 2. The small number of updates sent may already be stale and not represent the current state of the allocation locally. A risk here is that it's hard to reason about how this will interact with the 50ms batches on servers when the servers under load. A further improvement would be to completely remove the alloc update chan and instead use a mutex to build a map of alloc updates. I wanted to test the lowest risk possible change on loaded servers first before making more drastic changes.	2020-11-25 11:36:51 -08:00
Michael Schurter	f44c04ecd1	s/0.13/1.0/g 1.0 here we come!	2020-10-14 15:17:47 -07:00
Chris Baker	797543ad4b	removed backwards-compatible/untagged metrics deprecated in 0.7	2020-10-13 20:18:39 +00:00
Seth Hoenig	bdeb73cd2c	consul/connect: dynamically select envoy sidecar at runtime As newer versions of Consul are released, the minimum version of Envoy it supports as a sidecar proxy also gets bumped. Starting with the upcoming Consul v1.9.X series, Envoy v1.11.X will no longer be supported. Current versions of Nomad hardcode a version of Envoy v1.11.2 to be used as the default implementation of Connect sidecar proxy. This PR introduces a change such that each Nomad Client will query its local Consul for a list of Envoy proxies that it supports (https://github.com/hashicorp/consul/pull/8545) and then launch the Connect sidecar proxy task using the latest supported version of Envoy. If the `SupportedProxies` API component is not available from Consul, Nomad will fallback to the old version of Envoy supported by old versions of Consul. Setting the meta configuration option `meta.connect.sidecar_image` or setting the `connect.sidecar_task` stanza will take precedence as is the current behavior for sidecar proxies. Setting the meta configuration option `meta.connect.gateway_image` will take precedence as is the current behavior for connect gateways. `meta.connect.sidecar_image` and `meta.connect.gateway_image` may make use of the special `${NOMAD_envoy_version}` variable interpolation, which resolves to the newest version of Envoy supported by the Consul agent. Addresses #8585 #7665	2020-10-13 09:14:12 -05:00
Nick Ethier	756aa11654	client: add NetworkStatus to Allocation (#8657 )	2020-10-12 13:43:04 -04:00
Yoan Blanc	c14c616194	use allow/deny instead of the colored alternatives (#9019 ) Signed-off-by: Yoan Blanc <yoan@dosimple.ch>	2020-10-12 08:47:05 -04:00
Pete Woods	f40e6eed65	Add node "status", "scheduling eligibility" to all client metrics (#8925 ) - We previously added these to the client host metrics, but it's useful to have them on all client metrics. - e.g. so you can exclude draining nodes from charts showing your fleet size.	2020-09-22 13:53:50 -04:00
Seth Hoenig	9ffdeed904	consul/connect: add initial support for ingress gateways This PR adds initial support for running Consul Connect Ingress Gateways (CIGs) in Nomad. These gateways are declared as part of a task group level service definition within the connect stanza. ```hcl service { connect { gateway { proxy { // envoy proxy configuration } ingress { // ingress-gateway configuration entry } } } } ``` A gateway can be run in `bridge` or `host` networking mode, with the caveat that host networking necessitates manually specifying the Envoy admin listener (which cannot be disabled) via the service port value. Currently Envoy is the only supported gateway implementation in Consul, and Nomad only supports running Envoy as a gateway using the docker driver. Aims to address #8294 and tangentially #8647	2020-08-21 16:21:54 -05:00
Drew Bailey	19810365f6	oss compoments for multi-vault namespaces adds in oss components to support enterprise multi-vault namespace feature upgrade specific doc on vault multi-namespaces vault docs update test to reflect new error	2020-07-24 10:14:59 -04:00
Mahmood Ali	c70f2a1269	Revert "client: defensive against getting stale alloc updates"	2020-06-19 15:39:44 -04:00
Nick Ethier	33ce12cda9	CNI Implementation (#7518 )	2020-06-18 11:05:29 -07:00
Drew Bailey	5be192fac3	give enterpriseclient a logger (#8072 )	2020-05-28 15:43:16 -04:00
Drew Bailey	7fc495e30e	Oss license support for ent builds (#8054 ) * changes necessary to support oss licesning shims revert nomad fmt changes update test to work with enterprise changes update tests to work with new ent enforcements make check update cas test to use scheduler algorithm back out preemption changes add comments * remove unused method	2020-05-27 13:46:52 -04:00
Lang Martin	3477f2e87a	client/heartbeatstop: don't store client state, use timeout In order to minimize this change while keeping a simple version of the behavior, we set `lastOk` to the current time less the intial server connection timeout. If the client starts and never contacts the server, it will stop all configured tasks after the initial server connection grace period, on the assumption that we've been out of touch longer than any configured `stop_after_client_disconnect`. The more complex state behavior might be justified later, but we should learn about failure modes first.	2020-05-01 12:35:49 -04:00
Lang Martin	7405961144	client/heartbeatstop: destroy allocs when disconnected from servers - track lastHeartbeat, the client local time of the last successful heartbeat round trip - track allocations with `stop_after_client_disconnect` configured - trigger allocation destroy (which handles cleanup) - restore heartbeat/killable allocs tracking when allocs are recovered from disk - on client restart, stop those allocs after a grace period if the servers are still partioned	2020-05-01 12:35:49 -04:00
Lang Martin	bc750d8bb0	csi: add node events to report progress mounting and unmounting volumes (#7547 ) * nomad/structs/structs: new NodeEventSubsystemCSI * client/client: pass triggerNodeEvent in the CSIConfig * client/pluginmanager/csimanager/instance: add eventer to instanceManager * client/pluginmanager/csimanager/manager: pass triggerNodeEvent * client/pluginmanager/csimanager/volume: node event on [un]mount * nomad/structs/structs: use storage, not CSI * client/pluginmanager/csimanager/volume: use storage, not CSI * client/pluginmanager/csimanager/volume_test: eventer * client/pluginmanager/csimanager/volume: event on error * client/pluginmanager/csimanager/volume_test: check event on error * command/node_status: remove an extra space in event detail format * client/pluginmanager/csimanager/volume: use snake_case for details * client/pluginmanager/csimanager/volume_test: snake_case details	2020-03-31 17:13:52 -04:00
Tim Gross	42323c41d9	csi: add dynamicplugins registry to client state store (#7330 ) In order to correctly fingerprint dynamic plugins on client restarts, we need to persist a handle to the plugin (that is, connection info) to the client state store. The dynamic registry will sync automatically to the client state whenever it receives a register/deregister call.	2020-03-23 13:58:30 -04:00
Danielle Lancashire	69cbb964e1	client: Pass an RPC Client to AllocRunners As part of introducing support for CSI, AllocRunner hooks need to be able to communicate with Nomad Servers for validation of and interaction with storage volumes. Here we create a small RPCer interface and pass the client (rpc client) to the AR in preparation for making these RPCs.	2020-03-23 13:58:30 -04:00
Danielle Lancashire	1250d56333	csi: Add VolumeManager (#6920 ) This changeset is some pre-requisite boilerplate that is required for introducing CSI volume management for client nodes. It extracts out fingerprinting logic from the csi instance manager. This change is to facilitate reusing the csimanager to also manage the node-local CSI functionality, as it is the easiest place for us to guaruntee health checking and to provide additional visibility into the running operations through the fingerprinter mechanism and goroutine. It also introduces the VolumeMounter interface that will be used to manage staging/publishing unstaging/unpublishing of volumes on the host.	2020-03-23 13:58:29 -04:00
Danielle Lancashire	cd0c2a6df0	csi: Setup gRPC Clients with a logger	2020-03-23 13:58:29 -04:00

1 2 3 4 5 ...

660 Commits