nomad

mirror of https://github.com/kemko/nomad.git synced 2026-01-03 17:05:43 +03:00

Author	SHA1	Message	Date
Mahmood Ali	6c414cd5f9	gofmt all the files mostly to handle build directives in 1.17.	2021-10-01 10:14:28 -04:00
James Rasell	3bffe443ac	chore: fix incorrect docstring formatting.	2021-08-30 11:08:12 +02:00
Seth Hoenig	61ee443ee6	core: implement system batch scheduler This PR implements a new "System Batch" scheduler type. Jobs can make use of this new scheduler by setting their type to 'sysbatch'. Like the name implies, sysbatch can be thought of as a hybrid between system and batch jobs - it is for running short lived jobs intended to run on every compatible node in the cluster. As with batch jobs, sysbatch jobs can also be periodic and/or parameterized dispatch jobs. A sysbatch job is considered complete when it has been run on all compatible nodes until reaching a terminal state (success or failed on retries). Feasibility and preemption are governed the same as with system jobs. In this PR, the update stanza is not yet supported. The update stanza is sill limited in functionality for the underlying system scheduler, and is not useful yet for sysbatch jobs. Further work in #4740 will improve support for the update stanza and deployments. Closes #2527	2021-08-03 10:30:47 -04:00
Michael Schurter	179e88c52b	Merge pull request #10849 from benbuzbee/benbuz/fix-destroy Don't treat a failed recover + successful destroy as a successful recover	2021-07-19 10:49:31 -07:00
Seth Hoenig	a18d901bb0	consul/connect: add missing import statements	2021-07-12 09:28:16 -05:00
Seth Hoenig	3ac8d4e7a6	consul/connect: use join host port Co-authored-by: Michael Schurter <mschurter@hashicorp.com>	2021-07-12 09:04:54 -05:00
Seth Hoenig	ee7d32fb98	consul/connect: fix bug causing high cpu with multiple connect sidecars in group This PR fixes a bug where the underlying Envoy process of a Connect gateway would consume a full core of CPU if there is more than one sidecar or gateway in a group. The utilization was being caused by Consul injecting an envoy_ready_listener on 127.0.0.1:8443, of which only one of the Envoys would be able to bind to. The others would spin in a hot loop trying to bind the listener. As a workaround, we now specify -address during the Envoy bootstrap config step, which is how Consul maps this ready listener. Because there is already the envoy_admin_listener, and we need to continue supporting running gateways in host networking mode, and in those case we want to use the same port value coming from the service.port field, we now bind the admin listener to the 127.0.0.2 loop-back interface, and the ready listener takes 127.0.0.1. This shouldn't make a difference in the 99.999% use case where envoy is being run in its official docker container. Advanced users can reference ${NOMAD_ENVOY_ADMIN_ADDR_<service>} (as they 'ought to) if needed, as well as the new variable ${NOMAD_ENVOY_READY_ADDR_<service>} for the envoy_ready_listener.	2021-07-09 14:34:44 -05:00
Seth Hoenig	b937e7baf4	consul: avoid triggering unnecessary sync when removing workload There are bits of logic in callers of RemoveWorkload on group/task cleanup hooks which call RemoveWorkload with the "Canary" version of the workload, in case the alloc is marked as a Canary. This logic triggers an extra sync with Consul, and also doesn't do the intended behavior - for which no special casing is necessary anyway. When the workload is marked for removal, all associated services and checks will be removed regardless of the Canary status, because the service and check IDs do not incorporate the canary-ness in the first place. The only place where canary-ness matters is when updating a workload, where we need to compute the hash of the services and checks to determine whether they have been modified, the Canary flag of which is a part of that. Fixes #10842	2021-07-06 14:08:42 -05:00
Ben Buzbee	baea4716b7	Don't treat a failed recover + successful destroy as a successful recover This code just seems incorrect. As it stands today it reports a successful restore if RecoverTask fails and then DestroyTask succeeds. This can result in a really annoying bug where it then calls RecoverTask again, whereby it will probably get ErrTaskNotFound and call DestroyTask once more. I think the only reason this has not been noticed so far is because most drivers like Docker will return Success, then nomad will call RecoverTask, get an error (not found) and call DestroyTask again, and get a ErrTasksNotFound err.	2021-07-03 01:46:36 +00:00
Seth Hoenig	37729bb027	consul/connect: automatically set consul tls sni name for connect native tasks This PR makes it so that Nomad will automatically set the CONSUL_TLS_SERVER_NAME environment variable for Connect native tasks running in bridge networking mode where Consul has TLS enabled. Because of the use of a unix domain socket for communicating with Consul when in bridge networking mode, the server name is a file name instead of something compatible with the mTLS certificate Consul will authenticate against. "localhost" is by default a compatible name, so Nomad will set the environment variable to that. Fixes #10804	2021-06-28 08:36:53 -05:00
James Rasell	a4156c3e94	tests: remove duplicate import statements.	2021-06-11 09:39:22 +02:00
Seth Hoenig	0bc8a33084	consul: probe consul namespace feature before using namespace api This PR changes Nomad's wrapper around the Consul NamespaceAPI so that it will detect if the Consul Namespaces feature is enabled before making a request to the Namespaces API. Namespaces are not enabled in Consul OSS, and require a suitable license to be used with Consul ENT. Previously Nomad would check for a 404 status code when makeing a request to the Namespaces API to "detect" if Consul OSS was being used. This does not work for Consul ENT with Namespaces disabled, which returns a 500. Now we avoid requesting the namespace API altogether if Consul is detected to be the OSS sku, or if the Namespaces feature is not licensed. Since Consul can be upgraded from OSS to ENT, or a new license applied, we cache the value for 1 minute, refreshing on demand if expired. Fixes https://github.com/hashicorp/nomad-enterprise/issues/575 Note that the ticket originally describes using attributes from https://github.com/hashicorp/nomad/issues/10688. This turns out not to be possible due to a chicken-egg situation between bootstrapping the agent and setting up the consul client. Also fun: the Consul fingerprinter creates its own Consul client, because there is no [currently] no way to pass the agent's client through the fingerprint factory.	2021-06-07 12:19:25 -05:00
Seth Hoenig	312161c5fc	consul/connect: add support for connect mesh gateways This PR implements first-class support for Nomad running Consul Connect Mesh Gateways. Mesh gateways enable services in the Connect mesh to make cross-DC connections via gateways, where each datacenter may not have full node interconnectivity. Consul docs with more information: https://www.consul.io/docs/connect/gateways/mesh-gateway The following group level service block can be used to establish a Connect mesh gateway. service { connect { gateway { mesh { // no configuration } } } } Services can make use of a mesh gateway by configuring so in their upstream blocks, e.g. service { connect { sidecar_service { proxy { upstreams { destination_name = "<service>" local_bind_port = <port> datacenter = "<datacenter>" mesh_gateway { mode = "<mode>" } } } } } } Typical use of a mesh gateway is to create a bridge between datacenters. A mesh gateway should then be configured with a service port that is mapped from a host_network configured on a WAN interface in Nomad agent config, e.g. client { host_network "public" { interface = "eth1" } } Create a port mapping in the group.network block for use by the mesh gateway service from the public host_network, e.g. network { mode = "bridge" port "mesh_wan" { host_network = "public" } } Use this port label for the service.port of the mesh gateway, e.g. service { name = "mesh-gateway" port = "mesh_wan" connect { gateway { mesh {} } } } Currently Envoy is the only supported gateway implementation in Consul. By default Nomad client will run the latest official Envoy docker image supported by the local Consul agent. The Envoy task can be customized by setting `meta.connect.gateway_image` in agent config or by setting the `connect.sidecar_task` block. Gateways require Consul 1.8.0+, enforced by the Nomad scheduler. Closes #9446	2021-06-04 08:24:49 -05:00
Mahmood Ali	61a3b73d44	drivers: Capture exit code when task is killed (#10494 ) This commit ensures Nomad captures the task code more reliably even when the task is killed. This issue affect to `raw_exec` driver, as noted in https://github.com/hashicorp/nomad/issues/10430 . We fix this issue by ensuring that the TaskRunner only calls `driver.WaitTask` once. The TaskRunner monitors the completion of the task by calling `driver.WaitTask` which should return the task exit code on completion. However, it also could return a "context canceled" error if the agent/executor is shutdown. Previously, when a task is to be stopped, the killTask path makes two WaitTask calls, and the second returns "context canceled" occasionally because of a "race" in task shutting down and depending on driver, and how fast it shuts down after task completes. By having a single WaitTask call and consistently waiting for the task, we ensure we capture the exit code reliably before the executor is shutdown or the contexts expired. I opted to change the TaskRunner implementation to avoid changing the driver interface or requiring 3rd party drivers to update. Additionally, the PR ensures that attempts to kill the task terminate when the task "naturally" dies. Without this change, if the task dies at the right moment, the `killTask` call may retry to kill an already-dead task for up to 5 minutes before giving up.	2021-05-04 10:54:00 -04:00
Michael Schurter	2e6eb84a57	Merge pull request #10248 from hashicorp/f-remotetask-2021 core: propagate remote task handles	2021-04-30 08:57:26 -07:00
Michael Schurter	d50fb2a00e	core: propagate remote task handles Add a new driver capability: RemoteTasks. When a task is run by a driver with RemoteTasks set, its TaskHandle will be propagated to the server in its allocation's TaskState. If the task is replaced due to a down node or draining, its TaskHandle will be propagated to its replacement allocation. This allows tasks to be scheduled in remote systems whose lifecycles are disconnected from the Nomad node's lifecycle. See https://github.com/hashicorp/nomad-driver-ecs for an example ECS remote task driver.	2021-04-27 15:07:03 -07:00
Seth Hoenig	6884455750	connect: use exp backoff when waiting on consul envoy bootstrap This PR wraps the use of the consul envoy bootstrap command in an expoenential backoff closure, configured to timeout after 60 seconds. This is an increase over the current behavior of making 3 attempts over 6 seconds. Should help with #10451	2021-04-27 09:21:50 -06:00
Seth Hoenig	d86cd6c1c2	Merge pull request #10401 from hashicorp/cp-cns-ent-test-fixes cherry-pick fixes from cns ent tests	2021-04-20 08:45:15 -06:00
Seth Hoenig	94399fff08	client: always set script checks hook Similar to a bugfix made for the services hook, we need to always set the script checks hook, in case a task is initially launched without script checks, but then updated to include script checks. The scipt checks hook is the thing that handles that new registration.	2021-04-19 15:37:42 -06:00
Seth Hoenig	5173a12d81	e2e: consul namespace tests from nomad ent (cherry-picked from ent without _ent things) This is part 2/4 of e2e tests for Consul Namespaces. Took a first pass at what the parameterized tests can look like, but only on the ENT side for this PR. Will continue to refactor in the next PRs. Also fixes 2 bugs: - Config Entries registered by Nomad Server on job registration were not getting Namespace set - Group level script checks were not getting Namespace set Those changes will need to be copied back to Nomad OSS. Nomad OSS + no ACLs (previously, needs refactor) Nomad ENT + no ACLs (this) Nomad OSS + ACLs (todo) Nomad ENT + ALCs (todo)	2021-04-19 15:35:31 -06:00
Nick Ethier	e834a60de1	plugins/drivers: fix deprecated fields	2021-04-16 14:13:29 -04:00
Nick Ethier	a7f079d5b9	tr: set cpuset cpus if reserved	2021-04-15 13:31:51 -04:00
Nick Ethier	f897ac79e8	client/ar: thread through cpuset manager	2021-04-13 13:28:36 -04:00
Mahmood Ali	fa95eb6e1c	only publish measured metrics (#10376 )	2021-04-13 11:39:33 -04:00
Seth Hoenig	a97254fa20	consul: plubming for specifying consul namespace in job/group This PR adds the common OSS changes for adding support for Consul Namespaces, which is going to be a Nomad Enterprise feature. There is no new functionality provided by this changeset and hopefully no new bugs.	2021-04-05 10:03:19 -06:00
Mahmood Ali	b383f92188	oversubscription: set the linux memory limit Use the MemoryMaxMB as the LinuxResources limit. This is intended to ease drivers implementation and adoption of the features: drivers that use `resources.LinuxResources.MemoryLimitBytes` don't need to be updated. Drivers that use NomadResources will need to updated to track the new field value. Given that tasks aren't guaranteed to use up the excess memory limit, this is a reasonable compromise.	2021-03-30 16:55:58 -04:00
Tim Gross	14568b3e00	deps: bump gopsutil to v3.21.2	2021-03-30 16:02:51 -04:00
Florian Apolloner	f21ab14690	Automatically populate `CONSUL_HTTP_ADDR` for connect native tasks in host networking mode. Fixes #10239	2021-03-28 14:34:31 +02:00
Seth Hoenig	7b8364d459	Merge pull request #10103 from AndrewChubatiuk/service-portlabel-interpolation-fix fixed service interpolation for sidecar tasks	2021-03-17 10:40:48 -05:00
Adrian Todorov	2748d2a895	driver/docker: add extra labels ( job name, task and task group name)	2021-03-08 08:59:52 -05:00
AndrewChubatiuk	5e81f0da3a	fixed service interpolation for sidecar tasks	2021-03-01 10:39:14 +02:00
AndrewChubatiuk	b860cc44a0	allocate sidecar task port on host_network interface	2021-02-13 02:42:13 +02:00
Seth Hoenig	ceae8ad1cf	consul/connect: Add support for Connect terminating gateways This PR implements Nomad built-in support for running Consul Connect terminating gateways. Such a gateway can be used by services running inside the service mesh to access "legacy" services running outside the service mesh while still making use of Consul's service identity based networking and ACL policies. https://www.consul.io/docs/connect/gateways/terminating-gateway These gateways are declared as part of a task group level service definition within the connect stanza. service { connect { gateway { proxy { // envoy proxy configuration } terminating { // terminating-gateway configuration entry } } } } Currently Envoy is the only supported gateway implementation in Consul. The gateay task can be customized by configuring the connect.sidecar_task block. When the gateway.terminating field is set, Nomad will write/update the Configuration Entry into Consul on job submission. Because CEs are global in scope and there may be more than one Nomad cluster communicating with Consul, there is an assumption that any terminating gateway defined in Nomad for a particular service will be the same among Nomad clusters. Gateways require Consul 1.8.0+, checked by a node constraint. Closes #9445	2021-01-25 10:36:04 -06:00
Seth Hoenig	0019f2cfed	consul/connect: ensure proxyID in test case	2021-01-20 09:48:12 -06:00
Seth Hoenig	071060f19d	client: use closed variable in append	2021-01-20 09:20:50 -06:00
Seth Hoenig	7ff2f9c1bc	consul/connect: Enable running multiple ingress gateways per Nomad agent Connect ingress gateway services were being registered into Consul without an explicit deterministic service ID. Consul would generate one automatically, but then Nomad would have no way to register a second gateway on the same agent as it would not supply 'proxy-id' during envoy bootstrap. Set the ServiceID for gateways, and supply 'proxy-id' when doing envoy bootstrap. Fixes #9834	2021-01-19 12:58:36 -06:00
Kris Hicks	2cd7136bc7	Fix some errcheck errors (#9811 ) * Throw away result of multierror.Append When given a multierror.Error, it is mutated, therefore the return value is not needed. Simplify MergeMultierrorWarnings, use StringBuilder * Hash.Write() never returns an error * Remove error that was always nil * Remove error from Resources.Add signature When this was originally written it could return an error, but that was refactored away, and callers of it as of today never handle the error. * Throw away results of io.Copy during Bridge * Handle errors when computing node class in test	2021-01-14 12:46:35 -08:00
Mahmood Ali	354b2ee1a6	tests: deflake TestTaskRunner_StatsHook_Periodic (#9734 ) This PR deflakes TestTaskRunner_StatsHook_Periodic tests and adds backoff when the driver closes the channel. TestTaskRunner_StatsHook_Periodic is currently the most flaky test - failing ~4% of the time (20 out of 486 workflows). A sample failure: https://app.circleci.com/pipelines/github/hashicorp/nomad/14028/workflows/957b674f-cbcc-4228-96d9-1094fdee5b9c/jobs/128563 . This change has two components: First, it updates the StatsHook so that it backs off when stats channel is closed. In the context of the test where the mock driver emits a single stats update and closes the channel, the test may make tens of thousands update during the period. In real context, if a driver doesn't implement the stats handler properly or when a task finishes, we may generate way too many Stats queries in a tight loop. Here, the backoff reduces these queries. I've added a failing test that shows 154,458 stats updates within 500ms in https://app.circleci.com/pipelines/github/hashicorp/nomad/14092/workflows/50672445-392d-4661-b19e-e3561ed32746/jobs/129423 . Second, the test ignores the first stats update after a task exit. Due to the asynchronicity of updates and channel/context use, it's possible that an update is enqueued while the test marks the task as exited, resulting into a spurious update.	2021-01-06 16:03:00 -05:00
Seth Hoenig	5e156d55d3	consul: always include task services hook Previously, Nomad would optimize out the services task runner hook for tasks which were initially submitted with no services defined. This causes a problem when the job is later updated to include service(s) on that task, which will result in nothing happening because the hook is not present to handle the service registration in the .Update. Instead, always enable the services hook. The group services alloc runner hook is already always enabled. Fixes #9707	2021-01-05 08:47:19 -06:00
Chris Baker	6419966e88	enabled broken test that is no longer broken	2021-01-04 22:25:35 +00:00
Chris Baker	11f0fed935	update template and artifact interpolation to use client-relative paths resolves #9839 resolves #6929 resolves #6910 e2e: template env interpolation path testing	2021-01-04 22:25:34 +00:00
Tim Gross	004f1c972f	template: trigger change_mode for dynamic secrets on restore (#9636 ) When a task is restored after a client restart, the template runner will create a new lease for any dynamic secret (ex. Consul or PKI secrets engines). But because this lease is being created in the prestart hook, we don't trigger the `change_mode`. This changeset uses the the existence of the task handle to detect a previously running task that's been restored, so that we can trigger the template `change_mode` if the template is changed, as it will be only with dynamic secrets.	2020-12-16 13:36:19 -05:00
Seth Hoenig	f0f6f3a18f	consul/connect: fix regression where client connect images ignored Nomad v1.0.0 introduced a regression where the client configurations for `connect.sidecar_image` and `connect.gateway_image` would be ignored despite being set. This PR restores that functionality. There was a missing layer of interpolation that needs to occur for these parameters. Since Nomad 1.0 now supports dynamic envoy versioning through the ${NOMAD_envoy_version} psuedo variable, we basically need to first interpolate ${connect.sidecar_image} => envoyproxy/envoy:v${NOMAD_envoy_version} then use Consul at runtime to resolve to a real image, e.g. envoyproxy/envoy:v${NOMAD_envoy_version} => envoyproxy/envoy:v1.16.0 Of course, if the version of Consul is too old to provide an envoy version preference, we then need to know to fallback to the old version of envoy that we used before. envoyproxy/envoy:v${NOMAD_envoy_version} => envoyproxy/envoy:v1.11.2@sha256:a7769160c9c1a55bb8d07a3b71ce5d64f72b1f665f10d81aa1581bc3cf850d09 Beyond that, we also need to continue to support jobs that set the sidecar task themselves, e.g. sidecar_task { config { image: "custom/envoy" } } which itself could include teh pseudo envoy version variable.	2020-12-14 09:47:55 -06:00
Kris Hicks	7747124ef0	Apply some suggested fixes from staticcheck (#9598 )	2020-12-10 07:29:18 -08:00
Kris Hicks	85ed8ddd4f	Add gosimple linter (#9590 )	2020-12-09 11:05:18 -08:00
Michael Schurter	5b83ca0b5d	client: skip broken test and fix assertion	2020-11-18 10:01:02 -08:00
Michael Schurter	cd7226d398	client: fix interpolation in template source While Nomad v0.12.8 fixed `NOMAD_{ALLOC,TASK,SECRETS}_DIR` use in `template.destination`, interpolating these variables in `template.source` caused a path escape error. Why not apply the destination fix to source? The destination fix forces destination to always be relative to the task directory. This makes sense for the destination as a destination outside the task directory would be unreachable by the task. There's no reason to ever render a template outside the task directory. (Using `..` does allow destinations to escape the task directory if `template.disable_file_sandbox = true`. That's just awkward and unsafe enough I hope no one uses it.) There is a reason to source a template outside a task directory. At least if there weren't then I can't think of why we implemented `template.disable_file_sandbox`. So v0.12.8 left the behavior of `template.source` the more straightforward "Interpolate and validate." However, since outside of `raw_exec` every other driver uses absolute paths for `NOMAD__DIR` interpolation, this means those variables are unusable unless `disable_file_sandbox` is set. The Fix* The variables are now interpolated as relative paths only for the purpose of rendering templates. This is an unfortunate special case, but reflects the fact that the templates view of the filesystem is completely different (unconstrainted) vs the task's view (chrooted). Arguably the values of these variables should be context-specific. I think it's more reasonable to think of the "hack" as templating running uncontainerized than that giving templates different paths is a hack. TODO - [ ] E2E tests - [ ] Job validation may still be broken and prevent my fix from working? raw_exec `raw_exec` is actually broken _a different way_ as exercised by tests in this commit. I think we should probably remove these tests and fix that in a followup PR/release, but I wanted to leave them in for the initial review and discussion. Since non-containerized source paths are broken anyway, perhaps there's another solution to this entire problem I'm overlooking?	2020-11-17 22:03:04 -08:00
Seth Hoenig	459112b41d	Merge pull request #9352 from hashicorp/f-artifact-headers jobspec: add support for headers in artifact stanza	2020-11-13 14:04:27 -06:00
Seth Hoenig	6c7578636c	jobspec: add support for headers in artifact stanza This PR adds the ability to set HTTP headers when downloading an artifact from an `http` or `https` resource. The implementation in `go-getter` is such that a new `HTTPGetter` must be created for each artifact that sets headers (as opposed to conveniently setting headers per-request). This PR maintains the memoization of the default Getter objects, creating new ones only for artifacts where headers are set. Closes #9306	2020-11-13 12:03:54 -06:00
Jasmine Dahilig	b85cce42fe	lifecycle: add poststop hook (#8194 )	2020-11-12 08:01:42 -08:00

1 2 3 4 5 ...

390 Commits