nomad

mirror of https://github.com/kemko/nomad.git synced 2026-01-04 09:25:46 +03:00

Author	SHA1	Message	Date
Seth Hoenig	b937e7baf4	consul: avoid triggering unnecessary sync when removing workload There are bits of logic in callers of RemoveWorkload on group/task cleanup hooks which call RemoveWorkload with the "Canary" version of the workload, in case the alloc is marked as a Canary. This logic triggers an extra sync with Consul, and also doesn't do the intended behavior - for which no special casing is necessary anyway. When the workload is marked for removal, all associated services and checks will be removed regardless of the Canary status, because the service and check IDs do not incorporate the canary-ness in the first place. The only place where canary-ness matters is when updating a workload, where we need to compute the hash of the services and checks to determine whether they have been modified, the Canary flag of which is a part of that. Fixes #10842	2021-07-06 14:08:42 -05:00
James Rasell	a4156c3e94	tests: remove duplicate import statements.	2021-06-11 09:39:22 +02:00
Seth Hoenig	0bc8a33084	consul: probe consul namespace feature before using namespace api This PR changes Nomad's wrapper around the Consul NamespaceAPI so that it will detect if the Consul Namespaces feature is enabled before making a request to the Namespaces API. Namespaces are not enabled in Consul OSS, and require a suitable license to be used with Consul ENT. Previously Nomad would check for a 404 status code when makeing a request to the Namespaces API to "detect" if Consul OSS was being used. This does not work for Consul ENT with Namespaces disabled, which returns a 500. Now we avoid requesting the namespace API altogether if Consul is detected to be the OSS sku, or if the Namespaces feature is not licensed. Since Consul can be upgraded from OSS to ENT, or a new license applied, we cache the value for 1 minute, refreshing on demand if expired. Fixes https://github.com/hashicorp/nomad-enterprise/issues/575 Note that the ticket originally describes using attributes from https://github.com/hashicorp/nomad/issues/10688. This turns out not to be possible due to a chicken-egg situation between bootstrapping the agent and setting up the consul client. Also fun: the Consul fingerprinter creates its own Consul client, because there is no [currently] no way to pass the agent's client through the fingerprint factory.	2021-06-07 12:19:25 -05:00
Mahmood Ali	61a3b73d44	drivers: Capture exit code when task is killed (#10494 ) This commit ensures Nomad captures the task code more reliably even when the task is killed. This issue affect to `raw_exec` driver, as noted in https://github.com/hashicorp/nomad/issues/10430 . We fix this issue by ensuring that the TaskRunner only calls `driver.WaitTask` once. The TaskRunner monitors the completion of the task by calling `driver.WaitTask` which should return the task exit code on completion. However, it also could return a "context canceled" error if the agent/executor is shutdown. Previously, when a task is to be stopped, the killTask path makes two WaitTask calls, and the second returns "context canceled" occasionally because of a "race" in task shutting down and depending on driver, and how fast it shuts down after task completes. By having a single WaitTask call and consistently waiting for the task, we ensure we capture the exit code reliably before the executor is shutdown or the contexts expired. I opted to change the TaskRunner implementation to avoid changing the driver interface or requiring 3rd party drivers to update. Additionally, the PR ensures that attempts to kill the task terminate when the task "naturally" dies. Without this change, if the task dies at the right moment, the `killTask` call may retry to kill an already-dead task for up to 5 minutes before giving up.	2021-05-04 10:54:00 -04:00
Seth Hoenig	a97254fa20	consul: plubming for specifying consul namespace in job/group This PR adds the common OSS changes for adding support for Consul Namespaces, which is going to be a Nomad Enterprise feature. There is no new functionality provided by this changeset and hopefully no new bugs.	2021-04-05 10:03:19 -06:00
Mahmood Ali	b383f92188	oversubscription: set the linux memory limit Use the MemoryMaxMB as the LinuxResources limit. This is intended to ease drivers implementation and adoption of the features: drivers that use `resources.LinuxResources.MemoryLimitBytes` don't need to be updated. Drivers that use NomadResources will need to updated to track the new field value. Given that tasks aren't guaranteed to use up the excess memory limit, this is a reasonable compromise.	2021-03-30 16:55:58 -04:00
Mahmood Ali	d155e4d412	tests: restart restartpolicy for all tasks in tests	2020-03-24 21:52:48 -04:00
Mahmood Ali	df08c6c399	tests: populate task restart policy properly	2020-03-24 21:44:37 -04:00
Jasmine Dahilig	90fa242d83	fix failing ci test: TestTaskRunner_UnregisterConsul_Retries	2020-03-21 17:52:54 -04:00
Seth Hoenig	04b526662c	e2e: setup consul ACLs a little more correctly	2020-01-31 19:06:11 -06:00
Seth Hoenig	0f285b840e	tests: skip some SIDS hook tests if running tests as root	2020-01-31 19:05:32 -06:00
Seth Hoenig	08951ac759	client: additional test cases around failures in SIDS hook	2020-01-31 19:05:27 -06:00
Seth Hoenig	40de85867d	client: manage TR kill from parent on SI token derivation failure Re-orient the management of the tr.kill to happen in the parent of the spawned goroutine that is doing the actual token derivation. This makes the code a little more straightforward, making it easier to reason about not leaking the worker goroutine.	2020-01-31 19:05:02 -06:00
Seth Hoenig	bbedeb670d	nomad,client: apply more comment/style PR tweaks	2020-01-31 19:04:52 -06:00
Seth Hoenig	d24d470775	comments: cleanup some leftover debug comments and such	2020-01-31 19:04:35 -06:00
Seth Hoenig	d85cccc8d0	nomad: fixup token policy validation	2020-01-31 19:04:08 -06:00
Seth Hoenig	6bc6a52f99	client: enable envoy bootstrap hook to set SI token When creating the envoy bootstrap configuration, we should append the "-token=<token>" argument in the case where the sidsHook placed the token in the secrets directory.	2020-01-31 19:04:01 -06:00
Seth Hoenig	674ccaa122	nomad: proxy requests for Service Identity tokens between Clients and Consul Nomad jobs may be configured with a TaskGroup which contains a Service definition that is Consul Connect enabled. These service definitions end up establishing a Consul Connect Proxy Task (e.g. envoy, by default). In the case where Consul ACLs are enabled, a Service Identity token is required for these tasks to run & connect, etc. This changeset enables the Nomad Server to recieve RPC requests for the derivation of SI tokens on behalf of instances of Consul Connect using Tasks. Those tokens are then relayed back to the requesting Client, which then injects the tokens in the secrets directory of the Task.	2020-01-31 19:03:53 -06:00
Seth Hoenig	f8666bb1f9	client: enable nomad client to request and set SI tokens for tasks When a job is configured with Consul Connect aware tasks (i.e. sidecar), the Nomad Client should be able to request from Consul (through Nomad Server) Service Identity tokens specific to those tasks.	2020-01-31 19:03:38 -06:00
Michael Schurter	43909b1374	Revert "Revert "Use joint context to cancel prestart hooks""	2019-10-08 11:34:09 -07:00
Michael Schurter	680e30457f	Revert "Use joint context to cancel prestart hooks"	2019-10-08 11:27:08 -07:00
Drew Bailey	12be12020e	simplify logic to check for vault read event defer shutdown to cleanup after failed run Co-Authored-By: Michael Schurter <mschurter@hashicorp.com> update comment to include ctx note for shutdown	2019-09-30 11:02:14 -07:00
Drew Bailey	e0dbbb0950	Use joint context to cancel prestart hooks fixes https://github.com/hashicorp/nomad/issues/6382 The prestart hook for templates blocks while it resolves vault secrets. If the secret is not found it continues to retry. If a task is shutdown during this time, the prestart hook currently does not receive shutdownCtxCancel, causing it to hang. This PR joins the two contexts so either killCtx or shutdownCtx cancel and stop the task.	2019-09-30 10:48:01 -07:00
Chris Baker	03500494b2	cleanup test	2019-06-18 14:15:25 +00:00
Chris Baker	7050e14eb5	formatting and clarity	2019-06-18 14:00:57 +00:00
Chris Baker	240d68765c	metrics: add namespace label to allocation metrics	2019-06-17 20:50:26 +00:00
Danielle Lancashire	1aa6bc80d8	trt: Fix test	2019-06-12 17:06:11 +02:00
Danielle Lancashire	5933528404	trhooks: Add TaskStopHook interface to services We currently only run cleanup Service Hooks when a task is either Killed, or Exited. However, due to the implementation of a task runner, tasks are only Exited if they every correctly started running, which is not true when you recieve an error early in the task start flow, such as not being able to pull secrets from Vault. This updates the service hook to also call consul deregistration routines during a task Stop lifecycle event, to ensure that any registered checks and services are cleared in such cases. fixes #5770	2019-06-12 16:00:21 +02:00
Michael Schurter	796c05b9b8	client: register before restoring Registration and restoring allocs don't share state or depend on each other in any way (syncing allocs with servers is done outside of registration). Since restoring is synchronous, start the registration goroutine first. For nodes with lots of allocs to restore or close to their heartbeat deadline, this could be the difference between becoming "lost" or not.	2019-05-14 10:53:27 -07:00
Michael Schurter	6a2792ad90	client: do not restart dead tasks until server is contacted (try 2) Refactoring of 104067bc2b2002a4e45ae7b667a476b89addc162 Switch the MarkLive method for a chan that is closed by the client. Thanks to @notnoop for the idea! The old approach called a method on most existing ARs and TRs on every runAllocs call. The new approach does a once.Do call in runAllocs to accomplish the same thing with less work. Able to remove the gate abstraction that did much more than was needed.	2019-05-14 10:53:27 -07:00
Michael Schurter	e7042b674b	client: do not restart dead tasks until server is contacted Fixes #1795 Running restored allocations and pulling what allocations to run from the server happen concurrently. This means that if a client is rebooted, and has its allocations rescheduled, it may restart the dead allocations before it contacts the server and determines they should be dead. This commit makes tasks that fail to reattach on restore wait until the server is contacted before restarting.	2019-05-14 10:53:27 -07:00
Michael Schurter	21e895e2e7	Revert "executor/linux: add defensive checks to binary path" This reverts commit `cb36f4537e`.	2019-04-02 11:17:12 -07:00
Michael Schurter	cb36f4537e	executor/linux: add defensive checks to binary path	2019-04-02 09:40:53 -07:00
Michael Schurter	254901a51e	executor/linux: make chroot binary paths absolute Avoid libcontainer.Process trying to lookup the binary via $PATH as the executor has already found where the binary is located.	2019-04-01 15:45:31 -07:00
Michael Schurter	2dbc06de61	tests: port pre-0.9 task env tests I chose to make them more of integration tests since there's a lot more plumbing involved. The internal implementation details of how we craft task envs can now change and these tests will still properly assert the task runtime environment is setup properly.	2019-03-25 09:46:53 -07:00
Michael Schurter	8d409a6e39	client: test logmon cleanup The test is sadly quite complicated and peeks into things (logmon's reattach config) AR doesn't normally have access to. However, I couldn't find another way of asserting logmon got cleaned up without resorting to smaller unit tests. Smaller unit tests risk re-implementing dependencies in an unrealistic way, so I opted for an ugly integration test.	2019-03-04 13:15:15 -08:00
Michael Schurter	db9daf6631	client: ensure task is cleaned up when terminal This commit is a significant change. TR.Run is now always executed, even for terminal allocations. This was changed to allow TR.Run to cleanup (run stop hooks) if a handle was recovered. This is intended to handle the case of Nomad receiving a DesiredStatus=Stop allocation update, persisting it, but crashing before stopping AR/TR. The commit also renames task runner hook data as it was very easy to accidently set state on Requests instead of Responses using the old field names.	2019-03-01 14:00:23 -08:00
Mahmood Ali	e1e5053936	emit TaskRestartSignal event on vault restart When Vault token expires and task is restarted, emit `TaskRestartSignal` similar to v0.8.7	2019-02-22 15:56:14 -05:00
Mahmood Ali	8b7f66499f	address review comments	2019-02-22 15:56:14 -05:00
Mahmood Ali	d80774fde0	tests: port TestTaskRunner_VaultManager_Signal From https://github.com/hashicorp/nomad/blob/v0.8.7/client/task_runner_test.go#L1427	2019-02-22 15:53:04 -05:00
Mahmood Ali	b8c74ff6ca	tests: port TestTaskRunner_VaultManager_Restart From https://github.com/hashicorp/nomad/blob/v0.8.7/client/task_runner_test.go#L1352	2019-02-22 15:53:04 -05:00
Mahmood Ali	ce04bb7440	tests: port TestTaskRunner_UnregisterConsul_Retries From https://github.com/hashicorp/nomad/blob/v0.8.7/client/task_runner_test.go#L620	2019-02-22 15:53:04 -05:00
Mahmood Ali	d6a5a1c5a5	tests: port TestTaskRunner_Template_NewVaultToken From https://github.com/hashicorp/nomad/blob/v0.8.7/client/task_runner_test.go#L1275	2019-02-22 15:53:04 -05:00
Mahmood Ali	90ca1ab5a3	tests: port TestTaskRunner_Template_Artifact From https://github.com/hashicorp/nomad/blob/v0.8.7/client/task_runner_test.go#L1195	2019-02-22 15:52:59 -05:00
Michael Schurter	cf66e25e57	client: restart on recoverable StartTask errors Fixes restarting on recoverable errors from StartTask. Ports TestTaskRunner_Run_RecoverableStartError from 0.8 which discovered the bug.	2019-02-21 15:30:49 -08:00
Michael Schurter	414532adab	test: port TestTaskRunner_RestartSignalTask_NotRunning from 0.8	2019-02-21 15:30:49 -08:00
Michael Schurter	d4a17ae71f	test: port TestTaskRunner_DriverNetwork from 0.8	2019-02-21 15:30:49 -08:00
Michael Schurter	c51a54cfee	client: artifact errors are retry-able 0.9.0beta2 contains a regression where artifact download errors would not cause a task restart and instead immediately fail the task. This restores the pre-0.9 behavior of retrying all artifact errors and adds missing tests.	2019-02-20 07:21:27 -08:00
Michael Schurter	83979252cd	tests: add new task runner test helper Adds a new helper and removes a duplicated test.	2019-02-20 07:21:27 -08:00
Mahmood Ali	eb8b19ec82	test: improve readability of duration Co-Authored-By: schmichael <michael.schurter@gmail.com>	2019-02-14 08:12:06 -08:00

1 2

80 Commits