nomad

mirror of https://github.com/kemko/nomad.git synced 2026-01-04 17:35:43 +03:00

Author	SHA1	Message	Date
Luiz Aoqui	f33bb5ec4e	client: retry RPC call when no server is available (#15140 ) When a Nomad service starts it tries to establish a connection with servers, but it also runs alloc runners to manage whatever allocations it needs to run. The alloc runner will invoke several hooks to perform actions, with some of them requiring access to the Nomad servers, such as Native Service Discovery Registration. If the alloc runner starts before a connection is established the alloc runner will fail, causing the allocation to be shutdown. This is particularly problematic for disconnected allocations that are reconnecting, as they may fail as soon as the client reconnects. This commit changes the RPC request logic to retry it, using the existing retry mechanism, if there are no servers available.	2022-11-04 14:09:39 -04:00
Charlie Voiselle	52a254ba22	template: error on missing key (#15141 ) * Support error_on_missing_value for templates * Update docs for template stanza	2022-11-04 13:23:01 -04:00
Ethan	ca06227c88	fix: batchFirstFingerprints does not update device on node after v1.3.5 (#15125 ) * fix: update device in batch first footprint * cl: add cl note Co-authored-by: Seth Hoenig <shoenig@duck.com>	2022-11-03 16:31:39 -05:00
Tim Gross	5a5b4b04cb	WI: set identity to client secret if missing (#15121 ) Allocations created before 1.4.0 will not have a workload identity token. When the client running these allocs is upgraded to 1.4.x, the identity hook will run and replace the node secret ID token used previously with an empty string. This causes service discovery queries to fail. Fallback to the node's secret ID when the allocation doesn't have a signed identity. Note that pre-1.4.0 allocations won't have templates that read Variables, so there's no threat that this new node ID secret will be able to read data that the allocation shouldn't have access to.	2022-11-03 11:10:11 -04:00
Seth Hoenig	ee2880ceaf	build: update linters (#15063 ) Remove dead linters and add some interesting new ones.	2022-10-27 15:02:30 -05:00
Seth Hoenig	d978e7711a	client: ensure minimal cgroup controllers enabled (#15027 ) * client: ensure minimal cgroup controllers enabled This PR fixes a bug where Nomad could not operate properly on operating systems that set the root cgroup.subtree_control to a set of controllers that do not include the minimal set of controllers needed by Nomad. Nomad needs these controllers enabled to operate: - cpuset - cpu - io - memory - pids Now, Nomad will ensure these controllers are enabled during Client initialization, adding them to cgroup.subtree_control as necessary. This should be particularly helpful on the RHEL/CentOS/Fedora family of system. Ubuntu systems should be unaffected as they enable all controllers by default. Fixes: https://github.com/hashicorp/nomad/issues/14494 * docs: cleanup doc string * client: cleanup controller writes, enhance log messages	2022-10-24 16:08:54 -05:00
James Rasell	eaea9164a5	acl: correctly resolve ACL roles within client cache. (#14922 ) The client ACL cache was not accounting for tokens which included ACL role links. This change modifies the behaviour to resolve role links to policies. It will also now store ACL roles within the cache for quick lookup. The cache TTL is configurable in the same manner as policies or tokens. Another small fix is included that takes into account the ACL token expiry time. This was not included, which meant tokens with expiry could be used past the expiry time, until they were GC'd.	2022-10-20 09:37:32 +02:00
Seth Hoenig	faac908a81	consul: register checks along with service on initial registration (#14944 ) * consul: register checks along with service on initial registration This PR updates Nomad's Consul service client to include checks in an initial service registration, so that the checks associated with the service are registered "atomically" with the service. Before, we would only register the checks after the service registration, which causes problems where the service is deemed healthy, even if one or more checks are unhealthy - especially problematic in the case where SuccessBeforePassing is configured. Fixes #3935 * cr: followup to fix cause of extra consul logging * cr: fix another bug * cr: fixup changelog	2022-10-19 12:40:56 -05:00
Seth Hoenig	0b69a52a40	e2e: convert flaky exec download in chroot unit test into e2e test (#14949 ) Similar to https://github.com/hashicorp/nomad/pull/14710, convert flaky test into e2e test.	2022-10-19 08:22:32 -05:00
Michael Schurter	f91100bda3	client: remove unused LogOutput and LogLevel (#14867 ) * client: remove unused LogOutput * client: remove unused config.LogLevel	2022-10-11 09:24:40 -07:00
Seth Hoenig	9e9ddbdd3b	helpers: lockfree lookup of nobody user on unix systems (#14866 ) * helpers: lockfree lookup of nobody user on linux and darwin This PR continues the nobody user lookup saga, by making the nobody user lookup lock-free on linux and darwin. By doing the lookup in an init block this originally broke on Windows, where we must avoid doing the lookup at all. We can get around that breakage by only doing the lookup on linux/darwin where the nobody user is going to exist. Also return the nobody user by value so that a copy is created that cannot be modified by callers of Nobody(). * helper: move nobody code into unix file	2022-10-11 08:38:05 -05:00
Seth Hoenig	fed329883e	cleanup: rename Equals to Equal for consistency (#14759 )	2022-10-10 09:28:46 -05:00
Hemanth Krishna	a1388217aa	enhancement: UpdateTask when Task is waiting for ShutdownDelay (#14775 ) Signed-off-by: Hemanth Krishna <hkpdev008@gmail.com>	2022-10-06 16:33:28 -04:00
Pablo Ruiz García	536260c7bb	Invoke FingerprintManager's Reload() func during agent's SIGHUP (#14615 ) Fixes #14614	2022-10-06 16:22:59 -04:00
Tim Gross	d3a55915f5	client: defer `nobody` user lookup so Windows doesn't panic (#14790 ) In #14742 we introduced a cached lookup of the `nobody` user, which is only ever called on Unixish machines. But the initial caching was being done in an `init` block, which meant it was being run on Windows as well. This prevents the Nomad agent from starting on Windows. An alternative fix here would be to have a separate `init` block for Windows and Unix, but this potentially masks incorrect behavior if we accidentally added a call to the `Nobody()` method on Windows later. This way we're forced to handle the error in the caller.	2022-10-04 11:52:12 -04:00
Luiz Aoqui	88b61cb5b4	template: apply splay value on change_mode script (#14749 ) Previously, the splay timeout was only applied if a template re-render caused a restart or a signal action. The `change_mode = "script"` was running after the `if restart \|\| len(signals) != 0` check, so it was invoked at all times. This change refactors the logic so it's easier to notice that new `change_mode` options should start only after `splay` is applied.	2022-09-30 12:04:22 -04:00
Seth Hoenig	e4e5bc5cef	client: protect user lookups with global lock (#14742 ) * client: protect user lookups with global lock This PR updates Nomad client to always do user lookups while holding a global process lock. This is to prevent concurrency unsafe implementations of NSS, but still enabling NSS lookups of users (i.e. cannot not use osusergo). * cl: add cl	2022-09-29 09:30:13 -05:00
Michael Schurter	0eb711925e	test: skip chown test if nonroot (#14738 ) CI always runs this as root, so it worked there and always scared me when I ran it locally.	2022-09-28 14:45:38 -07:00
Seth Hoenig	32be86831f	e2e: convert chroot env unit tests into e2e tests (#14710 ) This PR translates two of our most flakey unit tests into e2e tests where they are fit much more naturally.	2022-09-26 15:40:29 -05:00
Michael Schurter	2e059c624f	fingerprint: add node attr for reserverable cores (#14694 ) * fingerprint: add node attr for reserverable cores Add an attribute for the number of reservable CPU cores as they may differ from the existing `cpu.numcores` due to client configuration or OS support. Hopefully clarifies some confusion in #14676 * add changelog * num_reservable_cores -> reservablecores	2022-09-26 13:03:03 -07:00
Luiz Aoqui	1b831f3da4	client: recover from getter panics (#14696 ) The artifact getter uses the go-getter library to fetch files from different sources. Any bug in this library that results in a panic can cause the entire Nomad client to crash due to a single file download attempt. This change aims to guard against this types of crashes by recovering from panics when the getter attempts to download an artifact. The resulting panic is converted to an error that is stored as a task event for operator visibility and the panic stack trace is logged to the client's log.	2022-09-26 15:16:26 -04:00
Michael Schurter	d677b48625	fingerprint: lengthen Vault check after seen (#14693 ) Extension of #14673 Once Vault is initially fingerprinted, extend the period since changes should be infrequent and the fingerprint is relatively expensive since it is contacting a central Vault server. Also move the period timer reset after the fingerprint. This is similar to #9435 where the idea is to ensure the retry period starts after the operation is attempted. 15s will be the minimum time between fingerprints now instead of the maximum time between fingerprints. In the case of Vault fingerprinting, the original behavior might cause the following: 1. Timer is reset to 15s 2. Fingerprint takes 16s 3. Timer has already elapsed so we immediately Fingerprint again Even if fingerprinting Vault only takes a few seconds, that may very well be due to excessive load and backing off our fingerprints is desirable. The new bevahior ensures we always wait at least 15s between fingerprint attempts and should allow some natural jittering based on server load and network latency.	2022-09-26 12:14:19 -07:00
Seth Hoenig	211ac8ec23	deps: update set and test (#14680 ) This PR updates go-set and shoenig/test, which introduced some breaking API changes.	2022-09-26 08:28:03 -05:00
Tim Gross	786dc5ff94	fingerprint: don't clear Consul/Vault attributes on failure (#14673 ) Clients periodically fingerprint Vault and Consul to ensure the server has updated attributes in the client's fingerprint. If the client can't reach Vault/Consul, the fingerprinter clears the attributes and requires a node update. Although this seems like correct behavior so that we can detect intentional removal of Vault/Consul access, it has two serious failure modes: (1) If a local Consul agent is restarted to pick up configuration changes and the client happens to fingerprint at that moment, the client will update its fingerprint and result in evaluations for all its jobs and all the system jobs in the cluster. (2) If a client loses Vault connectivity, the same thing happens. But the consequences are much worse in the Vault case because Vault is not run as a local agent, so Vault connectivity failures are highly correlated across the entire cluster. A 15 second Vault outage will cause a new `node-update` evalution for every system job on the cluster times the number of nodes, plus one `node-update` evaluation for every non-system job on each node. On large clusters of 1000s of nodes, we've seen this create a large backlog of evaluations. This changeset updates the fingerprinting behavior to keep the last fingerprint if Consul or Vault queries fail. This prevents a storm of evaluations at the cost of requiring a client restart if Consul or Vault is intentionally removed from the client.	2022-09-23 14:45:12 -04:00
Jorge Marey	451ecf358c	connect: add nomad env to envoy bootstrap (#12959 ) * Add nomad env to envoy bootstrap * Add changelog file	2022-09-22 13:18:18 -05:00
Jorge Marey	3aa184b544	Add Namespace, Job and Group to envoy stats (#14311 )	2022-09-22 10:38:21 -04:00
Seth Hoenig	ff1a30fe8d	cleanup more helper updates (#14638 ) * cleanup: refactor MapStringStringSliceValueSet to be cleaner * cleanup: replace SliceStringToSet with actual set * cleanup: replace SliceStringSubset with real set * cleanup: replace SliceStringContains with slices.Contains * cleanup: remove unused function SliceStringHasPrefix * cleanup: fixup StringHasPrefixInSlice doc string * cleanup: refactor SliceSetDisjoint to use real set * cleanup: replace CompareSliceSetString with SliceSetEq * cleanup: replace CompareMapStringString with maps.Equal * cleanup: replace CopyMapStringString with CopyMap * cleanup: replace CopyMapStringInterface with CopyMap * cleanup: fixup more CopyMapStringString and CopyMapStringInt * cleanup: replace CopySliceString with slices.Clone * cleanup: remove unused CopySliceInt * cleanup: refactor CopyMapStringSliceString to be generic as CopyMapOfSlice * cleanup: replace CopyMap with maps.Clone * cleanup: run go mod tidy	2022-09-21 14:53:25 -05:00
Luiz Aoqui	a993931edf	test: remove flaky Gate test (#14575 ) The concurrent gate access test is flaky since it depends on the order of operations of two concurrent goroutines. Despite the heavy bias towards one of the results, it's still possible to end the execution with a closed gate. I believe this case was created to test an earlier implementation where the gate state was stored and mutated internally, so the access had to be protected by a lock. However, the final implementation changed this approach to be only channel-based, so there is no need for this flaky test anymore.	2022-09-19 11:31:03 -04:00
Seth Hoenig	9274677423	cleanup: create interface for check watcher and mock it in nsd tests (#14577 ) * cleanup: create interface for check watcher and mock it in nsd tests * cleanup: add comments for check watcher interface	2022-09-14 08:25:20 -05:00
Michael Schurter	2a9a361a35	2 small data race fixes in logmon and check tests (#14538 ) * logmon: fix data race around oldestLogFileIdx * checks: fix 2 data races in tests * logmon: move & rename lock to logically group	2022-09-13 12:54:06 -07:00
Seth Hoenig	e94782527a	servicedisco: implement check_restart for nomad service checks This PR implements support for check_restart for checks registered in the Nomad service provider. Unlike Consul, Nomad service checks never report a "warning" status, and so the check_restart.ignore_warnings configuration is not valid for Nomad service checks.	2022-09-13 08:59:23 -05:00
Seth Hoenig	a3971f89a7	Merge pull request #14546 from hashicorp/f-refactor-check-watcher client: refactor check watcher to be reusable	2022-09-13 07:32:32 -05:00
Seth Hoenig	aab1ae646e	client: refactor check watcher to be reusable This PR refactors agent/consul/check_watcher into client/serviceregistration, and abstracts away the Consul-specific check lookups. In doing so we should be able to reuse the existing check watcher logic for also watching NSD checks in a followup PR. A chunk of consul/unit_test.go is removed - we'll cover that in e2e tests in a follow PR if needed. In the long run I'd like to remove this whole file.	2022-09-12 10:13:31 -05:00
Tim Gross	8ff79d8a2d	CI: make `make check` clean on macOS (#14528 ) Running `make check` on macOS identifies some dead code because the code is used only with the Linux build tag. Move this code into appropriately-tagged code files.	2022-09-09 12:26:34 -04:00
Seth Hoenig	52b3273b5d	cleanup: consolidate interfaces for workload restarting This PR combines two of the same interface definitions around workload restarting	2022-09-09 08:59:04 -05:00
Charlie Voiselle	61a6dbcfcb	Add client scheduling eligibility to heartbeat (#14483 )	2022-09-08 14:31:36 -04:00
Tiernan	df043d747c	Fix error handling in Client consulDiscoveryImpl (#14431 ) Added a missing `continue` on non-nil error to avoid accidentally using a bad peer.	2022-09-02 15:13:03 -04:00
Luiz Aoqui	7d88937751	connect: interpolate task env in config values (#14445 ) When configuring Consul Service Mesh, it's sometimes necessary to provide dynamic value that are only known to Nomad at runtime. By interpolating configuration values (in addition to configuration keys), user are able to pass these dynamic values to Consul from their Nomad jobs.	2022-09-02 15:00:28 -04:00
James Rasell	25e7c2ffa4	chore: remove use of "err" a log line context key for errors. (#14433 ) Log lines which include an error should use the full term "error" as the context key. This provides consistency across the codebase and avoids a Go style which operators might not be aware of.	2022-09-01 15:06:10 +02:00
Charlie Voiselle	015e4617b2	Vars: Update CT dependency to support variables. (#14399 ) * Update Consul Template dep to support Nomad vars * Remove `Peering` config for Consul Testservers Upgrading to the 1.14 Consul SDK introduces and additional default configuration—`Peering`—that is not compatible with versions of Consul before v1.13.0. because Nomad tests against Consul v1.11.1, this configuration has to be nil'ed out before passing it to the Consul binary.	2022-08-30 15:26:01 -04:00
Tim Gross	13bc6d6d8a	testing: setting env var incompatible with parallel tests (#14405 ) Neither the `os.Setenv` nor `t.Setenv` helper are safe to use in parallel tests because environment variables are process-global. The stdlib panics if you try to do this. Remove the `ci.Parallel()` call from all tests where we're setting environment variables.	2022-08-30 14:49:03 -04:00
Seth Hoenig	a147cf9893	Merge pull request #14385 from hashicorp/f-cg-use-kill cgroups: refactor v2 kill path to use cgroups.kill interface file	2022-08-30 09:02:02 -05:00
Seth Hoenig	9206952d1c	Merge pull request #14290 from hashicorp/cleanup-more-helper-cleanup cleanup: tidy up helper package some more	2022-08-30 08:19:48 -05:00
Seth Hoenig	0ff431563f	cgroups: refactor v2 kill path to use cgroups.kill interface file This PR refactors the cgroups v2 group kill code path to use the cgroups.kill interface file for destroying the cgroup. Previously we copied the freeze + sigkill + unfreeze pattern from the v1 code, but v2 provides a more efficient and more race-free way to handle this. Closes #14371	2022-08-29 14:55:13 -05:00
Seth Hoenig	13bd08b15b	client: refactor cpuset manager initialization This PR refactors the code path in Client startup for setting up the cpuset cgroup manager (non-linux systems not affected). Before, there was a logic bug where we would try to read the cpuset.cpus.effective cgroup interface file before ensuring nomad's parent cgroup existed. Therefor that file would not exist, and the list of useable cpus would be empty. Tasks started thereafter would not have a value set for their cpuset.cpus. The refactoring fixes some less than ideal coding style. Instead we now bootstrap each cpuset manager type (v1/v2) within its own constructor. If something goes awry during bootstrap (e.g. cgroups not enabled), the constructor returns the noop implementation and logs a warning. Fixes #14229	2022-08-25 11:18:43 -05:00
Luiz Aoqui	f74f50804a	Task lifecycle restart (#14127 ) * allocrunner: handle lifecycle when all tasks die When all tasks die the Coordinator must transition to its terminal state, coordinatorStatePoststop, to unblock poststop tasks. Since this could happen at any time (for example, a prestart task dies), all states must be able to transition to this terminal state. * allocrunner: implement different alloc restarts Add a new alloc restart mode where all tasks are restarted, even if they have already exited. Also unifies the alloc restart logic to use the implementation that restarts tasks concurrently and ignores ErrTaskNotRunning errors since those are expected when restarting the allocation. * allocrunner: allow tasks to run again Prevent the task runner Run() method from exiting to allow a dead task to run again. When the task runner is signaled to restart, the function will jump back to the MAIN loop and run it again. The task runner determines if a task needs to run again based on two new task events that were added to differentiate between a request to restart a specific task, the tasks that are currently running, or all tasks that have already run. * api/cli: add support for all tasks alloc restart Implement the new -all-tasks alloc restart CLI flag and its API counterpar, AllTasks. The client endpoint calls the appropriate restart method from the allocrunner depending on the restart parameters used. * test: fix tasklifecycle Coordinator test * allocrunner: kill taskrunners if all tasks are dead When all non-poststop tasks are dead we need to kill the taskrunners so we don't leak their goroutines, which are blocked in the alloc restart loop. This also ensures the allocrunner exits on its own. * taskrunner: fix tests that waited on WaitCh Now that "dead" tasks may run again, the taskrunner Run() method will not return when the task finishes running, so tests must wait for the task state to be "dead" instead of using the WaitCh, since it won't be closed until the taskrunner is killed. * tests: add tests for all tasks alloc restart * changelog: add entry for #14127 * taskrunner: fix restore logic. The first implementation of the task runner restore process relied on server data (`tr.Alloc().TerminalStatus()`) which may not be available to the client at the time of restore. It also had the incorrect code path. When restoring a dead task the driver handle always needs to be clear cleanly using `clearDriverHandle` otherwise, after exiting the MAIN loop, the task may be killed by `tr.handleKill`. The fix is to store the state of the Run() loop in the task runner local client state: if the task runner ever exits this loop cleanly (not with a shutdown) it will never be able to run again. So if the Run() loops starts with this local state flag set, it must exit early. This local state flag is also being checked on task restart requests. If the task is "dead" and its Run() loop is not active it will never be able to run again. * address code review requests * apply more code review changes * taskrunner: add different Restart modes Using the task event to differentiate between the allocrunner restart methods proved to be confusing for developers to understand how it all worked. So instead of relying on the event type, this commit separated the logic of restarting an taskRunner into two methods: - `Restart` will retain the current behaviour and only will only restart the task if it's currently running. - `ForceRestart` is the new method where a `dead` task is allowed to restart if its `Run()` method is still active. Callers will need to restart the allocRunner taskCoordinator to make sure it will allow the task to run again. * minor fixes	2022-08-24 17:43:07 -04:00
Seth Hoenig	1b1a68e42f	cleanup: move fs helpers into escapingfs	2022-08-24 14:45:34 -05:00
Seth Hoenig	24a1c48f47	client/logmon: acquire executable in init block This PR causes the logmon task runner to acquire the binary of the Nomad executable in an 'init' block, so as to almost certainly get the name while the nomad file still exists. This is an attempt at fixing the case where a deleted Nomad file (e.g. during upgrade) may be getting renamed with a mysterious suffix first. If this doesn't work, as a last resort we can literally just trim the mystery string. Fixes: #14079	2022-08-24 13:17:20 -05:00
Piotr Kazmierczak	34e4b080f6	template: custom change_mode scripts (#13972 ) This PR adds the functionality of allowing custom scripts to be executed on template change. Resolves #2707	2022-08-24 17:43:01 +02:00
Seth Hoenig	7f5dfe4478	cleanup: remove more copies of min/max from helper	2022-08-24 09:56:15 -05:00

1 2 3 4 5 ...

4641 Commits