nomad

mirror of https://github.com/kemko/nomad.git synced 2026-01-03 17:05:43 +03:00

Author	SHA1	Message	Date
Tim Gross	d1faead371	rename SecureVariables to Variables throughout	2022-08-26 16:06:24 -04:00
Tim Gross	91c81ba984	file rename	2022-08-26 16:06:24 -04:00
Vladimir Sokolov	b665da6c5f	cli: force periodic job if its id equals search prefix	2022-08-26 10:54:37 -04:00
Luiz Aoqui	119bb50151	Post 1.3.4 release (#14329 ) * Generate files for 1.3.4 release * Prepare for next release * Update CHANGELOG.md Co-authored-by: hc-github-team-nomad-core <github-team-nomad-core@hashicorp.com>	2022-08-26 10:09:13 -04:00
Charlie Voiselle	a36ae675ff	SV API: return upserted variable to caller (#14325 ) * Return created variable to caller in HTTP and Go APIs * Update tests for returned values	2022-08-25 17:38:15 -04:00
Seth Hoenig	7cd0c35060	Merge pull request #14301 from hashicorp/b-fix-check-status-test-racey testing: fix flakey check status test	2022-08-25 08:30:46 -05:00
Luiz Aoqui	f74f50804a	Task lifecycle restart (#14127 ) * allocrunner: handle lifecycle when all tasks die When all tasks die the Coordinator must transition to its terminal state, coordinatorStatePoststop, to unblock poststop tasks. Since this could happen at any time (for example, a prestart task dies), all states must be able to transition to this terminal state. * allocrunner: implement different alloc restarts Add a new alloc restart mode where all tasks are restarted, even if they have already exited. Also unifies the alloc restart logic to use the implementation that restarts tasks concurrently and ignores ErrTaskNotRunning errors since those are expected when restarting the allocation. * allocrunner: allow tasks to run again Prevent the task runner Run() method from exiting to allow a dead task to run again. When the task runner is signaled to restart, the function will jump back to the MAIN loop and run it again. The task runner determines if a task needs to run again based on two new task events that were added to differentiate between a request to restart a specific task, the tasks that are currently running, or all tasks that have already run. * api/cli: add support for all tasks alloc restart Implement the new -all-tasks alloc restart CLI flag and its API counterpar, AllTasks. The client endpoint calls the appropriate restart method from the allocrunner depending on the restart parameters used. * test: fix tasklifecycle Coordinator test * allocrunner: kill taskrunners if all tasks are dead When all non-poststop tasks are dead we need to kill the taskrunners so we don't leak their goroutines, which are blocked in the alloc restart loop. This also ensures the allocrunner exits on its own. * taskrunner: fix tests that waited on WaitCh Now that "dead" tasks may run again, the taskrunner Run() method will not return when the task finishes running, so tests must wait for the task state to be "dead" instead of using the WaitCh, since it won't be closed until the taskrunner is killed. * tests: add tests for all tasks alloc restart * changelog: add entry for #14127 * taskrunner: fix restore logic. The first implementation of the task runner restore process relied on server data (`tr.Alloc().TerminalStatus()`) which may not be available to the client at the time of restore. It also had the incorrect code path. When restoring a dead task the driver handle always needs to be clear cleanly using `clearDriverHandle` otherwise, after exiting the MAIN loop, the task may be killed by `tr.handleKill`. The fix is to store the state of the Run() loop in the task runner local client state: if the task runner ever exits this loop cleanly (not with a shutdown) it will never be able to run again. So if the Run() loops starts with this local state flag set, it must exit early. This local state flag is also being checked on task restart requests. If the task is "dead" and its Run() loop is not active it will never be able to run again. * address code review requests * apply more code review changes * taskrunner: add different Restart modes Using the task event to differentiate between the allocrunner restart methods proved to be confusing for developers to understand how it all worked. So instead of relying on the event type, this commit separated the logic of restarting an taskRunner into two methods: - `Restart` will retain the current behaviour and only will only restart the task if it's currently running. - `ForceRestart` is the new method where a `dead` task is allowed to restart if its `Run()` method is still active. Callers will need to restart the allocRunner taskCoordinator to make sure it will allow the task to run again. * minor fixes	2022-08-24 17:43:07 -04:00
Tim Gross	e886d5d055	vault: detect namespace change in config reload (#14298 ) The `namespace` field was not included in the equality check between old and new Vault configurations, which meant that a Vault config change that only changed the namespace would not be detected as a change and the clients would not be reloaded. Also, the comparison for boolean fields such as `enabled` and `allow_unauthenticated` was on the pointer and not the value of that pointer, which results in spurious reloads in real config reload that is easily missed in typical test scenarios. Includes a minor refactor of the order of fields for `Copy` and `Merge` to match the struct fields in hopes it makes it harder to make this mistake in the future, as well as additional test coverage.	2022-08-24 17:03:29 -04:00
Seth Hoenig	3ae6db666a	testing: fix flakey check status test This PR fixes a flakey test where we did not wait on the check status to actually become failing (go too fast and you just get a pending check). Instead add a helper for waiting on any check in the alloc to become the state we are looking for.	2022-08-24 15:11:41 -05:00
Piotr Kazmierczak	34e4b080f6	template: custom change_mode scripts (#13972 ) This PR adds the functionality of allowing custom scripts to be executed on template change. Resolves #2707	2022-08-24 17:43:01 +02:00
Luiz Aoqui	43fe45d972	fix minor issues found durint ENT merge (#14250 )	2022-08-23 17:22:18 -04:00
Luiz Aoqui	6070fa0c8d	allocrunner: refactor task coordinator (#14009 ) The current implementation for the task coordinator unblocks tasks by performing destructive operations over its internal state (like closing channels and deleting maps from keys). This presents a problem in situations where we would like to revert the state of a task, such as when restarting an allocation with tasks that have already exited. With this new implementation the task coordinator behaves more like a finite state machine where task may be blocked/unblocked multiple times by performing a state transition. This initial part of the work only refactors the task coordinator and is functionally equivalent to the previous implementation. Future work will build upon this to provide bug fixes and enhancements.	2022-08-22 18:38:49 -04:00
Tim Gross	2eaf3d7270	allow ACL policies to be associated with workload identity (#14140 ) The original design for workload identities and ACLs allows for operators to extend the automatic capabilities of a workload by using a specially-named policy. This has shown to be potentially unsafe because of naming collisions, so instead we'll allow operators to explicitly attach a policy to a workload identity. This changeset adds workload identity fields to ACL policy objects and threads that all the way down to the command line. It also a new secondary index to the ACL policy table on namespace and job so that claim resolution can efficiently query for related policies.	2022-08-22 16:41:21 -04:00
Luiz Aoqui	934bafb922	template: use pointer values for gid and uid (#14203 ) When a Nomad agent starts and loads jobs that already existed in the cluster, the default template uid and gid was being set to 0, since this is the zero value for int. This caused these jobs to fail in environments where it was not possible to use 0, such as in Windows clients. In order to differentiate between an explicit 0 and a template where these properties were not set we need to use a pointer.	2022-08-22 16:25:49 -04:00
Seth Hoenig	5694999c61	cli: display nomad service check status output in CLI commands This PR adds some NSD check status output to the CLI. 1. The 'nomad alloc status' command produces nsd check summary output (if present) 2. The 'nomad alloc checks' sub-command is added to produce complete nsd check output (if present)	2022-08-19 09:18:29 -05:00
Michael Schurter	01648e615a	client: fix data races in config handling (#14139 ) Before this change, Client had 2 copies of the config object: config and configCopy. There was no guidance around which to use where (other than configCopy's comment to pass it to alloc runners), both are shared among goroutines and mutated in data racy ways. At least at one point I think the idea was to have `config` be mutable and then grab a lock to overwrite `configCopy`'s pointer atomically. This would have allowed alloc runners to read their config copies in data race safe ways, but this isn't how the current implementation worked. This change takes the following approach to safely handling configs in the client: 1. `Client.config` is the only copy of the config and all access must go through the `Client.configLock` mutex 2. Since the mutex only protects the config pointer itself and not fields inside the Config struct: all config mutation must be done on a copy of the config, and then Client's config pointer is overwritten while the mutex is acquired. Alloc runners and other goroutines with the old config pointer will not see config updates. 3. Deep copying is implemented on the Config struct to satisfy the previous approach. The TLS Keyloader is an exception because it has its own internal locking to support mutating in place. An unfortunate complication but one I couldn't find a way to untangle in a timely fashion. 4. To facilitate deep copying I made an internally backward incompatible API change: our `helper/funcs` used to turn containers (slices and maps) with 0 elements into nils. This probably saves a few memory allocations but makes it very easy to cause panics. Since my new config handling approach uses more copying, it became very difficult to ensure all code that used containers on configs could handle nils properly. Since this code has caused panics in the past, I fixed it: nil containers are copied as nil, but 0-element containers properly return a new 0-element container. No more "downgrading to nil!"	2022-08-18 16:32:04 -07:00
Seth Hoenig	9cc27b3c2c	cleanup: fixing warnings and refactoring of command package, part 2 This PR continues the cleanup of the command package, removing linter warnings, refactoring to use helpers, making tests easier to read, etc.	2022-08-18 09:43:39 -05:00
Seth Hoenig	6baf6a1f8f	cleanup: first pass at fixing command package warnings This PR is the first of several for cleaning up warnings, and refactoring bits of code in the command package. First pass is over acl_ files and gets some helpers in place.	2022-08-17 15:33:37 -05:00
Piotr Kazmierczak	c4be2c6078	cleanup: replace TypeToPtr helper methods with pointer.Of (#14151 ) Bumping compile time requirement to go 1.18 allows us to simplify our pointer helper methods.	2022-08-17 18:26:34 +02:00
Seth Hoenig	4e3c3d472e	Merge pull request #14132 from hashicorp/build-update-go1.19 build: update to go1.19	2022-08-16 11:20:27 -05:00
Seth Hoenig	0c62f445c3	build: run gofmt on all go source files Go 1.19 will forecefully format all your doc strings. To get this out of the way, here is one big commit with all the changes gofmt wants to make.	2022-08-16 11:14:11 -05:00
Seth Hoenig	8e6bff2d0f	Merge pull request #14102 from hashicorp/cleanup-mesh-gateway-value cleanup: consul mesh gateway type need not be pointer	2022-08-16 10:07:16 -05:00
Charlie Voiselle	22194d437a	SV CLI: var init (#13820 ) * Nomad dep: add museli/reflow * SV CLI: var init	2022-08-15 13:43:29 -04:00
Tim Gross	3af6937cf3	move secure variable conflict resolution to state store (#13922 ) Move conflict resolution implementation into the state store with a new Apply RPC. This also makes the RPC for secure variables much more similar to Consul's KV, which will help us support soft deletes in a post-1.4.0 version of Nomad. Reimplement quotas in the state store functions. Co-authored-by: Charlie Voiselle <464492+angrycub@users.noreply.github.com>	2022-08-15 11:19:53 -04:00
Seth Hoenig	47d44d62bb	cleanup: consul mesh gateway type need not be pointer This PR changes the use of structs.ConsulMeshGateway to value types instead of via pointers. This will help in a follow up PR where we cleanup a lot of custom comparison code with helper functions instead.	2022-08-13 11:26:58 -05:00
Seth Hoenig	e96d52d87f	cli: respect vault token in plan command This PR fixes a regression where the 'job plan' command would not respect a Vault token if set via --vault-token or $VAULT_TOKEN. Basically the same bug/fix as for the validate command in https://github.com/hashicorp/nomad/issues/13062 Fixes https://github.com/hashicorp/nomad/issues/13939	2022-08-11 08:54:08 -05:00
Seth Hoenig	169211251d	Merge pull request #14069 from brian-athinkingape/cli-fix-memstats-cgroupsv2 cli: for systems with cgroups v2, fix alloation resource utilization showing 0 memory used	2022-08-11 07:27:48 -05:00
Luiz Aoqui	939d643fec	Post 1.3.3 release (#14064 ) * Generate files for 1.3.3 release * Prepare for next release * Merge release 1.3.3 files Co-authored-by: hc-github-team-nomad-core <github-team-nomad-core@hashicorp.com>	2022-08-09 17:27:29 -04:00
Brian Chau	0464889edc	cli: for systems with cgroups v2, fix alloation resource utilization showing 0 memory used	2022-08-09 14:09:14 -07:00
Derek Strickland	696deb9600	Add Nomad RetryConfig to agent template config (#13907 ) * add Nomad RetryConfig to agent template config	2022-08-03 16:56:30 -04:00
Piotr Kazmierczak	2e0b875b14	client: enable specifying user/group permissions in the template stanza (#13755 ) * Adds Uid/Gid parameters to template. * Updated diff_test * fixed order * update jobspec and api * removed obsolete code * helper functions for jobspec parse test * updated documentation * adjusted API jobs test. * propagate uid/gid setting to job_endpoint * adjusted job_endpoint tests * making uid/gid into pointers * refactor * updated documentation * updated documentation * Update client/allocrunner/taskrunner/template/template_test.go Co-authored-by: Luiz Aoqui <luiz@hashicorp.com> * Update website/content/api-docs/json-jobs.mdx Co-authored-by: Luiz Aoqui <luiz@hashicorp.com> * propagating documentation change from Luiz * formatting * changelog entry * changed changelog entry Co-authored-by: Luiz Aoqui <luiz@hashicorp.com>	2022-08-02 22:15:38 +02:00
James Rasell	581390bed1	cli: do not import structs, use API package only. (#13938 )	2022-08-02 16:33:08 +02:00
Eric Weber	07bbf1f91e	Add stage_publish_base_dir field to csi_plugin stanza of a job (#13919 ) * Allow specification of CSI staging and publishing directory path * Add website documentation for stage_publish_dir * Replace erroneous reference to csi_plugin.mount_config with csi_plugin.mount_dir * Avoid requiring CSI plugins to be redeployed after introducing StagePublishDir	2022-08-02 09:42:44 -04:00
Tim Gross	82861ae8d7	secure vars: enforce ENT quotas (OSS work) (#13951 ) Move the secure variables quota enforcement calls into the state store to ensure quota checks are atomic with quota updates (in the same transaction). Switch to a machine-size int instead of a uint64 for quota tracking. The ENT-side quota spec is described as int, and negative values have a meaning as "not permitted at all". Using the same type for tracking will make it easier to the math around checks, and uint64 is infeasibly large anyways. Add secure vars to quota HTTP API and CLI outputs and API docs.	2022-08-02 09:32:09 -04:00
Tim Gross	e4cceab4f0	fix flaky `TestAgent_ProxyRPC_Dev` test (#13925 ) This test is a fairly trivial test of the agent RPC, but the test setup waits for a short fixed window after the node starts to send the RPC. After looking at detailed logs for recent test failures, it looks like the node registration for the first node doesn't get a chance to happen before we make the RPC call. Use `WaitForResultUntil` to give the test more time to run in slower test environments, while allowing it to finish quickly if possible.	2022-07-28 14:47:15 -04:00
Lars Lehtonen	5d8258ecab	testing: fix dropped test errors in command/agent (#13926 )	2022-07-28 11:04:31 -04:00
Seth Hoenig	61e885dfb3	cleanup: use constants for on_update values	2022-07-21 13:09:47 -05:00
Seth Hoenig	4508af8160	Merge pull request #13715 from hashicorp/dev-nsd-checks client: add support for checks in nomad services	2022-07-21 10:22:57 -05:00
Seth Hoenig	9f37b84db4	Merge pull request #13870 from hashicorp/exp-fp-optimization client: use test timeouts for network fingerprinters in dev mode	2022-07-21 08:18:02 -05:00
Tim Gross	33f4f50044	search: use secure vars ACL policy for secure vars context (#13788 ) The search RPC used a placeholder policy for searching within the secure variables context. Now that we have ACL policies built for secure variables, we can use them for search. Requires a new loose policy for checking if a token has any secure variables access within a namespace, so that we can filter on specific paths in the iterator.	2022-07-21 08:39:36 -04:00
Seth Hoenig	74bc3dd120	devmode: use minimal network timeouts for network fingerprinters in dev mode	2022-07-20 15:13:14 -05:00
Tim Gross	69c9dc140d	keyring: use nanos for `CreateTime` in key metadata (#13849 ) Most of our objects use int64 timestamps derived from `UnixNano()` instead of `time.Time` objects. Switch the keyring metadata to use `UnixNano()` for consistency across the API.	2022-07-20 14:46:57 -04:00
Tim Gross	587360543b	docs: keyring commands (#13690 ) Document the secure variables keyring commands, document the aliased gossip keyring commands, and note that the old gossip keyring commands are deprecated.	2022-07-20 14:14:10 -04:00
Will Jordan	662a12a41e	Return 429 response on HTTP max connection limit (#13621 ) Return 429 response on HTTP max connection limit. Instead of silently closing the connection, return a `429 Too Many Requests` HTTP response with a helpful error message to aid debugging when the connection limit is unintentionally reached. Set a 10-millisecond write timeout and rate limiter for connection-limit 429 response to prevent writing the HTTP response from consuming too many server resources. Add `nomad.agent.http.exceeded metric` counting the number of HTTP connections exceeding concurrency limit.	2022-07-20 14:12:21 -04:00
hc-github-team-nomad-core	5f8889d522	Generate files for 1.3.2 release	2022-07-13 19:33:41 -04:00
Michael Schurter	d857be3c45	http: only log alloc/exec errors when non-nil (#13730 )	2022-07-13 09:44:51 -07:00
Luiz Aoqui	d456cc1e7f	Track plan rejection history and automatically mark clients as ineligible (#13421 ) Plan rejections occur when the scheduler work and the leader plan applier disagree on the feasibility of a plan. This may happen for valid reasons: since Nomad does parallel scheduling, it is expected that different workers will have a different state when computing placements. As the final plan reaches the leader plan applier, it may no longer be valid due to a concurrent scheduling taking up intended resources. In these situations the plan applier will notify the worker that the plan was rejected and that they should refresh their state before trying again. In some rare and unexpected circumstances it has been observed that workers will repeatedly submit the same plan, even if they are always rejected. While the root cause is still unknown this mitigation has been put in place. The plan applier will now track the history of plan rejections per client and include in the plan result a list of node IDs that should be set as ineligible if the number of rejections in a given time window crosses a certain threshold. The window size and threshold value can be adjusted in the server configuration. To avoid marking several nodes as ineligible at one, the operation is rate limited to 5 nodes every 30min, with an initial burst of 10 operations.	2022-07-12 18:40:20 -04:00
Seth Hoenig	b2861f2a9b	client: add support for checks in nomad services This PR adds support for specifying checks in services registered to the built-in nomad service provider. Currently only HTTP and TCP checks are supported, though more types could be added later.	2022-07-12 17:09:50 -05:00
Michael Schurter	f998a2b77b	core: merge reserved_ports into host_networks (#13651 ) Fixes #13505 This fixes #13505 by treating reserved_ports like we treat a lot of jobspec settings: merging settings from more global stanzas (client.reserved.reserved_ports) "down" into more specific stanzas (client.host_networks[].reserved_ports). As discussed in #13505 there are other options, and since it's totally broken right now we have some flexibility: Treat overlapping reserved_ports on addresses as invalid and refuse to start agents. However, I'm not sure there's a cohesive model we want to publish right now since so much 0.9-0.12 compat code still exists! We would have to explain to folks that if their -network-interface and host_network addresses overlapped, they could only specify reserved_ports in one place or the other?! It gets ugly. Use the global client.reserved.reserved_ports value as the default and treat host_network[].reserverd_ports as overrides. My first suggestion in the issue, but @groggemans made me realize the addresses on the agent's interface (as configured by -network-interface) may overlap with host_networks, so you'd need to remove the global reserved_ports from addresses shared with a shared network?! This seemed really confusing and subtle for users to me. So I think "merging down" creates the most expressive yet understandable approach. I've played around with it a bit, and it doesn't seem too surprising. The only frustrating part is how difficult it is to observe the available addresses and ports on a node! However that's a job for another PR.	2022-07-12 14:40:25 -07:00
Charlie Voiselle	b949ee690c	SV: CLI: var list command (#13707 ) * SV CLI: var list * Fix wildcard prefix filtering Co-authored-by: Tim Gross <tgross@hashicorp.com>	2022-07-12 12:49:39 -04:00

1 2 3 4 5 ...

3331 Commits