nomad

mirror of https://github.com/kemko/nomad.git synced 2026-01-05 01:45:44 +03:00

Author	SHA1	Message	Date
Tim Gross	fea846189f	dynamic host volumes: account for other claims in capability check (#24684 ) When we feasibility check a dynamic host volume against a volume request, we check the attachment mode and access mode. This only ensures that the capabilities match, but doesn't enforce the semantics of the capabilities against other claims that may be made on the allocation. Add support for checking the requested capability against other allocations that the volume claimed. Ref: https://github.com/hashicorp/nomad/pull/24479	2024-12-19 09:25:55 -05:00
Piotr Kazmierczak	8cbb74786c	stateful deployments: find feasible node for sticky host volumes (#24558 ) This changeset implements node feasibility checks for sticky host volumes.	2024-12-19 09:25:55 -05:00
Tim Gross	3143019d85	dynamic host volumes: capabilities check during scheduling (#24617 ) Static host volumes have a simple readonly toggle, but dynamic host volumes have a more complex set of capabilities similar to CSI volumes. Update the feasibility checker to account for these capabilities and volume readiness. Also fixes a minor bug in the state store where a soft-delete (not yet implemented) could cause a volume to be marked ready again. This is needed to support testing the readiness checking in the scheduler. Ref: https://github.com/hashicorp/nomad/pull/24479	2024-12-19 09:25:54 -05:00
Tim Gross	bbf49a9050	dynamic host volumes: node selection via constraints (#24518 ) When making a request to create a dynamic host volumes, users can pass a node pool and constraints instead of a specific node ID. This changeset implements a node scheduling logic by instantiating a filter by node pool and constraint checker borrowed from the scheduler package. Because host volumes with the same name can't land on the same host, we don't need to support `distinct_hosts`/`distinct_property`; this would be challenging anyways without building out a much larger node iteration mechanism to keep track of usage across multiple hosts. Ref: https://github.com/hashicorp/nomad/pull/24479	2024-12-19 09:25:54 -05:00
Piotr Kazmierczak	c5249c6ca4	gc: be consistent with setting create/modify timestamp tz (#24389 ) Whenever setting objects creation/modify time, we should always use UTC. #24112 introduced some inconsistencies in this area, and this PR fixes it.	2024-11-07 22:53:54 +01:00
Piotr Kazmierczak	f7847c6e5b	state: remove TimeTable and rely on objects' modify times instead (#24112 ) Core scheduler relies on a special table in the state store—the TimeTable—to figure out which objects can be GC'd. The TimeTable correlates Raft indices with objects insertion time, a solution we used before most of the objects we store in the state contained timestamps. This introduced a bit of a memory overhead and complexity, but most importantly meant that any GC threshold users set greater than timeTableLimit = 72 * time.Hour was ignored. This PR removes the TimeTable and relies on object timestamps to determine whether they could be GCd or not.	2024-11-01 19:38:04 +01:00
Michael Smithhisler	436ff75f15	scheduler: fix reconnecting allocations getting rescheduled (#24165 ) * scheduler: fix reconnecting allocations getting rescheduled	2024-10-14 09:00:58 -04:00
Daniel Bennett	7526c91ccd	scheduler: non-nil err when no devices match (#24118 )	2024-10-03 10:29:36 -05:00
Tim Gross	a7f2cb879e	command line tools for redacting keyring from snapshots (#24023 ) In #23977 we moved the keyring into Raft, which can expose key material in Raft snapshots when using the less-secure AEAD keyring instead of KMS. This changeset adds tools for redacting this material from snapshots: * The `operator snapshot state` command gains the ability to display key metadata (only), which respects the `-filter` option. * The `operator snapshot save` command gains a `-redact` option that removes key material from the snapshot after it's downloaded. * A new `operator snapshot redact` command allows removing key material from an existing snapshot.	2024-09-20 15:30:14 -04:00
Seth Hoenig	51215bf102	deps: update to go-set/v3 and refactor to use custom iterators (#23971 ) * deps: update to go-set/v3 * deps: use custom set iterators for looping	2024-09-16 13:40:10 -05:00
Seth Hoenig	8b093a6a5d	scheduler: support for device - aware numa scheduling (#1760 ) (#23837 ) (CE backport of ENT 59433d56c7215c0b8bf33764f41b57d9bd30160f (without ent files)) * scheduler: enhance numa aware scheduling with support for devices * cr: add comments	2024-08-20 07:53:04 -05:00
Daniel Bennett	d131c41943	cni: network.cni job updates should replace allocs (#23764 ) a change to the network{cni{}} block means that the user wants the network config to change, and that only happens during initial alloc setup, so we need to replace the alloc(s) to get fresh network(s) to reconfigure from scratch. e.g. a job plan diff like this ``` +/- Task Group: "g" (1 in-place update) + Network { + CNIConfig { + a: "ayy" } ``` should instead be ``` +/- Task Group: "g" (1 create/destroy update) + Network { + CNIConfig { + a: "ayy" } ```	2024-08-07 12:13:11 -05:00
Tim Gross	b25f1b66ce	resources: allow job authors to configure size of secrets tmpfs (#23696 ) On supported platforms, the secrets directory is a 1MiB tmpfs. But some tasks need larger space for downloading large secrets. This is especially the case for tasks using `templates`, which need extra room to write a temporary file to the secrets directory that gets renamed to the old file atomically. This changeset allows increasing the size of the tmpfs in the `resources` block. Because this is a memory resource, we need to include it in the memory we allocate for scheduling purposes. The task is already prevented from using more memory in the tmpfs than the `resources.memory` field allows, but can bypass that limit by writing to the tmpfs via `template` or `artifact` blocks. Therefore, we need to account for the size of the tmpfs in the allocation resources. Simply adding it to the memory needed when we create the allocation allows it to be accounted for in all downstream consumers, and then we'll subtract that amount from the memory resources just before configuring the task driver. For backwards compatibility, the default value of 1MiB is "free" and ignored by the scheduler. Otherwise we'd be increasing the allocated resources for every existing alloc, which could cause problems across upgrades. If a user explicitly sets `resources.secrets = 1` it will no longer be free. Fixes: https://github.com/hashicorp/nomad/issues/2481 Ref: https://hashicorp.atlassian.net/browse/NET-10070	2024-08-05 16:06:58 -04:00
Piotr Kazmierczak	78bc8e7843	scheduler: fix TestAllocSet_filterByTainted (#23648 )	2024-07-19 17:41:06 +02:00
Tim Gross	7d73065066	numa: fix scheduler panic due to topology serialization bug (#23284 ) The NUMA topology struct field `NodeIDs` is a `idset.Set`, which has no public members. As a result, this field is never serialized via msgpack and persisted in state. When `numa.affinity = "prefer"`, the scheduler dereferences this nil field and panics the scheduler worker. Ideally we would fix this by adding a msgpack serialization extension, but because the field already exists and is just always empty, this breaks RPC wire compatibility across upgrades. Instead, create a new field that's populated at the same time we populate the more useful `idset.Set`, and repopulate the set on demand. Fixes: https://hashicorp.atlassian.net/browse/NET-9924	2024-06-11 08:55:00 -04:00
Tim Gross	fa70267787	scheduler: `RescheduleTracker` dropped if follow-up fails placements (#12319 ) When an allocation fails it triggers an evaluation. The evaluation is processed and the scheduler sees it needs to reschedule, which triggers a follow-up eval. The follow-up eval creates a plan to `(stop 1) (place 1)`. The replacement alloc has a `RescheduleTracker` (or gets its `RescheduleTracker` updated). But in the case where the follow-up eval can't place all allocs (there aren't enough resources), it can create a partial plan to `(stop 1) (place 0)`. It then creates a blocked eval. The plan applier stops the failed alloc. Then when the blocked eval is processed, the job is missing an allocation, so the scheduler creates a new allocation. This allocation is _not_ a replacement from the perspective of the scheduler, so it's not handed off a `RescheduleTracker`. This changeset fixes this by annotating the reschedule tracker whenever the scheduler can't place a replacement allocation. We check this annotation for allocations that have the `stop` desired status when filtering out allocations to pass to the reschedule tracker. I've also included tests that cover this case and expands coverage of the relevant area of the code. Fixes: https://github.com/hashicorp/nomad/issues/12147 Fixes: https://github.com/hashicorp/nomad/issues/17072	2024-06-10 11:15:40 -04:00
Tim Gross	67967c99a7	scheduler: stack test should use job.ID and not job.Name (#23169 ) Some of our scheduler tests use the `AllocName` function from the structs package incorrectly. This function should always receive the `Job.ID` and not the `Job.Name`. Fix this to prevent future bugs from copy-pasting usage around.	2024-06-05 08:34:04 -04:00
Tim Gross	5bfb500932	refactor scheduler tests for node down/disconnected (#22198 ) While working on #20462 #12319 I found that some of our scheduler tests around down nodes or disconnected clients were enforcing invariants that were unclear. This changeset pulls out some minor refactorings so that the bug fix PR is easier to review. This includes: * Migrating a few tests from `testify` to `shoenig/test` that I'm going to touch in #12319 anyways. * Adding test names to the node down test * Update the disconnected client test so that we always re-process the pending/blocked eval it creates; this eliminates 2 redundant sub-tests. * Update the disconnected client test assertions so that they're explicit in the test setup rather than implied by whether we re-process the pending/blocked eval. Ref: https://github.com/hashicorp/nomad/issues/20462 Ref: https://github.com/hashicorp/nomad/pull/12319	2024-05-22 10:23:08 -04:00
Tim Gross	5666065131	tests: update disconnected client scheduler tests to avoid blocking (#20615 ) While working on #20462, I discovered that some of the scheduler tests for disconnected clients making long blocking queries. The tests used `testutil.WaitForResult` to wait for an evaluation to be written to the state store. The evaluation was never written, but the tests were not correctly returning an error for an empty query. This resulted in the tests blocking for 5s and then continuing anyways. In practice, the evaluation is never written to the state store as part of the test harness `Process` method, so this test assertion was meaningless. Remove the broken assertion from the two top-level tests that used it, and upgrade these tests to use `shoenig/test` in the process. This will save us ~50s per test run.	2024-05-16 12:16:27 -04:00
Michael Schurter	3aefc010d7	test: remove spurious print statements (#20503 )	2024-05-01 09:47:56 -07:00
Seth Hoenig	ae6c4c8e3f	deps: purge use of old x/exp packages (#20373 )	2024-04-12 08:29:00 -05:00
Gabi	ca22f34373	fix exhausted node metrics reporting in preemption (#20346 )	2024-04-11 14:49:56 -04:00
Juana De La Cuesta	c7e7fdfa84	[f-gh-208] Force recreation and redeployment of task if volume label changes (#20074 ) Scheduler: Force recreation and redeployment of task if volume mount labels in the task definitions changes	2024-03-27 11:43:31 +01:00
Juana De La Cuesta	f2965cad36	[gh-19729] Fix logic for updating terminal allocs on clients with max client disconnect (#20181 ) Only ignore allocs on terminal states that are updated --------- Co-authored-by: Tim Gross <tgross@hashicorp.com>	2024-03-26 10:31:58 +01:00
Juana De La Cuesta	ff72248c86	func: add new picker dependency (#20029 ) This commit introduces the new options for reconciling a reconnecting allocation and its replacement: Best score (Current implementation) Keep original Keep replacement Keep the one that has run the longest time It is achieved by adding a new dependency to the allocReconciler that calls the corresponding function depending on the task group's disconnect strategy. For more detailed information, refer to the new stanza for disconnected clientes RFC. It resolves 15144	2024-03-15 13:42:08 +01:00
carrychair	5f5b34db0e	remove repetitive words (#20110 ) Signed-off-by: carrychair <linghuchong404@gmail.com>	2024-03-11 08:52:08 +00:00
Seth Hoenig	4d83733909	tests: swap testify for test in more places (#20028 ) * tests: swap testify for test in plugins/csi/client_test.go * tests: swap testify for test in testutil/ * tests: swap testify for test in host_test.go * tests: swap testify for test in plugin_test.go * tests: swap testify for test in utils_test.go * tests: swap testify for test in scheduler/ * tests: swap testify for test in parse_test.go * tests: swap testify for test in attribute_test.go * tests: swap testify for test in plugins/drivers/ * tests: swap testify for test in command/ * tests: fixup some test usages * go: run go mod tidy * windows: cpuset test only on linux	2024-02-29 12:11:35 -06:00
Juana De La Cuesta	20cfbc82d3	Introduces `Disconnect` block into the `TaskGroup` configuration (#19886 ) This PR is the first on two that will implement the new Disconnect block. In this PR the new block is introduced to be backwards compatible with the fields it will replace. For more information refer to this RFC and this ticket.	2024-02-19 16:41:35 +01:00
Luiz Aoqui	4a8b01430b	scheduler: retain eval metrics on port collision (#19933 ) When an allocation can't be placed because of a port collision the resulting blocked eval is expected to have a metric reporting the port that caused the conflict, but this metrics was not being emitted when preemption was enabled.	2024-02-09 18:18:48 -05:00
Tim Gross	d3e5cae1eb	consul: support admin partitions (#19665 ) Add support for Consul Enterprise admin partitions. We added fingerprinting in https://github.com/hashicorp/nomad/pull/19485. This PR adds a `consul.partition` field. The expectation is that most users will create a mapping of Nomad node pool to Consul admin partition. But we'll also create an implicit constraint for the fingerprinted value. Fixes: https://github.com/hashicorp/nomad/issues/13139	2024-01-10 10:41:29 -05:00
Luiz Aoqui	099ee06a60	Revert "deps: update go-metrics to v0.5.3 (#19190 )" (#19374 ) * Revert "deps: update go-metrics to v0.5.3 (#19190)" This reverts commit `ddb060d8b3`. * changelog: add entry for #19374	2023-12-08 08:46:55 -05:00
Juana De La Cuesta	cf539c405e	Add a new parameter to avoid starting a replacement for lost allocs (#19101 ) This commit introduces the parameter preventRescheduleOnLost which indicates that the task group can't afford to have multiple instances running at the same time. In the case of a node going down, its allocations will be registered as unknown but no replacements will be rescheduled. If the lost node comes back up, the allocs will reconnect and continue to run. In case of max_client_disconnect also being enabled, if there is a reschedule policy, an error will be returned. Implements issue #10366 Co-authored-by: Dom Lavery <dom@circleci.com> Co-authored-by: Tim Gross <tgross@hashicorp.com> Co-authored-by: Luiz Aoqui <luiz@hashicorp.com>	2023-12-06 12:28:42 +01:00
Tim Gross	0bc2ea8d98	client version constraints for implicit identities for WI (#18932 ) Clients prior to Nomad 1.7 cannot support the new workload identity-based authentication to Consul and Vault. Add an implicit Nomad version constraint on job submission for task groups that use the new workflow. Includes a constraint test showing same-version prelease handling.	2023-12-01 13:51:21 -05:00
Luiz Aoqui	59aa860c13	scheduler: fix task-level consul diff (#19230 ) Fix `tasksUpdated()` to compare the task level `consul` blocks instead of the group.	2023-11-30 10:13:17 -05:00
Luiz Aoqui	ddb060d8b3	deps: update go-metrics to v0.5.3 (#19190 ) Update `go-metrics` to v0.5.3 to pick https://github.com/hashicorp/go-metrics/pull/146.	2023-11-28 12:37:57 -05:00
Charlie Voiselle	659c0945fc	[core] Honor job's namespace when checking `distinct_hosts` feasibility (#19004 ) * Update distinct_host feasibility checking to honor the job's namespace. Fixes #9792 * Added test to verify original condition and that fix resolved it. * Added documentation	2023-11-17 11:25:10 -05:00
James Rasell	3c8eb54dfc	scheduler: ensure dup alloc names are fixed before plan submit. (#18873 ) This change fixes a bug within the generic scheduler which meant duplicate alloc indexes (names) could be submitted to the plan applier and written to state. The bug originates from the placements calculation notion that names of allocations being replaced are blindly copied to their replacement. This is not correct in all cases, particularly when dealing with canaries. The fix updates the alloc name index tracker to include minor duplicate tracking. This can be used when computing placements to ensure duplicate are found, and a new name picked before the plan is submitted. The name index tracking is now passed from the reconciler to the generic scheduler via the results, so this does not have to be regenerated, or another data structure used.	2023-10-27 14:16:41 +01:00
Juana De La Cuesta	e8efe2d251	fix: handling non reschedule disconnecting and reconnecting allocs (#18701 ) This PR fixes a long lived bug, where disconnecting allocations where never rescheduled by their policy but because the group count was short. The default reschedule time for services and batches is 30 and 5 seconds respectively, in order to properly reschedule disconnected allocs, they need to be able to be rescheduled for later, a path that was not handled before. This PR introduces a way to handle such allocations.	2023-10-27 13:14:39 +02:00
James Rasell	b3e41bec2d	scheduler: remove unused alloc index functions. (#18846 )	2023-10-25 09:09:47 +01:00
Seth Hoenig	0020139440	core: port common code changes from ENT for numa scheduling (#18818 ) Some additional changes were made in the ENT PR to the common code in support of numa scheduling; this PR copies those changes back to CE.	2023-10-20 13:19:02 -05:00
Seth Hoenig	83720740f5	core: plumbing to support numa aware scheduling (#18681 ) * core: plumbing to support numa aware scheduling * core: apply node resources compatibility upon fsm rstore Handle the case where an upgraded server dequeus an evaluation before a client triggers a new fingerprint - which would be needed to cause the compatibility fix to run. By running the compat fix on restore the server will immediately have the compatible pseudo topology to use. * lint: learn how to spell pseudo	2023-10-19 15:09:30 -05:00
Luiz Aoqui	0ccf942b26	scheduler: fix host volume feasibility check (#18679 ) Host volumes were considered regular feasibility checks. This had two unintended consequences. The first happened when scheduling an allocation with a host volume on a set of nodes with the same computed class but where only some of them had the desired host volume. If the first node evaluated did not have the host volume, the entire node class was considered ineligible for the task group. ```go // Run the job feasibility checks. for _, check := range w.jobCheckers { feasible := check.Feasible(option) if !feasible { // If the job hasn't escaped, set it to be ineligible since it // failed a job check. if !jobEscaped { evalElig.SetJobEligibility(false, option.ComputedClass) } continue OUTER } } ``` This results in all nodes with the same computed class to be skipped, even if they do have the desired host volume. ```go switch evalElig.JobStatus(option.ComputedClass) { case EvalComputedClassIneligible: // Fast path the ineligible case metrics.FilterNode(option, "computed class ineligible") continue ``` The second consequence is somewhat the opposite. When an allocation has a host volume with `per_alloc = true` the node must have a host volume that matches the allocation index, so each allocation is likely to be placed in different nodes. But when the first allocation found a node match, it registered the node class as eligible for the task group. ```go // Set the task group eligibility if the constraints weren't escaped and // it hasn't been set before. if !tgEscaped && tgUnknown { evalElig.SetTaskGroupEligibility(true, w.tg, option.ComputedClass) } ``` This could cause other allocations to be placed on nodes without the expected host volume because of the computed node class fast path. The node feasibility for the volume was never checked. ```go case EvalComputedClassEligible: // Fast path the eligible case if w.available(option) { return option } // We match the class but are temporarily unavailable continue OUTER ``` These problems did not happen with CSI volumes kind of accidentally. Since the `CSIVolumeChecker` was not placed in the `tgCheckers` list it did not cause the node class to be considered ineligible on failure (avoiding the first problem). And, as illustrated in the code snippet above, the eligible node class fast path checks `tgAvailable` (where `CSIVolumeChecker` is placed) before returning the option (avoiding the second problem). By also placing `HostVolumeChecker` in the `tgAvailable` list instead of `tgCheckers` we also avoid these problems on host volume feasibility.	2023-10-06 11:00:48 -04:00
Seth Hoenig	e3c8700ded	deps: upgrade to go-set/v2 (#18638 ) No functional changes, just cleaning up deprecated usages that are removed in v2 and replace one call of .Slice with .ForEach to avoid making the intermediate copy.	2023-10-05 11:56:17 -05:00
James Rasell	673a7713a8	scheduler: remove unused changes reconciler function. (#18656 )	2023-10-05 08:10:01 +01:00
Tim Gross	aaee3076c2	consul: allow `consul` block in task scope (#18597 ) To support Workload Identity with Consul for templates, we want templates to be able to use the WI created at the task scope (either implicitly or set by the user). But to allow different tasks within a group to be assigned to different clusters as we're doing for Vault, we need to be able to set the `consul` block with its `cluster` field at the task level to override the group.	2023-09-29 15:03:48 -04:00
Tim Gross	7863d7bcbb	jobspec: support `cluster` field for Consul and Service blocks (#18409 ) This field supports the upcoming ENT-only multiple Consul clusters feature. The job validation and mutation hooks will come in a separate PR. Ref: https://github.com/hashicorp/team-nomad/issues/404	2023-09-07 09:48:49 -04:00
Seth Hoenig	f5b0da1d55	all: swap exp packages for maps, slices (#18311 )	2023-08-23 15:42:13 -05:00
James Rasell	6108f5c4c3	admin: rename _oss files to _ce (#18209 )	2023-08-18 07:47:24 +01:00
Michael Schurter	0e22fc1a0b	identity: add support for multiple identities + audiences (#18123 ) Allows for multiple `identity{}` blocks for tasks along with user-specified audiences. This is a building block to allow workload identities to be used with Consul, Vault and 3rd party JWT based auth methods. Expiration is still unimplemented and is necessary for JWTs to be used securely, so that's up next. --------- Co-authored-by: Tim Gross <tgross@hashicorp.com>	2023-08-15 09:11:53 -07:00
Seth Hoenig	d9341f0664	update go1.21 (#18184 ) * build: update to go1.21 * go: eliminate helpers in favor of min/max * build: run go mod tidy * build: swap depguard for semgrep * command: fixup broken tls error check on go1.21	2023-08-14 08:43:27 -05:00

1 2 3 4 5 ...

895 Commits