nomad

mirror of https://github.com/kemko/nomad.git synced 2026-01-02 08:25:43 +03:00

Author	SHA1	Message	Date
Tim Gross	c043d1c850	scheduler: property testing of reconcile reconnecting (#26180 ) To help break down the larger property tests we're doing in #26167 and #26172 into more manageable chunks, pull out a property test for just the `reconcileReconnecting` method. This method helpfully already defines its important properties, so we can implement those as test assertions. Ref: https://hashicorp.atlassian.net/browse/NMD-814 Ref: https://github.com/hashicorp/nomad/pull/26167 Ref: https://github.com/hashicorp/nomad/pull/26172	2025-07-07 09:40:49 -04:00
Tim Gross	5c909213ce	scheduler: add reconciler annotations to completed evals (#26188 ) The output of the reconciler stage of scheduling is only visible via debug-level logs, typically accessible only to the cluster admin. We can give job authors better ability to understand what's happening to their jobs if we expose this information to them in the `eval status` command. Add the reconciler's desired updates to the evaluation struct so it can be exposed in the API. This increases the size of evals by roughly 15% in the state store, or a bit more when there are preemptions (but we expect this will be a small minority of evals). Ref: https://hashicorp.atlassian.net/browse/NMD-818 Fixes: https://github.com/hashicorp/nomad/issues/15564	2025-07-07 09:40:21 -04:00
Tim Gross	9a29df2292	scheduler: emit structured logs from reconciliation (#26169 ) Both the cluster reconciler and node reconciler emit a debug-level log line with their results, but these are unstructured multi-line logs that are annoying for operators to parse. Change these to emit structured key-value pairs like we do everywhere else. Ref: https://hashicorp.atlassian.net/browse/NMD-818 Ref: https://go.hashi.co/rfc/nmd-212	2025-07-01 10:37:44 -04:00
Piotr Kazmierczak	36e7148247	scheduler: doc.go files for new packages (#26177 )	2025-07-01 16:28:33 +02:00
Tim Gross	ec8250ed30	property test generation for reconciler (#26142 ) As part of ongoing work to make the scheduler more legible and more robustly tested, we're implementing property testing of at least the reconciler. This changeset provides some infrastructure we'll need for generating the test cases using `pgregory.net/rapid`, without building out any of the property assertions yet (that'll be in upcoming PRs over the next couple weeks). The alloc reconciler generator produces a job, a previous version of the job, a set of tainted nodes, and a set of existing allocations. The node reconciler generator produces a job, a set of nodes, and allocations on those nodes. Reconnecting allocs are not yet well-covered by these generators, and with ~40 dimensions covered so far we may need to pull those out to their own tests in order to get good coverage. Note the scenarios only randomize fields of interest; fields like the job name that don't impact the reconciler would use up available shrink cycles on failed tests without actually reducing the scope of the scenario. Ref: https://hashicorp.atlassian.net/browse/NMD-814 Ref: https://github.com/flyingmutant/rapid	2025-06-26 11:09:53 -04:00
Piotr Kazmierczak	27da75044e	scheduler: move tests that depend on calling schedulers into `integration` package (#26037 )	2025-06-24 09:31:10 +02:00
Piotr Kazmierczak	12ddb6db94	scheduler: capture reconciler state in ReconcilerState object (#26088 ) This changeset separates reconciler fields into their own sub-struct to make testing easier and the code more explicit about what fields relate to which state.	2025-06-23 15:36:39 +02:00
Piotr Kazmierczak	1030760d3f	scheduler: adjust method comments and names to reflect recent refactoring (#26085 ) Co-authored-by: Tim Gross <tgross@hashicorp.com>	2025-06-20 17:23:31 +02:00
Piotr Kazmierczak	b82fd2e159	scheduler: refactor cluster reconciler to avoid hidden state mutation (#26042 ) Cluster reconciler code is notoriously hard to follow because most of its method continuously mutate the fields of the allocReconciler object. Even for top-level methods it makes the code hard to follow, but gets really gnarly with lower-level methods (of which there are many). This changeset proposes a refactoring that makes the vast majority of said methods return explicit values, and avoid mutating object fields.	2025-06-20 07:37:16 +02:00
Piotr Kazmierczak	0ddbc548a3	scheduler: rename reconciliation package to `reconciler` (#26038 ) nouns are better than verbs for package names	2025-06-12 14:36:09 +02:00
Piotr Kazmierczak	199d12865f	scheduler: isolate `feasibility` (#26031 ) This change isolates all the code that deals with node selection in the scheduler into its own package called feasible. --------- Co-authored-by: Tim Gross <tgross@hashicorp.com>	2025-06-11 20:11:04 +02:00
Piotr Kazmierczak	76e3c2961a	scheduler: isolate reconciliation code (#26002 ) This moves all the code of service/batch and system/sysbatch reconciliation into a new reconcile package.	2025-06-10 15:46:39 +02:00
Piotr Kazmierczak	ce054aae96	scheduler: add a readme and start documenting low level implementation details (#25986 ) In an effort to improve the readability and maintainability of nomad/scheduler package, we begin with a README file that describes its operation in more detail than the official documentation does. This PR will be followed by a few small ones that move the code around that package, improve variable naming and also keep that readme up to date. --------- Co-authored-by: Tim Gross <tgross@hashicorp.com>	2025-06-05 15:36:17 +02:00
Piotr Kazmierczak	648bacda77	testing: migrate nomad/scheduler off of testify (#25968 ) In the spirit of #25909, this PR removes testify dependencies from the scheduler package, along with reflect.DeepEqual removal. This is again a combination of semgrep and hx editing magic. --------- Co-authored-by: Tim Gross <tgross@hashicorp.com>	2025-06-04 09:29:28 +02:00
tehut	55523ecf8e	Add NodeMaxAllocations to client configuration (#25785 ) * Set MaxAllocations in client config Add NodeAllocationTracker struct to Node struct Evaluate MaxAllocations in AllocsFit function Set up cli config parsing Integrate maxAllocs into AllocatedResources view Co-authored-by: Tim Gross <tgross@hashicorp.com> --------- Co-authored-by: Tim Gross <tgross@hashicorp.com>	2025-05-22 12:49:27 -07:00
Tim Gross	456d95a19e	scheduler: account for affinity value of zero in score normalization (#25800 ) If there are no affinities on a job, we don't want to count an affinity score of zero in the number of scores we divide the normalized score by. This is how we handle other scoring components like node reschedule penalties on nodes that weren't running the previous allocation. But we also exclude counting the affinity in the case where we have affinity but the value is zero. In pathological cases, this can result in a node with a low affinity being picked over a node with no affinity, because the denominator is 1 larger. Include zero-value affinities in the count of scores if the job has affinities but the value just happens to be zero. Fixes: https://github.com/hashicorp/nomad/issues/25621	2025-05-19 14:10:00 -04:00
Allison Larson	fd16f80b5a	Only error on constraints if no allocs are running (#25850 ) * Only error on constraints if no allocs are running When running `nomad job run <JOB>` multiple times with constraints defined, there should be no error as a result of filtering out nodes that do not/have not ever satsified the constraints. When running a systems job with constraint, any run after an initial startup returns an exit(2) and a warning about unplaced allocations due to constraints. An error that is not encountered on the initial run, though the constraint stays the same. This is because the node that satisfies the condition is already running the allocation, and the placement is ignored. Another placement is attempted, but the only node(s) left are the ones that do not satisfy the constraint. Nomad views this case (no allocations that were attempted to placed could be placed successfully) as an error, and reports it as such. In reality, no allocations should be placed or updated in this case, but it should not be treated as an error. This change uses the `ignored` placements from diffSystemAlloc to attempt to determine if the case encountered is an error (no ignored placements means that nothing is already running, and is an error), or is not one (an ignored placement means that the task is already running somewhere on a node). It does this at the point where `failedTGAlloc` is populated, so placement functionality isn't changed, just the field that populates error. There is functionality that should be preserved which (correctly) notifies a user if a job is attempted that cannot be run on any node due to the constraints filtering out all available nodes. This should still behave as expected. * Add changelog entry * Handle in-place updates for constrained system jobs * Update .changelog/25850.txt Co-authored-by: Piotr Kazmierczak <470696+pkazmierczak@users.noreply.github.com> * Remove conditionals --------- Co-authored-by: Piotr Kazmierczak <470696+pkazmierczak@users.noreply.github.com>	2025-05-15 15:14:03 -07:00
Tim Gross	5208ad4c2c	scheduler: allow canaries to be migrated on node drain (#25726 ) When a node is drained that has canaries that are not yet healthy, the canaries may not be properly migrated and the deployment will halt. This happens only if there are more than `migrate.max_parallel` canaries on the node and the canaries are not yet healthy (ex. they have a long `update.min_healthy_time`). In this circumstance, the first batch of canaries are marked for migration by the drainer correctly. But then the reconciler counts these migrated canaries against the total number of expected canaries and no longer progresses the deployment. Because an insufficient number of allocations have reported they're healthy, the deployment cannot be promoted. When the reconciler looks for canaries to cancel, it leaves in the list any canaries that are already terminal (because there shouldn't be any work to do). But this ends up skipping the creation of a new canary to replace terminal canaries that have been marked for migration. Add a conditional for this case to cause the canary to be removed from the list of active canaries so we can replace it. Ref: https://hashicorp.atlassian.net/browse/NMD-560 Fixes: https://github.com/hashicorp/nomad/issues/17842	2025-04-24 09:24:28 -04:00
Allison Larson	50513a87b7	Preserve core resources during inplace service alloc updates (#25705 ) * Preserve core resources during inplace service alloc updates When an alloc is running with the core resources specified, and the alloc is able to be updated in place, the cores it is running on should be preserved. This fixes a bug where the allocation's task's core resources (CPU.ReservedCores) would be recomputed each time the reconciler checked that the allocation could continue to run on the given node. Under circumstances where a different core on the node became available before this check was made, the selection process could compute this new core as the core to run on, regardless of core the allocation was already running on. The check takes into account other allocations running on the node with reserved cores, but cannot check itself. When this would happen for multiple allocations being evaluated in a single plan, the selection process would see the other cores being previously reserved but be unaware of the one it ran on, resulting in the same core being chosen over and over for each allocation that was being checked, and updated in the state store (but not on the node). Once those cores were chosen and committed for multiple allocs, the node appears to be exhausted on the cores dimension, and it would prevent any additional allocations from being started on the node. The reconciler check/computation for allocations that are being updated in place and have resources.cores defined is effectively a check that the node has the available cores to run on, not a computation that should be changed. The fix still performs the check, but once it is successful any existing ReservedCores are preserved. Because any changes to this resource is considered a "destructive change", this can be confidently preserved during the inplace update. * Adjust reservedCores scheduler test * Add changelog entry	2025-04-23 10:38:47 -07:00
Tim Gross	c205688857	scheduler: fix state corruption from rescheduler tracker updates (#25698 ) In #12319 we fixed a bug where updates to the reschedule tracker would be dropped if the follow-up allocation failed to be placed by the scheduler in the later evaluation. We did this by mutating the previous allocation's reschedule tracker. But we did this without copying the previous allocation first and then making sure the updated copy was in the plan. This is unfortunately unsafe and corrupts the state store on the server where the scheduler ran; it may cause a race condition in RPC handlers and it causes the server to be out of sync with the other servers. This was discovered while trying to make all our tests race-free, but likely impacts production users. Copy the previous allocation before updating the reschedule tracker, and swap out the updated allocation in the plan. This also requires that we include the reschedule tracker in the "normalized" (stripped-down) allocations we send to the leader as part of a plan. Ref: https://github.com/hashicorp/nomad/pull/12319 Fixes: https://hashicorp.atlassian.net/browse/NET-12357	2025-04-18 08:42:54 -04:00
Tim Gross	5c89b07f11	CI: run copywrite on PRs, not just after merges (#25658 ) * CI: run copywrite on PRs, not just after merges * fix a missing copyright header	2025-04-10 17:01:34 -04:00
Carlos Galdino	048c5bcba9	Use core ID when selecting cores (#25340 ) * Use core ID when selecting cores If the available cores are not a continuous set, the core selector might panic when trying to select cores. For example, consider a scenario where the available cores for the selector are the following: [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47] This list contains 46 cores, because cores with IDs 0 and 24 are not included in the list Before this patch, if we requested 46 cores, the selector would panic trying to access the item with index 46 in `cs.topology.Cores`. This patch changes the selector to use the core ID instead when looking for a core inside `cs.topology.Cores`. This prevents an out of bounds access that was causing the panic. Note: The patch is straightforward with the change. Perhaps a better long-term solution would be to restructure the `numalib.Topology.Cores` field to be a `map[ID]Core`, but that is a much larger change that is more difficult to land. Also, the amount of cores in our case is small—at most 192—so a search won't have any noticeable impact. * Add changelog entry * Build list of IDs inline	2025-04-10 13:04:15 -07:00
James Rasell	4c4cb2c6ad	agent: Fix misaligned contextual k/v logging arguments. (#25629 ) Arguments passed to hclog log lines should always have an even number to provide the expected k/v output.	2025-04-10 14:40:21 +01:00
Michael Smithhisler	f2b761f17c	disconnected: removes deprecated disconnect fields (#25284 ) The group level fields stop_after_client_disconnect, max_client_disconnect, and prevent_reschedule_on_lost were deprecated in Nomad 1.8 and replaced by field in the disconnect block. This change removes any logic related to those deprecated fields. --------- Co-authored-by: Tim Gross <tgross@hashicorp.com>	2025-03-05 14:46:02 -05:00
Tim Gross	1788bfb42e	remove addresses from node class hash (#24942 ) When a node is fingerprinted, we calculate a "computed class" from a hash over a subset of its fields and attributes. In the scheduler, when a given node fails feasibility checking (before fit checking) we know that no other node of that same class will be feasible, and we add the hash to a map so we can reject them early. This hash cannot include any values that are unique to a given node, otherwise no other node will have the same hash and we'll never save ourselves the work of feasibility checking those nodes. In #4390 we introduce the `nomad.advertise.address` attribute and in #19969 we introduced `consul.dns.addr` attribute. Both of these are unique per node and break the hash. Additionally, we were not correctly filtering attributes out when checking if a node escaped the class by not filtering for attributes that start with `unique.`. The test for this introduced in #708 had an inverted assertion, which allowed this to pass unnoticed since the early days of Nomad. Ref: https://github.com/hashicorp/nomad/pull/708 Ref: https://github.com/hashicorp/nomad/pull/4390 Ref: https://github.com/hashicorp/nomad/pull/19969	2025-03-03 09:28:32 -05:00
James Rasell	7268053174	vault: Remove legacy token based authentication workflow. (#25155 ) The legacy workflow for Vault whereby servers were configured using a token to provide authentication to the Vault API has now been removed. This change also removes the workflow where servers were responsible for deriving Vault tokens for Nomad clients. The deprecated Vault config options used byi the Nomad agent have all been removed except for "token" which is still in use by the Vault Transit keyring implementation. Job specification authors can no longer use the "vault.policies" parameter and should instead use "vault.role" when not using the default workload identity. --------- Co-authored-by: Tim Gross <tgross@hashicorp.com> Co-authored-by: Aimee Ukasick <aimee.ukasick@hashicorp.com>	2025-02-28 07:40:02 +00:00
Piotr Kazmierczak	58c6387323	stateful deployments: task group host volume claims API (#25114 ) This PR introduces API endpoints /v1/volumes/claims/ and /v1/volumes/claim/:id for listing and deleting task group host volume claims, respectively.	2025-02-25 15:51:59 +01:00
Tim Gross	dc58f247ed	docs: clarify reschedule, migrate, and replacement terminology (#24929 ) Our vocabulary around scheduler behaviors outside of the `reschedule` and `migrate` blocks leaves room for confusion around whether the reschedule tracker should be propagated between allocations. There are effectively five different behaviors we need to cover: * restart: when the tasks of an allocation fail and we try to restart the tasks in place. * reschedule: when the `restart` block runs out of attempts (or the allocation fails before tasks even start), and we need to move the allocation to another node to try again. * migrate: when the user has asked to drain a node and we need to move the allocations. These are not failures, so we don't want to propagate the reschedule tracker. * replacement: when a node is lost, we don't count that against the `reschedule` tracker for the allocations on the node (it's not the allocation's "fault", after all). We don't want to run the `migrate` machinery here here either, as we can't contact the down node. To the scheduler, this is effectively the same as if we bumped the `group.count` * replacement for `disconnect.replace = true`: this is a replacement, but the replacement is intended to be temporary, so we propagate the reschedule tracker. Add a section to the `reschedule`, `migrate`, and `disconnect` blocks explaining when each item applies. Update the use of the word "reschedule" in several places where "replacement" is correct, and vice-versa. Fixes: https://github.com/hashicorp/nomad/issues/24918 Co-authored-by: Aimee Ukasick <aimee.ukasick@hashicorp.com>	2025-02-18 09:31:03 -05:00
Paweł Bęza	43885f6854	Allow for in-place update when affinity or spread was changed (#25109 ) Similarly to #6732 it removes checking affinity and spread for inplace update. Both affinity and spread should be as soft preference for Nomad scheduler rather than strict constraint. Therefore modifying them should not trigger job reallocation. Fixes #25070 Co-authored-by: Tim Gross <tgross@hashicorp.com>	2025-02-14 14:33:18 -05:00
Piotr Kazmierczak	5468829260	stateful deployments: fix return in the `hasVolumes` feasibility check (#25084 ) A return statement was missing in the sticky volume check—when we weren't able to find a suitable volume, we did not return false. This was caught by e2e test. This PR fixes the issue, and corrects and expands the unit test.	2025-02-11 18:57:48 +01:00
Piotr Kazmierczak	611452e1af	stateful deployments: use `TaskGroupVolumeClaim` table to associate volume requests with volume IDs (#24993 ) We introduce an alternative solution to the one presented in #24960 which is based on the state store and not previous-next allocation tracking in the reconciler. This new solution reduces cognitive complexity of the scheduler code at the cost of slightly more boilerplate code, but also opens up new possibilities in the future, e.g., allowing users to explicitly "un-stick" volumes with workloads still running. The diagram below illustrates the new logic: SetVolumes() upsertAllocsImpl() sets ns, job +-----------------checks if alloc requests tg in the scheduler v sticky vols and consults \| +-----------------------+ state. If there is no claim, \| \| TaskGroupVolumeClaim: \| it creates one. \| \| - namespace \| \| \| - jobID \| \| \| - tg name \| \| \| - vol ID \| v \| uniquely identify vol \| hasVolumes() +----+------------------+ consults the state \| ^ and returns true \| \| DeleteJobTxn() if there's a match <-----------+ +---------------removes the claim from or if there is no the state previous claim \| \| \| \| +-----------------------------+ +------------------------------------------------------+ scheduler state store	2025-02-07 17:41:01 +01:00
Matt Keeler	833e240597	Upgrade to using hashicorp/go-metrics@v0.5.4 (#24856 ) * Upgrade to using hashicorp/go-metrics@v0.5.4 This also requires bumping the dependencies for: * memberlist * serf * raft * raft-boltdb * (and indirectly hashicorp/mdns due to the memberlist or serf update) Unlike some other HashiCorp products, Nomads root module is currently expected to be consumed by others. This means that it needs to be treated more like our libraries and upgrade to hashicorp/go-metrics by utilizing its compat packages. This allows those importing the root module to control the metrics module used via build tags.	2025-01-31 15:22:00 -05:00
Michael Smithhisler	47c14ddf28	remove remote task execution code (#24909 )	2025-01-29 08:08:34 -05:00
Tim Gross	09eb473189	dynamic host volumes: set status unavailable on failed restore (#24962 ) When a client restarts but can't restore a volume (ex. the plugin is now missing), it's removed from the node fingerprint. So we won't allow future scheduling of the volume, but we were not updating the volume state field to report this reasoning to operators. Make debugging easier and the state field more meaningful by setting the value to "unavailable". Also, remove the unused "deleted" field. We did not implement soft deletes and aren't planning on it for Nomad 1.10.0. Ref: https://hashicorp.atlassian.net/browse/NET-11551	2025-01-27 16:35:53 -05:00
Tim Gross	7add04eb0f	refactor: volume request modes to be generic between DHV/CSI (#24896 ) When we implemented CSI, the types of the fields for access mode and attachment mode on volume requests were defined with a prefix "CSI". This gets confusing now that we have dynamic host volumes using the same fields. Fortunately the original was a typedef on string, and the Go API in the `api` package just uses strings directly, so we can change the name of the type without breaking backwards compatibility for the msgpack wire format. Update the names to `VolumeAccessMode` and `VolumeAttachmentMode`. Keep the CSI and DHV specific value constant names for these fields (they aren't currently 1:1), so that we can easily differentiate in a given bit of code which values are valid. Ref: https://github.com/hashicorp/nomad/pull/24881#discussion_r1920702890	2025-01-24 10:37:48 -05:00
Tim Gross	a7b5970d49	dynamic host volumes: cleanup comments (#24830 ) Some comment cleanups as we're wrapping up dynamic host volumes work: * We're not going to implement mount_options for host volumes, as the dynamic host volumes don't have the equivalent of the stage/publish phase that CSI volumes do. Users who want that sort of thing will pass them as `parameter` field during volume create/register. * The scheduler feasibility check prevents a dynamic host volume being claimed by a job in the wrong namespace, but the comment incorrectly identifies that code path as only being about the race between fingerprint and delete. Update the comment to make the intent clear so that we don't accidentally remove this behavior in the future.	2025-01-10 11:30:47 -05:00
Tim Gross	fea846189f	dynamic host volumes: account for other claims in capability check (#24684 ) When we feasibility check a dynamic host volume against a volume request, we check the attachment mode and access mode. This only ensures that the capabilities match, but doesn't enforce the semantics of the capabilities against other claims that may be made on the allocation. Add support for checking the requested capability against other allocations that the volume claimed. Ref: https://github.com/hashicorp/nomad/pull/24479	2024-12-19 09:25:55 -05:00
Piotr Kazmierczak	8cbb74786c	stateful deployments: find feasible node for sticky host volumes (#24558 ) This changeset implements node feasibility checks for sticky host volumes.	2024-12-19 09:25:55 -05:00
Tim Gross	3143019d85	dynamic host volumes: capabilities check during scheduling (#24617 ) Static host volumes have a simple readonly toggle, but dynamic host volumes have a more complex set of capabilities similar to CSI volumes. Update the feasibility checker to account for these capabilities and volume readiness. Also fixes a minor bug in the state store where a soft-delete (not yet implemented) could cause a volume to be marked ready again. This is needed to support testing the readiness checking in the scheduler. Ref: https://github.com/hashicorp/nomad/pull/24479	2024-12-19 09:25:54 -05:00
Tim Gross	bbf49a9050	dynamic host volumes: node selection via constraints (#24518 ) When making a request to create a dynamic host volumes, users can pass a node pool and constraints instead of a specific node ID. This changeset implements a node scheduling logic by instantiating a filter by node pool and constraint checker borrowed from the scheduler package. Because host volumes with the same name can't land on the same host, we don't need to support `distinct_hosts`/`distinct_property`; this would be challenging anyways without building out a much larger node iteration mechanism to keep track of usage across multiple hosts. Ref: https://github.com/hashicorp/nomad/pull/24479	2024-12-19 09:25:54 -05:00
Piotr Kazmierczak	c5249c6ca4	gc: be consistent with setting create/modify timestamp tz (#24389 ) Whenever setting objects creation/modify time, we should always use UTC. #24112 introduced some inconsistencies in this area, and this PR fixes it.	2024-11-07 22:53:54 +01:00
Piotr Kazmierczak	f7847c6e5b	state: remove TimeTable and rely on objects' modify times instead (#24112 ) Core scheduler relies on a special table in the state store—the TimeTable—to figure out which objects can be GC'd. The TimeTable correlates Raft indices with objects insertion time, a solution we used before most of the objects we store in the state contained timestamps. This introduced a bit of a memory overhead and complexity, but most importantly meant that any GC threshold users set greater than timeTableLimit = 72 * time.Hour was ignored. This PR removes the TimeTable and relies on object timestamps to determine whether they could be GCd or not.	2024-11-01 19:38:04 +01:00
Michael Smithhisler	436ff75f15	scheduler: fix reconnecting allocations getting rescheduled (#24165 ) * scheduler: fix reconnecting allocations getting rescheduled	2024-10-14 09:00:58 -04:00
Daniel Bennett	7526c91ccd	scheduler: non-nil err when no devices match (#24118 )	2024-10-03 10:29:36 -05:00
Tim Gross	a7f2cb879e	command line tools for redacting keyring from snapshots (#24023 ) In #23977 we moved the keyring into Raft, which can expose key material in Raft snapshots when using the less-secure AEAD keyring instead of KMS. This changeset adds tools for redacting this material from snapshots: * The `operator snapshot state` command gains the ability to display key metadata (only), which respects the `-filter` option. * The `operator snapshot save` command gains a `-redact` option that removes key material from the snapshot after it's downloaded. * A new `operator snapshot redact` command allows removing key material from an existing snapshot.	2024-09-20 15:30:14 -04:00
Seth Hoenig	51215bf102	deps: update to go-set/v3 and refactor to use custom iterators (#23971 ) * deps: update to go-set/v3 * deps: use custom set iterators for looping	2024-09-16 13:40:10 -05:00
Seth Hoenig	8b093a6a5d	scheduler: support for device - aware numa scheduling (#1760 ) (#23837 ) (CE backport of ENT 59433d56c7215c0b8bf33764f41b57d9bd30160f (without ent files)) * scheduler: enhance numa aware scheduling with support for devices * cr: add comments	2024-08-20 07:53:04 -05:00
Daniel Bennett	d131c41943	cni: network.cni job updates should replace allocs (#23764 ) a change to the network{cni{}} block means that the user wants the network config to change, and that only happens during initial alloc setup, so we need to replace the alloc(s) to get fresh network(s) to reconfigure from scratch. e.g. a job plan diff like this ``` +/- Task Group: "g" (1 in-place update) + Network { + CNIConfig { + a: "ayy" } ``` should instead be ``` +/- Task Group: "g" (1 create/destroy update) + Network { + CNIConfig { + a: "ayy" } ```	2024-08-07 12:13:11 -05:00
Tim Gross	b25f1b66ce	resources: allow job authors to configure size of secrets tmpfs (#23696 ) On supported platforms, the secrets directory is a 1MiB tmpfs. But some tasks need larger space for downloading large secrets. This is especially the case for tasks using `templates`, which need extra room to write a temporary file to the secrets directory that gets renamed to the old file atomically. This changeset allows increasing the size of the tmpfs in the `resources` block. Because this is a memory resource, we need to include it in the memory we allocate for scheduling purposes. The task is already prevented from using more memory in the tmpfs than the `resources.memory` field allows, but can bypass that limit by writing to the tmpfs via `template` or `artifact` blocks. Therefore, we need to account for the size of the tmpfs in the allocation resources. Simply adding it to the memory needed when we create the allocation allows it to be accounted for in all downstream consumers, and then we'll subtract that amount from the memory resources just before configuring the task driver. For backwards compatibility, the default value of 1MiB is "free" and ignored by the scheduler. Otherwise we'd be increasing the allocated resources for every existing alloc, which could cause problems across upgrades. If a user explicitly sets `resources.secrets = 1` it will no longer be free. Fixes: https://github.com/hashicorp/nomad/issues/2481 Ref: https://hashicorp.atlassian.net/browse/NET-10070	2024-08-05 16:06:58 -04:00
Piotr Kazmierczak	78bc8e7843	scheduler: fix TestAllocSet_filterByTainted (#23648 )	2024-07-19 17:41:06 +02:00

1 2 3 4 5 ...

931 Commits