Commit Graph

881 Commits

Author SHA1 Message Date
Tim Gross
7d73065066 numa: fix scheduler panic due to topology serialization bug (#23284)
The NUMA topology struct field `NodeIDs` is a `idset.Set`, which has no public
members. As a result, this field is never serialized via msgpack and persisted
in state. When `numa.affinity = "prefer"`, the scheduler dereferences this nil
field and panics the scheduler worker.

Ideally we would fix this by adding a msgpack serialization extension, but
because the field already exists and is just always empty, this breaks RPC wire
compatibility across upgrades. Instead, create a new field that's populated at
the same time we populate the more useful `idset.Set`, and repopulate the set on
demand.

Fixes: https://hashicorp.atlassian.net/browse/NET-9924
2024-06-11 08:55:00 -04:00
Tim Gross
fa70267787 scheduler: RescheduleTracker dropped if follow-up fails placements (#12319)
When an allocation fails it triggers an evaluation. The evaluation is processed
and the scheduler sees it needs to reschedule, which triggers a follow-up
eval. The follow-up eval creates a plan to `(stop 1) (place 1)`. The replacement
alloc has a `RescheduleTracker` (or gets its `RescheduleTracker` updated).

But in the case where the follow-up eval can't place all allocs (there aren't
enough resources), it can create a partial plan to `(stop 1) (place 0)`. It then
creates a blocked eval. The plan applier stops the failed alloc. Then when the
blocked eval is processed, the job is missing an allocation, so the scheduler
creates a new allocation. This allocation is _not_ a replacement from the
perspective of the scheduler, so it's not handed off a `RescheduleTracker`.

This changeset fixes this by annotating the reschedule tracker whenever the
scheduler can't place a replacement allocation. We check this annotation for
allocations that have the `stop` desired status when filtering out allocations
to pass to the reschedule tracker. I've also included tests that cover this case
and expands coverage of the relevant area of the code.

Fixes: https://github.com/hashicorp/nomad/issues/12147
Fixes: https://github.com/hashicorp/nomad/issues/17072
2024-06-10 11:15:40 -04:00
Tim Gross
67967c99a7 scheduler: stack test should use job.ID and not job.Name (#23169)
Some of our scheduler tests use the `AllocName` function from the structs
package incorrectly. This function should always receive the `Job.ID` and not
the `Job.Name`. Fix this to prevent future bugs from copy-pasting usage around.
2024-06-05 08:34:04 -04:00
Tim Gross
5bfb500932 refactor scheduler tests for node down/disconnected (#22198)
While working on #20462 #12319 I found that some of our scheduler tests around
down nodes or disconnected clients were enforcing invariants that were
unclear. This changeset pulls out some minor refactorings so that the bug fix PR
is easier to review. This includes:

* Migrating a few tests from `testify` to `shoenig/test` that I'm going to touch
  in #12319 anyways.
* Adding test names to the node down test
* Update the disconnected client test so that we always re-process the
  pending/blocked eval it creates; this eliminates 2 redundant sub-tests.
* Update the disconnected client test assertions so that they're explicit in the
  test setup rather than implied by whether we re-process the pending/blocked
  eval.

Ref: https://github.com/hashicorp/nomad/issues/20462
Ref: https://github.com/hashicorp/nomad/pull/12319
2024-05-22 10:23:08 -04:00
Tim Gross
5666065131 tests: update disconnected client scheduler tests to avoid blocking (#20615)
While working on #20462, I discovered that some of the scheduler tests for
disconnected clients making long blocking queries. The tests used
`testutil.WaitForResult` to wait for an evaluation to be written to the state
store. The evaluation was never written, but the tests were not correctly
returning an error for an empty query. This resulted in the tests blocking for
5s and then continuing anyways.

In practice, the evaluation is never written to the state store as part of the
test harness `Process` method, so this test assertion was meaningless. Remove
the broken assertion from the two top-level tests that used it, and upgrade
these tests to use `shoenig/test` in the process. This will save us ~50s per
test run.
2024-05-16 12:16:27 -04:00
Michael Schurter
3aefc010d7 test: remove spurious print statements (#20503) 2024-05-01 09:47:56 -07:00
Seth Hoenig
ae6c4c8e3f deps: purge use of old x/exp packages (#20373) 2024-04-12 08:29:00 -05:00
Gabi
ca22f34373 fix exhausted node metrics reporting in preemption (#20346) 2024-04-11 14:49:56 -04:00
Juana De La Cuesta
c7e7fdfa84 [f-gh-208] Force recreation and redeployment of task if volume label changes (#20074)
Scheduler: Force recreation and redeployment of task if volume mount labels in the task definitions changes
2024-03-27 11:43:31 +01:00
Juana De La Cuesta
f2965cad36 [gh-19729] Fix logic for updating terminal allocs on clients with max client disconnect (#20181)
Only ignore allocs on terminal states that are updated
---------

Co-authored-by: Tim Gross <tgross@hashicorp.com>
2024-03-26 10:31:58 +01:00
Juana De La Cuesta
ff72248c86 func: add new picker dependency (#20029)
This commit introduces the new options for reconciling a reconnecting allocation and its replacement:

Best score (Current implementation)
Keep original
Keep replacement
Keep the one that has run the longest time
It is achieved by adding a new dependency to the allocReconciler that calls the corresponding function depending on the task group's disconnect strategy. For more detailed information, refer to the new stanza for disconnected clientes RFC.

It resolves 15144
2024-03-15 13:42:08 +01:00
carrychair
5f5b34db0e remove repetitive words (#20110)
Signed-off-by: carrychair <linghuchong404@gmail.com>
2024-03-11 08:52:08 +00:00
Seth Hoenig
4d83733909 tests: swap testify for test in more places (#20028)
* tests: swap testify for test in plugins/csi/client_test.go

* tests: swap testify for test in testutil/

* tests: swap testify for test in host_test.go

* tests: swap testify for test in plugin_test.go

* tests: swap testify for test in utils_test.go

* tests: swap testify for test in scheduler/

* tests: swap testify for test in parse_test.go

* tests: swap testify for test in attribute_test.go

* tests: swap testify for test in plugins/drivers/

* tests: swap testify for test in command/

* tests: fixup some test usages

* go: run go mod tidy

* windows: cpuset test only on linux
2024-02-29 12:11:35 -06:00
Juana De La Cuesta
20cfbc82d3 Introduces Disconnect block into the TaskGroup configuration (#19886)
This PR is the first on two that will implement the new Disconnect block. In this PR the new block is introduced to be backwards compatible with the fields it will replace. For more information refer to this RFC and this ticket.
2024-02-19 16:41:35 +01:00
Luiz Aoqui
4a8b01430b scheduler: retain eval metrics on port collision (#19933)
When an allocation can't be placed because of a port collision the
resulting blocked eval is expected to have a metric reporting the port
that caused the conflict, but this metrics was not being emitted when
preemption was enabled.
2024-02-09 18:18:48 -05:00
Tim Gross
d3e5cae1eb consul: support admin partitions (#19665)
Add support for Consul Enterprise admin partitions. We added fingerprinting in
https://github.com/hashicorp/nomad/pull/19485. This PR adds a `consul.partition`
field. The expectation is that most users will create a mapping of Nomad node
pool to Consul admin partition. But we'll also create an implicit constraint for
the fingerprinted value.

Fixes: https://github.com/hashicorp/nomad/issues/13139
2024-01-10 10:41:29 -05:00
Luiz Aoqui
099ee06a60 Revert "deps: update go-metrics to v0.5.3 (#19190)" (#19374)
* Revert "deps: update go-metrics to v0.5.3 (#19190)"

This reverts commit ddb060d8b3.

* changelog: add entry for #19374
2023-12-08 08:46:55 -05:00
Juana De La Cuesta
cf539c405e Add a new parameter to avoid starting a replacement for lost allocs (#19101)
This commit introduces the parameter preventRescheduleOnLost which indicates that the task group can't afford to have multiple instances running at the same time. In the case of a node going down, its allocations will be registered as unknown but no replacements will be rescheduled. If the lost node comes back up, the allocs will reconnect and continue to run.

In case of max_client_disconnect also being enabled, if there is a reschedule policy, an error will be returned.
Implements issue #10366

Co-authored-by: Dom Lavery <dom@circleci.com>
Co-authored-by: Tim Gross <tgross@hashicorp.com>
Co-authored-by: Luiz Aoqui <luiz@hashicorp.com>
2023-12-06 12:28:42 +01:00
Tim Gross
0bc2ea8d98 client version constraints for implicit identities for WI (#18932)
Clients prior to Nomad 1.7 cannot support the new workload identity-based
authentication to Consul and Vault. Add an implicit Nomad version constraint on
job submission for task groups that use the new workflow.

Includes a constraint test showing same-version prelease handling.
2023-12-01 13:51:21 -05:00
Luiz Aoqui
59aa860c13 scheduler: fix task-level consul diff (#19230)
Fix `tasksUpdated()` to compare the task level `consul` blocks instead
of the group.
2023-11-30 10:13:17 -05:00
Luiz Aoqui
ddb060d8b3 deps: update go-metrics to v0.5.3 (#19190)
Update `go-metrics` to v0.5.3 to pick
https://github.com/hashicorp/go-metrics/pull/146.
2023-11-28 12:37:57 -05:00
Charlie Voiselle
659c0945fc [core] Honor job's namespace when checking distinct_hosts feasibility (#19004)
* Update distinct_host feasibility checking to honor the job's namespace. Fixes #9792
* Added test to verify original condition and that fix resolved it.
* Added documentation
2023-11-17 11:25:10 -05:00
James Rasell
3c8eb54dfc scheduler: ensure dup alloc names are fixed before plan submit. (#18873)
This change fixes a bug within the generic scheduler which meant
duplicate alloc indexes (names) could be submitted to the plan
applier and written to state. The bug originates from the
placements calculation notion that names of allocations being
replaced are blindly copied to their replacement. This is not
correct in all cases, particularly when dealing with canaries.

The fix updates the alloc name index tracker to include minor
duplicate tracking. This can be used when computing placements to
ensure duplicate are found, and a new name picked before the plan
is submitted. The name index tracking is now passed from the
reconciler to the generic scheduler via the results, so this does
not have to be regenerated, or another data structure used.
2023-10-27 14:16:41 +01:00
Juana De La Cuesta
e8efe2d251 fix: handling non reschedule disconnecting and reconnecting allocs (#18701)
This PR fixes a long lived bug, where disconnecting allocations where never rescheduled by their policy but because the group count was short. The default reschedule time for services and batches is 30 and 5 seconds respectively, in order to properly reschedule disconnected allocs, they need to be able to be rescheduled for later, a path that was not handled before. This PR introduces a way to handle such allocations.
2023-10-27 13:14:39 +02:00
James Rasell
b3e41bec2d scheduler: remove unused alloc index functions. (#18846) 2023-10-25 09:09:47 +01:00
Seth Hoenig
0020139440 core: port common code changes from ENT for numa scheduling (#18818)
Some additional changes were made in the ENT PR to the common code in
support of numa scheduling; this PR copies those changes back to CE.
2023-10-20 13:19:02 -05:00
Seth Hoenig
83720740f5 core: plumbing to support numa aware scheduling (#18681)
* core: plumbing to support numa aware scheduling

* core: apply node resources compatibility upon fsm rstore

Handle the case where an upgraded server dequeus an evaluation before
a client triggers a new fingerprint - which would be needed to cause
the compatibility fix to run. By running the compat fix on restore the
server will immediately have the compatible pseudo topology to use.

* lint: learn how to spell pseudo
2023-10-19 15:09:30 -05:00
Luiz Aoqui
0ccf942b26 scheduler: fix host volume feasibility check (#18679)
Host volumes were considered regular feasibility checks. This had two
unintended consequences.

The first happened when scheduling an allocation with a host volume on a
set of nodes with the same computed class but where only some of them
had the desired host volume.

If the first node evaluated did not have the host volume, the entire
node class was considered ineligible for the task group.

```go
// Run the job feasibility checks.
for _, check := range w.jobCheckers {
	feasible := check.Feasible(option)
	if !feasible {
		// If the job hasn't escaped, set it to be ineligible since it
		// failed a job check.
		if !jobEscaped {
			evalElig.SetJobEligibility(false, option.ComputedClass)
		}
		continue OUTER
	}
}
```

This results in all nodes with the same computed class to be skipped,
even if they do have the desired host volume.

```go
switch evalElig.JobStatus(option.ComputedClass) {
case EvalComputedClassIneligible:
	// Fast path the ineligible case
	metrics.FilterNode(option, "computed class ineligible")
	continue
```

The second consequence is somewhat the opposite. When an allocation has
a host volume with `per_alloc = true` the node must have a host volume
that matches the allocation index, so each allocation is likely to be
placed in different nodes.

But when the first allocation found a node match, it registered the node
class as eligible for the task group.

```go
// Set the task group eligibility if the constraints weren't escaped and
// it hasn't been set before.
if !tgEscaped && tgUnknown {
	evalElig.SetTaskGroupEligibility(true, w.tg, option.ComputedClass)
}
```

This could cause other allocations to be placed on nodes without the
expected host volume because of the computed node class fast path. The
node feasibility for the volume was never checked.

```go
case EvalComputedClassEligible:
	// Fast path the eligible case
	if w.available(option) {
		return option
	}
	// We match the class but are temporarily unavailable
	continue OUTER
```

These problems did not happen with CSI volumes kind of accidentally.
Since the `CSIVolumeChecker` was not placed in the `tgCheckers` list it
did not cause the node class to be considered ineligible on failure
(avoiding the first problem).

And, as illustrated in the code snippet above, the eligible node class
fast path checks `tgAvailable` (where `CSIVolumeChecker` is placed)
before returning the option (avoiding the second problem).

By also placing `HostVolumeChecker` in the `tgAvailable` list instead of
`tgCheckers` we also avoid these problems on host volume feasibility.
2023-10-06 11:00:48 -04:00
Seth Hoenig
e3c8700ded deps: upgrade to go-set/v2 (#18638)
No functional changes, just cleaning up deprecated usages that are
removed in v2 and replace one call of .Slice with .ForEach to avoid
making the intermediate copy.
2023-10-05 11:56:17 -05:00
James Rasell
673a7713a8 scheduler: remove unused changes reconciler function. (#18656) 2023-10-05 08:10:01 +01:00
Tim Gross
aaee3076c2 consul: allow consul block in task scope (#18597)
To support Workload Identity with Consul for templates, we want templates to be
able to use the WI created at the task scope (either implicitly or set by the
user). But to allow different tasks within a group to be assigned to different
clusters as we're doing for Vault, we need to be able to set the `consul` block
with its `cluster` field at the task level to override the group.
2023-09-29 15:03:48 -04:00
Tim Gross
7863d7bcbb jobspec: support cluster field for Consul and Service blocks (#18409)
This field supports the upcoming ENT-only multiple Consul clusters feature. The
job validation and mutation hooks will come in a separate PR.

Ref: https://github.com/hashicorp/team-nomad/issues/404
2023-09-07 09:48:49 -04:00
Seth Hoenig
f5b0da1d55 all: swap exp packages for maps, slices (#18311) 2023-08-23 15:42:13 -05:00
James Rasell
6108f5c4c3 admin: rename _oss files to _ce (#18209) 2023-08-18 07:47:24 +01:00
Michael Schurter
0e22fc1a0b identity: add support for multiple identities + audiences (#18123)
Allows for multiple `identity{}` blocks for tasks along with user-specified audiences. This is a building block to allow workload identities to be used with Consul, Vault and 3rd party JWT based auth methods.

Expiration is still unimplemented and is necessary for JWTs to be used securely, so that's up next.

---------

Co-authored-by: Tim Gross <tgross@hashicorp.com>
2023-08-15 09:11:53 -07:00
Seth Hoenig
d9341f0664 update go1.21 (#18184)
* build: update to go1.21

* go: eliminate helpers in favor of min/max

* build: run go mod tidy

* build: swap depguard for semgrep

* command: fixup broken tls error check on go1.21
2023-08-14 08:43:27 -05:00
hashicorp-copywrite[bot]
a9d61ea3fd Update copyright file headers to BUSL-1.1 2023-08-10 17:27:29 -05:00
Tim Gross
b1742c7015 scheduler: filter device instance IDs by constraints (#18141)
When the scheduler assigns a device instance, it iterates over the feasible
devices and then picks the first instance with availability. If the jobspec uses
a constraint on device ID, this can lead to buggy/surprising behavior where the
node's device matches the constraint but then the individual device instance
does not.

Add a second filter based on the `${device.ids}` constraint after selecting a
node's device to ensure the device instance ID falls within the constraint as
well.

Fixes: #18112
2023-08-03 14:58:30 -04:00
James Rasell
9707aafc5b test: add tests for allocNameIndex core funcs (#18136) 2023-08-03 15:43:50 +01:00
Tim Gross
1ef8ad8176 scheduler: fix panic in render_templates destructive update check (#18100)
In #18054 we introduced a new field `render_templates` in the `restart`
block. Previously changes to the `restart` block were always non-destructive in
the scheduler but we now need to check the new field so that we can update the
template runner. The check assumed that the block was always non-nil, which
causes panics in our scheduler tests.
2023-07-31 11:52:51 -04:00
Gerard Nguyen
9e98d694a6 feature: Add new field render_templates on restart block (#18054)
This feature is necessary when user want to explicitly re-render all templates on task restart.
E.g. to fetch all new secrets from Vault, even if the lease on the existing secrets has not been expired.
2023-07-28 11:53:32 -07:00
Luiz Aoqui
13ee343853 core: remove unnecessary call to SetNodes and adds DC downgrade test (#17655) 2023-06-22 13:26:14 -04:00
Luiz Aoqui
f4c7182873 node pools: apply node pool scheduler configuration (#17598) 2023-06-21 20:31:50 -04:00
Tim Gross
6ea36f248e node pools: support node.pool constraint in scheduler (#17548)
Although most of the time jobs will be assigned to a single node pool, users may
want to set the node pool to "all" and then constraint to a subset of node
pools. Add support for setting a contraint like `${node.pool}`.
2023-06-16 13:31:46 -04:00
Tim Gross
9a6078a2ae node pools: implement support in scheduler (#17443)
Implement scheduler support for node pool:

* When a scheduler is invoked, we get a set of the ready nodes in the DCs that
  are allowed for that job. Extend the filter to include the node pool.
* Ensure that changes to a job's node pool are picked up as destructive
  allocation updates.
* Add `NodesInPool` as a metric to all reporting done by the scheduler.
* Add the node-in-pool the filter to the `Node.Register` RPC so that we don't
  generate spurious evals for nodes in the wrong pool.
2023-06-07 10:39:03 -04:00
Tim Gross
feb34c738d scheduler: count implicit spread targets as a single target (#17195)
When calculating the score in the `SpreadIterator`, the score boost is
proportional to the difference between the current and desired count. But when
there are implicit spread targets, the current count is the sum of the possible
implicit targets, which results in incorrect scoring unless there's only one
implicit target.

This changeset updates the `propertySet` struct to accept a set of explicit
target values so it can detect when a property value falls into the implicit set
and should be combined with other implicit values.

Fixes: #11823
2023-05-17 10:25:00 -04:00
Tim Gross
c80c8ab43d scheduler: prevent -Inf in spread scoring (#17198)
When spread targets have a percent value of zero it's possible for them to
return -Inf scoring because of a float divide by zero. This is very hard for
operators to debug because the string "-Inf" is returned in the API and that
breaks the presentation of debugging data.

Most scoring iterators are bracketed to -1/+1, but spread iterators do not so
that they can handle greatly unbalanced scoring so we can't simply return a -1
score without generating a score that might be greater than the negative scores
set by other spread targets. Instead, track the lowest-seen spread boost and use
that as the spread boost for any cases where we'd divide by zero.

Fixes: #8863
2023-05-16 16:01:32 -04:00
Tim Gross
2aa3c746c4 logs: fix missing allocation logs after update to Nomad 1.5.4 (#17087)
When the server restarts for the upgrade, it loads the `structs.Job` from the
Raft snapshot/logs. The jobspec has long since been parsed, so none of the
guards around the default value are in play. The empty field value for `Enabled`
is the zero value, which is false.

This doesn't impact any running allocation because we don't replace running
allocations when either the client or server restart. But as soon as any
allocation gets rescheduled (ex. you drain all your clients during upgrades),
it'll be using the `structs.Job` that the server has, which has `Enabled =
false`, and logs will not be collected.

This changeset fixes the bug by adding a new field `Disabled` which defaults to
false (so that the zero value works), and deprecates the old field.

Fixes #17076
2023-05-04 16:01:18 -04:00
Tim Gross
30bc456f03 logs: allow disabling log collection in jobspec (#16962)
Some Nomad users ship application logs out-of-band via syslog. For these users
having `logmon` (and `docker_logger`) running is unnecessary overhead. Allow
disabling the logmon and pointing the task's stdout/stderr to /dev/null.

This changeset is the first of several incremental improvements to log
collection short of full-on logging plugins. The next step will likely be to
extend the internal-only task driver configuration so that cluster
administrators can turn off log collection for the entire driver.

---

Fixes: #11175

Co-authored-by: Thomas Weber <towe75@googlemail.com>
2023-04-24 10:00:27 -04:00
Charlie Voiselle
84cd58db27 [scheduler] Honor false for distinct hosts constraint (#16907)
* Honor value for distinct_hosts constraint
* Add test for feasibility checking for `false`
---------
Co-authored-by: Michael Schurter <mschurter@hashicorp.com>
2023-04-17 17:43:56 -04:00