Commit Graph

61 Commits

Author SHA1 Message Date
Piotr Kazmierczak
648bacda77 testing: migrate nomad/scheduler off of testify (#25968)
In the spirit of #25909, this PR removes testify dependencies from the scheduler
package, along with reflect.DeepEqual removal. This is again a combination of
semgrep and hx editing magic.

---------

Co-authored-by: Tim Gross <tgross@hashicorp.com>
2025-06-04 09:29:28 +02:00
tehut
55523ecf8e Add NodeMaxAllocations to client configuration (#25785)
* Set MaxAllocations in client config
Add NodeAllocationTracker struct to Node struct
Evaluate MaxAllocations in AllocsFit function
Set up cli config parsing
Integrate maxAllocs into AllocatedResources view
Co-authored-by: Tim Gross <tgross@hashicorp.com>

---------

Co-authored-by: Tim Gross <tgross@hashicorp.com>
2025-05-22 12:49:27 -07:00
Tim Gross
456d95a19e scheduler: account for affinity value of zero in score normalization (#25800)
If there are no affinities on a job, we don't want to count an affinity score of
zero in the number of scores we divide the normalized score by. This is how we
handle other scoring components like node reschedule penalties on nodes that
weren't running the previous allocation.

But we also exclude counting the affinity in the case where we have affinity but
the value is zero. In pathological cases, this can result in a node with a low
affinity being picked over a node with no affinity, because the denominator is 1
larger. Include zero-value affinities in the count of scores if the job has
affinities but the value just happens to be zero.

Fixes: https://github.com/hashicorp/nomad/issues/25621
2025-05-19 14:10:00 -04:00
Seth Hoenig
8b093a6a5d scheduler: support for device - aware numa scheduling (#1760) (#23837)
(CE backport of ENT 59433d56c7215c0b8bf33764f41b57d9bd30160f (without ent files))

* scheduler: enhance numa aware scheduling with support for devices

* cr: add comments
2024-08-20 07:53:04 -05:00
Tim Gross
b25f1b66ce resources: allow job authors to configure size of secrets tmpfs (#23696)
On supported platforms, the secrets directory is a 1MiB tmpfs. But some tasks
need larger space for downloading large secrets. This is especially the case for
tasks using `templates`, which need extra room to write a temporary file to the
secrets directory that gets renamed to the old file atomically.

This changeset allows increasing the size of the tmpfs in the `resources`
block. Because this is a memory resource, we need to include it in the memory we
allocate for scheduling purposes. The task is already prevented from using more
memory in the tmpfs than the `resources.memory` field allows, but can bypass
that limit by writing to the tmpfs via `template` or `artifact` blocks.

Therefore, we need to account for the size of the tmpfs in the allocation
resources. Simply adding it to the memory needed when we create the allocation
allows it to be accounted for in all downstream consumers, and then we'll
subtract that amount from the memory resources just before configuring the task
driver.

For backwards compatibility, the default value of 1MiB is "free" and ignored by
the scheduler. Otherwise we'd be increasing the allocated resources for every
existing alloc, which could cause problems across upgrades. If a user explicitly
sets `resources.secrets = 1` it will no longer be free.

Fixes: https://github.com/hashicorp/nomad/issues/2481
Ref: https://hashicorp.atlassian.net/browse/NET-10070
2024-08-05 16:06:58 -04:00
Tim Gross
7d73065066 numa: fix scheduler panic due to topology serialization bug (#23284)
The NUMA topology struct field `NodeIDs` is a `idset.Set`, which has no public
members. As a result, this field is never serialized via msgpack and persisted
in state. When `numa.affinity = "prefer"`, the scheduler dereferences this nil
field and panics the scheduler worker.

Ideally we would fix this by adding a msgpack serialization extension, but
because the field already exists and is just always empty, this breaks RPC wire
compatibility across upgrades. Instead, create a new field that's populated at
the same time we populate the more useful `idset.Set`, and repopulate the set on
demand.

Fixes: https://hashicorp.atlassian.net/browse/NET-9924
2024-06-11 08:55:00 -04:00
Gabi
ca22f34373 fix exhausted node metrics reporting in preemption (#20346) 2024-04-11 14:49:56 -04:00
Seth Hoenig
0020139440 core: port common code changes from ENT for numa scheduling (#18818)
Some additional changes were made in the ENT PR to the common code in
support of numa scheduling; this PR copies those changes back to CE.
2023-10-20 13:19:02 -05:00
Seth Hoenig
83720740f5 core: plumbing to support numa aware scheduling (#18681)
* core: plumbing to support numa aware scheduling

* core: apply node resources compatibility upon fsm rstore

Handle the case where an upgraded server dequeus an evaluation before
a client triggers a new fingerprint - which would be needed to cause
the compatibility fix to run. By running the compat fix on restore the
server will immediately have the compatible pseudo topology to use.

* lint: learn how to spell pseudo
2023-10-19 15:09:30 -05:00
hashicorp-copywrite[bot]
a9d61ea3fd Update copyright file headers to BUSL-1.1 2023-08-10 17:27:29 -05:00
Luiz Aoqui
f4c7182873 node pools: apply node pool scheduler configuration (#17598) 2023-06-21 20:31:50 -04:00
hashicorp-copywrite[bot]
f005448366 [COMPLIANCE] Add Copyright and License Headers 2023-04-10 15:36:59 +00:00
Michael Schurter
f998a2b77b core: merge reserved_ports into host_networks (#13651)
Fixes #13505

This fixes #13505 by treating reserved_ports like we treat a lot of jobspec settings: merging settings from more global stanzas (client.reserved.reserved_ports) "down" into more specific stanzas (client.host_networks[].reserved_ports).

As discussed in #13505 there are other options, and since it's totally broken right now we have some flexibility:

Treat overlapping reserved_ports on addresses as invalid and refuse to start agents. However, I'm not sure there's a cohesive model we want to publish right now since so much 0.9-0.12 compat code still exists! We would have to explain to folks that if their -network-interface and host_network addresses overlapped, they could only specify reserved_ports in one place or the other?! It gets ugly.
Use the global client.reserved.reserved_ports value as the default and treat host_network[].reserverd_ports as overrides. My first suggestion in the issue, but @groggemans made me realize the addresses on the agent's interface (as configured by -network-interface) may overlap with host_networks, so you'd need to remove the global reserved_ports from addresses shared with a shared network?! This seemed really confusing and subtle for users to me.
So I think "merging down" creates the most expressive yet understandable approach. I've played around with it a bit, and it doesn't seem too surprising. The only frustrating part is how difficult it is to observe the available addresses and ports on a node! However that's a job for another PR.
2022-07-12 14:40:25 -07:00
Luiz Aoqui
8a427a470a scheduler: detect and log unexpected scheduling collisions (#11793) 2022-01-14 20:09:14 -05:00
Mahmood Ali
28b8767b27 Allow configuring memory oversubscription (#10466)
Cluster operators want to have better control over memory
oversubscription and may want to enable/disable it based on their
experience.

This PR adds a scheduler configuration field to control memory
oversubscription. It's additional field that can be set in the [API via Scheduler Config](https://www.nomadproject.io/api-docs/operator/scheduler), or [the agent server config](https://www.nomadproject.io/docs/configuration/server#configuring-scheduler-config).

I opted to have the memory oversubscription be an opt-in, but happy to change it.  To enable it, operators should call the API with:
```json
{
  "MemoryOversubscriptionEnabled": true
}
```

If memory oversubscription is disabled, submitting jobs specifying `memory_max` will get a "Memory oversubscription is not
enabled" warnings, but the jobs will be accepted without them accessing
the additional memory.

The warning message is like:
```
$ nomad job run /tmp/j
Job Warnings:
1 warning(s):

* Memory oversubscription is not enabled; Task cache.redis memory_max value will be ignored

==> Monitoring evaluation "7c444157"
    Evaluation triggered by job "example"
==> Monitoring evaluation "7c444157"
    Evaluation within deployment: "9d826f13"
    Allocation "aa5c3cad" created: node "9272088e", group "cache"
    Evaluation status changed: "pending" -> "complete"
==> Evaluation "7c444157" finished with status "complete"

# then you can examine the Alloc AllocatedResources to validate whether the task is allowed to exceed memory:
$ nomad alloc status -json aa5c3cad | jq '.AllocatedResources.Tasks["redis"].Memory'
{
  "MemoryMB": 256,
  "MemoryMaxMB": 0
}
```
2021-04-29 22:09:56 -04:00
Andrii Chubatiuk
6f7171d50f add support for host network interpolation 2021-04-13 09:53:05 -04:00
Nick Ethier
94c7aec159 scheduler: implement scheduling of reserved cores 2021-03-19 00:29:07 -04:00
Drew Bailey
7ce0b5017c Events/msgtype cleanup (#9117)
* use msgtype in upsert node

adds message type to signature for upsert node, update tests, remove placeholder method

* UpsertAllocs msg type test setup

* use upsertallocs with msg type in signature

update test usage of delete node

delete placeholder msgtype method

* add msgtype to upsert evals signature, update test call sites with test setup msg type

handle snapshot upsert eval outside of FSM and ignore eval event

remove placeholder upsertevalsmsgtype

handle job plan rpc and prevent event creation for plan

msgtype cleanup upsertnodeevents

updatenodedrain msgtype

msg type 0 is a node registration event, so set the default  to the ignore type

* fix named import

* fix signature ordering on upsertnode to match
2020-10-19 09:30:15 -04:00
Nick Ethier
ad8ced3873 multi-interface network support 2020-06-19 09:42:10 -04:00
Mahmood Ali
60aa516db8 missed fixing one invocation 2020-05-01 13:38:46 -04:00
Mahmood Ali
5078e0cfed tests and some clean up 2020-05-01 13:13:30 -04:00
Michael Schurter
2781717ad6 core: fix node reservation scoring
The BinPackIter accounted for node reservations twice when scoring nodes
which could bias scores toward nodes with reservations.

Pseudo-code for previous algorithm:
```
	proposed  = reservedResources + sum(allocsResources)
	available = nodeResources - reservedResources
	score     = 1 - (proposed / available)
```

The node's reserved resources are added to the total resources used by
allocations, and then the node's reserved resources are later
substracted from the node's overall resources.

The new algorithm is:
```
	proposed  = sum(allocResources)
	available = nodeResources - reservedResources
	score     = 1 - (proposed / available)
```

The node's reserved resources are no longer added to the total resources
used by allocations.

My guess as to how this bug happened is that the resource utilization
variable (`util`) is calculated and returned by the `AllocsFit` function
which needs to take reserved resources into account as a basic
feasibility check.

To avoid re-calculating alloc resource usage (because there may be a
large number of allocs), we reused `util` in the `ScoreFit` function.
`ScoreFit` properly accounts for reserved resources by subtracting them
from the node's overall resources. However since `util` _also_ took
reserved resources into account the score would be incorrect.

Prior to the fix the added test output:
```
Node: reserved     Score: 1.0000
Node: reserved2    Score: 1.0000
Node: no-reserved  Score: 0.9741
```

The scores being 1.0 for *both* nodes with reserved resources is a good
hint something is wrong as they should receive different scores. Upon
further inspection the double accounting of reserved resources caused
their scores to be >1.0 and clamped.

After the fix the added test outputs:
```
Node: no-reserved  Score: 0.9741
Node: reserved     Score: 0.9480
Node: reserved2    Score: 0.8717
```
2020-04-15 15:13:30 -07:00
Michael Schurter
f12bfdb193 scheduler: update tests with modern error helper 2019-12-02 20:25:52 -08:00
Preetha Appan
5a1dd79179 Code review feedback 2019-07-31 01:04:08 -04:00
Preetha Appan
b561816343 Scheduler changes to support network at task group level
Also includes unit tests for binpacker and preemption.
The tests verify that network resources specified at the
task group level are properly accounted for
2019-07-31 01:04:08 -04:00
Alex Dadgar
bc42873e07 Change types of weights on spread/affinity 2019-01-30 12:20:38 -08:00
Alex Dadgar
8264f50c52 convert driver to device for device constraint/attributes 2019-01-23 10:58:45 -08:00
Preetha Appan
66670c0f02 Fixes device scheduling unit tests
Also changes the logic for score when there is more than one task
requesting a device. Since inter task affinities are already normalized,
we take the average of the scores across tasks.
2018-11-08 10:31:19 -06:00
Alex Dadgar
ad4c26a1e3 review comments 2018-11-07 11:31:52 -08:00
Alex Dadgar
7ab3ce4bde affinities 2018-11-07 10:32:03 -08:00
Alex Dadgar
77ad27de60 assign devices 2018-11-07 10:32:03 -08:00
Alex Dadgar
e30b20e65e renames 2018-10-04 14:57:25 -07:00
Alex Dadgar
0f2f4797cb fixing tests 2018-10-04 14:26:19 -07:00
Preetha Appan
5cd8d1fe82 Some minor changes from code review 2018-09-04 16:10:11 -05:00
Preetha Appan
a236342caa Address some review feedback 2018-09-04 16:10:11 -05:00
Preetha Appan
00924555a8 Implement affinity support in generic scheduler 2018-09-04 16:10:11 -05:00
Preetha Appan
c6c0741bd8 Add helper methods, use require and other code review feedback 2018-01-31 09:56:53 -06:00
Preetha Appan
5ecb7895bb Reschedule previous allocs and track their reschedule attempts 2018-01-31 09:56:53 -06:00
Michael Schurter
04b8f8e7fc Remove structs import from api
Goes a step further and removes structs import from api's tests as well
by moving GenerateUUID to its own package.
2017-09-29 10:36:08 -07:00
Alex Dadgar
a9e3a41407 Enable more linters 2017-09-26 15:26:33 -07:00
Alex Dadgar
ac1539d5d9 Sync namespace changes 2017-09-07 17:04:21 -07:00
Alex Dadgar
741a71e0b3 Fix tests 2017-05-01 13:54:26 -07:00
Diptanu Choudhury
396e45629b Renaming LocalDisk to EphemeralDisk (#1710)
Renaming LocalDisk to EphemeralDisk
2016-09-14 15:43:42 -07:00
Diptanu Choudhury
7da66e169c Making the scheduler use LocalDisk instead of Resources.DiskMB 2016-08-25 12:27:42 -05:00
Diptanu Choudhury
230a59ca16 Fixed some more tests 2016-07-25 17:26:38 -07:00
Alex Dadgar
bffcf5bd9f ProposedAllocs dedups in-place updated allocations 2016-03-21 18:09:32 -07:00
Alex Dadgar
bc13dcaf48 merge 2015-12-16 15:01:15 -08:00
Alex Dadgar
17bc13bfe5 Add garbage collection to jobs 2015-12-16 15:00:45 -08:00
Armon Dadgar
924bf123a1 scheduler: binpacker makes network offers 2015-09-13 14:31:32 -07:00
Armon Dadgar
8a02dbc481 Use a single implementation of GenerateUUID 2015-09-07 15:23:03 -07:00