Commit Graph

741 Commits

Author SHA1 Message Date
Tim Gross
2b63a093ac quotas: evaluate quota feasibility last in scheduler (#10753)
The `QuotaIterator` is used as the source of nodes passed into feasibility
checking for constraints. Every node that passes the quota check counts the
allocation resources agains the quota, and as a result we count nodes which
will be later filtered out by constraints. Therefore for jobs with
constraints, nodes that are feasibility checked but fail have been counted
against quotas. This failure mode is order dependent; if all the unfiltered
nodes happen to be quota checked first, everything works as expected.

This changeset moves the `QuotaIterator` to happen last among all feasibility
checkers (but before ranking). The `QuotaIterator` will never receive filtered
nodes so it will calculate quotas correctly.
2021-06-14 10:11:40 -04:00
Mahmood Ali
122a4cb844 tests: use standard library testing.TB
Glint pulled in an updated version of mitchellh/go-testing-interface
which broke some existing tests because the update added a Parallel()
method to testing.T. This switches to the standard library testing.TB
which doesn't have a Parallel() method.
2021-06-09 16:18:45 -07:00
Tim Gross
e6b3123758 scheduler: test for reconciler's in-place rollback behavior
The reconciler has some complicated behavior when there are already running
allocations from a previous version of the job that we want to keep, as
happens during a rollback. Document this behavior with a test.
2021-06-03 10:02:19 -04:00
Michael Schurter
2e6eb84a57 Merge pull request #10248 from hashicorp/f-remotetask-2021
core: propagate remote task handles
2021-04-30 08:57:26 -07:00
Michael Schurter
d3d6c60e63 clarify docs from pr comments 2021-04-30 08:31:31 -07:00
Mahmood Ali
28b8767b27 Allow configuring memory oversubscription (#10466)
Cluster operators want to have better control over memory
oversubscription and may want to enable/disable it based on their
experience.

This PR adds a scheduler configuration field to control memory
oversubscription. It's additional field that can be set in the [API via Scheduler Config](https://www.nomadproject.io/api-docs/operator/scheduler), or [the agent server config](https://www.nomadproject.io/docs/configuration/server#configuring-scheduler-config).

I opted to have the memory oversubscription be an opt-in, but happy to change it.  To enable it, operators should call the API with:
```json
{
  "MemoryOversubscriptionEnabled": true
}
```

If memory oversubscription is disabled, submitting jobs specifying `memory_max` will get a "Memory oversubscription is not
enabled" warnings, but the jobs will be accepted without them accessing
the additional memory.

The warning message is like:
```
$ nomad job run /tmp/j
Job Warnings:
1 warning(s):

* Memory oversubscription is not enabled; Task cache.redis memory_max value will be ignored

==> Monitoring evaluation "7c444157"
    Evaluation triggered by job "example"
==> Monitoring evaluation "7c444157"
    Evaluation within deployment: "9d826f13"
    Allocation "aa5c3cad" created: node "9272088e", group "cache"
    Evaluation status changed: "pending" -> "complete"
==> Evaluation "7c444157" finished with status "complete"

# then you can examine the Alloc AllocatedResources to validate whether the task is allowed to exceed memory:
$ nomad alloc status -json aa5c3cad | jq '.AllocatedResources.Tasks["redis"].Memory'
{
  "MemoryMB": 256,
  "MemoryMaxMB": 0
}
```
2021-04-29 22:09:56 -04:00
Luiz Aoqui
c7114921fa Add metrics for blocked eval resources (#10454)
* add metrics for blocked eval resources

* docs: add new blocked_evals metrics

* fix to call `pruneStats` instead of `stats.prune` directly
2021-04-29 15:03:45 -04:00
Michael Schurter
d50fb2a00e core: propagate remote task handles
Add a new driver capability: RemoteTasks.

When a task is run by a driver with RemoteTasks set, its TaskHandle will
be propagated to the server in its allocation's TaskState. If the task
is replaced due to a down node or draining, its TaskHandle will be
propagated to its replacement allocation.

This allows tasks to be scheduled in remote systems whose lifecycles are
disconnected from the Nomad node's lifecycle.

See https://github.com/hashicorp/nomad-driver-ecs for an example ECS
remote task driver.
2021-04-27 15:07:03 -07:00
Andrii Chubatiuk
6f7171d50f add support for host network interpolation 2021-04-13 09:53:05 -04:00
Seth Hoenig
a97254fa20 consul: plubming for specifying consul namespace in job/group
This PR adds the common OSS changes for adding support for Consul Namespaces,
which is going to be a Nomad Enterprise feature. There is no new functionality
provided by this changeset and hopefully no new bugs.
2021-04-05 10:03:19 -06:00
Chris Baker
80066ac798 Merge branch 'main' into f-node-drain-api 2021-04-01 15:22:57 -05:00
Mahmood Ali
80faab0f79 oversubscription: Add MemoryMaxMB to internal structs
Start tracking a new MemoryMaxMB field that represents the maximum memory a task
may use in the client. This allows tasks to specify a memory reservation (to be
used by scheduler when placing the task) but use excess memory used on the
client if the client has any.

This commit adds the server tracking for the value, and ensures that allocations
AllocatedResource fields include the value.
2021-03-30 16:55:58 -04:00
Nick Ethier
0a25e153c7 Merge pull request #10203 from hashicorp/f-cpu-cores
Reserved Cores [1/4]: Structs and scheduler implementation
2021-03-29 14:05:54 -04:00
Chris Baker
a52f32dedc restored Node.Sanitize() for RPC endpoints
multiple other updates from code review
2021-03-26 17:03:15 +00:00
Chris Baker
93d5187e7b removed deprecated fields from Drain structs and API
node drain: use msgtype on txn so that events are emitted
wip: encoding extension to add Node.Drain field back to API responses

new approach for hiding Node.SecretID in the API, using `json` tag
documented this approach in the contributing guide
refactored the JSON handlers with extensions
modified event stream encoding to use the go-msgpack encoders with the extensions
2021-03-21 15:30:11 +00:00
Nick Ethier
39a8a172eb scheduler: detect job change in cores resource 2021-03-19 22:25:50 -04:00
Nick Ethier
94c7aec159 scheduler: implement scheduling of reserved cores 2021-03-19 00:29:07 -04:00
Tim Gross
7c7569674c CSI: unique volume per allocation
Add a `PerAlloc` field to volume requests that directs the scheduler to test
feasibility for volumes with a source ID that includes the allocation index
suffix (ex. `[0]`), rather than the exact source ID.

Read the `PerAlloc` field when making the volume claim at the client to
determine if the allocation index suffix (ex. `[0]`) should be added to the
volume source ID.
2021-03-18 15:35:11 -04:00
Tim Gross
a1eaad9cf7 CSI: remove prefix matching from CSIVolumeByID and fix CLI prefix matching (#10158)
Callers of `CSIVolumeByID` are generally assuming they should receive a single
volume. This potentially results in feasibility checking being performed
against the wrong volume if a volume's ID is a prefix substring of other
volume (for example: "test" and "testing").

Removing the incorrect prefix matching from `CSIVolumeByID` breaks prefix
matching in the command line client. Add the required elements for prefix
matching to the commands and API.
2021-03-18 14:32:40 -04:00
Tim Gross
b66a341505 scheduler/csi: fix early return when multiple volumes are requested
When multiple CSI volumes are requested, the feasibility check could return
early for read/write volumes with free claims, even if a later volume in the
request was not feasible for any other reason (including not existing at
all). This can result in random failure to fail feasibility checking,
depending on how the map of volumes was being ordered at runtime.

Remove the early return from the feasibility check. Add a test to verify that
missing volumes in the map will cause a failure; this test will not catch a
regression every test run because of the random map ordering, but any failure
will be caught over the course of several CI runs.
2021-03-10 15:18:36 -05:00
Seth Hoenig
05c3ffde9d consul/connect: correctly detect when connect tasks not updated
This PR fixes a bug where tasks with Connect services could be
triggered to destructively update (i.e. placed in a new alloc)
when no update should be necessary.

Fixes #10077
2021-02-23 15:12:49 -06:00
Nick Ethier
a957a5cd20 Merge pull request #9937 from hashicorp/b-9728
scheduler: add tests and fix for detected host_network and to port field changes
2021-02-02 13:54:41 -05:00
Nick Ethier
ee7a5369b1 scheduler: add tests and fix for detected host_network and to port field changes 2021-02-01 15:56:43 -05:00
Drew Bailey
182aa4a459 Persist shared allocated ports for inplace update (#9830)
* Persist shared allocated ports for inplace update

Ports were not copied over when performing inplace updates in the
generic scheduler

* changelog

* drop spew
2021-01-15 12:45:12 -05:00
Drew Bailey
9c3ce6b6dc persist shared ports during inplace updates (#9736)
AllocatedSharedResources were not being copied over to the new
allocation struct the scheduler makes during inplace updates. This
caused downstream issues after the plan was applied, namely the shared
ports were dropped causing issues with service
registration/deregistration.

test that shared ports are preserved

change log, also carry over shared network

copy networks
2021-01-08 09:00:41 -05:00
Kris Hicks
7747124ef0 Apply some suggested fixes from staticcheck (#9598) 2020-12-10 07:29:18 -08:00
Kris Hicks
85ed8ddd4f Add gosimple linter (#9590) 2020-12-09 11:05:18 -08:00
Kris Hicks
be6e5e9e6d scheduler: Fix always-false sort func (#9547)
Co-authored-by: Mahmood Ali <mahmood@hashicorp.com>
2020-12-08 09:57:47 -08:00
Nick Ethier
1c14f69853 command: remove task network usage from init examples 2020-11-23 10:25:11 -06:00
Seth Hoenig
3826e23047 scheduler: enable upgrade path for bridge network finger print
This PR enables users of Nomad < 0.12 to upgrade to Nomad 0.12
and beyond. Nomad 0.12 introduced a network fingerprinter for
bridge networks, which is a contstraint checked for if bridge
network is being used. If users upgrade servers first as is
recommended, suddenly no clients running older versions of Nomad
will satisfy the bridge network resource constraint. Instead,
this change only enforces the constraint if the Nomad client
version is also >= 0.12.

Closes #8423
2020-11-13 14:17:01 -06:00
Drew Bailey
7ce0b5017c Events/msgtype cleanup (#9117)
* use msgtype in upsert node

adds message type to signature for upsert node, update tests, remove placeholder method

* UpsertAllocs msg type test setup

* use upsertallocs with msg type in signature

update test usage of delete node

delete placeholder msgtype method

* add msgtype to upsert evals signature, update test call sites with test setup msg type

handle snapshot upsert eval outside of FSM and ignore eval event

remove placeholder upsertevalsmsgtype

handle job plan rpc and prevent event creation for plan

msgtype cleanup upsertnodeevents

updatenodedrain msgtype

msg type 0 is a node registration event, so set the default  to the ignore type

* fix named import

* fix signature ordering on upsertnode to match
2020-10-19 09:30:15 -04:00
Michael Schurter
116b2b8b35 Merge pull request #9055 from hashicorp/f-9017-resources
api: add field filters to /v1/{allocations,nodes}
2020-10-14 14:49:39 -07:00
Michael Schurter
a55f46e9ba api: add field filters to /v1/{allocations,nodes}
Fixes #9017

The ?resources=true query parameter includes resources in the object
stub listings. Specifically:

- For `/v1/nodes?resources=true` both the `NodeResources` and
  `ReservedResources` field are included.
- For `/v1/allocations?resources=true` the `AllocatedResources` field is
  included.

The ?task_states=false query parameter removes TaskStates from
/v1/allocations responses. (By default TaskStates are included.)
2020-10-14 10:35:22 -07:00
Drew Bailey
004f634868 use Events to wrap index and events, store in events table 2020-10-14 12:44:39 -04:00
Drew Bailey
47d1a33eb5 writetxn can return error, add alloc and job generic events. Add events
table for durability
2020-10-14 12:44:39 -04:00
Drew Bailey
4f97bf8ef7 Events/eval alloc events (#9012)
* generic eval update event

first pass at alloc client update events

* api/event client
2020-10-14 12:44:37 -04:00
Drew Bailey
8d16846695 Events/deployment events (#9004)
* Node Drain events and Node Events (#8980)

Deployment status updates

handle deployment status updates (paused, failed, resume)

deployment alloc health

generate events from apply plan result

txn err check, slim down deployment event

one ndjson line per index

* consolidate down to node event + type

* fix UpdateDeploymentAllocHealth test invocations

* fix test
2020-10-14 12:44:37 -04:00
Tim Gross
7cff2de942 csi: allow more than 1 writer claim for multi-writer mode (#9040)
Fixes a bug where CSI volumes with the `MULTI_NODE_MULTI_WRITER` access mode
were using the same logic as `MULTI_NODE_SINGLE_WRITER` to determine whether
the volume had writer claims available for scheduling.

Extends CSI claim endpoint test to exercise multi-reader and make sure `WriteFreeClaims`
is exercised for multi-writer in feasibility test.
2020-10-07 10:43:23 -04:00
Seth Hoenig
37cd0eced3 consul/connect: trigger update as necessary on connect changes
This PR fixes a long standing bug where submitting jobs with changes
to connect services would not trigger updates as expected. Previously,
service blocks were not considered as sources of destructive updates
since they could be synced with consul non-destructively. With Connect,
task group services that have changes to their connect block or to
the service port should be destructive, since the network plumbing of
the alloc is going to need updating.

Fixes #8596 #7991

Non-destructive half in #7192
2020-10-05 14:53:00 -05:00
Neil Mock
ed385bd0ea Fix multi-interface networking in the system scheduler (#8822) 2020-09-22 12:54:34 -04:00
Mahmood Ali
49a46188a5 Merge pull request #8867 from hashicorp/b-canary-substitution
scheduler: Revert requireCanary logic
2020-09-15 12:58:55 -05:00
Mahmood Ali
a80ccb82cb Only ignore rescheduled allocations if they got stopped 2020-09-14 21:11:52 -04:00
Mahmood Ali
c6848d3d45 add a test when .NextAllocation is set but alloc is still running 2020-09-14 17:12:53 -04:00
Mahmood Ali
64175dccee Revert the requireCanary check introduced in https://github.com/hashicorp/nomad/pull/8691/files#diff-1801138ac4d10f2064ba6f2e434ac9b4L430-R431 .
The change was intended to fix a case where a canary alloc may fail to
be rescheduled if all the other allocs fail as well (e.g. if all allocs
happen to be placed on a node that died).  However, it introduced some
unintended side-effects.

Reverting the change for now and will investigate further.
2020-09-10 14:59:02 -04:00
Mahmood Ali
2bb78ac2c1 test for rescheduling non-canaries 2020-09-10 14:59:02 -04:00
Mahmood Ali
6f6a93b262 Handle migration of non-deployment jobs
This handles the case where a job when from no-deployment to deployment
with canaries.

Consider a case where a `max_parallel=0` job is submitted as version 0,
then an update is submitted with `max_parallel=1, canary=1` as verion 1.
In this case, we will have 1 canary alloc, and all remaining allocs will
be version 0.  Until the deployment is promoted, we ought to replace the
canaries with version 0 job (which isn't associated with a deployment).
2020-08-26 10:36:34 -04:00
Mahmood Ali
f075bcc811 Update scheduler/reconcile.go
Co-authored-by: Chris Baker <1675087+cgbaker@users.noreply.github.com>
2020-08-25 17:37:19 -04:00
Mahmood Ali
3a28b85b8a simplify canary check
`(alloc.DeploymentStatus == nil || !alloc.DeploymentStatus.IsCanary())`
and `!alloc.DeploymentStatus.IsCanary()` are equivalent.
2020-08-25 17:37:19 -04:00
Mahmood Ali
92bb3728c9 tweak stack job manipulation
To address review comments
2020-08-25 17:37:19 -04:00
Mahmood Ali
cb038b1a8c Have Plan.AppendAlloc accept the job 2020-08-25 17:22:09 -04:00