Commit Graph

822 Commits

Author SHA1 Message Date
Seth Hoenig
b21aeb8715 main: remove deprecated uses of rand.Seed (#16074)
* main: remove deprecated uses of rand.Seed

go1.20 deprecates rand.Seed, and seeds the rand package
automatically. Remove cases where we seed the random package,
and cleanup the one case where we intentionally create a
known random source.

* cl: update cl

* mod: update go mod
2023-02-07 09:19:38 -06:00
Tim Gross
2d7d6338b7 scheduler: move utils into files specific to their scheduler type (#16051)
Many of the functions in the `utils.go` file are specific to a particular
scheduler, and very few of them have guards (or even names) that help avoid
misuse with features specific to a given scheduler type. Move these
functions (and their tests) into files specific to their scheduler type without
any functionality changes to make it clear which bits go with what.
2023-02-03 12:29:39 -05:00
Tim Gross
ba20138ffd System and sysbatch jobs always have zero index (#16030)
Service jobs should have unique allocation Names, derived from the
Job.ID. System jobs do not have unique allocation Names because the index is
intended to indicated the instance out of a desired count size. Because system
jobs do not have an explicit count but the results are based on the targeted
nodes, the index is less informative and this was intentionally omitted from the
original design.

Update docs to make it clear that NOMAD_ALLOC_INDEX is always zero for 
system/sysbatch jobs

Validate that `volume.per_alloc` is incompatible with system/sysbatch jobs.
System and sysbatch jobs always have a `NOMAD_ALLOC_INDEX` of 0. So
interpolation via `per_alloc` will not work as soon as there's more than one
allocation placed. Validate against this on job submission.
2023-02-02 16:18:01 -05:00
Charlie Voiselle
fe4ff5be2a Add option to expose workload token to task (#15755)
Add `identity` jobspec block to expose workload identity tokens to tasks.

---------

Co-authored-by: Anders <mail@anars.dk>
Co-authored-by: Tim Gross <tgross@hashicorp.com>
Co-authored-by: Michael Schurter <mschurter@hashicorp.com>
2023-02-02 10:59:14 -08:00
jmwilkinson
46f3977db2 Allow wildcard datacenters to be specified in job file (#11170)
Also allows for default value of `datacenters = ["*"]`
2023-02-02 09:57:45 -05:00
Piotr Kazmierczak
949a6f60c7 renamed stanza to block for consistency with other projects (#15941) 2023-01-30 15:48:43 +01:00
Yorick Gersie
24a575ab80 Allow per_alloc to be used with host volumes (#15780)
Disallowing per_alloc for host volumes in some cases makes life of a nomad user much harder.
When we rely on the NOMAD_ALLOC_INDEX for any configuration that needs to be re-used across
restarts we need to make sure allocation placement is consistent. With CSI volumes we can
use the `per_alloc` feature but for some reason this is explicitly disabled for host volumes.

Ensure host volumes understand the concept of per_alloc
2023-01-26 09:14:47 -05:00
Luiz Aoqui
1318477789 scheduler: allow using device ID as attribute (#15455)
Devices are fingerprinted as groups of similar devices. This prevented
specifying specific device by their ID in constraint and affinity rules.

This commit introduces the `${device.ids}` attribute that returns a
comma separated list of IDs that are part of the device group. Users can
then use the set operators to write rules.
2023-01-10 14:28:23 -05:00
Luiz Aoqui
a55c12418f scheduler: create placements for non-register MRD (#15325)
* scheduler: create placements for non-register MRD

For multiregion jobs, the scheduler does not create placements on
registration because the deployment must wait for the other regions.
Once of these regions will then trigger the deployment to run.

Currently, this is done in the scheduler by considering any eval for a
multiregion job as "paused" since it's expected that another region will
eventually unpause it.

This becomes a problem where evals not triggered by a job registration
happen, such as on a node update. These types of regional changes do not
have other regions waiting to progress the deployment, and so they were
never resulting in placements.

The fix is to create a deployment at job registration time. This
additional piece of state allows the scheduler to differentiate between
a multiregion change, where there are other regions engaged in the
deployment so no placements are required, from a regional change, where
the scheduler does need to create placements.

This deployment starts in the new "initializing" status to signal to the
scheduler that it needs to compute the initial deployment state. The
multiregion deployment will wait until this deployment state is
persisted and its starts is set to "pending". Without this state
transition it's possible to hit a race condition where the plan applier
and the deployment watcher may step of each other and overwrite their
changes.

* changelog: add entry for #15325
2022-11-25 12:45:34 -05:00
Tim Gross
8018dd0ee8 scheduler: set job on system stack for CSI feasibility check (#15372)
When the scheduler checks feasibility of each node, it creates a "stack" which
carries attributes of the job and task group it needs to check feasibility
for. The `system` and `sysbatch` scheduler use a different stack than `service`
and `batch` jobs. This stack was missing the call to set the job ID and
namespace for the CSI check. This prevents CSI volumes from being scheduled for
system jobs whenever the volume is in a non-default namespace.

Set the job ID and namespace to match the generic scheduler.
2022-11-23 16:47:35 -05:00
Luiz Aoqui
6a3cf74f32 scheduler: log stack in case of panic (#15303) 2022-11-17 18:59:33 -05:00
Luiz Aoqui
7828c02a50 Update alloc after reconnect and enforece client heartbeat order (#15068)
* scheduler: allow updates after alloc reconnects

When an allocation reconnects to a cluster the scheduler needs to run
special logic to handle the reconnection, check if a replacement was
create and stop one of them.

If the allocation kept running while the node was disconnected, it will
be reconnected with `ClientStatus: running` and the node will have
`Status: ready`. This combination is the same as the normal steady state
of allocation, where everything is running as expected.

In order to differentiate between the two states (an allocation that is
reconnecting and one that is just running) the scheduler needs an extra
piece of state.

The current implementation uses the presence of a
`TaskClientReconnected` task event to detect when the allocation has
reconnected and thus must go through the reconnection process. But this
event remains even after the allocation is reconnected, causing all
future evals to consider the allocation as still reconnecting.

This commit changes the reconnect logic to use an `AllocState` to
register when the allocation was reconnected. This provides the
following benefits:

  - Only a limited number of task states are kept, and they are used for
    many other events. It's possible that, upon reconnecting, several
    actions are triggered that could cause the `TaskClientReconnected`
    event to be dropped.
  - Task events are set by clients and so their timestamps are subject
    to time skew from servers. This prevents using time to determine if
    an allocation reconnected after a disconnect event.
  - Disconnect events are already stored as `AllocState` and so storing
    reconnects there as well makes it the only source of information
    required.

With the new logic, the reconnection logic is only triggered if the
last `AllocState` is a disconnect event, meaning that the allocation has
not been reconnected yet. After the reconnection is handled, the new
`ClientStatus` is store in `AllocState` allowing future evals to skip
the reconnection logic.

* scheduler: prevent spurious placement on reconnect

When a client reconnects it makes two independent RPC calls:

  - `Node.UpdateStatus` to heartbeat and set its status as `ready`.
  - `Node.UpdateAlloc` to update the status of its allocations.

These two calls can happen in any order, and in case the allocations are
updated before a heartbeat it causes the state to be the same as a node
being disconnected: the node status will still be `disconnected` while
the allocation `ClientStatus` is set to `running`.

The current implementation did not handle this order of events properly,
and the scheduler would create an unnecessary placement since it
considered the allocation was being disconnected. This extra allocation
would then be quickly stopped by the heartbeat eval.

This commit adds a new code path to handle this order of events. If the
node is `disconnected` and the allocation `ClientStatus` is `running`
the scheduler will check if the allocation is actually reconnecting
using its `AllocState` events.

* rpc: only allow alloc updates from `ready` nodes

Clients interact with servers using three main RPC methods:

  - `Node.GetAllocs` reads allocation data from the server and writes it
    to the client.
  - `Node.UpdateAlloc` reads allocation from from the client and writes
    them to the server.
  - `Node.UpdateStatus` writes the client status to the server and is
    used as the heartbeat mechanism.

These three methods are called periodically by the clients and are done
so independently from each other, meaning that there can't be any
assumptions in their ordering.

This can generate scenarios that are hard to reason about and to code
for. For example, when a client misses too many heartbeats it will be
considered `down` or `disconnected` and the allocations it was running
are set to `lost` or `unknown`.

When connectivity is restored the to rest of the cluster, the natural
mental model is to think that the client will heartbeat first and then
update its allocations status into the servers.

But since there's no inherit order in these calls the reverse is just as
possible: the client updates the alloc status and then heartbeats. This
results in a state where allocs are, for example, `running` while the
client is still `disconnected`.

This commit adds a new verification to the `Node.UpdateAlloc` method to
reject updates from nodes that are not `ready`, forcing clients to
heartbeat first. Since this check is done server-side there is no need
to coordinate operations client-side: they can continue sending these
requests independently and alloc update will succeed after the heartbeat
is done.

* chagelog: add entry for #15068

* code review

* client: skip terminal allocations on reconnect

When the client reconnects with the server it synchronizes the state of
its allocations by sending data using the `Node.UpdateAlloc` RPC and
fetching data using the `Node.GetClientAllocs` RPC.

If the data fetch happens before the data write, `unknown` allocations
will still be in this state and would trigger the
`allocRunner.Reconnect` flow.

But when the server `DesiredStatus` for the allocation is `stop` the
client should not reconnect the allocation.

* apply more code review changes

* scheduler: persist changes to reconnected allocs

Reconnected allocs have a new AllocState entry that must be persisted by
the plan applier.

* rpc: read node ID from allocs in UpdateAlloc

The AllocUpdateRequest struct is used in three disjoint use cases:

1. Stripped allocs from clients Node.UpdateAlloc RPC using the Allocs,
   and WriteRequest fields
2. Raft log message using the Allocs, Evals, and WriteRequest fields
3. Plan updates using the AllocsStopped, AllocsUpdated, and Job fields

Adding a new field that would only be used in one these cases (1) made
things more confusing and error prone. While in theory an
AllocUpdateRequest could send allocations from different nodes, in
practice this never actually happens since only clients call this method
with their own allocations.

* scheduler: remove logic to handle exceptional case

This condition could only be hit if, somehow, the allocation status was
set to "running" while the client was "unknown". This was addressed by
enforcing an order in "Node.UpdateStatus" and "Node.UpdateAlloc" RPC
calls, so this scenario is not expected to happen.

Adding unnecessary code to the scheduler makes it harder to read and
reason about it.

* more code review

* remove another unused test
2022-11-04 16:25:11 -04:00
Charlie Voiselle
52a254ba22 template: error on missing key (#15141)
* Support error_on_missing_value for templates
* Update docs for template stanza
2022-11-04 13:23:01 -04:00
Tim Gross
1b434184e4 make version checks specific to region (1.4.x) (#14912)
* One-time tokens are not replicated between regions, so we don't want to enforce
  that the version check across all of serf, just members in the same region.
* Scheduler: Disconnected clients handling is specific to a single region, so we
  don't want to enforce that the version check across all of serf, just members in
  the same region.
* Variables: enforce version check in Apply RPC
* Cleans up a bunch of legacy checks.

This changeset is specific to 1.4.x and the changes for previous versions of
Nomad will be manually backported in a separate PR.
2022-10-17 16:23:51 -04:00
Seth Hoenig
fed329883e cleanup: rename Equals to Equal for consistency (#14759) 2022-10-10 09:28:46 -05:00
Will Jordan
8bdeb58f20 Allow jobs not requiring any network resources (#14300)
Jobs not requiring any network resources should be allowed
even when the network fingerprinter is disabled.
2022-10-06 16:25:41 -04:00
Seth Hoenig
1e5f6188fb core: numeric operands comparisons in constraints (#14722)
* cleanup: fixup linter warnings in schedular/feasible.go

* core: numeric operands comparisons in constraints

This PR changes constraint comparisons to be numeric rather than
lexical if both operands are integers or floats.

Inspiration #4856
Closes #4729
Closes #14719

* fix: always parse as int64
2022-09-27 11:07:07 -05:00
Derek Strickland
85a4df5e58 scheduler: Fix bug where the would treat multiregion jobs as paused for job types that don't use deployments (#14659)
* scheduler: Fix bug where the scheduler would treat multiregion jobs as paused for job types that don't use deployments

Co-authored-by: Tim Gross <tgross@hashicorp.com>

Co-authored-by: Tim Gross <tgross@hashicorp.com>
2022-09-22 14:31:27 -04:00
Seth Hoenig
ff1a30fe8d cleanup more helper updates (#14638)
* cleanup: refactor MapStringStringSliceValueSet to be cleaner

* cleanup: replace SliceStringToSet with actual set

* cleanup: replace SliceStringSubset with real set

* cleanup: replace SliceStringContains with slices.Contains

* cleanup: remove unused function SliceStringHasPrefix

* cleanup: fixup StringHasPrefixInSlice doc string

* cleanup: refactor SliceSetDisjoint to use real set

* cleanup: replace CompareSliceSetString with SliceSetEq

* cleanup: replace CompareMapStringString with maps.Equal

* cleanup: replace CopyMapStringString with CopyMap

* cleanup: replace CopyMapStringInterface with CopyMap

* cleanup: fixup more CopyMapStringString and CopyMapStringInt

* cleanup: replace CopySliceString with slices.Clone

* cleanup: remove unused CopySliceInt

* cleanup: refactor CopyMapStringSliceString to be generic as CopyMapOfSlice

* cleanup: replace CopyMap with maps.Clone

* cleanup: run go mod tidy
2022-09-21 14:53:25 -05:00
Mahmood Ali
757c3c94f2 scheduler: stopped-yet-running allocs are still running (#10446)
* scheduler: stopped-yet-running allocs are still running

* scheduler: test new stopped-but-running logic

* test: assert nonoverlapping alloc behavior

Also add a simpler Wait test helper to improve line numbers and save few
lines of code.

* docs: tried my best to describe #10446

it's not concise... feedback welcome

* scheduler: fix test that allowed overlapping allocs

* devices: only free devices when ClientStatus is terminal

* test: output nicer failure message if err==nil

Co-authored-by: Mahmood Ali <mahmood@hashicorp.com>
Co-authored-by: Michael Schurter <mschurter@hashicorp.com>
2022-09-13 12:52:47 -07:00
James Rasell
25e7c2ffa4 chore: remove use of "err" a log line context key for errors. (#14433)
Log lines which include an error should use the full term "error"
as the context key. This provides consistency across the codebase
and avoids a Go style which operators might not be aware of.
2022-09-01 15:06:10 +02:00
Piotr Kazmierczak
c4be2c6078 cleanup: replace TypeToPtr helper methods with pointer.Of (#14151)
Bumping compile time requirement to go 1.18 allows us to simplify our pointer helper methods.
2022-08-17 18:26:34 +02:00
Seth Hoenig
0c62f445c3 build: run gofmt on all go source files
Go 1.19 will forecefully format all your doc strings. To get this
out of the way, here is one big commit with all the changes gofmt
wants to make.
2022-08-16 11:14:11 -05:00
Abirdcfly
9bfed7893a fix minor unreachable code caused by t.Fatal
Signed-off-by: Abirdcfly <fp544037857@gmail.com>
2022-08-08 23:50:11 +08:00
Michael Schurter
f998a2b77b core: merge reserved_ports into host_networks (#13651)
Fixes #13505

This fixes #13505 by treating reserved_ports like we treat a lot of jobspec settings: merging settings from more global stanzas (client.reserved.reserved_ports) "down" into more specific stanzas (client.host_networks[].reserved_ports).

As discussed in #13505 there are other options, and since it's totally broken right now we have some flexibility:

Treat overlapping reserved_ports on addresses as invalid and refuse to start agents. However, I'm not sure there's a cohesive model we want to publish right now since so much 0.9-0.12 compat code still exists! We would have to explain to folks that if their -network-interface and host_network addresses overlapped, they could only specify reserved_ports in one place or the other?! It gets ugly.
Use the global client.reserved.reserved_ports value as the default and treat host_network[].reserverd_ports as overrides. My first suggestion in the issue, but @groggemans made me realize the addresses on the agent's interface (as configured by -network-interface) may overlap with host_networks, so you'd need to remove the global reserved_ports from addresses shared with a shared network?! This seemed really confusing and subtle for users to me.
So I think "merging down" creates the most expressive yet understandable approach. I've played around with it a bit, and it doesn't seem too surprising. The only frustrating part is how difficult it is to observe the available addresses and ports on a node! However that's a job for another PR.
2022-07-12 14:40:25 -07:00
Tim Gross
596203c7ff snapshot restore-from-archive streaming and filtering (#13658)
Stream snapshot to FSM when restoring from archive
The `RestoreFromArchive` helper decompresses the snapshot archive to a
temporary file before reading it into the FSM. For large snapshots
this performs a lot of disk IO. Stream decompress the snapshot as we
read it, without first writing to a temporary file.

Add bexpr filters to the `RestoreFromArchive` helper.
The operator can pass these as `-filter` arguments to `nomad operator
snapshot state` (and other commands in the future) to include only
desired data when reading the snapshot.
2022-07-11 10:48:00 -04:00
Tim Gross
b74712321b CSI: no early return when feasibility check fails on eligible nodes (#13274)
As a performance optimization in the scheduler, feasibility checks
that apply to an entire class are only checked once for all nodes of
that class. Other feasibility checks are "available" checks because
they rely on more ephemeral characteristics and don't contribute to
the hash for the node class. This currently includes only CSI.

We have a separate fast path for "available" checks when the node has
already been marked eligible on the basis of class. This fast path has
a bug where it returns early rather than continuing the loop. This
causes the entire task group to be rejected.

Fix the bug by not returning early in the fast path and instead jump
to the top of the loop like all the other code paths in this method.
Includes a new test exercising topology at whole-scheduler level and a
fix for an existing test that should've caught this previously.
2022-06-07 13:31:10 -04:00
Seth Hoenig
682dbaac5a core: reschedule evicted batch job when resources become available
This PR fixes a bug where an evicted batch job would not be rescheduled
once resources become available.

Closes #9890
2022-06-02 14:04:13 -05:00
Tim Gross
7c1c117e14 scheduler: volume updates should always be destructive (#13008) 2022-05-13 11:34:04 -04:00
Tim Gross
3aa520e0bd CSI: plugin config updates should always be destructive (#12774) 2022-04-25 12:59:25 -04:00
Derek Strickland
5b1413a1ae reconciler: Handle canaries when client disconnects (#12539)
* plan_apply: Allow node updates in disconnected node plans
* plan: Keep the job when persisting unknown allocs
* reconciler: stop unknown allocs when stopping all
* reconcile_util: reorder filtering to handle canaries; skip rescheduling unknown
* heartbeat: Fix bug in node heartbeating
2022-04-21 10:05:58 -04:00
Derek Strickland
f5de802993 system_scheduler: support disconnected clients (#12555)
* structs: Add helper method for checking if alloc is configured to disconnect
* system_scheduler: Add support for disconnected clients
2022-04-15 09:31:32 -04:00
Tim Gross
69cdc80984 allocs without max_client_disconnect should be lost on disconnect (#12529)
In the reconciler's filtering for tainted nodes, we use whether the
server supports disconnected clients as a gate to a bunch of our
logic, but this doesn't account for cases where the job doesn't have
`max_client_disconnect`. The only real consequence of this appears to
be that allocs on disconnected nodes are marked "complete" instead of
"lost".
2022-04-11 11:24:49 -04:00
Tim Gross
6a49a0fb81 set minimum version for disconnected client mode to 1.3.0 (#12530) 2022-04-08 16:48:37 -04:00
Derek Strickland
8863d1e45a disconnected clients: Support operator manual interventions (#12436)
* allocrunner: Remove Shutdown call in Reconnect
* Node.UpdateAlloc: Stop orphaned allocs.
* reconciler: Stop failed reconnects.
* Apply feedback from code review. Handle rebase conflict.
* Apply suggestions from code review

Co-authored-by: Tim Gross <tgross@hashicorp.com>
2022-04-06 09:33:32 -04:00
Derek Strickland
8ac3e642e6 reconciler: 2 phase reconnects and tests (#12333)
* structs: Add alloc.Expired & alloc.Reconnected functions. Add Reconnect eval trigger by.

* node_endpoint: Emit new eval for reconnecting unknown allocs.

* filterByTainted: handle 2 phase commit filtering rules.

* reconciler: Append AllocState on disconnect. Logic updates from testing and 2 phase reconnects.

* allocs: Set reconnect timestamp. Destroy if not DesiredStatusRun. Watch for unknown status.
2022-04-05 17:13:10 -04:00
Derek Strickland
9a82b63686 comments: update some stale comments referencing deprecated config name (#12271)
* comments: update some stale comments referencing deprecated config name
2022-04-05 17:12:23 -04:00
Derek Strickland
bab317300e Add description for allocs stopped due to reconnect (#12270) 2022-04-05 17:12:23 -04:00
Derek Strickland
6329f44148 disconnected clients: ensure servers meet minimum required version (#12202)
* planner: expose ServerMeetsMinimumVersion via Planner interface
* filterByTainted: add flag indicating disconnect support
* allocReconciler: accept and pass disconnect support flag
* tests: update dependent tests
2022-04-05 17:12:23 -04:00
Derek Strickland
5b5c853597 disconnected clients: Observability plumbing (#12141)
* Add disconnects/reconnect to log output and emit reschedule metrics

* TaskGroupSummary: Add Unknown, update StateStore logic, add to metrics
2022-04-05 17:12:23 -04:00
Derek Strickland
35752655b0 disconnected clients: Add reconnect task event (#12133)
* Add TaskClientReconnectedEvent constant
* Add allocRunner.Reconnect function to manage task state manually
* Removes server-side push
2022-04-05 17:12:23 -04:00
DerekStrickland
97ce949f0e reconciler: fix loop control bug 2022-04-05 17:12:22 -04:00
Derek Strickland
786180601d reconciler: support disconnected clients (#12058)
* Add merge helper for string maps
* structs: add statuses, MaxClientDisconnect, and helper funcs
* taintedNodes: Include disconnected nodes
* upsertAllocsImpl: don't use existing ClientStatus when upserting unknown
* allocSet: update filterByTainted and add delayByMaxClientDisconnect
* allocReconciler: support disconnecting and reconnecting allocs
* GenericScheduler: upsert unknown and queue reconnecting

Co-authored-by: Tim Gross <tgross@hashicorp.com>
2022-04-05 17:10:37 -04:00
Seth Hoenig
b242957990 ci: swap ci parallelization for unconstrained gomaxprocs 2022-03-15 12:58:52 -05:00
Lars Lehtonen
8c8a8d2e23 scheduler: fix unused dstate variable (#12268) 2022-03-14 10:00:59 -04:00
Tim Gross
7d0f87b910 CSI: allow updates to volumes on re-registration (#12167)
CSI `CreateVolume` RPC is idempotent given that the topology,
capabilities, and parameters are unchanged. CSI volumes have many
user-defined fields that are immutable once set, and many fields that
are not user-settable.

Update the `Register` RPC so that updating a volume via the API merges
onto any existing volume without touching Nomad-controlled fields,
while validating it with the same strict requirements expected for
idempotent `CreateVolume` RPCs.

Also, clarify that this state store method is used for everything, not just
for the `Register` RPC.
2022-03-07 11:06:59 -05:00
Tim Gross
03a8d72dba CSI: implement support for topology (#12129) 2022-03-01 10:15:46 -05:00
Tim Gross
7f5a0c541c CSI: minor refactoring (#12105)
* rename method checking that free write claims are available
* use package-level variables for claim errors
* semgrep fix for testify
2022-02-23 11:13:51 -05:00
Lars Lehtonen
f8d472a18c scheduler: fix dropped test error 2022-02-14 22:11:45 -08:00
Derek Strickland
cefc58dd7b reconciler: refactor computeGroup (#12033)
The allocReconciler's computeGroup function contained a significant amount of inline logic that was difficult to understand the intent of. This commit extracts inline logic into the following intention revealing subroutines. It also includes updates to the function internals also aimed at improving maintainability and renames some existing functions for the same purpose. New or renamed functions include.

Renamed functions

- handleGroupCanaries -> cancelUnneededCanaries
- handleDelayedLost -> createLostLaterEvals
- handeDelayedReschedules -> createRescheduleLaterEvals

New functions

- filterAndStopAll
- initializeDeploymentState
- requiresCanaries
- computeCanaries
- computeUnderProvisionedBy
- computeReplacements
- computeDestructiveUpdates
- computeMigrations
- createDeployment
- isDeploymentComplete
2022-02-10 16:24:51 -05:00