mirror of
https://github.com/kemko/nomad.git
synced 2026-01-06 18:35:44 +03:00
Added sentence about job anti-affinity; Reflowed
This will create a concrete mention of job anti-affminity in the Nomad documentation. The only place we discuss it currently is in a similar sentence on the website itself. I borrowed liberally from that sentence in crafting this line.
This commit is contained in:
@@ -9,91 +9,101 @@ description: |-
|
||||
# Scheduling
|
||||
|
||||
Scheduling is a core function of Nomad. It is the process of assigning tasks
|
||||
from jobs to client machines. This process must respect the constraints as declared
|
||||
in the job, and optimize for resource utilization. This page documents the details
|
||||
of how scheduling works in Nomad to help both users and developers
|
||||
build a mental model. The design is heavily inspired by Google's
|
||||
work on both [Omega: flexible, scalable schedulers for large compute clusters](https://research.google.com/pubs/pub41684.html)
|
||||
and [Large-scale cluster management at Google with Borg](https://research.google.com/pubs/pub43438.html).
|
||||
from jobs to client machines. This process must respect the constraints as
|
||||
declared in the job, and optimize for resource utilization. This page documents
|
||||
the details of how scheduling works in Nomad to help both users and developers
|
||||
build a mental model. The design is heavily inspired by Google's work on both
|
||||
[Omega: flexible, scalable schedulers for large compute clusters][Omega] and
|
||||
[Large-scale cluster management at Google with Borg][Borg].
|
||||
|
||||
~> **Advanced Topic!** This page covers technical details
|
||||
of Nomad. You do not need to understand these details to
|
||||
effectively use Nomad. The details are documented here for
|
||||
those who wish to learn about them without having to go
|
||||
spelunking through the source code.
|
||||
~> **Advanced Topic!** This page covers technical details of Nomad. You do not
|
||||
~> need to understand these details to effectively use Nomad. The details are
|
||||
~> documented here for those who wish to learn about them without having to
|
||||
~> go spelunking through the source code.
|
||||
|
||||
# Scheduling in Nomad
|
||||
|
||||
[](/assets/images/nomad-data-model.png)
|
||||
[![Nomad Data Model][img-data-model]][img-data-model]
|
||||
|
||||
There are four primary "nouns" in Nomad; jobs, nodes, allocations, and evaluations.
|
||||
Jobs are submitted by users and represent a _desired state_. A job is a declarative description
|
||||
of tasks to run which are bounded by constraints and require resources. Tasks can be scheduled on
|
||||
nodes in the cluster running the Nomad client. The mapping of tasks in a job to clients is done
|
||||
using allocations. An allocation is used to declare that a set of tasks in a job should be run
|
||||
on a particular node. Scheduling is the process of determining the appropriate allocations and
|
||||
is done as part of an evaluation.
|
||||
There are four primary "nouns" in Nomad; jobs, nodes, allocations, and
|
||||
evaluations. Jobs are submitted by users and represent a _desired state_. A job
|
||||
is a declarative description of tasks to run which are bounded by constraints
|
||||
and require resources. Tasks can be scheduled on nodes in the cluster running
|
||||
the Nomad client. The mapping of tasks in a job to clients is done using
|
||||
allocations. An allocation is used to declare that a set of tasks in a job
|
||||
should be run on a particular node. Scheduling is the process of determining
|
||||
the appropriate allocations and is done as part of an evaluation.
|
||||
|
||||
An evaluation is created any time the external state, either desired or emergent, changes. The desired
|
||||
state is based on jobs, meaning the desired state changes if a new job is submitted, an
|
||||
existing job is updated, or a job is deregistered. The emergent state is based on the client
|
||||
nodes, and so we must handle the failure of any clients in the system. These events trigger
|
||||
the creation of a new evaluation, as Nomad must _evaluate_ the state of the world and reconcile
|
||||
it with the desired state.
|
||||
An evaluation is created any time the external state, either desired or
|
||||
emergent, changes. The desired state is based on jobs, meaning the desired
|
||||
state changes if a new job is submitted, an existing job is updated, or a job
|
||||
is deregistered. The emergent state is based on the client nodes, and so we
|
||||
must handle the failure of any clients in the system. These events trigger the
|
||||
creation of a new evaluation, as Nomad must _evaluate_ the state of the world
|
||||
and reconcile it with the desired state.
|
||||
|
||||
This diagram shows the flow of an evaluation through Nomad:
|
||||
|
||||
[](/assets/images/nomad-evaluation-flow.png)
|
||||
[![Nomad Evaluation Flow][img-eval-flow]][img-eval-flow]
|
||||
|
||||
The lifecycle of an evaluation begins with an event causing the evaluation to be
|
||||
created. Evaluations are created in the `pending` state and are enqueued into the
|
||||
evaluation broker. There is a single evaluation broker which runs on the leader server.
|
||||
The evaluation broker is used to manage the queue of pending evaluations, provide priority ordering,
|
||||
and ensure at least once delivery.
|
||||
The lifecycle of an evaluation begins with an event causing the evaluation to
|
||||
be created. Evaluations are created in the `pending` state and are enqueued
|
||||
into the evaluation broker. There is a single evaluation broker which runs on
|
||||
the leader server. The evaluation broker is used to manage the queue of pending
|
||||
evaluations, provide priority ordering, and ensure at least once delivery.
|
||||
|
||||
Nomad servers run scheduling workers, defaulting to one per CPU core, which are used to
|
||||
process evaluations. The workers dequeue evaluations from the broker, and then invoke
|
||||
the appropriate scheduler as specified by the job. Nomad ships with a `service` scheduler
|
||||
that optimizes for long-lived services, a `batch` scheduler that is used for fast placement
|
||||
of batch jobs, a `system` scheduler that is used to run jobs on every node,
|
||||
and a `core` scheduler which is used for internal maintenance.
|
||||
Nomad can be extended to support custom schedulers as well.
|
||||
Nomad servers run scheduling workers, defaulting to one per CPU core, which are
|
||||
used to process evaluations. The workers dequeue evaluations from the broker,
|
||||
and then invoke the appropriate scheduler as specified by the job. Nomad ships
|
||||
with a `service` scheduler that optimizes for long-lived services, a `batch`
|
||||
scheduler that is used for fast placement of batch jobs, a `system` scheduler
|
||||
that is used to run jobs on every node, and a `core` scheduler which is used
|
||||
for internal maintenance. Nomad can be extended to support custom schedulers as
|
||||
well.
|
||||
|
||||
Schedulers are responsible for processing an evaluation and generating an allocation _plan_.
|
||||
The plan is the set of allocations to evict, update, or create. The specific logic used to
|
||||
generate a plan may vary by scheduler, but generally the scheduler needs to first reconcile
|
||||
the desired state with the real state to determine what must be done. New allocations need
|
||||
to be placed and existing allocations may need to be updated, migrated, or stopped.
|
||||
Schedulers are responsible for processing an evaluation and generating an
|
||||
allocation _plan_. The plan is the set of allocations to evict, update, or
|
||||
create. The specific logic used to generate a plan may vary by scheduler, but
|
||||
generally the scheduler needs to first reconcile the desired state with the
|
||||
real state to determine what must be done. New allocations need to be placed
|
||||
and existing allocations may need to be updated, migrated, or stopped.
|
||||
|
||||
Placing allocations is split into two distinct phases, feasibility
|
||||
checking and ranking. In the first phase the scheduler finds nodes that are
|
||||
feasible by filtering unhealthy nodes, those missing necessary drivers, and those
|
||||
failing the specified constraints.
|
||||
Placing allocations is split into two distinct phases, feasibility checking and
|
||||
ranking. In the first phase the scheduler finds nodes that are feasible by
|
||||
filtering unhealthy nodes, those missing necessary drivers, and those failing
|
||||
the specified constraints.
|
||||
|
||||
The second phase is ranking, where the scheduler scores feasible nodes to find the best fit.
|
||||
Scoring is primarily based on bin packing, which is used to optimize the resource utilization
|
||||
and density of applications, but is also augmented by affinity and anti-affinity rules. One such
|
||||
anti-affinity rule exists to avoid colocating instances of the same service to reduce the
|
||||
The second phase is ranking, where the scheduler scores feasible nodes to find
|
||||
the best fit. Scoring is primarily based on bin packing, which is used to
|
||||
optimize the resource utilization and density of applications, but is also
|
||||
augmented by affinity and anti-affinity rules. One anti-affinity rule
|
||||
attempts to avoid colocating instances of the same service to reduce the
|
||||
probability of correlated failures.
|
||||
|
||||
Once the scheduler has ranked enough nodes, the highest ranking node is selected and
|
||||
added to the allocation plan.
|
||||
Once the scheduler has ranked enough nodes, the highest ranking node is
|
||||
selected and added to the allocation plan.
|
||||
|
||||
When planning is complete, the scheduler submits the plan to the leader which adds
|
||||
the plan to the plan queue. The plan queue manages pending plans, provides priority
|
||||
ordering, and allows Nomad to handle concurrency races. Multiple schedulers are running
|
||||
in parallel without locking or reservations, making Nomad optimistically concurrent.
|
||||
As a result, schedulers might overlap work on the same node and cause resource
|
||||
over-subscription. The plan queue allows the leader node to protect against this and
|
||||
do partial or complete rejections of a plan.
|
||||
When planning is complete, the scheduler submits the plan to the leader which
|
||||
adds the plan to the plan queue. The plan queue manages pending plans, provides
|
||||
priority ordering, and allows Nomad to handle concurrency races. Multiple
|
||||
schedulers are running in parallel without locking or reservations, making
|
||||
Nomad optimistically concurrent. As a result, schedulers might overlap work on
|
||||
the same node and cause resource over-subscription. The plan queue allows the
|
||||
leader node to protect against this and do partial or complete rejections of a
|
||||
plan.
|
||||
|
||||
As the leader processes plans, it creates allocations when there is no conflict
|
||||
and otherwise informs the scheduler of a failure in the plan result. The plan result
|
||||
provides feedback to the scheduler, allowing it to terminate or explore alternate plans
|
||||
if the previous plan was partially or completely rejected.
|
||||
and otherwise informs the scheduler of a failure in the plan result. The plan
|
||||
result provides feedback to the scheduler, allowing it to terminate or explore
|
||||
alternate plans if the previous plan was partially or completely rejected.
|
||||
|
||||
Once the scheduler has finished processing an evaluation, it updates the status of
|
||||
the evaluation and acknowledges delivery with the evaluation broker. This completes
|
||||
the lifecycle of an evaluation. Allocations that were created, modified or deleted
|
||||
as a result will be picked up by client nodes and will begin execution.
|
||||
Once the scheduler has finished processing an evaluation, it updates the status
|
||||
of the evaluation and acknowledges delivery with the evaluation broker. This
|
||||
completes the lifecycle of an evaluation. Allocations that were created,
|
||||
modified or deleted as a result will be picked up by client nodes and will
|
||||
begin execution.
|
||||
|
||||
[Omega]: https://research.google.com/pubs/pub41684.html
|
||||
[Borg]: https://research.google.com/pubs/pub43438.html
|
||||
[img-data-model]: /assets/images/nomad-data-model.png
|
||||
[img-eval-flow]: /assets/images/nomad-evaluation-flow.png
|
||||
Reference in New Issue
Block a user