diff --git a/website/content/api-docs/jobs.mdx b/website/content/api-docs/jobs.mdx index f8ea1f134..b7e6c5048 100644 --- a/website/content/api-docs/jobs.mdx +++ b/website/content/api-docs/jobs.mdx @@ -559,7 +559,7 @@ $ curl \ - `Type`: The type of job in terms of scheduling. It can have one of the following values: - `service`: Allocations are intended to remain alive. - `batch`: Allocations are intended to exit. - - `system`: Each client gets an allocation. + - `system`: Each client in the datacenter and node pool gets an allocation. ## Read Job Submission diff --git a/website/content/docs/concepts/architecture.mdx b/website/content/docs/concepts/architecture.mdx index 7549bf4d8..80e4b10b1 100644 --- a/website/content/docs/concepts/architecture.mdx +++ b/website/content/docs/concepts/architecture.mdx @@ -93,6 +93,18 @@ Nomad models infrastructure as regions and datacenters. A region will contain one or more datacenters. A set of servers joined together will represent a single region. Servers federate across regions to make Nomad globally aware. +In federated clusters one of the regions must be defined as the [authoritative +region](#authoritative-and-non-authoritative-regions). + +#### Authoritative and Non-Authoritative Regions + +The authoritative region is the region in a federated multi-region cluster that +holds the source of true for entities replicated across regions, such as ACL +tokens, policies, and roles, namespaces, and node pools. + +All other regions are considered non-authoritative regions and replicate these +entities by pulling them from the authoritative region. + #### Datacenters Nomad models a datacenter as an abstract grouping of clients within a @@ -101,6 +113,20 @@ servers they are joined with, but do need to be in the same region. Datacenters provide a way to express fault tolerance among jobs as well as isolation of infrastructure. +#### Node + +A more generic term used to refer to machines running Nomad agents in client +mode. Despite being different concepts, you may find "node" being used +interchangeably with "client" in some materials and informal content. + +#### Node Pool + +Node pools are used to group [nodes](#node) and can be used to restrict which +[jobs](#job) are able to place [allocations](#allocation) in a given set of +nodes. Example use cases for node pools include segmenting nodes by environment +(development, staging, production), by department (engineering, finance, +support), or by functionality (databases, ingress proxy, applications). + #### Bin Packing Bin Packing is the process of filling bins with items in a way that maximizes @@ -169,7 +195,9 @@ are more details available for each of the sub-systems. The [consensus protocol] [gossip protocol](/nomad/docs/concepts/gossip), and [scheduler design](/nomad/docs/concepts/scheduling/scheduling) are all documented in more detail. -For other details, either consult the code, ask in IRC or reach out to the mailing list. - +For other details, either consult the code, [open an issue on +GitHub][gh_issue], or ask a question in the [community forum][forum]. [`update`]: /nomad/docs/job-specification/update +[gh_issue]: https://github.com/hashicorp/nomad/issues/new/choose +[forum]: https://discuss.hashicorp.com/c/nomad diff --git a/website/content/docs/concepts/node-pools.mdx b/website/content/docs/concepts/node-pools.mdx new file mode 100644 index 000000000..e87474f7c --- /dev/null +++ b/website/content/docs/concepts/node-pools.mdx @@ -0,0 +1,379 @@ +--- +layout: docs +page_title: Node Pools +description: Learn about the internal architecture of Nomad. +--- + +# Node Pools + +Node pools are a way to group clients and segment infrastructure into logical +units that can be targeted by jobs for a strong control over where allocations +are placed. + +Without node pools, allocations for a job can be placed in any eligible client +in the cluster. Affinities and constraints can help express preferences for +certain nodes, but they do not easily prevent other jobs from placing +allocations in a set of nodes. + +A node pool can be created using the [`nomad node pool apply`][cli_np_apply] +command and passing a node pool [specification file][np_spec]. + +```hcl +# dev-pool.nomad.hcl +node_pool "dev" { + description = "Nodes for the development environment." + + meta { + environment = "dev" + owner = "sre" + } +} +``` + +```shell-session +$ nomad node pool apply dev-pool.nomad.hcl +Successfully applied node pool "dev"! +``` + +Clients can then be added to this node pool by setting the +[`node_pool`][client_np] attribute in their configuration file, or using the +equivalent [`-node-pool`][cli_agent_np] command line flag. + +```hcl +client { + # ... + node_pool = "dev" + # ... +} +``` + +To help streamline this process, nodes can create node pools on demand. If a +client configuration references a node pool that does not exist yet, Nomad +creates the node pool automatically on client registration. + + + This behavior does not apply to clients in non-authoritative regions. Refer + to Multi-region Clusters for more + information. + + +Jobs can then reference node pools using the [`node_pool`][job_np] attribute. + +```hcl +job "app-dev" { + # ... + node_pool = "dev" + # ... +} +``` + +Similarly to the `namespace` attribute, the node pool must exist beforehand, +otherwise the job registration results in an error. Only nodes in the given +node pool are considered for placement. If none are available the deployment +is kept as pending until a client is added to the node pool. + +## Multi-region Clusters + +In federated multi-region clusters, node pools are automatically replicated +from the authoritative region to all non-authoritative regions, and requests to +create or modify a new node pool are forwarded from non-authoritative to the +authoritative region. + +Since the replication data only flows in one direction, clients in +non-authoritative regions are not able to create node pools on demand. + +A client in a non-authoritative region that references a node pool that does +not exist yet is kept in the `initializing` status until the node pool is +created and replicated to all regions. + +## Built-in Node Pools + +In addition to the user generated node pools Nomad automatically creates two +built-in node pools that cannot be deleted nor modified. + +- `default`: Node pools are an optional feature of Nomad. The `node_pool` + attribute in both the client configuration and job files are optional. When + not specified, these values are set to use the `default` built-in node pool. + +- `all`: In some situations, it is useful to be able to run a job across all + clients in a cluster, regardless of their node pool configuration. For these + scenarios the job may use the built-in `all` node pool which always includes + all clients registered in the cluster. Unlike other node pools, the `all` + node pool can only be used in jobs and not in client configuration. + +## Nomad Enterprise + +Nomad Enterprise provides additional features that make node pools more +powerful and easier to manage. + +### Scheduler Configuration + +Node pools in Nomad Enterprise are able to customize some aspects of the Nomad +scheduler and override certain global configuration per node pool. + +This allows experimenting with with functionalities such as memory +oversubscription in isolation, or adjusting the scheduler algorithm between +`spread` or `binpacking` depending on the types of workload being deployed in a +given set of clients. + +When using the built-in `all` node pool the global scheduler configuration is +applied. + +Refer to the [`scheduler_config`][np_spec_scheduler_config] parameter in the +node pool specification for more information. + +### Node Pool Governance + +Node pools and namespaces share some similarities, with both providing a way to +group resources in isolated logical units. Jobs are grouped into namespaces and +clients into node pools. + +Node Pool Governance allows assigning a default node pool to a namespace that +is automatically used by every job registered to the namespace. This feature +simplifies job management as the node pool is inferred from the namespace +configuration instead of having to be specified in every job. + +This connection is done using the [`default`][ns_spec_np_default] attribute in +the namespace `node_pool_config block. + +```hcl +namespace "dev" { + description = "Jobs for the development environment." + + node_pool_config { + default = "dev" + } +} +``` + +Now any job in the `dev` namespace only places allocations in nodes in the +`dev` node pool, and so the `node_pool` attribute may be omitted from the job +specification. + +```hcl +job "app-dev" { + # The "dev" node pool will be used because it is the + # namespace's default node pool. + namespace = "dev" + # ... +} +``` + +Jobs are able to override the namespace default node pool by specifying a +different `node_pool` value. + +The namespace can enforce if this behavior is allowed or limit which node pools +can and cannot be used with the [`allowed`][ns_spec_np_allowed] and +[`denied`][ns_spec_np_denied] parameters. + +```hcl +namespace "dev" { + description = "Jobs for the development environment." + + node_pool_config { + default = "dev" + denied = ["prod", "qa"] + } +} +``` + +```hcl +job "app-dev" { + namespace = "dev" + + # Jobs in the "dev" namespace are not allowed to use the + # "prod" node pool and so this job will fail to register. + node_pool = "prod" + # ... +} +``` + +### Multi-region Jobs + +Multi-region jobs can specify different node pools to be used in each region by +overriding the top-level `node_pool` job value, or the namespace `default` node +pool, in each `region` block. + +```hcl +job "multiregion" { + node_pool = "dev" + + multiregion { + # This region will use the top-level "dev" node pool. + region "north" {} + + # While the regions bellow will use their own specific node pool. + region "east" { + node_pool = "dev-east" + } + + region "west" { + node_pool = "dev-west" + } + } + # ... +} +``` + +## Node Pool Patterns + +The sections below describe some node pool patterns that can be used to achieve +specific goals. + +### Infrastructure and System Jobs + +This pattern illustrates an example where node pools are used to reserve nodes +for a specific set of jobs while also allowing system jobs to cross node pools +boundaries. + +It is common for Nomad clusters to have certain jobs that are focused on +providing the underlying infrastructure for more business focused applications. +Some examples include reverse proxies for traffic ingress, CSI plugins, and +periodic maintenance jobs. + +These jobs can be isolated in their own namespace but they may have different +scheduling requirements. + +Reverse proxies, and only reverse proxies, may need to run in clients that are +exposed to public traffic, and CSI controller plugins may require clients to +have high-privileged access to cloud resources and APIs. + +Other jobs, like CSI node plugins and periodic maintenance jobs, may need to +run as `system` jobs in all clients of the cluster. + +Node pools can be used to achieve the isolation required by the first set of +jobs, and the built-in `all` node pool can be used for the jobs that must run +in every client. To keep them organized, all jobs are registered in the same +`infra` namespace. + +```hcl +job "ingress-proxy" { + namespace = "infra" + node_pool = "ingress" + # ... +} +``` + +```hcl +job "csi-controller" { + namespace = "infra" + node_pool = "csi-controllers" + # ... +} +``` + +```hcl +job "csi-nodes" { + namespace = "infra" + node_pool = "all" + # ... +} +``` + +```hcl +job "maintenance" { + type = "batch" + namespace = "infra" + node_pool = "all" + + periodic { /* ... */ } + # ... +} +``` + +Use positive and negative constraints to fine-tune placements when targeting +the built-in `all` node pool. + +```hcl +job "maintenance-linux" { + type = "batch" + namespace = "infra" + node_pool = "all" + + constraint { + attribute = "${attr.kernel.name}" + value = "linux" + } + + constraint { + attribute = "${node.pool}" + operator = "!=" + value = "ingress" + } + + periodic { /* ... */ } + # ... +} +``` + +With Nomad Enterprise and Node Pool Governance, the `infra` namespace can be +configured to use a specific namespace by default and only allow the specific +node pools required. + +```hcl +namespace "infra" { + description = "Infrastructure jobs." + + node_pool_config { + default = "infra" + allowed = ["ingress", "csi-controllers", "all"] + } +} +``` + +### Mixed Scheduling Algorithms + +This pattern illustrate an example where different scheduling algorithms are +per node pool. + +Each of the scheduling algorithms provided by Nomad are best suited for +different types of environments and workloads. + +The `binpack` algorithm aims to maximize resource usage and pack as much +workload as possible in the given set of of clients. This makes it ideal for +cloud environments where infrastructure is billed by the hour and can be +quickly scaled in and out. By maximizing workload density a cluster running in +cloud instances can reduce the number of clients needed to run everything that +is necessary. + +The `spread` algorithm behaves in the opposite direction, making use of every +client available to reduce density and potential noisy neighbors and resource +contention. This makes it ideal for environments where clients are +pre-provisioned and scale more slowly, such as on-premises deployments. + +Clusters in a mixed environment can use node pools to adjust the scheduler +algorithm per node type. Cloud instances may be placed in a node pool that uses +the `binpack` algorithm while bare-metal nodes are placed in a node pool +configured to use `spread`. + +```hcl +node_pool "cloud" { + # ... + scheduler_config { + scheduler_algorithm = "binpack" + } +} +``` + +```hcl +node_pool "on-prem" { + # ... + scheduler_config { + scheduler_algorithm = "spread" + } +} +``` + +Another scenario where mixing algorithms may be useful is to separate workloads +that are more sensitive to noisy neighbors (and thus use the `spread` +algorithm), from those that are able to be packed more tightly (`binpack`). + +[cli_np_apply]: /nomad/docs/commands/node-pool/apply +[cli_agent_np]: /nomad/docs/commands/agent#node-pool +[client_np]: /nomad/docs/configuration/client#node_pool +[job_np]: /nomad/docs/job-specification/job#node_pool +[np_spec]: /nomad/docs/other-specifications/node-pool +[np_spec_scheduler_config]: /nomad/docs/other-specifications/node-pool#scheduler_config-parameters +[ns_spec_np_allowed]: /nomad/docs/other-specifications/namespace#allowed +[ns_spec_np_default]: /nomad/docs/other-specifications/namespace#default +[ns_spec_np_denied]: /nomad/docs/other-specifications/namespace#denied diff --git a/website/content/docs/concepts/scheduling/index.mdx b/website/content/docs/concepts/scheduling/index.mdx index 9c17dc0d8..32fa6d8b6 100644 --- a/website/content/docs/concepts/scheduling/index.mdx +++ b/website/content/docs/concepts/scheduling/index.mdx @@ -13,6 +13,7 @@ both [Omega: flexible, scalable schedulers for large compute clusters][omega] an for implementation details on scheduling in Nomad. - [Scheduling Internals](/nomad/docs/concepts/scheduling/scheduling) - An overview of how the scheduler works. +- [Placement](/nomad/docs/concepts/scheduling/placement) - Explains how placements are computed and how they can be adjusted. - [Preemption](/nomad/docs/concepts/scheduling/preemption) - Details of preemption, an advanced scheduler feature introduced in Nomad 0.9. [omega]: https://research.google.com/pubs/pub41684.html diff --git a/website/content/docs/concepts/scheduling/placement.mdx b/website/content/docs/concepts/scheduling/placement.mdx new file mode 100644 index 000000000..b50949ae0 --- /dev/null +++ b/website/content/docs/concepts/scheduling/placement.mdx @@ -0,0 +1,122 @@ +--- +layout: docs +page_title: Placement +description: Learn about how placements are computed in Nomad. +--- + +# Placement + +When the Nomad scheduler receives a job registration request, it needs to +determine which clients will run allocations for the job. + +This process is called allocation placement and can be important to understand +it to help achieve important goals for your applications, such as high +availability and resilience. + +By default, all nodes are considered for placements but this process can be +adjusted via agent and job configuration. + +There are several options that can be used depending on the desired outcome. + +### Affinities and Constraints + +Affinities and constraints allow users to define soft or hard requirements for +their jobs. The [`affinity`][job_affinity] block specifies a soft requirement +on certain node properties, meaning allocations for the job have a preference +for some nodes, but may be placed elsewhere if the rules can't be matched, +while the [`constraint`][job_constraint] block creates hard requirements and +allocations can only be placed in nodes that match these rules. Job placement +fails if a constraint cannot be satisfied. + +These rules can reference intrinsic node characteristics, which are called +[node attributes][] and are automatically detected by Nomad, static values +defined in the agent configuration file by cluster administrators, or dynamic +values defined after the agent starts. + +One restriction of using affinities and constraints is that they only express +relationships from jobs to nodes, so it is not possible to use them to restrict +a node to only receive allocations for specific jobs. + +Use affinities and constraints when some jobs have certain node preferences or +requirements but it is acceptable to have other jobs sharing the same nodes. + +The sections below describe the node values that can be configured and used in +job affinity and constraint rules. + +#### Node Class + +Node class is an arbitrary value that can be used to group nodes based on some +characteristics, like the instance size or the presence of fast hard drives, +and is specified in the client configuration file using the +[`node_class`][config_client_node_class] parameter. + +#### Dynamic and Static Node Metadata + +Node metadata are arbitrary key-value mappings specified either in the client +configuration file using the [`meta`][config_client_meta] parameter or +dynamically via the [`nomad node meta`][cli_node_meta] command and the +[`/v1/client/metadata`][api_client_metadata] API endpoint. + +There are no preconceived use cases for metadata values, and each team may +choose to use them in different ways. Some examples of static metadata include +resource ownership, such as `owner = "team-qa"`, or fine-grained locality, +`rack = "3"`. Dynamic metadata may be used to track runtime information, such +as jobs running in a given client. + +### Datacenter + +Datacenters represent a geographical location in a region that can be used for +fault tolerance and infrastructure isolation. + +It is defined in the agent configuration file using the +[`datacenter`][config_datacenter] parameter and, unlike affinities and +constraints, datacenters are opt-in at the job level, meaning that a job only +places allocations in the datacenters it uses, and, more importantly, only jobs +in a given datacenter are allowed to place allocations in those nodes. + +Given the strong connotation of a geographical location, use datacenters to +represent where a node resides rather than its intended use. The +[`spread`][job_spread] block can help achieve fault tolerance across +datacenters. + +### Node Pool + +Node pools allow grouping nodes that can be targeted by jobs to achieve +workload isolation. + +Similarly to datacenters, node pools are configured in an agent configuration +file using the [`node_pool`][config_client_node_pool] attribute, and are opt-in +on jobs, allowing restricted use of certain nodes by specific jobs without +extra configuration. + +But unlike datacenters, node pools don't have a preconceived notion and can be +used for several use cases, such as segmenting infrastructure per environment +(development, staging, production), by department (engineering, finance, +support), or by functionality (databases, ingress proxy, applications). + +Node pools are also a first-class concept and can hold additional [metadata and +configuration][spec_node_pool]. + +Use node pools when there is a need to restrict and reserve certain nodes for +specific workloads, or when you need to adjust specific [scheduler +configuration][spec_node_pool_sched_config] values. + +Nomad Enterprise also allows associating a node pool to a namespace to +facilitate managing the relationships between jobs, namespaces, and node pools. + +Refer to the [Node Pools][concept_np] concept page for more information. + + +[api_client_metadata]: /nomad/api-docs/client#update-node-metadata +[cli_node_meta]: /nomad/docs/commands/node/meta +[concept_np]: /nomad/docs/concepts/node-pools +[config_client_meta]: /nomad/docs/configuration/client#meta +[config_client_node_class]: /nomad/docs/configuration/client#node_class +[config_client_node_pool]: /nomad/docs/configuration/client#node_pool +[config_datacenter]: /nomad/docs/configuration#datacenter +[job_affinity]: /nomad/docs/job-specification/affinity +[job_constraint]: /nomad/docs/job-specification/constraint +[job_spread]: /nomad/docs/job-specification/spread +[node attributes]: /nomad/docs/runtime/interpolation#node-attributes +[spec_node_pool]: /nomad/docs/other-specifications/node-pool +[spec_node_pool_sched_config]: /nomad/docs/other-specifications/node-pool#scheduler_config-parameters diff --git a/website/content/docs/concepts/scheduling/scheduling.mdx b/website/content/docs/concepts/scheduling/scheduling.mdx index 9c07454ff..e6b39f249 100644 --- a/website/content/docs/concepts/scheduling/scheduling.mdx +++ b/website/content/docs/concepts/scheduling/scheduling.mdx @@ -52,8 +52,9 @@ and existing allocations may need to be updated, migrated, or stopped. Placing allocations is split into two distinct phases, feasibility checking and ranking. In the first phase the scheduler finds nodes that are feasible by -filtering unhealthy nodes, those missing necessary drivers, and those failing -the specified constraints. +filtering nodes in datacenters and node pools not used by the job, unhealthy +nodes, those missing necessary drivers, and those failing the specified +constraints. The second phase is ranking, where the scheduler scores feasible nodes to find the best fit. Scoring is primarily based on bin packing, which is used to diff --git a/website/content/docs/job-specification/multiregion.mdx b/website/content/docs/job-specification/multiregion.mdx index 53f5ee4b4..f03b42e37 100644 --- a/website/content/docs/job-specification/multiregion.mdx +++ b/website/content/docs/job-specification/multiregion.mdx @@ -147,6 +147,9 @@ The name of a region must match the name of one of the [federated regions]. datacenters in the region which are eligible for task placement. If not provided, the `datacenters` field of the job will be used. +- `node_pool` `(string: )` - The node pool to be used in this region. + It overrides the job-level `node_pool` and the namespace default node pool. + - `meta` - `Meta: nil` - The meta block allows for user-defined arbitrary key-value pairs. The meta specified for each region will be merged with the meta block at the job level. diff --git a/website/content/docs/other-specifications/node-pool.mdx b/website/content/docs/other-specifications/node-pool.mdx index ebc4b78c2..f3ed36219 100644 --- a/website/content/docs/other-specifications/node-pool.mdx +++ b/website/content/docs/other-specifications/node-pool.mdx @@ -56,8 +56,8 @@ Successfully applied node pool "example"! the node pool. - `meta` `(map[string]string: )` - Sets optional metadata on the node - pool, defined as key-value pairs. The scheduler does not use node pool metadat - as part of scheduling. + pool, defined as key-value pairs. The scheduler does not use node pool + metadata as part of scheduling. - `scheduler_config` ([SchedulerConfig][sched-config]: nil) - Sets scheduler configuration options specific to the node pool. If not diff --git a/website/content/docs/schedulers.mdx b/website/content/docs/schedulers.mdx index b403fcf14..7eb58f3d5 100644 --- a/website/content/docs/schedulers.mdx +++ b/website/content/docs/schedulers.mdx @@ -60,6 +60,10 @@ Systems jobs are intended to run until explicitly stopped either by an operator or [preemption]. If a system task exits it is considered a failure and handled according to the job's [restart] block; system jobs do not have rescheduling. +When used with node pools, system jobs run on all nodes of the pool used by the +job. The built-in node pool `all` allows placing allocations on all clients in +the cluster. + ## System Batch The `sysbatch` scheduler is used to register jobs that should be run to completion @@ -80,7 +84,7 @@ Sysbatch jobs are intended to run until successful completion, explicitly stoppe by an operator, or evicted through [preemption]. Sysbatch tasks that exit with an error are handled according to the job's [restart] block. - Like the `batch` scheduler, task groups in system batch jobs may have a `count` + Like the `batch` scheduler, task groups in system batch jobs may have a `count` greater than 1 to control how many instances are run. Instances that cannot be immediately placed will be scheduled when resources become available, potentially on a node that has already run another instance of the same job. diff --git a/website/data/docs-nav-data.json b/website/data/docs-nav-data.json index 84849322a..31c95a7a4 100644 --- a/website/data/docs-nav-data.json +++ b/website/data/docs-nav-data.json @@ -129,6 +129,10 @@ "title": "Concepts", "path": "concepts/scheduling/scheduling" }, + { + "title": "Placement", + "path": "concepts/scheduling/placement" + }, { "title": "Preemption", "path": "concepts/scheduling/preemption" @@ -147,6 +151,10 @@ "title": "Gossip Protocol", "path": "concepts/gossip" }, + { + "title": "Node Pools", + "path": "concepts/node-pools" + }, { "title": "Security Model", "path": "concepts/security"