Commit Graph

3904 Commits

Author SHA1 Message Date
Tim Gross
2d4e5b8fe9 scheduler: fix quadratic performance with spread blocks (#11712)
When the scheduler picks a node for each evaluation, the
`LimitIterator` provides at most 2 eligible nodes for the
`MaxScoreIterator` to choose from. This keeps scheduling fast while
producing acceptable results because the results are binpacked.

Jobs with a `spread` block (or node affinity) remove this limit in
order to produce correct spread scoring. This means that every
allocation within a job with a `spread` block is evaluated against
_all_ eligible nodes. Operators of large clusters have reported that
jobs with `spread` blocks that are eligible on a large number of nodes
can take longer than the nack timeout to evaluate (60s). Typical
evaluations are processed in milliseconds.

In practice, it's not necessary to evaluate every eligible node for
every allocation on large clusters, because the `RandomIterator` at
the base of the scheduler stack produces enough variation in each pass
that the likelihood of an uneven spread is negligible. Note that
feasibility is checked before the limit, so this only impacts the
number of _eligible_ nodes available for scoring, not the total number
of nodes.

This changeset sets the iterator limit for "large" `spread` block and
node affinity jobs to be equal to the number of desired
allocations. This brings an example problematic job evaluation down
from ~3min to ~10s. The included tests ensure that we have acceptable
spread results across a variety of large cluster topologies.
2021-12-21 10:10:01 -05:00
Andy Assareh
20bbdba041 Mesh Gateway doc enhancements (#11354)
* Mesh Gateway doc enhancements

1. I believe this line should be corrected to add mesh as one of the choices
2. I found that we are not setting this meta, and it is a required element for wan federation. I believe it would be helpful and potentially time saving to note that right here.
2021-12-20 17:10:44 -05:00
Guilherme
649f1ab6df Fix 'check calculations' link (#11420) 2021-12-20 17:09:15 -05:00
Tim Gross
3740c24d9e api: respect wildcard in evaluations list API (#11710) 2021-12-20 12:23:50 -05:00
Luiz Aoqui
55018bdfe6 docs: add v1.2.0 upgrade guide about Nomad UI ACL change for job details page (#11689) 2021-12-16 14:32:20 -05:00
Luiz Aoqui
15db86a6af docs: add more references and examples to the template block (#11691) 2021-12-16 14:14:01 -05:00
Noel Quiles
495a46ee79 website: Disable alert banner (#11688) 2021-12-16 13:43:47 -05:00
Tim Gross
03ea7d1c17 cli: unhide advanced operator raft debugging commands (#11682)
The `nomad operator raft` and `nomad operator snapshot state`
subcommands for inspecting on-disk raft state were hidden and
undocumented. Expose and document these so that advanced operators
have support for these tools.
2021-12-16 10:32:11 -05:00
Tim Gross
97621ec3c5 nomad eval list command (#11675)
Use the new filtering and pagination capabilities of the `Eval.List`
RPC to provide filtering and pagination at the command line.

Also includes note that `nomad eval status -json` is deprecated and
will be replaced with a single evaluation view in a future version of
Nomad.
2021-12-15 11:58:38 -05:00
Noel Quiles
608cdfc71d website: Copy updates (#11677) 2021-12-14 16:35:21 -05:00
Noel Quiles
ed91a53475 website: Update website Docker image (#11667) 2021-12-13 16:40:46 -05:00
Kevin Wang
be08f0ae3f feat: versioned docs (#11407) 2021-12-13 16:21:57 -05:00
Tim Gross
35c22bcb6c provide -no-shutdown-delay flag for job/alloc stop (#11596)
Some operators use very long group/task `shutdown_delay` settings to
safely drain network connections to their workloads after service
deregistration. But during incident response, they may want to cause
that drain to be skipped so they can quickly shed load.

Provide a `-no-shutdown-delay` flag on the `nomad alloc stop` and
`nomad job stop` commands that bypasses the delay. This sets a new
desired transition state on the affected allocations that the
allocation/task runner will identify during pre-kill on the client.

Note (as documented here) that using this flag will almost always
result in failed inbound network connections for workloads as the
tasks will exit before clients receive updated service discovery
information and won't be gracefully drained.
2021-12-13 14:54:53 -05:00
Tim Gross
63329878c0 update download to Nomad v1.2.3 (#11664) 2021-12-13 09:56:12 -05:00
Tim Gross
972708aaed evaluations list pagination and filtering (#11648)
API queries can request pagination using the `NextToken` and `PerPage`
fields of `QueryOptions`, when supported by the underlying API.

Add a `NextToken` field to the `structs.QueryMeta` so that we have a
common field across RPCs to tell the caller where to resume paging
from on their next API call. Include this field on the `api.QueryMeta`
as well so that it's available for future versions of List HTTP APIs
that wrap the response with `QueryMeta` rather than returning a simple
list of structs. In the meantime callers can get the `X-Nomad-NextToken`.

Add pagination to the `Eval.List` RPC by checking for pagination token
and page size in `QueryOptions`. This will allow resuming from the
last ID seen so long as the query parameters and the state store
itself are unchanged between requests.

Add filtering by job ID or evaluation status over the results we get
out of the state store.

Parse the query parameters of the `Eval.List` API into the arguments
expected for filtering in the RPC call.
2021-12-10 13:43:03 -05:00
Kevin Wang
ddca508b0d feat(website): extract /plugins /tools docs (#11584)
Co-authored-by: Luiz Aoqui <luiz@hashicorp.com>
Co-authored-by: Mike Nomitch <mnomitch@hashicorp.com>
2021-12-09 14:25:18 -05:00
Brandon Romano
1fdf1b9122 Update the banner (#11656) 2021-12-09 12:08:58 -05:00
Lukas W
bb1fc6ec5f CLI: Return non-zero exit code when deployment fails in nomad run (#11550)
* Exit non-zero from run command if deployment fails
* Fix typo in deployment monitor introduced in 0edda11
2021-12-09 09:09:28 -05:00
Tim Gross
95fa1b30f4 docs: improve docs for troubleshooting and monitoring scheduler (#11623)
This changeset adds more specific recommendations as to what metrics
to monitor, and what resources should be examined during incident
response.

It also renames the "Telemetry" section to "Monitoring Nomad" to
surface the material better and distinguish it from the "Metric
Reference".

Co-authored-by: Charlie Voiselle <464492+angrycub@users.noreply.github.com>
2021-12-07 15:52:13 -05:00
Noel Quiles
7d48372c1d website: Upgrade <HashiStackMenu /> to latest (#11615)
* Update @hashicorp/react-hashi-stack-menu

* Upgrade to latest

* One last upgrade
2021-12-07 15:25:28 -05:00
James Rasell
d2132b96b4 docs: add license expiry metric to metrics website doc. 2021-12-07 10:31:51 +00:00
Shantanu Gadgil
d174a38206 mention sysbatch in addition to batch (#11587) 2021-12-06 19:12:03 -05:00
Tim Gross
2c3db7ee1d scheduler: config option to reject job registration (#11610)
During incident response, operators may find that automated processes
elsewhere in the organization can be generating new workloads on Nomad
clusters that are unable to handle the workload. This changeset adds a
field to the `SchedulerConfiguration` API that causes all job
registration calls to be rejected unless the request has a management
ACL token.
2021-12-06 15:20:34 -05:00
Zachary Shilton
aa7aa6b12a website: bump deps to fix print styles (#11365)
* website: bump deps to fix print styles

* website: fix up print styles

* fix: hashi-stack-menu print selector
2021-12-03 10:14:21 -05:00
Tim Gross
216d4f8644 ui: change Consul/Vault base URL field name (#11589)
Give ourselves some room for extension in the UI configuration block
by naming the field `ui_url`, which will let us have an `api_url`.
Fix the template path to ensure we're getting the right value from the
API.
2021-11-30 13:20:29 -05:00
James Rasell
5a2e329408 Merge pull request #11577 from hashicorp/b-gh-11576
docs: add deprecation note to old style network task env vars.
2021-11-30 12:15:31 +01:00
Brandon Romano
9895a198c0 Updates use cases 2021-11-29 09:16:17 -08:00
Tim Gross
851ed6322f docs: mount_flags takes a slice of strings (#11583)
The `mount_flags` option takes a slice of strings, not a
comma-separated string like the flags passed to `mount(8)`.
2021-11-29 10:07:34 -05:00
James Rasell
481f5333c6 docs: add deprecation note to old style network task env vars. 2021-11-25 12:58:32 +01:00
Luiz Aoqui
6ad6ad67fe docs: document new Prometheus configuration for the Autoscaler APM plugin (#11562) 2021-11-24 17:37:35 -05:00
Luiz Aoqui
f9871fbae1 docs: add CLI and config docs for the Autoscaler policy source config (#11559) 2021-11-24 16:17:37 -05:00
Luiz Aoqui
53ab5ea169 update download to Nomad v1.2.2 (#11569) 2021-11-24 14:30:09 -05:00
Luiz Aoqui
9a0fee0a17 docs: add upgrade guide notes for Nomad 1.2.2 (#11567) 2021-11-24 14:24:20 -05:00
Tim Gross
1160817731 config: UI configuration block with Vault/Consul links (#11555)
Add `ui` block to agent configuration to enable/disable the web UI and
provide the web UI with links to Vault/Consul.
2021-11-24 11:20:02 -05:00
James Rasell
416b14ecef Merge pull request #11535 from hashicorp/docs-vault-token
docs: clarify vault.token only required on servers
2021-11-23 09:26:06 +01:00
James Rasell
80dcae7216 core: allow setting and propagation of eval priority on job de/registration (#11532)
This change modifies the Nomad job register and deregister RPCs to
accept an updated option set which includes eval priority. This
param is optional and override the use of the job priority to set
the eval priority.

In order to ensure all evaluations as a result of the request use
the same eval priority, the priority is shared to the
allocReconciler and deploymentWatcher. This creates a new
distinction between eval priority and job priority.

The Nomad agent HTTP API has been modified to allow setting the
eval priority on job update and delete. To keep consistency with
the current v1 API, job update accepts this as a payload param;
job delete accepts this as a query param.

Any user supplied value is validated within the agent HTTP handler
removing the need to pass invalid requests to the server.

The register and deregister opts functions now all for setting
the eval priority on requests.

The change includes a small change to the DeregisterOpts function
which handles nil opts. This brings the function inline with the
RegisterOpts.
2021-11-23 09:23:31 +01:00
Luiz Aoqui
9dd93990c5 Merge tag 'v1.2.1' into merge-release-1.2.1-branch
Version 1.2.1
2021-11-22 10:47:04 -05:00
Luiz Aoqui
08af85cdf6 update download to Nomad v1.2.1 (#11553) 2021-11-22 10:24:39 -05:00
Tim Gross
40de248b94 qemu: add args_allowlist to sandbox VM command line inputs
The QEMU driver allows arbitrary command line options, but many of
these options give access to host resources that operators may not
want to expose such as devices. Add an optional allowlist to the
plugin configuration so that operators can limit the resources for
QEMU.
2021-11-19 11:11:52 -05:00
James Rasell
663698db95 docs: add global query param to API job deregister endpoint. 2021-11-19 13:45:24 +01:00
Michael Schurter
5204aa7e46 docs: clarify vault.token only required on servers
While it *is* clarified toward the bottom of this page, I've seen people
go to great lengths to configure tokens for clients anyway, so I think
it's worth noting on the parameter's docs as well.
2021-11-18 16:34:59 -08:00
Luiz Aoqui
fa1586cc47 update website banner (#11512) 2021-11-16 14:15:58 -05:00
Luiz Aoqui
cce9e9b046 update download to Nomad v1.2.0 (#11511) 2021-11-16 12:13:35 -05:00
Luiz Aoqui
18ce6caac7 docs: add note about the Nomad APM autoscaling plugin and scaling cluster to zero (#11494) 2021-11-16 11:58:26 -05:00
Luiz Aoqui
7cbdcd11cc docs: remove mutual-exclusion between node class and datacenter in scaling policies (#11499) 2021-11-16 11:58:14 -05:00
Luiz Aoqui
6f40905500 add Nomad v1.2.0-rc1 download box (#11485) 2021-11-09 16:37:09 -05:00
kfenech1
6bbcb180f2 docs: nomad.client.unallocated.memory is in Megabytes not bytes (#11468) 2021-11-08 11:05:11 -05:00
Alessandro De Blasis
9a5248b932 cli: show host_network in nomad status (#11432)
Enhance the CLI in order to return the host network in two flavors 
(default, verbose) of the `node status` command.

Fixes: #11223.
Signed-off-by: Alessandro De Blasis <alex@deblasis.net>
2021-11-05 09:02:46 -04:00
James Rasell
5ad8bb08d9 Merge pull request #11444 from hashicorp/b-update-apidocs-alloclist-sample-resp
docs: update API alloc list sample response to be current.
2021-11-05 08:09:23 +01:00
Florian Apolloner
b52f42db9a Added a -hcl2-strict flag to allow for lenient hcl variable parsing. (#11284)
Co-authored-by: James Rasell <jrasell@hashicorp.com>
2021-11-04 16:33:09 +01:00