Commit Graph

4134 Commits

Author SHA1 Message Date
Luiz Aoqui
d456cc1e7f Track plan rejection history and automatically mark clients as ineligible (#13421)
Plan rejections occur when the scheduler work and the leader plan
applier disagree on the feasibility of a plan. This may happen for valid
reasons: since Nomad does parallel scheduling, it is expected that
different workers will have a different state when computing placements.

As the final plan reaches the leader plan applier, it may no longer be
valid due to a concurrent scheduling taking up intended resources. In
these situations the plan applier will notify the worker that the plan
was rejected and that they should refresh their state before trying
again.

In some rare and unexpected circumstances it has been observed that
workers will repeatedly submit the same plan, even if they are always
rejected.

While the root cause is still unknown this mitigation has been put in
place. The plan applier will now track the history of plan rejections
per client and include in the plan result a list of node IDs that should
be set as ineligible if the number of rejections in a given time window
crosses a certain threshold. The window size and threshold value can be
adjusted in the server configuration.

To avoid marking several nodes as ineligible at one, the operation is rate
limited to 5 nodes every 30min, with an initial burst of 10 operations.
2022-07-12 18:40:20 -04:00
Michael Schurter
f998a2b77b core: merge reserved_ports into host_networks (#13651)
Fixes #13505

This fixes #13505 by treating reserved_ports like we treat a lot of jobspec settings: merging settings from more global stanzas (client.reserved.reserved_ports) "down" into more specific stanzas (client.host_networks[].reserved_ports).

As discussed in #13505 there are other options, and since it's totally broken right now we have some flexibility:

Treat overlapping reserved_ports on addresses as invalid and refuse to start agents. However, I'm not sure there's a cohesive model we want to publish right now since so much 0.9-0.12 compat code still exists! We would have to explain to folks that if their -network-interface and host_network addresses overlapped, they could only specify reserved_ports in one place or the other?! It gets ugly.
Use the global client.reserved.reserved_ports value as the default and treat host_network[].reserverd_ports as overrides. My first suggestion in the issue, but @groggemans made me realize the addresses on the agent's interface (as configured by -network-interface) may overlap with host_networks, so you'd need to remove the global reserved_ports from addresses shared with a shared network?! This seemed really confusing and subtle for users to me.
So I think "merging down" creates the most expressive yet understandable approach. I've played around with it a bit, and it doesn't seem too surprising. The only frustrating part is how difficult it is to observe the available addresses and ports on a node! However that's a job for another PR.
2022-07-12 14:40:25 -07:00
Tim Gross
f295396ef8 docs: rename Internals to Concepts (#13696) 2022-07-11 16:55:33 -04:00
Tim Gross
b209fc47da docs: move operator subcommands under their own trees (#13677)
The sidebar navigation tree for the `operator` sub-sub commands is
getting cluttered and we have a new set of commands coming to support
secure variables keyring as well. Move these all under their own
subtrees.
2022-07-11 14:00:24 -04:00
Seth Hoenig
64f35f9cf3 docs: move upgrade docs for max_client_timeout
Co-authored-by: Luiz Aoqui <luiz@hashicorp.com>
2022-07-07 16:46:26 -05:00
Seth Hoenig
cbcceb0625 docs: upgrade guide for client max_kill_timeout 2022-07-07 15:27:40 -05:00
Luiz Aoqui
52389ff726 cli: improve output of eval commands (#13581)
Use the same output format when listing multiple evals in the `eval
list` command and when `eval status <prefix>` matches more than one
eval.

Include the eval namespace in all output formats and always include the
job ID in `eval status` since, even `node-update` evals are related to a
job.

Add Node ID to the evals table output to help differentiate
`node-update` evals.

Co-authored-by: James Rasell <jrasell@hashicorp.com>
2022-07-07 13:13:34 -04:00
Ted Behling
295021caad driver/docker: Don't pull InfraImage if it exists (#13265)
Co-authored-by: James Rasell <jrasell@hashicorp.com>
2022-07-07 17:44:06 +02:00
Seth Hoenig
142918ac9f docs: fixup from cr comments 2022-07-07 08:37:10 -05:00
Seth Hoenig
39fd91fe2e docs: add docs for simple load balancing nomad services
This PR adds a section to template docs for simple load balancing with nomad servicse.
2022-07-06 17:34:30 -05:00
James Rasell
11cb4c6d82 core: allow deleting of evaluations (#13492)
* core: add eval delete RPC and core functionality.

* agent: add eval delete HTTP endpoint.

* api: add eval delete API functionality.

* cli: add eval delete command.

* docs: add eval delete website documentation.
2022-07-06 16:30:11 +02:00
James Rasell
24220d0a02 core: allow pausing and un-pausing of leader broker routine (#13045)
* core: allow pause/un-pause of eval broker on region leader.

* agent: add ability to pause eval broker via scheduler config.

* cli: add operator scheduler commands to interact with config.

* api: add ability to pause eval broker via scheduler config

* e2e: add operator scheduler test for eval broker pause.

* docs: include new opertor scheduler CLI and pause eval API info.
2022-07-06 16:13:48 +02:00
Michelle Noorali
b9e084a4b7 doc: explain permissions for Vault sys/capabilties-self 2022-07-06 10:01:30 -04:00
Yann Coleu
154bb23d23 docs: typo on command word (#13582) 2022-07-05 16:24:25 -04:00
Steven Collins
59ba8adeee docs: Add 'serial' attribute to usb driver (#13547) 2022-07-05 16:23:04 -04:00
Seth Hoenig
1d8d1ab819 Merge pull request #12862 from hashicorp/f-choose-services
api: enable selecting subset of services using rendezvous hashing
2022-06-30 15:17:40 -05:00
Derek Strickland
bbd11fd9b5 docs: update task leader to explain shutdown sequence. (#13498)
* docs: update task leader to explain shutdown sequence.
2022-06-29 05:13:45 -04:00
James Rasell
c635ae0f89 docs: fixup HCL2 index collection function documentation. (#13511) 2022-06-28 18:27:38 +02:00
Andrew
37e5accf09 Fix typo in Docker docs (#13497) 2022-06-28 11:05:50 +02:00
Seth Hoenig
bdead31863 api: enable selecting subset of services using rendezvous hashing
This PR adds the 'choose' query parameter to the '/v1/service/<service>' endpoint.

The value of 'choose' is in the form '<number>|<key>', number is the number
of desired services and key is a value unique but consistent to the requester
(e.g. allocID).

Folks aren't really expected to use this API directly, but rather through consul-template
which will soon be getting a new helper function making use of this query parameter.

Example,

curl 'localhost:4646/v1/service/redis?choose=2|abc123'

Note: consul-templte v0.29.1 includes the necessary nomadServices functionality.
2022-06-25 10:37:37 -05:00
Seth Hoenig
f1cafd0789 core: remove support for raft protocol version 2
This PR checks server config for raft_protocol, which must now
be set to 3 or unset (0). When unset, version 3 is used as the
default.
2022-06-23 14:37:50 +00:00
Michael Schurter
c52741ae1b docs: clarify total_escaped is just an optimization (#13460) 2022-06-22 11:39:56 -07:00
Elijah Voigt
009a4d9a85 Lob.com uses Nomad too! (#13295)
Lob.com has been ramping up our use of Nomad for ~6 months.
Now that we've started blogging about it we'd love to be on the _official_ list.
2022-06-21 09:10:08 -04:00
Derek Strickland
08811312cc Improve Autoscaler overview (#13396)
Improve Autoscaler overview documentation.
2022-06-17 05:15:22 -04:00
Nick Wales
37ee50010e Merge pull request #13401 from nickwales/tls_typo
Updates TLS documentation
2022-06-16 12:34:59 -05:00
Arthur Leclerc
7518f42d1c docs: Fix typo (#13389) 2022-06-16 13:24:18 -04:00
Nick Wales
a8dca34a3a Updates TLS documentation 2022-06-16 12:15:40 -05:00
James Hu
00d004ae12 Fix spelling error (#13397) 2022-06-16 12:41:49 -04:00
Luiz Aoqui
3737fb3c7d docs: create volume spec page (#13353)
In addition to jobs, there are other objects in Nomad that have a
specific format and can be provided to commands and API endpoints.

This commit creates a new menu section to hold the specification for
volumes and update the command pages to point to the new centralized
definition.

Redirecting the previous entries is not possible with `redirect.js`
because they are done server-side and URL fragments are not accessible
to detect a match. So we provide hidden anchors with a link to the new
page to guide users towards the new documentation.

Co-authored-by: Tim Gross <tgross@hashicorp.com>
2022-06-14 14:08:25 -04:00
Luiz Aoqui
854a2c6d92 website: fix redirects with fragments (#13354)
* website: fix redirects with fragments

Vercel redirects don't support fragments in relative destination paths,
so an absolute URL must be specified instead.

* website: fix Vercel redirect documentation link
2022-06-14 11:27:34 -04:00
Grant Griffiths
2986f1f18a CSI: make plugin health_timeout configurable in csi_plugin stanza (#13340)
Signed-off-by: Grant Griffiths <ggriffiths@purestorage.com>
2022-06-14 10:04:16 -04:00
Michael Schurter
34959b26df docs: explain behavior of system gc command (#13342) 2022-06-13 09:54:23 +02:00
Derek Strickland
dd71afb891 template: improve default language for max_stale and wait (#13334)
* template: improve default language for max_stale and wait

Co-authored-by: Michael Schurter <mschurter@hashicorp.com>
2022-06-10 14:34:25 -04:00
Daniel Rossbach
9bb9aab714 qemu driver: Add option to configure drive_interface (#11864) 2022-06-10 10:03:51 -04:00
Raffaele Di Fazio
0b9fc17ae4 Update supplement.mdx with the right GitHub spelling (#13326) 2022-06-10 11:46:19 +02:00
phreakocious
f8774369d2 Add guest_agent config option for QEMU driver (#12800)
Add boolean 'guest_agent' config option for QEMU driver, which will
create the socket file for the QEMU Guest Agent in the task dir when
enabled.
2022-06-09 09:21:38 -04:00
Derek Strickland
e78a5908b9 docker: update images to reference hashicorpdev Docker organization (#12903)
docker: update images to reference hashicorpdev dockerhub organization
generate job_init.bindata_assetfs.go

Co-authored-by: Luiz Aoqui <luiz@hashicorp.com>
2022-06-08 15:06:00 -04:00
Derek Strickland
7899fd3fac consul-template: Add fault tolerant defaults (#13041)
consul-template: Add fault tolerant defaults

Co-authored-by: Tim Gross <tgross@hashicorp.com>
2022-06-08 14:08:25 -04:00
Shantanu Gadgil
b1a84bb77e heartbeat_grace is a server parameter (#13288)
`heartbeat_grace` is a `server` parameter, not a `client` parameter.
2022-06-08 10:49:23 -04:00
Kevin Schoonover
d725acb380 parse ACL token from authorization header (#12534) 2022-06-06 15:51:02 -04:00
Conor Evans
2a01807d20 add filebase64 function (#11791)
Signed-off-by: Conor Evans <coevans@tcd.ie>
2022-06-06 11:58:17 -04:00
dgotlieb
99b9408c91 docs: update warning for gateway listener docs for non-tcp protos 2022-06-06 10:53:01 -04:00
Radek Simko
0246944d68 docs/job-spec: Fix formatting in network page (#13228) 2022-06-06 10:14:12 -04:00
dependabot[bot]
0bc084a7fa build(deps): bump semver-regex from 3.1.3 to 3.1.4 in /website (#13225)
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2022-06-06 09:11:22 -05:00
Radek Simko
cbde2ba94b docs/docker: fix broken link to bridge mode (#13221) 2022-06-06 09:59:36 -04:00
Radek Simko
ff87354665 docs: link to client reqs section for added clarity (#13215) 2022-06-06 09:56:29 -04:00
Lance Haig
eafc93902b Allow Operator Generated bootstrap token (#12520) 2022-06-03 07:37:24 -04:00
Luiz Aoqui
b7357fd325 update README Nomad logo (#13206) 2022-06-02 19:21:26 -04:00
Huan Wang
b6e07487c2 adding support for customized ingress tls (#13184) 2022-06-02 18:43:58 -04:00
Shantanu Gadgil
f0bc4cedca fingerprint kernel architecture name (#13182) 2022-06-02 15:51:00 -04:00