Commit Graph

869 Commits

Author SHA1 Message Date
Luiz Aoqui
e1e80f383e vault: add new nomad setup vault -check commmand (#19720)
The new `nomad setup vault -check` commmand can be used to retrieve
information about the changes required before a cluster is migrated from
the deprecated legacy authentication flow with Vault to use only
workload identities.
2024-01-12 15:48:30 -05:00
Luiz Aoqui
b2aa6ffd05 docs: fix Consul ACL requirements (#19721)
Even with the new workload identitiy based flow the Nomad servers still
need the `acl = "write"` permission in order to revoke service identity
tokens.
2024-01-11 15:52:23 -05:00
Seth Hoenig
9410c519ff drivers/raw_exec: remove plumbing for ineffective no_cgroups configuration (#19599)
* drivers/raw_exec: remove plumbing for ineffective no_cgroups configuration

* fix tests
2024-01-11 08:20:15 -06:00
CJ
c9cd8480fa docs: considerations for Stateful Workloads (#19077)
Co-authored-by: Adrian Todorov <adrian.todorov@hashicorp.com>
2024-01-10 16:06:45 -05:00
Tim Gross
0935f443dc vault: support allowing tokens to expire without refresh (#19691)
Some users with batch workloads or short-lived prestart tasks want to derive a
Vaul token, use it, and then allow it to expire without requiring a constant
refresh. Add the `vault.allow_token_expiration` field, which works only with the
Workload Identity workflow and not the legacy workflow.

When set to true, this disables the client's renewal loop in the
`vault_hook`. When Vault revokes the token lease, the token will no longer be
valid. The client will also now automatically detect if the Vault auth
configuration does not allow renewals and will disable the renewal loop
automatically.

Note this should only be used when a secret is requested from Vault once at the
start of a task or in a short-lived prestart task. Long-running tasks should
never set `allow_token_expiration=true` if they obtain Vault secrets via
`template` blocks, as the Vault token will expire and the template runner will
continue to make failing requests to Vault until the `vault_retry` attempts are
exhausted.

Fixes: https://github.com/hashicorp/nomad/issues/8690
2024-01-10 14:49:02 -05:00
Luiz Aoqui
5267eec3ad vault: fix token revocation during workflow migration (#19689)
When transitioning from the legacy token-based workflow to the new JWT
workflow for Vault the previous code would instantiate a no-op Vault if
the server configuration had a `default_identity` block.

This no-op client returned an error for some of its operations were
called, such as `LookupToken` and `RevokeTokens`. The original intention
was that, in the new JWT workflow, none of these methods should be
called, so returning an error could help surface potential bugs.

But the `RevokeTokens` and `MarkForRevocation` methods _are_ called even
in the JWT flow. When a leadership transition happens, the new server
looks for unused Vault accessors from state and tries to revoke them.
Similarly, the `RevokeTokens` method is called every time the
`Node.UpdataStatus` and `Node.UpdateAlloc` RPCs are made by clients, as
the Nomad server tries to find unused Vault tokens for the node/alloc.

Since the new JWT flow does not require Nomad servers to contact Vault,
calling `RevokeTokens` and `MarkForRevocation` is not able to complete
without a Vault token, so this commit changes the logic to use the no-op
Vault client when no token is configured. It also updates the client
itself to not error if these methods are called, but to rather just log
so operators can be made aware that there are Vault tokens created by
Nomad that have not been force-expired.

When migrating an existing cluster to the new workload identity based
flow, Nomad operators must first upgrade the Nomad version without
removing any of the existing Vault configuration. Doing so can prevent
Nomad servers from managing and cleaning-up existing Vault tokens during
a leadership transition and node or alloc updates.

Operators must also resubmit all jobs with a `vault` block so they are
updated with an `identity` for Vault. Skipping this step may cause
allocations to fail if their Vault token expires (if, for example, the
Nomad client stops running for TTL/2) or if they are rescheduled, since
the new client will try to follow the legacy flow which will fail if the
Nomad server configuration for Vault has already been updated to remove
the Vault address and token.
2024-01-10 13:28:46 -05:00
Tim Gross
d3e5cae1eb consul: support admin partitions (#19665)
Add support for Consul Enterprise admin partitions. We added fingerprinting in
https://github.com/hashicorp/nomad/pull/19485. This PR adds a `consul.partition`
field. The expectation is that most users will create a mapping of Nomad node
pool to Consul admin partition. But we'll also create an implicit constraint for
the fingerprinted value.

Fixes: https://github.com/hashicorp/nomad/issues/13139
2024-01-10 10:41:29 -05:00
Seth Hoenig
cb7d078c1d drivers/raw_exec: enable configuring raw_exec task to have no memory limit (#19670)
* drivers/raw_exec: enable configuring raw_exec task to have no memory limit

This PR makes it possible to configure a raw_exec task to not have an
upper memory limit, which is how the driver would behave pre-1.7.

This is done by setting memory_max = -1. The cluster (or node pool) must
have memory oversubscription enabled.

* cl: add cl
2024-01-09 14:57:13 -06:00
Tim Gross
c875f3e49a docs: expand docs on implicit ACL capabilities grants (#19681)
An audit of Nomad's ACLs resulted in some confusion around whether the
`NamespaceValidator` method is conjunctive ("add", as implied by the docs) or
disjunctive ("or", as it is by design). Clarify the ACL documentation as
follows:

* Call out where fine-grained capabilities imply grants to other
  capabilities (for example, that `csi-read-volume` grants `csi-list-volume`).
* Fix an incorrectly documented ACL requirement for the CSI List External
  Volumes API.
* Clarify how ACLs are expected to work for the two search API endpoints, such
  that you need list/read access to the objects in the search context.
2024-01-09 13:25:05 -05:00
Tim Gross
a399f16a31 docs: describe cgroup controller requirements (#19493)
Nomad can only use cgroups to control resource requirements if all the cgroups
controllers are actually enabled. Add this to our requirements documentation as
well as the impacted `exec` and `java` task drivers.
2024-01-08 10:01:14 -05:00
am-ak
7dc82f233f [DOCS] Update docker.mdx (#19657)
Removed info regarding development of Nomad
2024-01-08 14:32:57 +00:00
Shantanu Gadgil
6bbd3b0cec reschedule is at group level (#19653)
Co-authored-by: James Rasell <jrasell@hashicorp.com>
2024-01-08 10:54:52 +00:00
Seth Hoenig
4b3ee77d6b docs: update raw_exec driver docs and 1.7 upgrade notes (#19598) 2024-01-04 08:26:46 -06:00
Seth Hoenig
ccfb13a72d e2e: add test for raw_exec memory_max configuration (#19596)
* e2e: add test for raw_exec memory_max configuration

* docs: note raw_exec supports memory_max in resources documentation
2024-01-04 08:25:56 -06:00
Tim Gross
e7ca2b51ad vault: ignore allow_unauthenticated config if identity is set (#19585)
When the server's `vault` block has a default identity, we don't check the
user's Vault token (and in fact, we warn them on job submit if they've provided
one). But the validation hook still checks for a token if
`allow_unauthenticated` is set to true. This is a misconfiguration but there's
no reason for Nomad not to do the expected thing here.

Fixes: https://github.com/hashicorp/nomad/issues/19565
2024-01-02 16:46:34 -05:00
Mike Nomitch
dd15bdff9c Adds vault role to JWT claims if specified in jobspec (#19535) 2023-12-20 15:51:34 -08:00
Piotr Kazmierczak
84115d732d docs: correct Nomad Autoscaler example link in HA vars documentation (#19537) 2023-12-20 16:26:35 +01:00
Etienne Bruines
f18d5c7c32 docs: fix migration to workload identity links (#19508)
Fixes #19507
2023-12-18 21:27:38 -05:00
Tim Gross
14200a800f docs: note replacement of - characters in meta env vars (#19501)
The keys of `meta` fields have all characters outside of `[A-Za-z0-9_.]`
replaced by underscores when we create `NOMAD_META` environment variables. Make
sure this replacement is documented.

Fixes: https://github.com/hashicorp/nomad/issues/15359
2023-12-15 15:48:23 -05:00
Luiz Aoqui
a8d1447550 docs: update Consul and Vault integration (#19424) 2023-12-14 15:14:55 -05:00
Mike Nomitch
31f4296826 Adds support for failures before warning to Consul service checks (#19336)
Adds support for failures before warning and failures before critical
to the automatically created Nomad client and server services in Consul
2023-12-14 11:33:31 -08:00
Tim Gross
0e42569ffb docs: note that 1.7.2 yanks 1.7.0-1.7.1 due to CPU fingeprint bug (#19474) 2023-12-14 11:32:13 -05:00
James Rasell
94b8b7769a docs: add reporting config block documentation. (#19470) 2023-12-14 15:11:29 +00:00
Grant Griffiths
9b2e8ae20f CSI: prevent stage_publish_base_dir from being subdir of mount_dir (#19441) 2023-12-13 14:31:40 -05:00
Charlie Voiselle
d2fc7cc0c4 [docs] Note reboot to update bridge_network_hairpin_mode (#19304) 2023-12-12 19:49:15 -05:00
Luiz Aoqui
99d72b7154 docs: fix placement of Consul auth method configs (#19404)
The auth method names are used by Nomad clients, not servers.
2023-12-11 09:16:57 -05:00
Tim Gross
e551814df5 docs: add warnings about backing up keyring to snapshot commands (#19400)
The `operator snapshot` commands and agent don't back up Nomad's key
material. Add some warnings about this to places where users might be looking
for information on cluster recovery.

Fixes: https://github.com/hashicorp/nomad/issues/19389
2023-12-08 16:05:05 -05:00
Tim Gross
ad9520c240 docs: add warning not to use 1.7.0 (#19399)
Nomad 1.7.0 should be considered "yanked". Add a note about this to the upgrade
guide.
2023-12-08 15:19:27 -05:00
Adrian Todorov
1eb1dbfa36 docs: update PKI example in template block with the new pkiCert function (#19394) 2023-12-08 14:23:12 -05:00
Seth Hoenig
39eb17f3ec docs: describe the need for dmidecode in docs (#19348) 2023-12-08 10:45:37 -06:00
Tim Gross
fb58dd835d docs: expand on Sentinel policy reference (#19335) 2023-12-07 14:04:43 -05:00
Luiz Aoqui
27d2ad1baf cli: add -dev-consul and -dev-vault agent mode (#19327)
The `-dev-consul` and `-dev-vault` flags add default identities and
configuration to the Nomad agent to connect and use the workload
identity integration with Consul and Vault.
2023-12-07 11:51:20 -05:00
Juana De La Cuesta
cf539c405e Add a new parameter to avoid starting a replacement for lost allocs (#19101)
This commit introduces the parameter preventRescheduleOnLost which indicates that the task group can't afford to have multiple instances running at the same time. In the case of a node going down, its allocations will be registered as unknown but no replacements will be rescheduled. If the lost node comes back up, the allocs will reconnect and continue to run.

In case of max_client_disconnect also being enabled, if there is a reschedule policy, an error will be returned.
Implements issue #10366

Co-authored-by: Dom Lavery <dom@circleci.com>
Co-authored-by: Tim Gross <tgross@hashicorp.com>
Co-authored-by: Luiz Aoqui <luiz@hashicorp.com>
2023-12-06 12:28:42 +01:00
Tim Gross
1e51379e56 docs: clarify behavior and recommendations for mTLS vs TLS for HTTP (#19282)
Some of our documentation on `tls` configuration could be more clear as to
whether we're referring to mTLS or TLS. Also, when ACLs are enabled it's fine to
have `verify_https_client=false` (the default). Make it clear that this is an
acceptably secure configuration and that it's in fact recommended in order to
avoid pain of distributing client certs to user browsers.
2023-12-04 15:03:43 -05:00
Tim Gross
37df614da6 docs: fix recommended binding rules for Consul integration (#19299)
Fixes some errors in the documentation for the Consul integration, based on
tests locally without using the `nomad setup consul` command and updating the
docs to match.

* Consul CE doesn't support the `-namespace-rule-bind-namespace` option.
* The binding rule for services should not including the Nomad namespace in the
  `bind-name` parameter (the service is registered in the appropriate Consul
  namespace).
* The role for tasks should include the suffix "-tasks" in the name to match the
  binding rule we create.
* Fix the Consul bound audiences to be a list of strings
* Fix some quoting issues in the commands.
2023-12-04 11:56:03 -05:00
Piotr Kazmierczak
0a783d0046 wi: change setup cmds -cleanup flag to -destroy (#19295) 2023-12-04 15:28:17 +01:00
Piotr Kazmierczak
0ff190fa38 docs: setup helpers documentation (#19267) 2023-12-04 09:59:07 +01:00
James Rasell
d041ddc4ee docs: fix up HCL formatting on agent config examples. (#19254) 2023-12-04 08:44:00 +00:00
Luiz Aoqui
125dd4af38 docs: small updates to agent consul (#19285) 2023-12-01 16:40:06 -05:00
Seth Hoenig
b83c1e14c1 docs: fix documentation of client.reserved.cores (#19266) 2023-12-01 13:06:55 -06:00
Tim Gross
2ba459c73a docs: split consul config params into client vs server sections (#19258)
Some sections of the `consul` configuration are relevant only for clients or
servers. We updated our Vault docs to split these parameters out into their own
sections for clarity. Match that for the Consul docs.
2023-12-01 13:37:39 -05:00
Adrian Todorov
af71f4a55a Clarify docs around CSI volume context updates (#19216)
Co-authored-by: Tim Gross <tgross@hashicorp.com>
2023-12-01 15:19:04 +00:00
Phil Renaud
d104432cd3 Actions: API, command, and jobspec docs (#19166)
* API command and jobspec docs

* PR comments addressed

* API docs for job/jobid/action socket

* Removing a perhaps incorrect origin of job_id across the jobs api doc

* PR comments addressed
2023-11-30 14:13:37 -05:00
Piotr Kazmierczak
e57dcdf106 docs: adjust claim mappings for Consul auth method (#19244) 2023-11-30 20:01:18 +01:00
Seth Hoenig
5f3aae7340 website: fix spellcheck path and cleanup some misspellings (#19238) 2023-11-30 09:38:19 -06:00
Piotr Kazmierczak
d699b82df6 docs: update consul-integration to include ns changes (#19239)
Co-authored-by: Tim Gross <tgross@hashicorp.com>
2023-11-30 16:37:48 +01:00
Luiz Aoqui
d29ac461a7 cli: non-service jobs on job restart -reschedule (#19147)
The `-reschedule` flag stops allocations and assumes the Nomad scheduler
will create new allocations to replace them. But this is only true for
service and batch jobs.

Restarting non-service jobs with the `-reschedule` flag causes the
command to loop forever waiting for the allocations to be replaced,
which never happens.

Allocations for system jobs may be replaced by triggering an evaluation
after each stop to cause the reconciler to run again.

Sysbatch jobs should not be allowed to be rescheduled as they are never
replaced by the scheduler.
2023-11-29 13:01:19 -05:00
Piotr Kazmierczak
26b778bb0c docs: correction to Consul integration TLS note (#19207) 2023-11-28 19:22:02 +01:00
Tim Gross
8ab7ab0db4 docs: fix typos and markdown issues on CPU concepts page (#19205) 2023-11-28 11:27:27 -05:00
James Rasell
e2487698e6 docs: add alloc metrics note about possible cgroup variations. (#19195) 2023-11-28 14:32:08 +00:00