Commit Graph

912 Commits

Author SHA1 Message Date
Piotr Kazmierczak
0d14dd96ca eval_broker: track enqueue and dequeue times (#20329)
Adds new metrics to the eval broker that track times of evaluations enqueueing
and dequeueing.
2024-04-15 16:16:50 +02:00
Tim Gross
1739f94e84 docs: fix a broken link on the Consul index page (#20387) 2024-04-12 15:31:48 -04:00
Tim Gross
43281f6038 docs: provide guidance on using Consul DNS (#20369)
Add a standalone section to the Consul integration docs showing how to configure
both the Consul agent and the workload to take advantage of Consul DNS. Include
a reference to the new transparent proxy feature as well.

Fixes: https://github.com/hashicorp/nomad/issues/18305
2024-04-12 14:38:04 -04:00
Tim Gross
1e50090776 docs: clarify "best effort" for ephemeral disk migration (#20357)
The docs for ephemeral disk migration use the term "best effort" without
outlining the requirements or the cases under which the migration can
fail. Update the docs to make it obvious that ephemeral disk migration is
subject to data loss.

Fixes: https://github.com/hashicorp/nomad/issues/20355
2024-04-11 16:35:22 -04:00
astudentofblake
7b7ed12326 func: Allow custom paths to be added the the getter landlock (#20349)
* func: Allow custom paths to be added the the getter landlock

Fixes: 20315

* fix: slices imports
fix: more meaningful examples
fix: improve documentation
fix: quote error output
2024-04-11 15:17:33 -05:00
Tim Gross
8298d39e78 Connect transparent proxy support
Add support for Consul Connect transparent proxies

Fixes: https://github.com/hashicorp/nomad/issues/10628
2024-04-10 11:00:18 -04:00
Tim Gross
9340c77b12 docs: remove extra indents in tproxy HCL examples 2024-04-10 10:21:32 -04:00
Tim Gross
e2e561da88 tproxy: documentation improvements 2024-04-10 08:55:50 -04:00
James Rasell
a7c56a6563 docs: fix incorrect formatting within ACL policy spec. (#20339) 2024-04-09 14:46:06 +01:00
James Rasell
200b7134f0 docs: ensure namespace ACL policy capabilities are all listed. (#20306) 2024-04-09 13:57:10 +01:00
Tim Gross
8eaf176868 client: fix IPv6 parsing for client.servers block (#20324)
When the `client.servers` block is parsed, we split the port from the
address. This does not correctly handle IPv6 addresses when they are in URL
format (wrapped in brackets), which we require to disambiguate the port and
address.

Fix the parser to correctly split out the port and handle a missing port value
for IPv6. Update the documentation to make the URL format requirement clear.

Fixes: https://github.com/hashicorp/nomad/issues/20310
2024-04-08 15:06:27 -04:00
James Rasell
0cbd08ebf2 docs: add Digital Ocean Spaces artifact jobspec example. (#20304) 2024-04-08 08:15:07 +01:00
Tim Gross
d1f3a72104 tproxy: transparent_proxy reference docs (#20241)
Ref: https://github.com/hashicorp/nomad/pull/20175
2024-04-04 17:01:07 -04:00
Tim Gross
bb062deadc docs: update service mesh integration docs for transparent proxy (#20251)
Update the service mesh integration docs to explain how Consul needs to be
configured for transparent proxy. Update the walkthrough to assume that
`transparent_proxy` mode is the best approach, and move the manually-configured
`upstreams` to a separate section for users who don't want to use Consul DNS.

Ref: https://github.com/hashicorp/nomad/pull/20175
Ref: https://github.com/hashicorp/nomad/pull/20241
2024-04-04 17:01:07 -04:00
Tim Gross
a71632e3a4 docs: recommendation for maximum number of template dependencies (#20259) 2024-04-04 11:08:49 -04:00
James Rasell
fd5a42a6ca docs: clarify data dir default parameters and default creation. (#20268) 2024-04-04 09:20:47 +01:00
Seth Hoenig
6ad648bec8 networking: Inject implicit constraints on CNI plugins when using bridge mode (#15473)
This PR adds a job mutator which injects constraints on the job taskgroups
that make use of bridge networking. Creating a bridge network makes use of the
CNI plugins: bridge, firewall, host-local, loopback, and portmap. Starting
with Nomad 1.5 these plugins are fingerprinted on each node, and as such we
can ensure jobs are correctly scheduled only on nodes where they are available,
when needed.
2024-03-27 16:11:39 -04:00
Tim Gross
9c2286014f docs: update Consul compatibility matrix (#20242)
Version of Nomad and Consul that were known not to be compatible are no longer
supported in general. Update the compatibility matrix for Consul to match.
2024-03-27 16:11:14 -04:00
James Rasell
facc3e8013 agent: allow configuration of in-memory telemetry sink. (#20166)
This change adds configuration options for setting the in-memory
telemetry sink collection and retention durations. This sink backs
the metrics JSON API and previously had hard-coded default values.

The new options are particularly useful when running development or
debug environments, where metrics collection is desired at a fast
and granular rate.
2024-03-25 15:00:18 +00:00
Tim Gross
02d98b9357 operator debug: fix pprof interval handling (#20206)
The `nomad operator debug` command saves a CPU profile for each interval, and
names these files based on the interval.

The same functions takes a goroutine profile, heap profile, etc. but is missing
the logic to interpolate the file name with the interval. This results in the
operator debug command making potentially many expensive profile requests, and
then overwriting the data. Update the command to save every profile it scrapes,
and number them similarly to the existing CPU profile.

Additionally, the command flags for `-pprof-interval` and `-pprof-duration` were
validated backwards, which meant that we always coerced the `-pprof-interval` to
be the same as the `-pprof-duration`, which always resulted in a single profile
being taken at the start of the bundle. Correct the check as well as change the
defaults to be more sensible.

Fixes: https://github.com/hashicorp/nomad/issues/20151
2024-03-25 09:01:06 -04:00
Tim Gross
bdf3ff301e jobspec: add support for destination partition to upstream block (#20167)
Adds support for specifying a destination Consul admin partition in the
`upstream` block.

Fixes: https://github.com/hashicorp/nomad/issues/19785
2024-03-22 16:15:22 -04:00
Tim Gross
d3ddb0aa49 docs: make it clear that federation features require ACLs (#20196)
Our documentation has a hidden assumption that users know that federation
replication requires ACLs to be enabled and bootstrapped. Add notes at some of
the places users are likely to look for it.

A separate follow-up PR to the federation tutorial should point to the ACL
multi-region tutorial as well.

Fixes: https://github.com/hashicorp/nomad/issues/20128
2024-03-22 15:15:00 -04:00
Michael Schurter
976789b8de Small docs updates: bai rkt, cya openapi, lol ephemeral_disk "examples" (#20198)
* docs: rip openapi spec

* docs: remove useless ephemeral_disk examples
2024-03-22 11:53:25 -07:00
Tim Gross
10dd738a03 jobspec: update gateway.ingress.service Consul API fields (#20176)
Add support for further configuring `gateway.ingress.service` blocks to bring
this block up-to-date with currently available Consul API fields (except for
namespace and admin partition, which will need be handled under a different
PR). These fields are sent to Consul as part of the job endpoint submission hook
for Connect gateways.

Co-authored-by: Horacio Monsalvo <horacio.monsalvo@southworks.com>
2024-03-22 13:50:48 -04:00
Luiz Aoqui
b5573b7470 docs: fix invoke_scheduler metrics (#20172) 2024-03-21 10:57:30 -04:00
Juana De La Cuesta
56bf253474 Add docs for disconnected block (#20147)
Expand the job settings to include the disconnect block and set as deprecated the fields that will be replaced by it.
2024-03-20 10:08:16 +01:00
Tim Gross
dc39c20e66 docs: make recommendation for collection interval vs scrape interval (#20056)
Metrics tools that "pull" metrics, such as Prometheus, have a configurable
interval for how frequently they scrape metrics. This should be greater or equal
to the Nomad `telemetry.collection_interval` to avoid re-scraping metrics that
cannot have been updated in that interval.

Fixes: https://github.com/hashicorp/nomad/issues/20055
2024-03-19 08:56:29 -04:00
Tim Gross
c4253470a0 autopilot: add operator autopilot health command (#20156)
Add a command line operation that reports Enterprise autopilot data from the
`/operator/autopilot/health` API. I've pulled this feature out of
@lindleywhite's PR in the Enterprise repo.

Ref: https://github.com/hashicorp/nomad-enterprise/pull/1394

Co-authored-by: Lindley <lindley@hashicorp.com>
2024-03-18 14:46:18 -04:00
Tim Gross
695bb7ffcf docs: improve wording around autoconfiguration via Consul (#20139)
Fixes: https://github.com/hashicorp/nomad/issues/20132
2024-03-15 08:44:58 -04:00
Giovanni Avelar
26a27bb12c cli: add -json option on jobs status command (#18925) 2024-03-08 16:03:52 -05:00
Michael Schurter
3193ac204f docs: skipping a major release is fine (#20075)
Nomad has always placed an extremely high priority on backward
compatibility. We have always aimed to support N-2 major releases and
usually gone above and beyond that.

The new https://www.hashicorp.com/long-term-support policy also mentions
that N-2 is what we have always supported, so it's probably time for our
docs to reflect that reality.
2024-03-06 08:57:12 -08:00
Jeff Boruszak
57af1cdcbf docs: Consul Admin partition example (#20022) 2024-02-28 09:04:04 -06:00
Tim Gross
45b2c34532 cni: add DNS set by CNI plugins to task configuration (#20007)
CNI plugins may set DNS configuration, but this isn't threaded through to the
task configuration so that we can write it to the `/etc/resolv.conf` file as
needed. Add the `AllocNetworkStatus` to the alloc hook resources so they're
accessible from the taskrunner. Any DNS entries provided by the user will
override these values.

Fixes: https://github.com/hashicorp/nomad/issues/11102
2024-02-20 10:17:27 -05:00
Tim Gross
c1b5850473 docs: add warning not to enable Consul tls.grpc.verify_incoming (#19970)
Consul does not support incoming TLS verification of Envoy. This failure results
in hard-to-understand errors like `SSLV3_ALERT_BAD_CERTIFICATE` in the Envoy
allocation logs. Leave a warning about this to users.

Closes: https://github.com/hashicorp/nomad/issues/19772
Closes: https://github.com/hashicorp/nomad/issues/16854
Ref: https://github.com/hashicorp/consul/issues/13088
2024-02-14 08:56:35 -05:00
Seth Hoenig
37c497628c docs: describe cloud environments in fingerprint denylist (#19952)
This PR changes the example of the client config option "fingerprint.denylist"
to include all the cloud environment fingerprinters. Each one contains a
2 second HTTP timeout to a metadata endpoint that does not exist if you are not
in that particular cloud. When run in serial on startup, this results in
an 8 second wait where nothing useful is happening.

Closes #16727
2024-02-12 09:57:29 -06:00
Phil Renaud
41c783aec2 Noting action name restrictions, and correcting those of auth methods and roles (#19905) 2024-02-08 12:01:22 -05:00
Luiz Aoqui
2a348ba714 docs: expand impact of verify_https_client=false (#19916)
When Nomad is configured with `verify_https_client=false` endpoints that
do not require an ACL token can be accessed without any other type of
authentication. Expand the docs to mention this effect.
2024-02-08 10:55:40 -05:00
Luiz Aoqui
7daa854491 docs: remove duplicate entry for upstreams.config (#19877) 2024-02-06 20:44:02 -05:00
Luiz Aoqui
5825cefe51 docs: remove Docker cpuset_cpus config (#19882)
Nomad 1.7 refactored how CPU cores are assigned to tasks, making the
Docker-specific `cpuset_cpus` configuration no longer used.
2024-02-06 10:51:16 -05:00
Juana De La Cuesta
120c3ca3c9 Add granular control of SELinux labels for host mounts (#19839)
Add new configuration option on task's volume_mounts, to give a fine grained control over SELinux "z" label

* Update website/content/docs/job-specification/volume_mount.mdx

Co-authored-by: Luiz Aoqui <luiz@hashicorp.com>

* fix: typo

* func: make volume mount verification happen even on  mounts with no volume

---------

Co-authored-by: Luiz Aoqui <luiz@hashicorp.com>
Co-authored-by: Tim Gross <tgross@hashicorp.com>
2024-02-05 10:05:33 +01:00
Adrian Todorov
044eb0e048 docs: warnings about template dependencies, HCL2 clarifications (#19779) 2024-01-19 14:07:15 -05:00
Vijesh
3b4afea974 docs: note script checks don't support some Consul options (#19770)
Script checks don't support Consul's `success_before_passing`, `failures_before_critical`, or `failures_before_warning` because they're run by Nomad and not by Consul
2024-01-18 08:38:57 -05:00
Tom Davies
5a11a28cac docs: updates link to Consul WLI migration docs (#19748) 2024-01-17 09:57:02 -05:00
Luiz Aoqui
e1e80f383e vault: add new nomad setup vault -check commmand (#19720)
The new `nomad setup vault -check` commmand can be used to retrieve
information about the changes required before a cluster is migrated from
the deprecated legacy authentication flow with Vault to use only
workload identities.
2024-01-12 15:48:30 -05:00
Luiz Aoqui
b2aa6ffd05 docs: fix Consul ACL requirements (#19721)
Even with the new workload identitiy based flow the Nomad servers still
need the `acl = "write"` permission in order to revoke service identity
tokens.
2024-01-11 15:52:23 -05:00
Seth Hoenig
9410c519ff drivers/raw_exec: remove plumbing for ineffective no_cgroups configuration (#19599)
* drivers/raw_exec: remove plumbing for ineffective no_cgroups configuration

* fix tests
2024-01-11 08:20:15 -06:00
CJ
c9cd8480fa docs: considerations for Stateful Workloads (#19077)
Co-authored-by: Adrian Todorov <adrian.todorov@hashicorp.com>
2024-01-10 16:06:45 -05:00
Tim Gross
0935f443dc vault: support allowing tokens to expire without refresh (#19691)
Some users with batch workloads or short-lived prestart tasks want to derive a
Vaul token, use it, and then allow it to expire without requiring a constant
refresh. Add the `vault.allow_token_expiration` field, which works only with the
Workload Identity workflow and not the legacy workflow.

When set to true, this disables the client's renewal loop in the
`vault_hook`. When Vault revokes the token lease, the token will no longer be
valid. The client will also now automatically detect if the Vault auth
configuration does not allow renewals and will disable the renewal loop
automatically.

Note this should only be used when a secret is requested from Vault once at the
start of a task or in a short-lived prestart task. Long-running tasks should
never set `allow_token_expiration=true` if they obtain Vault secrets via
`template` blocks, as the Vault token will expire and the template runner will
continue to make failing requests to Vault until the `vault_retry` attempts are
exhausted.

Fixes: https://github.com/hashicorp/nomad/issues/8690
2024-01-10 14:49:02 -05:00
Luiz Aoqui
5267eec3ad vault: fix token revocation during workflow migration (#19689)
When transitioning from the legacy token-based workflow to the new JWT
workflow for Vault the previous code would instantiate a no-op Vault if
the server configuration had a `default_identity` block.

This no-op client returned an error for some of its operations were
called, such as `LookupToken` and `RevokeTokens`. The original intention
was that, in the new JWT workflow, none of these methods should be
called, so returning an error could help surface potential bugs.

But the `RevokeTokens` and `MarkForRevocation` methods _are_ called even
in the JWT flow. When a leadership transition happens, the new server
looks for unused Vault accessors from state and tries to revoke them.
Similarly, the `RevokeTokens` method is called every time the
`Node.UpdataStatus` and `Node.UpdateAlloc` RPCs are made by clients, as
the Nomad server tries to find unused Vault tokens for the node/alloc.

Since the new JWT flow does not require Nomad servers to contact Vault,
calling `RevokeTokens` and `MarkForRevocation` is not able to complete
without a Vault token, so this commit changes the logic to use the no-op
Vault client when no token is configured. It also updates the client
itself to not error if these methods are called, but to rather just log
so operators can be made aware that there are Vault tokens created by
Nomad that have not been force-expired.

When migrating an existing cluster to the new workload identity based
flow, Nomad operators must first upgrade the Nomad version without
removing any of the existing Vault configuration. Doing so can prevent
Nomad servers from managing and cleaning-up existing Vault tokens during
a leadership transition and node or alloc updates.

Operators must also resubmit all jobs with a `vault` block so they are
updated with an `identity` for Vault. Skipping this step may cause
allocations to fail if their Vault token expires (if, for example, the
Nomad client stops running for TTL/2) or if they are rescheduled, since
the new client will try to follow the legacy flow which will fail if the
Nomad server configuration for Vault has already been updated to remove
the Vault address and token.
2024-01-10 13:28:46 -05:00
Tim Gross
d3e5cae1eb consul: support admin partitions (#19665)
Add support for Consul Enterprise admin partitions. We added fingerprinting in
https://github.com/hashicorp/nomad/pull/19485. This PR adds a `consul.partition`
field. The expectation is that most users will create a mapping of Nomad node
pool to Consul admin partition. But we'll also create an implicit constraint for
the fingerprinted value.

Fixes: https://github.com/hashicorp/nomad/issues/13139
2024-01-10 10:41:29 -05:00