Commit Graph

162 Commits

Author SHA1 Message Date
tehut
55523ecf8e Add NodeMaxAllocations to client configuration (#25785)
* Set MaxAllocations in client config
Add NodeAllocationTracker struct to Node struct
Evaluate MaxAllocations in AllocsFit function
Set up cli config parsing
Integrate maxAllocs into AllocatedResources view
Co-authored-by: Tim Gross <tgross@hashicorp.com>

---------

Co-authored-by: Tim Gross <tgross@hashicorp.com>
2025-05-22 12:49:27 -07:00
Daniel Bennett
15c01e5a49 ipv6: normalize addrs per RFC-5942 §4 (#25921)
https://datatracker.ietf.org/doc/html/rfc5952#section-4

* copy NormalizeAddr func from vault
  * PRs hashicorp/vault#29228 & hashicorp/vault#29517
* normalize bind/advertise addrs
* normalize consul/vault addrs
2025-05-22 14:21:30 -04:00
James Rasell
0b265d2417 encrypter: Track initial tasks for is ready calculation. (#25803)
The server startup could "hang" to the view of an operator if it
had a key that could not be decrypted or replicated loaded from
the FSM at startup.

In order to prevent this happening, the server startup function
will now use a timeout to wait for the encrypter to be ready. If
the timeout is reached, the error is sent back to the caller which
fails the CLI command. This bubbling of error message will also
flush to logs which will provide addition operator feedback.

The server only cares about keys loaded from the FSM snapshot and
trailing logs before the encrypter should be classed as ready. So
that the encrypter ready function does not get blocked by keys
added outside of the initial Raft load, we take a snapshot of the
decryption tasks as we enter the blocking call, and class these as
our barrier.
2025-05-07 15:38:16 +01:00
Michael Smithhisler
5c4d0e923d consul: Remove legacy token based authentication workflow (#25217) 2025-03-05 15:38:11 -05:00
James Rasell
7268053174 vault: Remove legacy token based authentication workflow. (#25155)
The legacy workflow for Vault whereby servers were configured
using a token to provide authentication to the Vault API has now
been removed. This change also removes the workflow where servers
were responsible for deriving Vault tokens for Nomad clients.

The deprecated Vault config options used byi the Nomad agent have
all been removed except for "token" which is still in use by the
Vault Transit keyring implementation.

Job specification authors can no longer use the "vault.policies"
parameter and should instead use "vault.role" when not using the
default workload identity.

---------

Co-authored-by: Tim Gross <tgross@hashicorp.com>
Co-authored-by: Aimee Ukasick <aimee.ukasick@hashicorp.com>
2025-02-28 07:40:02 +00:00
Tim Gross
7b89c0ee28 template: fix client's default retry configuration (#25113)
In #20165 we fixed a bug where a partially configured `client.template` retry
block would set any unset fields to nil instead of their default values. But
this patch introduced a regression in the default values, so we were now
defaulting to unlimited retries if the retry block was unset. Restore the
correct behavior and add better test coverage at both the config parsing and
template configuration code.

Ref: https://github.com/hashicorp/nomad/pull/20165
Ref: https://github.com/hashicorp/nomad/issues/23305#issuecomment-2643731565
2025-02-14 09:25:41 -05:00
James Rasell
7d48aa2667 client: emit optional telemetry from prerun and prestart hooks. (#24556)
The Nomad client can now optionally emit telemetry data from the
prerun and prestart hooks. This allows operators to monitor and
alert on failures and time taken to complete.

The new datapoints are:
  - nomad.client.alloc_hook.prerun.success (counter)
  - nomad.client.alloc_hook.prerun.failed (counter)
  - nomad.client.alloc_hook.prerun.elapsed (sample)

  - nomad.client.task_hook.prestart.success (counter)
  - nomad.client.task_hook.prestart.failed (counter)
  - nomad.client.task_hook.prestart.elapsed (sample)

The hook execution time is useful to Nomad engineering and will
help optimize code where possible and understand job specification
impacts on hook performance.

Currently only the PreRun and PreStart hooks have telemetry
enabled, so we limit the number of new metrics being produced.
2024-12-12 14:43:14 +00:00
Piotr Kazmierczak
f7a4ded2c0 security: add CT executeTemplate to default function_denylist (#24541)
This PR adds Consul Template's executeTemplate function to the denylist by
default, in order to prevent accidental or malicious infinitely recursive
execution.

---------

Co-authored-by: Tim Gross <tgross@hashicorp.com>
2024-11-22 19:33:56 +01:00
Tim Gross
c970d22164 keyring: support external KMS for key encryption key (KEK) (#23580)
In Nomad 1.4.0, we shipped support for encrypted Variables and signed Workload
Identities, but the key material is protected only by a AEAD encrypting the
KEK. Add support for Vault transit encryption and external KMS from major cloud
providers. The servers call out to the external service to decrypt each key in
the on-disk keystore.

Ref: https://hashicorp.atlassian.net/browse/NET-10334
Fixes: https://github.com/hashicorp/nomad/issues/14852
2024-07-18 09:42:28 -04:00
Piotr Kazmierczak
88e8973004 consul: additional unit test for consul config merging (#23495) 2024-07-03 16:09:16 +02:00
Piotr Kazmierczak
2a09abc477 metrics: quota utilization configuration and documentation (#22912)
Introduces support for (optional) quota utilization metrics

CE part of the hashicorp/nomad-enterprise#1488 change
2024-06-03 21:06:19 +02:00
James Rasell
facc3e8013 agent: allow configuration of in-memory telemetry sink. (#20166)
This change adds configuration options for setting the in-memory
telemetry sink collection and retention durations. This sink backs
the metrics JSON API and previously had hard-coded default values.

The new options are particularly useful when running development or
debug environments, where metrics collection is desired at a fast
and granular rate.
2024-03-25 15:00:18 +00:00
Tim Gross
7b9bce2d08 config: fix client.template config merging with defaults (#20165)
When loading the client configuration, the user-specified `client.template`
block was not properly merged with the default values. As a result, if the user
set any `client.template` field, all the other field defaulted to their zero
values instead of the documented defaults.

This changeset:
* Adds the missing `Merge` method for the client template config and ensures
  it's called.
* Makes a single source of truth for the default template configuration,
  instead of two different constructors.
* Extends the tests to cover the merge of a partial block better.

Fixes: https://github.com/hashicorp/nomad/issues/20164
2024-03-20 10:18:56 -04:00
Luiz Aoqui
27d2ad1baf cli: add -dev-consul and -dev-vault agent mode (#19327)
The `-dev-consul` and `-dev-vault` flags add default identities and
configuration to the Nomad agent to connect and use the workload
identity integration with Consul and Vault.
2023-12-07 11:51:20 -05:00
Piotr Kazmierczak
248b2ba5cd WI: use single auth method for Consul by default (#19169)
This simplifies the default setup of Nomad workloads WI-based
authentication for Consul by using a single auth method with 2 binding rules.

Users can still specify separate auth methods for services and tasks.
2023-11-28 12:22:27 +01:00
Tim Gross
9d075c44b2 config: remove old Vault/Consul config blocks from parser (#18997)
Remove the now-unused original configuration blocks for Consul and Vault from
the agent configuration parsing. When the agent needs to refer to a Consul or
Vault block it will always be for a specific cluster for the task/service (or
the default cluster for the agent's own use).

This is third of three changesets for this work.

Fixes: https://github.com/hashicorp/nomad/issues/18947
Ref: https://github.com/hashicorp/nomad/pull/18991
Ref: https://github.com/hashicorp/nomad/pull/18994
2023-11-08 09:30:08 -05:00
Michael Schurter
a806363f6d OpenID Configuration Discovery Endpoint (#18691)
Added the [OIDC Discovery](https://openid.net/specs/openid-connect-discovery-1_0.html) `/.well-known/openid-configuration` endpoint to Nomad, but it is only enabled if the `server.oidc_issuer` parameter is set. Documented the parameter, but without a tutorial trying to actually _use_ this will be very hard.

I intentionally did *not* use https://github.com/hashicorp/cap for the OIDC configuration struct because it's built to be a *compliant* OIDC provider. Nomad is *not* trying to be compliant initially because compliance to the spec does not guarantee it will actually satisfy the requirements of third parties. I want to avoid the problem where in an attempt to be standards compliant we ship configuration parameters that lock us in to a certain behavior that we end up regretting. I want to add parameters and behaviors as there's a demonstrable need.

Users always have the escape hatch of providing their own OIDC configuration endpoint. Nomad just needs to know the Issuer so that the JWTs match the OIDC configuration. There's no reason the actual OIDC configuration JSON couldn't live in S3 and get served directly from there. Unlike JWKS the OIDC configuration should be static, or at least change very rarely.

This PR is just the endpoint extracted from #18535. The `RS256` algorithm still needs to be added in hopes of supporting third parties such as [AWS IAM OIDC Provider](https://docs.aws.amazon.com/IAM/latest/UserGuide/id_roles_providers_create_oidc.html).

Co-authored-by: Luiz Aoqui <luiz@hashicorp.com>
2023-10-20 17:11:41 -07:00
James Rasell
1ffdd576bb agent: add config option to enable file and line log detail. (#18768) 2023-10-16 15:59:16 +01:00
Luiz Aoqui
7267be719f config: apply defaults to extra Consul and Vault (#18623)
* config: apply defaults to extra Consul and Vault

Apply the expected default values when loading additional Consul and
Vault cluster configuration. Without these defaults some fields would be
left empty.

* config: retain pointer of multi Consul and Vault

When calling `Copy()` the pointer reference from the `"default"` key of
the `Consuls` and `Vaults` maps to the `Consul` and `Vault` field of
`Config` was being lost.

* test: ensure TestAgent has the right reference to the default Consul config
2023-09-29 17:15:20 -03:00
Tim Gross
5001bf4547 consul: use constant instead of "default" literal (#18611)
Use the constant `structs.ConsulDefaultCluster` instead of the string literal
"default", as we've done for Vault.
2023-09-28 16:50:21 -04:00
Luiz Aoqui
868aba57bb vault: update identity name to start with vault_ (#18591)
* vault: update identity name to start with `vault_`

In the original proposal, workload identities used to derive Vault
tokens were expected to be called just `vault`. But in order to support
multiple Vault clusters it is necessary to associate identities with
specific Vault cluster configuration.

This commit implements a new proposal to have Vault identities named as
`vault_<cluster>`.
2023-09-27 15:53:28 -03:00
Juana De La Cuesta
124272c050 server: Add reporting option to agent (#18572)
* func: add reporting option to agent

* func: add test for merge and fix comments

* Update config_ce.go

* Update config_ce.go

* Update config_ce.go

* fix: add reporting config to default configuration and update to use must over require

* Update command/agent/config_parse.go

Co-authored-by: Luiz Aoqui <luiz@hashicorp.com>

* Update nomad/structs/config/reporting.go

Co-authored-by: Luiz Aoqui <luiz@hashicorp.com>

* Update nomad/structs/config/reporting.go

Co-authored-by: Luiz Aoqui <luiz@hashicorp.com>

* style: rename license and reporting config

* fix: use default function instead of empty struct

---------

Co-authored-by: Luiz Aoqui <luiz@hashicorp.com>
2023-09-27 00:11:32 +02:00
stswidwinski
bd519dcbf4 Fix for https://github.com/hashicorp/nomad/issues/18493 (#18494)
Co-authored-by: James Rasell <jrasell@users.noreply.github.com>
2023-09-14 13:35:15 +01:00
Tim Gross
a8bad048b6 config: parsing support for multiple Consul clusters in agent config (#18255)
Add the plumbing we need to accept multiple Consul clusters in Nomad agent
configuration, to support upcoming Nomad Enterprise features. The `consul` blocks
are differentiated by a new `name` field, and if the `name` is omitted it
becomes the "default" Consul configuration. All blocks with the same name are
merged together, as with the existing behavior.

As with the `vault` block, we're still using HCL1 for parsing configuration and
the `Decode` method doesn't parse multiple blocks differentiated only by a field
name without a label. So we've had to add an extra parsing pass, similar to what
we've done for HCL1 jobspecs. This also revealed a subtle bug in the `vault`
block handling of extra keys when there are multiple `vault` blocks, which I've
fixed here.

For now, all existing consumers will use the "default" Consul configuration, so
there's no user-facing behavior change in this changeset other than the contents
of the agent self API.

Ref: https://github.com/hashicorp/team-nomad/issues/404
2023-08-18 15:25:16 -04:00
Tim Gross
74b796e6d0 config: parsing support for multiple Vault clusters in agent config (#18224)
Add the plumbing we need to accept multiple Vault clusters in Nomad agent
configuration, to support upcoming Nomad Enterprise features. The `vault` blocks
are differentiated by a new `name` field, and if the `name` is omitted it
becomes the "default" Vault configuration. All blocks with the same name are
merged together, as with the existing behavior.

Unfortunately we're still using HCL1 for parsing configuration and the `Decode`
method doesn't parse multiple blocks differentiated only by a field name without
a label. So we've had to add an extra parsing pass, similar to what we've done
for HCL1 jobspecs.

For now, all existing consumers will use the "default" Vault configuration, so
there's no user-facing behavior change in this changeset other than the contents
of the agent self API.

Ref: https://github.com/hashicorp/team-nomad/issues/404
2023-08-17 14:10:32 -04:00
hashicorp-copywrite[bot]
a9d61ea3fd Update copyright file headers to BUSL-1.1 2023-08-10 17:27:29 -05:00
Luiz Aoqui
81f0b359dd node pools: register a node in a node pool (#17405) 2023-06-02 17:50:50 -04:00
hashicorp-copywrite[bot]
f005448366 [COMPLIANCE] Add Copyright and License Headers 2023-04-10 15:36:59 +00:00
Lance Haig
99f43c1144 Update ioutil library references to os and io respectively for command (#16329)
No user facing changes so I assume no change log is required
2023-03-08 09:20:04 -06:00
Alessio Perugini
365ccf4377 Allow configurable range of Job priorities (#16084) 2023-02-17 09:23:13 -05:00
stswidwinski
2285432424 GC: ensure no leakage of evaluations for batch jobs. (#15097)
Prior to 2409f72 the code compared the modification index of a job to itself. Afterwards, the code compared the creation index of the job to itself. In either case there should never be a case of re-parenting of allocs causing the evaluation to trivially always result in false, which leads to unreclaimable memory.

Prior to this change allocations and evaluations for batch jobs were never garbage collected until the batch job was explicitly stopped. The new `batch_eval_gc_threshold` server configuration controls how often they are collected. The default threshold is `24h`.
2023-01-31 13:32:14 -05:00
Tim Gross
bcd5bbdad7 add metric for count of RPC requests (#15515)
Implement a metric for RPC requests with labels on the identity, so that
administrators can monitor the source of requests within the cluster. This
changeset demonstrates the change with the new `ACL.WhoAmI` RPC, and we'll wire
up the remaining RPCs once we've threaded the new pre-forwarding authentication
through the all.

Note that metrics are measured after we forward but before we return any
authentication error. This ensures that we only emit metrics on the server that
actually serves the request. We'll perform rate limiting at the same place.

Includes telemetry configuration to omit identity labels.
2023-01-24 11:54:20 -05:00
Seth Hoenig
dab4d7ed7a ci: swap freeport for portal in packages (#15661) 2023-01-03 11:25:20 -06:00
James Rasell
eaea9164a5 acl: correctly resolve ACL roles within client cache. (#14922)
The client ACL cache was not accounting for tokens which included
ACL role links. This change modifies the behaviour to resolve role
links to policies. It will also now store ACL roles within the
cache for quick lookup. The cache TTL is configurable in the same
manner as policies or tokens.

Another small fix is included that takes into account the ACL
token expiry time. This was not included, which meant tokens with
expiry could be used past the expiry time, until they were GC'd.
2022-10-20 09:37:32 +02:00
James Rasell
7b3bd1017d Merge branch 'main' into f-gh-13120-sso-umbrella-merged-main 2022-08-25 12:14:29 +01:00
Piotr Kazmierczak
c4be2c6078 cleanup: replace TypeToPtr helper methods with pointer.Of (#14151)
Bumping compile time requirement to go 1.18 allows us to simplify our pointer helper methods.
2022-08-17 18:26:34 +02:00
Seth Hoenig
0c62f445c3 build: run gofmt on all go source files
Go 1.19 will forecefully format all your doc strings. To get this
out of the way, here is one big commit with all the changes gofmt
wants to make.
2022-08-16 11:14:11 -05:00
Derek Strickland
696deb9600 Add Nomad RetryConfig to agent template config (#13907)
* add Nomad RetryConfig to agent template config
2022-08-03 16:56:30 -04:00
James Rasell
892ab8a07a Merge branch 'main' into f-gh-13120-sso-umbrella 2022-08-02 08:30:03 +01:00
Luiz Aoqui
d456cc1e7f Track plan rejection history and automatically mark clients as ineligible (#13421)
Plan rejections occur when the scheduler work and the leader plan
applier disagree on the feasibility of a plan. This may happen for valid
reasons: since Nomad does parallel scheduling, it is expected that
different workers will have a different state when computing placements.

As the final plan reaches the leader plan applier, it may no longer be
valid due to a concurrent scheduling taking up intended resources. In
these situations the plan applier will notify the worker that the plan
was rejected and that they should refresh their state before trying
again.

In some rare and unexpected circumstances it has been observed that
workers will repeatedly submit the same plan, even if they are always
rejected.

While the root cause is still unknown this mitigation has been put in
place. The plan applier will now track the history of plan rejections
per client and include in the plan result a list of node IDs that should
be set as ineligible if the number of rejections in a given time window
crosses a certain threshold. The window size and threshold value can be
adjusted in the server configuration.

To avoid marking several nodes as ineligible at one, the operation is rate
limited to 5 nodes every 30min, with an initial burst of 10 operations.
2022-07-12 18:40:20 -04:00
James Rasell
08845cef04 server: add ACL token expiration config parameters. (#13667)
This commit adds configuration parameters to control ACL token
expirations. This includes both limits on the min and max TTL
expiration values, as well as a GC threshold for expired tokens.
2022-07-12 13:43:25 +02:00
Eng Zer Jun
fca4ee8e05 test: use T.TempDir to create temporary test directory (#12853)
* test: use `T.TempDir` to create temporary test directory

This commit replaces `ioutil.TempDir` with `t.TempDir` in tests. The
directory created by `t.TempDir` is automatically removed when the test
and all its subtests complete.

Prior to this commit, temporary directory created using `ioutil.TempDir`
needs to be removed manually by calling `os.RemoveAll`, which is omitted
in some tests. The error handling boilerplate e.g.
	defer func() {
		if err := os.RemoveAll(dir); err != nil {
			t.Fatal(err)
		}
	}
is also tedious, but `t.TempDir` handles this for us nicely.

Reference: https://pkg.go.dev/testing#T.TempDir
Signed-off-by: Eng Zer Jun <engzerjun@gmail.com>

* test: fix TestLogmon_Start_restart on Windows

Signed-off-by: Eng Zer Jun <engzerjun@gmail.com>

* test: fix failing TestConsul_Integration

t.TempDir fails to perform the cleanup properly because the folder is
still in use

testing.go:967: TempDir RemoveAll cleanup: unlinkat /tmp/TestConsul_Integration2837567823/002/191a6f1a-5371-cf7c-da38-220fe85d10e5/web/secrets: device or resource busy

Signed-off-by: Eng Zer Jun <engzerjun@gmail.com>
2022-05-12 11:42:40 -04:00
Derek Strickland
9d7ea218bb consul-template: revert function_denylist logic (#12071)
* consul-template: replace config rather than append
Co-authored-by: Seth Hoenig <seth.a.hoenig@gmail.com>
2022-04-18 13:57:56 -04:00
Michael Schurter
f87ec7e64e template: disallow writeToFile by default
Resolves #12095 by WONTFIXing it.

This approach disables `writeToFile` as it allows arbitrary host
filesystem writes and is only a small quality of life improvement over
multiple `template` stanzas.

This approach has the significant downside of leaving people who have
altered their `template.function_denylist` *still vulnerable!* I added
an upgrade note, but we should have implemented the denylist as a
`map[string]bool` so that new funcs could be denied without overriding
custom configurations.

This PR also includes a bug fix that broke enabling all consul-template
funcs. We repeatedly failed to differentiate between a nil (unset)
denylist and an empty (allow all) one.
2022-03-28 17:05:42 -07:00
James Rasell
d49cf2388a Merge branch 'main' into f-1.3-boogie-nights 2022-03-23 09:41:25 +01:00
Seth Hoenig
b242957990 ci: swap ci parallelization for unconstrained gomaxprocs 2022-03-15 12:58:52 -05:00
James Rasell
b78e436f10 config: add native service discovery admin boolean parameter. 2022-03-14 11:48:13 +01:00
Derek Strickland
a30c7dd56b Update IsEmpty to check for pre-1.2.4 fields (#11930) 2022-01-26 11:31:37 -05:00
Derek Strickland
43edd0e709 Expose Consul template configuration parameters (#11606)
This PR exposes the following existing`consul-template` configuration options to Nomad jobspec authors in the `{job.group.task.template}` stanza.

- `wait`

It also exposes the following`consul-template` configuration to Nomad operators in the `{client.template}` stanza.

- `max_stale`
- `block_query_wait`
- `consul_retry`
- `vault_retry` 
- `wait` 

Finally, it adds the following new Nomad-specific configuration to the `{client.template}` stanza that allows Operators to set bounds on what `jobspec` authors configure.

- `wait_bounds`

Co-authored-by: Tim Gross <tgross@hashicorp.com>
Co-authored-by: Michael Schurter <mschurter@hashicorp.com>
2022-01-10 10:19:07 -05:00
Tim Gross
fe7a43e448 docs/tests for multiple HTTP address config (#11760) 2022-01-03 10:17:13 -05:00