Commit Graph

25152 Commits

Author SHA1 Message Date
Kevin Wang
6dcc402188 chore(docs): update file HCL function (#18696) 2023-10-16 09:03:50 +01:00
Piotr Kazmierczak
299f3bf74b client: use WI-issued consul tokens in the template_hook (#18752)
ref https://github.com/hashicorp/team-nomad/issues/404
2023-10-16 09:39:20 +02:00
dependabot[bot]
cb2363f2fb chore(deps): bump github.com/hashicorp/go-bexpr from 0.1.12 to 0.1.13 (#18758) 2023-10-16 08:21:57 +01:00
Piotr Kazmierczak
b697de9dda client: correct consul block validation in the consul_hook (#18751) 2023-10-13 15:15:04 +02:00
Tim Gross
0931f2ba12 csi: add test for that plugin allocs are filtered by namespace (#18753)
A CSI plugin can be made up of multiple jobs, which may not be in the same
namespace. When querying for a plugin and getting information about the
allocations that implement the plugin, we need to filter by the namespaces the
user has access to.

This test existed in the ENT code base and was never moved over to CE when we
made namespaces part of the CE product.
2023-10-13 09:06:36 -04:00
James Rasell
e02dd2a331 vault: use an importable const for Vault header string. (#18740) 2023-10-13 07:39:06 +01:00
Tim Gross
484f91b893 auth: remove "mixed auth" special casing for Variables endpoint (#18744)
The RPC handlers expect to see `nil` ACL objects whenever ACLs are disabled. By
using `nil` as a sentinel value, we have the risk of nil pointer exceptions and
improper handling of `nil` when returned from our various auth methods that can
lead to privilege escalation bugs. This is the third in a series to eliminate
the use of `nil` ACLs as a sentinel value for when ACLs are disabled.

This patch involves leveraging the refactored `auth` package to remove the weird
"mixed auth" helper functions that only support the Variables read/list RPC
handlers. Instead, pass the ACL object and claim together into the
`AllowVariableOperations` method in the usual `acl` package.

Ref: https://github.com/hashicorp/nomad-enterprise/pull/1218
Ref: https://github.com/hashicorp/nomad/pull/18703
Ref: https://github.com/hashicorp/nomad/pull/18715
Ref: https://github.com/hashicorp/nomad/pull/16799
Ref: https://github.com/hashicorp/nomad/pull/18730

Fixes: https://github.com/hashicorp/nomad/issues/15875
2023-10-12 16:43:11 -04:00
Piotr Kazmierczak
91753308b3 WI: set the right identity name for Consul tasks (#18742)
Consul tasks should only have 1 identity of the form consul/{consul_cluster_name}.
2023-10-12 20:34:15 +02:00
Tim Gross
3633ca0f8c auth: add client-only ACL (#18730)
The RPC handlers expect to see `nil` ACL objects whenever ACLs are disabled. By
using `nil` as a sentinel value, we have the risk of nil pointer exceptions and
improper handling of `nil` when returned from our various auth methods that can
lead to privilege escalation bugs. This is the third in a series to eliminate
the use of `nil` ACLs as a sentinel value for when ACLs are disabled.

This patch involves creating a new "virtual" ACL object for checking permissions
on client operations and a matching `AuthenticateClientOnly` method for
client-only RPCs that can produce that ACL.

Unlike the server ACLs PR, this also includes a special case for "legacy" client
RPCs where the client was not previously sending the secret as it
should (leaning on mTLS only). Those client RPCs were fixed in Nomad 1.6.0, but
it'll take a while before we can guarantee they'll be present during upgrades.

Ref: https://github.com/hashicorp/nomad-enterprise/pull/1218
Ref: https://github.com/hashicorp/nomad/pull/18703
Ref: https://github.com/hashicorp/nomad/pull/18715
Ref: https://github.com/hashicorp/nomad/pull/16799
2023-10-12 12:21:48 -04:00
dependabot[bot]
cecd9b0472 chore(deps): bump golang.org/x/net from 0.14.0 to 0.17.0 (#18734) 2023-10-12 07:58:59 +01:00
Tim Gross
c7f97722ef consul hook: get WIs only for own task group (#18732)
The WID manager will only sign WI tokens for the allocation's task group. We're
accidentally looping over all the task groups, which for jobs with multiple task
groups results in a failure in the `consul_hook`.
2023-10-11 17:01:28 -04:00
Tim Gross
b39632fa6f testing: fix configuration for retry tests (#18731)
The retry tests in the `api` package set up a client but don't use `NewClient`,
so the address never gets parsed into a `url.URL` and that's causing some test
failures.
2023-10-11 14:06:31 -04:00
Charlie Voiselle
7266d267b0 Add unix domain socket support to API (#16872)
- Expose internal HTTP client's Do() via Raw
- Use URL parser to identify scheme
- Align more with curl output
- Add changelog
- Fix test failure; add tests for socket envvars
- Apply review feedback for tests
- Consolidate address parsing
- Address feedback from code reviews

Co-authored-by: Tim Gross <tgross@hashicorp.com>
2023-10-11 11:04:12 -04:00
Tim Gross
a92461cdc9 auth: add server-only ACL (#18715)
* auth: add server-only ACL

The RPC handlers expect to see `nil` ACL objects whenever ACLs are disabled. By
using `nil` as a sentinel value, we have the risk of nil pointer exceptions and
improper handling of `nil` when returned from our various auth methods that can
lead to privilege escalation bugs. This is the second in a series to eliminate
the use of `nil` ACLs as a sentinel value for when ACLs are disabled.

This patch involves creating a new "virtual" ACL object for checking permissions
on server operations and a matching `AuthenticateServerOnly` method for
server-only RPCs that can produce that ACL.

Ref: https://github.com/hashicorp/nomad-enterprise/pull/1218
Ref: https://github.com/hashicorp/nomad/pull/18703
2023-10-11 10:59:31 -04:00
Tim Gross
7ca619fe97 deps: remove Vault SDK (#18725)
Nomad imports the Vault SDK to get testing helpers, but it turns out the only
thing actually in use was a single string constant for the Vault namespace
header. Remove this dependency and hardcode the constant to reduce dependency
churn.
2023-10-11 10:42:09 -04:00
Tim Gross
e22c5b82f3 WID manager: request signed identities for services (#18650)
Includes changes to WID Manager that make it request signed identities for
services, as well as a few improvements to WIHandle introduced in #18672.

---------

Co-authored-by: Piotr Kazmierczak <470696+pkazmierczak@users.noreply.github.com>
2023-10-11 12:07:16 +02:00
Juana De La Cuesta
70b020e583 server: Rename functions and use iterator function for clarity (#18716) 2023-10-11 09:47:10 +02:00
Tim Gross
635afee376 build: bump to go 1.21.3 (#18717)
Go 1.21.3 fixes an important HTTP2 CVE (see CVE-2023-39325 and
CVE-2023-44487). Nomad does not use HTTP2 and is not vulnerable. However we
should pick up the toolchain bump if for no other reason than we don't have to
answer questions about that.
2023-10-10 16:37:24 -04:00
Luiz Aoqui
ef6814388c cli: remove default for ACL token type on update (#18689)
With a default value set to `client`, the `nomad acl token update`
command can silently downgrade a management token to client on update if
the command does not specify `-type=management` on every update.
2023-10-10 15:51:13 -04:00
Tim Gross
9c2ecbf1d3 auth: refactor Authenticate into its own package (#18703)
The RPC handlers expect to see `nil` ACL objects whenever ACLs are disabled. By
using `nil` as a sentinel value, we have the risk of nil pointer exceptions and
improper handling of `nil` when returned from our various auth methods that can
lead to privilege escalation bugs.

This patchset is the first in a series to eliminate the use of `nil` ACLs as a
sentinel value for when ACLs are disabled. This one is entirely refactoring to
reduce the burden of reviewing the final patchsets that have the functional
changes:

* Move RPC auth into a new `nomad/auth` package, injecting the dependencies
  required from the server. Expose only those public methods on `nomad/auth`
  that are intended for use in the RPC handlers.
* Keep the existing large authentication test as an integration test.
* Add unit tests covering the methods of `nomad/auth` we intend on keeping. The
  assertions for many of these will change once we have no `nil` sentinels and
  can make safe assertions about permissions on the resulting `ACL` objects.
2023-10-10 11:01:24 -04:00
James Rasell
9c57ddd838 core: add preempt to desired updates stringer function return. (#18702) 2023-10-10 09:55:18 +01:00
dependabot[bot]
9a38a9c188 chore(deps): bump github.com/docker/cli (#18565) 2023-10-10 09:12:32 +01:00
dependabot[bot]
fbf792f895 chore(deps): bump github.com/docker/distribution (#18693) 2023-10-10 08:20:28 +01:00
Tim Gross
928a82a184 WID manager: save and restore signed WIs from client state DB (#18661)
When clients are restarted and the identity hook runs when we restore
allocations, the running allocations are likely to have already-signed Workload
Identities that are unexpired. Save these to the client's local state DB so that
we can avoid a thundering herd of RPCs during client restart. When we restore,
we'll check if there's at least one expired signed WI before making any initial
signing request.

Included:
* Renames `getIdentities` to `getInitialIdentities` to make the workflow more clear.
* Renames the existing `widmgr_test.go` file of integration tests, which is in its
  own package to avoid circular imports to `widmgr_int_test.go`
2023-10-09 09:16:23 -04:00
dependabot[bot]
5945ed5cfd chore(deps): bump google.golang.org/protobuf from 1.30.0 to 1.31.0 (#18694) 2023-10-09 11:39:51 +01:00
Luiz Aoqui
c6ce966d98 build: load time/tzdata on Windows (#18676)
Nomad uses `time.LoadLocation()` to translate a periodic job time zone
string value to a `time.Location`. From godocs:

    LoadLocation looks for the IANA Time Zone database in the following locations in order:

    * the directory or uncompressed zip file named by the ZONEINFO environment variable
    * on a Unix system, the system standard installation location
    * $GOROOT/lib/time/zoneinfo.zip
    * the time/tzdata package, if it was imported

So non-Unix systems require Go to be installed or `time/tzdata` to be
imported, otherwise running periodic jobs with a specific `time_zone`
value results in an error:

    Invalid time zone "America/Toronto": unknown time zone America/Toronto

This commit adds the `timetzdata` build tag on Windows to embed the time
zone data into the final binary. This results in a slightly bigger
binary, but from `time/tzdata` godocs:

    Importing this package will increase the size of a program by about 450 KB.
    [..]
    This package will be automatically imported if you build with -tags timetzdata.
2023-10-06 12:57:42 -04:00
Piotr Kazmierczak
597d835220 wi: introduce workload identity handler (#18672)
Any code that tracks workloads and their identities should not rely on string
comparisons, especially since we support 2 types of workload identities: those
that identify tasks and those that identify services. This means we cannot rely
on task.Name for workload-identity pairs.

The new type structs.WIHandle solves this problem by providing a uniform way of
identifying workloads and their identities.
2023-10-06 18:32:47 +02:00
Luiz Aoqui
0ccf942b26 scheduler: fix host volume feasibility check (#18679)
Host volumes were considered regular feasibility checks. This had two
unintended consequences.

The first happened when scheduling an allocation with a host volume on a
set of nodes with the same computed class but where only some of them
had the desired host volume.

If the first node evaluated did not have the host volume, the entire
node class was considered ineligible for the task group.

```go
// Run the job feasibility checks.
for _, check := range w.jobCheckers {
	feasible := check.Feasible(option)
	if !feasible {
		// If the job hasn't escaped, set it to be ineligible since it
		// failed a job check.
		if !jobEscaped {
			evalElig.SetJobEligibility(false, option.ComputedClass)
		}
		continue OUTER
	}
}
```

This results in all nodes with the same computed class to be skipped,
even if they do have the desired host volume.

```go
switch evalElig.JobStatus(option.ComputedClass) {
case EvalComputedClassIneligible:
	// Fast path the ineligible case
	metrics.FilterNode(option, "computed class ineligible")
	continue
```

The second consequence is somewhat the opposite. When an allocation has
a host volume with `per_alloc = true` the node must have a host volume
that matches the allocation index, so each allocation is likely to be
placed in different nodes.

But when the first allocation found a node match, it registered the node
class as eligible for the task group.

```go
// Set the task group eligibility if the constraints weren't escaped and
// it hasn't been set before.
if !tgEscaped && tgUnknown {
	evalElig.SetTaskGroupEligibility(true, w.tg, option.ComputedClass)
}
```

This could cause other allocations to be placed on nodes without the
expected host volume because of the computed node class fast path. The
node feasibility for the volume was never checked.

```go
case EvalComputedClassEligible:
	// Fast path the eligible case
	if w.available(option) {
		return option
	}
	// We match the class but are temporarily unavailable
	continue OUTER
```

These problems did not happen with CSI volumes kind of accidentally.
Since the `CSIVolumeChecker` was not placed in the `tgCheckers` list it
did not cause the node class to be considered ineligible on failure
(avoiding the first problem).

And, as illustrated in the code snippet above, the eligible node class
fast path checks `tgAvailable` (where `CSIVolumeChecker` is placed)
before returning the option (avoiding the second problem).

By also placing `HostVolumeChecker` in the `tgAvailable` list instead of
`tgCheckers` we also avoid these problems on host volume feasibility.
2023-10-06 11:00:48 -04:00
Seth Hoenig
e3c8700ded deps: upgrade to go-set/v2 (#18638)
No functional changes, just cleaning up deprecated usages that are
removed in v2 and replace one call of .Slice with .ForEach to avoid
making the intermediate copy.
2023-10-05 11:56:17 -05:00
Phil Renaud
533f293fa8 Wrap the passed path prop as a handlebars tag (#18598) 2023-10-05 12:47:18 -04:00
Luiz Aoqui
d425c90e0f client: remove null dynamic metadata keys (#18664)
Setting a null value to a node metadata is expected to remove it from
subsequent reads. This is true both for static node metadata (defined in
the agent configuration file) as well as for dynamic node metadata
(defined via the Nomad API).

Null values for static metadata must be persisted to indicate that the
value has been removed, but strictly dynamic metadata null values can be
removed from state and client memory.
2023-10-05 11:41:44 -04:00
Luiz Aoqui
ed204e0fd9 client: ensure task only runs with prestart hooks (#18662)
Since the allocation in the task runner is updated in a separate
goroutine, a race condition may happen where the task is started but the
prestart hooks are skipped because the allocation became terminal.

Checking for a terminal allocation before proceeding with the task start
ensures the task only runs if the prestart hooks are also executed.

Since `shouldShutdown()` only uses terminal allocation status, it
remains `true` after the first transition, so it's safe to check it
again after the prestart hooks as it will never revert to `false`.
2023-10-05 10:16:57 -04:00
Juana De La Cuesta
d701925ffa [f-gh-1106-reporting] Use full cluster metadata for reporting (#18660)
* func: add reporting config to server

* func: add reporting manager for ce

* func: change from clusterID to clusterMetadata and use it to start ent ledearship

* Update leader.go

* style: typo
2023-10-05 09:32:54 +02:00
Piotr Kazmierczak
03cf9ae7ff vault: eliminate vaultclient test import cycle (#18652)
Eliminates the vaultclient test import cycle by putting the test file into the
client package and making vaultclient objects public.

Ref hashicorp/team-nomad#404
2023-10-05 09:17:16 +02:00
James Rasell
673a7713a8 scheduler: remove unused changes reconciler function. (#18656) 2023-10-05 08:10:01 +01:00
Charlie Voiselle
8a93ff3d2d [server] Directed leadership transfer CLI and API (#17383)
* Add directed leadership transfer func
* Add leadership transfer RPC endpoint
* Add ACL tests for leadership-transfer endpoint
* Add HTTP API route and implementation
* Add to Go API client
* Implement CLI command
* Add documentation
* Add changelog

Co-authored-by: Tim Gross <tgross@hashicorp.com>
2023-10-04 12:20:27 -04:00
Piotr Kazmierczak
c885c08640 wi: service names should not be prepended with taskname if empty (#18657)
ref https://github.com/hashicorp/team-nomad/issues/404
2023-10-04 18:00:18 +02:00
Tim Gross
bf65e44a09 consul: only fetch Consul tokens for Consul-specific identities (#18649)
Only the workload identities signed specifically for Consul, named
for the task or service, should result in authenticating to Consul to get tokens.
2023-10-04 11:12:50 -04:00
Matthew Salsamendi
aa9ff3a5b3 fix: use interpolated address when performing health checks (#18584)
* fix: use interpolated address when performing health checks

* Fix tests, add changelog

* Update .changelog/18584.txt

Co-authored-by: Seth Hoenig <shoenig@duck.com>

---------

Co-authored-by: Seth Hoenig <shoenig@duck.com>
2023-10-04 07:58:55 -05:00
Tim Gross
fb7582d596 services: get Consul token from hook resources (#18600)
When Workload Identity is being used with Consul, the `consul_hook` will add
Consul tokens to the alloc hook resources. Update the `group_service_hook` and
`service_hook` to use those tokens when available for registering and
deregistering Consul workloads.
2023-10-04 08:35:18 -04:00
Daniel Bennett
e7136f80c5 scaling: set Index on nil-job scale status reply (#18637)
returning a nil error in a blockingOptions.run()
without increasing the reply Index can cause the
query to block indefinitely (until timeout).

this fixes that happening in Job.ScaleStatus
when the job is deleted -- the job going away
should now return as not-found and provide a new
index for the caller to try if they so please.
2023-10-03 12:03:20 -05:00
Tim Gross
52ef476a72 sids_hook: read tokens from consul_hook when available (#18594)
The `sids_hook` runs for Connect sidecar/gateway tasks and gets Consul Service
Identity (SI) tokens for use by the Envoy bootstrap hook. When Workload Identity
is being used with Consul, the `consul_hook` will have already added these
tokens to the alloc hook resources. Update the `sids_hook` to use those tokens
instead and write them to the expected area of the taskdir.
2023-10-03 09:12:13 -04:00
James Rasell
df16c96a9f cli: use same offset when following single or multiple alloc logs. (#18604) 2023-10-03 08:43:14 +01:00
Piotr Kazmierczak
3d62438876 consul: consul taskrunner hook should only write tokens that belong to its task (#18635)
Ref hashicorp/team-nomad#404
2023-10-02 19:49:02 +02:00
Piotr Kazmierczak
62a0768775 consul: make service and task identity names unique (#18634)
Ref: hashicorp/team-nomad#404
2023-10-02 19:48:34 +02:00
Kevin Wang
e7b70adc2c cli: improve job and status text (#18628) 2023-10-02 10:31:57 -04:00
dependabot[bot]
ccafb94645 chore(deps): bump github.com/cyphar/filepath-securejoin (#18545) 2023-10-02 08:25:35 +01:00
Luiz Aoqui
7267be719f config: apply defaults to extra Consul and Vault (#18623)
* config: apply defaults to extra Consul and Vault

Apply the expected default values when loading additional Consul and
Vault cluster configuration. Without these defaults some fields would be
left empty.

* config: retain pointer of multi Consul and Vault

When calling `Copy()` the pointer reference from the `"default"` key of
the `Consuls` and `Vaults` maps to the `Consul` and `Vault` field of
`Config` was being lost.

* test: ensure TestAgent has the right reference to the default Consul config
2023-09-29 17:15:20 -03:00
Michael Schurter
3f9bd17687 client: prevent watching stale alloc state (#18612)
When waiting on a previous alloc we must query against the leader before
switching to a stale query with index set.

Also check to ensure the response is fresh before using it like #18269
2023-09-29 12:46:28 -07:00
Tim Gross
aaee3076c2 consul: allow consul block in task scope (#18597)
To support Workload Identity with Consul for templates, we want templates to be
able to use the WI created at the task scope (either implicitly or set by the
user). But to allow different tasks within a group to be assigned to different
clusters as we're doing for Vault, we need to be able to set the `consul` block
with its `cluster` field at the task level to override the group.
2023-09-29 15:03:48 -04:00