Commit Graph

4851 Commits

Author SHA1 Message Date
Tim Gross
f0330d6df1 identity_hook: implement PreKill hook, not TaskStop hook (#18913)
The allocrunner's `identity_hook` implements the interface for TaskStop, but
this interface is only ever called for task-level hooks. This results in a
leaked goroutine that tries to periodically renew WIs until the client shuts
down gracefully.

Add an implementation for the allocrunner's `PreKill` and `Destroy` hooks, so
that whenever an allocation is stopped or garbage collected we stop renewing its
Workload Identities. This also requires making the `Shutdown` method of `WIDMgr`
safe to call multiple times.
2023-10-30 10:54:22 -04:00
Luiz Aoqui
347389f9f9 vault: derive token using create_from_role (#18880)
Fallback to the ACL role defined in the client's `create_from_role`
configuration when using the JWT flow and the task does not specify a
role to use.
2023-10-27 13:03:44 -04:00
Justin Yang
b76e0429c4 client: add support for NetBSD clients (#18562)
Bumps `shirou/gopsutil` to v3.23.9
2023-10-27 10:33:00 -04:00
Seth Hoenig
8ed82416e3 client: fix detection of cpuset.mems on cgroups v1 systems (#18868) 2023-10-26 09:42:10 -05:00
Kerim Satirli
5e1bbf90fc docs: update all URLs to developer.hashicorp.com (#16247) 2023-10-24 11:00:11 -04:00
Seth Hoenig
951cde4e3b numa: fix cpu topology conversion for non linux systems (#18843) 2023-10-24 09:12:34 -05:00
Tim Gross
cb3fde3c96 metrics: prevent negative counter from iowait decrease (#18835)
The iowait metric obtained from `/proc/stat` can under some circumstances
decrease. The relevant condition is when an interrupt arrives on a different
core than the one that gets woken up for the IO, and a particular counter in the
kernel for that core gets interrupted. This is documented in the man page for
the `proc(5)` pseudo-filesystem, and considered an unfortunate behavior that
can't be changed for the sake of ABI compatibility.

In Nomad, we get the current "busy" time (everything except for idle) and
compare it to the previous busy time to get the counter incremeent. If the
iowait counter decreases and the idle counter increases more than the increase
in the total busy time, we can get a negative total. This previously caused a
panic in our metrics collection (see #15861) but that is being prevented by
reporting an error message.

Fix the bug by putting a zero floor on the values we return from the host CPU
stats calculator.

Fixes: #15861
Fixes: #18804
2023-10-24 09:58:25 -04:00
Tim Gross
4d9cc73ed2 sids_hook: fix check for Consul token derived from WI (#18821)
The `sids_hook` serves the legacy Connect workflow, and we want to bypass it
when using workload identities. So the hook checks that there's not already a
Consul token in the alloc hook resources derived from the Workload
Identity. This check was looking for the wrong key. This would cause the hook to
ignore the Consul token we already have and then fail to derive a SI token
unless the Nomad agent has its own token with `acl:write` permission.

Fix the lookup and add tests covering the bypass behavior.
2023-10-23 08:57:02 -04:00
Seth Hoenig
0020139440 core: port common code changes from ENT for numa scheduling (#18818)
Some additional changes were made in the ENT PR to the common code in
support of numa scheduling; this PR copies those changes back to CE.
2023-10-20 13:19:02 -05:00
Luiz Aoqui
6d4b62200b log: add Consul and Vault cluster name to output (#18817)
Ensure Consul and Vault loggers have the cluster name as an attribute to
help differentiate log source.
2023-10-20 14:03:56 -04:00
Phil Renaud
8902afe651 Nomad Actions (#18794)
* Scaffolding actions (#18639)

* Task-level actions for job submissions and retrieval

* FIXME: Temporary workaround to get ember dev server to pass exec through to 4646

* Update api/tasks.go

Co-authored-by: Tim Gross <tgross@hashicorp.com>

* Update command/agent/job_endpoint.go

Co-authored-by: Tim Gross <tgross@hashicorp.com>

* Diff and copy implementations

* Action structs get their own file, diff updates to behave like our other diffs

* Test to observe actions changes in a version update

* Tests migrated into structs/diff_test and modified with PR comments in mind

* APIActionToSTructsAction now returns a new value

* de-comment some plain parts, remove unused action lookup

* unused param in action converter

---------

Co-authored-by: Tim Gross <tgross@hashicorp.com>

* New endpoint: job/:id/actions (#18690)

* unused param in action converter

* backing out of parse_job level and moved toward new endpoint level

* Adds taskName and taskGroupName to actions at job level

* Unmodified job mock actions tests

* actionless job test

* actionless job test

* Multi group multi task actions test

* HTTP method check for GET, cleaner errors in job_endpoint_test

* decomment

* Actions aggregated at job model level (#18733)

* Removal of temporary fix to proxy to 4646

* Run Action websocket endpoint (#18760)

* Working demo for review purposes

* removal of cors passthru for websockets

* Remove job_endpoint-specific ws handlers and aimed at existing alloc exec handlers instead

* PR comments adressed, no need for taskGroup pass, better group and task lookups from alloc

* early return in action validate and removed jobid from req args per PR comments

* todo removal, we're checking later in the rpc

* boolean style change on tty

* Action CLI command (#18778)

* Action command init and stuck-notes

* Conditional reqpath to aim at Job action endpoint

* De-logged

* General CLI command cleanup, observe namespace, pass action as string, get random alloc w group adherence

* tab and varname cleanup

* Remove action param from Allocations().Exec calls

* changelog

* dont nil-check acl

---------

Co-authored-by: Tim Gross <tgross@hashicorp.com>
2023-10-20 13:05:55 -04:00
James Rasell
ca9e08e6b5 monitor: add log include location option on monitor CLI and API (#18795) 2023-10-20 07:55:22 +01:00
Seth Hoenig
83720740f5 core: plumbing to support numa aware scheduling (#18681)
* core: plumbing to support numa aware scheduling

* core: apply node resources compatibility upon fsm rstore

Handle the case where an upgraded server dequeus an evaluation before
a client triggers a new fingerprint - which would be needed to cause
the compatibility fix to run. By running the compat fix on restore the
server will immediately have the compatible pseudo topology to use.

* lint: learn how to spell pseudo
2023-10-19 15:09:30 -05:00
Piotr Kazmierczak
0410b8acea client: remove unnecessary debugging from consul client mock (#18807) 2023-10-19 16:23:42 +02:00
Luiz Aoqui
8b9a5fde4e vault: add multi-cluster support on templates (#18790)
In Nomad Enterprise, a task may connect to a non-default Vault cluster,
requiring `consul-template` to be configured with a specific client
`vault` block.
2023-10-18 20:45:01 -04:00
Piotr Kazmierczak
16d71582f6 client: consul_hook tests (#18780)
ref https://github.com/hashicorp/team-nomad/issues/404
2023-10-18 20:02:35 +02:00
Tim Gross
d0957eb109 Consul: agent config updates for WI (#18774)
This changeset makes two changes:
* Removes the `consul.use_identity` field from the agent configuration. This behavior is properly covered by the presence of `consul.service_identity` / `consul.task_identity` blocks.
* Adds a `consul.task_auth_method` and `consul.service_auth_method` fields to the agent configuration. This allows the cluster administrator to choose specific Consul Auth Method names for their environment, with a reasonable default.
2023-10-17 14:42:14 -04:00
Tim Gross
ac56855f07 consul: add multi-cluster support to client constructors (#18624)
When agents start, they create a shared Consul client that is then wrapped as
various interfaces for testability, and used in constructing the Nomad client
and server. The interfaces that support workload services (rather than the Nomad
agent itself) need to support multiple Consul clusters for Nomad
Enterprise. Update these interfaces to be factory functions that return the
Consul client for a given cluster name. Update the `ServiceClient` to split
workload updates between clusters by creating a wrapper around all the clients
that delegates to the cluster-specific `ServiceClient`.

Ref: https://github.com/hashicorp/team-nomad/issues/404
2023-10-17 13:46:49 -04:00
Luiz Aoqui
349c032369 vault: update task runner vault hook to support workload identity (#18534) 2023-10-16 19:37:57 -04:00
Tim Gross
cbd7248248 auth: use ACLsDisabledACL when ACLs are disabled (#18754)
The RPC handlers expect to see `nil` ACL objects whenever ACLs are disabled. By
using `nil` as a sentinel value, we have the risk of nil pointer exceptions and
improper handling of `nil` when returned from our various auth methods that can
lead to privilege escalation bugs. This is the final patch in a series to
eliminate the use of `nil` ACLs as a sentinel value for when ACLs are disabled.

This patch adds a new virtual ACL policy field for when ACLs are disabled and
updates our authentication logic to use it. Included:

* Extends auth package tests to demonstrate that nil ACLs are treated as failed
  auth and disabled ACLs succeed auth.
* Adds a new `AllowDebug` ACL check for the weird special casing we have for
  pprof debugging when ACLs are disabled.
* Removes the remaining unexported methods (and repeated tests) from the
  `nomad/acl.go` file.
* Update the semgrep rules to detect improper nil ACL checking and remove the
  old invalid ACL checks.
* Update the contributing guide for RPC authentication.

Ref: https://github.com/hashicorp/nomad-enterprise/pull/1218
Ref: https://github.com/hashicorp/nomad/pull/18703
Ref: https://github.com/hashicorp/nomad/pull/18715
Ref: https://github.com/hashicorp/nomad/pull/16799
Ref: https://github.com/hashicorp/nomad/pull/18730
Ref: https://github.com/hashicorp/nomad/pull/18744
2023-10-16 09:30:24 -04:00
Piotr Kazmierczak
299f3bf74b client: use WI-issued consul tokens in the template_hook (#18752)
ref https://github.com/hashicorp/team-nomad/issues/404
2023-10-16 09:39:20 +02:00
Piotr Kazmierczak
b697de9dda client: correct consul block validation in the consul_hook (#18751) 2023-10-13 15:15:04 +02:00
James Rasell
e02dd2a331 vault: use an importable const for Vault header string. (#18740) 2023-10-13 07:39:06 +01:00
Piotr Kazmierczak
91753308b3 WI: set the right identity name for Consul tasks (#18742)
Consul tasks should only have 1 identity of the form consul/{consul_cluster_name}.
2023-10-12 20:34:15 +02:00
Tim Gross
3633ca0f8c auth: add client-only ACL (#18730)
The RPC handlers expect to see `nil` ACL objects whenever ACLs are disabled. By
using `nil` as a sentinel value, we have the risk of nil pointer exceptions and
improper handling of `nil` when returned from our various auth methods that can
lead to privilege escalation bugs. This is the third in a series to eliminate
the use of `nil` ACLs as a sentinel value for when ACLs are disabled.

This patch involves creating a new "virtual" ACL object for checking permissions
on client operations and a matching `AuthenticateClientOnly` method for
client-only RPCs that can produce that ACL.

Unlike the server ACLs PR, this also includes a special case for "legacy" client
RPCs where the client was not previously sending the secret as it
should (leaning on mTLS only). Those client RPCs were fixed in Nomad 1.6.0, but
it'll take a while before we can guarantee they'll be present during upgrades.

Ref: https://github.com/hashicorp/nomad-enterprise/pull/1218
Ref: https://github.com/hashicorp/nomad/pull/18703
Ref: https://github.com/hashicorp/nomad/pull/18715
Ref: https://github.com/hashicorp/nomad/pull/16799
2023-10-12 12:21:48 -04:00
Tim Gross
c7f97722ef consul hook: get WIs only for own task group (#18732)
The WID manager will only sign WI tokens for the allocation's task group. We're
accidentally looping over all the task groups, which for jobs with multiple task
groups results in a failure in the `consul_hook`.
2023-10-11 17:01:28 -04:00
Tim Gross
7ca619fe97 deps: remove Vault SDK (#18725)
Nomad imports the Vault SDK to get testing helpers, but it turns out the only
thing actually in use was a single string constant for the Vault namespace
header. Remove this dependency and hardcode the constant to reduce dependency
churn.
2023-10-11 10:42:09 -04:00
Tim Gross
e22c5b82f3 WID manager: request signed identities for services (#18650)
Includes changes to WID Manager that make it request signed identities for
services, as well as a few improvements to WIHandle introduced in #18672.

---------

Co-authored-by: Piotr Kazmierczak <470696+pkazmierczak@users.noreply.github.com>
2023-10-11 12:07:16 +02:00
Tim Gross
928a82a184 WID manager: save and restore signed WIs from client state DB (#18661)
When clients are restarted and the identity hook runs when we restore
allocations, the running allocations are likely to have already-signed Workload
Identities that are unexpired. Save these to the client's local state DB so that
we can avoid a thundering herd of RPCs during client restart. When we restore,
we'll check if there's at least one expired signed WI before making any initial
signing request.

Included:
* Renames `getIdentities` to `getInitialIdentities` to make the workflow more clear.
* Renames the existing `widmgr_test.go` file of integration tests, which is in its
  own package to avoid circular imports to `widmgr_int_test.go`
2023-10-09 09:16:23 -04:00
Piotr Kazmierczak
597d835220 wi: introduce workload identity handler (#18672)
Any code that tracks workloads and their identities should not rely on string
comparisons, especially since we support 2 types of workload identities: those
that identify tasks and those that identify services. This means we cannot rely
on task.Name for workload-identity pairs.

The new type structs.WIHandle solves this problem by providing a uniform way of
identifying workloads and their identities.
2023-10-06 18:32:47 +02:00
Seth Hoenig
e3c8700ded deps: upgrade to go-set/v2 (#18638)
No functional changes, just cleaning up deprecated usages that are
removed in v2 and replace one call of .Slice with .ForEach to avoid
making the intermediate copy.
2023-10-05 11:56:17 -05:00
Luiz Aoqui
d425c90e0f client: remove null dynamic metadata keys (#18664)
Setting a null value to a node metadata is expected to remove it from
subsequent reads. This is true both for static node metadata (defined in
the agent configuration file) as well as for dynamic node metadata
(defined via the Nomad API).

Null values for static metadata must be persisted to indicate that the
value has been removed, but strictly dynamic metadata null values can be
removed from state and client memory.
2023-10-05 11:41:44 -04:00
Luiz Aoqui
ed204e0fd9 client: ensure task only runs with prestart hooks (#18662)
Since the allocation in the task runner is updated in a separate
goroutine, a race condition may happen where the task is started but the
prestart hooks are skipped because the allocation became terminal.

Checking for a terminal allocation before proceeding with the task start
ensures the task only runs if the prestart hooks are also executed.

Since `shouldShutdown()` only uses terminal allocation status, it
remains `true` after the first transition, so it's safe to check it
again after the prestart hooks as it will never revert to `false`.
2023-10-05 10:16:57 -04:00
Piotr Kazmierczak
03cf9ae7ff vault: eliminate vaultclient test import cycle (#18652)
Eliminates the vaultclient test import cycle by putting the test file into the
client package and making vaultclient objects public.

Ref hashicorp/team-nomad#404
2023-10-05 09:17:16 +02:00
Tim Gross
bf65e44a09 consul: only fetch Consul tokens for Consul-specific identities (#18649)
Only the workload identities signed specifically for Consul, named
for the task or service, should result in authenticating to Consul to get tokens.
2023-10-04 11:12:50 -04:00
Matthew Salsamendi
aa9ff3a5b3 fix: use interpolated address when performing health checks (#18584)
* fix: use interpolated address when performing health checks

* Fix tests, add changelog

* Update .changelog/18584.txt

Co-authored-by: Seth Hoenig <shoenig@duck.com>

---------

Co-authored-by: Seth Hoenig <shoenig@duck.com>
2023-10-04 07:58:55 -05:00
Tim Gross
fb7582d596 services: get Consul token from hook resources (#18600)
When Workload Identity is being used with Consul, the `consul_hook` will add
Consul tokens to the alloc hook resources. Update the `group_service_hook` and
`service_hook` to use those tokens when available for registering and
deregistering Consul workloads.
2023-10-04 08:35:18 -04:00
Tim Gross
52ef476a72 sids_hook: read tokens from consul_hook when available (#18594)
The `sids_hook` runs for Connect sidecar/gateway tasks and gets Consul Service
Identity (SI) tokens for use by the Envoy bootstrap hook. When Workload Identity
is being used with Consul, the `consul_hook` will have already added these
tokens to the alloc hook resources. Update the `sids_hook` to use those tokens
instead and write them to the expected area of the taskdir.
2023-10-03 09:12:13 -04:00
Piotr Kazmierczak
3d62438876 consul: consul taskrunner hook should only write tokens that belong to its task (#18635)
Ref hashicorp/team-nomad#404
2023-10-02 19:49:02 +02:00
Michael Schurter
3f9bd17687 client: prevent watching stale alloc state (#18612)
When waiting on a previous alloc we must query against the leader before
switching to a stale query with index set.

Also check to ensure the response is fresh before using it like #18269
2023-09-29 12:46:28 -07:00
Tim Gross
aaee3076c2 consul: allow consul block in task scope (#18597)
To support Workload Identity with Consul for templates, we want templates to be
able to use the WI created at the task scope (either implicitly or set by the
user). But to allow different tasks within a group to be assigned to different
clusters as we're doing for Vault, we need to be able to set the `consul` block
with its `cluster` field at the task level to override the group.
2023-09-29 15:03:48 -04:00
Piotr Kazmierczak
5dab41881b client: new consul_hook (#18557)
This PR introduces a new allocrunner-level consul_hook which iterates over
services and tasks, if their provider is consul, fetches consul tokens for all of
them, stores them in AllocHookResources and in task secret dirs.

Ref: hashicorp/team-nomad#404

---------

Co-authored-by: Tim Gross <tgross@hashicorp.com>
2023-09-29 17:41:48 +02:00
Piotr Kazmierczak
0a75a42d94 WI: WIDMgr should expose default identity signatures (#18610)
Since the identity_hook is meant to be the central place that makes signed
identities available to other hooks, it should also expose the default identity
that is signed by the plan applier.

Ref: hashicorp/team-nomad#404
2023-09-29 15:17:59 +02:00
Tim Gross
5001bf4547 consul: use constant instead of "default" literal (#18611)
Use the constant `structs.ConsulDefaultCluster` instead of the string literal
"default", as we've done for Vault.
2023-09-28 16:50:21 -04:00
Michael Schurter
e73026dd4c client: prevent using stale allocs (#18601)
Similar to #18269, it is possible that even if Node.GetClientAllocs
retrieves fresh allocs that the subsequent Alloc.GetAllocs call
retrieves stale allocs. While `diffAlloc(existing, updated)` properly
ignores stale alloc *updates*, alloc deletions have no such check.

So if a client retrieves an alloc created at index 123, and then a
subsequent Alloc.GetAllocs call hits a new server which returns results
at index 100, the client will stop the alloc created at 123 because it
will be missing from the stale response.

This change applies the same logic as #18269 and ensures only fresh
responses are used.

Glossary:
* fresh - modified at an index > the query index
* stale - modified at an index <= the query index
2023-09-28 11:42:57 -07:00
Luiz Aoqui
868aba57bb vault: update identity name to start with vault_ (#18591)
* vault: update identity name to start with `vault_`

In the original proposal, workload identities used to derive Vault
tokens were expected to be called just `vault`. But in order to support
multiple Vault clusters it is necessary to associate identities with
specific Vault cluster configuration.

This commit implements a new proposal to have Vault identities named as
`vault_<cluster>`.
2023-09-27 15:53:28 -03:00
Luiz Aoqui
19241964a4 config: fix some issues with workload identity and multi Consul and Vault (#18590)
* config: fix multi consul and vault config parse

Capture the loop variable when parsing multiple Consul and Vault
configuration blocks so the duration parse function uses the correct
field when it's called later on.

* client: build Vault client with right config

When setting up the multiple Vault clients, the code was always loading
the default configuration, resulting in all clients to be configured the
same way.

* config: fix WorkloadIdentityConfig.Copy() method

Ensure `WorkloadIdentityConfig.Copy()` does not return the original
pointer for the `TTL` field.
2023-09-27 14:41:11 -03:00
Tim Gross
02a5aab359 consul: provide workload's Consul token to service client (#18559)
This is a work-in-progress changeset to provide workload-specific Consul tokens
that are created by the `consul_hook` and attached to workload registration
requests by the `group_service_hook` and `service_hook`.

This requires unreleased updates to Consul's `api` package, so this changeset
includes a temporary `replace` directive in the go.mod file.
2023-09-26 14:13:29 -04:00
Tim Gross
20eadc7b29 config: move Consul getter out of fingerprinter (#18556) 2023-09-22 10:58:39 -04:00
Daniel Bennett
7bd5c6e84e test: Refactor mock CSI manager (#18554)
and MockCSIManager to support the call counting
that csi_hook_test expects

instead of implementing csimanager
interfaces in two separate places:
* client/allocrunner/csi_hook_test
* client/csi_endpoint_test

they can both use the same mocks defined in
client/pluginmanager/csimanager/
alongside the actual implementations of them.

also refactor TestCSINode_DetachVolume
to use use it like Node_ExpandVolume
so we can also test the happy path there
2023-09-21 16:03:53 -05:00