Commit Graph

25212 Commits

Author SHA1 Message Date
Seth Hoenig
b5469dd0eb Post 1.6.3 release (#18918)
* Generate files for 1.6.3 release

* Prepare for next release

* Merge release 1.6.3 files

---------

Co-authored-by: hc-github-team-nomad-core <github-team-nomad-core@hashicorp.com>
2023-10-30 12:38:16 -05:00
Tim Gross
f0330d6df1 identity_hook: implement PreKill hook, not TaskStop hook (#18913)
The allocrunner's `identity_hook` implements the interface for TaskStop, but
this interface is only ever called for task-level hooks. This results in a
leaked goroutine that tries to periodically renew WIs until the client shuts
down gracefully.

Add an implementation for the allocrunner's `PreKill` and `Destroy` hooks, so
that whenever an allocation is stopped or garbage collected we stop renewing its
Workload Identities. This also requires making the `Shutdown` method of `WIDMgr`
safe to call multiple times.
2023-10-30 10:54:22 -04:00
Dave May
1f4965e877 docs: Add code fence to Improvements example (#18902) 2023-10-30 14:13:19 +00:00
Tim Gross
9463d7f88a docs: add note about consul.service_identity ignoring fields (#18900)
The WI we get for Consul services is saved to the client state DB like all other
WIs, but the resulting JWT is never exposed to the task secrets directory
because (a) it's only intended for use with Consul service configuration,
and (b) for group services it could be ambiguous which task to expose it to.

Add a note to the `consul.service_identity` docs that these fields are ignored.
2023-10-30 09:19:15 -04:00
Luiz Aoqui
347389f9f9 vault: derive token using create_from_role (#18880)
Fallback to the ACL role defined in the client's `create_from_role`
configuration when using the JWT flow and the task does not specify a
role to use.
2023-10-27 13:03:44 -04:00
Luiz Aoqui
71a471b90a cli: deprecate -vault-token flag (#18881)
Apply the same deprecation notice from #18863 to the `nomad job plan`
command.
2023-10-27 12:48:11 -04:00
James Rasell
2daf49df9a server: use same receiver name for all server funcs. (#18896) 2023-10-27 16:36:10 +01:00
Tim Gross
694a5ec19d docs: remove stale note about generate_lease from template docs (#18895)
Prior to `consul-template` v0.22.0, automatic PKI renewal wouldn't work properly
based on the expiration of the cert. More recent versions of `consul-template`
can use the expiry to refresh the cert, so it's no longer necessary (and in fact
generates extra load on Vault) to set `generate_lease`. Remove this
recommendation from the docs.

Fixes: #18893
2023-10-27 11:09:09 -04:00
Justin Yang
b76e0429c4 client: add support for NetBSD clients (#18562)
Bumps `shirou/gopsutil` to v3.23.9
2023-10-27 10:33:00 -04:00
Tim Gross
139a96ad12 e2e: fix bind name to allow Connect reachability (#18878)
The `BindName` for JWT authentication should always bind to the `nomad_service` field in the JWT and not include the namespace, as the `nomad_service` is what's actually registered in Consul. 

* Fix the binding rule for the `consulcompat` test 
* Add a reachability assertion so that we don't miss regressions.
* Ensure we have a clean shutdown so that we don't leak state (containers and iptables) between tests.
2023-10-27 10:15:17 -04:00
James Rasell
3c8eb54dfc scheduler: ensure dup alloc names are fixed before plan submit. (#18873)
This change fixes a bug within the generic scheduler which meant
duplicate alloc indexes (names) could be submitted to the plan
applier and written to state. The bug originates from the
placements calculation notion that names of allocations being
replaced are blindly copied to their replacement. This is not
correct in all cases, particularly when dealing with canaries.

The fix updates the alloc name index tracker to include minor
duplicate tracking. This can be used when computing placements to
ensure duplicate are found, and a new name picked before the plan
is submitted. The name index tracking is now passed from the
reconciler to the generic scheduler via the results, so this does
not have to be regenerated, or another data structure used.
2023-10-27 14:16:41 +01:00
Juana De La Cuesta
e8efe2d251 fix: handling non reschedule disconnecting and reconnecting allocs (#18701)
This PR fixes a long lived bug, where disconnecting allocations where never rescheduled by their policy but because the group count was short. The default reschedule time for services and batches is 30 and 5 seconds respectively, in order to properly reschedule disconnected allocs, they need to be able to be rescheduled for later, a path that was not handled before. This PR introduces a way to handle such allocations.
2023-10-27 13:14:39 +02:00
Robert Sturla
23665a5685 docs: update link to tc-redirect-tap (#18879) 2023-10-26 14:21:10 -04:00
Seth Hoenig
fdde8a56ae docs: add job-specification docs for numa (#18864)
* docs: add job-specification docs for numa

* docs: take suggestions

Co-authored-by: Tim Gross <tgross@hashicorp.com>

* docs: more cr suggestions

---------

Co-authored-by: Tim Gross <tgross@hashicorp.com>
2023-10-26 11:39:08 -05:00
Luiz Aoqui
61d4ee7e60 vault: validate tasks using non-default clusters (#18810)
Since Nomad servers only start a Vault client for the default cluster,
tasks using non-default clusters must provide an identity to be used for
token derivation, either in the task itself or in the agent
configuration.
2023-10-26 11:50:42 -04:00
Tim Gross
8f8265fa6d add deprecation warning for Vault/Consul token usage (#18863)
Submitting a Consul or Vault token with a job is deprecated in Nomad 1.7 and
intended for removal in Nomad 1.9. Add a deprecation warning to the CLI when the
user passes in the appropriate flag or environment variable.

Nomad agents will no longer need a Vault token when configured with workload
identity, and we'll ignore Vault tokens in the agent config after Nomad 1.9. Log
a warning at agent startup.

Ref: https://github.com/hashicorp/nomad/issues/15617
Ref: https://github.com/hashicorp/nomad/issues/15618
2023-10-26 10:46:02 -04:00
Seth Hoenig
8ed82416e3 client: fix detection of cpuset.mems on cgroups v1 systems (#18868) 2023-10-26 09:42:10 -05:00
Tim Gross
47f2118f40 docs: Vault Workload Identity integration (#18704)
Documentation updates to support the new Vault integration with Nomad Workload
Identity. Included:

* Added a large section to the Vault integration docs to explain how to set up
  auth methods, roles, and policies (by hand, assuming we don't ship a `nomad
  setup-vault` tool for now), and how to safely migrate from the existing workflow
  to the new one.
* Shuffled around some of the existing text so that the legacy authentication
  method text is in its own section.
* Added a compatibility matrix to the Vault integration page.
2023-10-26 10:33:52 -04:00
Seth Hoenig
afac9d10dd deps: purge and prohibit use of go-set/v1 (#18869) 2023-10-26 08:56:43 -05:00
Piotr Kazmierczak
7f62dec473 consul WI: rename default auth method for services (#18867)
It should be called nomad-services instead of nomad-workloads.
2023-10-26 09:43:33 +02:00
Seth Hoenig
de28760928 cl: add changelog for numa (#18847) 2023-10-25 10:41:17 -05:00
James Rasell
b3e41bec2d scheduler: remove unused alloc index functions. (#18846) 2023-10-25 09:09:47 +01:00
Michael Schurter
9b3c38b3ed docs: deprecate rsadecrypt (#18856)
`rsadecrypt` uses PKCS #1 v1.5 padding which has multiple known
weaknesses. While it is possible to use safely in Nomad, we should not
encourage our users to use bad cryptographic primitives.

If users want to decrypt secrets in jobspecs we should choose a
cryptographic primitive designed for that purpose. `rsadecrypt` was
inherited from Terraform which only implemented it to support decrypting
Window's passwords on AWS EC2 instances:

https://github.com/hashicorp/terraform/pull/16647

This is not something that should ever be done in a jobspec, therefore
there's no reason for Nomad to support this HCL2 function.
2023-10-24 15:48:15 -07:00
Tim Gross
6c2d5a0fbb E2E: Consul compatibility matrix tests (#18799)
Set up a new test suite that exercises Nomad's compatibility with Consul. This
suite installs all currently supported versions of Consul, spins up a Consul
agent with appropriate configuration, and a Nomad agent running in dev
mode. Then it runs a Connect job against each pair.
2023-10-24 16:03:53 -04:00
Seth Hoenig
8de7af51cb cl: remove cgroup mountpoint (#18848)
* cl: remove cgroup mountpoint attribute

* cl: add changelog for cgroups attribute changes
2023-10-24 11:38:26 -05:00
Daniel Bennett
b46b41a2e9 scheduler: appropriately unblock evals with quotas (#18838)
When an eval is blocked due to e.g. cpu exhausted
on nodes, but there happens to also be a quota on
the job's namespace, the eval would not get auto-
unblocked when the node cpu got freed up.

This change ensures, when considering quota during
BlockedEvals.unblock(), that the block was due to
quota in the first place, so unblocking does not
get skipped due to the mere existence of a quota
on the namespace.
2023-10-24 11:22:24 -05:00
Seth Hoenig
5cf4c6cc06 cl: note breaking change of numcores attribute on apple systems (#18850)
I goofed the name the first time around, "power" should have been
"performance" which is consistent with both Apple and Intel branding.
2023-10-24 10:54:26 -05:00
Seth Hoenig
9ae4b10dc6 cl: minor features are listed as improvements (#18845)
The Features header is reserved for "tent-pole" features of a Nomad version.
2023-10-24 10:53:40 -05:00
James Rasell
f64ade2304 cli: ensure HCL env vars are added to the job submission object. (#18832) 2023-10-24 16:48:13 +01:00
Kerim Satirli
5e1bbf90fc docs: update all URLs to developer.hashicorp.com (#16247) 2023-10-24 11:00:11 -04:00
Seth Hoenig
951cde4e3b numa: fix cpu topology conversion for non linux systems (#18843) 2023-10-24 09:12:34 -05:00
Tim Gross
cb3fde3c96 metrics: prevent negative counter from iowait decrease (#18835)
The iowait metric obtained from `/proc/stat` can under some circumstances
decrease. The relevant condition is when an interrupt arrives on a different
core than the one that gets woken up for the IO, and a particular counter in the
kernel for that core gets interrupted. This is documented in the man page for
the `proc(5)` pseudo-filesystem, and considered an unfortunate behavior that
can't be changed for the sake of ABI compatibility.

In Nomad, we get the current "busy" time (everything except for idle) and
compare it to the previous busy time to get the counter incremeent. If the
iowait counter decreases and the idle counter increases more than the increase
in the total busy time, we can get a negative total. This previously caused a
panic in our metrics collection (see #15861) but that is being prevented by
reporting an error message.

Fix the bug by putting a zero floor on the values we return from the host CPU
stats calculator.

Fixes: #15861
Fixes: #18804
2023-10-24 09:58:25 -04:00
Seth Hoenig
043b1a95a7 deps: bump go-set/v2 to alpha.3 (#18844)
fixes a rather critical bug in .Equals implementation
2023-10-24 08:23:25 -05:00
James Rasell
b55dcb3967 test: use must lib for bitmap tests. (#18834) 2023-10-24 07:40:02 +01:00
Luiz Aoqui
70b1862026 test: add E2E vaultcompat test for JWT auth flow (#18822)
Test the JWT auth flow using real Nomad and Vault agents.
2023-10-23 20:00:55 -04:00
Tim Gross
1b3920f96b cli: add prefix ID and wildcard namespace support for service info (#18836)
The `nomad service info` command doesn't support using a wildcard namespace with
a prefix match, the way that we do for many other commands. Update the command
to do a prefix match list query for the services before making the get query.

Fixes: #18831
2023-10-23 13:17:51 -04:00
Tim Gross
8a311255a2 docs: Consul Workload Identity integration (#18685)
Documentation updates to support the new Consul integration with Nomad Workload
Identity. Included:

* Added a large section to the Consul integration docs to explain how to set up
  auth methods and binding rules (by hand, assuming we don't ship a `nomad
  setup-consul` tool for now), and how to safely migrate from the existing
  workflow to the new one.
* Move `consul` block out of `group` and onto its own page now that we have it
  available at the `task` scope, and expanded examples of its use.
* Added the `service_identity` and `task_identity` blocks to the Nomad agent
  configuration, and provided a recommended default.
* Added the `identity` block to the `service` block page.
* Added a rough compatibility matrix to the Consul integration page.
2023-10-23 09:17:22 -04:00
Tim Gross
4d9cc73ed2 sids_hook: fix check for Consul token derived from WI (#18821)
The `sids_hook` serves the legacy Connect workflow, and we want to bypass it
when using workload identities. So the hook checks that there's not already a
Consul token in the alloc hook resources derived from the Workload
Identity. This check was looking for the wrong key. This would cause the hook to
ignore the Consul token we already have and then fail to derive a SI token
unless the Nomad agent has its own token with `acl:write` permission.

Fix the lookup and add tests covering the bypass behavior.
2023-10-23 08:57:02 -04:00
Michael Schurter
a806363f6d OpenID Configuration Discovery Endpoint (#18691)
Added the [OIDC Discovery](https://openid.net/specs/openid-connect-discovery-1_0.html) `/.well-known/openid-configuration` endpoint to Nomad, but it is only enabled if the `server.oidc_issuer` parameter is set. Documented the parameter, but without a tutorial trying to actually _use_ this will be very hard.

I intentionally did *not* use https://github.com/hashicorp/cap for the OIDC configuration struct because it's built to be a *compliant* OIDC provider. Nomad is *not* trying to be compliant initially because compliance to the spec does not guarantee it will actually satisfy the requirements of third parties. I want to avoid the problem where in an attempt to be standards compliant we ship configuration parameters that lock us in to a certain behavior that we end up regretting. I want to add parameters and behaviors as there's a demonstrable need.

Users always have the escape hatch of providing their own OIDC configuration endpoint. Nomad just needs to know the Issuer so that the JWTs match the OIDC configuration. There's no reason the actual OIDC configuration JSON couldn't live in S3 and get served directly from there. Unlike JWKS the OIDC configuration should be static, or at least change very rarely.

This PR is just the endpoint extracted from #18535. The `RS256` algorithm still needs to be added in hopes of supporting third parties such as [AWS IAM OIDC Provider](https://docs.aws.amazon.com/IAM/latest/UserGuide/id_roles_providers_create_oidc.html).

Co-authored-by: Luiz Aoqui <luiz@hashicorp.com>
2023-10-20 17:11:41 -07:00
Seth Hoenig
0020139440 core: port common code changes from ENT for numa scheduling (#18818)
Some additional changes were made in the ENT PR to the common code in
support of numa scheduling; this PR copies those changes back to CE.
2023-10-20 13:19:02 -05:00
Luiz Aoqui
6d4b62200b log: add Consul and Vault cluster name to output (#18817)
Ensure Consul and Vault loggers have the cluster name as an attribute to
help differentiate log source.
2023-10-20 14:03:56 -04:00
Phil Renaud
8902afe651 Nomad Actions (#18794)
* Scaffolding actions (#18639)

* Task-level actions for job submissions and retrieval

* FIXME: Temporary workaround to get ember dev server to pass exec through to 4646

* Update api/tasks.go

Co-authored-by: Tim Gross <tgross@hashicorp.com>

* Update command/agent/job_endpoint.go

Co-authored-by: Tim Gross <tgross@hashicorp.com>

* Diff and copy implementations

* Action structs get their own file, diff updates to behave like our other diffs

* Test to observe actions changes in a version update

* Tests migrated into structs/diff_test and modified with PR comments in mind

* APIActionToSTructsAction now returns a new value

* de-comment some plain parts, remove unused action lookup

* unused param in action converter

---------

Co-authored-by: Tim Gross <tgross@hashicorp.com>

* New endpoint: job/:id/actions (#18690)

* unused param in action converter

* backing out of parse_job level and moved toward new endpoint level

* Adds taskName and taskGroupName to actions at job level

* Unmodified job mock actions tests

* actionless job test

* actionless job test

* Multi group multi task actions test

* HTTP method check for GET, cleaner errors in job_endpoint_test

* decomment

* Actions aggregated at job model level (#18733)

* Removal of temporary fix to proxy to 4646

* Run Action websocket endpoint (#18760)

* Working demo for review purposes

* removal of cors passthru for websockets

* Remove job_endpoint-specific ws handlers and aimed at existing alloc exec handlers instead

* PR comments adressed, no need for taskGroup pass, better group and task lookups from alloc

* early return in action validate and removed jobid from req args per PR comments

* todo removal, we're checking later in the rpc

* boolean style change on tty

* Action CLI command (#18778)

* Action command init and stuck-notes

* Conditional reqpath to aim at Job action endpoint

* De-logged

* General CLI command cleanup, observe namespace, pass action as string, get random alloc w group adherence

* tab and varname cleanup

* Remove action param from Allocations().Exec calls

* changelog

* dont nil-check acl

---------

Co-authored-by: Tim Gross <tgross@hashicorp.com>
2023-10-20 13:05:55 -04:00
Seth Hoenig
3e8ebf85f5 lang: add a helper for iterating a map in order (#18809)
In some cases it is helpful to iterate a map in the sorted order of
the maps keyset - particularly in implementations of some function for
which the tests cannot be deterministic without order.
2023-10-20 08:11:35 -05:00
James Rasell
1a0d1efb0d cli: use single dep func for opening URLs. (#18808) 2023-10-20 08:24:11 +01:00
James Rasell
ca9e08e6b5 monitor: add log include location option on monitor CLI and API (#18795) 2023-10-20 07:55:22 +01:00
Tim Gross
f5c5035fde testutil: add ACL bootstrapping to test server configuration (#18811)
Some of our `api` package tests have ACLs enabled, but none of those tests also
run clients and the "wait for the clients to be live" code reads from the Node
API. The caller can't bootstrap ACLs until `NewTestServer` returns, and this
makes for a circular dependency.

Allow developers to provide a bootstrap token to the test server config, and
if it's available, have the server bootstrap the ACL system with it before
checking for live clients.
2023-10-19 16:50:38 -04:00
Seth Hoenig
83720740f5 core: plumbing to support numa aware scheduling (#18681)
* core: plumbing to support numa aware scheduling

* core: apply node resources compatibility upon fsm rstore

Handle the case where an upgraded server dequeus an evaluation before
a client triggers a new fingerprint - which would be needed to cause
the compatibility fix to run. By running the compat fix on restore the
server will immediately have the compatible pseudo topology to use.

* lint: learn how to spell pseudo
2023-10-19 15:09:30 -05:00
Piotr Kazmierczak
0410b8acea client: remove unnecessary debugging from consul client mock (#18807) 2023-10-19 16:23:42 +02:00
Luiz Aoqui
8b9a5fde4e vault: add multi-cluster support on templates (#18790)
In Nomad Enterprise, a task may connect to a non-default Vault cluster,
requiring `consul-template` to be configured with a specific client
`vault` block.
2023-10-18 20:45:01 -04:00
Piotr Kazmierczak
16d71582f6 client: consul_hook tests (#18780)
ref https://github.com/hashicorp/team-nomad/issues/404
2023-10-18 20:02:35 +02:00