The `nomad service info` command doesn't support using a wildcard namespace with
a prefix match, the way that we do for many other commands. Update the command
to do a prefix match list query for the services before making the get query.
Fixes: #18831
Added the [OIDC Discovery](https://openid.net/specs/openid-connect-discovery-1_0.html) `/.well-known/openid-configuration` endpoint to Nomad, but it is only enabled if the `server.oidc_issuer` parameter is set. Documented the parameter, but without a tutorial trying to actually _use_ this will be very hard.
I intentionally did *not* use https://github.com/hashicorp/cap for the OIDC configuration struct because it's built to be a *compliant* OIDC provider. Nomad is *not* trying to be compliant initially because compliance to the spec does not guarantee it will actually satisfy the requirements of third parties. I want to avoid the problem where in an attempt to be standards compliant we ship configuration parameters that lock us in to a certain behavior that we end up regretting. I want to add parameters and behaviors as there's a demonstrable need.
Users always have the escape hatch of providing their own OIDC configuration endpoint. Nomad just needs to know the Issuer so that the JWTs match the OIDC configuration. There's no reason the actual OIDC configuration JSON couldn't live in S3 and get served directly from there. Unlike JWKS the OIDC configuration should be static, or at least change very rarely.
This PR is just the endpoint extracted from #18535. The `RS256` algorithm still needs to be added in hopes of supporting third parties such as [AWS IAM OIDC Provider](https://docs.aws.amazon.com/IAM/latest/UserGuide/id_roles_providers_create_oidc.html).
Co-authored-by: Luiz Aoqui <luiz@hashicorp.com>
* Scaffolding actions (#18639)
* Task-level actions for job submissions and retrieval
* FIXME: Temporary workaround to get ember dev server to pass exec through to 4646
* Update api/tasks.go
Co-authored-by: Tim Gross <tgross@hashicorp.com>
* Update command/agent/job_endpoint.go
Co-authored-by: Tim Gross <tgross@hashicorp.com>
* Diff and copy implementations
* Action structs get their own file, diff updates to behave like our other diffs
* Test to observe actions changes in a version update
* Tests migrated into structs/diff_test and modified with PR comments in mind
* APIActionToSTructsAction now returns a new value
* de-comment some plain parts, remove unused action lookup
* unused param in action converter
---------
Co-authored-by: Tim Gross <tgross@hashicorp.com>
* New endpoint: job/:id/actions (#18690)
* unused param in action converter
* backing out of parse_job level and moved toward new endpoint level
* Adds taskName and taskGroupName to actions at job level
* Unmodified job mock actions tests
* actionless job test
* actionless job test
* Multi group multi task actions test
* HTTP method check for GET, cleaner errors in job_endpoint_test
* decomment
* Actions aggregated at job model level (#18733)
* Removal of temporary fix to proxy to 4646
* Run Action websocket endpoint (#18760)
* Working demo for review purposes
* removal of cors passthru for websockets
* Remove job_endpoint-specific ws handlers and aimed at existing alloc exec handlers instead
* PR comments adressed, no need for taskGroup pass, better group and task lookups from alloc
* early return in action validate and removed jobid from req args per PR comments
* todo removal, we're checking later in the rpc
* boolean style change on tty
* Action CLI command (#18778)
* Action command init and stuck-notes
* Conditional reqpath to aim at Job action endpoint
* De-logged
* General CLI command cleanup, observe namespace, pass action as string, get random alloc w group adherence
* tab and varname cleanup
* Remove action param from Allocations().Exec calls
* changelog
* dont nil-check acl
---------
Co-authored-by: Tim Gross <tgross@hashicorp.com>
* core: plumbing to support numa aware scheduling
* core: apply node resources compatibility upon fsm rstore
Handle the case where an upgraded server dequeus an evaluation before
a client triggers a new fingerprint - which would be needed to cause
the compatibility fix to run. By running the compat fix on restore the
server will immediately have the compatible pseudo topology to use.
* lint: learn how to spell pseudo
This changeset makes two changes:
* Removes the `consul.use_identity` field from the agent configuration. This behavior is properly covered by the presence of `consul.service_identity` / `consul.task_identity` blocks.
* Adds a `consul.task_auth_method` and `consul.service_auth_method` fields to the agent configuration. This allows the cluster administrator to choose specific Consul Auth Method names for their environment, with a reasonable default.
When agents start, they create a shared Consul client that is then wrapped as
various interfaces for testability, and used in constructing the Nomad client
and server. The interfaces that support workload services (rather than the Nomad
agent itself) need to support multiple Consul clusters for Nomad
Enterprise. Update these interfaces to be factory functions that return the
Consul client for a given cluster name. Update the `ServiceClient` to split
workload updates between clusters by creating a wrapper around all the clients
that delegates to the cluster-specific `ServiceClient`.
Ref: https://github.com/hashicorp/team-nomad/issues/404
The RPC handlers expect to see `nil` ACL objects whenever ACLs are disabled. By
using `nil` as a sentinel value, we have the risk of nil pointer exceptions and
improper handling of `nil` when returned from our various auth methods that can
lead to privilege escalation bugs. This is the final patch in a series to
eliminate the use of `nil` ACLs as a sentinel value for when ACLs are disabled.
This patch adds a new virtual ACL policy field for when ACLs are disabled and
updates our authentication logic to use it. Included:
* Extends auth package tests to demonstrate that nil ACLs are treated as failed
auth and disabled ACLs succeed auth.
* Adds a new `AllowDebug` ACL check for the weird special casing we have for
pprof debugging when ACLs are disabled.
* Removes the remaining unexported methods (and repeated tests) from the
`nomad/acl.go` file.
* Update the semgrep rules to detect improper nil ACL checking and remove the
old invalid ACL checks.
* Update the contributing guide for RPC authentication.
Ref: https://github.com/hashicorp/nomad-enterprise/pull/1218
Ref: https://github.com/hashicorp/nomad/pull/18703
Ref: https://github.com/hashicorp/nomad/pull/18715
Ref: https://github.com/hashicorp/nomad/pull/16799
Ref: https://github.com/hashicorp/nomad/pull/18730
Ref: https://github.com/hashicorp/nomad/pull/18744
The RPC handlers expect to see `nil` ACL objects whenever ACLs are disabled. By
using `nil` as a sentinel value, we have the risk of nil pointer exceptions and
improper handling of `nil` when returned from our various auth methods that can
lead to privilege escalation bugs. This is the third in a series to eliminate
the use of `nil` ACLs as a sentinel value for when ACLs are disabled.
This patch involves creating a new "virtual" ACL object for checking permissions
on client operations and a matching `AuthenticateClientOnly` method for
client-only RPCs that can produce that ACL.
Unlike the server ACLs PR, this also includes a special case for "legacy" client
RPCs where the client was not previously sending the secret as it
should (leaning on mTLS only). Those client RPCs were fixed in Nomad 1.6.0, but
it'll take a while before we can guarantee they'll be present during upgrades.
Ref: https://github.com/hashicorp/nomad-enterprise/pull/1218
Ref: https://github.com/hashicorp/nomad/pull/18703
Ref: https://github.com/hashicorp/nomad/pull/18715
Ref: https://github.com/hashicorp/nomad/pull/16799
- Expose internal HTTP client's Do() via Raw
- Use URL parser to identify scheme
- Align more with curl output
- Add changelog
- Fix test failure; add tests for socket envvars
- Apply review feedback for tests
- Consolidate address parsing
- Address feedback from code reviews
Co-authored-by: Tim Gross <tgross@hashicorp.com>
With a default value set to `client`, the `nomad acl token update`
command can silently downgrade a management token to client on update if
the command does not specify `-type=management` on every update.
No functional changes, just cleaning up deprecated usages that are
removed in v2 and replace one call of .Slice with .ForEach to avoid
making the intermediate copy.
When Workload Identity is being used with Consul, the `consul_hook` will add
Consul tokens to the alloc hook resources. Update the `group_service_hook` and
`service_hook` to use those tokens when available for registering and
deregistering Consul workloads.
* config: apply defaults to extra Consul and Vault
Apply the expected default values when loading additional Consul and
Vault cluster configuration. Without these defaults some fields would be
left empty.
* config: retain pointer of multi Consul and Vault
When calling `Copy()` the pointer reference from the `"default"` key of
the `Consuls` and `Vaults` maps to the `Consul` and `Vault` field of
`Config` was being lost.
* test: ensure TestAgent has the right reference to the default Consul config
To support Workload Identity with Consul for templates, we want templates to be
able to use the WI created at the task scope (either implicitly or set by the
user). But to allow different tasks within a group to be assigned to different
clusters as we're doing for Vault, we need to be able to set the `consul` block
with its `cluster` field at the task level to override the group.
The initial intention behind the `vault.use_identity` configuration was
to indicate to Nomad servers that they would need to sign a workload
identities for allocs with a `vault` block.
But in order to support identity renewal, #18262 and #18431 moved the
token signing logic to the alloc runner since a new token needs to be
signed prior to the TTL expiring.
So #18343 implemented `use_identity` as a flag to indicate that the
workload identity JWT flow should be used when deriving Vault tokens for
tasks.
But this configuration value is set on servers so it is not available to
clients at the time of token derivation, making its meaning not clear: a
job may end up using the identity-based flow even when `use_identity` is
`false`.
The only reliable signal available to clients at token derivation time
is the presence of an `identity` block for Vault, and this is already
configured with the `vault.default_identity` configuration block, making
`vault.use_identity` redundant.
This commit removes the `vault.use_identity` configuration and
simplifies the logic on when an implicit Vault identity is injected into
tasks.
* vault: update identity name to start with `vault_`
In the original proposal, workload identities used to derive Vault
tokens were expected to be called just `vault`. But in order to support
multiple Vault clusters it is necessary to associate identities with
specific Vault cluster configuration.
This commit implements a new proposal to have Vault identities named as
`vault_<cluster>`.
* config: fix multi consul and vault config parse
Capture the loop variable when parsing multiple Consul and Vault
configuration blocks so the duration parse function uses the correct
field when it's called later on.
* client: build Vault client with right config
When setting up the multiple Vault clients, the code was always loading
the default configuration, resulting in all clients to be configured the
same way.
* config: fix WorkloadIdentityConfig.Copy() method
Ensure `WorkloadIdentityConfig.Copy()` does not return the original
pointer for the `TTL` field.
This is a work-in-progress changeset to provide workload-specific Consul tokens
that are created by the `consul_hook` and attached to workload registration
requests by the `group_service_hook` and `service_hook`.
This requires unreleased updates to Consul's `api` package, so this changeset
includes a temporary `replace` directive in the go.mod file.
It includes the work over the state store, the PRC server, the HTTP server, the go API package and the CLI's command. To read more on the actuall functionality, refer to the RFCs [NMD-178] Locking with Nomad Variables and [NMD-179] Leader election using locking mechanism for the Autoscaler.
The original thinking for Workload Identity integration with Consul and Vault
was that we'd allow `template` blocks to specify their own identity. But because
the login to Consul/Vault to get tokens happens at the task level, this would
involve making the `template` block a new WID watcher on its own rather than
using the Consul and Vault hooks we're building at the group/task level.
So it doesn't make sense to have separate identities for individual `template`
blocks rather than at the level of tasks. Update the agent configuration to
rename the `template_identity` to the more accurate `task_identity`, which will
be used for any non-service hooks (just `template` today).
Update the implicit identities job mutation hook to create the identity we'll
need as well.
Nomad Enterprise will support configuring multiple Vault clients. Instead of
having a single Vault client field in the Nomad client, we'll have a function
that callers can parameterize by the Vault cluster name that returns the
correctly configured Vault API client wrapper.
This feature will help operator to remove a failed/left node from Serf layer immediately
without waiting for 24 hours for the node to be reaped
* Update CLI with prune flag
* Update API /v1/agent/force-leave with prune query string parameter
* Update CLI and API doc
* Add unit test
Add support for identity token TTL in agent configuration fields such as
Consul `service_identity` and `template_identity`.
Co-authored-by: Michael Schurter <mschurter@hashicorp.com>
* client: refactor cpuset partitioning
This PR updates the way Nomad client manages the split between tasks
that make use of resources.cpus vs. resources.cores.
Previously, each task was explicitly assigned which CPU cores they were
able to run on. Every time a task was started or destroyed, all other
tasks' cpusets would need to be updated. This was inefficient and would
crush the Linux kernel when a client would try to run ~400 or so tasks.
Now, we make use of cgroup heirarchy and cpuset inheritence to efficiently
manage cpusets.
* cr: tweaks for feedback
In Nomad Enterprise when multiple Vault/Consul clusters are configured, cluster admins can control access to clusters for jobs via namespace ACLs, similar to how we've done so for node pools. This changeset updates the ACL configuration structs, but doesn't wire them up.
Rename the agent configuraion for workload identity to
`WorkloadIdentityConfig` to make its use more explicit and remove the
`ServiceName` field since it is never expected to be defined in a
configuration file.
Also update the job mutation to inject a service identity following
these rules:
1. Don't inject identity if `consul.use_identity` is false.
2. Don't inject identity if `consul.service_identity` is not specified.
3. Don't inject identity if service provider is not `consul`.
4. Set name and service name if the service specifies an identity.
5. Inject `consul.service_identity` if service does not specify an
identity.