In the spirit of #25909, this PR removes testify dependencies from the scheduler
package, along with reflect.DeepEqual removal. This is again a combination of
semgrep and hx editing magic.
---------
Co-authored-by: Tim Gross <tgross@hashicorp.com>
We've been gradually migrating from `testify` to `shoenig/test` on a
test-by-test basis. While working on a large refactoring in the state store, I
found this to create a lot of diffs incidental to the refactoring.
In this changeset, I've used a prototype collection of semgrep fix rules to
autofix most of the uses of testify in the `nomad/state` package. Then I went in
manually and fixed any resulting problems, as well as a few minor test bugs that
`shoenig/test` catches and `testify` does not because of its API. I've also
added a semgrep rule for marking a package as "testify clean", so that we don't
accidentally add it back to any package we manage to remove it from going
forward.
While I'm here, I've removed most of the uses of `reflect.DeepEqual` in the
tests as well as cleaned up some older idioms that Go has nicer syntax for now.
We have several semgrep rules forbidding imports of packages we don't
want. While testing out a new rule I discovered that the rule we have is
completely ineffective. Update the rule to detect imports using the Go language
plugin, including regex matching on some packages where it's forbidden to import
the root but fine to import a subpackage or different version.
The go-set import rule is an example of one where our `go-set/v3` imports fails
the re-written check unless we use the regex syntax. If you replace the pattern
rule with `import "=~/github.com\/hashicorp\/go-set/v3$/"` it would fail.
Nomad client agents run as privileged processes and require access to much of
the cluster state, secrets, etc. to operate. But we can improve upon this by
tightening up the virtual policy that use for RPC requests authenticated by the
node secret ID. This changeset removes the `node:read`, `plugin:read`, and
`plugin:list` policy, as well as namespace operations. In return, we add a
`AllowClientOp` check to the RPCs the client uses that would otherwise need
those policies.
Where possible, the update RPCs have also been changed to match on node ID so
that a client can only make the RPC that impacts itself. In future work, we may
be able to downscope further by adding node pool filtering to `AllowClientOp`.
Ref: https://github.com/hashicorp/nomad-enterprise/issues/1528
Ref: https://github.com/hashicorp/nomad-enterprise/pull/1529
Ref: https://hashicorp.atlassian.net/browse/NET-9925
As of #18754 which shipped in Nomad 1.7, we no longer need to nil-check the
object returned by ResolveACL if there's no error return, because in the case
where ACLs are disabled we return a special "ACLs disabled" ACL object. Checking
nil is not a bug but should be discouraged because it opens us up to future bugs
that would bypass ACLs.
We fixed a bunch of these cases in https://github.com/hashicorp/nomad/pull/20150
but I didn't update the semgrep rule, which meant we missed a few more. Update
the semgrep rule and fix the remaining cases.
Some packages licensed under MPL-2.0 were incorrectly importing code
from packages licensed under BUSL-1.1.
Not all imports are fixed here as they will require additional work to
untangle them. To help track progress this commit adds a Semgrep rule
that detects incorrect BUSL-1.1 imports in MPL-2.0 packages.
Added the [OIDC Discovery](https://openid.net/specs/openid-connect-discovery-1_0.html) `/.well-known/openid-configuration` endpoint to Nomad, but it is only enabled if the `server.oidc_issuer` parameter is set. Documented the parameter, but without a tutorial trying to actually _use_ this will be very hard.
I intentionally did *not* use https://github.com/hashicorp/cap for the OIDC configuration struct because it's built to be a *compliant* OIDC provider. Nomad is *not* trying to be compliant initially because compliance to the spec does not guarantee it will actually satisfy the requirements of third parties. I want to avoid the problem where in an attempt to be standards compliant we ship configuration parameters that lock us in to a certain behavior that we end up regretting. I want to add parameters and behaviors as there's a demonstrable need.
Users always have the escape hatch of providing their own OIDC configuration endpoint. Nomad just needs to know the Issuer so that the JWTs match the OIDC configuration. There's no reason the actual OIDC configuration JSON couldn't live in S3 and get served directly from there. Unlike JWKS the OIDC configuration should be static, or at least change very rarely.
This PR is just the endpoint extracted from #18535. The `RS256` algorithm still needs to be added in hopes of supporting third parties such as [AWS IAM OIDC Provider](https://docs.aws.amazon.com/IAM/latest/UserGuide/id_roles_providers_create_oidc.html).
Co-authored-by: Luiz Aoqui <luiz@hashicorp.com>
The RPC handlers expect to see `nil` ACL objects whenever ACLs are disabled. By
using `nil` as a sentinel value, we have the risk of nil pointer exceptions and
improper handling of `nil` when returned from our various auth methods that can
lead to privilege escalation bugs. This is the final patch in a series to
eliminate the use of `nil` ACLs as a sentinel value for when ACLs are disabled.
This patch adds a new virtual ACL policy field for when ACLs are disabled and
updates our authentication logic to use it. Included:
* Extends auth package tests to demonstrate that nil ACLs are treated as failed
auth and disabled ACLs succeed auth.
* Adds a new `AllowDebug` ACL check for the weird special casing we have for
pprof debugging when ACLs are disabled.
* Removes the remaining unexported methods (and repeated tests) from the
`nomad/acl.go` file.
* Update the semgrep rules to detect improper nil ACL checking and remove the
old invalid ACL checks.
* Update the contributing guide for RPC authentication.
Ref: https://github.com/hashicorp/nomad-enterprise/pull/1218
Ref: https://github.com/hashicorp/nomad/pull/18703
Ref: https://github.com/hashicorp/nomad/pull/18715
Ref: https://github.com/hashicorp/nomad/pull/16799
Ref: https://github.com/hashicorp/nomad/pull/18730
Ref: https://github.com/hashicorp/nomad/pull/18744
The RPC handlers expect to see `nil` ACL objects whenever ACLs are disabled. By
using `nil` as a sentinel value, we have the risk of nil pointer exceptions and
improper handling of `nil` when returned from our various auth methods that can
lead to privilege escalation bugs. This is the third in a series to eliminate
the use of `nil` ACLs as a sentinel value for when ACLs are disabled.
This patch involves creating a new "virtual" ACL object for checking permissions
on client operations and a matching `AuthenticateClientOnly` method for
client-only RPCs that can produce that ACL.
Unlike the server ACLs PR, this also includes a special case for "legacy" client
RPCs where the client was not previously sending the secret as it
should (leaning on mTLS only). Those client RPCs were fixed in Nomad 1.6.0, but
it'll take a while before we can guarantee they'll be present during upgrades.
Ref: https://github.com/hashicorp/nomad-enterprise/pull/1218
Ref: https://github.com/hashicorp/nomad/pull/18703
Ref: https://github.com/hashicorp/nomad/pull/18715
Ref: https://github.com/hashicorp/nomad/pull/16799
* auth: add server-only ACL
The RPC handlers expect to see `nil` ACL objects whenever ACLs are disabled. By
using `nil` as a sentinel value, we have the risk of nil pointer exceptions and
improper handling of `nil` when returned from our various auth methods that can
lead to privilege escalation bugs. This is the second in a series to eliminate
the use of `nil` ACLs as a sentinel value for when ACLs are disabled.
This patch involves creating a new "virtual" ACL object for checking permissions
on server operations and a matching `AuthenticateServerOnly` method for
server-only RPCs that can produce that ACL.
Ref: https://github.com/hashicorp/nomad-enterprise/pull/1218
Ref: https://github.com/hashicorp/nomad/pull/18703
* build: update to go1.21
* go: eliminate helpers in favor of min/max
* build: run go mod tidy
* build: swap depguard for semgrep
* command: fixup broken tls error check on go1.21
Add JWKS endpoint to HTTP API for exposing the root public signing keys used for signing workload identity JWTs.
Part 1 of N components as part of making workload identities consumable by third party services such as Consul and Vault. Identity attenuation (audience) and expiration (+renewal) are necessary to securely use workload identities with 3rd parties, so this merge does not yet document this endpoint.
---------
Co-authored-by: Tim Gross <tgross@hashicorp.com>
This change resolves policies for workload identities when calling Client RPCs. Previously only ACL tokens could be used for Client RPCs.
Since the same cache is used for both bearer tokens (ACL and Workload ID), the token cache size was doubled.
---------
Co-authored-by: James Rasell <jrasell@users.noreply.github.com>
Some of the core scheduler tests need the maximum batch size for writes to be
smaller than the usual `structs.MaxUUIDsPerWriteRequest`. But they do so by
unsafely modifying the global struct, which creates test flakes in other tests.
Modify the functions under test to take a batch size parameter. Production code
will pass the global while the tests can inject smaller values. Turn the
`structs.MaxUUIDsPerWriteRequest` into a constant, and add a semgrep rule for
avoiding this kind of thing in the future.
This changeset allows Workload Identities to authenticate to all the RPCs that
support HTTP API endpoints, for use with PR #15864.
* Extends the work done for pre-forwarding authentication to all RPCs that
support a HTTP API endpoint.
* Consolidates the auth helpers used by the CSI, Service Registration, and Node
endpoints that are currently used to support both tokens and client secrets.
Intentionally excluded from this changeset:
* The Variables endpoint still has custom handling because of the implicit
policies. Ideally we'll figure out an efficient way to resolve those into real
policies and then we can get rid of that custom handling.
* The RPCs that don't currently support auth tokens (i.e. those that don't
support HTTP endpoints) have not been updated with the new pre-forwarding auth
We'll be doing this under a separate PR to support RPC rate metrics.
This changeset covers a sidebar discussion that @schmichael and I had around the
design for pre-forwarding auth. This includes some changes extracted out of
#15513 to make it easier to review both and leave a clean history.
* Remove fast path for NodeID. Previously-connected clients will have a NodeID
set on the context, and because this is a large portion of the RPCs sent we
fast-pathed it at the top of the `Authenticate` method. But the context is
shared for all yamux streams over the same yamux session (and TCP
connection). This lets an authenticated HTTP request to a client use the
NodeID for authentication, which is a privilege escalation. Remove the fast
path and annotate it so that we don't break it again.
* Add context to decisions around AuthenticatedIdentity. The `Authenticate`
method taken on its own looks like it wants to return an `acl.ACL` that folds
over all the various identity types (creating an ephemeral ACL on the fly if
neccessary). But keeping these fields idependent allows RPC handlers to
differentiate between internal and external origins so we most likely want to
avoid this. Leave some docstrings as a warning as to why this is built the way
it is.
* Mutate the request rather than returning. When reviewing #15513 we decided
that forcing the request handler to call `SetIdentity` was repetitive and
error prone. Instead, the `Authenticate` method mutates the request by setting
its `AuthenticatedIdentity`.
Upcoming work to instrument the rate of RPC requests by consumer (and eventually
rate limit) require that we authenticate a RPC request before forwarding. Add a
new top-level `Authenticate` method to the server and have it return an
`AuthenticatedIdentity` struct. RPC handlers will use the relevant fields of
this identity for performing authorization.
This changeset includes:
* The main implementation of `Authenticate`
* Provide a new RPC `ACL.WhoAmI` for debugging authentication. This endpoint
returns the same `AuthenticatedIdentity` that will be used by RPC handlers. At
some point we might want to give this an equivalent HTTP endpoint but I didn't
want to add that to our public API until some of the other Workload Identity
work is solidified, especially if we don't need it yet.
* A full coverage test of the `Authenticate` method. This sets up two server
nodes with mTLS and ACLs, some tokens, and some allocations with workload
identities.
* Wire up an example of using `Authenticate` in the `Namespace.Upsert` RPC and
see how authorization happens after forwarding.
* A new semgrep rule for `Authenticate`, which we'll need to update once we're
ready to wire up more RPC endpoints with authorization steps.
The List RPC correctly authorized against the prefix argument. But when
filtering results underneath the prefix, it only checked authorization for
standard ACL tokens and not Workload Identity. This results in WI tokens being
able to read List results (metadata only: variable paths and timestamps) for
variables under the `nomad/` prefix that belong to other jobs in the same
namespace.
Fixes the filtering and split the `handleMixedAuthEndpoint` function into
separate authentication and authorization steps so that we don't need to
re-verify the claim token on each filtered object.
Also includes:
* update semgrep rule for mixed auth endpoints
* variables: List returns empty set when all results are filtered
Metrics state is local to the server and needs to use time, which is normally
forbidden in the FSM code. We have a bypass for this rule for
`metrics.MeasureSince` but needed one for `metrics.MeasureSinceWithLabels` as well.
* test: don't use loop vars in goroutines
fixes a data race in the test
* test: copy objects in statestore before mutating
fixes data race in test
* test: @lgfa29's segmgrep rule for loops/goroutines
Found 2 places where we were improperly using loop variables inside
goroutines.
In order to support implicit ACL policies for tasks to get their own
secrets, each task would need to have its own ACL token. This would
add extra raft overhead as well as new garbage collection jobs for
cleaning up task-specific ACL tokens. Instead, Nomad will create a
workload Identity Claim for each task.
An Identity Claim is a JSON Web Token (JWT) signed by the server’s
private key and attached to an Allocation at the time a plan is
applied. The encoded JWT can be submitted as the X-Nomad-Token header
to replace ACL token secret IDs for the RPCs that support identity
claims.
Whenever a key is is added to a server’s keyring, it will use the key
as the seed for a Ed25519 public-private private keypair. That keypair
will be used for signing the JWT and for verifying the JWT.
This implementation is a ruthlessly minimal approach to support the
secure variables feature. When a JWT is verified, the allocation ID
will be checked against the Nomad state store, and non-existent or
terminal allocation IDs will cause the validation to be rejected. This
is sufficient to support the secure variables feature at launch
without requiring implementation of a background process to renew
soon-to-expire tokens.
PR #11956 implemented a new mTLS RPC check to validate the role of the
certificate used in the request, but further testing revealed two flaws:
1. client-only endpoints did not accept server certificates so the
request would fail when forwarded from one server to another.
2. the certificate was being checked after the request was forwarded,
so the check would happen over the server certificate, not the
actual source.
This commit checks for the desired mTLS level, where the client level
accepts both, a server or a client certificate. It also validates the
cercertificate before the request is forwarded.