ACL policies are parsed when creating, updating, or compiling the
resulting ACL object when used. This parsing was silently ignoring
duplicate singleton keys, or invalid keys which does not grant any
additional access, but is a poor UX and can be unexpected.
This change parses all new policy writes and updates, so that
duplicate or invalid keys return an error to the caller. This is
called strict parsing. In order to correctly handle upgrades of
clusters which have existing policies that would fall foul of the
change, a lenient parsing mode is also available. This allows
the policy to continue to be parsed and compiled after an upgrade
without the need for an operator to correct the policy document
prior to further use.
Co-authored-by: Tim Gross <tgross@hashicorp.com>
This changeset implements the ACLs required for dynamic host volumes RPCs:
* `host-volume-write` is a coarse-grained policy that implies all operations.
* `host-volume-register` is the highest fine-grained privilege because it
potentially bypasses quotas.
* `host-volume-create` is implicitly granted by `host-volume-register`
* `host-volume-delete` is implicitly granted only by `host-volume-write`
* `host-volume-read` is implicitly granted by `policy = "read"`,
These are namespaced operations, so the testing here is predominantly around
parsing and granting of implicit capabilities rather than the well-tested
`AllowNamespaceOperation` method.
This changeset does not include any changes to the `host_volumes` policy which
we'll need for claiming volumes on job submit. That'll be covered in a later PR.
Ref: https://hashicorp.atlassian.net/browse/NET-11549
The path for a Variable never begins with a leading `/`, because it's stripped
off in the API before it ever gets to the state store. The CLI and UI allow the
leading `/` for convenience, but this can be misleading when it comes to writing
ACL policies. An ACL policy with a path starting with a leading `/` will never
match.
Update the ACL policy parser so that we prevent an incorrect variable path in
the policy.
Fixes: https://github.com/hashicorp/nomad/issues/23730
Nomad client agents run as privileged processes and require access to much of
the cluster state, secrets, etc. to operate. But we can improve upon this by
tightening up the virtual policy that use for RPC requests authenticated by the
node secret ID. This changeset removes the `node:read`, `plugin:read`, and
`plugin:list` policy, as well as namespace operations. In return, we add a
`AllowClientOp` check to the RPCs the client uses that would otherwise need
those policies.
Where possible, the update RPCs have also been changed to match on node ID so
that a client can only make the RPC that impacts itself. In future work, we may
be able to downscope further by adding node pool filtering to `AllowClientOp`.
Ref: https://github.com/hashicorp/nomad-enterprise/issues/1528
Ref: https://github.com/hashicorp/nomad-enterprise/pull/1529
Ref: https://hashicorp.atlassian.net/browse/NET-9925
The RPC handlers expect to see `nil` ACL objects whenever ACLs are disabled. By
using `nil` as a sentinel value, we have the risk of nil pointer exceptions and
improper handling of `nil` when returned from our various auth methods that can
lead to privilege escalation bugs. This is the final patch in a series to
eliminate the use of `nil` ACLs as a sentinel value for when ACLs are disabled.
This patch adds a new virtual ACL policy field for when ACLs are disabled and
updates our authentication logic to use it. Included:
* Extends auth package tests to demonstrate that nil ACLs are treated as failed
auth and disabled ACLs succeed auth.
* Adds a new `AllowDebug` ACL check for the weird special casing we have for
pprof debugging when ACLs are disabled.
* Removes the remaining unexported methods (and repeated tests) from the
`nomad/acl.go` file.
* Update the semgrep rules to detect improper nil ACL checking and remove the
old invalid ACL checks.
* Update the contributing guide for RPC authentication.
Ref: https://github.com/hashicorp/nomad-enterprise/pull/1218
Ref: https://github.com/hashicorp/nomad/pull/18703
Ref: https://github.com/hashicorp/nomad/pull/18715
Ref: https://github.com/hashicorp/nomad/pull/16799
Ref: https://github.com/hashicorp/nomad/pull/18730
Ref: https://github.com/hashicorp/nomad/pull/18744
The RPC handlers expect to see `nil` ACL objects whenever ACLs are disabled. By
using `nil` as a sentinel value, we have the risk of nil pointer exceptions and
improper handling of `nil` when returned from our various auth methods that can
lead to privilege escalation bugs. This is the third in a series to eliminate
the use of `nil` ACLs as a sentinel value for when ACLs are disabled.
This patch involves creating a new "virtual" ACL object for checking permissions
on client operations and a matching `AuthenticateClientOnly` method for
client-only RPCs that can produce that ACL.
Unlike the server ACLs PR, this also includes a special case for "legacy" client
RPCs where the client was not previously sending the secret as it
should (leaning on mTLS only). Those client RPCs were fixed in Nomad 1.6.0, but
it'll take a while before we can guarantee they'll be present during upgrades.
Ref: https://github.com/hashicorp/nomad-enterprise/pull/1218
Ref: https://github.com/hashicorp/nomad/pull/18703
Ref: https://github.com/hashicorp/nomad/pull/18715
Ref: https://github.com/hashicorp/nomad/pull/16799
* auth: add server-only ACL
The RPC handlers expect to see `nil` ACL objects whenever ACLs are disabled. By
using `nil` as a sentinel value, we have the risk of nil pointer exceptions and
improper handling of `nil` when returned from our various auth methods that can
lead to privilege escalation bugs. This is the second in a series to eliminate
the use of `nil` ACLs as a sentinel value for when ACLs are disabled.
This patch involves creating a new "virtual" ACL object for checking permissions
on server operations and a matching `AuthenticateServerOnly` method for
server-only RPCs that can produce that ACL.
Ref: https://github.com/hashicorp/nomad-enterprise/pull/1218
Ref: https://github.com/hashicorp/nomad/pull/18703
ACL permissions for the search endpoints are done in three passes. The
first (the `sufficientSearchPerms` method) is for performance and coarsely
rejects requests based on the passed-in context parameter if the user has no
permissions to any object in that context. The second (the
`filteredSearchContexts` method) filters out contexts based on whether the user
has permissions either to the requested namespace or again by context (to catch
the "all" context). Finally, when iterating over the objects available, we do
the usual filtering in the iterator.
Internal testing found several bugs in this filtering:
* CSI plugins can be searched by any authenticated user.
* Variables can be searched if the user has `job:read` permissions to the
variable's namespace instead of `variable:list`.
* Variables cannot be searched by wildcard namespace.
This is an information leak of the plugin names and variable paths, which we
don't consider to be privileged information but intended to protect anyways.
This changeset fixes these bugs by ensuring CSI plugins are filtered in the 1st
and 2nd pass ACL filters, and changes variables to check `variable:list` in the
2nd pass filter unless the wildcard namespace is passed (at which point we'll
fallback to filtering in the iterator).
Fixes: CVE-2023-3300
Fixes: #17906
An ACL policy with a block without label generates unexpected results.
For example, a policy such as this:
```
namespace {
policy = "read"
}
```
Is applied to a namespace called `policy` instead of the documented
behaviour of applying it to the `default` namespace.
This happens because of the way HCL1 decodes blocks. Since it doesn't
know if a block is expected to have a label it applies the `key` tag to
the content of the block and, in the example above, the first key is
`policy`, so it sets that as the `namespace` block label.
Since this happens internally in the HCL decoder it's not possible to
detect the problem externally.
Fixing the problem inside the decoder is challenging because the JSON
and HCL parsers generate different ASTs that makes impossible to
differentiate between a JSON tree from an invalid HCL tree within the
decoder.
The fix in this commit consists of manually parsing the policy after
decoding to clear labels that were not set in the file. This allows the
validation rules to consistently catch and return any errors, no matter
if the policy is an invalid HCL or JSON.
ACL policies can be associated with a job so that the job's Workload Identity
can have expanded access to other policy objects, including other
variables. Policies set on the variables the job automatically has access to
were ignored, but this includes policies with `deny` capabilities.
Additionally, when resolving claims for a workload identity without any attached
policies, the `ResolveClaims` method returned a `nil` ACL object, which is
treated similarly to a management token. While this was safe in Nomad 1.4.x,
when the workload identity token was exposed to the task via the `identity`
block, this allows a user with `submit-job` capabilities to escalate their
privileges.
We originally implemented automatic workload access to Variables as a separate
code path in the Variables RPC endpoint so that we don't have to generate
on-the-fly policies that blow up the ACL policy cache. This is fairly brittle
but also the behavior around wildcard paths in policies different from the rest
of our ACL polices, which is hard to reason about.
Add an `ACLClaim` parameter to the `AllowVariableOperation` method so that we
can push all this logic into the `acl` package and the behavior can be
consistent. This will allow a `deny` policy to override automatic access (and
probably speed up checks of non-automatic variable access).
The HCL parser allows for labels that aren't needed, which makes it easy to
accidentally write a `secure_variable` block that has the intended path as the
label for that block instead of the innner `path` block. This can result in
silent failure to lock down variables if an incorrectly specified block was used
to reduce the scope of capabilities (for example, if another correctly-written
rule allows access to `*`).
We can't detect the extraneous label in the HCL API, but we can detect if we're
missing `path` blocks entirely. Use this to block obvious user errors.
The List RPCs only checked the ACL for the Prefix argument of the request. Add
an ACL filter to the paginator for the List RPC.
Extend test coverage of ACLs in the List RPC and in the `acl` package, and add a
"deny" capability so that operators can deny specific paths or prefixes below an
allowed path.
The search RPC used a placeholder policy for searching within the secure
variables context. Now that we have ACL policies built for secure variables, we
can use them for search. Requires a new loose policy for checking if a token has
any secure variables access within a namespace, so that we can filter on
specific paths in the iterator.
This changeset includes some additional unit tests for secure
variables ACL policies, so that we have explicit coverage of edge
cases we're discussing with the UI folks.
Adds a new policy block inside namespaces to control access to secure
variables on the basis of path, with support for globbing.
Splits out VerifyClaim from ResolveClaim.
The ServiceRegistration RPC only needs to be able to verify that a
claim is valid for some allocation in the store; it doesn't care about
implicit policies or capabilities. Split this out to its own method on
the server so that the SecureVariables RPC can reuse it as a separate
step from resolving policies (see next commit).
Support implicit policies based on workload identity
Improve how the all namespaces wildcard (`*`) is handled when checking
ACL permissions. When using the wildcard namespace the `AllowNsOp` would
return false since it looks for a namespace called `*` to match.
This commit changes this behavior to return `true` when the queried
namespace is `*` and the token allows the operation in _any_ namespace.
Actual permission must be checked per object. The helper function
`AllowNsOpFunc` returns a function that can be used to make this
verification.
Add new namespace ACL requirement for the /v1/jobs/parse endpoint and
return early if HCLv2 parsing fails.
The endpoint now requires the new `parse-job` ACL capability or
`submit-job`.
state store: call-out to generic update of job recommendations from job update method
recommendations API work, and http endpoint errors for OSS
support for scaling polices in task block of job spec
add query filters for ScalingPolicy list endpoint
command: nomad scaling policy list: added -job and -type
- read-job-scaling
- scale-job
- list-scaling-policies
- read-scaling-policy
updated the read and right policy dispositions, added the new autoscaler disposition
* acl/policy: add the volume ACL policies
* nomad/csi_endpoint: enforce ACLs for volume access
* nomad/search_endpoint_oss: volume acls
* acl/acl: add plugin read as a global policy
* acl/policy: add PluginPolicy global cap type
* nomad/csi_endpoint: check the global plugin ACL policy
* nomad/mock/acl: PluginPolicy
* nomad/csi_endpoint: fix list rebase
* nomad/core_sched_test: new test since #7358
* nomad/csi_endpoint_test: use correct permissions for list
* nomad/csi_endpoint: allowCSIMount keeps ACL checks together
* nomad/job_endpoint: check mount permission for jobs
* nomad/job_endpoint_test: need plugin read, too
Fix a bug where a millicious user can access or manipulate an alloc in a
namespace they don't have access to. The allocation endpoints perform
ACL checks against the request namespace, not the allocation namespace,
and performs the allocation lookup independently from namespaces.
Here, we check that the requested can access the alloc namespace
regardless of the declared request namespace.
Ideally, we'd enforce that the declared request namespace matches
the actual allocation namespace. Unfortunately, we haven't documented
alloc endpoints as namespaced functions; we suspect starting to enforce
this will be very disruptive and inappropriate for a nomad point
release. As such, we maintain current behavior that doesn't require
passing the proper namespace in request. A future major release may
start enforcing checking declared namespace.
This adds an initial implementation of ACLs for HostVolumes.
Because HostVolumes are a cluster-wide resource, they cannot be tied to
a namespace, thus here we allow similar wildcard definitions based on
their names, tied to a set of capabilities.
Initially, the only available capabilities are deny, or mount. These
may be extended in the future to allow read-fs, mount-readonly and
similar capabilities.
This adds `alloc-exec` capability to allow operator to execute command into a
running task. Furthermore, it adds `alloc-node-exec` capability, required when
the alloc task is raw_exec or a driver with no FSIsolation.
This capability will gate access to features that allow interacting with
a running allocation, for example, signalling, stopping, and rescheduling
specific allocations.
This commit adds basic support for globbing namespaces in acl
definitions.
For concrete definitions, we merge all of the defined policies at load time, and
perform a simple lookup later on. If an exact match of a concrete
definition is found, we do not attempt to resolve globs.
For glob definitions, we merge definitions of exact replicas of a glob.
When loading a policy for a glob defintion, we choose the glob that has
the closest match to the namespace we are resolving for. We define the
closest match as the one with the _smallest character difference_
between the glob and the namespace we are matching.