Some of our `api` package tests have ACLs enabled, but none of those tests also
run clients and the "wait for the clients to be live" code reads from the Node
API. The caller can't bootstrap ACLs until `NewTestServer` returns, and this
makes for a circular dependency.
Allow developers to provide a bootstrap token to the test server config, and
if it's available, have the server bootstrap the ACL system with it before
checking for live clients.
* core: plumbing to support numa aware scheduling
* core: apply node resources compatibility upon fsm rstore
Handle the case where an upgraded server dequeus an evaluation before
a client triggers a new fingerprint - which would be needed to cause
the compatibility fix to run. By running the compat fix on restore the
server will immediately have the compatible pseudo topology to use.
* lint: learn how to spell pseudo
In Nomad Enterprise, a task may connect to a non-default Vault cluster,
requiring `consul-template` to be configured with a specific client
`vault` block.
* ui: fix websocket connections on dev proxy
`ember-cli` includes a handler for websocket upgrade requests to start
proxying requests. Handling the same upgrade event by also proxying in
`api.js` causes some kind of conflict where the connection is closed
unexpectedly.
The only change necessary on upgrade is to overwrite the Origin header
so Nomad accepts the connection.
* ui: remove unused variables
This changeset makes two changes:
* Removes the `consul.use_identity` field from the agent configuration. This behavior is properly covered by the presence of `consul.service_identity` / `consul.task_identity` blocks.
* Adds a `consul.task_auth_method` and `consul.service_auth_method` fields to the agent configuration. This allows the cluster administrator to choose specific Consul Auth Method names for their environment, with a reasonable default.
When agents start, they create a shared Consul client that is then wrapped as
various interfaces for testability, and used in constructing the Nomad client
and server. The interfaces that support workload services (rather than the Nomad
agent itself) need to support multiple Consul clusters for Nomad
Enterprise. Update these interfaces to be factory functions that return the
Consul client for a given cluster name. Update the `ServiceClient` to split
workload updates between clusters by creating a wrapper around all the clients
that delegates to the cluster-specific `ServiceClient`.
Ref: https://github.com/hashicorp/team-nomad/issues/404
per alloc task. this can save a bit of cpu when
running plans for tasks that already exist,
and prevents Nomad tokens from changing,
which can cause task template{}s to restart
unnecessarily.
The RPC handlers expect to see `nil` ACL objects whenever ACLs are disabled. By
using `nil` as a sentinel value, we have the risk of nil pointer exceptions and
improper handling of `nil` when returned from our various auth methods that can
lead to privilege escalation bugs. This is the final patch in a series to
eliminate the use of `nil` ACLs as a sentinel value for when ACLs are disabled.
This patch adds a new virtual ACL policy field for when ACLs are disabled and
updates our authentication logic to use it. Included:
* Extends auth package tests to demonstrate that nil ACLs are treated as failed
auth and disabled ACLs succeed auth.
* Adds a new `AllowDebug` ACL check for the weird special casing we have for
pprof debugging when ACLs are disabled.
* Removes the remaining unexported methods (and repeated tests) from the
`nomad/acl.go` file.
* Update the semgrep rules to detect improper nil ACL checking and remove the
old invalid ACL checks.
* Update the contributing guide for RPC authentication.
Ref: https://github.com/hashicorp/nomad-enterprise/pull/1218
Ref: https://github.com/hashicorp/nomad/pull/18703
Ref: https://github.com/hashicorp/nomad/pull/18715
Ref: https://github.com/hashicorp/nomad/pull/16799
Ref: https://github.com/hashicorp/nomad/pull/18730
Ref: https://github.com/hashicorp/nomad/pull/18744
A CSI plugin can be made up of multiple jobs, which may not be in the same
namespace. When querying for a plugin and getting information about the
allocations that implement the plugin, we need to filter by the namespaces the
user has access to.
This test existed in the ENT code base and was never moved over to CE when we
made namespaces part of the CE product.
The RPC handlers expect to see `nil` ACL objects whenever ACLs are disabled. By
using `nil` as a sentinel value, we have the risk of nil pointer exceptions and
improper handling of `nil` when returned from our various auth methods that can
lead to privilege escalation bugs. This is the third in a series to eliminate
the use of `nil` ACLs as a sentinel value for when ACLs are disabled.
This patch involves creating a new "virtual" ACL object for checking permissions
on client operations and a matching `AuthenticateClientOnly` method for
client-only RPCs that can produce that ACL.
Unlike the server ACLs PR, this also includes a special case for "legacy" client
RPCs where the client was not previously sending the secret as it
should (leaning on mTLS only). Those client RPCs were fixed in Nomad 1.6.0, but
it'll take a while before we can guarantee they'll be present during upgrades.
Ref: https://github.com/hashicorp/nomad-enterprise/pull/1218
Ref: https://github.com/hashicorp/nomad/pull/18703
Ref: https://github.com/hashicorp/nomad/pull/18715
Ref: https://github.com/hashicorp/nomad/pull/16799
The WID manager will only sign WI tokens for the allocation's task group. We're
accidentally looping over all the task groups, which for jobs with multiple task
groups results in a failure in the `consul_hook`.
The retry tests in the `api` package set up a client but don't use `NewClient`,
so the address never gets parsed into a `url.URL` and that's causing some test
failures.
- Expose internal HTTP client's Do() via Raw
- Use URL parser to identify scheme
- Align more with curl output
- Add changelog
- Fix test failure; add tests for socket envvars
- Apply review feedback for tests
- Consolidate address parsing
- Address feedback from code reviews
Co-authored-by: Tim Gross <tgross@hashicorp.com>
* auth: add server-only ACL
The RPC handlers expect to see `nil` ACL objects whenever ACLs are disabled. By
using `nil` as a sentinel value, we have the risk of nil pointer exceptions and
improper handling of `nil` when returned from our various auth methods that can
lead to privilege escalation bugs. This is the second in a series to eliminate
the use of `nil` ACLs as a sentinel value for when ACLs are disabled.
This patch involves creating a new "virtual" ACL object for checking permissions
on server operations and a matching `AuthenticateServerOnly` method for
server-only RPCs that can produce that ACL.
Ref: https://github.com/hashicorp/nomad-enterprise/pull/1218
Ref: https://github.com/hashicorp/nomad/pull/18703
Nomad imports the Vault SDK to get testing helpers, but it turns out the only
thing actually in use was a single string constant for the Vault namespace
header. Remove this dependency and hardcode the constant to reduce dependency
churn.
Includes changes to WID Manager that make it request signed identities for
services, as well as a few improvements to WIHandle introduced in #18672.
---------
Co-authored-by: Piotr Kazmierczak <470696+pkazmierczak@users.noreply.github.com>
Go 1.21.3 fixes an important HTTP2 CVE (see CVE-2023-39325 and
CVE-2023-44487). Nomad does not use HTTP2 and is not vulnerable. However we
should pick up the toolchain bump if for no other reason than we don't have to
answer questions about that.
With a default value set to `client`, the `nomad acl token update`
command can silently downgrade a management token to client on update if
the command does not specify `-type=management` on every update.
The RPC handlers expect to see `nil` ACL objects whenever ACLs are disabled. By
using `nil` as a sentinel value, we have the risk of nil pointer exceptions and
improper handling of `nil` when returned from our various auth methods that can
lead to privilege escalation bugs.
This patchset is the first in a series to eliminate the use of `nil` ACLs as a
sentinel value for when ACLs are disabled. This one is entirely refactoring to
reduce the burden of reviewing the final patchsets that have the functional
changes:
* Move RPC auth into a new `nomad/auth` package, injecting the dependencies
required from the server. Expose only those public methods on `nomad/auth`
that are intended for use in the RPC handlers.
* Keep the existing large authentication test as an integration test.
* Add unit tests covering the methods of `nomad/auth` we intend on keeping. The
assertions for many of these will change once we have no `nil` sentinels and
can make safe assertions about permissions on the resulting `ACL` objects.
When clients are restarted and the identity hook runs when we restore
allocations, the running allocations are likely to have already-signed Workload
Identities that are unexpired. Save these to the client's local state DB so that
we can avoid a thundering herd of RPCs during client restart. When we restore,
we'll check if there's at least one expired signed WI before making any initial
signing request.
Included:
* Renames `getIdentities` to `getInitialIdentities` to make the workflow more clear.
* Renames the existing `widmgr_test.go` file of integration tests, which is in its
own package to avoid circular imports to `widmgr_int_test.go`
Nomad uses `time.LoadLocation()` to translate a periodic job time zone
string value to a `time.Location`. From godocs:
LoadLocation looks for the IANA Time Zone database in the following locations in order:
* the directory or uncompressed zip file named by the ZONEINFO environment variable
* on a Unix system, the system standard installation location
* $GOROOT/lib/time/zoneinfo.zip
* the time/tzdata package, if it was imported
So non-Unix systems require Go to be installed or `time/tzdata` to be
imported, otherwise running periodic jobs with a specific `time_zone`
value results in an error:
Invalid time zone "America/Toronto": unknown time zone America/Toronto
This commit adds the `timetzdata` build tag on Windows to embed the time
zone data into the final binary. This results in a slightly bigger
binary, but from `time/tzdata` godocs:
Importing this package will increase the size of a program by about 450 KB.
[..]
This package will be automatically imported if you build with -tags timetzdata.
Any code that tracks workloads and their identities should not rely on string
comparisons, especially since we support 2 types of workload identities: those
that identify tasks and those that identify services. This means we cannot rely
on task.Name for workload-identity pairs.
The new type structs.WIHandle solves this problem by providing a uniform way of
identifying workloads and their identities.
Host volumes were considered regular feasibility checks. This had two
unintended consequences.
The first happened when scheduling an allocation with a host volume on a
set of nodes with the same computed class but where only some of them
had the desired host volume.
If the first node evaluated did not have the host volume, the entire
node class was considered ineligible for the task group.
```go
// Run the job feasibility checks.
for _, check := range w.jobCheckers {
feasible := check.Feasible(option)
if !feasible {
// If the job hasn't escaped, set it to be ineligible since it
// failed a job check.
if !jobEscaped {
evalElig.SetJobEligibility(false, option.ComputedClass)
}
continue OUTER
}
}
```
This results in all nodes with the same computed class to be skipped,
even if they do have the desired host volume.
```go
switch evalElig.JobStatus(option.ComputedClass) {
case EvalComputedClassIneligible:
// Fast path the ineligible case
metrics.FilterNode(option, "computed class ineligible")
continue
```
The second consequence is somewhat the opposite. When an allocation has
a host volume with `per_alloc = true` the node must have a host volume
that matches the allocation index, so each allocation is likely to be
placed in different nodes.
But when the first allocation found a node match, it registered the node
class as eligible for the task group.
```go
// Set the task group eligibility if the constraints weren't escaped and
// it hasn't been set before.
if !tgEscaped && tgUnknown {
evalElig.SetTaskGroupEligibility(true, w.tg, option.ComputedClass)
}
```
This could cause other allocations to be placed on nodes without the
expected host volume because of the computed node class fast path. The
node feasibility for the volume was never checked.
```go
case EvalComputedClassEligible:
// Fast path the eligible case
if w.available(option) {
return option
}
// We match the class but are temporarily unavailable
continue OUTER
```
These problems did not happen with CSI volumes kind of accidentally.
Since the `CSIVolumeChecker` was not placed in the `tgCheckers` list it
did not cause the node class to be considered ineligible on failure
(avoiding the first problem).
And, as illustrated in the code snippet above, the eligible node class
fast path checks `tgAvailable` (where `CSIVolumeChecker` is placed)
before returning the option (avoiding the second problem).
By also placing `HostVolumeChecker` in the `tgAvailable` list instead of
`tgCheckers` we also avoid these problems on host volume feasibility.
No functional changes, just cleaning up deprecated usages that are
removed in v2 and replace one call of .Slice with .ForEach to avoid
making the intermediate copy.
Setting a null value to a node metadata is expected to remove it from
subsequent reads. This is true both for static node metadata (defined in
the agent configuration file) as well as for dynamic node metadata
(defined via the Nomad API).
Null values for static metadata must be persisted to indicate that the
value has been removed, but strictly dynamic metadata null values can be
removed from state and client memory.
Since the allocation in the task runner is updated in a separate
goroutine, a race condition may happen where the task is started but the
prestart hooks are skipped because the allocation became terminal.
Checking for a terminal allocation before proceeding with the task start
ensures the task only runs if the prestart hooks are also executed.
Since `shouldShutdown()` only uses terminal allocation status, it
remains `true` after the first transition, so it's safe to check it
again after the prestart hooks as it will never revert to `false`.
* func: add reporting config to server
* func: add reporting manager for ce
* func: change from clusterID to clusterMetadata and use it to start ent ledearship
* Update leader.go
* style: typo
Eliminates the vaultclient test import cycle by putting the test file into the
client package and making vaultclient objects public.
Ref hashicorp/team-nomad#404