This changeset implements the RPC handlers for Dynamic Host Volumes, including
the plumbing needed to forward requests to clients. The client-side
implementation is stubbed and will be done under a separate PR.
Ref: https://hashicorp.atlassian.net/browse/NET-11549
In #23977 we moved the keyring into Raft, which can expose key material in Raft
snapshots when using the less-secure AEAD keyring instead of KMS. This changeset
adds tools for redacting this material from snapshots:
* The `operator snapshot state` command gains the ability to display key
metadata (only), which respects the `-filter` option.
* The `operator snapshot save` command gains a `-redact` option that removes key
material from the snapshot after it's downloaded.
* A new `operator snapshot redact` command allows removing key material from an
existing snapshot.
In Nomad 1.4, we implemented a root keyring to support encrypting Variables and
signing Workload Identities. The keyring was originally stored with the
AEAD-wrapped DEKs and the KEK together in a JSON keystore file on disk. We
recently added support for using an external KMS for the KEK to improve the
security model for the keyring. But we've encountered multiple instances of the
keystore files not getting backed up separately from the Raft snapshot,
resulting in failure to restore clusters from backup.
Move Nomad's root keyring into Raft (encrypted with a KMS/Vault where available)
in order to eliminate operational problems with the separate on-disk keystore.
Fixes: https://github.com/hashicorp/nomad/issues/23665
Ref: https://hashicorp.atlassian.net/browse/NET-10523
On supported platforms, the secrets directory is a 1MiB tmpfs. But some tasks
need larger space for downloading large secrets. This is especially the case for
tasks using `templates`, which need extra room to write a temporary file to the
secrets directory that gets renamed to the old file atomically.
This changeset allows increasing the size of the tmpfs in the `resources`
block. Because this is a memory resource, we need to include it in the memory we
allocate for scheduling purposes. The task is already prevented from using more
memory in the tmpfs than the `resources.memory` field allows, but can bypass
that limit by writing to the tmpfs via `template` or `artifact` blocks.
Therefore, we need to account for the size of the tmpfs in the allocation
resources. Simply adding it to the memory needed when we create the allocation
allows it to be accounted for in all downstream consumers, and then we'll
subtract that amount from the memory resources just before configuring the task
driver.
For backwards compatibility, the default value of 1MiB is "free" and ignored by
the scheduler. Otherwise we'd be increasing the allocated resources for every
existing alloc, which could cause problems across upgrades. If a user explicitly
sets `resources.secrets = 1` it will no longer be free.
Fixes: https://github.com/hashicorp/nomad/issues/2481
Ref: https://hashicorp.atlassian.net/browse/NET-10070
The TLS configuration object includes a deprecated `prefer_server_cipher_suites`
field. In version of Go prior to 1.17, this property controlled whether a TLS
connection would use the cipher suites preferred by the server or by the
client. This field is ignored as of 1.17 and, according to the `crypto/tls`
docs: "Servers now select the best mutually supported cipher suite based on
logic that takes into account inferred client hardware, server hardware, and
security."
This property has been long-deprecated and leaving it in place may lead to false
assumptions about how cipher suites are negotiated in connection to a server. So
we want to remove it in Nomad 1.9.0.
Fixes: https://github.com/hashicorp/nomad-enterprise/issues/999
Ref: https://hashicorp.atlassian.net/browse/NET-10531
The batch deregister RPC endpoint is only used by the internal
garbage collection process, it is not exposed via the HTTP API or
used anywhere else.
The GC process ensures that a job can only be removed from state
if all related evaluations and allocations are in a state that
means they can also be removed from state. This means that we do
not need to create evaluations when jobs are being deregistered
via this endpoint.
When `transparent_proxy` block is present and the network mode is `bridge`, use
a different CNI configuration that includes the `consul-cni` plugin. Before
invoking the CNI plugins, create a Consul SDK `iptables.Config` struct for the
allocation. This includes:
* Use all the `transparent_proxy` block fields
* The reserved ports are added to the inbound exclusion list so the alloc is
reachable from outside the mesh
* The `expose` blocks and `check` blocks with `expose=true` are added to the
inbound exclusion list so health checks work.
The `iptables.Config` is then passed as a CNI argument to the `consul-cni`
plugin.
Ref: https://github.com/hashicorp/nomad/issues/10628
Also add an explicit exit code to subproc package for when a child
process is instructed to run an unrunnable command (i.e. cannot be
found or is not executable) - with the 127 return code folks using bash
are familiar with
Replaces #18812
Upgraded with:
```
find . -name '*.go' -exec sed -i s/"github.com\/hashicorp\/go-msgpack\/codec"/"github.com\/hashicorp\/go-msgpack\/v2\/codec/" '{}' ';'
find . -name '*.go' -exec sed -i s/"github.com\/hashicorp\/net-rpc-msgpackrpc"/"github.com\/hashicorp\/net-rpc-msgpackrpc\/v2/" '{}' ';'
go get
go get -v -u github.com/hashicorp/raft-boltdb/v2
go get -v github.com/hashicorp/serf@5d32001edfaa18d1c010af65db707cdb38141e80
```
see https://github.com/hashicorp/go-msgpack/releases/tag/v2.1.0
for details
* exec: add a client.users configuration block
For now just add min/max dynamic user values; soon we can also absorb
the "user.denylist" and "user.checked_drivers" options from the
deprecated client.options map.
* give the no-op pool implementation a better name
* use explicit error types to make referencing them cleaner in tests
* use import alias to not shadow package name
* exec2: implement dynamic workload users taskrunner hook
This PR impelements a TR hook for allocating dynamic workload users from
a pool managed by the Nomad client. This adds a new task driver Capability,
DynamicWorkloadUsers - which a task driver must indicate in order to make
use of this feature.
The client config plumbing is coming in a followup PR - in the RFC we
realized having a client.users block would be nice to have, with some
additional unrelated options being moved from the deprecated client.options
config.
* learn to spell
* exec2: implement a dynamic users pool
This PR adds an implementation of a Pool from which dynamic users can
be allocated on behalf of tasks making use of an upcoming feature of
Nomad client (dynamic users).
A task hook and client plumbing, etc. will be in follow up PRs.
* no need for randomness assertion
The Nomad client renders templates in the same privileged process used for most
other client operations. During internal testing, we discovered that a malicious
task can create a symlink that can cause template rendering to read and write to
arbitrary files outside the allocation sandbox. Because the Nomad agent can be
restarted without restarting tasks, we can't simply check that the path is safe
at the time we write without encountering a time-of-check/time-of-use race.
To protect Nomad client hosts from this attack, we'll now read and write
templates in a subprocess:
* On Linux/Unix, this subprocess is sandboxed via chroot to the allocation
directory. This requires that Nomad is running as a privileged process. A
non-root Nomad agent will warn that it cannot sandbox the template renderer.
* On Windows, this process is sandboxed via a Windows AppContainer which has
been granted access to only to the allocation directory. This does not require
special privileges on Windows. (Creating symlinks in the first place can be
prevented by running workloads as non-Administrator or
non-ContainerAdministrator users.)
Both sandboxes cause encountered symlinks to be evaluated in the context of the
sandbox, which will result in a "file not found" or "access denied" error,
depending on the platform. This change will also require an update to
Consul-Template to allow callers to inject a custom `ReaderFunc` and
`RenderFunc`.
This design is intended as a workaround to allow us to fix this bug without
creating backwards compatibility issues for running tasks. A future version of
Nomad may introduce a read-only mount specifically for templates and artifacts
so that tasks cannot write into the same location that the Nomad agent is.
Fixes: https://github.com/hashicorp/nomad/issues/19888
Fixes: CVE-2024-1329
During allocation directory migration, the client was not checking that any
symlinks in the archive aren't pointing to somewhere outside the allocation
directory. While task driver sandboxing will protect against processes inside
the task from reading/writing thru the symlink, this doesn't protect against the
client itself from performing unintended operations outside the sandbox.
This changeset includes two changes:
* Update the archive unpacking to check the source of symlinks and require that
they fall within the sandbox.
* Fix a bug in the symlink check where it was using `filepath.Rel` which doesn't
work for paths in the sibling directories of the sandbox directory. This bug
doesn't appear to be exploitable but caused errors in testing.
Fixes: https://github.com/hashicorp/nomad/issues/19887
The current implementation of the `nomad tls ca create` command
ovierrides the value of the `-domain` flag with `"nomad"` if no
additional customization is provided.
This results in a certificate for the wrong domain or an error if the
`-name-constraint` flag is also used.
THe logic for `IsCustom()` also seemed reversed. If all custom fields
are empty then the certificate is _not_ customized, so `IsCustom()`
should return false.
This PR refactors a helper function for getting the UID associated with
a given username to also return the GID and home directory. Also adds
unit tests on the known values of root and nobody user on Ubuntu Linux.
Some packages licensed under MPL-2.0 were incorrectly importing code
from packages licensed under BUSL-1.1.
Not all imports are fixed here as they will require additional work to
untangle them. To help track progress this commit adds a Semgrep rule
that detects incorrect BUSL-1.1 imports in MPL-2.0 packages.
The Nomad state store function was recently updated to validate
certain parameters, fixing a panic condition. This change meant
dummy FSM used for the snapshot state command was always failing
this validation and the command no longer worked.
This change adds the required parameter to pass validation and
therefore makes the CLI command functional again.
* Move group into a separate helper module for reuse
* Add shutdownCh to worker
The shutdown channel is used to signal that worker has stopped.
* Make server shutdown block on workers' shutdownCh
* Fix waiting for eval broker state change blocking indefinitely
There was a race condition in the GenericNotifier between the
Run and WaitForChange functions, where WaitForChange blocks
trying to write to a full unsubscribeCh, but the Run function never
reads from the unsubscribeCh as it has already stopped.
This commit fixes it by unblocking if the notifier has been stopped.
* Bound the amount of time server shutdown waits on worker completion
* Fix lostcancel linter error
* Fix worker test using unexpected worker constructor
* Add changelog
---------
Co-authored-by: Marvin Chin <marvinchin@users.noreply.github.com>
Nomad load all plugins from `plugin_dir` regardless if it is listed in
the agent configuration file. This can cause unexpected binaries to be
executed.
This commit begins the deprecation process of this behaviour. The Nomad
agent will emit a warning log for every plugin binary found without a
corresponding agent configuration block.
---------
Co-authored-by: Michael Schurter <mschurter@hashicorp.com>
No functional changes, just cleaning up deprecated usages that are
removed in v2 and replace one call of .Slice with .ForEach to avoid
making the intermediate copy.
In #12458 we added an in-memory connection buffer so that template runners that
want access to the Nomad API for Service Registration and Variables can
communicate with Nomad without having to create a real HTTP client. The size of
this buffer (1 MiB) was taken directly from its usage in Vault, and each
connection makes 2 such buffers (send and receive). Because each template runner
has its own connection, when there are large numbers of allocations this adds up
to significant memory usage.
The largest Nomad Variable payload is 64KiB, and a small amount of
metadata. Service Registration responses are much smaller, and we don't include
check results in them (as Consul does), so the size is relatively bounded. We
should be able to safely reduce the size of the buffer by a factor of 10 or more
without forcing the template runner to make multiple read calls over the buffer.
Fixes: #18508
We use capped exponential backoff in several places in the code when handling
failures. The code we've copy-and-pasted all over has a check to see if the
backoff is greater than the limit, but this check happens after the bitshift and
we always increment the number of attempts. This causes an overflow with a
fairly small number of failures (ex. at one place I tested it occurs after only
24 iterations), resulting in a negative backoff which then never recovers. The
backoff becomes a tight loop consuming resources and/or DoS'ing a Nomad RPC
handler or an external API such as Vault. Note this doesn't occur in places
where we cap the number of iterations so the loop breaks (usually to return an
error), so long as the number of iterations is reasonable.
Introduce a helper with a check on the cap before the bitshift to avoid overflow in all
places this can occur.
Fixes: #18199
Co-authored-by: stswidwinski <stan.swidwinski@gmail.com>
Allows for multiple `identity{}` blocks for tasks along with user-specified audiences. This is a building block to allow workload identities to be used with Consul, Vault and 3rd party JWT based auth methods.
Expiration is still unimplemented and is necessary for JWTs to be used securely, so that's up next.
---------
Co-authored-by: Tim Gross <tgross@hashicorp.com>
* build: update to go1.21
* go: eliminate helpers in favor of min/max
* build: run go mod tidy
* build: swap depguard for semgrep
* command: fixup broken tls error check on go1.21
Add JWKS endpoint to HTTP API for exposing the root public signing keys used for signing workload identity JWTs.
Part 1 of N components as part of making workload identities consumable by third party services such as Consul and Vault. Identity attenuation (audience) and expiration (+renewal) are necessary to securely use workload identities with 3rd parties, so this merge does not yet document this endpoint.
---------
Co-authored-by: Tim Gross <tgross@hashicorp.com>