No matter the passed region identifier, the CLI was always adding
"<role>.global.nomad" to the certificate DNS names. This is not
what we expect and has been removed.
While here, the long deprecated cluster-region flag has been
removed. This removal only impacts CLI functionality, so is safe
to do.
This change isolates all the code that deals with node selection in the
scheduler into its own package called feasible.
---------
Co-authored-by: Tim Gross <tgross@hashicorp.com>
The server RPC handler and RPC connection pool both use a shared
configuration object for custom yamux configuration. Both
sub-systems were modifying the shared object which could cause a
data race. The passed object is now cloned before being modified.
This changes also moves where the yamux configuration is cloned
and modified to the relevant constructor function. This avoids
performing a clone per connection handle or per new connection
generated in the RPC pool.
The legacy workflow for Vault whereby servers were configured
using a token to provide authentication to the Vault API has now
been removed. This change also removes the workflow where servers
were responsible for deriving Vault tokens for Nomad clients.
The deprecated Vault config options used byi the Nomad agent have
all been removed except for "token" which is still in use by the
Vault Transit keyring implementation.
Job specification authors can no longer use the "vault.policies"
parameter and should instead use "vault.role" when not using the
default workload identity.
---------
Co-authored-by: Tim Gross <tgross@hashicorp.com>
Co-authored-by: Aimee Ukasick <aimee.ukasick@hashicorp.com>
In #24650 we switched to using ephemeral state for CNI plugins, so that when a
host reboots and we lose all the allocations we don't end up trying to use IPs
we created in network namespaces we just destroyed. Unfortunately upgrade
testing missed that in a non-reboot scenario, the existing CNI state was being
used by plugins like the ipam plugin to hand out the "next available" IP
address. So with no state carried over, we might allocate new addresses that
conflict with existing allocations. (This can be avoided by draining the node
first.)
As a compatibility shim, copy the old CNI state directory to the new CNI state
directory during agent startup, if the new CNI state directory doesn't already
exist.
Ref: https://github.com/hashicorp/nomad/pull/24650
We introduce an alternative solution to the one presented in #24960 which is
based on the state store and not previous-next allocation tracking in the
reconciler. This new solution reduces cognitive complexity of the scheduler
code at the cost of slightly more boilerplate code, but also opens up new
possibilities in the future, e.g., allowing users to explicitly "un-stick"
volumes with workloads still running.
The diagram below illustrates the new logic:
SetVolumes() upsertAllocsImpl()
sets ns, job +-----------------checks if alloc requests
tg in the scheduler v sticky vols and consults
| +-----------------------+ state. If there is no claim,
| | TaskGroupVolumeClaim: | it creates one.
| | - namespace |
| | - jobID |
| | - tg name |
| | - vol ID |
v | uniquely identify vol |
hasVolumes() +----+------------------+
consults the state | ^
and returns true | | DeleteJobTxn()
if there's a match <-----------+ +---------------removes the claim from
or if there is no the state
previous claim
| | | |
+-----------------------------+ +------------------------------------------------------+
scheduler state store
* Upgrade to using hashicorp/go-metrics@v0.5.4
This also requires bumping the dependencies for:
* memberlist
* serf
* raft
* raft-boltdb
* (and indirectly hashicorp/mdns due to the memberlist or serf update)
Unlike some other HashiCorp products, Nomads root module is currently expected to be consumed by others. This means that it needs to be treated more like our libraries and upgrade to hashicorp/go-metrics by utilizing its compat packages. This allows those importing the root module to control the metrics module used via build tags.
This changeset implements the RPC handlers for Dynamic Host Volumes, including
the plumbing needed to forward requests to clients. The client-side
implementation is stubbed and will be done under a separate PR.
Ref: https://hashicorp.atlassian.net/browse/NET-11549
In #23977 we moved the keyring into Raft, which can expose key material in Raft
snapshots when using the less-secure AEAD keyring instead of KMS. This changeset
adds tools for redacting this material from snapshots:
* The `operator snapshot state` command gains the ability to display key
metadata (only), which respects the `-filter` option.
* The `operator snapshot save` command gains a `-redact` option that removes key
material from the snapshot after it's downloaded.
* A new `operator snapshot redact` command allows removing key material from an
existing snapshot.
In Nomad 1.4, we implemented a root keyring to support encrypting Variables and
signing Workload Identities. The keyring was originally stored with the
AEAD-wrapped DEKs and the KEK together in a JSON keystore file on disk. We
recently added support for using an external KMS for the KEK to improve the
security model for the keyring. But we've encountered multiple instances of the
keystore files not getting backed up separately from the Raft snapshot,
resulting in failure to restore clusters from backup.
Move Nomad's root keyring into Raft (encrypted with a KMS/Vault where available)
in order to eliminate operational problems with the separate on-disk keystore.
Fixes: https://github.com/hashicorp/nomad/issues/23665
Ref: https://hashicorp.atlassian.net/browse/NET-10523
On supported platforms, the secrets directory is a 1MiB tmpfs. But some tasks
need larger space for downloading large secrets. This is especially the case for
tasks using `templates`, which need extra room to write a temporary file to the
secrets directory that gets renamed to the old file atomically.
This changeset allows increasing the size of the tmpfs in the `resources`
block. Because this is a memory resource, we need to include it in the memory we
allocate for scheduling purposes. The task is already prevented from using more
memory in the tmpfs than the `resources.memory` field allows, but can bypass
that limit by writing to the tmpfs via `template` or `artifact` blocks.
Therefore, we need to account for the size of the tmpfs in the allocation
resources. Simply adding it to the memory needed when we create the allocation
allows it to be accounted for in all downstream consumers, and then we'll
subtract that amount from the memory resources just before configuring the task
driver.
For backwards compatibility, the default value of 1MiB is "free" and ignored by
the scheduler. Otherwise we'd be increasing the allocated resources for every
existing alloc, which could cause problems across upgrades. If a user explicitly
sets `resources.secrets = 1` it will no longer be free.
Fixes: https://github.com/hashicorp/nomad/issues/2481
Ref: https://hashicorp.atlassian.net/browse/NET-10070
The TLS configuration object includes a deprecated `prefer_server_cipher_suites`
field. In version of Go prior to 1.17, this property controlled whether a TLS
connection would use the cipher suites preferred by the server or by the
client. This field is ignored as of 1.17 and, according to the `crypto/tls`
docs: "Servers now select the best mutually supported cipher suite based on
logic that takes into account inferred client hardware, server hardware, and
security."
This property has been long-deprecated and leaving it in place may lead to false
assumptions about how cipher suites are negotiated in connection to a server. So
we want to remove it in Nomad 1.9.0.
Fixes: https://github.com/hashicorp/nomad-enterprise/issues/999
Ref: https://hashicorp.atlassian.net/browse/NET-10531
The batch deregister RPC endpoint is only used by the internal
garbage collection process, it is not exposed via the HTTP API or
used anywhere else.
The GC process ensures that a job can only be removed from state
if all related evaluations and allocations are in a state that
means they can also be removed from state. This means that we do
not need to create evaluations when jobs are being deregistered
via this endpoint.
When `transparent_proxy` block is present and the network mode is `bridge`, use
a different CNI configuration that includes the `consul-cni` plugin. Before
invoking the CNI plugins, create a Consul SDK `iptables.Config` struct for the
allocation. This includes:
* Use all the `transparent_proxy` block fields
* The reserved ports are added to the inbound exclusion list so the alloc is
reachable from outside the mesh
* The `expose` blocks and `check` blocks with `expose=true` are added to the
inbound exclusion list so health checks work.
The `iptables.Config` is then passed as a CNI argument to the `consul-cni`
plugin.
Ref: https://github.com/hashicorp/nomad/issues/10628
Also add an explicit exit code to subproc package for when a child
process is instructed to run an unrunnable command (i.e. cannot be
found or is not executable) - with the 127 return code folks using bash
are familiar with
Replaces #18812
Upgraded with:
```
find . -name '*.go' -exec sed -i s/"github.com\/hashicorp\/go-msgpack\/codec"/"github.com\/hashicorp\/go-msgpack\/v2\/codec/" '{}' ';'
find . -name '*.go' -exec sed -i s/"github.com\/hashicorp\/net-rpc-msgpackrpc"/"github.com\/hashicorp\/net-rpc-msgpackrpc\/v2/" '{}' ';'
go get
go get -v -u github.com/hashicorp/raft-boltdb/v2
go get -v github.com/hashicorp/serf@5d32001edfaa18d1c010af65db707cdb38141e80
```
see https://github.com/hashicorp/go-msgpack/releases/tag/v2.1.0
for details
* exec: add a client.users configuration block
For now just add min/max dynamic user values; soon we can also absorb
the "user.denylist" and "user.checked_drivers" options from the
deprecated client.options map.
* give the no-op pool implementation a better name
* use explicit error types to make referencing them cleaner in tests
* use import alias to not shadow package name
* exec2: implement dynamic workload users taskrunner hook
This PR impelements a TR hook for allocating dynamic workload users from
a pool managed by the Nomad client. This adds a new task driver Capability,
DynamicWorkloadUsers - which a task driver must indicate in order to make
use of this feature.
The client config plumbing is coming in a followup PR - in the RFC we
realized having a client.users block would be nice to have, with some
additional unrelated options being moved from the deprecated client.options
config.
* learn to spell
* exec2: implement a dynamic users pool
This PR adds an implementation of a Pool from which dynamic users can
be allocated on behalf of tasks making use of an upcoming feature of
Nomad client (dynamic users).
A task hook and client plumbing, etc. will be in follow up PRs.
* no need for randomness assertion
The Nomad client renders templates in the same privileged process used for most
other client operations. During internal testing, we discovered that a malicious
task can create a symlink that can cause template rendering to read and write to
arbitrary files outside the allocation sandbox. Because the Nomad agent can be
restarted without restarting tasks, we can't simply check that the path is safe
at the time we write without encountering a time-of-check/time-of-use race.
To protect Nomad client hosts from this attack, we'll now read and write
templates in a subprocess:
* On Linux/Unix, this subprocess is sandboxed via chroot to the allocation
directory. This requires that Nomad is running as a privileged process. A
non-root Nomad agent will warn that it cannot sandbox the template renderer.
* On Windows, this process is sandboxed via a Windows AppContainer which has
been granted access to only to the allocation directory. This does not require
special privileges on Windows. (Creating symlinks in the first place can be
prevented by running workloads as non-Administrator or
non-ContainerAdministrator users.)
Both sandboxes cause encountered symlinks to be evaluated in the context of the
sandbox, which will result in a "file not found" or "access denied" error,
depending on the platform. This change will also require an update to
Consul-Template to allow callers to inject a custom `ReaderFunc` and
`RenderFunc`.
This design is intended as a workaround to allow us to fix this bug without
creating backwards compatibility issues for running tasks. A future version of
Nomad may introduce a read-only mount specifically for templates and artifacts
so that tasks cannot write into the same location that the Nomad agent is.
Fixes: https://github.com/hashicorp/nomad/issues/19888
Fixes: CVE-2024-1329
During allocation directory migration, the client was not checking that any
symlinks in the archive aren't pointing to somewhere outside the allocation
directory. While task driver sandboxing will protect against processes inside
the task from reading/writing thru the symlink, this doesn't protect against the
client itself from performing unintended operations outside the sandbox.
This changeset includes two changes:
* Update the archive unpacking to check the source of symlinks and require that
they fall within the sandbox.
* Fix a bug in the symlink check where it was using `filepath.Rel` which doesn't
work for paths in the sibling directories of the sandbox directory. This bug
doesn't appear to be exploitable but caused errors in testing.
Fixes: https://github.com/hashicorp/nomad/issues/19887
The current implementation of the `nomad tls ca create` command
ovierrides the value of the `-domain` flag with `"nomad"` if no
additional customization is provided.
This results in a certificate for the wrong domain or an error if the
`-name-constraint` flag is also used.
THe logic for `IsCustom()` also seemed reversed. If all custom fields
are empty then the certificate is _not_ customized, so `IsCustom()`
should return false.
This PR refactors a helper function for getting the UID associated with
a given username to also return the GID and home directory. Also adds
unit tests on the known values of root and nobody user on Ubuntu Linux.
Some packages licensed under MPL-2.0 were incorrectly importing code
from packages licensed under BUSL-1.1.
Not all imports are fixed here as they will require additional work to
untangle them. To help track progress this commit adds a Semgrep rule
that detects incorrect BUSL-1.1 imports in MPL-2.0 packages.
The Nomad state store function was recently updated to validate
certain parameters, fixing a panic condition. This change meant
dummy FSM used for the snapshot state command was always failing
this validation and the command no longer worked.
This change adds the required parameter to pass validation and
therefore makes the CLI command functional again.
* Move group into a separate helper module for reuse
* Add shutdownCh to worker
The shutdown channel is used to signal that worker has stopped.
* Make server shutdown block on workers' shutdownCh
* Fix waiting for eval broker state change blocking indefinitely
There was a race condition in the GenericNotifier between the
Run and WaitForChange functions, where WaitForChange blocks
trying to write to a full unsubscribeCh, but the Run function never
reads from the unsubscribeCh as it has already stopped.
This commit fixes it by unblocking if the notifier has been stopped.
* Bound the amount of time server shutdown waits on worker completion
* Fix lostcancel linter error
* Fix worker test using unexpected worker constructor
* Add changelog
---------
Co-authored-by: Marvin Chin <marvinchin@users.noreply.github.com>
Nomad load all plugins from `plugin_dir` regardless if it is listed in
the agent configuration file. This can cause unexpected binaries to be
executed.
This commit begins the deprecation process of this behaviour. The Nomad
agent will emit a warning log for every plugin binary found without a
corresponding agent configuration block.
---------
Co-authored-by: Michael Schurter <mschurter@hashicorp.com>
No functional changes, just cleaning up deprecated usages that are
removed in v2 and replace one call of .Slice with .ForEach to avoid
making the intermediate copy.