Resolves symlink escape when unarchiving by removing existing paths within the same allocation directory which can occur by writing a header that points to a symlink that lives outside of the sandbox environment. This exploit requires first compromising the Nomad client agent at the source allocation.
Ref: https://hashicorp.atlassian.net/browse/NET-10607
Ref: https://github.com/hashicorp/nomad-enterprise/pull/1725
* exec2: add client support for unveil filesystem isolation mode
This PR adds support for a new filesystem isolation mode, "Unveil". The
mode introduces a "alloc_mounts" directory where tasks have user-owned
directory structure which are bind mounts into the real alloc directory
structure. This enables a task driver to use landlock (and maybe the
real unveil on openbsd one day) to isolate a task to the task owned
directory structure, providing sandboxing.
* actually create alloc-mounts-dir directory
* fix doc strings about alloc mount dir paths
During allocation directory migration, the client was not checking that any
symlinks in the archive aren't pointing to somewhere outside the allocation
directory. While task driver sandboxing will protect against processes inside
the task from reading/writing thru the symlink, this doesn't protect against the
client itself from performing unintended operations outside the sandbox.
This changeset includes two changes:
* Update the archive unpacking to check the source of symlinks and require that
they fall within the sandbox.
* Fix a bug in the symlink check where it was using `filepath.Rel` which doesn't
work for paths in the sibling directories of the sandbox directory. This bug
doesn't appear to be exploitable but caused errors in testing.
Fixes: https://github.com/hashicorp/nomad/issues/19887
* client/allocdir: use an interface in place of AllocDir structs
This PR replace *allocdir.AllocDir with allocdir.Interface such that we
may eventually have another implementation of alloc directories. This is
in support of the exec2 driver, which will need an implementation of the
alloc directory incompatibile with the current version.
* use rlock
When waiting on a previous alloc we must query against the leader before
switching to a stale query with index set.
Also check to ensure the response is fresh before using it like #18269
When ephemeral disks are migrated from an allocation on the same node,
allocation logs for the previous allocation are lost.
There are two workflows for the best-effort attempt to migrate the allocation
data between the old and new allocations. For previous allocations on other
clients (the "remote" workflow), we create a local allocdir and download the
data from the previous client into it. That data is then moved into the new
allocdir and we delete the allocdir of the previous alloc.
For "local" previous allocations we don't need to create an extra directory for
the previous allocation and instead move the files directly from one to the
other. But we still delete the old allocdir _entirely_, which includes all the
logs!
There doesn't seem to be any reason to destroy the local previous allocdir, as
the usual client garbage collection should destroy it later on when needed. By
not deleting it, the previous allocation's logs are still available for the user
to read.
Fixes: #18034
Tools like `nomad-nodesim` are unable to implement a minimal implementation of
an allocrunner so that we can test the client communication without having to
lug around the entire allocrunner/taskrunner code base. The allocrunner was
implemented with an interface specifically for this purpose, but there were
circular imports that made it challenging to use in practice.
Move the AllocRunner interface into an inner package and provide a factory
function type. Provide a minimal test that exercises the new function so that
consumers have some idea of what the minimum implementation required is.
The `ephemeral_disk` block's `migrate` field allows for best-effort migration of
the ephemeral disk data to new nodes. The documentation says the `migrate` field
is only respected if `sticky=true`, but in fact if client ACLs are not set the
data is migrated even if `sticky=false`.
The existing behavior when client ACLs are disabled has existed since the early
implementation, so "fixing" that case now would silently break backwards
compatibility. Additionally, having `migrate` not imply `sticky` seems
nonsensical: it suggests that if we place on a new node we migrate the data but
if we place on the same node, we throw the data away!
Update so that `migrate=true` implies `sticky=true` as follows:
* The failure mode when client ACLs are enabled comes from the server not passing
along a migration token. Update the server so that the server provides a
migration token whenever `migrate=true` and not just when `sticky=true` too.
* Update the scheduler so that `migrate` implies `sticky`.
* Update the client so that we check for `migrate || sticky` where appropriate.
* Refactor the E2E tests to move them off the old framework and make the intention
of the test more clear.
Fixes#10200
**The bug**
A user reported receiving the following error when an alloc was placed
that needed to preempt existing allocs:
```
[ERROR] client.alloc_watcher: error querying previous alloc:
alloc_id=28... previous_alloc=8e... error="rpc error: alloc lookup
failed: index error: UUID must be 36 characters"
```
The previous alloc (8e) was already complete on the client. This is
possible if an alloc stops *after* the scheduling decision was made to
preempt it, but *before* the node running both allocations was able to
pull and start the preemptor. While that is hopefully a narrow window of
time, you can expect it to occur in high throughput batch scheduling
heavy systems.
However the RPC error made no sense! `previous_alloc` in the logs was a
valid 36 character UUID!
**The fix**
The fix is:
```
- prevAllocID: c.Alloc.PreviousAllocation,
+ prevAllocID: watchedAllocID,
```
The alloc watcher new func used for preemption improperly referenced
Alloc.PreviousAllocation instead of the passed in watchedAllocID. When
multiple allocs are preempted, a watcher is created for each with
watchedAllocID set properly by the caller. In this case
Alloc.PreviousAllocation="" -- which is where the `UUID must be 36 characters`
error was coming from! Sadly we were properly referencing
watchedAllocID in the log, so it made the error make no sense!
**The repro**
I was able to reproduce this with a dev agent with [preemption enabled](https://gist.github.com/schmichael/53f79cbd898afdfab76865ad8c7fc6a0#file-preempt-hcl)
and [lowered limits](https://gist.github.com/schmichael/53f79cbd898afdfab76865ad8c7fc6a0#file-limits-hcl)
for ease of repro.
First I started a [low priority count 3 job](https://gist.github.com/schmichael/53f79cbd898afdfab76865ad8c7fc6a0#file-preempt-lo-nomad),
then a [high priority job](https://gist.github.com/schmichael/53f79cbd898afdfab76865ad8c7fc6a0#file-preempt-hi-nomad)
that evicts 2 low priority jobs. Everything worked as expected.
However if I force it to use the [remotePrevAlloc implementation](https://github.com/hashicorp/nomad/blob/v1.3.0-beta.1/client/allocwatcher/alloc_watcher.go#L147),
it reproduces the bug because the watcher references PreviousAllocation
instead of watchedAllocID.
Fixes#2522
Skip embedding client.alloc_dir when building chroot. If a user
configures a Nomad client agent so that the chroot_env will embed the
client.alloc_dir, Nomad will happily infinitely recurse while building
the chroot until something horrible happens. The best case scenario is
the filesystem's path length limit is hit. The worst case scenario is
disk space is exhausted.
A bad agent configuration will look something like this:
```hcl
data_dir = "/tmp/nomad-badagent"
client {
enabled = true
chroot_env {
# Note that the source matches the data_dir
"/tmp/nomad-badagent" = "/ohno"
# ...
}
}
```
Note that `/ohno/client` (the state_dir) will still be created but not
`/ohno/alloc` (the alloc_dir).
While I cannot think of a good reason why someone would want to embed
Nomad's client (and possibly server) directories in chroots, there
should be no cause for harm. chroots are only built when Nomad runs as
root, and Nomad disables running exec jobs as root by default. Therefore
even if client state is copied into chroots, it will be inaccessible to
tasks.
Skipping the `data_dir` and `{client,server}.state_dir` is possible, but
this PR attempts to implement the minimum viable solution to reduce risk
of unintended side effects or bugs.
When running tests as root in a vm without the fix, the following error
occurs:
```
=== RUN TestAllocDir_SkipAllocDir
alloc_dir_test.go:520:
Error Trace: alloc_dir_test.go:520
Error: Received unexpected error:
Couldn't create destination file /tmp/TestAllocDir_SkipAllocDir1457747331/001/nomad/test/testtask/nomad/test/testtask/.../nomad/test/testtask/secrets/.nomad-mount: open /tmp/TestAllocDir_SkipAllocDir1457747331/001/nomad/test/.../testtask/secrets/.nomad-mount: file name too long
Test: TestAllocDir_SkipAllocDir
--- FAIL: TestAllocDir_SkipAllocDir (22.76s)
```
Also removed unused Copy methods on AllocDir and TaskDir structs.
Thanks to @eveld for not letting me forget about this!
* Stopping an alloc is implemented via Updates but update hooks are
*not* run.
* Destroying an alloc is a best effort cleanup.
* AllocRunner destroy hooks implemented.
* Disk migration and blocking on a previous allocation exiting moved to
its own package to avoid cycles. Now only depends on alloc broadcaster
instead of also using a waitch.
* AllocBroadcaster now only drops stale allocations and always keeps the
latest version.
* Made AllocDir safe for concurrent use
Lots of internal contexts that are currently unused. Unsure if they
should be used or removed.