Commit Graph

88 Commits

Author SHA1 Message Date
Conor Mongey
f7096fb9d6 docker: add cgroupns task config (#25927) 2025-06-11 13:50:44 -04:00
Allison Larson
d1d8945d2e Add docker plugin config option image_pull_timeout value for default timeout (#25489)
* Add docker plugin config image_pull_timeout value for default timeout

* Add image_pull_timeout docker plugin config to docs

* Add changelog
2025-03-24 13:03:14 -07:00
Piotr Kazmierczak
981ca36049 docker: use official client instead of fsouza/go-dockerclient (#23966)
This PR replaces fsouza/go-dockerclient 3rd party docker client library with
docker's official SDK.

---------

Co-authored-by: Tim Gross <tgross@hashicorp.com>
Co-authored-by: Seth Hoenig <shoenig@duck.com>
2024-09-26 18:41:44 +02:00
Tim Gross
192d70cee7 docker: update infra_image to new registry (#23927)
The gcr.io container registry is shutting down in March. Update the default
`image_image` for Docker's "pause" containers to point to the new location
hosted by the k8s project.

Fixes: https://github.com/hashicorp/nomad/issues/23911
Ref: https://hashicorp.atlassian.net/browse/NET-10942
2024-09-06 14:34:03 -04:00
Tim Gross
6aa503f2bb docker: disable cpuset management for non-root clients (#23804)
Nomad clients manage a cpuset cgroup for each task to reserve or share CPU
cores. But Docker owns its own cgroups, and attempting to set a parent cgroup
that Nomad manages runs into conflicts with how runc manages cgroups via
systemd. Therefore Nomad must run as root in order for cpuset management to ever
be compatible with Docker.

However, some users running in unsupported configurations felt that the changes
we made in Nomad 1.7.0 to ensure Nomad was running correctly represented a
regression. This changeset disables cpuset management for non-root Nomad
clients. When running Nomad as non-root, the driver will not longer reconcile
cpusets with Nomad and `resources.cores` will behave incorrectly (but the driver
will still run).

Although this is one small step along the way to supporting a rootless Nomad
client, running Nomad as non-root is still unsupported. This PR is insufficient
by itself to have a secure and properly-working rootless Nomad client.

Ref: https://github.com/hashicorp/nomad/issues/18211
Ref: https://github.com/hashicorp/nomad/issues/13669
Ref: https://hashicorp.atlassian.net/browse/NET-10652
Ref: https://github.com/opencontainers/runc/blob/main/docs/systemd.md
2024-08-14 16:44:13 -04:00
Piotr Kazmierczak
0ece7b5c16 docker: validate that containers do not run as ContainerAdmin on Windows (#23443)
This enables checks for ContainerAdmin user on docker images on Windows. It's
only checked if users run docker with process isolation and not hyper-v,
because hyper-v provides its own, proper sandboxing.

---------

Co-authored-by: Tim Gross <tgross@hashicorp.com>
2024-06-27 16:22:24 +02:00
Piotr Kazmierczak
0e8a67f0e1 docker: oom_score_adj support (#23297) 2024-06-12 10:49:20 +02:00
Piotr Kazmierczak
307fd590d7 docker: new container_exists_attempts configuration field (#22419)
This allows users to set a custom value of attempts that will be made to purge
an existing (not running) container if one is found during task creation.

---------

Co-authored-by: Tim Gross <tgross@hashicorp.com>
2024-05-30 19:22:14 +02:00
Seth Hoenig
05937ab75b exec2: add client support for unveil filesystem isolation mode (#20115)
* exec2: add client support for unveil filesystem isolation mode

This PR adds support for a new filesystem isolation mode, "Unveil". The
mode introduces a "alloc_mounts" directory where tasks have user-owned
directory structure which are bind mounts into the real alloc directory
structure. This enables a task driver to use landlock (and maybe the
real unveil on openbsd one day) to isolate a task to the task owned
directory structure, providing sandboxing.

* actually create alloc-mounts-dir directory

* fix doc strings about alloc mount dir paths
2024-03-13 08:24:17 -05:00
Seth Hoenig
591394fb62 drivers: plumb hardware topology via grpc into drivers (#18504)
* drivers: plumb hardware topology via grpc into drivers

This PR swaps out the temporary use of detecting system hardware manually
in each driver for using the Client's detected topology by plumbing the
data over gRPC. This ensures that Client configuration is taken to account
consistently in all references to system topology.

* cr: use enum instead of bool for core grade

* cr: fix test slit tables to be possible
2023-09-18 08:58:07 -05:00
hashicorp-copywrite[bot]
2d35e32ec9 Update copyright file headers to BUSL-1.1 2023-08-10 17:27:15 -05:00
Seth Hoenig
a4cc76bd3e numa: enable numa topology detection (#18146)
* client: refactor cgroups management in client

* client: fingerprint numa topology

* client: plumb numa and cgroups changes to drivers

* client: cleanup task resource accounting

* client: numa client and config plumbing

* lib: add a stack implementation

* tools: remove ec2info tool

* plugins: fixup testing for cgroups / numa changes

* build: update makefile and package tests and cl
2023-08-10 17:05:30 -05:00
Charlie Voiselle
3a687930bd Fix typos (#17962) 2023-07-18 13:31:36 -04:00
Seth Hoenig
ec4fa55bbf drivers/docker: refactor use of clients in docker driver (#17731)
* drivers/docker: refactor use of clients in docker driver

This PR refactors how we manage the two underlying clients used by the
docker driver for communicating with the docker daemon. We keep two clients
- one with a hard-coded timeout that applies to all operations no matter
what, intended for use with short lived / async calls to docker. The other
has no timeout and is the responsibility of the caller to set a context
that will ensure the call eventually terminates.

The use of these two clients has been confusing and mistakes were made
in a number of places where calls were making use of the wrong client.

This PR makes it so that a user must explicitly call a function to get
the client that makes sense for that use case.

Fixes #17023

* cr: followup items
2023-06-26 15:21:42 -05:00
KamilCuk
da9ec8ce1e Add group_add docker option (#17313) 2023-06-02 20:26:01 -04:00
Tim Gross
bf7b82b52b drivers: make internal DisableLogCollection capability public (#17196)
The `DisableLogCollection` capability was introduced as an experimental
interface for the Docker driver in 0.10.4. The interface has been stable and
allowing third-party task drivers the same capability would be useful for those
drivers that don't need the additional overhead of logmon.

This PR only makes the capability public. It doesn't yet add it to the
configuration options for the other internal drivers.

Fixes: #14636 #15686
2023-05-16 09:16:03 -04:00
hashicorp-copywrite[bot]
f005448366 [COMPLIANCE] Add Copyright and License Headers 2023-04-10 15:36:59 +00:00
Michael Schurter
db2325b17a docker: default device.container_path to host_path (#16811)
* docker: default device.container_path to host_path

Matches docker cli behavior.

Fixes #16754
2023-04-06 14:44:33 -07:00
Tim Gross
03a90f497a docker: move pause container recovery to after SetConfig (#16713)
When we added recovery of pause containers in #16352 we called the recovery
function from the plugin factory function. But in our plugin setup protocol, a
plugin isn't ready for use until we call `SetConfig`. This meant that
recovering pause containers was always done with the default
config. Setting up the Docker client only happens once, so setting the wrong
config in the recovery function also means that all other Docker API calls will
use the default config.

Move the `recoveryPauseContainers` call into the `SetConfig`. Fix the error
handling so that we return any error but also don't log when the context is
canceled, which happens twice during normal startup as we fingerprint the
driver.
2023-03-29 16:20:37 -04:00
Nick Wales
69fd1a0e4b docker: add option for Windows isolation modes (#15819) 2023-01-24 16:31:48 -05:00
Seth Hoenig
0bd42a501c docker: create a docker task config setting for disable built-in healthcheck
This PR adds a docker driver task configuration setting for turning off
built-in HEALTHCHECK of a container.

References)
https://docs.docker.com/engine/reference/builder/#healthcheck
https://github.com/docker/engine-api/blob/master/types/container/config.go#L16

Closes #5310
Closes #14068
2022-08-11 10:33:48 -05:00
Seth Hoenig
5da1a31e94 client: enable support for cgroups v2
This PR introduces support for using Nomad on systems with cgroups v2 [1]
enabled as the cgroups controller mounted on /sys/fs/cgroups. Newer Linux
distros like Ubuntu 21.10 are shipping with cgroups v2 only, causing problems
for Nomad users.

Nomad mostly "just works" with cgroups v2 due to the indirection via libcontainer,
but not so for managing cpuset cgroups. Before, Nomad has been making use of
a feature in v1 where a PID could be a member of more than one cgroup. In v2
this is no longer possible, and so the logic around computing cpuset values
must be modified. When Nomad detects v2, it manages cpuset values in-process,
rather than making use of cgroup heirarchy inheritence via shared/reserved
parents.

Nomad will only activate the v2 logic when it detects cgroups2 is mounted at
/sys/fs/cgroups. This means on systems running in hybrid mode with cgroups2
mounted at /sys/fs/cgroups/unified (as is typical) Nomad will continue to
use the v1 logic, and should operate as before. Systems that do not support
cgroups v2 are also not affected.

When v2 is activated, Nomad will create a parent called nomad.slice (unless
otherwise configured in Client conifg), and create cgroups for tasks using
naming convention <allocID>-<task>.scope. These follow the naming convention
set by systemd and also used by Docker when cgroups v2 is detected.

Client nodes now export a new fingerprint attribute, unique.cgroups.version
which will be set to 'v1' or 'v2' to indicate the cgroups regime in use by
Nomad.

The new cpuset management strategy fixes #11705, where docker tasks that
spawned processes on startup would "leak". In cgroups v2, the PIDs are
started in the cgroup they will always live in, and thus the cause of
the leak is eliminated.

[1] https://www.kernel.org/doc/html/latest/admin-guide/cgroup-v2.html

Closes #11289
Fixes #11705 #11773 #11933
2022-03-23 11:35:27 -05:00
Shishir
f2a37a0a03 Add support for setting pids_limit in docker plugin config. (#11526) 2021-12-21 13:31:34 -05:00
Shishir Mahajan
479442e682 Add support for --init to docker driver.
Signed-off-by: Shishir Mahajan <smahajan@roblox.com>
2021-10-15 12:53:25 -07:00
Seth Hoenig
17ec5a5aa8 drivers: fixup linux version dependent test cases
The error output being checked depends on the linux caps supported
by the particular operating system. Fix these test cases to just
check that an error did occur.
2021-05-17 12:37:40 -06:00
Seth Hoenig
191144c3bf drivers/exec: enable setting allow_caps on exec driver
This PR enables setting allow_caps on the exec driver
plugin configuration, as well as cap_add and cap_drop in
exec task configuration. These options replicate the
functionality already present in the docker task driver.

Important: this change also reduces the default set of
capabilities enabled by the exec driver to match the
default set enabled by the docker driver. Until v1.0.5
the exec task driver would enable all capabilities supported
by the operating system. v1.0.5 removed NET_RAW from that
list of default capabilities, but left may others which
could potentially also be leveraged by compromised tasks.

Important: the "root" user is still special cased when
used with the exec driver. Older versions of Nomad enabled
enabled all capabilities supported by the operating system
for tasks set with the root user. To maintain compatibility
with existing clusters we continue supporting this "feature",
however we maintain support for the legacy set of capabilities
rather than enabling all capabilities now supported on modern
operating systems.
2021-05-17 12:37:40 -06:00
Seth Hoenig
003d68fe6d drivers/docker+exec+java: disable net_raw capability by default
The default Linux Capabilities set enabled by the docker, exec, and
java task drivers includes CAP_NET_RAW (for making ping just work),
which has the side affect of opening an ARP DoS/MiTM attack between
tasks using bridge networking on the same host network.

https://docs.docker.com/engine/reference/run/#runtime-privilege-and-linux-capabilities

This PR disables CAP_NET_RAW for the docker, exec, and java task
drivers. The previous behavior can be restored for docker using the
allow_caps docker plugin configuration option.

A future version of nomad will enable similar configurability for the
exec and java task drivers.
2021-05-12 13:22:09 -07:00
Florian Apolloner
8b3ea4ea9a docker: support configuring default log driver in plugin options 2021-03-12 16:04:33 -05:00
Adrian Todorov
2748d2a895 driver/docker: add extra labels ( job name, task and task group name) 2021-03-08 08:59:52 -05:00
Tim Gross
153522f47f prefer TrimPrefix to checking HasPrefix first 2021-01-22 13:41:28 -05:00
Huan Wang
c4d8bc49c9 fix the inconsistency handling between infra image and normal task image 2021-01-22 13:41:28 -05:00
Mahmood Ali
8879645ab9 docker: introduce a new hcl2-friendly mount syntax (#9635)
Introduce a new more-block friendly syntax for specifying mounts with a new `mount` block type with the target as label:

```hcl
config {
  image = "..."

  mount {
    type = "..."
    target = "target-path"
    volume_options { ... }
  }
}
```

The main benefit here is that by `mount` being a block, it can nest blocks and avoids the compatibility problems noted in https://github.com/hashicorp/nomad/pull/9634/files#diff-2161d829655a3a36ba2d916023e4eec125b9bd22873493c1c2e5e3f7ba92c691R128-R155 .

The intention is for us to promote this `mount` blocks and quietly deprecate the `mounts` type, while still honoring to preserve compatibility as much as we could.

This addresses the issue in https://github.com/hashicorp/nomad/issues/9604 .
2020-12-15 14:13:50 -05:00
Shishir Mahajan
161bec4ada Add cpuset_cpus to docker driver. 2020-11-11 12:30:00 -08:00
Tim Gross
306cfabb62 docker: disallow volume mounts from host by default (#9321)
The default behavior for `docker.volumes.enabled` is intended to be `false`,
but the HCL schema defaults to `true` if the value is unset. Set the default
literal value to `true`.

Additionally, Docker driver mounts of type "volume" (but not "bind") are not
being properly sandboxed with that setting. Disable Docker mounts with type
"volume" entirely whenever the `docker.volumes.enabled` flag is set to
false. Note this is unrelated to the `volume_mount` feature, which is
constrained to preconfigured host volumes or whatever is mounted by a CSI
plugin.

This changeset includes updates to unit tests that should have been failing
under the documented behavior but were not.
2020-11-11 10:03:46 -05:00
Tim Gross
a9979db1d4 docker: image_delay default missing without gc stanza (#9101)
In the Docker driver plugin config for garbage collection, the `image_delay`
field was missing from the default we set if the entire `gc` stanza is
missing. This results in a default of 0s and immediate GC of Docker images.

Expanded docker gc config test fields.
2020-10-15 12:36:01 -04:00
Michael Schurter
f44c04ecd1 s/0.13/1.0/g
1.0 here we come!
2020-10-14 15:17:47 -07:00
Yoan Blanc
c14c616194 use allow/deny instead of the colored alternatives (#9019)
Signed-off-by: Yoan Blanc <yoan@dosimple.ch>
2020-10-12 08:47:05 -04:00
Seth Hoenig
188a604ce3 drivers/docker: detect arch for default infra_image
The 'docker.config.infra_image' would default to an amd64 container.
It is possible to reference the correct image for a platform using
the `runtime.GOARCH` variable, eliminating the need to explicitly set
the `infra_image` on non-amd64 platforms.

Also upgrade to Google's pause container version 3.1 from 3.0, which
includes some enhancements around process management.

Fixes #8926
2020-09-23 13:54:30 -05:00
James Rasell
4f39d161ed Merge pull request #8589 from hashicorp/f-gh-5718
driver/docker: allow configurable pull context timeout setting.
2020-08-14 16:07:59 +02:00
James Rasell
a40a14064a driver/docker: allow configurable pull context timeout setting.
Pulling large docker containers can take longer than the default
context timeout. Without a way to change this it is very hard for
users to utilise Nomad properly without hacky work arounds.

This change adds an optional pull_timeout config parameter which
gives operators the possibility to account for increase pull times
where needed. The infra docker image also has the option to set a
custom timeout to keep consistency.
2020-08-12 08:58:07 +01:00
Nick Ethier
c11dbcd001 docker: support group allocated ports and host_networks (#8623)
* docker: support group allocated ports

* docker: add new ports driver config to specify which group ports are mapped

* docker: update port mapping docs
2020-08-11 18:30:22 -04:00
Mahmood Ali
ada8a39d20 docker: disable host volume binding by default 2020-06-23 13:43:37 -04:00
Seth Hoenig
186aa59fc9 driver/docker: enable setting hard/soft memory limits
Fixes #2093

Enable configuring `memory_hard_limit` in the docker config stanza for tasks.
If set, this field will be passed to the container runtime as `--memory`, and
the `memory` configuration from the task resource configuration will be passed
as `--memory_reservation`, creating hard and soft memory limits for tasks using
the docker task driver.
2020-06-01 09:22:45 -05:00
Mahmood Ali
5de673e6f2 tests: don't delete images after tests complete
Fix some docker test flakiness where image cleanup process may
contaminate other tests. A clean up process may attempt to delete an
image while it's used by another test.
2020-05-26 18:53:24 -04:00
Mahmood Ali
d6c75e301e cleanup driver eventor goroutines
This fixes few cases where driver eventor goroutines are leaked during
normal operations, but especially so in tests.

This change makes few modifications:

First, it switches drivers to use `Context`s to manage shutdown events.
Previously, it relied on callers invoking `.Shutdown()` function that is
specific to internal drivers only and require casting.  Using `Contexts`
provide a consistent idiomatic way to manage lifecycle for both internal
and external drivers.

Also, I discovered few places where we don't clean up a temporary driver
instance in the plugin catalog code, where we dispense a driver to
inspect and validate the schema config without properly cleaning it up.
2020-05-26 11:04:04 -04:00
Tim Gross
8860b72bc3 volumes: return better error messages for unsupported task drivers (#8030)
When an allocation runs for a task driver that can't support volume mounts,
the mounting will fail in a way that can be hard to understand. With host
volumes this usually means failing silently, whereas with CSI the operator
gets inscrutable internals exposed in the `nomad alloc status`.

This changeset adds a MountConfig field to the task driver Capabilities
response. We validate this when the `csi_hook` or `volume_hook` fires and
return a user-friendly error.

Note that we don't currently have a way to get driver capabilities up to the
server, except through attributes. Validating this when the user initially
submits the jobspec would be even better than what we're doing here (and could
be useful for all our other capabilities), but that's out of scope for this
changeset.

Also note that the MountConfig enum starts with "supports all" in order to
support community plugins in a backwards compatible way, rather than cutting
them off from volume mounting unexpectedly.
2020-05-21 09:18:02 -04:00
Mahmood Ali
9aa10cc4cd use allow_runtimes for consistency
Other allow lists use allow_ prefix (e.g. allow_caps, allow_privileged).
2020-05-12 11:03:08 -04:00
Mahmood Ali
88ec571375 Add a knob to restrict docker runtimes 2020-05-12 10:14:43 -04:00
Ben Buzbee
3bd675ea9d Rename OCIRuntime to Runtime; allow gpu conflicts is they are the same runtime; add conflict test 2020-04-03 12:15:11 -07:00
Ben Buzbee
5f58ac619a Support custom docker runtimes
This enables customers who want to use gvisor and have it configured on their clients.
2020-04-03 11:07:37 -07:00