nomad

mirror of https://github.com/kemko/nomad.git synced 2026-01-05 09:55:44 +03:00

Author	SHA1	Message	Date
Mahmood Ali	87c0c92ac7	Pass stats interval colleciton to executor This fixes a bug where executor based drivers emit stats every second, regardless of user configuration. When serializing the Stats request across grpc, the nomad agent dropped the Interval value, and then executor uses 1s as a default value.	2020-01-31 14:17:15 -05:00
John Schlederer	81592734b5	Making pull activity timeout configurable in Docker * Making pull activity timeout configurable in Docker plugin config, first pass * Fixing broken function call * Fixing broken tests * Fixing linter suggestion * Adding documentation on new parameter in Docker plugin config * Adding unit test * Setting min value for pull_activity_timeout, making pull activity duration a private var	2019-12-18 12:58:53 +01:00
Mahmood Ali	20f8227c0a	Merge pull request #6820 from hashicorp/f-skip-docker-logging-knob driver: allow disabling log collection	2019-12-13 11:41:20 -05:00
Mahmood Ali	e82dad732b	address review comments	2019-12-13 11:21:00 -05:00
Mahmood Ali	f794b49ec6	simplify cgroup path lookup	2019-12-11 12:43:25 -05:00
Mahmood Ali	596d0be5d8	executor: stop joining executor to container cgroup Stop joining libcontainer executor process into the newly created task container cgroup, to ensure that the cgroups are fully destroyed on shutdown, and to make it consistent with other plugin processes. Previously, executor process is added to the container cgroup so the executor process resources get aggregated along with user processes in our metric aggregation. However, adding executor process to container cgroup adds some complications with much benefits: First, it complicates cleanup. We must ensure that the executor is removed from container cgroup on shutdown. Though, we had a bug where we missed removing it from the systemd cgroup. Because executor uses `containerState.CgroupPaths` on launch, which includes systemd, but `cgroups.GetAllSubsystems` which doesn't. Second, it may have advese side-effects. When a user process is cpu bound or uses too much memory, executor should remain functioning without risk of being killed (by OOM killer) or throttled. Third, it is inconsistent with other drivers and plugins. Logmon and DockerLogger processes aren't in the task cgroups. Neither are containerd processes, though it is equivalent to executor in responsibility. Fourth, in my experience when executor process moves cgroup while it's running, the cgroup aggregation is odd. The cgroup `memory.usage_in_bytes` doesn't seem to capture the full memory usage of the executor process and becomes a red-harring when investigating memory issues. For all the reasons above, I opted to have executor remain in nomad agent cgroup and we can revisit this when we have a better story for plugin process cgroup management.	2019-12-11 11:28:09 -05:00
Mahmood Ali	2f4b9da61a	drivers/exec: test all cgroups are destroyed	2019-12-11 11:12:29 -05:00
Seth Hoenig	94c60b4cfa	tests: swap lib/freeport for tweaked helper/freeport Copy the updated version of freeport (sdk/freeport), and tweak it for use in Nomad tests. This means staying below port 10000 to avoid conflicts with the lib/freeport that is still transitively used by the old version of consul that we vendor. Also provide implementations to find ephemeral ports of macOS and Windows environments. Ports acquired through freeport are supposed to be returned to freeport, which this change now also introduces. Many tests are modified to include calls to a cleanup function for Server objects. This should help quite a bit with some flakey tests, but not all of them. Our port problems will not go away completely until we upgrade our vendor version of consul. With Go modules, we'll probably do a 'replace' to swap out other copies of freeport with the one now in 'nomad/helper/freeport'.	2019-12-09 08:37:32 -06:00
Mahmood Ali	943854469d	driver: allow disabling log collection Operators commonly have docker logs aggregated using various tools and don't need nomad to manage their docker logs. Worse, Nomad uses a somewhat heavy docker api call to collect them and it seems to cause problems when a client runs hundreds of log collections. Here we add a knob to disable log aggregation completely for nomad. When log collection is disabled, we avoid running logmon and docker_logger for the docker tasks in this implementation. The downside here is once disabled, `nomad logs ...` commands and API no longer return logs and operators must corrolate alloc-ids with their aggregated log info. This is meant as a stop gap measure. Ideally, we'd follow up with at least two changes: First, we should optimize behavior when we can such that operators don't need to disable docker log collection. Potentially by reverting to using pre-0.9 syslog aggregation in linux environments, though with different trade-offs. Second, when/if logs are disabled, nomad logs endpoints should lookup docker logs api on demand. This ensures that the cost of log collection is paid sparingly.	2019-12-08 14:15:03 -05:00
Mahmood Ali	ac9547e6b2	drivers: always initialize taskHandle.logger Looks like the RecoverTask doesn't set taskHandle.logger field causing a panic when the handle attempts to log (e.g. when Shutdown or Signaling fails).	2019-11-22 10:44:59 -05:00
Nick Ethier	ac239a3f0b	docker: set default cpu cfs period (#6737 ) * docker: set default cpu cfs period Co-Authored-By: Michael Schurter <mschurter@hashicorp.com>	2019-11-19 19:05:15 -05:00
Mahmood Ali	bdef161e20	changelog and comment	2019-11-19 15:51:08 -05:00
Mahmood Ali	6878134a7f	always destroy	2019-11-18 21:31:29 -05:00
Mahmood Ali	a15bdc130d	Add tests for orphaned processes	2019-11-18 21:31:29 -05:00
Tim Gross	b9eaf6119e	remove misleading networking log line (#6588 ) When a job has a task group network, this log line ends up being misleading if you're trying to debug networking issues. We really only care about this when there's no port map set, in which case we get the error returned anyways.	2019-10-30 13:23:33 -04:00
Mahmood Ali	00a0be0df1	docs: Docker driver supports task user option Also, add a test case.	2019-10-24 14:00:37 -04:00
Mahmood Ali	95fe2cd805	driver/docker: ensure that defaults are populated Looks like we may need to pass default literal at each layer to be able, so defaults are set properly.	2019-10-18 18:27:28 -04:00
Mahmood Ali	c64647c218	add timeouts for docker reconciler docker calls	2019-10-18 15:31:13 -04:00
Mahmood Ali	04a2e05994	only set a single label for now Other labels aren't strictly necessary here, and we may follow up with a better way to customize.	2019-10-18 15:31:13 -04:00
Mahmood Ali	487b0d8349	Only start reconciler once in main driver driver.SetConfig is not appropriate for starting up reconciler goroutine. Some ephemeral driver instances are created for validating config and we ought not to side-effecting goroutines for those. We currently lack a lifecycle hook to inject these, so I picked the `Fingerprinter` function for now, and reconciler should only run after fingerprinter started. Use `sync.Once` to ensure that we only start reconciler loop once.	2019-10-18 14:43:23 -04:00
Mahmood Ali	8c3136a666	docker label refactoring and additional tests	2019-10-17 10:45:13 -04:00
Mahmood Ali	ef4465dfa4	add docker labels	2019-10-17 10:45:12 -04:00
Mahmood Ali	24f6c2bf07	refactor reconciler code and address comments	2019-10-17 09:42:23 -04:00
Mahmood Ali	c8ba2d1b86	address code review comments	2019-10-17 08:36:02 -04:00
Mahmood Ali	3bf0ae995a	docker: explicit grace period for initial container reconcilation Ensure we wait for some grace period before killing docker containers that may have launched in earlier nomad restore.	2019-10-17 08:36:02 -04:00
Mahmood Ali	911d17e3ee	docker: periodically reconcile containers When running at scale, it's possible that Docker Engine starts containers successfully but gets wedged in a way where API call fails. The Docker Engine may remain unavailable for arbitrary long time. Here, we introduce a periodic reconcilation process that ensures that any container started by nomad is tracked, and killed if is running unexpectedly. Basically, the periodic job inspects any container that isn't tracked in its handlers. A creation grace period is used to prevent killing newly created containers that aren't registered yet. Also, we aim to avoid killing unrelated containters started by host or through raw_exec drivers. The logic is to pattern against containers environment variables and mounts to infer if they are an alloc docker container. Lastly, the periodic job can be disabled to avoid any interference if need be.	2019-10-17 08:36:01 -04:00
Danielle Lancashire	afb59bedf5	volumes: Add support for mount propagation This commit introduces support for configuring mount propagation when mounting volumes with the `volume_mount` stanza on Linux targets. Similar to Kubernetes, we expose 3 options for configuring mount propagation: - private, which is equivalent to `rprivate` on Linux, which does not allow the container to see any new nested mounts after the chroot was created. - host-to-task, which is equivalent to `rslave` on Linux, which allows new mounts that have been created _outside of the container_ to be visible inside the container after the chroot is created. - bidirectional, which is equivalent to `rshared` on Linux, which allows both the container to see new mounts created on the host, but importantly _allows the container to create mounts that are visible in other containers an don the host_ private and host-to-task are safe, but bidirectional mounts can be dangerous, as if the code inside a container creates a mount, and does not clean it up before tearing down the container, it can cause bad things to happen inside the kernel. To add a layer of safety here, we require that the user has ReadWrite permissions on the volume before allowing bidirectional mounts, as a defense in depth / validation case, although creating mounts should also require a priviliged execution environment inside the container.	2019-10-14 14:09:58 +02:00
Nick Ethier	56fb3de0ed	executor: run exec commands in netns if set (#6405 ) executor: run exec commands in netns if set	2019-10-01 14:45:43 -04:00
Nick Ethier	149578ca1e	executor: rename wrapNetns to withNetworkIsolation	2019-09-30 21:38:31 -04:00
Nick Ethier	e6ce6d2c2b	comment wrapNetns	2019-09-30 12:06:52 -04:00
Nick Ethier	159b911820	executor: removed unused field from exec_utils.go	2019-09-30 11:57:34 -04:00
Nick Ethier	2f16eb9640	executor: run exec commands in netns if set	2019-09-30 11:50:22 -04:00
Tim Gross	d94e301219	driver/java: pass task network isolation to executor Without passing the network isolation configuration to the executor, java tasks are not placed in the same network namespace as the other processes in their task group, which breaks Consul Connect.	2019-09-27 08:26:54 -04:00
Tim Gross	e17901d667	driver/networking: don't recreate existing network namespaces	2019-09-25 14:58:17 -04:00
Nick Ethier	c36fe98198	driver: set correct network isolation caps for exec and java dr… (#6368 )	2019-09-25 11:48:14 -04:00
rpramodd	a555ce8686	utils: add missing error info in case of cmd failure (#6355 )	2019-09-24 09:33:27 -04:00
Mahmood Ali	8c29de2032	docker: remove containers on creation failures The docker creation API calls may fail with http errors (e.g. timeout) even if container was successfully created. Here, we force remove container if we got unexpected failure. We already do this in some error handlers, and this commit updates all paths. I stopped short from a more aggressive refactoring, as the code is ripe for refactoring and would rather do that in another PR.	2019-09-18 08:45:59 -04:00
Mahmood Ali	b5b445c101	add exponential backoff for docker api calls	2019-09-18 08:12:54 -04:00
Mahmood Ali	d5051687b8	retry transient docker errors within function	2019-09-13 15:25:31 -04:00
Mahmood Ali	2f47a6d86c	docker: defensive against failed starts This handles a bug where we may start a container successfully, yet we fail due to retries and startContainer not being idempotent call. Here, we ensure that when starting a container fails with 500 error, the retry succeeds if container was started successfully.	2019-09-13 13:02:35 -04:00
Mahmood Ali	bd6bbc9ca8	fix qemu and update docker with tests	2019-09-04 11:27:51 -04:00
Jasmine Dahilig	6190443d79	fix portmap envvars in docker driver	2019-09-04 11:26:13 -04:00
Michael Schurter	f02c163532	Merge pull request #6000 from Iqoqo/docker-convert-host-paths-to-host-native driver/docker: convert host bind path to os native	2019-09-03 09:34:56 -07:00
Danielle Lancashire	86838dbc02	docker: Fix driver spec hclspec.NewLiteral does not quote its values, which caused `3m` to be parsed as a nonsensical literal which broke the plugin loader during initialization. By quoting the value here, it starts correctly.	2019-09-03 08:53:37 +02:00
Zhiguang Wang	e7eede5f74	Add default value "3m" to image_delay, making it consistent with docs.	2019-09-02 16:40:00 +08:00
Mahmood Ali	8b688cc70e	tests: enable raw_exec driver	2019-08-29 20:26:50 -04:00
Mahmood Ali	e14619da45	raw_exec: be defensive when disabled Ensure that no raw_exec task can run on a client where it's disabled, even if a flaw lead to client being assigned a raw_exec task unexpectedly.	2019-08-29 09:09:40 -04:00
Danielle Lancashire	a921c21c8e	docker: Fix issue where an exec may never timeout	2019-08-16 15:40:03 +02:00
Michael Schurter	f189f1f250	docker: reword FromSlash(hostPath) comment	2019-08-12 14:38:31 -07:00
ilya guterman	0f47a7daba	Update utils.go	2019-08-12 19:31:34 +03:00

1 2 3 4 5 ...

507 Commits