Commit Graph

1870 Commits

Author SHA1 Message Date
Michael Schurter
ba7694855d Merge pull request #11334 from hashicorp/f-chroot-skip-allocdir
client: never embed alloc_dir in chroot
2021-11-03 16:48:09 -07:00
Luiz Aoqui
89447fb42e Revert "Return SchedulerConfig instead of SchedulerConfigResponse struct (#10799)" (#11433) 2021-11-02 17:42:52 -04:00
Charlie Voiselle
4cfc6a0ed6 Making RPC Upgrade mode reloadable. (#11144)
- Making RPC Upgrade mode reloadable.
- Add suggestions from code review
- remove spurious comment
- switch to require(t,...) form for test.
- Add to changelog
2021-11-01 16:30:53 -04:00
Mahmood Ali
3ce89b75f4 logging: Log the cause behind agent startup failure (#11353)
Log the failure error when the agent fails to start. Previously, the
agent startup failure error would be emitted to the command UI but not
logged. So it doesn't get emitted to syslog or `log_file` if they are
set, and it makes debugging much harder. Also, logging the error again
before exit makes the error more visible: previously, the operator
needed to scroll to the top to find the error.

On a sample failure, the output will look like:
```
==> WARNING: Bootstrap mode enabled! Potentially unsafe operation.
==> Loaded configuration from sample-configs/config-bad
==> Starting Nomad agent...
==> Error starting agent: setting up server node ID failed: mkdir /path-without-permission: read-only file system
    2021-10-20T14:38:51.179-0400 [WARN]  agent.plugin_loader: skipping external plugins since plugin_dir doesn't exist: plugin_dir=/path-without-permission/plugins
    2021-10-20T14:38:51.181-0400 [DEBUG] agent.plugin_loader.docker: using client connection initialized from environment: plugin_dir=/path-without-permission/plugins
    2021-10-20T14:38:51.181-0400 [DEBUG] agent.plugin_loader.docker: using client connection initialized from environment: plugin_dir=/path-without-permission/plugins
    2021-10-20T14:38:51.181-0400 [INFO]  agent: detected plugin: name=java type=driver plugin_version=0.1.0
    2021-10-20T14:38:51.181-0400 [INFO]  agent: detected plugin: name=docker type=driver plugin_version=0.1.0
    2021-10-20T14:38:51.181-0400 [INFO]  agent: detected plugin: name=mock_driver type=driver plugin_version=0.1.0
    2021-10-20T14:38:51.181-0400 [INFO]  agent: detected plugin: name=raw_exec type=driver plugin_version=0.1.0
    2021-10-20T14:38:51.181-0400 [INFO]  agent: detected plugin: name=exec type=driver plugin_version=0.1.0
    2021-10-20T14:38:51.181-0400 [INFO]  agent: detected plugin: name=qemu type=driver plugin_version=0.1.0
    2021-10-20T14:38:51.181-0400 [ERROR] agent: error starting agent: error="setting up server node ID failed: mkdir /path-without-permission: read-only file system"
```

This change adds the final `ERROR` message. It's easy to miss the `==>
Error starting agent` above.
2021-10-27 10:41:17 -07:00
Luiz Aoqui
796a91b0be prevent active log from being overwritten when agent starts (#11386) 2021-10-26 20:57:07 -04:00
Michael Schurter
37f053ff89 client: never embed alloc_dir in chroot
Fixes #2522

Skip embedding client.alloc_dir when building chroot. If a user
configures a Nomad client agent so that the chroot_env will embed the
client.alloc_dir, Nomad will happily infinitely recurse while building
the chroot until something horrible happens. The best case scenario is
the filesystem's path length limit is hit. The worst case scenario is
disk space is exhausted.

A bad agent configuration will look something like this:

```hcl
data_dir = "/tmp/nomad-badagent"

client {
  enabled = true

  chroot_env {
    # Note that the source matches the data_dir
    "/tmp/nomad-badagent" = "/ohno"
    # ...
  }
}
```

Note that `/ohno/client` (the state_dir) will still be created but not
`/ohno/alloc` (the alloc_dir).
While I cannot think of a good reason why someone would want to embed
Nomad's client (and possibly server) directories in chroots, there
should be no cause for harm. chroots are only built when Nomad runs as
root, and Nomad disables running exec jobs as root by default. Therefore
even if client state is copied into chroots, it will be inaccessible to
tasks.

Skipping the `data_dir` and `{client,server}.state_dir` is possible, but
this PR attempts to implement the minimum viable solution to reduce risk
of unintended side effects or bugs.

When running tests as root in a vm without the fix, the following error
occurs:

```
=== RUN   TestAllocDir_SkipAllocDir
    alloc_dir_test.go:520:
                Error Trace:    alloc_dir_test.go:520
                Error:          Received unexpected error:
                                Couldn't create destination file /tmp/TestAllocDir_SkipAllocDir1457747331/001/nomad/test/testtask/nomad/test/testtask/.../nomad/test/testtask/secrets/.nomad-mount: open /tmp/TestAllocDir_SkipAllocDir1457747331/001/nomad/test/.../testtask/secrets/.nomad-mount: file name too long
                Test:           TestAllocDir_SkipAllocDir
--- FAIL: TestAllocDir_SkipAllocDir (22.76s)
```

Also removed unused Copy methods on AllocDir and TaskDir structs.

Thanks to @eveld for not letting me forget about this!
2021-10-18 09:22:01 -07:00
Luiz Aoqui
600bf12b75 Merge missing commits from 1.2.0-beta1 release branch (#11319) 2021-10-14 16:10:05 -04:00
Charlie Voiselle
8ba714e211 Return SchedulerConfig instead of SchedulerConfigResponse struct (#10799) 2021-10-13 21:23:13 -04:00
Michael Schurter
6a0dede9b6 Merge pull request #11167 from a-zagaevskiy/master
Support configurable dynamic port range
2021-10-13 16:47:38 -07:00
Michael Schurter
fc89835daf client: improve errors & tests for dynamic ports 2021-10-13 16:25:25 -07:00
Luiz Aoqui
713094ffb7 wrap log messages with hclog (#11291) 2021-10-12 14:38:44 -04:00
Aleksandr Zagaevskiy
0620bb04a5 fixup! Support configurable dynamic port range 2021-10-11 14:13:59 +03:00
Matt Mukerjee
0881b94201 Add FailoverHeartbeatTTL to config (#11127)
FailoverHeartbeatTTL is the amount of time to wait after a server leader failure
before considering reallocating client tasks. This TTL should be fairly long as
the new server leader needs to rebuild the entire heartbeat map for the
cluster. In deployments with a small number of machines, the default TTL (5m)
may be unnecessary long. Let's allow operators to configure this value in their
config files.
2021-10-06 18:48:12 -04:00
Luiz Aoqui
6cd62b3ef0 fix panic when Connect mesh gateway doesn't have a proxy block (#11257)
Co-authored-by: Michael Schurter <mschurter@hashicorp.com>
2021-10-04 15:52:07 -04:00
Mahmood Ali
6c414cd5f9 gofmt all the files
mostly to handle build directives in 1.17.
2021-10-01 10:14:28 -04:00
Michael Schurter
2968c01295 client: output reserved ports with min/max ports
Also add a little more min/max port testing and add the consts back that
had been removed: but unexported and as defaults.
2021-09-30 17:05:46 -07:00
Michael Schurter
33c91fd734 client: add NOMAD_LICENSE to default env deny list
By default we should not expose the NOMAD_LICENSE environment variable
to tasks.

Also refactor where the DefaultEnvDenyList lives so we don't have to
maintain 2 copies of it. Since client/config is the most obvious
location, keep a reference there to its unfortunate home buried deep
in command/agent/host. Since the agent uses this list as well for the
/agent/host endpoint the list must be accessible from both command/agent
and client.
2021-09-21 13:51:17 -07:00
James Rasell
e34fa583f9 allow configuration of Docker hostnames in bridge mode (#11173)
Add a new hostname string parameter to the network block which
allows operators to specify the hostname of the network namespace.
Changing this causes a destructive update to the allocation and it
is omitted if empty from API responses. This parameter also supports
interpolation.

In order to have a hostname passed as a configuration param when
creating an allocation network, the CreateNetwork func of the
DriverNetworkManager interface needs to be updated. In order to
minimize the disruption of future changes, rather than add another
string func arg, the function now accepts a request struct along with
the allocID param. The struct has the hostname as a field.

The in-tree implementations of DriverNetworkManager.CreateNetwork
have been modified to account for the function signature change.
In updating for the change, the enhancement of adding hostnames to
network namespaces has also been added to the Docker driver, whilst
the default Linux manager does not current implement it.
2021-09-16 08:13:09 +02:00
Aleksandr Zagaevskiy
e3b6f62198 Support configurable dynamic port range 2021-09-10 11:52:47 +03:00
James Rasell
3bffe443ac chore: fix incorrect docstring formatting. 2021-08-30 11:08:12 +02:00
Luiz Aoqui
d74ab11d25 Don't timestamp active log file (#11070)
* don't timestamp active log file

* website: update log_file default value

* changelog: add entry for #11070

* website: add upgrade instructions for log_file in v1.14 and v1.2.0
2021-08-23 11:27:34 -04:00
Mahmood Ali
fdb8684004 Merge pull request #9160 from hashicorp/f-sysbatch
core: implement system batch scheduler
2021-08-16 09:30:24 -04:00
Michael Schurter
734f1d7eb6 Merge pull request #10848 from ggriffiths/listsnapshot_secrets
CSI Listsnapshot secrets support
2021-08-10 15:59:33 -07:00
Seth Hoenig
61ee443ee6 core: implement system batch scheduler
This PR implements a new "System Batch" scheduler type. Jobs can
make use of this new scheduler by setting their type to 'sysbatch'.

Like the name implies, sysbatch can be thought of as a hybrid between
system and batch jobs - it is for running short lived jobs intended to
run on every compatible node in the cluster.

As with batch jobs, sysbatch jobs can also be periodic and/or parameterized
dispatch jobs. A sysbatch job is considered complete when it has been run
on all compatible nodes until reaching a terminal state (success or failed
on retries).

Feasibility and preemption are governed the same as with system jobs. In
this PR, the update stanza is not yet supported. The update stanza is sill
limited in functionality for the underlying system scheduler, and is
not useful yet for sysbatch jobs. Further work in #4740 will improve
support for the update stanza and deployments.

Closes #2527
2021-08-03 10:30:47 -04:00
Mahmood Ali
52c37e16aa Only initialize task.VolumeMounts when not-nil (#10990)
1.1.3 had a bug where task.VolumeMounts will be an empty slice instead of nil. Eventually, it gets canonicalized and is set to `nil`, but it seems to confuse dry-run planning.

The regression was introduced in https://github.com/hashicorp/nomad/pull/10855/files#diff-56b3c82fcbc857f8fb93a903f1610f6e6859b3610a4eddf92bad9ea27fdc85ecL1028-R1037 . Curiously, it's the only place where `len(apiTask.VolumeMounts)` check was dropped. I assume it was dropped accidentally.

Fixes #10981
2021-08-02 13:08:10 -04:00
Nomad Release bot
8c0c814099 Generate files for 1.1.3 release 2021-07-29 03:43:03 +00:00
Grant Griffiths
cba476eae6 CSI ListSnapshots secrets implementation
Signed-off-by: Grant Griffiths <ggriffiths@purestorage.com>
2021-07-28 11:30:29 -07:00
Mahmood Ali
428174ab55 Merge pull request #10875 from hashicorp/b-namespace-flag-override
cli: `-namespace` should override job namespace
2021-07-14 17:28:36 -04:00
Seth Hoenig
8ad8e1095a consul/connect: remove sidecar proxy before removing parent service
This PR will have Nomad de-register a sidecar proxy service before
attempting to de-register the parent service. Otherwise, Consul will
emit a warning and an error.

Fixes #10845
2021-07-08 13:30:19 -05:00
Seth Hoenig
72f431f4c4 Merge pull request #10872 from hashicorp/b-cc-regex-checkids
consul/connect: Avoid assumption of parent service when filtering connect proxies
2021-07-08 13:29:40 -05:00
Seth Hoenig
82f424585a consul/connect: improve regex from CR suggestions 2021-07-08 13:05:05 -05:00
Tim Gross
38af39e1b2 cli: -namespace should override job namespace
When a jobspec doesn't include a namespace, we provide it with the default
namespace, but this ends up overriding the explicit `-namespace` flag. This
changeset uses the same logic as region parsing to create an order of
precedence: the query string parameter (the `-namespace` flag) overrides the
API request body which overrides the jobspec.
2021-07-08 13:17:27 -04:00
Seth Hoenig
86283ad2dc consul/connect: Avoid assumption of parent service when filtering connect proxies
This PR uses regex-based matching for sidecar proxy services and checks when syncing
with Consul. Previously we would check if the parent of the sidecar was still being
tracked in Nomad. This is a false invariant - one which we must not depend when we
make #10845 work.

Fixes #10843
2021-07-08 09:43:41 -05:00
Mahmood Ali
0f0bcdb6ed Merge pull request #10806 from hashicorp/munda/idempotent-job-dispatch
Enforce idempotency of dispatched jobs using token on dispatch request
2021-07-08 10:23:31 -04:00
Tim Gross
b397edff4e cni: respect default cni_config_dir and cni_path (#10870)
The default agent configuration values were not set, which meant they were not
being set in the client configuration and this results in fingerprints failing
unless the values were set explicitly.
2021-07-08 09:56:57 -04:00
Alex Munda
2bd2f586f4 Set/parse idempotency_token query param 2021-07-07 16:26:55 -05:00
Seth Hoenig
421a6a8a7e consul: avoid extra sync operations when no action required
This PR makes it so the Consul sync logic will ignore operations that
do not specify an action to take (i.e. [de-]register [services|checks]).

Ideally such noops would be discarded at the callsites (i.e. users
of [Create|Update|Remove]Workload], but we can also be defensive
at the commit point.

Also adds 2 trace logging statements which are helpful for diagnosing
sync operations with Consul - when they happen and why.

Fixes #10797
2021-07-07 11:24:56 -05:00
Tim Gross
550fca9f83 csi: account for nil volume_mount in API-to-structs conversion (#10855)
Fix a nil pointer in the API struct to `nomad/structs` conversion when a
`volume_mount` block is empty.
2021-07-07 08:06:39 -04:00
Seth Hoenig
57fdb81433 consul: set task name only for group service checks
This PR fixes a bug introduced in a refactoring

https://github.com/hashicorp/nomad/pull/10764/files#diff-56b3c82fcbc857f8fb93a903f1610f6e6859b3610a4eddf92bad9ea27fdc85ec

where task level service checks would inherent the task name
field, when they shouldn't.

Fixes #10781
2021-06-18 12:16:27 -05:00
Seth Hoenig
7ba60b4e33 consul/connect: in-place update service definition when connect upstreams are modified
This PR fixes a bug where modifying the upstreams of a Connect sidecar proxy
would not result Consul applying the changes, unless an additional change to
the job would trigger a task replacement (thus replacing the service definition).

The fix is to check if upstreams have been modified between Nomad's view of the
sidecar service definition, and the service definition for the sidecar that is
actually registered in Consul.

Fixes #8754
2021-06-16 16:48:26 -05:00
Seth Hoenig
b4a631c1c5 consul: make failures_before_critical and success_before_passing work with group services
This PR fixes some job submission plumbing to make sure the Consul Check parameters
- failure_before_critical
- success_before_passing

work with group-level services. They already work with task-level services.
2021-06-15 11:20:40 -05:00
James Rasell
a7d055a584 Merge pull request #10744 from hashicorp/b-remove-duplicate-imports
chore: remove duplicate import statements
2021-06-11 16:42:34 +02:00
James Rasell
a4156c3e94 tests: remove duplicate import statements. 2021-06-11 09:39:22 +02:00
Nomad Release bot
21465a592d Generate files for 1.1.1 release 2021-06-10 08:04:25 -04:00
Mahmood Ali
122a4cb844 tests: use standard library testing.TB
Glint pulled in an updated version of mitchellh/go-testing-interface
which broke some existing tests because the update added a Parallel()
method to testing.T. This switches to the standard library testing.TB
which doesn't have a Parallel() method.
2021-06-09 16:18:45 -07:00
Isabel Suchanek
0edda116ad cli: add monitor flag to deployment status
Adding '-verbose' will print out the allocation information for the
deployment. This also changes the job run command so that it now blocks
until deployment is complete and adds timestamps to the output so that
it's more in line with the output of node drain.

This uses glint to print in place in running in a tty. Because glint
doesn't yet support cmd/powershell, Windows workflows use a different
library to print in place, which results in slightly different
formatting: 1) different margins, and 2) no spinner indicating
deployment in progress.
2021-06-09 16:18:45 -07:00
Seth Hoenig
9883b95ae4 consul: move consul acl tests into ent files
(cherry-pick ent back to oss)

This PR moves a lot of Consul ACL token validation tests into ent files,
so that we can verify correct behavior difference between OSS and ENT
Nomad versions.
2021-06-09 08:38:42 -05:00
Seth Hoenig
402b19c3b0 Merge pull request #10720 from hashicorp/f-cns-acl-check
consul: correctly check consul acl token namespace when using consul oss
2021-06-08 15:43:42 -05:00
Seth Hoenig
09c9a17a7f consul: correctly check consul acl token namespace when using consul oss
This PR fixes the Nomad Object Namespace <-> Consul ACL Token relationship
check when using Consul OSS (or Consul ENT without namespace support).

Nomad v1.1.0 introduced a regression where Nomad would fail the validation
when submitting Connect jobs and allow_unauthenticated set to true, with
Consul OSS - because it would do the namespace check against the Consul ACL
token assuming the "default" namespace, which does not work because Consul OSS
does not have namespaces.

Instead of making the bad assumption, expand the namespace check to handle
each special case explicitly.

Fixes #10718
2021-06-08 13:55:57 -05:00
Seth Hoenig
4b3ed53511 consul: pr cleanup namespace probe function signatures 2021-06-07 15:41:01 -05:00