Commit Graph

25643 Commits

Author SHA1 Message Date
Seth Hoenig
286dce7a2a exec2: add a client.users configuration block (#20093)
* exec: add a client.users configuration block

For now just add min/max dynamic user values; soon we can also absorb
the "user.denylist" and "user.checked_drivers" options from the
deprecated client.options map.

* give the no-op pool implementation a better name

* use explicit error types to make referencing them cleaner in tests

* use import alias to not shadow package name
2024-03-08 16:02:32 -06:00
Giovanni Avelar
26a27bb12c cli: add -json option on jobs status command (#18925) 2024-03-08 16:03:52 -05:00
Luke Kysow
9c3bbd191a Bump consul-template to 0.37.2 (#20105) 2024-03-08 14:56:35 -05:00
Tim Gross
ac366521f2 deps: upgrade protobuf lib to 1.33.0 (#20100)
Although Nomad is not vulnerable to CVE-2024-24786 because it's configured to
discard unknown messages during unmarshaling, we should upgrade so that
third-party vulnerability scanners don't detect the vulnerable version and
complain.

Also update go1.22.1 changelog entry to include CVEs
2024-03-08 10:55:55 -05:00
Seth Hoenig
2c1f5daad7 more test refactoring (#20092)
* tests: swap testify for test in client/config

* tests: swap testify for test in logmon/
2024-03-07 11:04:16 -06:00
Michael Schurter
3193ac204f docs: skipping a major release is fine (#20075)
Nomad has always placed an extremely high priority on backward
compatibility. We have always aimed to support N-2 major releases and
usually gone above and beyond that.

The new https://www.hashicorp.com/long-term-support policy also mentions
that N-2 is what we have always supported, so it's probably time for our
docs to reflect that reality.
2024-03-06 08:57:12 -08:00
Michael Schurter
82fe2b5df6 docs: fix s/port-plan-failure (#20079)
Fixes #20070
2024-03-06 08:56:31 -08:00
Seth Hoenig
55b0795866 build: upgrade to go1.22 (#20066)
* build: upgrade to go1.22

* add cl

* build: use codecgen from go-msgpack v1.1.5+base32 and stringer 0.18.0

for compatability with go1.22

* ci: update golangci-lint to 1.56.2

* build: update hclogvet for go1.22

* build: bump to go1.22.1
2024-03-06 09:54:04 -06:00
Seth Hoenig
67554b8f91 exec2: implement dynamic workload users taskrunner hook (#20069)
* exec2: implement dynamic workload users taskrunner hook

This PR impelements a TR hook for allocating dynamic workload users from
a pool managed by the Nomad client. This adds a new task driver Capability,
DynamicWorkloadUsers - which a task driver must indicate in order to make
use of this feature.

The client config plumbing is coming in a followup PR - in the RFC we
realized having a client.users block would be nice to have, with some
additional unrelated options being moved from the deprecated client.options
config.

* learn to spell
2024-03-06 09:34:27 -06:00
Mark Johnston
3e7191ccb7 Fix wording of ACL error message (#20071)
When creating a job ACL, you must supply a job ID if you supply a
namespace.  If you try to give a namespace without a job ID, the error
states "JobACL.JobID without Namespace" instead.
2024-03-05 16:49:28 -08:00
Phil Renaud
7820df53ca [ui]] Percy Stabilization (#20061)
* Some actions and inline-chart stabilization

* Weird little semicolon, are you my undoing?
2024-03-05 08:49:58 -05:00
Seth Hoenig
57bd39061b exec2: implement a dynamic users pool (#20065)
* exec2: implement a dynamic users pool

This PR adds an implementation of a Pool from which dynamic users can
be allocated on behalf of tasks making use of an upcoming feature of
Nomad client (dynamic users).

A task hook and client plumbing, etc. will be in follow up PRs.

* no need for randomness assertion
2024-03-05 07:35:20 -06:00
Seth Hoenig
06a4fcb7d5 build: update the actions/checkout version (#20067) 2024-03-04 13:01:38 -06:00
Soren L. Hansen
96acddbc13 Avoid NPE in nomad/command/job_restart.go (#20049)
stopAlloc() checks if an allocation represents a system job like this:
```
  if *alloc.Job.Type == api.JobTypeSystem {
    ...
  }
```

This caused the cli to crash:
```
==> 2024-02-29T08:45:53+01:00: Restarting 2 allocations
    2024-02-29T08:45:54+01:00: Rescheduling allocation "6a9da11a" for group "redacted-group"
panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x2 addr=0x20 pc=0x10686affc]

goroutine 36 [running]:
github.com/hashicorp/nomad/command.(*JobRestartCommand).stopAlloc(0x14000b11040, {0x14000996dc0?, 0x0?})
        github.com/hashicorp/nomad/command/job_restart.go:968 +0x25c
github.com/hashicorp/nomad/command.(*JobRestartCommand).handleAlloc(0x14000b11040, {0x14000996dc0?, 0x0?})
        github.com/hashicorp/nomad/command/job_restart.go:868 +0x34
github.com/hashicorp/nomad/command.(*JobRestartCommand).Run.(*JobRestartCommand).Run.func1.func2()
        github.com/hashicorp/nomad/command/job_restart.go:392 +0x28
github.com/hashicorp/go-multierror.(*Group).Go.func1()
        github.com/hashicorp/go-multierror@v1.1.1/group.go:23 +0x60
created by github.com/hashicorp/go-multierror.(*Group).Go in goroutine 1
        github.com/hashicorp/go-multierror@v1.1.1/group.go:20 +0x84
```

Attaching a debugger revealed that `alloc.Job` was set, but
`alloc.Job.Type` was nil. After guarding the `.Type` check with a
`alloc.Job.Type != nil`, it still crashed. This time, `alloc.Job` was
nil.

I was scrambling to get the job running again, so I didn't have the
opportunity to find out why those values were nil, but this change
ensures the CLI does not crash in these situations.

Fixes #20048
2024-03-01 08:07:28 -06:00
Seth Hoenig
a66f7ba888 ci: update macos runners to macos-14 (apple silicon) (#20054) 2024-02-29 14:31:59 -06:00
Seth Hoenig
4d83733909 tests: swap testify for test in more places (#20028)
* tests: swap testify for test in plugins/csi/client_test.go

* tests: swap testify for test in testutil/

* tests: swap testify for test in host_test.go

* tests: swap testify for test in plugin_test.go

* tests: swap testify for test in utils_test.go

* tests: swap testify for test in scheduler/

* tests: swap testify for test in parse_test.go

* tests: swap testify for test in attribute_test.go

* tests: swap testify for test in plugins/drivers/

* tests: swap testify for test in command/

* tests: fixup some test usages

* go: run go mod tidy

* windows: cpuset test only on linux
2024-02-29 12:11:35 -06:00
Phil Renaud
c2fe51bf11 Fixes an issue where shift+num would not open an eval on the evaluations index table (#20047) 2024-02-29 11:25:52 -06:00
James Rasell
8f3f2a8c5c docs: fix autoscaler variable ACL policy example. (#20050) 2024-02-29 15:44:29 +00:00
Jeff Boruszak
57af1cdcbf docs: Consul Admin partition example (#20022) 2024-02-28 09:04:04 -06:00
James Rasell
dfda021aaf docs: add autoscaler ACL policy requirements. (#20041)
Co-authored-by: Luiz Aoqui <luiz@hashicorp.com>
2024-02-28 14:19:38 +00:00
Soren L. Hansen
14280e0820 Prevent NPE when service lacks identity (#19987)
Fixes a null pointer exception if `Alloc.SignIdentities` was called for
any service and any service lacked an identity.

Fixes #19986
2024-02-22 09:01:06 -05:00
Luiz Aoqui
cce72cddfd docs: add Autoscaler query_window_offset config (#19942) 2024-02-20 17:01:30 -05:00
Michael Schurter
b3a4c80f8c docs: fix s/envoy-bootstrap-error redirect (#20015)
And cleanup whitespace
2024-02-20 12:26:26 -08:00
Mike Nomitch
18e5e168f4 Remove accidental console log in namespace test setup (#19874) 2024-02-20 11:42:11 -08:00
Tim Gross
45b2c34532 cni: add DNS set by CNI plugins to task configuration (#20007)
CNI plugins may set DNS configuration, but this isn't threaded through to the
task configuration so that we can write it to the `/etc/resolv.conf` file as
needed. Add the `AllocNetworkStatus` to the alloc hook resources so they're
accessible from the taskrunner. Any DNS entries provided by the user will
override these values.

Fixes: https://github.com/hashicorp/nomad/issues/11102
2024-02-20 10:17:27 -05:00
Juana De La Cuesta
20cfbc82d3 Introduces Disconnect block into the TaskGroup configuration (#19886)
This PR is the first on two that will implement the new Disconnect block. In this PR the new block is introduced to be backwards compatible with the fields it will replace. For more information refer to this RFC and this ticket.
2024-02-19 16:41:35 +01:00
Phil Renaud
e8db588368 ui: Warning and better error handling for invalid-named variables and job templates (#19989)
* Warning and better error handling for invalid-named variables and job templates

* Warning and better error handling for invalid-named variables and job templates

* Tests for variable pathname warnings

* Only show the bad-name warning if the variable is being created and path is editable
2024-02-15 11:54:09 -05:00
Phil Renaud
c1cbc39a96 Confirmation on exit from exec as long as socket has been opened (#19985) 2024-02-15 11:52:23 -05:00
Tim Gross
3149e5393c connect: fix missing diff of expose block (#19990)
While working on #10628 I discovered that the `expose` block was missing an
implementation of Diff, which means it doesn't show up correctly in `job plan`
output.

Also, fix field comparison in `ServiceCheck.Equal`.
This is a bug in the method but it doesn't look like it impacts production code.
2024-02-15 09:01:53 -05:00
James Rasell
4b46ff8ce0 test: fix test datarace within helper broker. (#19974) 2024-02-15 08:54:56 +00:00
James Rasell
e4648551e5 test: fix test datarace within driver shared eventer. (#19975) 2024-02-15 07:39:43 +00:00
Tim Gross
a74775814c fingerprint: add DNS address and port to Consul fingerprint (#19969)
In order to provide a DNS address and port to Connect tasks configured for
transparent proxy, we need to fingerprint the Consul DNS address and port. The
client will pass this address/port to the iptables configuration provided to the
`consul-cni` plugin.

Ref: https://github.com/hashicorp/nomad/issues/10628
2024-02-14 12:15:58 -05:00
Cedric Le Roux
994a2b1036 client: fixed a bug where corrupt client state could panic the client (#19972) 2024-02-14 11:14:11 -05:00
Tim Gross
c1b5850473 docs: add warning not to enable Consul tls.grpc.verify_incoming (#19970)
Consul does not support incoming TLS verification of Envoy. This failure results
in hard-to-understand errors like `SSLV3_ALERT_BAD_CERTIFICATE` in the Envoy
allocation logs. Leave a warning about this to users.

Closes: https://github.com/hashicorp/nomad/issues/19772
Closes: https://github.com/hashicorp/nomad/issues/16854
Ref: https://github.com/hashicorp/consul/issues/13088
2024-02-14 08:56:35 -05:00
Tim Gross
c364cb5729 Merge pull request #19968 from hashicorp/post-1.7.5-release
Post 1.7.5 release
2024-02-13 11:42:30 -05:00
Tim Gross
3978f96898 Merge release 1.7.5 files 2024-02-13 11:34:25 -05:00
hc-github-team-nomad-core
64c2e2b868 Prepare for next release 2024-02-13 11:32:59 -05:00
hc-github-team-nomad-core
6e08d9ffff Generate files for 1.7.5 release 2024-02-13 11:32:59 -05:00
Luiz Aoqui
62b7d6ffe9 vault: revert #18998 to fix potential deadlock (#19963)
* Revert "vault: always renew tokens using the renewal loop (#18998)"
  This reverts commit 7054fe1a8c.
* test: add case for concurrent Vault token renewal
2024-02-13 09:50:46 -05:00
Julien Castets
61941d8204 docs: autoscaler doc for max_scale_up and max_scale_down of target-value strategy (#19945)
See https://github.com/hashicorp/nomad-autoscaler/pull/848
2024-02-13 07:38:39 +00:00
Phil Renaud
1bde7a8fb4 [ui] Upgrades to build storybook on node v20 (#19953)
* Attempting to build storybook on node v20

* babel-plugin-dynamic-import-node added

* build without babel-plugin-dynamic-import-node explicitly declared
2024-02-12 16:51:47 -05:00
Tim Gross
a54657899c CNI: fix deprecation warnings (#19954)
We updated our `go-cni` dependency in #17582 but this left deprecation warnings
on the `cni.CNIResult` type (now `cni.Result`).
2024-02-12 15:35:43 -05:00
Seth Hoenig
37c497628c docs: describe cloud environments in fingerprint denylist (#19952)
This PR changes the example of the client config option "fingerprint.denylist"
to include all the cloud environment fingerprinters. Each one contains a
2 second HTTP timeout to a metadata endpoint that does not exist if you are not
in that particular cloud. When run in serial on startup, this results in
an 8 second wait where nothing useful is happening.

Closes #16727
2024-02-12 09:57:29 -06:00
Tim Gross
e986c298ac alloc exec: fix panics after stream close (#19932)
In #19172 we added a check on websocket errors to see if they were one of
several benign "close" messages. This change inadvertently assumed that other
messages used for close would not implement `HTTPCodedError`. When errors like
the following are received:

> msgpack decode error [pos 0]: io: read/write on closed pipe"

they are sent from the inner loop as though they were a "real" error, but the
channel is already being closed with a "close" message.

This allowed many more attempts to pass thru a previously-undiscovered race
condition in the two goroutines that stream RPC responses to the websocket. When
the input stream returns an error for any reason (for example, the command we're
executing has exited), it will unblock the "outer" goroutine and cause a write
to the websocket. If we're concurrently writing the "close error" discussed
above, this results in a panic from the websocket library.

This changeset includes two fixes:
* Catch "closed pipe" error correctly so that we're not sending unnecessary
  error messages.
* Move all writes to the websocket into the same response streaming
  goroutine. The main handler goroutine will block on a results channel, and the
  response streaming goroutine will send on that channel with the final error when
  it's done so it can be reported to the user.
2024-02-12 09:43:34 -05:00
Tim Gross
0985f96f8d state: fix state store corruption in plan apply (#19937)
The state store's `UpsertPlanResults` method canonicalizes allocations in order
to upgrade them to a new version. But the method does not copy the allocation
before doing so, which can potentially corrupt the state store. This hasn't been
implicated in any known user-facing bugs, but was detected when running Nomad
under a build with Go toolchains data race detection enabled.
2024-02-12 08:59:04 -05:00
Luiz Aoqui
e2bfdf0c10 events: emit event when job is deleted (#19903)
When jobs are deregistered with the `purge` flag they are immediately
deleted from the state store instead of just updated to be marked as
stopped.

Without tracking job deletions the event stream would not receive a
`JobDeregistered` event when `purge` was set.
2024-02-09 18:19:33 -05:00
Luiz Aoqui
4a8b01430b scheduler: retain eval metrics on port collision (#19933)
When an allocation can't be placed because of a port collision the
resulting blocked eval is expected to have a metric reporting the port
that caused the conflict, but this metrics was not being emitted when
preemption was enabled.
2024-02-09 18:18:48 -05:00
Luiz Aoqui
b52a44717e executor: limit the value of CPU shares (#19935)
The value for the executor cgroup CPU weight must be within the limits
imposed by the Linux kernel.

Nomad used the task `resource.cpu`, an unbounded value, directly as the
cgroup CPU weight, causing it to potentially go outside the imposed
values.

This commit clamps the CPU shares values to be within the limits
allowed.

Co-authored-by: Tim Gross <tgross@hashicorp.com>
2024-02-09 16:29:14 -05:00
Luiz Aoqui
db5ffde2b7 client: prevent start on cgroups init error (#19915)
The Nomad client expects certain cgroups paths to exist in order to
manage tasks. These paths are created when the agent first starts, but
if process fails the agent would just log the error and proceed with its
initialization, despite not being able to run tasks.

This commit surfaces the errors back to the client initialization so the
process can stop early and make clear to operators that something went
wrong.
2024-02-09 13:45:29 -05:00
Tim Gross
110d93ab25 windows: remove LazyDLL calls for system modules (#19925)
On Windows, Nomad uses `syscall.NewLazyDLL` and `syscall.LoadDLL` functions to
load a few system DLL files, which does not prevent DLL hijacking
attacks. Hypothetically a local attacker on the client host that can place an
abusive library in a specific location could use this to escalate privileges to
the Nomad process. Although this attack does not fall within the Nomad security
model, it doesn't hurt to follow good practices here.

We can remove two of these DLL loads by using wrapper functions provided by the
stdlib in `x/sys/windows`

Co-authored-by: dduzgun-security <deniz.duzgun@hashicorp.com>
2024-02-09 08:47:48 -05:00