Commit Graph

24399 Commits

Author SHA1 Message Date
Tim Gross
c70bbd14ba agent: trim space when parsing X-Nomad-Token header (#16469)
Our auth token parsing code trims space around the `Authorization` header but
not around `X-Nomad-Token`. When using the UI, it's easy to accidentally
introduce a leading or trailing space, which results in spurious authentication
errors. Trim the space at the HTTP server.
2023-03-14 08:57:53 -04:00
Seth Hoenig
a42a33fa6b cgv1: do not disable cpuset manager if reserved interface already exists (#16467)
* cgv1: do not disable cpuset manager if reserved interface already exists

This PR fixes a bug where restarting a Nomad Client on a machine using cgroups
v1 (e.g. Ubuntu 20.04) would cause the cpuset cgroups manager to disable itself.

This is being caused by incorrectly interpreting a "file exists" error as
problematic when ensuring the reserved cpuset exists. If we get a "file exists"
error, that just means the Client was likely restarted.

Note that a machine reboot would fix the issue - the groups interfaces are
ephemoral.

* cl: add cl
2023-03-13 17:00:17 -05:00
Luiz Aoqui
f2bfbfaf03 acl: update job eval requirement to submit-job (#16463)
The job evaluate endpoint creates a new evaluation for the job which is
a write operation. This change modifies the necessary capability from
`read-job` to `submit-job` to better reflect this.
2023-03-13 17:13:54 -04:00
Luiz Aoqui
b2c873274b plugin: add missing fields to TaskConfig (#16434) 2023-03-13 15:58:16 -04:00
Dao Thanh Tung
f3a527b26f doc: Update nomad fmt doc to run against non-deprecated HCL2 jobspec only (#16435)
Signed-off-by: dttung2905 <ttdao.2015@accountancy.smu.edu.sg>
2023-03-13 15:26:27 -04:00
Michael Schurter
5f37b2fba0 build: update from go1.20.1 to go1.20.2 (#16427)
* build: update from go1.20.1 to go1.20.2

Note that the CVE fixed in go1.20.2 does *not* impact Nomad.

https://github.com/golang/go/issues/58647
2023-03-13 09:47:07 -07:00
dependabot[bot]
5febe9bd9b build(deps): bump go.uber.org/goleak from 1.2.0 to 1.2.1 (#16439)
Bumps [go.uber.org/goleak](https://github.com/uber-go/goleak) from 1.2.0 to 1.2.1.
- [Release notes](https://github.com/uber-go/goleak/releases)
- [Changelog](https://github.com/uber-go/goleak/blob/master/CHANGELOG.md)
- [Commits](https://github.com/uber-go/goleak/compare/v1.2.0...v1.2.1)

---
updated-dependencies:
- dependency-name: go.uber.org/goleak
  dependency-type: direct:production
  update-type: version-update:semver-patch
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2023-03-13 11:23:56 -05:00
Tim Gross
b6d6cc41ed scheduler: refactor system util tests (#16416)
The tests for the system allocs reconciling code path (`diffSystemAllocs`)
include many impossible test environments, such as passing allocs for the wrong
node into the function. This makes the test assertions nonsensible for use in
walking yourself through the correct behavior.

I've pulled this changeset out of PR #16097 so that we can merge these
improvements and revisit the right approach to fix the problem in #16097 with
less urgency now that the PFNR bug fix has been merged. This changeset breaks up
a couple of tests, expands test coverage, and makes test assertions more
clear. It also corrects one bit of production code that behaves fine in
production because of canonicalization, but forces us to remember to set values
in tests to compensate.
2023-03-13 11:59:31 -04:00
Seth Hoenig
12688f2ea3 scheduler: add simple benchmark for tasksUpdated (#16422)
In preperation for some refactoring to tasksUpdated, add a benchmark to the
old code so it's easy to compare with the changes, making sure nothing goes
off the rails for performance.
2023-03-13 10:44:14 -05:00
Seth Hoenig
a34925f689 deps: remove replace statement for go-discover (#16304)
Which we no longer need since we no longer have consul as a dependency
2023-03-13 10:40:35 -05:00
Tim Gross
2a0e45bd75 Merge pull request #16445 from hashicorp/post-1.5.1-release
Post 1.5.1 release
2023-03-13 11:29:49 -04:00
Tim Gross
172f49fc92 Merge release 1.5.1 files 2023-03-13 11:15:04 -04:00
hc-github-team-nomad-core
6c91cc85c7 Prepare for next release 2023-03-13 11:13:27 -04:00
hc-github-team-nomad-core
669495bac6 Generate files for 1.5.1 release 2023-03-13 11:13:27 -04:00
Tim Gross
d0ddd5e84b acl: prevent privilege escalation via workload identity
ACL policies can be associated with a job so that the job's Workload Identity
can have expanded access to other policy objects, including other
variables. Policies set on the variables the job automatically has access to
were ignored, but this includes policies with `deny` capabilities.

Additionally, when resolving claims for a workload identity without any attached
policies, the `ResolveClaims` method returned a `nil` ACL object, which is
treated similarly to a management token. While this was safe in Nomad 1.4.x,
when the workload identity token was exposed to the task via the `identity`
block, this allows a user with `submit-job` capabilities to escalate their
privileges.

We originally implemented automatic workload access to Variables as a separate
code path in the Variables RPC endpoint so that we don't have to generate
on-the-fly policies that blow up the ACL policy cache. This is fairly brittle
but also the behavior around wildcard paths in policies different from the rest
of our ACL polices, which is hard to reason about.

Add an `ACLClaim` parameter to the `AllowVariableOperation` method so that we
can push all this logic into the `acl` package and the behavior can be
consistent. This will allow a `deny` policy to override automatic access (and
probably speed up checks of non-automatic variable access).
2023-03-13 11:13:27 -04:00
Michael Schurter
9fefc18b77 e2e fixes: cli output, timing issue, and some cleanups (#16418)
* e2e: job expects alloc to run until stopped

* e2e: fix case changed by #16306

* e2e: couldn't find a bug but improved test+jobspecs
2023-03-10 13:14:51 -08:00
Luiz Aoqui
419c4bfa4e allocrunner: fix health check monitoring for Consul services (#16402)
Services must be interpolated to replace runtime variables before they
can be compared against the values returned by Consul.
2023-03-10 14:43:31 -05:00
Juana De La Cuesta
712c669657 cli: add -json and -t flag for alloc checks command (#16405)
* cli: add -json flag to alloc checks for completion

* CLI: Expand test to include testing the json flag for allocation checks

* Documentation: Add the checks command

* Documentation: Add example for alloc check command

* Update website/content/docs/commands/alloc/checks.mdx

Co-authored-by: James Rasell <jrasell@users.noreply.github.com>

* CLI: Add template flag to alloc checks command

* Update website/content/docs/commands/alloc/checks.mdx

Co-authored-by: James Rasell <jrasell@users.noreply.github.com>

* CLI: Extend test to include -t flag for alloc checks

* func: add changelog for added flags to alloc checks

* cli[doc]: Make usage section on alloc checks clearer

* Update website/content/docs/commands/alloc/checks.mdx

Co-authored-by: James Rasell <jrasell@users.noreply.github.com>

* Delete modd.conf

* cli[doc]: add -t flag to command description for alloc checks

---------

Co-authored-by: James Rasell <jrasell@users.noreply.github.com>
Co-authored-by: Juanita De La Cuesta Morales <juanita.delacuestamorales@juanita.delacuestamorales-LHQ7X0QG9X>
2023-03-10 16:58:53 +01:00
Michael Schurter
730adaa6a7 env/aws: update ec2 cpu info data (#16417)
Update AWS EC2 CPU tables using `make ec2info`
2023-03-09 14:33:21 -08:00
Luiz Aoqui
4fdb5c477e cli: remove hard requirement on list-jobs (#16380)
Most job subcommands allow for job ID prefix match as a convenience
functionality so users don't have to type the full job ID.

But this introduces a hard ACL requirement that the token used to run
these commands have the `list-jobs` permission, even if the token has
enough permission to execute the basic command action and the user
passed an exact job ID.

This change softens this requirement by not failing the prefix match in
case the request results in a permission denied error and instead using
the information passed by the user directly.
2023-03-09 15:00:04 -05:00
Bryce Kalow
1dd320387a docs: update content-conformance package (#16412) 2023-03-09 12:47:46 -06:00
James Rasell
0f7ad3ba56 cli: fix help output format on job init command. (#16407) 2023-03-09 18:17:15 +01:00
Tim Gross
c36d3bd2f9 scheduling: prevent self-collision in dynamic port network offerings (#16401)
When the scheduler tries to find a placement for a new allocation, it iterates
over a subset of nodes. For each node, we populate a `NetworkIndex` bitmap with
the ports of all existing allocations and any other allocations already proposed
as part of this same evaluation via its `SetAllocs` method. Then we make an
"ask" of the `NetworkIndex` in `AssignPorts` for any ports we need and receive
an "offer" in return. The offer will include both static ports and any dynamic
port assignments.

The `AssignPorts` method was written to support group networks, and it shares
code that selects dynamic ports with the original `AssignTaskNetwork`
code. `AssignTaskNetwork` can request multiple ports from the bitmap at a
time. But `AssignPorts` requests them one at a time and does not account for
possible collisions, and doesn't return an error in that case.

What happens next varies:

1. If the scheduler doesn't place the allocation on that node, the port
   conflict is thrown away and there's no problem.
2. If the node is picked and this is the only allocation (or last allocation),
   the plan applier will reject the plan when it calls `SetAllocs`, as we'd expect.
3. If the node is picked and there are additional allocations in the same eval
   that iterate over the same node, their call to `SetAllocs` will detect the
   impossible state and the node will be rejected. This can have the puzzling
   behavior where a second task group for the job without any networking at all
   can hit a port collision error!

It looks like this bug has existed since we implemented group networks, but
there are several factors that add up to making the issue rare for many users
yet frustratingly frequent for others:

* You're more likely to hit this bug the more tightly packed your range for
  dynamic ports is. With 12000 ports in the range by default, many clusters can
  avoid this for a long time.
* You're more likely to hit case (3) for jobs with lots of allocations or if a
  scheduler has to iterate over a large number of nodes, such as with system jobs,
  jobs with `spread` blocks, or (sometimes) jobs using `unique` constraints.

For unlucky combinations of these factors, it's possible that case (3) happens
repeatedly, preventing scheduling of a given job until a client state
change (ex. restarting the agent so all its allocations are rescheduled
elsewhere) re-opens the range of dynamic ports available.

This changeset:

* Fixes the bug by accounting for collisions in dynamic port selection in
  `AssignPorts`.
* Adds test coverage for `AssignPorts`, expands coverage of this case for the
  deprecated `AssignTaskNetwork`, and tightens the dynamic port range in a
  scheduler test for spread scheduling to more easily detect this kind of problem
  in the future.
* Adds a `String()` method to `Bitmap` so that any future "screaming" log lines
  have a human-readable list of used ports.
2023-03-09 10:09:54 -05:00
Proskurin Kirill
1227615ee5 Updated who-uses-nomad to add Behavox (#16339) 2023-03-08 19:43:12 -05:00
Seth Hoenig
95359b8c4c client: disable running artifact downloader as nobody (#16375)
* client: disable running artifact downloader as nobody

This PR reverts a change from Nomad 1.5 where artifact downloads were
executed as the nobody user on Linux systems. This was done as an attempt
to improve the security model of artifact downloading where third party
tools such as git or mercurial would be run as the root user with all
the security implications thereof.

However, doing so conflicts with Nomad's own advice for securing the
Client data directory - which when setup with the recommended directory
permissions structure prevents artifact downloads from working as intended.

Artifact downloads are at least still now executed as a child process of
the Nomad agent, and on modern Linux systems make use of the kernel Landlock
feature for limiting filesystem access of the child process.

* docs: update upgrade guide for 1.5.1 sandboxing

* docs: add cl

* docs: add title to upgrade guide fix
2023-03-08 15:58:43 -06:00
Seth Hoenig
40ab325594 e2e: setup nomad permissions correctly (client vs. server) (#16399)
This PR configures

- server nodes with a systemd unit running the agent as the nomad service user
- client nodes with a root owned nomad data directory
2023-03-08 14:41:08 -06:00
Phil Renaud
fcd51dce4b [ui] Fix: New toast notifications no longer last forever (#16384)
* Removes an errant console.log and corrects a default sticky=true on toast notifications

* Default so no need to refault
2023-03-08 14:50:18 -05:00
Lance Haig
48e7d70fcd deps: Update ioutil deprecated library references to os and io respectively in the client package (#16318)
* Update ioutil deprecated library references to os and io respectively

* Deal with the errors produced.

Add error handling to filEntry info
Add error handling to info
2023-03-08 13:25:10 -06:00
Lance Haig
3160c76209 deps: Update ioutil library references to os and io respectively for drivers package (#16331)
* Update ioutil library references to os and io respectively for drivers package

No user facing changes so I assume no change log is required

* Fix failing tests
2023-03-08 10:31:09 -06:00
Lance Haig
0e74431b01 Update ioutil library references to os and io respectively for API and Plugins package (#16330)
No user facing changes so I assume no change log is required
2023-03-08 10:25:09 -06:00
Lance Haig
962b65f5bc Update ioutil library references to os and io respectively for e2e helper nomad (#16332)
No user facing changes so I assume no change log is required
2023-03-08 09:39:03 -06:00
Lance Haig
99f43c1144 Update ioutil library references to os and io respectively for command (#16329)
No user facing changes so I assume no change log is required
2023-03-08 09:20:04 -06:00
dependabot[bot]
37e9ecace1 build(deps): bump golang.org/x/crypto from 0.5.0 to 0.7.0 (#16337)
Bumps [golang.org/x/crypto](https://github.com/golang/crypto) from 0.5.0 to 0.7.0.
- [Release notes](https://github.com/golang/crypto/releases)
- [Commits](https://github.com/golang/crypto/compare/v0.5.0...v0.7.0)

---
updated-dependencies:
- dependency-name: golang.org/x/crypto
  dependency-type: direct:production
  update-type: version-update:semver-minor
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2023-03-08 09:14:49 -06:00
Phil Renaud
5d5740b533 Outage recovery link fix (#16365) 2023-03-07 15:52:26 -05:00
Seth Hoenig
24af468b67 e2e: fix permissions on nomad data directory (#16376)
This PR updates the provisioning step where we create /opt/nomad/data,
such that it is with 0700 permissions in line with our security guidance.
2023-03-07 14:41:54 -06:00
Seth Hoenig
b3f7559351 docker: fix bug where network pause containers would be erroneously reconciled (#16352)
* docker: fix bug where network pause containers would be erroneously gc'd

* docker: cl: thread context from driver into pause container restoration
2023-03-07 12:17:32 -06:00
James Rasell
b677ec7e99 docs: add 1.5.0, 1.4.5, and 1.3.10 pause regression upgrade note. (#16358) 2023-03-07 18:29:03 +01:00
James Rasell
003a567d10 cli: support json and t on acl binding-rule info command. (#16357) 2023-03-07 18:27:02 +01:00
Tim Gross
03d6a8c70a docs: note that secrets dir is usually mounted noexec (#16363) 2023-03-07 11:57:15 -05:00
Tim Gross
6f52a9194f scheduler: correctly detect inplace update with wildcard datacenters (#16362)
Wildcard datacenters introduced a bug where a job with any wildcard datacenters
will always be treated as a destructive update when we check whether a
datacenter has been removed from the jobspec.

Includes updating the helper so that callers don't have to loop over the job's
datacenters.
2023-03-07 10:05:59 -05:00
Ashlee M Boyer
605f15506b CI: delete test-link-rewrites.yml (#16354) 2023-03-06 15:41:01 -05:00
Seth Hoenig
78bcd32ad4 deps: update test to 0.6.2 for new functions (#16326) 2023-03-06 09:24:45 -06:00
Phil Renaud
a57f97eae3 [ui] Fix: Wildcard-datacenter system/sysbatch jobs stopped showing client links/chart (#16274)
* Fix for wildcard DC sys/sysbatch jobs

* A few extra modules for wildcard DC in systemish jobs

* doesMatchPattern moved to its own util as match-glob

* DC glob lookup using matchGlob

* PR feedback
2023-03-06 10:06:31 -05:00
Luiz Aoqui
b07af57618 client: don't emit task shutdown delay event if not waiting (#16281) 2023-03-03 18:22:06 -05:00
Luiz Aoqui
b24dddce2a api: set last index and request time on alloc stop (#16319)
Some of the methods in `Allocations()` incorrectly use the `putQuery`
in API calls where `put` is more appropriate since they are not reading
information back. These methods are also not returning request metadata
such as `LastIndex` back to callers, which can be useful to have in some
scenarios.

They also provide poor developer experience as they take an
`*api.Allocation` struct when only the allocation ID is necessary. This
can lead consumers to make unnecessary API calls to fetch the full
allocation.

Fixing these problems require updating the methods' signatures so they
take `*WriteOptions` instead of `*QueryOptions` and return `*WriteMeta`,
but this is a breaking change that requires advanced notice to consumers.

This commit adds a future breaking change notice and also fixes the
`Stop` method so it properly returns request metadata in a backwards
compatible way.
2023-03-03 15:52:41 -05:00
Tim Gross
ceed255d73 remove backcompat support for non-atomic job registration (#16305)
In Nomad 0.12.1 we introduced atomic job registration/deregistration, where the
new eval was written in the same raft entry. Backwards-compatibility checks were
supposed to have been removed in Nomad 1.1.0, but we missed that. This is long
safe to remove.
2023-03-03 15:52:22 -05:00
Luiz Aoqui
158d6a91c2 docs: fix alloc stop no_shutdown_delay (#16282) 2023-03-03 14:44:49 -05:00
Luiz Aoqui
0e824d363a cli: use shared logic for resolving job prefix (#16306)
Several `nomad job` subcommands had duplicate or slightly similar logic
for resolving a job ID from a CLI argument prefix, while others did not
have this functionality at all.

This commit pulls the shared logic to the command Meta and updates all
`nomad job` subcommands to use it.
2023-03-03 14:43:20 -05:00
Tim Gross
a4f792641b service: fix regression in task access to list/read endpoint (#16316)
When native service discovery was added, we used the node secret as the auth
token. Once Workload Identity was added in Nomad 1.4.x we needed to use the
claim token for `template` blocks, and so we allowed valid claims to bypass the
ACL policy check to preserve the existing behavior. (Invalid claims are still
rejected, so this didn't widen any security boundary.)

In reworking authentication for 1.5.0, we unintentionally removed this
bypass. For WIs without a policy attached to their job, everything works as
expected because the resulting `acl.ACL` is nil. But once a policy is attached
to the job the `acl.ACL` is no longer nil and this causes permissions errors.

Fix the regression by adding back the bypass for valid claims. In future work,
we should strongly consider getting turning the implicit policies into real
`ACLPolicy` objects (even if not stored in state) so that we don't have these
kind of brittle exceptions to the auth code.
2023-03-03 11:41:19 -05:00
Dao Thanh Tung
2ec6575675 api: add new test case for force-leave (#16260)
Signed-off-by: dttung2905 <ttdao.2015@accountancy.smu.edu.sg>
2023-03-03 10:38:40 -05:00