Commit Graph

25612 Commits

Author SHA1 Message Date
Tim Gross
a74775814c fingerprint: add DNS address and port to Consul fingerprint (#19969)
In order to provide a DNS address and port to Connect tasks configured for
transparent proxy, we need to fingerprint the Consul DNS address and port. The
client will pass this address/port to the iptables configuration provided to the
`consul-cni` plugin.

Ref: https://github.com/hashicorp/nomad/issues/10628
2024-02-14 12:15:58 -05:00
Cedric Le Roux
994a2b1036 client: fixed a bug where corrupt client state could panic the client (#19972) 2024-02-14 11:14:11 -05:00
Tim Gross
c1b5850473 docs: add warning not to enable Consul tls.grpc.verify_incoming (#19970)
Consul does not support incoming TLS verification of Envoy. This failure results
in hard-to-understand errors like `SSLV3_ALERT_BAD_CERTIFICATE` in the Envoy
allocation logs. Leave a warning about this to users.

Closes: https://github.com/hashicorp/nomad/issues/19772
Closes: https://github.com/hashicorp/nomad/issues/16854
Ref: https://github.com/hashicorp/consul/issues/13088
2024-02-14 08:56:35 -05:00
Tim Gross
c364cb5729 Merge pull request #19968 from hashicorp/post-1.7.5-release
Post 1.7.5 release
2024-02-13 11:42:30 -05:00
Tim Gross
3978f96898 Merge release 1.7.5 files 2024-02-13 11:34:25 -05:00
hc-github-team-nomad-core
64c2e2b868 Prepare for next release 2024-02-13 11:32:59 -05:00
hc-github-team-nomad-core
6e08d9ffff Generate files for 1.7.5 release 2024-02-13 11:32:59 -05:00
Luiz Aoqui
62b7d6ffe9 vault: revert #18998 to fix potential deadlock (#19963)
* Revert "vault: always renew tokens using the renewal loop (#18998)"
  This reverts commit 7054fe1a8c.
* test: add case for concurrent Vault token renewal
2024-02-13 09:50:46 -05:00
Julien Castets
61941d8204 docs: autoscaler doc for max_scale_up and max_scale_down of target-value strategy (#19945)
See https://github.com/hashicorp/nomad-autoscaler/pull/848
2024-02-13 07:38:39 +00:00
Phil Renaud
1bde7a8fb4 [ui] Upgrades to build storybook on node v20 (#19953)
* Attempting to build storybook on node v20

* babel-plugin-dynamic-import-node added

* build without babel-plugin-dynamic-import-node explicitly declared
2024-02-12 16:51:47 -05:00
Tim Gross
a54657899c CNI: fix deprecation warnings (#19954)
We updated our `go-cni` dependency in #17582 but this left deprecation warnings
on the `cni.CNIResult` type (now `cni.Result`).
2024-02-12 15:35:43 -05:00
Seth Hoenig
37c497628c docs: describe cloud environments in fingerprint denylist (#19952)
This PR changes the example of the client config option "fingerprint.denylist"
to include all the cloud environment fingerprinters. Each one contains a
2 second HTTP timeout to a metadata endpoint that does not exist if you are not
in that particular cloud. When run in serial on startup, this results in
an 8 second wait where nothing useful is happening.

Closes #16727
2024-02-12 09:57:29 -06:00
Tim Gross
e986c298ac alloc exec: fix panics after stream close (#19932)
In #19172 we added a check on websocket errors to see if they were one of
several benign "close" messages. This change inadvertently assumed that other
messages used for close would not implement `HTTPCodedError`. When errors like
the following are received:

> msgpack decode error [pos 0]: io: read/write on closed pipe"

they are sent from the inner loop as though they were a "real" error, but the
channel is already being closed with a "close" message.

This allowed many more attempts to pass thru a previously-undiscovered race
condition in the two goroutines that stream RPC responses to the websocket. When
the input stream returns an error for any reason (for example, the command we're
executing has exited), it will unblock the "outer" goroutine and cause a write
to the websocket. If we're concurrently writing the "close error" discussed
above, this results in a panic from the websocket library.

This changeset includes two fixes:
* Catch "closed pipe" error correctly so that we're not sending unnecessary
  error messages.
* Move all writes to the websocket into the same response streaming
  goroutine. The main handler goroutine will block on a results channel, and the
  response streaming goroutine will send on that channel with the final error when
  it's done so it can be reported to the user.
2024-02-12 09:43:34 -05:00
Tim Gross
0985f96f8d state: fix state store corruption in plan apply (#19937)
The state store's `UpsertPlanResults` method canonicalizes allocations in order
to upgrade them to a new version. But the method does not copy the allocation
before doing so, which can potentially corrupt the state store. This hasn't been
implicated in any known user-facing bugs, but was detected when running Nomad
under a build with Go toolchains data race detection enabled.
2024-02-12 08:59:04 -05:00
Luiz Aoqui
e2bfdf0c10 events: emit event when job is deleted (#19903)
When jobs are deregistered with the `purge` flag they are immediately
deleted from the state store instead of just updated to be marked as
stopped.

Without tracking job deletions the event stream would not receive a
`JobDeregistered` event when `purge` was set.
2024-02-09 18:19:33 -05:00
Luiz Aoqui
4a8b01430b scheduler: retain eval metrics on port collision (#19933)
When an allocation can't be placed because of a port collision the
resulting blocked eval is expected to have a metric reporting the port
that caused the conflict, but this metrics was not being emitted when
preemption was enabled.
2024-02-09 18:18:48 -05:00
Luiz Aoqui
b52a44717e executor: limit the value of CPU shares (#19935)
The value for the executor cgroup CPU weight must be within the limits
imposed by the Linux kernel.

Nomad used the task `resource.cpu`, an unbounded value, directly as the
cgroup CPU weight, causing it to potentially go outside the imposed
values.

This commit clamps the CPU shares values to be within the limits
allowed.

Co-authored-by: Tim Gross <tgross@hashicorp.com>
2024-02-09 16:29:14 -05:00
Luiz Aoqui
db5ffde2b7 client: prevent start on cgroups init error (#19915)
The Nomad client expects certain cgroups paths to exist in order to
manage tasks. These paths are created when the agent first starts, but
if process fails the agent would just log the error and proceed with its
initialization, despite not being able to run tasks.

This commit surfaces the errors back to the client initialization so the
process can stop early and make clear to operators that something went
wrong.
2024-02-09 13:45:29 -05:00
Tim Gross
110d93ab25 windows: remove LazyDLL calls for system modules (#19925)
On Windows, Nomad uses `syscall.NewLazyDLL` and `syscall.LoadDLL` functions to
load a few system DLL files, which does not prevent DLL hijacking
attacks. Hypothetically a local attacker on the client host that can place an
abusive library in a specific location could use this to escalate privileges to
the Nomad process. Although this attack does not fall within the Nomad security
model, it doesn't hurt to follow good practices here.

We can remove two of these DLL loads by using wrapper functions provided by the
stdlib in `x/sys/windows`

Co-authored-by: dduzgun-security <deniz.duzgun@hashicorp.com>
2024-02-09 08:47:48 -05:00
Tim Gross
62c57d208b fingerprint: eliminate spurious warning logs with Consul CE (#19923)
Support for fingerprinting the Consul admin partition was added in #19485. But
when the client fingerprints Consul CE, it gets a valid fingerprint and working
Consul but with a warn-level log. Return "ok" from the partition extractor, but
also ensure that we only add the Consul attribute if it actually has a value.

Fixes: https://github.com/hashicorp/nomad/issues/19756
2024-02-09 08:19:00 -05:00
Phil Renaud
81f868631f Fix vercel deployments EBADENGINE errors (#19914) 2024-02-08 14:14:57 -05:00
Phil Renaud
41c783aec2 Noting action name restrictions, and correcting those of auth methods and roles (#19905) 2024-02-08 12:01:22 -05:00
Tim Gross
fc26e0cb22 Post 1.7.4 release (#19918) 2024-02-08 10:58:50 -05:00
Luiz Aoqui
2a348ba714 docs: expand impact of verify_https_client=false (#19916)
When Nomad is configured with `verify_https_client=false` endpoints that
do not require an ACL token can be accessed without any other type of
authentication. Expand the docs to mention this effect.
2024-02-08 10:55:40 -05:00
Tim Gross
2970690355 Merge release 1.7.4 files 2024-02-08 10:41:11 -05:00
hc-github-team-nomad-core
33f0a5b268 Prepare for next release 2024-02-08 10:40:24 -05:00
hc-github-team-nomad-core
875e96cccc Generate files for 1.7.4 release 2024-02-08 10:40:24 -05:00
Tim Gross
df86503349 template: sandbox template rendering
The Nomad client renders templates in the same privileged process used for most
other client operations. During internal testing, we discovered that a malicious
task can create a symlink that can cause template rendering to read and write to
arbitrary files outside the allocation sandbox. Because the Nomad agent can be
restarted without restarting tasks, we can't simply check that the path is safe
at the time we write without encountering a time-of-check/time-of-use race.

To protect Nomad client hosts from this attack, we'll now read and write
templates in a subprocess:

* On Linux/Unix, this subprocess is sandboxed via chroot to the allocation
  directory. This requires that Nomad is running as a privileged process. A
  non-root Nomad agent will warn that it cannot sandbox the template renderer.

* On Windows, this process is sandboxed via a Windows AppContainer which has
  been granted access to only to the allocation directory. This does not require
  special privileges on Windows. (Creating symlinks in the first place can be
  prevented by running workloads as non-Administrator or
  non-ContainerAdministrator users.)

Both sandboxes cause encountered symlinks to be evaluated in the context of the
sandbox, which will result in a "file not found" or "access denied" error,
depending on the platform. This change will also require an update to
Consul-Template to allow callers to inject a custom `ReaderFunc` and
`RenderFunc`.

This design is intended as a workaround to allow us to fix this bug without
creating backwards compatibility issues for running tasks. A future version of
Nomad may introduce a read-only mount specifically for templates and artifacts
so that tasks cannot write into the same location that the Nomad agent is.

Fixes: https://github.com/hashicorp/nomad/issues/19888
Fixes: CVE-2024-1329
2024-02-08 10:40:24 -05:00
Tim Gross
0d3cd1427f migration: check symlink sources during archive unpack
During allocation directory migration, the client was not checking that any
symlinks in the archive aren't pointing to somewhere outside the allocation
directory. While task driver sandboxing will protect against processes inside
the task from reading/writing thru the symlink, this doesn't protect against the
client itself from performing unintended operations outside the sandbox.

This changeset includes two changes:

* Update the archive unpacking to check the source of symlinks and require that
  they fall within the sandbox.
* Fix a bug in the symlink check where it was using `filepath.Rel` which doesn't
  work for paths in the sibling directories of the sandbox directory. This bug
  doesn't appear to be exploitable but caused errors in testing.

Fixes: https://github.com/hashicorp/nomad/issues/19887
2024-02-08 10:40:24 -05:00
hc-github-team-nomad-core
c03c735c99 Backport of deps: update dependencies indirectly bringing in older runc into release/1.7.x #19866
Co-authored-by: Tim Gross <tgross@hashicorp.com>
2024-02-08 10:40:24 -05:00
hc-github-team-nomad-core
af7cf79df7 Backport of chore(deps): bump github.com/opencontainers/runc from 1.1.10 to 1.1.12 into release/1.7.x #19862
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2024-02-08 10:40:24 -05:00
Luiz Aoqui
7391a59695 docs: add note about stub list filtering (#19902)
When filtering list results, the filter expression is applied to the
full object, not the stub. This is useful because it allows users to
filter the list on fields not present in the object stub. But it can
also be confusing because some fields have different names, or only
exist in the stub, so the filter expression needs to reference fields
not present in returned data.

Filtering on the stub would reduce the confusion, but it would also
restrict users to only be able to filter on the fields in the stub,
which, by definition, are just a subset of the original fields.

Documenting this behaviour can help users understand unexpected errors
and results.
2024-02-07 16:41:07 -05:00
Luiz Aoqui
ce710d49fd cli: fix tls ca create command with -domain (#19892)
The current implementation of the `nomad tls ca create` command
ovierrides the value of the `-domain` flag with `"nomad"` if no
additional customization is provided.

This results in a certificate for the wrong domain or an error if the
`-name-constraint` flag is also used.

THe logic for `IsCustom()` also seemed reversed. If all custom fields
are empty then the certificate is _not_ customized, so `IsCustom()`
should return false.
2024-02-07 16:40:51 -05:00
Phil Renaud
15b06e8505 [ui] HashiCorp Design System upgraded to 3.6.0 (#19872)
* HashiCorp Design System upgraded to 3.6.0

* Fresh yarn

* Responses out of range are brought back within

* General pass at a11y fixes with updated components and node

* Further tooltip updates

* 3 more partitions worth of toggle and tooltip updates

* scale-events-accordion and topo-viz node fixes
2024-02-07 16:08:41 -05:00
Kiara Grouwstra
1e04fc4613 Libraries & SDKs: add nix-nomad (#19808) 2024-02-06 20:47:23 -05:00
Luiz Aoqui
7daa854491 docs: remove duplicate entry for upstreams.config (#19877) 2024-02-06 20:44:02 -05:00
Luiz Aoqui
5825cefe51 docs: remove Docker cpuset_cpus config (#19882)
Nomad 1.7 refactored how CPU cores are assigned to tasks, making the
Docker-specific `cpuset_cpus` configuration no longer used.
2024-02-06 10:51:16 -05:00
Phil Renaud
c927377700 Random exec assignment depends on taskGroup name if provided (#19878) 2024-02-05 23:23:01 -05:00
Luiz Aoqui
50c50a6328 cli: fix return code when job deployment succeeds (#19876)
When a job eval is blocked due to missing capacity, the `nomad job run`
command will monitor the deployment, which may succeed once additional
capacity is made available.

But the current implementation would return `2` even when the deployment
succeeded because it only took the first eval status into account.

This commit updates the eval monitoring logic to reset the scheduling
error state if the deployment eventually succeeds.
2024-02-05 18:32:25 -05:00
Juana De La Cuesta
120c3ca3c9 Add granular control of SELinux labels for host mounts (#19839)
Add new configuration option on task's volume_mounts, to give a fine grained control over SELinux "z" label

* Update website/content/docs/job-specification/volume_mount.mdx

Co-authored-by: Luiz Aoqui <luiz@hashicorp.com>

* fix: typo

* func: make volume mount verification happen even on  mounts with no volume

---------

Co-authored-by: Luiz Aoqui <luiz@hashicorp.com>
Co-authored-by: Tim Gross <tgross@hashicorp.com>
2024-02-05 10:05:33 +01:00
Tim Gross
f1637bdd5f deps: update dependencies indirectly bringing in older runc (#19863)
Although Nomad itself is not vulnerable to CVE-2024-21626, we want to update
dependencies that bring in the vulnerable packages so as not to trip
vulnerability scanners. Update `containerd` and `go-dockerclient` as well as the
various transitive dependencies these bring in.
2024-02-02 16:08:22 -05:00
dependabot[bot]
b94a193c8a chore(deps): bump github.com/opencontainers/runc from 1.1.10 to 1.1.12 (#19851)
* chore(deps): bump github.com/opencontainers/runc from 1.1.10 to 1.1.12

Bumps [github.com/opencontainers/runc](https://github.com/opencontainers/runc) from 1.1.10 to 1.1.12.
- [Release notes](https://github.com/opencontainers/runc/releases)
- [Changelog](https://github.com/opencontainers/runc/blob/v1.1.12/CHANGELOG.md)
- [Commits](https://github.com/opencontainers/runc/compare/v1.1.10...v1.1.12)

---
updated-dependencies:
- dependency-name: github.com/opencontainers/runc
  dependency-type: direct:production
...

Signed-off-by: dependabot[bot] <support@github.com>

* add changelog entry

---------

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Co-authored-by: Tim Gross <tgross@hashicorp.com>
2024-02-02 10:18:53 -05:00
Tim Gross
334c383eb6 template: run template tests on Windows where possible (#19856)
We don't run the whole suite of unit tests on all platforms to keep CI times
reasonable, so the only things we've been running on Windows are
platform-specific.

I'm working on some platform-specific `template` related work and having these
tests run on Windows will reduce the risk of regressions. Our Windows CI box
doesn't have Consul or Vault, so I've skipped those tests for the time being,
and can follow up with that later. There's also a test with assertions looking
for specific paths, and the results are different on Windows. I've skipped those
for the moment as well and will follow up under a separate PR.

Also swap `testify` for `shoenig/test`
2024-02-02 09:22:03 -05:00
Heat Hamilton
556d44cd7a Merge pull request #19848 from hashicorp/heat/chore/update-website-dependencies
website: update dependencies
2024-01-30 15:07:11 -05:00
Heat Hamilton
0b29a7d727 Update dependencies to match Next v14 in Dev Portal; updated husky workflow to v9; updated nvmrc to v18 2024-01-30 13:43:36 -05:00
Daniel Bennett
e059adef98 e2e: PreCleanup and other jobs3 helpers (#19844) 2024-01-29 17:54:54 -06:00
Seth Hoenig
b50b81e488 users: refactor method for getting UID from username (#19840)
This PR refactors a helper function for getting the UID associated with
a given username to also return the GID and home directory. Also adds
unit tests on the known values of root and nobody user on Ubuntu Linux.
2024-01-29 13:56:30 -06:00
Luiz Aoqui
41277f823f license: fix some imports of BUSL-1.1 in MPL-2.0 (#19832)
Some packages licensed under MPL-2.0 were incorrectly importing code
from packages licensed under BUSL-1.1.

Not all imports are fixed here as they will require additional work to
untangle them. To help track progress this commit adds a Semgrep rule
that detects incorrect BUSL-1.1 imports in MPL-2.0 packages.
2024-01-29 12:04:12 -05:00
James Rasell
10324566ae driver/rawexec: populate OOM killed exit result. (#19829) 2024-01-29 08:54:52 +00:00
James Rasell
8d6067e987 driver/qemu: populate OOM killed exit result. (#19830) 2024-01-29 07:34:27 +00:00