Commit Graph

4949 Commits

Author SHA1 Message Date
Tim Gross
7d73065066 numa: fix scheduler panic due to topology serialization bug (#23284)
The NUMA topology struct field `NodeIDs` is a `idset.Set`, which has no public
members. As a result, this field is never serialized via msgpack and persisted
in state. When `numa.affinity = "prefer"`, the scheduler dereferences this nil
field and panics the scheduler worker.

Ideally we would fix this by adding a msgpack serialization extension, but
because the field already exists and is just always empty, this breaks RPC wire
compatibility across upgrades. Instead, create a new field that's populated at
the same time we populate the more useful `idset.Set`, and repopulate the set on
demand.

Fixes: https://hashicorp.atlassian.net/browse/NET-9924
2024-06-11 08:55:00 -04:00
Tim Gross
61608e43cb test: move NUMA platform scan out of testing global (#23289)
The `testing.go` test helpers file for the driver manager initializes the NUMA
scan as a package-global variable. This causes it to be pulled in even in
production builds, so even running commands like `nomad version` will cause the
NUMA scan to happen. Move the scan into the test helper setup.
2024-06-11 08:52:51 -04:00
nicoche
ffcb72bfe3 api: Add Notes field to service checks (#22397)
Co-authored-by: Piotr Kazmierczak <470696+pkazmierczak@users.noreply.github.com>
2024-06-10 16:59:49 +02:00
Seth Hoenig
45da80bde2 client: cleanup empty task directory when using unveil filesystem isolation (#23237)
This PR fixes a bug where Nomad client would leave behind an empty directory
created on behalf of tasks making use of the unveil filesystem isolation
mode (i.e. using exec2 task driver). Once unmounting is complete, we should
remember to also delete the directory.

Fixes #22433
2024-06-06 10:47:23 -05:00
Tim Gross
140747240f consul: include admin partition in JWT login requests (#22226)
When logging into a JWT auth method, we need to explicitly supply the Consul
admin partition if the local Consul agent is in a partition. We can't derive
this from agent configuration because the Consul agent's configuration is
canonical, so instead we get the partition from the fingerprint (if
available). This changeset updates the Consul client constructor so that we
close over the partition from the fingerprint.

Ref: https://hashicorp.atlassian.net/browse/NET-9451
2024-05-29 16:31:09 -04:00
Daniel Bennett
4415fabe7d jobspec: time based task execution (#22201)
this is the CE side of an Enterprise-only feature.
a job trying to use this in CE will fail to validate.

to enable daily-scheduled execution entirely client-side,
a job may now contain:

task "name" {
  schedule {
    cron {
      start    = "0 12 * * * *" # may not include "," or "/"
      end      = "0 16"         # partial cron, with only {minute} {hour}
      timezone = "EST"          # anything in your tzdb
    }
  }
...

and everything about the allocation will be placed as usual,
but if outside the specified schedule, the taskrunner will block
on the client, waiting on the schedule start, before proceeding
with the task driver execution, etc.

this includes a taksrunner hook, which watches for the end of
the schedule, at which point it will kill the task.

then, restarts-allowing, a new task will start and again block
waiting for start, and so on.

this also includes all the plumbing required to pipe API calls
through from command->api->agent->server->client, so that
tasks can be force-run, force-paused, or resume the schedule
on demand.
2024-05-22 15:40:25 -05:00
Seth Hoenig
09bd11383c client: alloc_mounts directory must be sibling of data directory (#22199)
This PR adjusts the default location of -alloc-mounts-dir path to be a
sibling of the -data-dir path rather than a child. This is because on a
production-hardened systems the data dir is supposed to be chmod 0700
owned by root - preventing the exec2 task driver (and others using
unveil file system isolation features) from working properly.

For reference the directory structure from -data-dir now looks like this
after running an example job. Under the alloc_mounts directory, task
specific directories are mode 0710 and owned by the task user (which
may be a dynamic user UID/GID).

➜ sudo tree -p -d -u /tmp/mynomad
[drwxrwxr-x shoenig ]  /tmp/mynomad
├── [drwx--x--x root    ]  alloc_mounts
│   └── [drwx--x--- 80552   ]  c753b71d-c6a1-3370-1f59-47ab838fd8a6-mytask
│       ├── [drwxrwxrwx nobody  ]  alloc
│       │   ├── [drwxrwxrwx nobody  ]  data
│       │   ├── [drwxrwxrwx nobody  ]  logs
│       │   └── [drwxrwxrwx nobody  ]  tmp
│       ├── [drwxrwxrwx nobody  ]  local
│       ├── [drwxr-xr-x root    ]  private
│       ├── [drwx--x--- 80552   ]  secrets
│       └── [drwxrwxrwt nobody  ]  tmp
└── [drwx------ root    ]  data
    ├── [drwx--x--x root    ]  alloc
    │   └── [drwxr-xr-x root    ]  c753b71d-c6a1-3370-1f59-47ab838fd8a6
    │       ├── [drwxrwxrwx nobody  ]  alloc
    │       │   ├── [drwxrwxrwx nobody  ]  data
    │       │   ├── [drwxrwxrwx nobody  ]  logs
    │       │   └── [drwxrwxrwx nobody  ]  tmp
    │       └── [drwx--x--- 80552   ]  mytask
    │           ├── [drwxrwxrwx nobody  ]  alloc
    │           │   ├── [drwxrwxrwx nobody  ]  data
    │           │   ├── [drwxrwxrwx nobody  ]  logs
    │           │   └── [drwxrwxrwx nobody  ]  tmp
    │           ├── [drwxrwxrwx nobody  ]  local
    │           ├── [drwxrwxrwx nobody  ]  private
    │           ├── [drwx--x--- 80552   ]  secrets
    │           └── [drwxrwxrwt nobody  ]  tmp
    ├── [drwx------ root    ]  client
    └── [drwxr-xr-x root    ]  server
        ├── [drwx------ root    ]  keystore
        ├── [drwxr-xr-x root    ]  raft
        │   └── [drwxr-xr-x root    ]  snapshots
        └── [drwxr-xr-x root    ]  serf

32 directories
2024-05-22 13:14:34 -05:00
Tim Gross
b1657dd1fa CSI: track node claim before staging to prevent interleaved unstage (#20550)
The CSI hook for each allocation that claims a volume runs concurrently. If a
call to `MountVolume` happens at the same time as a call to `UnmountVolume` for
the same volume, it's possible for the second alloc to detect the volume has
already been staged, then for the original alloc to unpublish and unstage it,
only for the second alloc to then attempt to publish a volume that's been
unstaged.

The usage tracker on the volume manager was intended to prevent this behavior
but the call to claim the volume was made only after staging and publishing was
complete. Move the call to claim the volume for the usage tracker to the top of
the `MountVolume` workflow to prevent it from being unstaged until all consuming
allocations have called `UnmountVolume`.

Fixes: https://github.com/hashicorp/nomad/issues/20424
2024-05-16 09:45:07 -04:00
Tim Gross
953bfcc31e services: retry failed Nomad service deregistrations from client (#20596)
When the allocation is stopped, we deregister the service in the alloc runner's
`PreKill` hook. This ensures we delete the service registration and wait for the
shutdown delay before shutting down the tasks, so that workloads can drain their
connections. However, the call to remove the workload only logs errors and never
retries them.

Add a short retry loop to the `RemoveWorkload` method for Nomad services, so
that transient errors give us an extra opportunity to deregister the service
before the tasks are stopped, before we need to fall back to the data integrity
improvements implemented in #20590.

Ref: https://github.com/hashicorp/nomad/issues/16616
2024-05-16 08:59:54 -04:00
Deniz Onur Duzgun
1cc99cc1b4 bug: resolve type conversion alerts (#20553) 2024-05-15 13:22:10 -04:00
Seth Hoenig
4148ca1769 client: mount shared alloc dir as nobody (#20589)
In the Unveil filesystem isolation mode we were mounting the shared
alloc dir with the UID/GID of the user of the task dir being mounted
and 0710 filesystem permissions. This was causing the actual task dir
to become inaccessible to other tasks in the allocation (a race where
the last mounter wins). Instead mount the shared alloc dir as nobody
with 0777 filesystem permissions.
2024-05-15 10:43:30 -05:00
James Rasell
04ba358266 client: expose network namespace CNI config as task env vars. (#11810)
This change exposes CNI configuration details of a network
namespace as environment variables. This allows a task to use
these value to configure itself; a potential use case is to run
a Raft application binding to IP and Port details configured using
the bridge network mode.
2024-05-14 09:02:06 +01:00
Juana De La Cuesta
169818b1bd [gh-6980] Client: clean up old allocs before running new ones using the exec task driver. (#20500)
Whenever the "exec" task driver is being used, nomad runs a plug in that in time runs the task on a container under the hood. If by any circumstance the executor is killed, the task is reparented to the init service and wont be stopped by Nomad in case of a job updated or stop.

This commit introduces two mechanisms to avoid this behaviour:

* Adds signal catching and handling to the executor, so in case of a SIGTERM, the signal will also be passed on to the task.
* Adds a pre start clean up of the processes in the container, ensuring only the ones the executor runs are present at any given time.
2024-05-14 09:51:27 +02:00
Tim Gross
65ae61249c CSI: include volume namespace in staging path (#20532)
CSI volumes are namespaced. But the client does not include the namespace in the
staging mount path. This causes CSI volumes with the same volume ID but
different namespace to collide if they happen to be placed on the same host. The
per-allocation paths don't need to be namespaced, because an allocation can only
mount volumes from its job's own namespace.

Rework the CSI hook tests to have more fine-grained control over the mock
on-disk state. Add tests covering upgrades from staging paths missing
namespaces.

Fixes: https://github.com/hashicorp/nomad/issues/18741
2024-05-13 11:24:09 -04:00
Tim Gross
623486b302 deps: vendor containernetworking/plugins functions for net NS utils (#20556)
We bring in `containernetworking/plugins` for the contents of a single file,
which we use in a few places for running a goroutine in a specific network
namespace. This code hasn't needed an update in a couple of years, and a good
chunk of what we need was previously vendored into `client/lib/nsutil`
already.

Updating the library via dependabot is causing errors in Docker driver tests
because it updates a lot of transient dependencies, and it's bringing in a pile
of new transient dependencies like opentelemetry. Avoid this problem going
forward by vendoring the remaining code we hadn't already.

Ref: https://github.com/hashicorp/nomad/pull/20146
2024-05-13 09:10:16 -04:00
James Rasell
7e42ad869a client: fix unallocated CPU metric when reserved cpu is set. (#20543) 2024-05-09 10:55:22 +01:00
Seth Hoenig
14a022cbc0 drivers/raw_exec: enable setting cgroup override values (#20481)
* drivers/raw_exec: enable setting cgroup override values

This PR enables configuration of cgroup override values on the `raw_exec`
task driver. WARNING: setting cgroup override values eliminates any
gauruntee Nomad can make about resource availability for *any* task on
the client node.

For cgroup v2 systems, set a single unified cgroup path using `cgroup_v2_override`.
The path may be either absolute or relative to the cgroup root.

config {
  cgroup_v2_override = "custom.slice/app.scope"
}

or

config {
  cgroup_v2_override = "/sys/fs/cgroup/custom.slice/app.scope"
}

For cgroup v1 systems, set a per-controller path for each controller using
`cgroup_v1_override`. The path(s) may be either absolute or relative to
the controller root.

config {
  cgroup_v1_override = {
    "pids": "custom/app",
    "cpuset": "custom/app",
  }
}

or

config {
  cgroup_v1_override = {
    "pids": "/sys/fs/cgroup/pids/custom/app",
    "cpuset": "/sys/fs/cgroup/cpuset/custom/app",
  }
}

* drivers/rawexec: ensure only one of v1/v2 cgroup override is set

* drivers/raw_exec: executor should error if setting cgroup does not work

* drivers/raw_exec: create cgroups in raw_exec tests

* drivers/raw_exec: ensure we fail to start if custom cgroup set and non-root

* move custom cgroup func into shared file

---------

Co-authored-by: Michael Schurter <mschurter@hashicorp.com>
2024-05-07 16:46:27 -07:00
Tim Gross
f41bc468eb consul: provide CONSUL_HTTP_TOKEN env var to tasks (#20519)
When available, we provide an environment variable `CONSUL_TOKEN` to tasks, but
this isn't the environment variable expected by the Consul CLI. Job
specifications like deploying an API Gateway become noticeably nicer if we can
instead provide the expected env var.
2024-05-03 11:30:33 -04:00
Seth Hoenig
5f64e42d73 client: fixup how alloc mounts directory are setup (#20463) 2024-04-26 07:29:52 -05:00
Tim Gross
6d58acd897 WI: ensure tasks within same alloc get different Consul tokens (#20411)
The `consul_hook` in the allocrunner gets a separate Consul token for each task,
even if the tasks' identities have the same name, but used the identity name as
the key to the alloc hook resources map. This means the last task in the group
overwrites the Consul tokens of all other tasks.

Fix this by adding the task name to the key in the allocrunner's
`consul_hook`. And update the taskrunner's `consul_hook` to expect the task
name in the key.

Fixes: https://github.com/hashicorp/nomad/issues/20374
Fixes: https://hashicorp.atlassian.net/browse/NOMAD-614
2024-04-17 11:29:58 -04:00
Luiz Aoqui
9d4f7bcb68 mock_driver: fix fingreprint key (#20351)
The `mock_driver` is an internal task driver used mostly for testing and
simulating workloads. During the allocrunner v2 work (#4792) its name
changed from `mock_driver` to just `mock` and then back to
`mock_driver`, but the fingreprint key was kept as `driver.mock`.

This results in tasks configured with `driver = "mock"` to be scheduled
(because Nomad thinks the client has a task driver called `mock`), but
fail to actually run (because the Nomad client can't find a driver
called `mock` in its catalog).

Fingerprinting the right name prevents the job from being scheduled in
the first place.

Also removes mentions of the mock driver from documentation since its an
internal driver and not available in any production release.
2024-04-16 07:16:55 +01:00
Tim Gross
9cb1ef3e3d CNI: fix bugs in parsing strings to port number integers (#20379)
Ports are a maximum of uint16, but we have a few places in the recent tproxy
code where we were parsing them as 64-bit wide integers and then downcasting
them to `int`, which is technically unsafe and triggers code scanning alerts. In
practice we've validated the range elsewhere and don't build for 32-bit
platforms. This changeset fixes the parsing to make everything a bit more robust
and silence the alert.

Fixes: https://github.com/hashicorp/nomad-enterprise/security/code-scanning/444
2024-04-12 13:31:25 -04:00
Seth Hoenig
ae6c4c8e3f deps: purge use of old x/exp packages (#20373) 2024-04-12 08:29:00 -05:00
astudentofblake
7b7ed12326 func: Allow custom paths to be added the the getter landlock (#20349)
* func: Allow custom paths to be added the the getter landlock

Fixes: 20315

* fix: slices imports
fix: more meaningful examples
fix: improve documentation
fix: quote error output
2024-04-11 15:17:33 -05:00
Tim Gross
d56e8ad1aa WI: ensure Consul hook and WID manager interpolate services (#20344)
Services can have some of their string fields interpolated. The new Workload
Identity flow doesn't interpolate the services before requesting signed
identities or using those identities to get Consul tokens.

Add support for interpolation to the WID manager and the Consul tokens hook by
providing both with a taskenv builder. Add an "interpolate workload" field to
the WI handle to allow passing the original workload name to the server so the
server can find the correct service to sign.

This changeset also makes two related test improvements:
* Remove the mock WID manager, which was only used in the Consul hook tests and
  isn't necessary so long as we provide the real WID manager with the mock
  signer and never call `Run` on it. It wasn't feasible to exercise the correct
  behavior without this refactor, as the mocks were bypassing the new code.
* Fixed swapped expect-vs-actual assertions on the `consul_hook` tests.

Fixes: https://github.com/hashicorp/nomad/issues/20025
2024-04-11 15:40:28 -04:00
Tim Gross
8298d39e78 Connect transparent proxy support
Add support for Consul Connect transparent proxies

Fixes: https://github.com/hashicorp/nomad/issues/10628
2024-04-10 11:00:18 -04:00
Tim Gross
4fef82e8e2 tproxy: refactor getPortMapping
The `getPortMapping` method forces callers to handle two different data
structures, but only one caller cares about it. We don't want to return a single
map or slice because the `cni.PortMapping` object doesn't include a label field
that we need for tproxy. Return a new datastructure that closes over both a
slice of `cni.PortMapping` and a map of label to index in that slice.
2024-04-10 10:16:13 -04:00
Tim Gross
8eaf176868 client: fix IPv6 parsing for client.servers block (#20324)
When the `client.servers` block is parsed, we split the port from the
address. This does not correctly handle IPv6 addresses when they are in URL
format (wrapped in brackets), which we require to disambiguate the port and
address.

Fix the parser to correctly split out the port and handle a missing port value
for IPv6. Update the documentation to make the URL format requirement clear.

Fixes: https://github.com/hashicorp/nomad/issues/20310
2024-04-08 15:06:27 -04:00
Tim Gross
76009d89af tproxy: networking hook changes (#20183)
When `transparent_proxy` block is present and the network mode is `bridge`, use
a different CNI configuration that includes the `consul-cni` plugin. Before
invoking the CNI plugins, create a Consul SDK `iptables.Config` struct for the
allocation. This includes:

* Use all the `transparent_proxy` block fields
* The reserved ports are added to the inbound exclusion list so the alloc is
  reachable from outside the mesh
* The `expose` blocks and `check` blocks with `expose=true` are added to the
  inbound exclusion list so health checks work.

The `iptables.Config` is then passed as a CNI argument to the `consul-cni`
plugin.

Ref: https://github.com/hashicorp/nomad/issues/10628
2024-04-04 17:01:07 -04:00
Seth Hoenig
6ad648bec8 networking: Inject implicit constraints on CNI plugins when using bridge mode (#15473)
This PR adds a job mutator which injects constraints on the job taskgroups
that make use of bridge networking. Creating a bridge network makes use of the
CNI plugins: bridge, firewall, host-local, loopback, and portmap. Starting
with Nomad 1.5 these plugins are fingerprinted on each node, and as such we
can ensure jobs are correctly scheduled only on nodes where they are available,
when needed.
2024-03-27 16:11:39 -04:00
Tim Gross
15162917c1 cni: fix regression in falling back to DNS owned by dockerd (#20189)
In #20007 we fixed a bug where the DNS configuration set by CNI plugins was not
threaded through to the task configuration. This resulted in a regression where
a DNS override set by `dockerd` was not respected for `bridge` mode
networking. Our existing handling of CNI DNS incorrectly assumed that the DNS
field would be empty, when in fact it contains a single empty DNS struct.

Handle this case correctly by checking whether the DNS struct we get back from
CNI has any nameservers, and ignore it if it doesn't. Expand test coverage of
this case.

Fixes: https://github.com/hashicorp/nomad/issues/20174
2024-03-22 10:54:16 -04:00
Michael Schurter
23e4b7c9d2 Upgrade go-msgpack to v2 (#20173)
Replaces #18812

Upgraded with:
```
find . -name '*.go' -exec sed -i s/"github.com\/hashicorp\/go-msgpack\/codec"/"github.com\/hashicorp\/go-msgpack\/v2\/codec/" '{}' ';'
find . -name '*.go' -exec sed -i s/"github.com\/hashicorp\/net-rpc-msgpackrpc"/"github.com\/hashicorp\/net-rpc-msgpackrpc\/v2/" '{}' ';'
go get
go get -v -u github.com/hashicorp/raft-boltdb/v2
go get -v github.com/hashicorp/serf@5d32001edfaa18d1c010af65db707cdb38141e80
```

see https://github.com/hashicorp/go-msgpack/releases/tag/v2.1.0
for details
2024-03-21 11:44:23 -07:00
Tim Gross
7b9bce2d08 config: fix client.template config merging with defaults (#20165)
When loading the client configuration, the user-specified `client.template`
block was not properly merged with the default values. As a result, if the user
set any `client.template` field, all the other field defaulted to their zero
values instead of the documented defaults.

This changeset:
* Adds the missing `Merge` method for the client template config and ensures
  it's called.
* Makes a single source of truth for the default template configuration,
  instead of two different constructors.
* Extends the tests to cover the merge of a partial block better.

Fixes: https://github.com/hashicorp/nomad/issues/20164
2024-03-20 10:18:56 -04:00
Charlie Voiselle
7b27bc344b [refactor] Move task directory destroy logic from alloc_dir.go to task_dir.go (#20006)
* Move task directory destroy logic from alloc_dir to task_dir
* Update errors to wrap error cause
* Use constants for file permissions
* Make multierror handling consistent.
* Make helpers for directory creation
* Move mount dir unlink to task_dir Unlink method
* Make constant for file mode 710

Co-authored-by: Tim Gross <tgross@hashicorp.com>
Co-authored-by: Michael Schurter <mschurter@hashicorp.com>
2024-03-19 13:49:09 -04:00
Tim Gross
13617eee4b template: improve internal documentation around shutdown (#20134)
While investigating a report around possible consul-template shutdown issues,
which didn't bear fruit, I found that some of the logic around template runner
shutdown is unintuitive.

* Add some doc strings to the places where someone might think we should be
  obviously stopping the runner or returning early.
* Mark context argument for `Poststart`, `Stop`, and `Update` hooks as unused.

No functional code changes.
2024-03-14 15:33:32 -04:00
Amir Abbas
40b8f17717 Support insecure flag on artifact (#20126) 2024-03-14 10:59:20 -05:00
Seth Hoenig
bb54d16e4a exec2: setup RPC plumbing for dynamic workload users (#20129)
And pass the dynamic users pool from the client into the hook.
2024-03-13 14:06:52 -05:00
Seth Hoenig
05937ab75b exec2: add client support for unveil filesystem isolation mode (#20115)
* exec2: add client support for unveil filesystem isolation mode

This PR adds support for a new filesystem isolation mode, "Unveil". The
mode introduces a "alloc_mounts" directory where tasks have user-owned
directory structure which are bind mounts into the real alloc directory
structure. This enables a task driver to use landlock (and maybe the
real unveil on openbsd one day) to isolate a task to the task owned
directory structure, providing sandboxing.

* actually create alloc-mounts-dir directory

* fix doc strings about alloc mount dir paths
2024-03-13 08:24:17 -05:00
carrychair
5f5b34db0e remove repetitive words (#20110)
Signed-off-by: carrychair <linghuchong404@gmail.com>
2024-03-11 08:52:08 +00:00
Seth Hoenig
286dce7a2a exec2: add a client.users configuration block (#20093)
* exec: add a client.users configuration block

For now just add min/max dynamic user values; soon we can also absorb
the "user.denylist" and "user.checked_drivers" options from the
deprecated client.options map.

* give the no-op pool implementation a better name

* use explicit error types to make referencing them cleaner in tests

* use import alias to not shadow package name
2024-03-08 16:02:32 -06:00
Seth Hoenig
2c1f5daad7 more test refactoring (#20092)
* tests: swap testify for test in client/config

* tests: swap testify for test in logmon/
2024-03-07 11:04:16 -06:00
Seth Hoenig
67554b8f91 exec2: implement dynamic workload users taskrunner hook (#20069)
* exec2: implement dynamic workload users taskrunner hook

This PR impelements a TR hook for allocating dynamic workload users from
a pool managed by the Nomad client. This adds a new task driver Capability,
DynamicWorkloadUsers - which a task driver must indicate in order to make
use of this feature.

The client config plumbing is coming in a followup PR - in the RFC we
realized having a client.users block would be nice to have, with some
additional unrelated options being moved from the deprecated client.options
config.

* learn to spell
2024-03-06 09:34:27 -06:00
Tim Gross
45b2c34532 cni: add DNS set by CNI plugins to task configuration (#20007)
CNI plugins may set DNS configuration, but this isn't threaded through to the
task configuration so that we can write it to the `/etc/resolv.conf` file as
needed. Add the `AllocNetworkStatus` to the alloc hook resources so they're
accessible from the taskrunner. Any DNS entries provided by the user will
override these values.

Fixes: https://github.com/hashicorp/nomad/issues/11102
2024-02-20 10:17:27 -05:00
Juana De La Cuesta
20cfbc82d3 Introduces Disconnect block into the TaskGroup configuration (#19886)
This PR is the first on two that will implement the new Disconnect block. In this PR the new block is introduced to be backwards compatible with the fields it will replace. For more information refer to this RFC and this ticket.
2024-02-19 16:41:35 +01:00
Tim Gross
a74775814c fingerprint: add DNS address and port to Consul fingerprint (#19969)
In order to provide a DNS address and port to Connect tasks configured for
transparent proxy, we need to fingerprint the Consul DNS address and port. The
client will pass this address/port to the iptables configuration provided to the
`consul-cni` plugin.

Ref: https://github.com/hashicorp/nomad/issues/10628
2024-02-14 12:15:58 -05:00
Cedric Le Roux
994a2b1036 client: fixed a bug where corrupt client state could panic the client (#19972) 2024-02-14 11:14:11 -05:00
Luiz Aoqui
62b7d6ffe9 vault: revert #18998 to fix potential deadlock (#19963)
* Revert "vault: always renew tokens using the renewal loop (#18998)"
  This reverts commit 7054fe1a8c.
* test: add case for concurrent Vault token renewal
2024-02-13 09:50:46 -05:00
Tim Gross
a54657899c CNI: fix deprecation warnings (#19954)
We updated our `go-cni` dependency in #17582 but this left deprecation warnings
on the `cni.CNIResult` type (now `cni.Result`).
2024-02-12 15:35:43 -05:00
Luiz Aoqui
db5ffde2b7 client: prevent start on cgroups init error (#19915)
The Nomad client expects certain cgroups paths to exist in order to
manage tasks. These paths are created when the agent first starts, but
if process fails the agent would just log the error and proceed with its
initialization, despite not being able to run tasks.

This commit surfaces the errors back to the client initialization so the
process can stop early and make clear to operators that something went
wrong.
2024-02-09 13:45:29 -05:00
Tim Gross
62c57d208b fingerprint: eliminate spurious warning logs with Consul CE (#19923)
Support for fingerprinting the Consul admin partition was added in #19485. But
when the client fingerprints Consul CE, it gets a valid fingerprint and working
Consul but with a warn-level log. Return "ok" from the partition extractor, but
also ensure that we only add the Consul attribute if it actually has a value.

Fixes: https://github.com/hashicorp/nomad/issues/19756
2024-02-09 08:19:00 -05:00