Commit Graph

734 Commits

Author SHA1 Message Date
Tim Gross
78f4f76520 adjust prioritized client updates (#17541)
In #17354 we made client updates prioritized to reduce client-to-server
traffic. When the client has no previously-acknowledged update we assume that
the update is of typical priority; although we don't know that for sure in
practice an allocation will never become healthy quickly enough that the first
update we send is the update saying the alloc is healthy.

But that doesn't account for allocations that quickly fail in an unrecoverable
way because of allocrunner hook failures, and it'd be nice to be able to send
those failure states to the server more quickly. This changeset does so and adds
some extra comments on reasoning behind priority.
2023-06-26 09:14:24 -04:00
grembo
6f04b91912 Add disable_file parameter to job's vault stanza (#13343)
This complements the `env` parameter, so that the operator can author
tasks that don't share their Vault token with the workload when using 
`image` filesystem isolation. As a result, more powerful tokens can be used 
in a job definition, allowing it to use template stanzas to issue all kinds of 
secrets (database secrets, Vault tokens with very specific policies, etc.), 
without sharing that issuing power with the task itself.

This is accomplished by creating a directory called `private` within
the task's working directory, which shares many properties of
the `secrets` directory (tmpfs where possible, not accessible by
`nomad alloc fs` or Nomad's web UI), but isn't mounted into/bound to the
container.

If the `disable_file` parameter is set to `false` (its default), the Vault token
is also written to the NOMAD_SECRETS_DIR, so the default behavior is
backwards compatible. Even if the operator never changes the default,
they will still benefit from the improved behavior of Nomad never reading
the token back in from that - potentially altered - location.
2023-06-23 15:15:04 -04:00
Luiz Aoqui
30921a1bb4 client: fix panic on alloc stop in non-Linux environments (#17515)
Provide a no-op implementation of the drivers.DriverNetoworkManager
interface to be used by systems that don't support network isolation and
prevent panics where a network manager is expected.
2023-06-14 10:22:38 -04:00
Seth Hoenig
89ce092b20 docker: stop network pause container of lost alloc after node restart (#17455)
This PR fixes a bug where the docker network pause container would not be
stopped and removed in the case where a node is restarted, the alloc is
moved to another node, the node comes back up. See the issue below for
full repro conditions.

Basically in the DestroyNetwork PostRun hook we would depend on the
NetworkIsolationSpec field not being nil - which is only the case
if the Client stays alive all the way from network creation to network
teardown. If the node is rebooted we lose that state and previously
would not be able to find the pause container to remove. Now, we manually
find the pause container by scanning them and looking for the associated
allocID.

Fixes #17299
2023-06-09 08:46:29 -05:00
Tim Gross
893d4a77c8 prioritized client updates (#17354)
The allocrunner sends several updates to the server during the early lifecycle
of an allocation and its tasks. Clients batch-up allocation updates every 200ms,
but experiments like the C2M challenge has shown that even with this batching,
servers can be overwhelmed with client updates during high volume
deployments. Benchmarking done in #9451 has shown that client updates can easily
represent ~70% of all Nomad Raft traffic.

Each allocation sends many updates during its lifetime, but only those that
change the `ClientStatus` field are critical for progressing a deployment or
kicking off a reschedule to recover from failures.

Add a priority to the client allocation sync and update the `syncTicker`
receiver so that we only send an update if there's a high priority update
waiting, or on every 5th tick. This means when there are no high priority
updates, the client will send updates at most every 1s instead of
200ms. Benchmarks have shown this can reduce overall Raft traffic by 10%, as
well as reduce client-to-server RPC traffic.

This changeset also switches from a channel-based collection of updates to a
shared buffer, so as to split batching from sending and prevent backpressure
onto the allocrunner when the RPC is slow. This doesn't have a major performance
benefit in the benchmarks but makes the implementation of the prioritized update
simpler.

Fixes: #9451
2023-05-31 15:34:16 -04:00
Seth Hoenig
3cc25949fa client: ignore restart issued to terminal allocations (#17175)
* client: ignore restart issued to terminal allocations

This PR fixes a bug where issuing a restart to a terminal allocation
would cause the allocation to run its hooks anyway. This was particularly
apparent with group_service_hook who would then register services but
then never deregister them - as the allocation would be effectively in
a "zombie" state where it is prepped to run tasks but never will.

* e2e: add e2e test for alloc restart zombies

* cl: tweak text

Co-authored-by: Tim Gross <tgross@hashicorp.com>

---------

Co-authored-by: Tim Gross <tgross@hashicorp.com>
2023-05-16 10:19:41 -05:00
Tim Gross
bf7b82b52b drivers: make internal DisableLogCollection capability public (#17196)
The `DisableLogCollection` capability was introduced as an experimental
interface for the Docker driver in 0.10.4. The interface has been stable and
allowing third-party task drivers the same capability would be useful for those
drivers that don't need the additional overhead of logmon.

This PR only makes the capability public. It doesn't yet add it to the
configuration options for the other internal drivers.

Fixes: #14636 #15686
2023-05-16 09:16:03 -04:00
Tim Gross
88323bab4a allocrunner: provide factory function so we can build mock ARs (#17161)
Tools like `nomad-nodesim` are unable to implement a minimal implementation of
an allocrunner so that we can test the client communication without having to
lug around the entire allocrunner/taskrunner code base. The allocrunner was
implemented with an interface specifically for this purpose, but there were
circular imports that made it challenging to use in practice.

Move the AllocRunner interface into an inner package and provide a factory
function type. Provide a minimal test that exercises the new function so that
consumers have some idea of what the minimum implementation required is.
2023-05-12 13:29:44 -04:00
Tim Gross
116f24d768 client: de-duplicate alloc updates and gate during restore (#17074)
When client nodes are restarted, all allocations that have been scheduled on the
node have their modify index updated, including terminal allocations. There are
several contributing factors:

* The `allocSync` method that updates the servers isn't gated on first contact
  with the servers. This means that if a server updates the desired state while
  the client is down, the `allocSync` races with the `Node.ClientGetAlloc`
  RPC. This will typically result in the client updating the server with "running"
  and then immediately thereafter "complete".

* The `allocSync` method unconditionally sends the `Node.UpdateAlloc` RPC even
  if it's possible to assert that the server has definitely seen the client
  state. The allocrunner may queue-up updates even if we gate sending them. So
  then we end up with a race between the allocrunner updating its internal state
  to overwrite the previous update and `allocSync` sending the bogus or duplicate
  update.

This changeset adds tracking of server-acknowledged state to the
allocrunner. This state gets checked in the `allocSync` before adding the update
to the batch, and updated when `Node.UpdateAlloc` returns successfully. To
implement this we need to be able to equality-check the updates against the last
acknowledged state. We also need to add the last acknowledged state to the
client state DB, otherwise we'd drop unacknowledged updates across restarts.

The client restart test has been expanded to cover a variety of allocation
states, including allocs stopped before shutdown, allocs stopped by the server
while the client is down, and allocs that have been completely GC'd on the
server while the client is down. I've also bench tested scenarios where the task
workload is killed while the client is down, resulting in a failed restore.

Fixes #16381
2023-05-11 09:05:24 -04:00
Seth Hoenig
2272ca9167 client: unveil /etc/ssh/ssh_known_hosts for artifact downloads (#17122)
This PR fixes a bug where nodes configured with populated
/etc/ssh/ssh_known_hosts files would be unable to read them during
artifact downloading.

Fixes #17086
2023-05-09 09:43:52 -05:00
Daniel Bennett
c2dc1c58dd full task cleanup when alloc prerun hook fails (#17104)
to avoid leaking task resources (e.g. containers,
iptables) if allocRunner prerun fails during
restore on client restart.

now if prerun fails, TaskRunner.MarkFailedKill()
will only emit an event, mark the task as failed,
and cancel the tr's killCtx, so then ar.runTasks()
-> tr.Run() can take care of the actual cleanup.

removed from (formerly) tr.MarkFailedDead(),
now handled by tr.Run():
 * set task state as dead
 * save task runner local state
 * task stop hooks

also done in tr.Run() now that it's not skipped:
 * handleKill() to kill tasks while respecting
   their shutdown delay, and retrying as needed
   * also includes task preKill hooks
 * clearDriverHandle() to destroy the task
   and associated resources
 * task exited hooks
2023-05-08 13:17:10 -05:00
Tim Gross
2aa3c746c4 logs: fix missing allocation logs after update to Nomad 1.5.4 (#17087)
When the server restarts for the upgrade, it loads the `structs.Job` from the
Raft snapshot/logs. The jobspec has long since been parsed, so none of the
guards around the default value are in play. The empty field value for `Enabled`
is the zero value, which is false.

This doesn't impact any running allocation because we don't replace running
allocations when either the client or server restart. But as soon as any
allocation gets rescheduled (ex. you drain all your clients during upgrades),
it'll be using the `structs.Job` that the server has, which has `Enabled =
false`, and logs will not be collected.

This changeset fixes the bug by adding a new field `Disabled` which defaults to
false (so that the zero value works), and deprecates the old field.

Fixes #17076
2023-05-04 16:01:18 -04:00
Seth Hoenig
c89f6a538e connect: remove unusable path for fallback envoy image names (#17044)
This PR does some cleanup of an old code path for versions of Consul that
did not support reporting the supported versions of Envoy in its API. Those
versions are no longer supported for years at this point, and the fallback
version of envoy hasn't been supported by any version of Consul for almost
as long. Remove this code path that is no longer useful.
2023-05-02 09:48:44 -05:00
Seth Hoenig
7744caed48 connect: use explicit docker.io prefix in default envoy image names (#17045)
This PR modifies references to the envoyproxy/envoy docker image to
explicitly include the docker.io prefix. This does not affect existing
users, but makes things easier for Podman users, who otherwise need to
specify the full name because Podman does not default to docker.io
2023-05-02 09:27:48 -05:00
Seth Hoenig
0b3bd454ec connect: do not restrict auto envoy version to docker task driver (#17041)
This PR updates the envoy_bootstrap_hook to no longer disable itself if
the task driver in use is not docker. In other words, make it work for
podman and other image based task drivers. The hook now only checks that

1. the task is a connect sidecar
2. the task.config block contains an "image" field
2023-05-01 15:07:35 -05:00
Seth Hoenig
889c5aa0f7 services: un-mark group services as deregistered if restart hook runs (#16905)
* services: un-mark group services as deregistered if restart hook runs

This PR may fix a bug where group services will never be deregistered if the
group undergoes a task restart.

* e2e: add test case for restart and deregister group service

* cl: add cl

* e2e: add wait for service list call
2023-04-24 14:24:51 -05:00
Tim Gross
30bc456f03 logs: allow disabling log collection in jobspec (#16962)
Some Nomad users ship application logs out-of-band via syslog. For these users
having `logmon` (and `docker_logger`) running is unnecessary overhead. Allow
disabling the logmon and pointing the task's stdout/stderr to /dev/null.

This changeset is the first of several incremental improvements to log
collection short of full-on logging plugins. The next step will likely be to
extend the internal-only task driver configuration so that cluster
administrators can turn off log collection for the entire driver.

---

Fixes: #11175

Co-authored-by: Thomas Weber <towe75@googlemail.com>
2023-04-24 10:00:27 -04:00
astudentofblake
206236039c fix: added landlock access to /usr/libexec for getter (#16900) 2023-04-20 11:16:04 -05:00
Luiz Aoqui
3f7143f50b allocrunner: prevent panic on network manager (#16921) 2023-04-18 13:39:13 -07:00
James Rasell
fc75e9d117 consul/connect: fixed a bug where restarting proxy tasks failed. (#16815)
The first start of a Consul Connect proxy sidecar triggers a run
of the envoy_version hook which modifies the task config image
entry. The modification takes into account a number of factors to
correctly populate this. Importantly, once the hook has run, it
marks itself as done so the taskrunner will not execute it again.

When the client receives a non-destructive update for the
allocation which the proxy sidecar is a member of, it will update
and overwrite the task definition within the taskerunner. In doing
so it overwrite the modification performed by the hook. If the
allocation is restarted, the envoy_version hook will be skipped as
it previously marked itself as done, and therefore the sidecar
config image is incorrect and causes a driver error.

The fix removes the hook in marking itself as done to the view of
the taskrunner.
2023-04-11 15:56:03 +01:00
Michael Schurter
a9f52f540e Merge pull request #16836 from hashicorp/compliance/add-headers
[COMPLIANCE] Add Copyright and License Headers
2023-04-10 16:32:03 -07:00
Daniel Bennett
dbe836595e gracefully recover tasks that use csi node plugins (#16809)
new WaitForPlugin() called during csiHook.Prerun,
so that on startup, clients can recover running
tasks that use CSI volumes, instead of them being
terminated and rescheduled because they need a
node plugin that is "not found" *yet*, only because
the plugin task has not yet been recovered.
2023-04-10 17:15:33 -05:00
hashicorp-copywrite[bot]
f005448366 [COMPLIANCE] Add Copyright and License Headers 2023-04-10 15:36:59 +00:00
James Rasell
f257aec073 client: ensure envoy version hook uses all pointer receiver funcs. (#16813) 2023-04-06 14:47:00 +01:00
the-nando
385e3bab86 Do not set attributes when spawning the getter child (#16791)
* Do not set attributes when spawning the getter child

* Cleanup

* Cleanup

---------

Co-authored-by: the-nando <the-nando@invalid.local>
2023-04-05 11:47:51 -05:00
Tim Gross
f3fc54adcf CSI: set mounts in alloc hook resources atomically (#16722)
The allocrunner has a facility for passing data written by allocrunner hooks to
taskrunner hooks. Currently the only consumers of this facility are the
allocrunner CSI hook (which writes data) and the taskrunner volume hook (which
reads that same data).

The allocrunner hook for CSI volumes doesn't set the alloc hook resources
atomically. Instead, it gets the current resources and then writes a new version
back. Because the CSI hook is currently the only writer and all readers happen
long afterwards, this should be safe but #16623 shows there's some sequence of
events during restore where this breaks down.

Refactor hook resources so that hook data is accessed via setters and getters
that hold the mutex.
2023-04-03 11:03:36 -04:00
Seth Hoenig
ed498f8ddb nsd: always set deregister flag after deregistration of group (#16289)
* services: always set deregister flag after deregistration of group

This PR fixes a bug where the group service hook's deregister flag was
not set in some cases, causing the hook to attempt deregistrations twice
during job updates (alloc replacement).

In the tests ... we used to assert on the wrong behvior (remove twice) which
has now been corrected to assert we remove only once.

This bug was "silent" in the Consul provider world because the error logs for
double deregistration only show up in Consul logs; with the Nomad provider the
error logs are in the Nomad agent logs.

* services: cleanup group service hook tests
2023-03-17 09:44:21 -05:00
Seth Hoenig
995ab41cbc artifact: git needs more files for private repositories (#16508)
* landlock: git needs more files for private repositories

This PR fixes artifact downloading so that git may work when cloning from
private repositories. It needs

- file read on /etc/passwd
- dir read on /root/.ssh
- file write on /root/.ssh/known_hosts

Add these rules to the landlock rules for the artifact sandbox.

* cr: use nonexistent instead of devnull

Co-authored-by: Michael Schurter <mschurter@hashicorp.com>

* cr: use go-homdir for looking up home directory

* pr: pull go-homedir into explicit require

* cr: fixup homedir tests in homeless root cases

* cl: fix root test for real

---------

Co-authored-by: Michael Schurter <mschurter@hashicorp.com>
2023-03-16 12:22:25 -05:00
Seth Hoenig
ea727dff9e artifact: do not set process attributes on darwin (#16511)
This PR fixes the non-root macOS use case where artifact downloads
stopped working. It seems setting a Credential on a SysProcAttr
used by the exec package will always cause fork/exec to fail -
even if the credential contains our own UID/GID or nil UID/GID.

Technically we do not need to set this as the child process will
inherit the parent UID/GID anyway... and not setting it makes
things work again ... /shrug
2023-03-16 11:31:18 -05:00
Luiz Aoqui
419c4bfa4e allocrunner: fix health check monitoring for Consul services (#16402)
Services must be interpolated to replace runtime variables before they
can be compared against the values returned by Consul.
2023-03-10 14:43:31 -05:00
Seth Hoenig
95359b8c4c client: disable running artifact downloader as nobody (#16375)
* client: disable running artifact downloader as nobody

This PR reverts a change from Nomad 1.5 where artifact downloads were
executed as the nobody user on Linux systems. This was done as an attempt
to improve the security model of artifact downloading where third party
tools such as git or mercurial would be run as the root user with all
the security implications thereof.

However, doing so conflicts with Nomad's own advice for securing the
Client data directory - which when setup with the recommended directory
permissions structure prevents artifact downloads from working as intended.

Artifact downloads are at least still now executed as a child process of
the Nomad agent, and on modern Linux systems make use of the kernel Landlock
feature for limiting filesystem access of the child process.

* docs: update upgrade guide for 1.5.1 sandboxing

* docs: add cl

* docs: add title to upgrade guide fix
2023-03-08 15:58:43 -06:00
Lance Haig
48e7d70fcd deps: Update ioutil deprecated library references to os and io respectively in the client package (#16318)
* Update ioutil deprecated library references to os and io respectively

* Deal with the errors produced.

Add error handling to filEntry info
Add error handling to info
2023-03-08 13:25:10 -06:00
Luiz Aoqui
b07af57618 client: don't emit task shutdown delay event if not waiting (#16281) 2023-03-03 18:22:06 -05:00
Farbod Ahmadian
fbd0dcbe9b tests: add functionality to skip a test if it's not running in CI and not with root user (#16222) 2023-03-02 13:38:27 -05:00
Tim Gross
f619b0bd7a populate Nomad token for task runner update hooks (#16266)
The `TaskUpdateRequest` struct we send to task runner update hooks was not
populating the Nomad token that we get from the task runner (which we do for the
Vault token). This results in task runner hooks like the template hook
overwriting the Nomad token with the zero value for the token. This causes
in-place updates of a task to break templates (but not other uses that rely on
identity but don't currently bother to update it, like the identity hook).
2023-02-27 10:48:13 -05:00
Seth Hoenig
30bcd51588 services: ensure task group is set on service hook (#16240)
This PR fixes a bug where the task group information was not being set
on the serviceHook.AllocInfo struct, which is needed later on for calculating
the CheckID of a nomad service check. The CheckID is calculated independently
from multiple callsites, and the information being passed in must be consistent,
including the group name.

The workload.AllocInfo.Group was not set at this callsite, due to the bug fixed in this PR.
 https://github.com/hashicorp/nomad/blob/main/client/serviceregistration/nsd/nsd.go#L114
2023-02-22 10:22:48 -06:00
Seth Hoenig
511d0c1e70 artifact: protect against unbounded artifact decompression (1.5.0) (#16151)
* artifact: protect against unbounded artifact decompression

Starting with 1.5.0, set defaut values for artifact decompression limits.

artifact.decompression_size_limit (default "100GB") - the maximum amount of
data that will be decompressed before triggering an error and cancelling
the operation

artifact.decompression_file_count_limit (default 4096) - the maximum number
of files that will be decompressed before triggering an error and
cancelling the operation.

* artifact: assert limits cannot be nil in validation
2023-02-14 09:28:39 -06:00
Charlie Voiselle
bbf9b073fc [chore] Move TestUtil_loadVersionControlGlobalConfigs into build flagged file (#16114) 2023-02-09 14:25:26 -05:00
Seth Hoenig
77ea4e3239 users: eliminate LookupGroupId and its one use case (#16093)
This PR deletes the user.LookupGroupId function as it was only being used
in a single test case, and its value was not important to the test.
2023-02-08 14:57:09 -06:00
Luiz Aoqui
ae26057eec docs: update default Nomad bridge config (#16072) 2023-02-07 09:47:41 -05:00
Michael Schurter
9bab96ebd3 Task API via Unix Domain Socket (#15864)
This change introduces the Task API: a portable way for tasks to access Nomad's HTTP API. This particular implementation uses a Unix Domain Socket and, unlike the agent's HTTP API, always requires authentication even if ACLs are disabled.

This PR contains the core feature and tests but followup work is required for the following TODO items:

- Docs - might do in a followup since dynamic node metadata / task api / workload id all need to interlink
- Unit tests for auth middleware
- Caching for auth middleware
- Rate limiting on negative lookups for auth middleware

---------

Co-authored-by: Seth Hoenig <shoenig@duck.com>
2023-02-06 11:31:22 -08:00
Daniel Bennett
fad28e4265 docs: how to troubleshoot consul connect envoy (#15908)
* largely a doc-ification of this commit message:
  d47678074b
  this doesn't spell out all the possible failure modes,
  but should be a good starting point for folks.

* connect: add doc link to envoy bootstrap error

* add Unwrap() to RecoverableError
  mainly for easier testing
2023-02-02 14:20:26 -06:00
Charlie Voiselle
3225e5cead Update networking_bridge_linux.go (#16025)
* Removed line from previous implementation
* remove import

Co-authored-by: Luiz Aoqui <luiz@hashicorp.com>
2023-02-02 14:03:02 -05:00
Charlie Voiselle
fe4ff5be2a Add option to expose workload token to task (#15755)
Add `identity` jobspec block to expose workload identity tokens to tasks.

---------

Co-authored-by: Anders <mail@anars.dk>
Co-authored-by: Tim Gross <tgross@hashicorp.com>
Co-authored-by: Michael Schurter <mschurter@hashicorp.com>
2023-02-02 10:59:14 -08:00
Charlie Voiselle
55df5af4aa client: Add option to enable hairpinMode on Nomad bridge (#15961)
* Add `bridge_network_hairpin_mode` client config setting
* Add node attribute: `nomad.bridge.hairpin_mode`
* Changed format string to use `%q` to escape user provided data
* Add test to validate template JSON for developer safety

Co-authored-by: Daniel Bennett <dbennett@hashicorp.com>
2023-02-02 10:12:15 -05:00
Piotr Kazmierczak
949a6f60c7 renamed stanza to block for consistency with other projects (#15941) 2023-01-30 15:48:43 +01:00
舍我其谁
69b08bb706 volume: Add the missing option propagation_mode (#15626) 2023-01-30 09:32:07 -05:00
Seth Hoenig
d30e34261e client: always run alloc cleanup hooks on final update (#15855)
* client: run alloc pre-kill hooks on last pass despite no live tasks

This PR fixes a bug where alloc pre-kill hooks were not run in the
edge case where there are no live tasks remaining, but it is also
the final update to process for the (terminal) allocation. We need
to run cleanup hooks here, otherwise they will not run until the
allocation gets garbage collected (i.e. via Destroy()), possibly
at a distant time in the future.

Fixes #15477

* client: do not run ar cleanup hooks if client is shutting down
2023-01-27 09:59:31 -06:00
Luiz Aoqui
b668e74a4a template: restore driver handle on update (#15915)
When the template hook Update() method is called it may recreate the
template manager if the Nomad or Vault token has been updated.

This caused the new template manager did not have a driver handler
because this was only being set on the Poststart hook, which is not
called for inplace updates.
2023-01-27 10:55:59 -05:00
Seth Hoenig
5f4368d83c artifact: enable reading system git/mercurial configuration (#15903)
This PR adjusts the artifact sandbox on Linux to enable reading from known
system-wide git or mercurial configuration, if they exist.

Folks doing something odd like specifying custom paths for global config will
need to use the standard locations, or disable artifact filesystem isolation.
2023-01-26 13:07:40 -06:00