Commit Graph

25833 Commits

Author SHA1 Message Date
Phil Renaud
8a9d58ae8f Storybook scripts and references removed (#22232) 2024-05-29 16:34:26 -04:00
Tim Gross
140747240f consul: include admin partition in JWT login requests (#22226)
When logging into a JWT auth method, we need to explicitly supply the Consul
admin partition if the local Consul agent is in a partition. We can't derive
this from agent configuration because the Consul agent's configuration is
canonical, so instead we get the partition from the fingerprint (if
available). This changeset updates the Consul client constructor so that we
close over the partition from the fingerprint.

Ref: https://hashicorp.atlassian.net/browse/NET-9451
2024-05-29 16:31:09 -04:00
Tim Gross
de38ff4189 consul: set partition for gateway config entries (#22228)
When we write Connect gateway configuation entries from the server, we're not
passing in the intended partition. This means we're using the server's own
partition to submit the configuration entries and this may not match. Note this
requires the Nomad server's token has permission to that partition.

Also, move the config entry write after we check Sentinel policies. This allows
us to return early if we hit a Sentinel error without making Consul RPCs first.
2024-05-29 16:31:02 -04:00
Daniel Bennett
c5dae2bf35 Merge pull request #22402 from hashicorp/post-1.8.0-release
Post 1.8.0 release
2024-05-29 13:46:07 -05:00
Daniel Bennett
05be289b24 Merge release 1.8.0 files 2024-05-29 13:57:57 -04:00
David Yu
42d72ff8a6 Merge pull request #22403 from hashicorp/david-yu-patch-1
docs: release note typo
2024-05-29 10:19:06 -07:00
David Yu
f083a27979 Update v1_8_x.mdx 2024-05-29 09:24:35 -07:00
hc-github-team-nomad-core
82e4ecd809 Prepare for next release 2024-05-29 11:48:56 -04:00
hc-github-team-nomad-core
32d820644a Generate files for 1.8.0 release 2024-05-29 11:48:55 -04:00
Seth Hoenig
9fb2b10ab6 e2e: no lnoger need consul terraform module (#22396) 2024-05-28 08:04:03 -05:00
David Yu
6493bc6c86 docs: Nomad 1.8 release notes (#22104) 2024-05-28 08:48:08 -04:00
David Yu
5f93bbb3cd docs: update CNI plugin version (#22341) 2024-05-28 08:47:43 -04:00
Tim Gross
91d422ec21 E2E: document how the AMIs are tagged and how those tags are used (#22237)
The process by which we tag AMIs with the commit SHA of the Packer directory
isn't documented in this repository, which makes it easy to accidentally build
an AMI that will break nightly E2E.
2024-05-24 11:11:00 -05:00
David Yu
ace3ccfcc2 Merge pull request #22234 from hashicorp/david-yu-patch-1
docs: small typo
2024-05-24 09:08:01 -07:00
James Rasell
81d87f1e9f config: fix panic in job using Vault cluster not in agent config. (#22227) 2024-05-24 15:13:20 +01:00
David Yu
1e90369c87 Update exec2.mdx
Small change, removal of extraneous open parentheses
2024-05-23 15:15:02 -07:00
Daniel Bennett
ac8fc25dd8 Merge pull request #22233 from hashicorp/post-1.8.0-rc.1-release
Post 1.8.0 rc.1 release
2024-05-23 16:17:02 -05:00
hc-github-team-nomad-core
5e1be121ad Prepare for next release 2024-05-23 16:55:05 -04:00
hc-github-team-nomad-core
c374bd375b Generate files for 1.8.0-rc.1 release 2024-05-23 16:55:05 -04:00
Daniel Bennett
032cddd7e8 Prepare release 1.8.0-rc.1 2024-05-23 16:55:05 -04:00
Piotr Kazmierczak
f0851bc989 job endpoint: fix implicit constraint mutation for task-level services (#22229)
Fixes a regression in Nomad 1.7 which caused task-level services no longer
having created implicit Consul constraints.
2024-05-23 19:27:47 +02:00
Phil Renaud
811b7e85f9 [ui] Better UX with filter expressions in the jobs index search box (#22100)
* Maintains rawSearchText separate from searchText

* Filter expression suggestions

* Now super-stops duelling queries on else-type error

* Filter suggestions and corrections

* Errorlink is now template standard and testfixes

* Mirage simulates healthy errors

* Test for bad filter expressions and snapshots
2024-05-22 23:39:37 -04:00
Phil Renaud
86c858cdc3 [ui] Sentinel Policies CRUD UI (#20483)
* Gallery allows picking stuff

* Small fixes

* added sentinel templates

* Can set enforcement level on policies

* Working on the interactive sentinel dev mode

* Very rough development flow on FE

* Changed position in gutter menu

* More sentinel stuff

* PR cleanup: removed testmode, removed unneeded mixins and deps

* Heliosification

* Index-level sentinel policy deletion and page title fixes

* Makes the Canaries sentinel policy real and then comments out the unfinished ones

* rename Access Control to Administration in prep for moving Sentinel Policies and Node Pool admin there

* Sentinel policies moved within the Administration section

* Mirage fixture for sentinel policy endpoints

* Description length check and 500 prevention

* Sync review PR feedback addressed, implied butons on radio cards

* Cull un-used sentinel policies

---------

Co-authored-by: Mike Nomitch <mail@mikenomitch.com>
2024-05-22 16:41:50 -04:00
Daniel Bennett
4415fabe7d jobspec: time based task execution (#22201)
this is the CE side of an Enterprise-only feature.
a job trying to use this in CE will fail to validate.

to enable daily-scheduled execution entirely client-side,
a job may now contain:

task "name" {
  schedule {
    cron {
      start    = "0 12 * * * *" # may not include "," or "/"
      end      = "0 16"         # partial cron, with only {minute} {hour}
      timezone = "EST"          # anything in your tzdb
    }
  }
...

and everything about the allocation will be placed as usual,
but if outside the specified schedule, the taskrunner will block
on the client, waiting on the schedule start, before proceeding
with the task driver execution, etc.

this includes a taksrunner hook, which watches for the end of
the schedule, at which point it will kill the task.

then, restarts-allowing, a new task will start and again block
waiting for start, and so on.

this also includes all the plumbing required to pipe API calls
through from command->api->agent->server->client, so that
tasks can be force-run, force-paused, or resume the schedule
on demand.
2024-05-22 15:40:25 -05:00
David Yu
6a25c2fb12 docs: add installation section to exec2 driver (#22091)
* Update exec2.mdx

Add installation section

* Update exec2.mdx
2024-05-22 15:14:00 -05:00
Phil Renaud
e8b77fcfa0 [ui] Jobspec UI block: Descriptions and Links (#18292)
* Hacky but shows links and desc

* markdown

* Small pre-test cleanup

* Test for UI description and link rendering

* JSON jobspec docs and variable example job get UI block

* Jobspec documentation for UI block

* Description and links moved into the Title component and made into Helios components

* Marked version upgrade

* Allow links without a description and max description to 1000 chars

* Node 18 for setup-js

* markdown sanitization

* Ui to UI and docs change

* Canonicalize, copy and diff for job.ui

* UI block added to testJob for structs testing

* diff test

* Remove redundant reset

* For readability, changing the receiving pointer of copied job variables

* TestUI endpiont conversion tests

* -require +must

* Nil check on Links

* JobUIConfig.Links as pointer

---------

Co-authored-by: Tim Gross <tgross@hashicorp.com>
2024-05-22 15:00:45 -04:00
Seth Hoenig
09bd11383c client: alloc_mounts directory must be sibling of data directory (#22199)
This PR adjusts the default location of -alloc-mounts-dir path to be a
sibling of the -data-dir path rather than a child. This is because on a
production-hardened systems the data dir is supposed to be chmod 0700
owned by root - preventing the exec2 task driver (and others using
unveil file system isolation features) from working properly.

For reference the directory structure from -data-dir now looks like this
after running an example job. Under the alloc_mounts directory, task
specific directories are mode 0710 and owned by the task user (which
may be a dynamic user UID/GID).

➜ sudo tree -p -d -u /tmp/mynomad
[drwxrwxr-x shoenig ]  /tmp/mynomad
├── [drwx--x--x root    ]  alloc_mounts
│   └── [drwx--x--- 80552   ]  c753b71d-c6a1-3370-1f59-47ab838fd8a6-mytask
│       ├── [drwxrwxrwx nobody  ]  alloc
│       │   ├── [drwxrwxrwx nobody  ]  data
│       │   ├── [drwxrwxrwx nobody  ]  logs
│       │   └── [drwxrwxrwx nobody  ]  tmp
│       ├── [drwxrwxrwx nobody  ]  local
│       ├── [drwxr-xr-x root    ]  private
│       ├── [drwx--x--- 80552   ]  secrets
│       └── [drwxrwxrwt nobody  ]  tmp
└── [drwx------ root    ]  data
    ├── [drwx--x--x root    ]  alloc
    │   └── [drwxr-xr-x root    ]  c753b71d-c6a1-3370-1f59-47ab838fd8a6
    │       ├── [drwxrwxrwx nobody  ]  alloc
    │       │   ├── [drwxrwxrwx nobody  ]  data
    │       │   ├── [drwxrwxrwx nobody  ]  logs
    │       │   └── [drwxrwxrwx nobody  ]  tmp
    │       └── [drwx--x--- 80552   ]  mytask
    │           ├── [drwxrwxrwx nobody  ]  alloc
    │           │   ├── [drwxrwxrwx nobody  ]  data
    │           │   ├── [drwxrwxrwx nobody  ]  logs
    │           │   └── [drwxrwxrwx nobody  ]  tmp
    │           ├── [drwxrwxrwx nobody  ]  local
    │           ├── [drwxrwxrwx nobody  ]  private
    │           ├── [drwx--x--- 80552   ]  secrets
    │           └── [drwxrwxrwt nobody  ]  tmp
    ├── [drwx------ root    ]  client
    └── [drwxr-xr-x root    ]  server
        ├── [drwx------ root    ]  keystore
        ├── [drwxr-xr-x root    ]  raft
        │   └── [drwxr-xr-x root    ]  snapshots
        └── [drwxr-xr-x root    ]  serf

32 directories
2024-05-22 13:14:34 -05:00
Tim Gross
5bfb500932 refactor scheduler tests for node down/disconnected (#22198)
While working on #20462 #12319 I found that some of our scheduler tests around
down nodes or disconnected clients were enforcing invariants that were
unclear. This changeset pulls out some minor refactorings so that the bug fix PR
is easier to review. This includes:

* Migrating a few tests from `testify` to `shoenig/test` that I'm going to touch
  in #12319 anyways.
* Adding test names to the node down test
* Update the disconnected client test so that we always re-process the
  pending/blocked eval it creates; this eliminates 2 redundant sub-tests.
* Update the disconnected client test assertions so that they're explicit in the
  test setup rather than implied by whether we re-process the pending/blocked
  eval.

Ref: https://github.com/hashicorp/nomad/issues/20462
Ref: https://github.com/hashicorp/nomad/pull/12319
2024-05-22 10:23:08 -04:00
KeisukeYamashita
1b872c422c build: fix broken link to nomad in docker (#22191)
Signed-off-by: KeisukeYamashita <19yamashita15@gmail.com>
2024-05-22 12:02:25 +02:00
Nick Wales
1174019676 docs: typo fix (#22090) 2024-05-21 14:29:31 -04:00
Michael Schurter
a3b1810bdb doc: specify ca cert needs to be shared (#20620)
Specify that the Vault JWT auth method must be configured to trust Nomad's CA certificate when mTLS is enabled.
2024-05-17 14:49:48 -07:00
Tim Gross
5a6262d1c4 tproxy: add implicit constraint on client version (#20623)
The new transparent proxy feature already has an implicity constraint on the
presence of the CNI plugin. But if the CNI plugin is installed on an older
version of Nomad, this isn't sufficient to protect against placing tproxy
workloads on clients that can't support it. Add a Nomad version constraint as
well.

Fixes: https://github.com/hashicorp/nomad/issues/20614
2024-05-17 11:57:06 -04:00
Piotr Kazmierczak
b5bca27c07 docs: add a note to binding rules docs about multiple rules application (#20624) 2024-05-17 17:40:12 +02:00
Seth Hoenig
7d00a494d9 windows: fix inefficient gathering of task processes (#20619)
* windows: fix inefficient gathering of task processes

* return set of just executor pid in case of ps error
2024-05-17 09:46:23 -05:00
Ben Roberts
a6f6384b71 Permit Consul Connect Gateways to be used with podman (#20611)
* Permit Consul Connect Gateways to be used with podman

Enable use of Consul Connect Gateways (ingresss/terminating/mesh)
with podman task driver.

task driver for Connect-enabled tasks for sidecar services which
used podman if any other task in the same task group was using podman
or fell back to docker otherwise.

That PR did not consider consul connect gateways, which remained
hardcoded to using docker task driver always.

This change applies the same heuristic also to gateway tasks,
enabling use of podman.

Limitations: The heuristic only works where the task group containing
the gateway also contains a podman task. Therefore it does not work
for the ingress example in the docs
(https://developer.hashicorp.com/nomad/docs/job-specification/gateway#ingress-gateway)
which uses connect native and requires the gateway be in a separate task.

* cl: add cl for connect gateway podman autodetect

* connect: add test ensuring we guess podman for gateway when possible

---------

Co-authored-by: Seth Hoenig <shoenig@duck.com>
2024-05-17 09:26:09 -05:00
claire labry
e9d6c39dba SMRE/BPA Onboarding LTS (#20595)
Configuration changes to use backport assistant with LTS support. These include:

* adding a manifest file for active releases
* adding configuration to send backport to ENT repo
2024-05-17 08:21:42 -04:00
Tim Gross
5666065131 tests: update disconnected client scheduler tests to avoid blocking (#20615)
While working on #20462, I discovered that some of the scheduler tests for
disconnected clients making long blocking queries. The tests used
`testutil.WaitForResult` to wait for an evaluation to be written to the state
store. The evaluation was never written, but the tests were not correctly
returning an error for an empty query. This resulted in the tests blocking for
5s and then continuing anyways.

In practice, the evaluation is never written to the state store as part of the
test harness `Process` method, so this test assertion was meaningless. Remove
the broken assertion from the two top-level tests that used it, and upgrade
these tests to use `shoenig/test` in the process. This will save us ~50s per
test run.
2024-05-16 12:16:27 -04:00
Tim Gross
c8c67da52d CSI: allow plugin GC to detect jobs with updated plugin IDs (#20555)
When a job that implements a plugin is updated to have a new plugin ID, the old
version of the plugin is never deleted. We want to delay deleting plugins until
garbage collection to avoid race conditions between a plugin being registered
and its allocations being marked healthy.

Add logic to the state store's `DeleteCSIPlugin` method (used only by GC) to
check whether any of the jobs associated with the plugin have no allocations and
either have been purged or have been updated to no longer implement that plugin
ID.

This changeset also updates the CSI plugin lifecycle tests in the state store to
use `shoenig/test` over `testify`, and removes a spurious error log that was
happening on every periodic plugin GC attempt.

Fixes: https://github.com/hashicorp/nomad/issues/20225
2024-05-16 10:29:07 -04:00
Tim Gross
b1657dd1fa CSI: track node claim before staging to prevent interleaved unstage (#20550)
The CSI hook for each allocation that claims a volume runs concurrently. If a
call to `MountVolume` happens at the same time as a call to `UnmountVolume` for
the same volume, it's possible for the second alloc to detect the volume has
already been staged, then for the original alloc to unpublish and unstage it,
only for the second alloc to then attempt to publish a volume that's been
unstaged.

The usage tracker on the volume manager was intended to prevent this behavior
but the call to claim the volume was made only after staging and publishing was
complete. Move the call to claim the volume for the usage tracker to the top of
the `MountVolume` workflow to prevent it from being unstaged until all consuming
allocations have called `UnmountVolume`.

Fixes: https://github.com/hashicorp/nomad/issues/20424
2024-05-16 09:45:07 -04:00
Tim Gross
953bfcc31e services: retry failed Nomad service deregistrations from client (#20596)
When the allocation is stopped, we deregister the service in the alloc runner's
`PreKill` hook. This ensures we delete the service registration and wait for the
shutdown delay before shutting down the tasks, so that workloads can drain their
connections. However, the call to remove the workload only logs errors and never
retries them.

Add a short retry loop to the `RemoveWorkload` method for Nomad services, so
that transient errors give us an extra opportunity to deregister the service
before the tasks are stopped, before we need to fall back to the data integrity
improvements implemented in #20590.

Ref: https://github.com/hashicorp/nomad/issues/16616
2024-05-16 08:59:54 -04:00
Dianne Laguerta
cabdd7eddb migrate GHA workflows to using single runner labels (#20581) 2024-05-16 13:35:10 +01:00
Szymon Nowicki-Korgol
898dddc5db structs: Fix job canonicalization for array type fields (#20522)
Co-authored-by: Piotr Kazmierczak <470696+pkazmierczak@users.noreply.github.com>
2024-05-16 14:05:12 +02:00
Phil Renaud
6886edf033 Makes it so an empty state query blocks and changes the style to be more Nomadic (#20588) 2024-05-15 13:57:48 -04:00
Deniz Onur Duzgun
1cc99cc1b4 bug: resolve type conversion alerts (#20553) 2024-05-15 13:22:10 -04:00
Tim Gross
6d806a9934 services: fix data integrity errors for Nomad native services (#20590)
This changeset fixes three potential data integrity issues between allocations
and their Nomad native service registrations.

* When a node is marked down because it missed heartbeats, we remove Vault and
  Consul tokens (for the pre-Workload Identity workflows) after we've written
  the node update to Raft. This is unavoidably non-transactional because the
  Consul and Vault servers aren't in the same Raft cluster as Nomad itself. But
  we've unnecessarily mirrored this same behavior to deregister Nomad
  services. This makes it possible for the leader to successfully write the node
  update to Raft without removing services.

  To address this, move the delete into the same Raft transaction. One minor
  caveat with this approach is the upgrade path: if the leader is upgraded first
  and a node is marked down during this window, older followers will have stale
  information until they are also upgraded. This is unavoidable without
  requiring the leader to unconditionally make an extra Raft write for every
  down node until 2 LTS versions after Nomad 1.8.0. This temporary reduction in
  data integrity for stale reads seems like a reasonable tradeoff.

* When an allocation is marked client-terminal from the client in
  `UpdateAllocsFromClient`, we have an opportunity to ensure data integrity by
  deregistering services for that allocation.

* When an allocation is deleted during eval garbage collection, we have an
  opportunity to ensure data integrity by deregistering services for that
  allocation. This is a cheap no-op if the allocation has been previously marked
  client-terminal.

This changeset does not address client-side retries for the originally reported
issue, which will be done in a separate PR.

Ref: https://github.com/hashicorp/nomad/issues/16616
2024-05-15 11:56:07 -04:00
Seth Hoenig
4148ca1769 client: mount shared alloc dir as nobody (#20589)
In the Unveil filesystem isolation mode we were mounting the shared
alloc dir with the UID/GID of the user of the task dir being mounted
and 0710 filesystem permissions. This was causing the actual task dir
to become inaccessible to other tasks in the allocation (a race where
the last mounter wins). Instead mount the shared alloc dir as nobody
with 0777 filesystem permissions.
2024-05-15 10:43:30 -05:00
Tim Gross
c9fd93c772 connect: support volume_mount blocks for sidecar task overrides (#20575)
Users can override the default sidecar task for Connect workloads. This sidecar
task might need access to certificate stores on the host. Allow adding the
`volume_mount` block to the sidecar task override.

Also fixes a bug where `volume_mount` blocks would not appear in plan diff
outputs.

Fixes: https://github.com/hashicorp/nomad/issues/19786
2024-05-14 12:49:37 -04:00
James Rasell
04ba358266 client: expose network namespace CNI config as task env vars. (#11810)
This change exposes CNI configuration details of a network
namespace as environment variables. This allows a task to use
these value to configure itself; a potential use case is to run
a Raft application binding to IP and Port details configured using
the bridge network mode.
2024-05-14 09:02:06 +01:00
Juana De La Cuesta
169818b1bd [gh-6980] Client: clean up old allocs before running new ones using the exec task driver. (#20500)
Whenever the "exec" task driver is being used, nomad runs a plug in that in time runs the task on a container under the hood. If by any circumstance the executor is killed, the task is reparented to the init service and wont be stopped by Nomad in case of a job updated or stop.

This commit introduces two mechanisms to avoid this behaviour:

* Adds signal catching and handling to the executor, so in case of a SIGTERM, the signal will also be passed on to the task.
* Adds a pre start clean up of the processes in the container, ensuring only the ones the executor runs are present at any given time.
2024-05-14 09:51:27 +02:00
Tim Gross
5b328d9adc CSI: add support for wildcard namespaces on plugin status (#20551)
The `nomad plugin status :plugin_id` command lists allocations that implement
the plugin being queried. This list is filtered by the `-namespace` flag as
usual. Cluster admins will likely deploy plugins to a single namespace, but for
convenience they may want to have the wildcard namespace set in their command
environment.

Add support for handling the wildcard namespace to the CSI plugin RPC handler.

Fixes: https://github.com/hashicorp/nomad/issues/20537
2024-05-13 15:42:35 -04:00