Commit Graph

26679 Commits

Author SHA1 Message Date
James Rasell
37fb418a16 deps: Update consul-template to 0.40.0 (#25140) 2025-02-18 14:14:14 +00:00
Juana De La Cuesta
af2ac87409 Simplify binary overrides on e2e provision (#25122)
* func: remove the lists to override the nomad_local_binary for servers and clients

* docs: add a note to the terraform e2e readme

* fix: remove the extra 'windows' from the aws_ami filter

* style: hcl fmt
2025-02-17 16:13:32 +01:00
dependabot[bot]
05681afa57 chore(deps): bump golang.org/x/sys from 0.29.0 to 0.30.0 (#25127)
Bumps [golang.org/x/sys](https://github.com/golang/sys) from 0.29.0 to 0.30.0.
- [Commits](https://github.com/golang/sys/compare/v0.29.0...v0.30.0)

---
updated-dependencies:
- dependency-name: golang.org/x/sys
  dependency-type: direct:production
  update-type: version-update:semver-minor
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2025-02-17 13:50:55 +01:00
dependabot[bot]
b7b18d1c50 chore(deps): bump go.etcd.io/bbolt from 1.3.11 to 1.4.0 (#25130)
Bumps [go.etcd.io/bbolt](https://github.com/etcd-io/bbolt) from 1.3.11 to 1.4.0.
- [Release notes](https://github.com/etcd-io/bbolt/releases)
- [Commits](https://github.com/etcd-io/bbolt/compare/v1.3.11...v1.4.0)

---
updated-dependencies:
- dependency-name: go.etcd.io/bbolt
  dependency-type: direct:production
  update-type: version-update:semver-minor
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2025-02-17 13:50:00 +01:00
dependabot[bot]
21ad3ed938 chore(deps): bump github.com/hashicorp/go-bexpr from 0.1.13 to 0.1.14 (#25128)
Bumps [github.com/hashicorp/go-bexpr](https://github.com/hashicorp/go-bexpr) from 0.1.13 to 0.1.14.
- [Release notes](https://github.com/hashicorp/go-bexpr/releases)
- [Commits](https://github.com/hashicorp/go-bexpr/compare/v0.1.13...v0.1.14)

---
updated-dependencies:
- dependency-name: github.com/hashicorp/go-bexpr
  dependency-type: direct:production
  update-type: version-update:semver-patch
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2025-02-17 11:01:30 +01:00
dependabot[bot]
e87cf9d4b9 chore(deps): bump golang.org/x/time from 0.9.0 to 0.10.0 (#25131)
Bumps [golang.org/x/time](https://github.com/golang/time) from 0.9.0 to 0.10.0.
- [Commits](https://github.com/golang/time/compare/v0.9.0...v0.10.0)

---
updated-dependencies:
- dependency-name: golang.org/x/time
  dependency-type: direct:production
  update-type: version-update:semver-minor
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2025-02-17 10:29:23 +01:00
Paweł Bęza
43885f6854 Allow for in-place update when affinity or spread was changed (#25109)
Similarly to #6732 it removes checking affinity and spread for inplace update.
Both affinity and spread should be as soft preference for Nomad scheduler rather than strict constraint. Therefore modifying them should not trigger job reallocation.

Fixes #25070
Co-authored-by: Tim Gross <tgross@hashicorp.com>
2025-02-14 14:33:18 -05:00
Aimee Ukasick
f1a1ff678c Docs: Clarify Job status mapping on Job page (#25105)
* Add dead (stopped) to status mapping to clarify Stopped

CE-816

* Pull status mapping into partial and include in job status command

* change `complete` to dead in table after discuss with Michael

* added clarifications; add CLI status definitions

* fixed line endings

* fixed typoce816dead
2025-02-14 09:47:11 -06:00
Tim Gross
7b89c0ee28 template: fix client's default retry configuration (#25113)
In #20165 we fixed a bug where a partially configured `client.template` retry
block would set any unset fields to nil instead of their default values. But
this patch introduced a regression in the default values, so we were now
defaulting to unlimited retries if the retry block was unset. Restore the
correct behavior and add better test coverage at both the config parsing and
template configuration code.

Ref: https://github.com/hashicorp/nomad/pull/20165
Ref: https://github.com/hashicorp/nomad/issues/23305#issuecomment-2643731565
2025-02-14 09:25:41 -05:00
Tim Gross
8c57fd5eb0 fingerprint: initial fingerprint of Vault/Consul should be periodic (#25102)
In #24526 we updated the Consul and Vault fingerprints so that they are no
longer periodic. This fixed a problem that cluster admins reported where rolling
updates of Vault or Consul would cause a thundering herd of fingerprint updates
across the whole cluster.

But if Consul/Vault is not available during the initial fingerprint, it will
never get fingerprinted again. This is challenging for cluster updates and black
starts because the implicit service startup ordering may require
reloads. Instead, have the fingerprinter run periodically but mark that it has
made its first successful fingerprint of all Consul/Vault clusters. At that
point, we can skip further periodic updates. The `Reload` method will reset the
mark and allow the subsequent fingerprint to run normally.

Fixes: https://github.com/hashicorp/nomad/issues/25097
Ref: https://github.com/hashicorp/nomad/pull/24526
Ref: https://github.com/hashicorp/nomad/issues/24049
2025-02-13 14:26:04 -05:00
Tim Gross
c2298e0999 Dynamic host volume reference documentation (#24797) 2025-02-13 12:25:58 -05:00
Jorge Marey
25426f0777 fingerprint: add config option to disable dmidecode (#25108) 2025-02-13 11:20:48 -05:00
Juana De La Cuesta
af735dce16 F net 11478 enos versions (#25092)
* fix: change the value of the version used for testing to account for ent versions

* func: add more specific test for servers stability

* func: change the criteria we use to verify the cluster stability after server upgrades

* style: syntax
2025-02-13 10:32:43 +01:00
Aimee Ukasick
35365bc1fb resolve merge conflicts 2025-02-12 11:43:21 -06:00
Tim Gross
716df52788 CNI: migrate from persistent state to ephemeral state during restart (#25093)
In #24650 we switched to using ephemeral state for CNI plugins, so that when a
host reboots and we lose all the allocations we don't end up trying to use IPs
we created in network namespaces we just destroyed. Unfortunately upgrade
testing missed that in a non-reboot scenario, the existing CNI state was being
used by plugins like the ipam plugin to hand out the "next available" IP
address. So with no state carried over, we might allocate new addresses that
conflict with existing allocations. (This can be avoided by draining the node
first.)

As a compatibility shim, copy the old CNI state directory to the new CNI state
directory during agent startup, if the new CNI state directory doesn't already
exist.

Ref: https://github.com/hashicorp/nomad/pull/24650
2025-02-12 09:25:50 -05:00
Tim Gross
f0d3c2834e upgrade testing: add README and fix authorization header (#25059)
Add a README describing the setup required for running upgrade testing via
Enos. Also fix the authorization header of our `wget` to use the proper header
for short-lived tokens, and the output path variable of the artifactory step.

Co-authored-by: Juanadelacuesta <8647634+Juanadelacuesta@users.noreply.github.com>
2025-02-12 08:56:47 -05:00
James Rasell
268e90dedf ci: Update semgrep container version to 1.107.0 (#25078) 2025-02-12 09:48:26 +00:00
James Rasell
d8841e011f semgrep: Fix invalid RPC rule and add validation GHA workflow. (#25088) 2025-02-12 09:44:27 +00:00
Daniel Bennett
1c0caddb98 Merge pull request #25094 from hashicorp/post-1.9.6-release
Post 1.9.6 release
2025-02-11 18:06:12 -05:00
Daniel Bennett
c16d318bbe Merge release 1.9.6 files 2025-02-11 17:36:15 -05:00
hc-github-team-nomad-core
ca21509631 Prepare for next release 2025-02-11 17:03:45 -05:00
hc-github-team-nomad-core
ac36990fe3 Generate files for 1.9.6 release 2025-02-11 17:03:45 -05:00
Piotr Kazmierczak
5468829260 stateful deployments: fix return in the hasVolumes feasibility check (#25084)
A return statement was missing in the sticky volume check—when we weren't able
to find a suitable volume, we did not return false. This was caught by e2e
test.

This PR fixes the issue, and corrects and expands the unit test.
2025-02-11 18:57:48 +01:00
Michael Smithhisler
c4f232f23e event stream: fix wildcard namespace bypass (#25089) 2025-02-11 11:06:29 -05:00
Daniel Bennett
92c90af542 e2e: task schedule: pauses vs restarts (#25085)
CE side of ENT PR:
task schedule: pauses are not restart "attempts"

distinguish between these two cases:
1. task dies because we "paused" it (on purpose)
   - should not count against restarts,
     because nothing is wrong.
2. task dies because it didn't work right
   - should count against restart attempts,
     so users can address application issues.

with this, the restart{} block is back to its normal
behavior, so its documentation applies without caveat.
2025-02-11 09:46:58 -06:00
Aimee Ukasick
8a597a172d Docs SEO: task drivers and plugins; refactor virt section (#24783)
* Docs SEO: task drivers and plugins; refactor virt section

* add redirects for virt driver files

* Some updates. committing rather than stashing

* fix content-check errors

* Remove docs/devices/ and redirect to plugins/devices

* Update docs/drivers descriptions

* Move USB device plugin up a level. Finish descriptions.

* Apply suggestions from Jeff's code review

Co-authored-by: Jeff Boruszak <104028618+boruszak@users.noreply.github.com>

* Apply title case suggestions from code review

Co-authored-by: Jeff Boruszak <104028618+boruszak@users.noreply.github.com>

* apply title case suggestions; fix indentation

---------

Co-authored-by: Jeff Boruszak <104028618+boruszak@users.noreply.github.com>
2025-02-10 15:43:02 -06:00
Michael Smithhisler
ba71c299b1 test: add eval to state store periodic job test (#25083) 2025-02-10 13:39:08 -05:00
Michael Smithhisler
b5c157df29 state store: remove reschedulable check when getting job status (#25081) 2025-02-10 12:21:00 -05:00
Tim Gross
87741dd908 deps: remove gofakeit (#25073)
This dependency is only used to generate mock `Variables`. The only time the
faked values are meaningful would be in the state store and RPC handler tests,
where we are always setting the values directly so that we can control
unblocking behaviors. Remove most of the random generation and remove the
dependency.

Closes: https://github.com/hashicorp/nomad/pull/25066
2025-02-10 11:53:05 -05:00
stswidwinski
871585ee90 18529 nomad executes any file in plugins (#18530)
Co-authored-by: James Rasell <jrasell@hashicorp.com>
2025-02-10 16:08:22 +00:00
Juana De La Cuesta
cfc24116b3 Add tag to instances with OS and add merged output (#25071)
* func: add a new output that merges both windowa and linux clients, but add tags to distinguish them

* fix: outputs cant referrence other outputs in terraform

* Update e2e/terraform/provision-infra/compute.tf

Co-authored-by: Tim Gross <tgross@hashicorp.com>

---------

Co-authored-by: Tim Gross <tgross@hashicorp.com>
2025-02-10 17:08:07 +01:00
Juana De La Cuesta
c5d74a96a3 Add module to upgrade clients (#25055)
* func: add module to upgrade clients

* func: add polling to verify the metadata to make sure all clients are up

* style: remove unused code

* fix: Give the allocations a little time to get to the expected number on teh test health check, to avoid possible flaky tests in the future

* fix: set the upgrade version as clients version for the last health check
2025-02-10 17:03:54 +01:00
dependabot[bot]
493f664632 chore(deps): bump github.com/prometheus/common from 0.60.1 to 0.62.0 (#25069)
Bumps [github.com/prometheus/common](https://github.com/prometheus/common) from 0.60.1 to 0.62.0.
- [Release notes](https://github.com/prometheus/common/releases)
- [Changelog](https://github.com/prometheus/common/blob/main/RELEASE.md)
- [Commits](https://github.com/prometheus/common/compare/v0.60.1...v0.62.0)

---
updated-dependencies:
- dependency-name: github.com/prometheus/common
  dependency-type: direct:production
  update-type: version-update:semver-minor
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2025-02-10 09:39:31 -05:00
dependabot[bot]
43e6b5493f chore(deps): bump github.com/hashicorp/go-kms-wrapping/wrappers/azurekeyvault/v2 (#25068)
Bumps [github.com/hashicorp/go-kms-wrapping/wrappers/azurekeyvault/v2](https://github.com/hashicorp/go-kms-wrapping) from 2.0.11 to 2.0.13.
- [Commits](https://github.com/hashicorp/go-kms-wrapping/compare/v2.0.11...v2.0.13)

---
updated-dependencies:
- dependency-name: github.com/hashicorp/go-kms-wrapping/wrappers/azurekeyvault/v2
  dependency-type: direct:production
  update-type: version-update:semver-patch
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2025-02-10 09:27:36 -05:00
dependabot[bot]
d999e88cef chore(deps): bump github.com/aws/aws-sdk-go-v2/config (#25067)
Bumps [github.com/aws/aws-sdk-go-v2/config](https://github.com/aws/aws-sdk-go-v2) from 1.29.4 to 1.29.6.
- [Release notes](https://github.com/aws/aws-sdk-go-v2/releases)
- [Changelog](https://github.com/aws/aws-sdk-go-v2/blob/main/changelog-template.json)
- [Commits](https://github.com/aws/aws-sdk-go-v2/compare/config/v1.29.4...config/v1.29.6)

---
updated-dependencies:
- dependency-name: github.com/aws/aws-sdk-go-v2/config
  dependency-type: direct:production
  update-type: version-update:semver-patch
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2025-02-10 09:26:19 -05:00
dependabot[bot]
6eca129a9a chore(deps): bump github.com/containerd/go-cni from 1.1.11 to 1.1.12 (#25065)
Bumps [github.com/containerd/go-cni](https://github.com/containerd/go-cni) from 1.1.11 to 1.1.12.
- [Release notes](https://github.com/containerd/go-cni/releases)
- [Commits](https://github.com/containerd/go-cni/compare/v1.1.11...v1.1.12)

---
updated-dependencies:
- dependency-name: github.com/containerd/go-cni
  dependency-type: direct:production
  update-type: version-update:semver-patch
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2025-02-10 09:25:25 -05:00
Juana De La Cuesta
cae81182dd fix: refactor to avoid flakiness (#25047) 2025-02-10 10:53:39 +01:00
Tim Gross
a11325863e E2E: dynamic host volumes (#25063)
I merged #24869 having forgotten we don't run these tests in PR CI, so there's a compile error in the test. Fix that error and add the no-op import we use to catch this kind of thing.

Ref: https://github.com/hashicorp/nomad/pull/24869
2025-02-07 16:27:36 -05:00
Tim Gross
3f2d4000a6 E2E: dynamic host volume tests for sticky volumes (#24869)
Add tests for dynamic host volumes where the claiming jobs have `volume.sticky =
true`. Includes a test for forced rescheduling and a test for node drain.

This changeset includes a new `e2e/v3`-style package for creating dynamic host
volumes, so we can reuse that across other tests.
2025-02-07 15:50:54 -05:00
Michael Smithhisler
a6523be478 state store: fix logic for evaluating job status (#24974) 2025-02-07 15:34:14 -05:00
Daniel Bennett
91194b3cc2 docker: refactor to handle futures more easily (#24992)
at least one bug has been created because it's
easy to miss a future.set() in pullImageImpl()

this pulls future.set() out to PullImage(),
the same level where it's created and wait()ed
2025-02-07 12:45:17 -06:00
Daniel Bennett
62ef621582 docker: respect image_pull_timeout (#24991)
I believe the docker driver stopped respecting image_pull_timeout
in Nomad 1.9.0 in 981ca36049

this makes the timeout apply again
2025-02-07 11:36:31 -06:00
Piotr Kazmierczak
611452e1af stateful deployments: use TaskGroupVolumeClaim table to associate volume requests with volume IDs (#24993)
We introduce an alternative solution to the one presented in #24960 which is
based on the state store and not previous-next allocation tracking in the
reconciler. This new solution reduces cognitive complexity of the scheduler
code at the cost of slightly more boilerplate code, but also opens up new
possibilities in the future, e.g., allowing users to explicitly "un-stick"
volumes with workloads still running.

The diagram below illustrates the new logic:

     SetVolumes()                                               upsertAllocsImpl()          
     sets ns, job                             +-----------------checks if alloc requests    
     tg in the scheduler                      v                 sticky vols and consults    
            |                  +-----------------------+        state. If there is no claim,
            |                  | TaskGroupVolumeClaim: |        it creates one.             
            |                  | - namespace           |                                    
            |                  | - jobID               |                                    
            |                  | - tg name             |                                    
            |                  | - vol ID              |                                    
            v                  | uniquely identify vol |                                    
     hasVolumes()              +----+------------------+                                    
     consults the state             |           ^                                           
     and returns true               |           |               DeleteJobTxn()              
     if there's a match <-----------+           +---------------removes the claim from      
     or if there is no                                          the state                   
     previous claim                                                                         
|                             | |                                                      |    
+-----------------------------+ +------------------------------------------------------+    
                                                                                            
           scheduler                                  state store
2025-02-07 17:41:01 +01:00
Daniel Bennett
3493551c38 docker: surface image pull progress error (#24981)
set() on the future, so the caller can handle it
instead of wait()ing forever and causing the
allocation to get stuck "pending"
2025-02-07 10:36:09 -06:00
Tim Gross
d0a6424844 enos: improve documentation around required variables (#25051)
The variables definitions for Enos upgrade scenarios have a couple of unused
variables and some of the documentation strings are ambiguous:

* `nomad_region` and `binary_local_path` variables are unused and can be removed.
* `nomad_local_binary` refers to the directory where the binaries will be
  download, not the binaries themselves. Rename to make it clear this belongs to
  the artifactory fetch and not the provisioning step (which uses the
  artifactory fetch outputs).
2025-02-07 11:35:50 -05:00
James Rasell
4fbacee328 sec: Remove yamux suppression as vuln has been revoked. (#25044) 2025-02-07 15:15:15 +00:00
Tim Gross
5d09d7ad07 reset max query time of blocking queries in client after retries (#25039)
When a blocking query on the client hits a retryable error, we change the max
query time so that it falls within the `RPCHoldTimeout` timeout. But when the
retry succeeds we don't reset it to the original value.

Because the calls to `Node.GetClientAllocs` reuse the same request struct
instead of reallocating it, any retry will cause the agent to poll at a faster
frequency until the agent restarts. No other current RPC on the client has this
behavior, but we'll fix this in the `rpc` method rather than in the caller so
that any future users of the `rpc` method don't have to remember this detail.

Fixes: https://github.com/hashicorp/nomad/issues/25033
2025-02-07 08:45:56 -05:00
Tim Gross
b5faeff233 vault: fix bug in logging logic around renewals (#25040)
In #24409 we fixed a bug where some of the error messages we get from Vault
weren't being caught correctly. This fix itself contains a bug where we changed
the logic that logged the non-fatal errors so that it logs when there is no
renewal error.

Ref: https://github.com/hashicorp/nomad/pull/24409
Fixes: https://github.com/hashicorp/nomad/issues/24933
2025-02-07 08:45:33 -05:00
Juana De La Cuesta
cf0a046364 Module to upgrade servers (#24971)
* func: add initial enos skeleton

* style: add headers

* func: change the variables input to a map of objects to simplify the workloads creation

* style: formating

* Add tests for servers and clients

* style: separate the tests in diferent scripts

* style: add missing headers

* func: add tests for allocs

* style: improve output

* func: add step to copy remote upgrade version

* style: hcl formatting

* fix: remove the terraform nomad provider

* fix: Add clean token to remove extra new line added in provision

* fix: Add clean token to remove extra new line added in provision

* fix: Add clean token to remove extra new line added in provision

* fix: add missing license headers

* style: hcl fmt

* style: rename variables and fix format

* func: remove the template step on the workloads module and chop the noamd token output on the provide module

* fix: correct the jobspec path on the workloads module

* fix: add missing variable definitions on job specs for workloads

* style: formatting

* fix: Add clean token to remove extra new line added in provision

* func: add module to upgrade servers

* style: missing headers

* func: add upgrade module

* func: add install for windows as well

* func: add an intermediate module that runs the upgrade server for each server

* fix: add missing license headers

* fix: remove extra input variables and connect upgrade servers to the scenario

* fix: rename missing env variables for cluster health scripts

* func: move the cluster health test outside of the modules and into the upgrade scenario

* fix: fix the regex to ignore snap files on the gitignore file

* fix: Add clean token to remove extra new line added in provision

* fix: Add clean token to remove extra new line added in provision

* fix: Add clean token to remove extra new line added in provision

* fix: remove extra input variables and connect upgrade servers to the scenario

* style: formatting

* fix: move taken and restoring snapshots out of the upgrade_single_server to avoid possible race conditions

* fix: rename variable in health test

* fix: Add clean token to remove extra new line added in provision

* func: add an intermediate module that runs the upgrade server for each server

* fix: Add clean token to remove extra new line added in provision

* fix: Add clean token to remove extra new line added in provision

* fix: Add clean token to remove extra new line added in provision

* func: fix the last_log_index check and add a versions check

* func: done use for_each when upgrading the servers, hardcodes each one to ensure they are upgraded one by one

* Update enos/modules/upgrade_instance/variables.tf

Co-authored-by: Tim Gross <tgross@hashicorp.com>

* Update enos/modules/upgrade_instance/variables.tf

Co-authored-by: Tim Gross <tgross@hashicorp.com>

* Update enos/modules/upgrade_instance/variables.tf

Co-authored-by: Tim Gross <tgross@hashicorp.com>

* func: make snapshot by calling every server and allowing stale data

* style: formatting

* fix: make the source for the upgrade binary unknow until apply

* func: use enos bundle to install remote upgrade version, enos_files is not meant for dynamic files

---------

Co-authored-by: Tim Gross <tgross@hashicorp.com>
2025-02-07 10:26:03 +01:00
salehjafarli
a914888c2c docs: Corrected meta keys example from sidecar_service documentation (#25042) 2025-02-07 08:43:13 +00:00