Commit Graph

777 Commits

Author SHA1 Message Date
James Rasell
8bce0b0954 e2e: Migrate legacy Vault token based workflow to workload ID (#25139)
Nomad 1.10.0 is removing the legacy Vault token based workflow
which means the legacy e2e compatibility tests will fail and not
work.

The Nomad e2e cluster was using the legacy Vault token based
workflow for initial cluster build. This change migrates to using
the workload identity flow which utilizes authentication methods,
roles, and policies.

The Nomad server network has been modified to allow traffic from
the HCP Vault HVN which is a private network peered into our AWS
account. This is required, so that Vault can pull JWKS
information from the Nomad API without going over the public
internet.

The cluster build will now also configure a Vault KV v2 mount at
a unique indentifier for the e2e cluster. This allows all Nomad
workloads and tests to use this if required.

The vaultsecrets suite has been updated to accommodate the new
changes and extended to test the default workload ID flow for
allocations which use Vault for secrets.
2025-02-20 14:06:25 +00:00
Tim Gross
86e1d6da52 E2E: use repo root to find correct git sha for AMI (#25151)
The nightly E2E run only builds a new AMI when required by changes to the
build. The AMI is tagged with the SHA of the commit that forced that build,
which may not be the commit that's spawning a particular test run. So we have a
resource in the `provision-infra` module that finds that SHA.

But when we run upgrade testing via Enos, we're running the E2E Terraform
configuration from outside the `e2e/terraform` folder. So the script that
resource runs will fail and prevent us from getting the AMI. Fix the script so
it can be run from any folder.

We also have duplicate resources for the "ubuntu jammy" AMI, but this is because
the Enos matrix might (in the near future) test with ARM64. For now, we'll pin
the Consul server to AMD64. Rename the resource appropriately to make the source
of the duplicate obvious.
2025-02-19 08:59:22 -05:00
Juana De La Cuesta
af2ac87409 Simplify binary overrides on e2e provision (#25122)
* func: remove the lists to override the nomad_local_binary for servers and clients

* docs: add a note to the terraform e2e readme

* fix: remove the extra 'windows' from the aws_ami filter

* style: hcl fmt
2025-02-17 16:13:32 +01:00
Daniel Bennett
92c90af542 e2e: task schedule: pauses vs restarts (#25085)
CE side of ENT PR:
task schedule: pauses are not restart "attempts"

distinguish between these two cases:
1. task dies because we "paused" it (on purpose)
   - should not count against restarts,
     because nothing is wrong.
2. task dies because it didn't work right
   - should count against restart attempts,
     so users can address application issues.

with this, the restart{} block is back to its normal
behavior, so its documentation applies without caveat.
2025-02-11 09:46:58 -06:00
Juana De La Cuesta
cfc24116b3 Add tag to instances with OS and add merged output (#25071)
* func: add a new output that merges both windowa and linux clients, but add tags to distinguish them

* fix: outputs cant referrence other outputs in terraform

* Update e2e/terraform/provision-infra/compute.tf

Co-authored-by: Tim Gross <tgross@hashicorp.com>

---------

Co-authored-by: Tim Gross <tgross@hashicorp.com>
2025-02-10 17:08:07 +01:00
Tim Gross
a11325863e E2E: dynamic host volumes (#25063)
I merged #24869 having forgotten we don't run these tests in PR CI, so there's a compile error in the test. Fix that error and add the no-op import we use to catch this kind of thing.

Ref: https://github.com/hashicorp/nomad/pull/24869
2025-02-07 16:27:36 -05:00
Tim Gross
3f2d4000a6 E2E: dynamic host volume tests for sticky volumes (#24869)
Add tests for dynamic host volumes where the claiming jobs have `volume.sticky =
true`. Includes a test for forced rescheduling and a test for node drain.

This changeset includes a new `e2e/v3`-style package for creating dynamic host
volumes, so we can reuse that across other tests.
2025-02-07 15:50:54 -05:00
Juana De La Cuesta
d53b8a7e98 func: remove triggers from resources that copy the binaries into the remote instances (#25036) 2025-02-06 17:11:19 +01:00
Juana De La Cuesta
3861c40220 func: add initial enos skeleton (#24787)
* func: add initial enos skeleton

* style: add headers

* func: change the variables input to a map of objects to simplify the workloads creation

* style: formating

* Add tests for servers and clients

* style: separate the tests in diferent scripts

* style: add missing headers

* func: add tests for allocs

* style: improve output

* func: add step to copy remote upgrade version

* style: hcl formatting

* fix: remove the terraform nomad provider

* fix: Add clean token to remove extra new line added in provision

* fix: Add clean token to remove extra new line added in provision

* fix: Add clean token to remove extra new line added in provision

* fix: add missing license headers

* style: hcl fmt

* style: rename variables and fix format

* func: remove the template step on the workloads module and chop the noamd token output on the provide module

* fix: correct the jobspec path on the workloads module

* fix: add missing variable definitions on job specs for workloads

* style: formatting

* fix: rename variable in health test
2025-01-30 16:37:55 +01:00
Michael Smithhisler
47c14ddf28 remove remote task execution code (#24909) 2025-01-29 08:08:34 -05:00
Juana De La Cuesta
1b1ad896ec Add the path to the ssh key to connect to the cluster's instances as an output (#24969)
* fix: add the ssh key pem path to te outputs and fix the message with the correct path

* func: add ssh pem key as output
2025-01-28 18:25:02 +01:00
James Rasell
c8d7e741c8 e2e: Fix TF output SSH key path. (#24965) 2025-01-28 16:29:56 +00:00
James Rasell
8859cfa3f5 e2e: Ensure Consul client is running before starting Nomad service. (#24964) 2025-01-28 15:28:12 +00:00
Phil Renaud
7106ac1462 Update playwright to 1.50.0 for e2e ui tests (#24956) 2025-01-27 12:03:59 -05:00
Juana De La Cuesta
687335639b fix: add a dependency to avoid terraform errors when generating ssh keys (#24912) 2025-01-22 11:36:03 +01:00
Juana De La Cuesta
039da61d8f [F-net-11478] Make keys directory cluster grouped (#24883)
* func: make windows arch dependant

* func: unify keys and make them cluster grouped

* Update README.md

* Update e2e/terraform/provision-infra/provision-nomad/variables.tf

Co-authored-by: Tim Gross <tgross@hashicorp.com>

* Update .gitignore

* style: add an output with the custer identifier

---------

Co-authored-by: Tim Gross <tgross@hashicorp.com>
2025-01-20 10:18:38 +01:00
Tim Gross
1df94b1470 E2E: refactor volume_mounts test (#24876)
The volume_mounts test is flaky due to slow starts from the exec-driver and some
incorrect wait code. Refactor the volume_mounts test to use the `e2e/v3` package
helpers, and use these to give it enough time to start the exec tasks.
2025-01-17 08:31:50 -05:00
Tim Gross
6ea40cbfb2 E2E: dynamic host volumes test reliability (#24854)
The nightly runs for E2E have been failing the recently added dynamic host
volumes tests for a number of reasons:

* Adding timing logs to the tests shows that it can take over 5s (the original
  test timeout) for the client fingerprint to show up on the client. This seems
  like a lot but seems to be host-dependent because it's much faster locally.
  Extend the timeout and leave in the timing logs so that we can keep an eye on
  this problem in the future.

* The register test doesn't wait for the dispatched job to complete, and the
  dispatched job was actually broken when TLS was in use because we weren't using
  the Task API socket. Fix the jobspec for the dispatched job and add waiting
  for the dispatched allocation to be marked complete before checking for the
  volume on the server.

I've also change both the mounter jobs to batch workloads, so that we don't have
to wait 10s for the deployment to complete.
2025-01-14 12:26:31 -05:00
Tim Gross
ef366ee166 E2E: update .gitignore files to avoid committing runtime files (#24855)
In #24694 we did a major refactoring of the E2E Terraform configuration. After
deploying a cluster this morning, I noticed a few moved/removed files were not
reflected in the .gitignore files. This changeset updates the .gitignore to have
no unstaged files after applying.
2025-01-14 12:16:01 -05:00
Juana De La Cuesta
b29a3736a4 Update e2e infra provision to expect providers (#24694)
* func: move infra provisionining to a module and remove providers

* func: update paths

* func: update more paths

* func: update path inside bootstrap scrip

* style: remove debug prints on bootstrap scripts

* Delete e2e/terraform/csi/input/volume-efs.hcl

* fix: update keys path to use module path instead pf root

* fix: add missing headers

* fix: update keys directory inside provision-nomad

* style; format hcl files

* Update compute.tf

* Update e2e/terraform/main.tf

Co-authored-by: Tim Gross <tgross@hashicorp.com>

* Update e2e/terraform/provision-infra/compute.tf

Co-authored-by: Tim Gross <tgross@hashicorp.com>

* fix: update more paths

* fix: fmt hcl files

* func: final paths revision for running e2e locally

* fix: make path of certs relative to module for the bootstrap

* func: final paths revision for running e2e locally

* Update network.tf

* fix: fix typo and add success message

* fix: remove the test name from token to avoid long names and use name for vol to avoid colisions

* func: unify the uploads folder

* func: make the uploads file one per cluster

* func: Add outputs with all data necessary to connect to the cluster

* fix: make nomad token a sensitive output

* Update bootstrap-nomad.sh

---------

Co-authored-by: Tim Gross <tgross@hashicorp.com>
2025-01-13 15:59:40 +01:00
Michael Smithhisler
606ce9dd90 deps: upgrade aws-sdk-go from v1 to v2 (#24720) 2025-01-09 17:27:19 -05:00
Tim Gross
997358d855 E2E: dynamic host volumes workflows (#24816)
Initial end-to-end tests for dynamic host volumes. This includes tests for two
workflows:

* One where a dynamic host volume is created by a plugin and then mounted by a job.
* Another where a dynamic host volume is created out-of-band and registered by a
  job, then mounted by another job.

This changeset also moves the existing `volumes` E2E test package to the
better-named `volume_mounts`.

Ref: https://hashicorp.atlassian.net/browse/NET-11551
2025-01-09 08:41:22 -05:00
James Rasell
359571df01 e2e: Account for non-default region in Prometheus scrape config. (#24807) 2025-01-08 14:08:17 +00:00
Juana De La Cuesta
2eb2b6c739 fix: update the dnsconfig script to handle multiple interfaces (#24800) 2025-01-07 21:12:18 +01:00
James Rasell
8bb7c1315d e2e: fix failing tests due to region name change. (#24713) 2024-12-19 14:21:17 +00:00
Tim Gross
abeae5c47b E2E: use a variable for region (#24693)
In #24644 we set the region to "e2e" but forgot to setup the TLS certificate
names appropriately. Swap the region out for a variable instead.
2024-12-17 10:28:22 -05:00
Tim Gross
75b0202f7f api: don't copy previously parsed URL when setting new address (#24644)
In #16872 we added support for unix domain sockets, but this required mutating
the `Config` when parsing the address so as to remove the port number. In #23785
we fixed a bug where if the configuration was used across multiple clients that
mutation would happen multiple times and the address would be incorrectly
parsed.

When making `alloc log`, `alloc fs`, or `alloc exec` calls where we have
line-of-sight to the client, we attempt to make a HTTP API call directly to the
client node. So we create a new API client from the same configuration and then
set the address. But in this case we copy the private `url` field and that
causes the URL parsing to be skipped for the new client.

This results in the region always being set to the string literal
`"global"` (because of mTLS handling code introduced all the way back in
4d3b75d867), unless the user has set the region specifically. This fails with
an error "no path to region" when the cluster isn't non-global and requests are
sent to a non-leader.

Arguably the "right" way of fixing this would be for `ClientConfig` not to
change the API client's region to `"global"` in the first place, but as this is
a public API and extremely longstanding behavior, it could potentially be a
breaking change for some downstream consumers. Instead, we'll avoid copying the
private `url` field so that the new address is re-parsed.

Fixes: https://github.com/hashicorp/nomad/issues/24635
Fixes: https://github.com/hashicorp/nomad/issues/24609
Ref: https://github.com/hashicorp/nomad/pull/16872
Ref: https://github.com/hashicorp/nomad/pull/23785
Ref: 4d3b75d867
2024-12-16 11:05:29 -05:00
Juana De La Cuesta
526c6375ad Make paths in e2e/terraform/ directory relative to the module (#24664)
* func: make paths relative

* func: make paths relative to the module inside the e2e terraform folder

* fix: add license files to gitignore

* func: move /etc and update all paths

* Uncomment forgotten code

* fix: update the path to the tls certificates to be local to the instance
2024-12-13 17:33:59 +01:00
Juana De La Cuesta
a9a0f71213 Remove sockaddr and use native tools (#24665)
* func: remove sockaddr and use native tools

* Update setup.sh
2024-12-13 17:24:53 +01:00
Yucong Sun
642e33ae41 CSI: fix topology matching logic (#24522)
Some plugins emit multiple topology segment entries for the same segment (ex. newer versions of AWS EBS) to accommodate convention changes in k8s. Check that segments are a superset instead of exactly equal to the plugin's topology segments.
2024-11-22 09:22:36 -05:00
Phil Renaud
0023edd3ec Updates Playwright in response to an E2E nightly failure (#24487) 2024-11-20 09:33:27 -05:00
Juana De La Cuesta
270b4f97a6 Update some details of the terraform readme file for e2e provisioning (#24451)
* docs: update instructions to provision e2e cluster

* Update e2e/terraform/README.md

Co-authored-by: Piotr Kazmierczak <470696+pkazmierczak@users.noreply.github.com>

* Update e2e/terraform/terraform.tfvars

Co-authored-by: Piotr Kazmierczak <470696+pkazmierczak@users.noreply.github.com>

* Update e2e/terraform/README.md

Co-authored-by: Piotr Kazmierczak <470696+pkazmierczak@users.noreply.github.com>

---------

Co-authored-by: Piotr Kazmierczak <470696+pkazmierczak@users.noreply.github.com>
2024-11-18 13:36:51 +01:00
Juana De La Cuesta
1f944196d9 Allow scaling system jobs to 0 (#24363)
* func: remove validation scaling for system jobs and dont canonicalize to 1

* test: update test to validate for 0 and improve error message

* func: remove the canonicalization to 1 from system jobs

* docs: add changelog

* func: add test for scaling system jobs

* temp: add logging to debug test

* fix: clean up after test is done

* fix: scaled down jobs will still have the stop allocation, update test to account for it

* Update the e2e test to accomodate for system jobs to have an alloc per node

* fix: filter to only count ready nodes on the node count

* fix: remove the datacenter constrain from the system job definition

* fix: compare alloc IDs to avoid flaky tests when verifying no alloc was stoped

* fix: remove duplicated code
2024-11-18 13:35:47 +01:00
Piotr Kazmierczak
73383ee755 e2e: unflake testDockerExecStdin (#24385) 2024-11-07 13:35:32 +01:00
Seth Hoenig
b18851617f docker: close response connection once stdin is exhausted (#24202) 2024-10-17 11:07:23 -05:00
Piotr Kazmierczak
a22e56390e e2e: fix failing tests due to docker plugin settings (#24234) 2024-10-17 11:12:59 +02:00
Piotr Kazmierczak
f9cbaaf6c7 docker: fix a bug where auth for private registries wasn't parsed correctly (#24215)
In #23966 we introduced an official Docker client and did not notice that in
contrast to our previous 3rd party client, the official SDK PullOptions object
expects a base64 encoded JSON with username and password, instead of username/
password pair.
2024-10-16 22:04:54 +02:00
Tim Gross
d261d58ea2 build: update hc-install to current (#24199)
Installing Vault and Consul from releases.hashicorp.com via `hc-install` has
been failing intermittently. Update the `hc-install` binaries to be current and
add one retry to downloads for our compat tests so that we can get builds more
reliably green while the underlying issue is being debugged.
2024-10-15 10:07:58 -04:00
Daniel Bennett
278a2df3af e2e: ui: update playwright to 1.48.0 (#24158)
steps to update:
 * edit run.sh IMAGE variable manually
 * run ./run.sh test
2024-10-09 10:34:53 -05:00
Tim Gross
e9ba630639 docker: fix script check execution (#24098)
In #24095 we made a fix for non-streaming exec into Docker tasks for script
checks and `change_mode = "script"`, but didn't complete E2E testing. We need to
use `ContainerExecAttach` in the new API in order to get stdout/stderr from
tasklets, but the previous `ContainerExecStart` call will prevent this from
running successfully with an error that the exec has already run.

* Ref: [NET-11202 (comment)](https://hashicorp.atlassian.net/browse/NET-11202?focusedCommentId=551618)
* This has shipped in Nomad 1.9.0-beta.1 but not production yet.
* This should fix the remaining issues in nightly E2E for Docker.
2024-10-01 16:41:38 -04:00
Michael Smithhisler
6b6aa7cc26 identity: adds ability to specify custom filepath for saving workload identities (#24038) 2024-09-23 10:27:00 -04:00
Tim Gross
9247dc9108 E2E: allow Consul version to omit tags (#24024)
When we start the Consul agent in the `consulcompat` test package, we check that
the version matches the version we expect. But Consul agents may omit non-core
parts of the version string (ex. `1.20.0-rc1` displays `1.20.0`). Compare only
the core portions of the version string.
2024-09-20 14:46:01 -04:00
Seth Hoenig
51215bf102 deps: update to go-set/v3 and refactor to use custom iterators (#23971)
* deps: update to go-set/v3

* deps: use custom set iterators for looping
2024-09-16 13:40:10 -05:00
Tim Gross
8739d7738c E2E: remove invalid HCLv1 field on submissions test (#23936)
HCLv1 support was removed entirely in #23912, but I missed this one test and
documentation reference.
2024-09-09 09:57:25 -04:00
Phil Renaud
faf95ef7b9 Update the pinned playwright version (#23929) 2024-09-06 15:57:19 -04:00
Tim Gross
a9beef7edd jobspec: remove HCL1 support (#23912)
This changeset removes support for parsing jobspecs via the long-deprecated
HCLv1.

Fixes: https://github.com/hashicorp/nomad/issues/20195
Ref: https://hashicorp.atlassian.net/browse/NET-10220
2024-09-05 09:02:45 -04:00
Seth Hoenig
4aeb279534 e2e: fix module name of an artifact we download (#23843)
Because this will definitely never change again, for sure, trust me.
2024-08-19 10:25:35 -05:00
Seth Hoenig
db0642099e build: update golangci-lint to 1.60.1 (#23807)
* build: update golangci-lint to 1.60.1

* ci: update golangci-lint to v1.60.1

Helps with go1.23 compatability. Introduces some breaking changes / newly
enforced linter patterns so those are fixed as well.
2024-08-14 10:09:31 -05:00
Tim Gross
bc50eebebd workload identity: add support for extra claims config for Vault (#23675)
Although we encourage users to use Vault roles, sometimes they're going to want
to assign policies based on entity and pre-create entities and aliases based on
claims. This allows them to use single default role (or at least small number of
them) that has a templated policy, but have an escape hatch from that.

When defining Vault entities the `user_claim` must be unique. When writing Vault
binding rules for use with Nomad workload identities the binding rule won't be
able to create a 1:1 mapping because the selector language allows accessing only
a single field. The `nomad_job_id` claim isn't sufficient to uniquely identify a
job because of namespaces. It's possible to create a JWT auth role with
`bound_claims` to avoid this becoming a security problem, but this doesn't allow
for correct accounting of user claims.

Add support for an `extra_claims` block on the server's `default_identity`
blocks for Vault. This allows a cluster administrator to add a custom claim on
all allocations. The values for these claims are interpolatable with a limited
subset of fields, similar to how we interpolate the task environment.

Fixes: https://github.com/hashicorp/nomad/issues/23510
Ref: https://hashicorp.atlassian.net/browse/NET-10372
Ref: https://hashicorp.atlassian.net/browse/NET-10387
2024-08-05 15:01:54 -04:00
Daniel Bennett
10d3f1749b e2e: test all cni config formats (#23650) 2024-07-22 10:17:03 -05:00