Commit Graph

254 Commits

Author SHA1 Message Date
Michael Smithhisler
37da98be1c Merge pull request #26681 from hashicorp/NMD-760-nomad-secrets-block
Secrets Block: merge feature branch to main
2025-09-09 10:46:18 -04:00
Daniel Bennett
1f7f51ceb4 e2e: update cni plugins (#26724)
> failed to configure network: plugin type="firewall" failed (add):
> incompatible CNI versions; config is "1.0.0", plugin supports ["0.4.0"]
2025-09-08 11:52:23 -04:00
Michael Smithhisler
10ed46cbd4 secrets: pass key/value config data to plugins as env (#26455)
Co-authored-by: Michael Schurter <mschurter@hashicorp.com>
Co-authored-by: Tim Gross <tgross@hashicorp.com>
2025-09-05 16:08:24 -04:00
Daniel Bennett
9682aa2724 consul connect: allow "cni/*" network mode (#26449)
don't require "bridge" network mode when using connect{}

we document this as "at your own risk" because CNI configuration
is so flexible that we can't guarantee a user's network will work,
but Nomad's "bridge" CNI config may be used as a reference.
2025-09-04 12:29:50 -04:00
Allison Larson
3fff1aa3cc Support IMDSv2 on windows e2e runners (#26629) 2025-08-25 15:37:50 -07:00
Tim Gross
767683ce3e E2E: allow setting instance_type variable (#26607)
When we refactored the E2E provisioning to allow it to be reused by the upgrade
testing, we didn't thread the `instance_type` variable from the main module down
into the `provision-infra` module. This prevents you from setting a custom
instance size when deploying the E2E cluster manually.
2025-08-22 15:22:10 -04:00
Allison Larson
f6a078c7e5 Disable IMDSv2 on windows test instances (#26606) 2025-08-21 16:29:35 -07:00
Allison Larson
694e0ac2e3 Require IMDSv2 for e2e EC2 instances (#26585)
Re-enables this now that go-discover is updated in all the right places.
2025-08-20 14:47:43 -07:00
Daniel Bennett
8675fba382 e2e: install exec2 driver v0.1.0 (#26578)
for auto-unveil of NOMAD_SECRETS_DIR
following f3e08d8aa9
2025-08-19 11:28:57 -04:00
Daniel Bennett
f3e08d8aa9 e2e: exec2: envoy binary version and tidying (#26558)
* e2e: update standalone envoy binary version

fix for:

> === FAIL: e2e/exec2 TestExec2/testCountdash (21.25s)
>     exec2_test.go:71:
> ...
> [warning][config] [./source/extensions/config_subscription/grpc/grpc_stream.h:155] DeltaAggregatedResources gRPC config stream to local_agent closed: 3, Envoy 1.29.4 is too old and is not supported by Consul

there's also this warning, but it doesn't seem so fatal:

> [warning][main] [source/server/server.cc:910] There is no configured limit to the number of allowed active downstream connections. Configure a limit in `envoy.resource_monitors.downstream_connections` resource monitor.

picked latest supported from latest consul (1.21.4):

```
$ curl -s localhost:8500/v1/agent/self | jq .xDS.SupportedProxies
{
  "envoy": [
    "1.34.1",
    "1.33.2",
    "1.32.5",
    "1.31.8"
  ]
}
```

* e2e: exec2: remove extraneous bits

 * reschedule: no reschedule for batch jobs
 * unveil: nomad paths get auto-unveiled with unveil_defaults
   https://github.com/hashicorp/nomad-driver-exec2/blob/v0.1.0/plugin/driver.go#L514-L522
2025-08-18 14:58:00 -04:00
Tim Gross
7bfc04576a E2E: disable sdnotify for Consul agents (#26078)
In our E2E environment we've seen some flakiness with the Consul-related
tests. As it turns out, the Consul agents are getting restarted every 90s or so
because they're timing out their systemd notification.

> consul.service: start operation timed out. Terminating.

This appears to be a known issue in Consul and we'll try to contribute some help
to hunt down the cause if they want help, but in the meantime let's remove it
from our systemd unit files for the Consul agents.

Ref: https://github.com/hashicorp/consul/issues/16844#issuecomment-1913282248
2025-06-18 17:03:32 -04:00
Tim Gross
d6800c41c1 E2E: include Windows 2022 host in test targets (#26003)
Some time ago the Windows host we were using as a Nomad client agent test target
started failing to allow ssh connections. The underlying problem appears to be
with sysprep but I wasn't able to debug the exact cause as it's not an area I
have a lot of expertise in.

Swap out the deprecated Windows 2016 host for a Windows 2022 host. This will use
a base image provided by Amazon and then we'll use a userdata script to
bootstrap ssh and some target directories for Terraform to upload files to. The
more modern Windows will let us drop some of extra powershell scripts we were
using as well.

Fixes: https://hashicorp.atlassian.net/browse/NMD-151
Fixes: https://github.com/hashicorp/nomad-e2e/issues/125
2025-06-16 12:12:15 -04:00
Piotr Kazmierczak
a10c2f6de7 e2e: mention in the terraform readme that we require a local Consul binary (#25944) 2025-05-28 17:12:57 +02:00
Tim Gross
0e728b87db E2E: remove dnsmasq and references to ECS plugin (#25892)
The DNS configuration for our E2E cluster uses dnsmasq to pass all DNS through
Consul. But there's a circular reference in systemd configurations that
sometimes causes the Docker service to fail, this is causing test flakes during
upgrade testing because we count the number of nodes and expect `system` jobs
using Docker to run on all nodes.

We no longer have any tests that require Consul DNS, so remove the complication
of dnsmasq to break the reference cycle. Also, while I was looking at this I
noticed we still had setup that would configure the ECS remote task driver
plugin, which is archived. Remove this as well.

Ref: https://hashicorp.atlassian.net/browse/NMD-162
2025-05-20 08:26:22 -04:00
Tim Gross
88dc842729 testing: use Docker Hub registry mirror for CI (#25703)
As of April 1, Docker Hub rate limits tightened. With only 10 pulls/hr/IP, we're
likely to encounter test failures. Switch all Docker images getting pulled from
this repository to use the HashiCorp managed registry mirror.

Note that most of our tests in `drivers/docker` don't pull from the remote
registry but load a local image, while others will need to pull from the remote
and fetch different images depending on OS/arch. Refactor the definition of test
task configuration to make it clear which is which, and de-factor some false
sharing of setup functions.

Updates the E2E tests to use that registry by configuring the Docker
daemon. This required changing out a few container images that we don't have in
the registry, but these new images are all smaller. There are a couple of tests
that still use explicitly-tagged `docker.io` images or other third-party
registries, which have been left in place.

Ref: https://hashicorp.atlassian.net/browse/NET-12233

update E2E images to those in the registry mirror

fix windows and docklog test build

fix stopsignal test

mop-up

more mop-up
2025-04-18 14:21:49 -04:00
James Rasell
311a83d706 e2e: Ensure UI is enabled. (#25620)
The `ui.enabled` parameter is a non-pointer bool which means the
merge function is unable to differentiate between false and not
set. When e2e introduced the `ui.show_cli_hints` configuration
parameter, the way we merge meant the UI became disabled.
2025-04-08 13:57:29 +01:00
Michael Smithhisler
c8cc519f54 e2e: disable cli hints for command parsing (#25584) 2025-04-02 09:12:36 -04:00
Michael Smithhisler
95c9029df0 e2e: update consul task policy and add empty consul block to task groups (#25580) 2025-04-01 16:29:47 -04:00
Michael Smithhisler
077c1921ef e2e: disable IMDSv2 in tests (#25564)
Consul needs to use a newer version of go-discover that can query IMDSv2
in order for our test infrastructure to be enabled with it.
2025-03-31 12:07:45 -04:00
Piotr Kazmierczak
a1fd9da705 e2e: require IMDSv2 for ec2 instances (#25541)
Require Instance Metadata Service v2 to access EC2 instance metadata for all VMs
that run our e2e suite.
2025-03-28 09:58:51 +01:00
Michael Smithhisler
f0e0215d56 e2e: fix consul e2e enterprise logic in bootstrapping (#25532) 2025-03-26 14:08:20 -04:00
Michael Smithhisler
c66269f8d0 e2e: fixes node write policy for consul agents (#25418) 2025-03-17 15:18:30 -04:00
Juana De La Cuesta
9b9d16421e Merge branch 'main' into NET-11546-enos-drain 2025-03-17 16:14:18 +01:00
Juanadelacuesta
4b0903789e func: add check script for vault workload 2025-03-14 17:03:35 +01:00
Juanadelacuesta
3af2da7362 fix: add default policy to consul acl configurations for the e2e cluster 2025-03-14 16:46:03 +01:00
Juanadelacuesta
4c1ba45d48 func: add workload to test vault workload identity 2025-03-13 17:55:59 +01:00
Tim Gross
5cc1b4e606 upgrade tests: add transparent proxy workload (#25176)
Add an upgrade test workload for Consul service mesh with transparent
proxy. Note this breaks from the "countdash" demo. The dashboard application
only can verify the backend is up by making a websocket connection, which we
can't do as a health check, and the health check it exposes for that purpose
only passes once the websocket connection has been made. So replace the
dashboard with a minimal nginx reverse proxy to the count-api instead.

Ref: https://hashicorp.atlassian.net/browse/NET-12217
2025-03-07 15:25:26 -05:00
Tim Gross
916fe2c7fa upgrade testing: rework CSI test to use self-contained workload (#25285)
Getting the CSI test to work with AWS EFS or EBS has proven to be awkward
because we're having to deal with external APIs with their own consistency
guarantees, as well as challenges around teardown. Make the CSI test entirely
self-contained by using a userland NFS server and the rocketduck CSI plugin.

Ref: https://hashicorp.atlassian.net/browse/NET-12217
Ref: https://gitlab.com/rocketduck/csi-plugin-nfs
2025-03-05 11:48:19 -05:00
Michael Smithhisler
25cea5c16b e2e: allow consul access to nomad cluster (#25277) 2025-03-04 09:06:50 -05:00
Michael Smithhisler
7867957811 e2e: remove legacy consul token tests (#25174) 2025-02-28 11:31:33 -05:00
James Rasell
8bce0b0954 e2e: Migrate legacy Vault token based workflow to workload ID (#25139)
Nomad 1.10.0 is removing the legacy Vault token based workflow
which means the legacy e2e compatibility tests will fail and not
work.

The Nomad e2e cluster was using the legacy Vault token based
workflow for initial cluster build. This change migrates to using
the workload identity flow which utilizes authentication methods,
roles, and policies.

The Nomad server network has been modified to allow traffic from
the HCP Vault HVN which is a private network peered into our AWS
account. This is required, so that Vault can pull JWKS
information from the Nomad API without going over the public
internet.

The cluster build will now also configure a Vault KV v2 mount at
a unique indentifier for the e2e cluster. This allows all Nomad
workloads and tests to use this if required.

The vaultsecrets suite has been updated to accommodate the new
changes and extended to test the default workload ID flow for
allocations which use Vault for secrets.
2025-02-20 14:06:25 +00:00
Tim Gross
86e1d6da52 E2E: use repo root to find correct git sha for AMI (#25151)
The nightly E2E run only builds a new AMI when required by changes to the
build. The AMI is tagged with the SHA of the commit that forced that build,
which may not be the commit that's spawning a particular test run. So we have a
resource in the `provision-infra` module that finds that SHA.

But when we run upgrade testing via Enos, we're running the E2E Terraform
configuration from outside the `e2e/terraform` folder. So the script that
resource runs will fail and prevent us from getting the AMI. Fix the script so
it can be run from any folder.

We also have duplicate resources for the "ubuntu jammy" AMI, but this is because
the Enos matrix might (in the near future) test with ARM64. For now, we'll pin
the Consul server to AMD64. Rename the resource appropriately to make the source
of the duplicate obvious.
2025-02-19 08:59:22 -05:00
Juana De La Cuesta
af2ac87409 Simplify binary overrides on e2e provision (#25122)
* func: remove the lists to override the nomad_local_binary for servers and clients

* docs: add a note to the terraform e2e readme

* fix: remove the extra 'windows' from the aws_ami filter

* style: hcl fmt
2025-02-17 16:13:32 +01:00
Juana De La Cuesta
cfc24116b3 Add tag to instances with OS and add merged output (#25071)
* func: add a new output that merges both windowa and linux clients, but add tags to distinguish them

* fix: outputs cant referrence other outputs in terraform

* Update e2e/terraform/provision-infra/compute.tf

Co-authored-by: Tim Gross <tgross@hashicorp.com>

---------

Co-authored-by: Tim Gross <tgross@hashicorp.com>
2025-02-10 17:08:07 +01:00
Juana De La Cuesta
d53b8a7e98 func: remove triggers from resources that copy the binaries into the remote instances (#25036) 2025-02-06 17:11:19 +01:00
Juana De La Cuesta
3861c40220 func: add initial enos skeleton (#24787)
* func: add initial enos skeleton

* style: add headers

* func: change the variables input to a map of objects to simplify the workloads creation

* style: formating

* Add tests for servers and clients

* style: separate the tests in diferent scripts

* style: add missing headers

* func: add tests for allocs

* style: improve output

* func: add step to copy remote upgrade version

* style: hcl formatting

* fix: remove the terraform nomad provider

* fix: Add clean token to remove extra new line added in provision

* fix: Add clean token to remove extra new line added in provision

* fix: Add clean token to remove extra new line added in provision

* fix: add missing license headers

* style: hcl fmt

* style: rename variables and fix format

* func: remove the template step on the workloads module and chop the noamd token output on the provide module

* fix: correct the jobspec path on the workloads module

* fix: add missing variable definitions on job specs for workloads

* style: formatting

* fix: rename variable in health test
2025-01-30 16:37:55 +01:00
Michael Smithhisler
47c14ddf28 remove remote task execution code (#24909) 2025-01-29 08:08:34 -05:00
Juana De La Cuesta
1b1ad896ec Add the path to the ssh key to connect to the cluster's instances as an output (#24969)
* fix: add the ssh key pem path to te outputs and fix the message with the correct path

* func: add ssh pem key as output
2025-01-28 18:25:02 +01:00
James Rasell
c8d7e741c8 e2e: Fix TF output SSH key path. (#24965) 2025-01-28 16:29:56 +00:00
James Rasell
8859cfa3f5 e2e: Ensure Consul client is running before starting Nomad service. (#24964) 2025-01-28 15:28:12 +00:00
Juana De La Cuesta
687335639b fix: add a dependency to avoid terraform errors when generating ssh keys (#24912) 2025-01-22 11:36:03 +01:00
Juana De La Cuesta
039da61d8f [F-net-11478] Make keys directory cluster grouped (#24883)
* func: make windows arch dependant

* func: unify keys and make them cluster grouped

* Update README.md

* Update e2e/terraform/provision-infra/provision-nomad/variables.tf

Co-authored-by: Tim Gross <tgross@hashicorp.com>

* Update .gitignore

* style: add an output with the custer identifier

---------

Co-authored-by: Tim Gross <tgross@hashicorp.com>
2025-01-20 10:18:38 +01:00
Tim Gross
ef366ee166 E2E: update .gitignore files to avoid committing runtime files (#24855)
In #24694 we did a major refactoring of the E2E Terraform configuration. After
deploying a cluster this morning, I noticed a few moved/removed files were not
reflected in the .gitignore files. This changeset updates the .gitignore to have
no unstaged files after applying.
2025-01-14 12:16:01 -05:00
Juana De La Cuesta
b29a3736a4 Update e2e infra provision to expect providers (#24694)
* func: move infra provisionining to a module and remove providers

* func: update paths

* func: update more paths

* func: update path inside bootstrap scrip

* style: remove debug prints on bootstrap scripts

* Delete e2e/terraform/csi/input/volume-efs.hcl

* fix: update keys path to use module path instead pf root

* fix: add missing headers

* fix: update keys directory inside provision-nomad

* style; format hcl files

* Update compute.tf

* Update e2e/terraform/main.tf

Co-authored-by: Tim Gross <tgross@hashicorp.com>

* Update e2e/terraform/provision-infra/compute.tf

Co-authored-by: Tim Gross <tgross@hashicorp.com>

* fix: update more paths

* fix: fmt hcl files

* func: final paths revision for running e2e locally

* fix: make path of certs relative to module for the bootstrap

* func: final paths revision for running e2e locally

* Update network.tf

* fix: fix typo and add success message

* fix: remove the test name from token to avoid long names and use name for vol to avoid colisions

* func: unify the uploads folder

* func: make the uploads file one per cluster

* func: Add outputs with all data necessary to connect to the cluster

* fix: make nomad token a sensitive output

* Update bootstrap-nomad.sh

---------

Co-authored-by: Tim Gross <tgross@hashicorp.com>
2025-01-13 15:59:40 +01:00
Juana De La Cuesta
2eb2b6c739 fix: update the dnsconfig script to handle multiple interfaces (#24800) 2025-01-07 21:12:18 +01:00
Tim Gross
abeae5c47b E2E: use a variable for region (#24693)
In #24644 we set the region to "e2e" but forgot to setup the TLS certificate
names appropriately. Swap the region out for a variable instead.
2024-12-17 10:28:22 -05:00
Tim Gross
75b0202f7f api: don't copy previously parsed URL when setting new address (#24644)
In #16872 we added support for unix domain sockets, but this required mutating
the `Config` when parsing the address so as to remove the port number. In #23785
we fixed a bug where if the configuration was used across multiple clients that
mutation would happen multiple times and the address would be incorrectly
parsed.

When making `alloc log`, `alloc fs`, or `alloc exec` calls where we have
line-of-sight to the client, we attempt to make a HTTP API call directly to the
client node. So we create a new API client from the same configuration and then
set the address. But in this case we copy the private `url` field and that
causes the URL parsing to be skipped for the new client.

This results in the region always being set to the string literal
`"global"` (because of mTLS handling code introduced all the way back in
4d3b75d867), unless the user has set the region specifically. This fails with
an error "no path to region" when the cluster isn't non-global and requests are
sent to a non-leader.

Arguably the "right" way of fixing this would be for `ClientConfig` not to
change the API client's region to `"global"` in the first place, but as this is
a public API and extremely longstanding behavior, it could potentially be a
breaking change for some downstream consumers. Instead, we'll avoid copying the
private `url` field so that the new address is re-parsed.

Fixes: https://github.com/hashicorp/nomad/issues/24635
Fixes: https://github.com/hashicorp/nomad/issues/24609
Ref: https://github.com/hashicorp/nomad/pull/16872
Ref: https://github.com/hashicorp/nomad/pull/23785
Ref: 4d3b75d867
2024-12-16 11:05:29 -05:00
Juana De La Cuesta
526c6375ad Make paths in e2e/terraform/ directory relative to the module (#24664)
* func: make paths relative

* func: make paths relative to the module inside the e2e terraform folder

* fix: add license files to gitignore

* func: move /etc and update all paths

* Uncomment forgotten code

* fix: update the path to the tls certificates to be local to the instance
2024-12-13 17:33:59 +01:00
Juana De La Cuesta
a9a0f71213 Remove sockaddr and use native tools (#24665)
* func: remove sockaddr and use native tools

* Update setup.sh
2024-12-13 17:24:53 +01:00
Juana De La Cuesta
270b4f97a6 Update some details of the terraform readme file for e2e provisioning (#24451)
* docs: update instructions to provision e2e cluster

* Update e2e/terraform/README.md

Co-authored-by: Piotr Kazmierczak <470696+pkazmierczak@users.noreply.github.com>

* Update e2e/terraform/terraform.tfvars

Co-authored-by: Piotr Kazmierczak <470696+pkazmierczak@users.noreply.github.com>

* Update e2e/terraform/README.md

Co-authored-by: Piotr Kazmierczak <470696+pkazmierczak@users.noreply.github.com>

---------

Co-authored-by: Piotr Kazmierczak <470696+pkazmierczak@users.noreply.github.com>
2024-11-18 13:36:51 +01:00