Commit Graph

837 Commits

Author SHA1 Message Date
Piotr Kazmierczak
f767db5639 e2e: fix TestScaling/TestScaling_System (#26804) 2025-09-19 15:31:32 +02:00
Piotr Kazmierczak
4874622ebd e2e: test canary updates for system jobs (#26776) 2025-09-17 10:20:03 +02:00
Daniel Bennett
f47cb5d10f e2e: adjust flaky timings (#26771)
hopefully fixes:

```
TestOversubscription/testExec:
    oversubscription_test.go:57: submitting job: "./input/exec.hcl"
    oversubscription_test.go:72:
        oversubscription_test.go:72: expected condition to pass within wait context
        ↪ error: wait: timeout exceeded: expect '31457280' in stdout, got: 'stat {...}/cat.stdout.0: no such file or directory'
```

and in separate runs,

```
TestTaskAPI/testTaskAPI_Auth:
     taskapi_test.go:85:
         taskapi_test.go:85: expected string to have suffix
         ↪ suffix: Unauthorized
         ↪ string:
```

```
TestTaskAPI/testTaskAPI_Auth:
     taskapi_test.go:85:
         taskapi_test.go:85: expected string to have suffix
         ↪ suffix: Forbidden
         ↪ string:
```
2025-09-15 15:54:53 -04:00
Michael Smithhisler
37da98be1c Merge pull request #26681 from hashicorp/NMD-760-nomad-secrets-block
Secrets Block: merge feature branch to main
2025-09-09 10:46:18 -04:00
Daniel Bennett
cb3e49f3e4 e2e: shorten restart delay in docker registry task (#26729)
tests that use this local docker registry (docker and podman tests)
occasionally flake, I think due to the timeout being reached,
despite passing after a restart.

> jobs3.go:658: tg 'create-files' task 'create-auth-file' event: Task received by client
> jobs3.go:658: tg 'create-files' task 'create-auth-file' event: Building Task Directory
> jobs3.go:658: tg 'create-files' task 'create-auth-file' event: Task started by client
> jobs3.go:658: tg 'create-files' task 'create-auth-file' event: Exit Code: 1
> jobs3.go:658: tg 'create-files' task 'create-auth-file' event: Task restarting in 16.212149445s
> jobs3.go:658: tg 'create-files' task 'create-auth-file' event: Task started by client
> jobs3.go:658: tg 'create-files' task 'create-auth-file' event: Exit Code: 0

setting the delay lower will (hopefully) keep within the job timeout.

I'm not sure why the `pledge` task apparently flakes like this;
I could find no useful info in the logs.
2025-09-08 15:21:08 -04:00
Daniel Bennett
1f7f51ceb4 e2e: update cni plugins (#26724)
> failed to configure network: plugin type="firewall" failed (add):
> incompatible CNI versions; config is "1.0.0", plugin supports ["0.4.0"]
2025-09-08 11:52:23 -04:00
Michael Smithhisler
10ed46cbd4 secrets: pass key/value config data to plugins as env (#26455)
Co-authored-by: Michael Schurter <mschurter@hashicorp.com>
Co-authored-by: Tim Gross <tgross@hashicorp.com>
2025-09-05 16:08:24 -04:00
Michael Smithhisler
68167254e8 e2e: add initial tests for secrets block (#26397) 2025-09-05 16:08:23 -04:00
Daniel Bennett
9682aa2724 consul connect: allow "cni/*" network mode (#26449)
don't require "bridge" network mode when using connect{}

we document this as "at your own risk" because CNI configuration
is so flexible that we can't guarantee a user's network will work,
but Nomad's "bridge" CNI config may be used as a reference.
2025-09-04 12:29:50 -04:00
Daniel Bennett
3ad22ddad5 e2e: ui: fix token form fill (#26692)
look, I know I misspelled "locater" in the code comment, but it's easier to acknowledge that here in this commit message than it is to push a new commit with all the test/approval machinery in github.
2025-09-03 12:11:35 -04:00
James Rasell
267dc72f4e e2e: Correctly handle IMDSv2 when discovering UI proxy address. (#26674)
The call to IMDSv1 has been failing since we switched to v2 which
meant the UI e2e script attempted to use the service IP address
for its tests. The service IP address is the Nomad client's
private address which is not routable from the e2e test runner
which means the test fails.

This change updates the IP discovery to use IMDSv2 which means the
address is correctly populated and routable. The change also makes
this discovery method by a job action within the proxy job. This
exercises that feature and utilizes it in a way for which it was
designed.
2025-09-02 11:02:48 +01:00
James Rasell
d5f2c0201e e2e: Wait for keyring before starting client intro client agents. (#26660)
Ensuring the keyring is ready before starting the Nomad client in
the client intro e2e test speeds up execution. This is because the
client does not have to wait to retry failed registrations due to
the keyring not being ready.
2025-09-01 07:32:40 +01:00
James Rasell
6bd8bc6c0c e2e: Add client intro test to exercise strict enforcement (#26648) 2025-08-29 08:40:48 +01:00
James Rasell
07bd1de72e e2e: Update UI playwright container image to v1.55.0 (#26650) 2025-08-28 16:41:57 +01:00
James Rasell
9e893ef2ad e2e: Add Client Intro test framework and initial test. (#26639)
The new client intro test mimics the Consul and Vault compat tests
and uses local agents to perform the required setup. This method
allows us the flexibility moving forward to test when enforcement
mode is in strict.

The test suite will now be triggered from the test-e2e CI run
and can also be called by a make target.
2025-08-28 09:53:07 +01:00
James Rasell
d0ffb31fea e2e: Add Client Identity get and renew tests. (#26632) 2025-08-26 13:49:06 +01:00
Allison Larson
3fff1aa3cc Support IMDSv2 on windows e2e runners (#26629) 2025-08-25 15:37:50 -07:00
Tim Gross
767683ce3e E2E: allow setting instance_type variable (#26607)
When we refactored the E2E provisioning to allow it to be reused by the upgrade
testing, we didn't thread the `instance_type` variable from the main module down
into the `provision-infra` module. This prevents you from setting a custom
instance size when deploying the E2E cluster manually.
2025-08-22 15:22:10 -04:00
Allison Larson
f6a078c7e5 Disable IMDSv2 on windows test instances (#26606) 2025-08-21 16:29:35 -07:00
Allison Larson
694e0ac2e3 Require IMDSv2 for e2e EC2 instances (#26585)
Re-enables this now that go-discover is updated in all the right places.
2025-08-20 14:47:43 -07:00
Daniel Bennett
8675fba382 e2e: install exec2 driver v0.1.0 (#26578)
for auto-unveil of NOMAD_SECRETS_DIR
following f3e08d8aa9
2025-08-19 11:28:57 -04:00
Daniel Bennett
f3e08d8aa9 e2e: exec2: envoy binary version and tidying (#26558)
* e2e: update standalone envoy binary version

fix for:

> === FAIL: e2e/exec2 TestExec2/testCountdash (21.25s)
>     exec2_test.go:71:
> ...
> [warning][config] [./source/extensions/config_subscription/grpc/grpc_stream.h:155] DeltaAggregatedResources gRPC config stream to local_agent closed: 3, Envoy 1.29.4 is too old and is not supported by Consul

there's also this warning, but it doesn't seem so fatal:

> [warning][main] [source/server/server.cc:910] There is no configured limit to the number of allowed active downstream connections. Configure a limit in `envoy.resource_monitors.downstream_connections` resource monitor.

picked latest supported from latest consul (1.21.4):

```
$ curl -s localhost:8500/v1/agent/self | jq .xDS.SupportedProxies
{
  "envoy": [
    "1.34.1",
    "1.33.2",
    "1.32.5",
    "1.31.8"
  ]
}
```

* e2e: exec2: remove extraneous bits

 * reschedule: no reschedule for batch jobs
 * unveil: nomad paths get auto-unveiled with unveil_defaults
   https://github.com/hashicorp/nomad-driver-exec2/blob/v0.1.0/plugin/driver.go#L514-L522
2025-08-18 14:58:00 -04:00
Daniel Bennett
2c699b9794 sysbatch: fix panic from reschedule block (#26534)
* fix panic from nil ReschedulePolicy

commit 279775082c (pr #26279)
intended to return an error for sysbatch jobs with a reschedule block,
but in bypassing populating the `ReschedulePolicy`'s pointer fields,
a nil pointer panic occurred before the job could get rejected
with the intended error.

in particular, in `command/agent/job_endpoint.go`, `func ApiTgToStructsTG`,

```
if taskGroup.ReschedulePolicy != nil {
	tg.ReschedulePolicy = &structs.ReschedulePolicy{
		Attempts:      *taskGroup.ReschedulePolicy.Attempts,
		Interval:      *taskGroup.ReschedulePolicy.Interval,
```

`*taskGroup.ReschedulePolicy.Interval` was a nil pointer.

* fix e2e test jobs
2025-08-18 10:19:14 -04:00
Aimee Ukasick
a30cb2f137 Update UI, code comment, and README links to docs, tutorials (#26429)
* Update UI, code comment, and README links to docs, tutorials

* fix typo in ephemeral disks learn more link url

* feedback on typo

Co-authored-by: Tim Gross <tgross@hashicorp.com>

---------

Co-authored-by: Tim Gross <tgross@hashicorp.com>
2025-08-06 09:40:23 -05:00
Tim Gross
6e5ecb6bb0 E2E: update Consul/Vault compat versions tested (#26369)
Update our E2E compatibility test for Consul and Vault to only include back to
the oldest-supported LTS versions of Consul and Vault. This will still leave
a few unsupported non-LTS versions in the matrix between the two oldest LTS, but
this is a small number of tests and fixing it would mean hard-coding the LTS
support matrix in our tests.
2025-07-28 12:03:30 -04:00
James Rasell
5989d5862a ci: Update golangci-lint to v2 and fix highlighted issues. (#26334) 2025-07-25 10:44:08 +01:00
Daniel Bennett
949b23602c e2e: ui: bump playwright version (#26119) 2025-06-23 13:31:11 -04:00
Tim Gross
7bfc04576a E2E: disable sdnotify for Consul agents (#26078)
In our E2E environment we've seen some flakiness with the Consul-related
tests. As it turns out, the Consul agents are getting restarted every 90s or so
because they're timing out their systemd notification.

> consul.service: start operation timed out. Terminating.

This appears to be a known issue in Consul and we'll try to contribute some help
to hunt down the cause if they want help, but in the meantime let's remove it
from our systemd unit files for the Consul agents.

Ref: https://github.com/hashicorp/consul/issues/16844#issuecomment-1913282248
2025-06-18 17:03:32 -04:00
Tim Gross
976ea854b0 E2E: fix scaling test assertion for extra Windows host (#26077)
* E2E: fix scaling test assertion for extra Windows host

The scaling test assumes that all nodes will receive the system job. But the job
can only run on Linux hosts, so the count will be wrong if we're running a
Windows host as part of the cluster. Filter the expected count by the OS.

While we're touching this test, let's also migrate it off the legacy framework.

* address comments from code review
2025-06-18 17:03:17 -04:00
Tim Gross
3c67ba0516 E2E: update TaskAPI test for Windows (#26074)
The current version of Windows we're using ships with curl, so we don't need to
download it as an artifact anymore. Remove the broken reference to this in the TaskAPI
test for Windows.

Ref: https://github.com/hashicorp/nomad-e2e/actions/runs/15708894856/job/44267973319
2025-06-17 16:03:50 -04:00
Tim Gross
d6800c41c1 E2E: include Windows 2022 host in test targets (#26003)
Some time ago the Windows host we were using as a Nomad client agent test target
started failing to allow ssh connections. The underlying problem appears to be
with sysprep but I wasn't able to debug the exact cause as it's not an area I
have a lot of expertise in.

Swap out the deprecated Windows 2016 host for a Windows 2022 host. This will use
a base image provided by Amazon and then we'll use a userdata script to
bootstrap ssh and some target directories for Terraform to upload files to. The
more modern Windows will let us drop some of extra powershell scripts we were
using as well.

Fixes: https://hashicorp.atlassian.net/browse/NMD-151
Fixes: https://github.com/hashicorp/nomad-e2e/issues/125
2025-06-16 12:12:15 -04:00
Daniel Bennett
7519df8d06 task env: add NOMAD_UNIX_ADDR var (#25598)
for easier setup when using workload identity + task api
2025-06-11 15:56:51 -04:00
Piotr Kazmierczak
348177d118 e2e: correct TestSingleAffinities behavior (#25943)
TestSingleAffinities never expected a node with affinity score set to 0 in
the set of returned nodes. However, since #25800, this can happen. What the
test should be checking for instead is that the node with the highest normalized
score has the right affinity.
2025-05-30 19:46:08 +02:00
Michael Smithhisler
4c8257d0c7 client: add once mode to template block (#25922) 2025-05-28 11:45:11 -04:00
Piotr Kazmierczak
a10c2f6de7 e2e: mention in the terraform readme that we require a local Consul binary (#25944) 2025-05-28 17:12:57 +02:00
Tim Gross
0e728b87db E2E: remove dnsmasq and references to ECS plugin (#25892)
The DNS configuration for our E2E cluster uses dnsmasq to pass all DNS through
Consul. But there's a circular reference in systemd configurations that
sometimes causes the Docker service to fail, this is causing test flakes during
upgrade testing because we count the number of nodes and expect `system` jobs
using Docker to run on all nodes.

We no longer have any tests that require Consul DNS, so remove the complication
of dnsmasq to break the reference cycle. Also, while I was looking at this I
noticed we still had setup that would configure the ECS remote task driver
plugin, which is archived. Remove this as well.

Ref: https://hashicorp.atlassian.net/browse/NMD-162
2025-05-20 08:26:22 -04:00
James Rasell
4b40e10e68 e2e: Update UI playwright version to 1.52.0 (#25740) 2025-04-24 13:38:26 +01:00
James Rasell
717207bce0 e2e: Fix TestDocker/testRedis with increased timeout on deployment (#25739)
The fresh deployment of the Redis job took around 20s which is
also the default context timeout on the e2e util that monitors and
waits for a deployment to complete.

The tight timing meant the test often timed out but sometimes
would complete successfully. Increasing the timeout for this
deployment will remove the flakiness.
2025-04-24 09:09:33 +01:00
Tim Gross
88dc842729 testing: use Docker Hub registry mirror for CI (#25703)
As of April 1, Docker Hub rate limits tightened. With only 10 pulls/hr/IP, we're
likely to encounter test failures. Switch all Docker images getting pulled from
this repository to use the HashiCorp managed registry mirror.

Note that most of our tests in `drivers/docker` don't pull from the remote
registry but load a local image, while others will need to pull from the remote
and fetch different images depending on OS/arch. Refactor the definition of test
task configuration to make it clear which is which, and de-factor some false
sharing of setup functions.

Updates the E2E tests to use that registry by configuring the Docker
daemon. This required changing out a few container images that we don't have in
the registry, but these new images are all smaller. There are a couple of tests
that still use explicitly-tagged `docker.io` images or other third-party
registries, which have been left in place.

Ref: https://hashicorp.atlassian.net/browse/NET-12233

update E2E images to those in the registry mirror

fix windows and docklog test build

fix stopsignal test

mop-up

more mop-up
2025-04-18 14:21:49 -04:00
James Rasell
311a83d706 e2e: Ensure UI is enabled. (#25620)
The `ui.enabled` parameter is a non-pointer bool which means the
merge function is unable to differentiate between false and not
set. When e2e introduced the `ui.show_cli_hints` configuration
parameter, the way we merge meant the UI became disabled.
2025-04-08 13:57:29 +01:00
James Rasell
6c39285538 e2e: Ensure test resources are cleaned. (#25611)
I couldn't find any reason the exec2 HTTP jobs were not being run
with a generated cleanup function, so I added this.

The deletion of the DHV ACL policy does not seem like it would
have any negative impact.
2025-04-07 14:15:29 +01:00
Phil Renaud
afa9e65afa Update playwright to 1.51.0 for e2e ui tests (#25585) 2025-04-02 15:12:00 +01:00
Michael Smithhisler
c8cc519f54 e2e: disable cli hints for command parsing (#25584) 2025-04-02 09:12:36 -04:00
Michael Smithhisler
95c9029df0 e2e: update consul task policy and add empty consul block to task groups (#25580) 2025-04-01 16:29:47 -04:00
Michael Smithhisler
077c1921ef e2e: disable IMDSv2 in tests (#25564)
Consul needs to use a newer version of go-discover that can query IMDSv2
in order for our test infrastructure to be enabled with it.
2025-03-31 12:07:45 -04:00
Michael Smithhisler
8e3625a716 e2e: create consul policies and roles in respective namespaces (#25546) 2025-03-28 13:52:49 -04:00
Piotr Kazmierczak
a1fd9da705 e2e: require IMDSv2 for ec2 instances (#25541)
Require Instance Metadata Service v2 to access EC2 instance metadata for all VMs
that run our e2e suite.
2025-03-28 09:58:51 +01:00
Michael Smithhisler
f0e0215d56 e2e: fix consul e2e enterprise logic in bootstrapping (#25532) 2025-03-26 14:08:20 -04:00
Michael Smithhisler
c66269f8d0 e2e: fixes node write policy for consul agents (#25418) 2025-03-17 15:18:30 -04:00
Juana De La Cuesta
9b9d16421e Merge branch 'main' into NET-11546-enos-drain 2025-03-17 16:14:18 +01:00