Commit Graph

720 Commits

Author SHA1 Message Date
Daniel Bennett
2da38ba9c4 e2e: jobs3 hcl vars differently (#23363)
and include jobspec and vars in registrations
(so they show up in the UI under job Definition)
2024-06-17 13:20:51 -05:00
Daniel Bennett
5a6e3d5ef0 e2e: add Enterprise Option for cluster3.Establish (#23362) 2024-06-17 12:59:37 -05:00
Tim Gross
288a048a2e e2e: add prerelease builds to Consul/Vault compatibility tests (#23287)
Update the Consul/Vault build downloader functions so that we include the
current prerelease build (if any) in our E2E compatibility testing we do on each
PR. This will automatically cycle out when the GA build is released, because
that build is "higher" in the sorted set.
2024-06-11 08:54:27 -04:00
Seth Hoenig
2054e87158 e2e: add tests for exec2 task driver (#22406)
* e2e: add tests for exec2 task driver

* e2e: use envoy 1.29.4 because consul

* e2e: add a bridge networking http test for exec driver

* e2e: split up http test so curl always starts after the server
2024-05-31 09:22:39 -05:00
Seth Hoenig
9fb2b10ab6 e2e: no lnoger need consul terraform module (#22396) 2024-05-28 08:04:03 -05:00
Tim Gross
91d422ec21 E2E: document how the AMIs are tagged and how those tags are used (#22237)
The process by which we tag AMIs with the commit SHA of the Packer directory
isn't documented in this repository, which makes it easy to accidentally build
an AMI that will break nightly E2E.
2024-05-24 11:11:00 -05:00
James Rasell
04ba358266 client: expose network namespace CNI config as task env vars. (#11810)
This change exposes CNI configuration details of a network
namespace as environment variables. This allows a task to use
these value to configure itself; a potential use case is to run
a Raft application binding to IP and Port details configured using
the bridge network mode.
2024-05-14 09:02:06 +01:00
Piotr Kazmierczak
abe9c0803a e2e: unflake TestWorkloadIdentity/testNobody (#20499)
sometimes the container quits too fast
2024-04-30 18:17:14 +02:00
Tim Gross
ff2d9de592 Revert "E2E: skip Vault 1.16.1 for JWT compatibility test (#20301)" (#20484)
This reverts commit 45b36371a12ffae5b5bfaaeadb08f801fb6bc98d. Now that Vault
1.16.2 has shipped, the E2E test will pick up only a working version.

Closes: https://github.com/hashicorp/nomad/issues/20298
2024-04-26 09:36:09 -04:00
Tim Gross
d40e23f939 E2E: clean up go mod cache after building consul-cni (#20378)
In #20296 we added a Go tool chain to the AMI we use for E2E tests, so that we
can build `consul-cni` for tproxy testing. This is intended to be temporary
until `consul-k8s` 1.4.2 is officially released. But the Go cache from building
`consul-k8s` uses up roughly 1.5GiB of space and the test machines have fairly
small disks. This causes the Nomad clients to aggressively GC client allocations
that stop, which breaks tests that run batch workloads and then read their logs.
2024-04-12 11:52:46 -04:00
Tim Gross
8298d39e78 Connect transparent proxy support
Add support for Consul Connect transparent proxies

Fixes: https://github.com/hashicorp/nomad/issues/10628
2024-04-10 11:00:18 -04:00
Tim Gross
548adb0fd4 tproxy: E2E tests (#20296)
Add the `consul-cni` plugin to the Linux AMI for E2E, and add a test case that
covers the transparent proxy feature. Add test assertions to the Connect tests
for upstream reachability

Ref: https://github.com/hashicorp/nomad/pull/20175
2024-04-05 14:23:26 -04:00
Tim Gross
2382ab8776 E2E: ensure periodic test can't fail due to cron conflicts (#20300)
The E2E test for periodic dispatch jobs has a `cron` trigger for once a
minute. If the test happens to run at the top of the minute, it's possible for
the forced dispatch to run from the test code, then the periodic timer triggers
and leaves a running child job. This fails the test because it expects only a
single job in the "dead" state.

Make it so that the `cron` expression is implausible to run during our test
window, and migrate the test off the old framework while we're at it.
2024-04-05 08:45:35 -04:00
Tim Gross
648daceca1 E2E: skip Vault 1.16.1 for JWT compatibility test (#20301)
Vault 1.16.1 has a known issue around the JWT auth configuration that will
prevent this test from ever passing. Skip testing the JWT code path on
1.16.1. Once 1.16.2 ships it will no longer get skipped.

Ref: https://github.com/hashicorp/nomad/issues/20298
2024-04-04 17:00:35 -04:00
Tim Gross
c1f020d60f E2E: refactor Connect tests to use stdlib testing (#20278)
Migrate our E2E tests for Connect off the old framework in preparation for
writing E2E tests for transparent proxy and the updated workload identity
workflow. Mark the tests that cover the legacy Consul token submitted workflow.

Ref: https://github.com/hashicorp/nomad/pull/20175
2024-04-04 10:48:10 -04:00
Tim Gross
4ce728afbd E2E: make vault.create_from_role unique per cluster (#20267)
If a E2E cluster is destroyed after a different one has been created, the role
and policy we create in Vault for the cluster will be deleted and Vault-related
tests will fail. Note that before 1.9, we should figure out a way to give HCP
Vault access to the JWKS endpoint and have a different set of policies, but
we'll need to have a role-per-cluster in that case as well.

Fixes: https://github.com/hashicorp/nomad-e2e/issues/138 (internal)
2024-04-03 08:45:01 -04:00
Tim Gross
cf25cf5cd5 E2E: use a self-hosted Consul for easier WI testing (#20256)
Our `consulcompat` tests exercise both the Workload Identity and legacy Consul
token workflow, but they are limited to running single node tests. The E2E
cluster is network isolated, so using our HCP Consul cluster runs into a
problem validating WI tokens because it can't reach the JWKS endpoint. In real
production environments, you'd solve this with a CNAME pointing to a public IP
pointing to a proxy with a real domain name. But that's logisitcally
impractical for our ephemeral nightly cluster.

Migrate the HCP Consul to a single-node Consul cluster on AWS EC2 alongside our
Nomad cluster. Bootstrap TLS and ACLs in Terraform and ensure all nodes can
reach each other. This will allow us to update our Consul tests so they can use
Workload Identity, in a separate PR.

Ref: #19698
2024-04-02 15:24:51 -04:00
Tim Gross
de218d1919 E2E: change timing of vaultsecrets test to guarantee lease window (#20200)
We've been getting a couple of errors from this test on nightly where the
template hasn't rendered by the time we expect it to. I've run some tests
locally and this may be a timing issue introduced by recent code changes to
templates.

Move the start of the timer to after we're guaranteed that we've got a secret
lease TTL started, to eliminate this as a source of flakiness. In my tests this
adds another ~5s to a test that already takes over a minute to run anyways.
2024-03-22 16:12:00 -04:00
Daniel Bennett
e059adef98 e2e: PreCleanup and other jobs3 helpers (#19844) 2024-01-29 17:54:54 -06:00
Piotr Kazmierczak
543ba16e61 e2e: more retries for RequireConsulDeregistered (#19801) 2024-01-22 20:11:48 +01:00
Piotr Kazmierczak
8a4bd61caf e2e: WaitForJobStopped correction (#19749) 2024-01-22 11:38:22 +01:00
Piotr Kazmierczak
8226a85263 e2e: remove deprecated template_file dependency for tf (#19313)
This also allows running tf for our e2e suite locally on darwin.
2024-01-15 18:42:28 +01:00
Piotr Kazmierczak
609f3a60b5 e2e: purging jobs removes all allocs (#19744)
There's no need to wait for allocs since #19609, in fact waiting for allocs to
stop will always fail leading to e2e failures.
2024-01-15 17:54:35 +01:00
Piotr Kazmierczak
858a805d7d e2e: add a note about provisioning the infrastructure on macOS/Apple Silicon (#19727) 2024-01-12 14:09:50 +01:00
Seth Hoenig
a58f0eca8e e2e: move rawexec oversub tests into oversubscription e2e test suite (#19717)
* e2e: move rawexec oversub tests into oversubscription e2e test suite

This PR moves two tests for raw_exec and memory oversubscription into
the oversubscription test suite, which has the necessary plumbing to
activate and restore the oversubscription configuration of the scheduler
during the test.

* cr: rename files for better readability
2024-01-11 14:27:05 -06:00
Piotr Kazmierczak
930339a0fa e2e: remove broken Consul WI test (#19697) 2024-01-10 21:31:18 +01:00
Seth Hoenig
cb7d078c1d drivers/raw_exec: enable configuring raw_exec task to have no memory limit (#19670)
* drivers/raw_exec: enable configuring raw_exec task to have no memory limit

This PR makes it possible to configure a raw_exec task to not have an
upper memory limit, which is how the driver would behave pre-1.7.

This is done by setting memory_max = -1. The cluster (or node pool) must
have memory oversubscription enabled.

* cl: add cl
2024-01-09 14:57:13 -06:00
Seth Hoenig
ccfb13a72d e2e: add test for raw_exec memory_max configuration (#19596)
* e2e: add test for raw_exec memory_max configuration

* docs: note raw_exec supports memory_max in resources documentation
2024-01-04 08:25:56 -06:00
Piotr Kazmierczak
aa197cf824 e2e: pass Nomad address to Consul WI test (#19603) 2024-01-04 08:52:39 +01:00
Piotr Kazmierczak
a87aa71f55 e2e: fix typo in Consul e2e (#19589) 2024-01-03 09:34:38 +01:00
Matt Robenolt
656bb5cafa drivers/executor: set oom_score_adj for raw_exec (#19515)
* drivers/executor: set oom_score_adj for raw_exec

This might not be wholly true since I don't know all configurations of
Nomad, but in our use cases, we run some of our tasks as `raw_exec` for
reasons.

We observed that our tasks were running with `oom_score_adj = -1000`,
which prevents them from being OOM'd. This value is being inherited from
the nomad agent parent process, as configured by systemd.

Similar to #10698, we also were shocked to have this value inherited
down to every child process and believe that we should also set this
value to 0 explicitly.

I have no idea if there are other paths that might leverage this or
other ways that `raw_exec` can manifest, but this is how I was able to
observe and fix in one of our configurations.

We have been running in production our tasks wrapped in a script that
does: `echo 0 > /proc/self/oom_score_adj` to avoid this issue.

* drivers/executor: minor cleanup of setting oom adjustment

* e2e: add test for raw_exec oom adjust score

* e2e: set oom score adjust to -999

* cl: add cl

---------

Co-authored-by: Seth Hoenig <shoenig@duck.com>
2024-01-02 13:35:09 -06:00
Piotr Kazmierczak
bb3d2227a2 e2e: add a test for checking default WI Consul workflow for services and tasks (#19500) 2024-01-02 16:02:32 +01:00
Daniel Bennett
eb23add189 e2e: sleep in docker job (#19434) 2023-12-11 15:38:14 -06:00
Tom Davies
c983a8f0ad Fixes Consul token checking when policies exist within namespaces (#18516)
* e2e/connect: adds test for namespace policies

* consul: use token namespace when fetching policies

* changelog

* fixup! e2e/connect: adds test for namespace policies
2023-12-11 10:07:32 -06:00
Seth Hoenig
f3cbe2e29a e2e: sleep a bit in short lived docker jobs (#19384) 2023-12-08 10:44:43 -06:00
Daniel Bennett
e9ff6d74d3 e2e: unflake oversubscription.testExec (#19373)
poll with must.Wait() instead of hard-coded sleep
waiting for poststart task to run, and wait for longer
2023-12-08 10:20:18 -06:00
Daniel Bennett
7baf3c012c e2e: even more time for exec+java tests (#19347) 2023-12-07 10:23:39 -06:00
Seth Hoenig
8cde7a4f70 e2e: turn of extreme verbose metrics test logging (#19330) 2023-12-06 16:08:49 -06:00
Tim Gross
340c9ebd47 E2E: extend timeout on CSI snapshot test (#19338)
The EBS snapshot operation can take a long time to complete. Recent runs have
shown we sometimes get up to the 10s timeout on the context we're giving the CLI
command. Extend this so that we're not getting spurious timeouts.

Fixes: https://github.com/hashicorp/nomad/issues/19118
2023-12-06 16:34:54 -05:00
Daniel Bennett
36f69a8e88 e2e: more occasionally slow exec tasks (#19337) 2023-12-06 15:22:15 -06:00
Daniel Bennett
9fe1f0aadc e2e: fix ConsulNamespaces tests (#19325)
* cleanup consul tokens by accessor id
rather than secret id, which has been failing for some time with:
> 404 (Cannot find token to delete)

* expect subset of consul namespaces
the consul test cluster may have namespaces from other unrelated tests
2023-12-06 12:21:27 -06:00
Seth Hoenig
87e7bf4ab2 e2e: skip connect test that does a restart of nomad agent (#19316) 2023-12-05 09:15:09 -06:00
Seth Hoenig
35ccb7ecdb e2e: use correct url to download zip file from go-getter repository (#19315) 2023-12-05 09:11:08 -06:00
Seth Hoenig
cc65f39c82 e2e/v3: dump eval if detected as cancelled (#19310) 2023-12-05 09:07:12 -06:00
Daniel Bennett
c7d01705f5 e2e: push nomad token to servers (#19312)
so humans with root shell access can use it to debug

not ideal security, but this is a short-lived test cluster
2023-12-05 08:54:57 -06:00
Seth Hoenig
6779d7c7b4 e2e: add a ShowState() option to cluster3.Establish options (#19303)
This will dump much of the interesting parts of cluster state, including
available nodes and their status, existing allocations and their status,
and existing evaluations and their status.
2023-12-04 12:37:21 -06:00
Daniel Bennett
d34788896f e2e: jobs3-submitted jobs automatically cleanup (#19284)
so that cleanup occurs even if the job fails to run
(unless configured not to)
2023-12-01 15:57:23 -06:00
Daniel Bennett
bfb2263f30 e2e: give isolation test jobs more time to start (#19276) 2023-12-01 14:03:40 -06:00
Seth Hoenig
5b3416bd97 e2e: set e2e/v3 debug logging on metrics test (#19263) 2023-12-01 10:03:55 -06:00
Tim Gross
05fe2ad191 E2E: fix assertion in CT native service lookup test (#19249)
When porting the `ConsulTemplate` test, I made a last-minute refactor to the
assertions for waiting on files, and accidentally inverted the test assertion in
the process.

Also, when running `jobs3.Submit` you need to include the `Namespace` option so
that the cleanup function that gets return deletes the job from the correct
namespace. This was causing the namespace cleanup to fail because the job
deletion had failed.
2023-12-01 08:54:55 -05:00