Commit Graph

212 Commits

Author SHA1 Message Date
Tim Gross
ef366ee166 E2E: update .gitignore files to avoid committing runtime files (#24855)
In #24694 we did a major refactoring of the E2E Terraform configuration. After
deploying a cluster this morning, I noticed a few moved/removed files were not
reflected in the .gitignore files. This changeset updates the .gitignore to have
no unstaged files after applying.
2025-01-14 12:16:01 -05:00
Juana De La Cuesta
b29a3736a4 Update e2e infra provision to expect providers (#24694)
* func: move infra provisionining to a module and remove providers

* func: update paths

* func: update more paths

* func: update path inside bootstrap scrip

* style: remove debug prints on bootstrap scripts

* Delete e2e/terraform/csi/input/volume-efs.hcl

* fix: update keys path to use module path instead pf root

* fix: add missing headers

* fix: update keys directory inside provision-nomad

* style; format hcl files

* Update compute.tf

* Update e2e/terraform/main.tf

Co-authored-by: Tim Gross <tgross@hashicorp.com>

* Update e2e/terraform/provision-infra/compute.tf

Co-authored-by: Tim Gross <tgross@hashicorp.com>

* fix: update more paths

* fix: fmt hcl files

* func: final paths revision for running e2e locally

* fix: make path of certs relative to module for the bootstrap

* func: final paths revision for running e2e locally

* Update network.tf

* fix: fix typo and add success message

* fix: remove the test name from token to avoid long names and use name for vol to avoid colisions

* func: unify the uploads folder

* func: make the uploads file one per cluster

* func: Add outputs with all data necessary to connect to the cluster

* fix: make nomad token a sensitive output

* Update bootstrap-nomad.sh

---------

Co-authored-by: Tim Gross <tgross@hashicorp.com>
2025-01-13 15:59:40 +01:00
Juana De La Cuesta
2eb2b6c739 fix: update the dnsconfig script to handle multiple interfaces (#24800) 2025-01-07 21:12:18 +01:00
Tim Gross
abeae5c47b E2E: use a variable for region (#24693)
In #24644 we set the region to "e2e" but forgot to setup the TLS certificate
names appropriately. Swap the region out for a variable instead.
2024-12-17 10:28:22 -05:00
Tim Gross
75b0202f7f api: don't copy previously parsed URL when setting new address (#24644)
In #16872 we added support for unix domain sockets, but this required mutating
the `Config` when parsing the address so as to remove the port number. In #23785
we fixed a bug where if the configuration was used across multiple clients that
mutation would happen multiple times and the address would be incorrectly
parsed.

When making `alloc log`, `alloc fs`, or `alloc exec` calls where we have
line-of-sight to the client, we attempt to make a HTTP API call directly to the
client node. So we create a new API client from the same configuration and then
set the address. But in this case we copy the private `url` field and that
causes the URL parsing to be skipped for the new client.

This results in the region always being set to the string literal
`"global"` (because of mTLS handling code introduced all the way back in
4d3b75d867), unless the user has set the region specifically. This fails with
an error "no path to region" when the cluster isn't non-global and requests are
sent to a non-leader.

Arguably the "right" way of fixing this would be for `ClientConfig` not to
change the API client's region to `"global"` in the first place, but as this is
a public API and extremely longstanding behavior, it could potentially be a
breaking change for some downstream consumers. Instead, we'll avoid copying the
private `url` field so that the new address is re-parsed.

Fixes: https://github.com/hashicorp/nomad/issues/24635
Fixes: https://github.com/hashicorp/nomad/issues/24609
Ref: https://github.com/hashicorp/nomad/pull/16872
Ref: https://github.com/hashicorp/nomad/pull/23785
Ref: 4d3b75d867
2024-12-16 11:05:29 -05:00
Juana De La Cuesta
526c6375ad Make paths in e2e/terraform/ directory relative to the module (#24664)
* func: make paths relative

* func: make paths relative to the module inside the e2e terraform folder

* fix: add license files to gitignore

* func: move /etc and update all paths

* Uncomment forgotten code

* fix: update the path to the tls certificates to be local to the instance
2024-12-13 17:33:59 +01:00
Juana De La Cuesta
a9a0f71213 Remove sockaddr and use native tools (#24665)
* func: remove sockaddr and use native tools

* Update setup.sh
2024-12-13 17:24:53 +01:00
Juana De La Cuesta
270b4f97a6 Update some details of the terraform readme file for e2e provisioning (#24451)
* docs: update instructions to provision e2e cluster

* Update e2e/terraform/README.md

Co-authored-by: Piotr Kazmierczak <470696+pkazmierczak@users.noreply.github.com>

* Update e2e/terraform/terraform.tfvars

Co-authored-by: Piotr Kazmierczak <470696+pkazmierczak@users.noreply.github.com>

* Update e2e/terraform/README.md

Co-authored-by: Piotr Kazmierczak <470696+pkazmierczak@users.noreply.github.com>

---------

Co-authored-by: Piotr Kazmierczak <470696+pkazmierczak@users.noreply.github.com>
2024-11-18 13:36:51 +01:00
Piotr Kazmierczak
a22e56390e e2e: fix failing tests due to docker plugin settings (#24234) 2024-10-17 11:12:59 +02:00
Piotr Kazmierczak
f9cbaaf6c7 docker: fix a bug where auth for private registries wasn't parsed correctly (#24215)
In #23966 we introduced an official Docker client and did not notice that in
contrast to our previous 3rd party client, the official SDK PullOptions object
expects a base64 encoded JSON with username and password, instead of username/
password pair.
2024-10-16 22:04:54 +02:00
Tim Gross
d261d58ea2 build: update hc-install to current (#24199)
Installing Vault and Consul from releases.hashicorp.com via `hc-install` has
been failing intermittently. Update the `hc-install` binaries to be current and
add one retry to downloads for our compat tests so that we can get builds more
reliably green while the underlying issue is being debugged.
2024-10-15 10:07:58 -04:00
Daniel Bennett
10d3f1749b e2e: test all cni config formats (#23650) 2024-07-22 10:17:03 -05:00
Tim Gross
a29f9b6fc0 keyring: E2E testing for KMS/rotation (#23601)
In #23580 we're implementing support for encrypting Nomad's key material with
external KMS providers or Vault Transit. This changeset breaks out the E2E
infrastructure and testing from that PR to keep the review manageable.

Ref: https://hashicorp.atlassian.net/browse/NET-10334
Ref: https://github.com/hashicorp/nomad/issues/14852
Ref: https://github.com/hashicorp/nomad/pull/23580
2024-07-19 13:49:48 -04:00
Daniel Bennett
de10efa3fa e2e: hc-install consul-cni (#23612)
now that the version with tproxy CNI_ARGS is on releases.hashicorp.com
2024-07-17 14:26:40 -05:00
Daniel Bennett
afbd283c1b e2e: skip missing windows ami if windows clients=0 (#23610)
and tweak Makefile to generate a custom.tfvars
instead of specifying vars separately via CLI.
hoping this makes it a little more obvious
if there is no consul/nomad license.
2024-07-17 12:45:41 -05:00
Martina Santangelo
bc81c85ec7 e2e: cni args tests (#23597)
Co-authored-by: Daniel Bennett <dbennett@hashicorp.com>
2024-07-15 17:08:50 -04:00
Deniz Onur Duzgun
c82dd76a1b security: update tls cipher suites (#23551) 2024-07-11 14:01:45 -04:00
Seth Hoenig
2054e87158 e2e: add tests for exec2 task driver (#22406)
* e2e: add tests for exec2 task driver

* e2e: use envoy 1.29.4 because consul

* e2e: add a bridge networking http test for exec driver

* e2e: split up http test so curl always starts after the server
2024-05-31 09:22:39 -05:00
Seth Hoenig
9fb2b10ab6 e2e: no lnoger need consul terraform module (#22396) 2024-05-28 08:04:03 -05:00
Tim Gross
91d422ec21 E2E: document how the AMIs are tagged and how those tags are used (#22237)
The process by which we tag AMIs with the commit SHA of the Packer directory
isn't documented in this repository, which makes it easy to accidentally build
an AMI that will break nightly E2E.
2024-05-24 11:11:00 -05:00
Tim Gross
d40e23f939 E2E: clean up go mod cache after building consul-cni (#20378)
In #20296 we added a Go tool chain to the AMI we use for E2E tests, so that we
can build `consul-cni` for tproxy testing. This is intended to be temporary
until `consul-k8s` 1.4.2 is officially released. But the Go cache from building
`consul-k8s` uses up roughly 1.5GiB of space and the test machines have fairly
small disks. This causes the Nomad clients to aggressively GC client allocations
that stop, which breaks tests that run batch workloads and then read their logs.
2024-04-12 11:52:46 -04:00
Tim Gross
548adb0fd4 tproxy: E2E tests (#20296)
Add the `consul-cni` plugin to the Linux AMI for E2E, and add a test case that
covers the transparent proxy feature. Add test assertions to the Connect tests
for upstream reachability

Ref: https://github.com/hashicorp/nomad/pull/20175
2024-04-05 14:23:26 -04:00
Tim Gross
4ce728afbd E2E: make vault.create_from_role unique per cluster (#20267)
If a E2E cluster is destroyed after a different one has been created, the role
and policy we create in Vault for the cluster will be deleted and Vault-related
tests will fail. Note that before 1.9, we should figure out a way to give HCP
Vault access to the JWKS endpoint and have a different set of policies, but
we'll need to have a role-per-cluster in that case as well.

Fixes: https://github.com/hashicorp/nomad-e2e/issues/138 (internal)
2024-04-03 08:45:01 -04:00
Tim Gross
cf25cf5cd5 E2E: use a self-hosted Consul for easier WI testing (#20256)
Our `consulcompat` tests exercise both the Workload Identity and legacy Consul
token workflow, but they are limited to running single node tests. The E2E
cluster is network isolated, so using our HCP Consul cluster runs into a
problem validating WI tokens because it can't reach the JWKS endpoint. In real
production environments, you'd solve this with a CNAME pointing to a public IP
pointing to a proxy with a real domain name. But that's logisitcally
impractical for our ephemeral nightly cluster.

Migrate the HCP Consul to a single-node Consul cluster on AWS EC2 alongside our
Nomad cluster. Bootstrap TLS and ACLs in Terraform and ensure all nodes can
reach each other. This will allow us to update our Consul tests so they can use
Workload Identity, in a separate PR.

Ref: #19698
2024-04-02 15:24:51 -04:00
Piotr Kazmierczak
8226a85263 e2e: remove deprecated template_file dependency for tf (#19313)
This also allows running tf for our e2e suite locally on darwin.
2024-01-15 18:42:28 +01:00
Piotr Kazmierczak
858a805d7d e2e: add a note about provisioning the infrastructure on macOS/Apple Silicon (#19727) 2024-01-12 14:09:50 +01:00
Matt Robenolt
656bb5cafa drivers/executor: set oom_score_adj for raw_exec (#19515)
* drivers/executor: set oom_score_adj for raw_exec

This might not be wholly true since I don't know all configurations of
Nomad, but in our use cases, we run some of our tasks as `raw_exec` for
reasons.

We observed that our tasks were running with `oom_score_adj = -1000`,
which prevents them from being OOM'd. This value is being inherited from
the nomad agent parent process, as configured by systemd.

Similar to #10698, we also were shocked to have this value inherited
down to every child process and believe that we should also set this
value to 0 explicitly.

I have no idea if there are other paths that might leverage this or
other ways that `raw_exec` can manifest, but this is how I was able to
observe and fix in one of our configurations.

We have been running in production our tasks wrapped in a script that
does: `echo 0 > /proc/self/oom_score_adj` to avoid this issue.

* drivers/executor: minor cleanup of setting oom adjustment

* e2e: add test for raw_exec oom adjust score

* e2e: set oom score adjust to -999

* cl: add cl

---------

Co-authored-by: Seth Hoenig <shoenig@duck.com>
2024-01-02 13:35:09 -06:00
Daniel Bennett
c7d01705f5 e2e: push nomad token to servers (#19312)
so humans with root shell access can use it to debug

not ideal security, but this is a short-lived test cluster
2023-12-05 08:54:57 -06:00
Daniel Bennett
4ec9343447 e2e: use tf variable defaults (#19108) 2023-11-16 14:50:11 -06:00
Seth Hoenig
f211a0ab7c e2e: update terrform lock file for 1.6.3 (#19049)
Using the latest version of terraform, the lock file is not the same
as when it was generated. Seems like the http module is not needed?
versioned? present? anymore.
2023-11-09 10:49:49 -06:00
Seth Hoenig
402540f7fb e2e: bump packer build instances because faster (#19046) 2023-11-09 09:33:30 -06:00
Seth Hoenig
a28e5b6965 e2e: refactor metrics test to use NSD and WI (#19022)
* e2e: remove old metrics suite

* e2e: install stress on e2e jammy image

* e2e: overhaul metrics test to use nomad service discovery, workload identity

* e2e: format metrics hcl files and copywrite

* e2e: undo tf lock file

* e2e: undo reg auth file perms

* e2e: format cpustress.hcl
2023-11-09 08:21:16 -06:00
Seth Hoenig
63da22063b e2e: update pledge driver to 0.3.0 (#19020) 2023-11-08 06:58:59 -06:00
Seth Hoenig
a2f7ab2645 e2e disable windows (#19012)
* e2e: disable windows client

* e2e: disable windows artifact test
2023-11-07 09:34:18 -06:00
Daniel Bennett
a51d46c65c e2e: packer windows from "ECS_Optimized" image (#18453)
"Containers" AMIs evaporated at some point...
https://aws.amazon.com/marketplace/pp/prodview-yfve3zjgfjtug
> This version has been removed and is no longer
> available to new customers.
2023-09-11 12:26:32 -05:00
hashicorp-copywrite[bot]
a9d61ea3fd Update copyright file headers to BUSL-1.1 2023-08-10 17:27:29 -05:00
Seth Hoenig
8d28946993 e2e podman private registry (#17642)
* e2e: add tests for using private registry with podman driver

This PR adds e2e tests that stands up a private docker registry
and has a podman tasks run a container from an image in that private
registry.

Tests
 - user:password set in task config
 - auth_soft_fail works for public images when auth is set in driver
 - credentials helper is set in driver auth config
 - config auth.json file is set in driver auth config

* packer: use nomad-driver-podman v0.5.0

* e2e: eliminate unnecessary chmod

Co-authored-by: Daniel Bennett <dbennett@hashicorp.com>

* cr: no need to install nomad twice

* cl: no need to install docker twice

---------

Co-authored-by: Daniel Bennett <dbennett@hashicorp.com>
2023-07-19 15:59:36 -05:00
Seth Hoenig
159bf51120 e2e: add some e2e tests for pledge task driver (#17909)
* e2e: setup nomad for pledge driver

* e2e: add some e2e tests for pledge task driver
2023-07-12 11:56:08 -05:00
Daniel Bennett
6bd509869b e2e: use DNS instead of HTTP to get my_public_ipv4 (#17759) 2023-06-28 13:11:57 -05:00
Daniel Bennett
748aea1c61 e2e: fix windows client docker (#17572)
the windows docker install script stopped working.

after trying various things to fix the script,
I opted instead for a base image that comes with
docker already installed.

error output during build was:
  Installing Docker.
  WARNING: Cannot find path 'C:\Users\Administrator\AppData\Local\Temp\DockerMsftProvider\DockerDefault_DockerSearchIndex.json' because it does not exist.
  WARNING: Cannot bind argument to parameter 'downloadURL' because it is an empty string.
  WARNING: The property 'AbsoluteUri' cannot be found on this object. Verify that the property exists.
  WARNING: The property 'RequestMessage' cannot be found on this object. Verify that the property exists.
  Failed to install Docker.
  Install-Package : No match was found for the specified search criteria and package name 'docker'.
2023-06-20 10:17:16 -05:00
Seth Hoenig
6975409386 e2e: cleanup podman installation in jammy image (#17558)
* e2e: cleanup podman installation in jammy image

The original steps were copied over from the bionic image and does a lot
of hoop jumping we do not need anymore.

For the moment just hard-code installing the v0.4.2 version of the driver,
but I may follow up and modify hc-install to support installing @latest
like go itself.

* use releases for hc-install
2023-06-15 18:17:31 -05:00
Seth Hoenig
6b2834559f e2e: purge bionic packer image scripts (#17559)
Bionic is dead, long live the Jammy!
2023-06-15 15:15:01 -05:00
Shawn
9898e85d09 fix: typo (#16873) 2023-04-12 16:18:13 -04:00
hashicorp-copywrite[bot]
f005448366 [COMPLIANCE] Add Copyright and License Headers 2023-04-10 15:36:59 +00:00
Tim Gross
6cb69e5609 E2E: test enforcement of ACL system (#16796)
This changeset provides a matrix test of ACL enforcement across several
dimensions:
  * anonymous vs bogus vs valid tokens
  * permitted vs not permitted by policy
  * request sent to server vs sent to client (and forwarded)
2023-04-06 09:11:20 -04:00
Michael Schurter
282e3bcfcc Enable ACLs on E2E test clients (#16530)
* e2e: uniformly enable acls across all agents

* docs: clarify that acls should be set everywhere
2023-03-16 14:22:41 -07:00
Seth Hoenig
40ab325594 e2e: setup nomad permissions correctly (client vs. server) (#16399)
This PR configures

- server nodes with a systemd unit running the agent as the nomad service user
- client nodes with a root owned nomad data directory
2023-03-08 14:41:08 -06:00
Seth Hoenig
24af468b67 e2e: fix permissions on nomad data directory (#16376)
This PR updates the provisioning step where we create /opt/nomad/data,
such that it is with 0700 permissions in line with our security guidance.
2023-03-07 14:41:54 -06:00
Tim Gross
517ad9c5bf E2E: add multi-home networking to test infrastructure (#16218)
Add an Elastic Network Interface (ENI) to each Linux host, on a secondary subnet
we have provisioned in each AZ. Revise security groups as follows:

* Split out client security groups from servers so that we can't have clients
  accidentally accessing serf addresses or other unexpected cross-talk.
* Add new security groups for the secondary subnet that only allows
  communication within the security group so we can exercise behaviors with
  multiple IPs.

This changeset doesn't include any Nomad configuration changes needed to take
advantage of the extra network interface. I'll include those with testing for
PR #16217.
2023-02-20 10:08:28 +01:00
Seth Hoenig
6e4410a9b1 e2e: fix 1 of 4 client disconnect tests (#15357)
This PR modifies the disconnect helper job to run as root, which is necesary
for manipulating iptables as it does. Also re-organizes the final test logic
to wait for client re-connect before looking for the replacement (3rd) allocation
in case that client was needed to run the alloc (also giving the sheduler more
time to do its thing).

Skips the other 3 tests, which fail and I cannot yet figure out what is going on.
2022-11-22 08:51:53 -06:00