In #24694 we did a major refactoring of the E2E Terraform configuration. After
deploying a cluster this morning, I noticed a few moved/removed files were not
reflected in the .gitignore files. This changeset updates the .gitignore to have
no unstaged files after applying.
* func: move infra provisionining to a module and remove providers
* func: update paths
* func: update more paths
* func: update path inside bootstrap scrip
* style: remove debug prints on bootstrap scripts
* Delete e2e/terraform/csi/input/volume-efs.hcl
* fix: update keys path to use module path instead pf root
* fix: add missing headers
* fix: update keys directory inside provision-nomad
* style; format hcl files
* Update compute.tf
* Update e2e/terraform/main.tf
Co-authored-by: Tim Gross <tgross@hashicorp.com>
* Update e2e/terraform/provision-infra/compute.tf
Co-authored-by: Tim Gross <tgross@hashicorp.com>
* fix: update more paths
* fix: fmt hcl files
* func: final paths revision for running e2e locally
* fix: make path of certs relative to module for the bootstrap
* func: final paths revision for running e2e locally
* Update network.tf
* fix: fix typo and add success message
* fix: remove the test name from token to avoid long names and use name for vol to avoid colisions
* func: unify the uploads folder
* func: make the uploads file one per cluster
* func: Add outputs with all data necessary to connect to the cluster
* fix: make nomad token a sensitive output
* Update bootstrap-nomad.sh
---------
Co-authored-by: Tim Gross <tgross@hashicorp.com>
In #16872 we added support for unix domain sockets, but this required mutating
the `Config` when parsing the address so as to remove the port number. In #23785
we fixed a bug where if the configuration was used across multiple clients that
mutation would happen multiple times and the address would be incorrectly
parsed.
When making `alloc log`, `alloc fs`, or `alloc exec` calls where we have
line-of-sight to the client, we attempt to make a HTTP API call directly to the
client node. So we create a new API client from the same configuration and then
set the address. But in this case we copy the private `url` field and that
causes the URL parsing to be skipped for the new client.
This results in the region always being set to the string literal
`"global"` (because of mTLS handling code introduced all the way back in
4d3b75d867), unless the user has set the region specifically. This fails with
an error "no path to region" when the cluster isn't non-global and requests are
sent to a non-leader.
Arguably the "right" way of fixing this would be for `ClientConfig` not to
change the API client's region to `"global"` in the first place, but as this is
a public API and extremely longstanding behavior, it could potentially be a
breaking change for some downstream consumers. Instead, we'll avoid copying the
private `url` field so that the new address is re-parsed.
Fixes: https://github.com/hashicorp/nomad/issues/24635
Fixes: https://github.com/hashicorp/nomad/issues/24609
Ref: https://github.com/hashicorp/nomad/pull/16872
Ref: https://github.com/hashicorp/nomad/pull/23785
Ref: 4d3b75d867
* func: make paths relative
* func: make paths relative to the module inside the e2e terraform folder
* fix: add license files to gitignore
* func: move /etc and update all paths
* Uncomment forgotten code
* fix: update the path to the tls certificates to be local to the instance
In #23966 we introduced an official Docker client and did not notice that in
contrast to our previous 3rd party client, the official SDK PullOptions object
expects a base64 encoded JSON with username and password, instead of username/
password pair.
Installing Vault and Consul from releases.hashicorp.com via `hc-install` has
been failing intermittently. Update the `hc-install` binaries to be current and
add one retry to downloads for our compat tests so that we can get builds more
reliably green while the underlying issue is being debugged.
and tweak Makefile to generate a custom.tfvars
instead of specifying vars separately via CLI.
hoping this makes it a little more obvious
if there is no consul/nomad license.
* e2e: add tests for exec2 task driver
* e2e: use envoy 1.29.4 because consul
* e2e: add a bridge networking http test for exec driver
* e2e: split up http test so curl always starts after the server
The process by which we tag AMIs with the commit SHA of the Packer directory
isn't documented in this repository, which makes it easy to accidentally build
an AMI that will break nightly E2E.
In #20296 we added a Go tool chain to the AMI we use for E2E tests, so that we
can build `consul-cni` for tproxy testing. This is intended to be temporary
until `consul-k8s` 1.4.2 is officially released. But the Go cache from building
`consul-k8s` uses up roughly 1.5GiB of space and the test machines have fairly
small disks. This causes the Nomad clients to aggressively GC client allocations
that stop, which breaks tests that run batch workloads and then read their logs.
Add the `consul-cni` plugin to the Linux AMI for E2E, and add a test case that
covers the transparent proxy feature. Add test assertions to the Connect tests
for upstream reachability
Ref: https://github.com/hashicorp/nomad/pull/20175
If a E2E cluster is destroyed after a different one has been created, the role
and policy we create in Vault for the cluster will be deleted and Vault-related
tests will fail. Note that before 1.9, we should figure out a way to give HCP
Vault access to the JWKS endpoint and have a different set of policies, but
we'll need to have a role-per-cluster in that case as well.
Fixes: https://github.com/hashicorp/nomad-e2e/issues/138 (internal)
Our `consulcompat` tests exercise both the Workload Identity and legacy Consul
token workflow, but they are limited to running single node tests. The E2E
cluster is network isolated, so using our HCP Consul cluster runs into a
problem validating WI tokens because it can't reach the JWKS endpoint. In real
production environments, you'd solve this with a CNAME pointing to a public IP
pointing to a proxy with a real domain name. But that's logisitcally
impractical for our ephemeral nightly cluster.
Migrate the HCP Consul to a single-node Consul cluster on AWS EC2 alongside our
Nomad cluster. Bootstrap TLS and ACLs in Terraform and ensure all nodes can
reach each other. This will allow us to update our Consul tests so they can use
Workload Identity, in a separate PR.
Ref: #19698
* drivers/executor: set oom_score_adj for raw_exec
This might not be wholly true since I don't know all configurations of
Nomad, but in our use cases, we run some of our tasks as `raw_exec` for
reasons.
We observed that our tasks were running with `oom_score_adj = -1000`,
which prevents them from being OOM'd. This value is being inherited from
the nomad agent parent process, as configured by systemd.
Similar to #10698, we also were shocked to have this value inherited
down to every child process and believe that we should also set this
value to 0 explicitly.
I have no idea if there are other paths that might leverage this or
other ways that `raw_exec` can manifest, but this is how I was able to
observe and fix in one of our configurations.
We have been running in production our tasks wrapped in a script that
does: `echo 0 > /proc/self/oom_score_adj` to avoid this issue.
* drivers/executor: minor cleanup of setting oom adjustment
* e2e: add test for raw_exec oom adjust score
* e2e: set oom score adjust to -999
* cl: add cl
---------
Co-authored-by: Seth Hoenig <shoenig@duck.com>
Using the latest version of terraform, the lock file is not the same
as when it was generated. Seems like the http module is not needed?
versioned? present? anymore.
* e2e: add tests for using private registry with podman driver
This PR adds e2e tests that stands up a private docker registry
and has a podman tasks run a container from an image in that private
registry.
Tests
- user:password set in task config
- auth_soft_fail works for public images when auth is set in driver
- credentials helper is set in driver auth config
- config auth.json file is set in driver auth config
* packer: use nomad-driver-podman v0.5.0
* e2e: eliminate unnecessary chmod
Co-authored-by: Daniel Bennett <dbennett@hashicorp.com>
* cr: no need to install nomad twice
* cl: no need to install docker twice
---------
Co-authored-by: Daniel Bennett <dbennett@hashicorp.com>
the windows docker install script stopped working.
after trying various things to fix the script,
I opted instead for a base image that comes with
docker already installed.
error output during build was:
Installing Docker.
WARNING: Cannot find path 'C:\Users\Administrator\AppData\Local\Temp\DockerMsftProvider\DockerDefault_DockerSearchIndex.json' because it does not exist.
WARNING: Cannot bind argument to parameter 'downloadURL' because it is an empty string.
WARNING: The property 'AbsoluteUri' cannot be found on this object. Verify that the property exists.
WARNING: The property 'RequestMessage' cannot be found on this object. Verify that the property exists.
Failed to install Docker.
Install-Package : No match was found for the specified search criteria and package name 'docker'.
* e2e: cleanup podman installation in jammy image
The original steps were copied over from the bionic image and does a lot
of hoop jumping we do not need anymore.
For the moment just hard-code installing the v0.4.2 version of the driver,
but I may follow up and modify hc-install to support installing @latest
like go itself.
* use releases for hc-install
This changeset provides a matrix test of ACL enforcement across several
dimensions:
* anonymous vs bogus vs valid tokens
* permitted vs not permitted by policy
* request sent to server vs sent to client (and forwarded)
This PR configures
- server nodes with a systemd unit running the agent as the nomad service user
- client nodes with a root owned nomad data directory
Add an Elastic Network Interface (ENI) to each Linux host, on a secondary subnet
we have provisioned in each AZ. Revise security groups as follows:
* Split out client security groups from servers so that we can't have clients
accidentally accessing serf addresses or other unexpected cross-talk.
* Add new security groups for the secondary subnet that only allows
communication within the security group so we can exercise behaviors with
multiple IPs.
This changeset doesn't include any Nomad configuration changes needed to take
advantage of the extra network interface. I'll include those with testing for
PR #16217.
This PR modifies the disconnect helper job to run as root, which is necesary
for manipulating iptables as it does. Also re-organizes the final test logic
to wait for client re-connect before looking for the replacement (3rd) allocation
in case that client was needed to run the alloc (also giving the sheduler more
time to do its thing).
Skips the other 3 tests, which fail and I cannot yet figure out what is going on.