nomad

mirror of https://github.com/kemko/nomad.git synced 2026-01-06 18:35:44 +03:00

Author	SHA1	Message	Date
Tim Gross	ef366ee166	E2E: update .gitignore files to avoid committing runtime files (#24855 ) In #24694 we did a major refactoring of the E2E Terraform configuration. After deploying a cluster this morning, I noticed a few moved/removed files were not reflected in the .gitignore files. This changeset updates the .gitignore to have no unstaged files after applying.	2025-01-14 12:16:01 -05:00
Juana De La Cuesta	b29a3736a4	Update e2e infra provision to expect providers (#24694 ) * func: move infra provisionining to a module and remove providers * func: update paths * func: update more paths * func: update path inside bootstrap scrip * style: remove debug prints on bootstrap scripts * Delete e2e/terraform/csi/input/volume-efs.hcl * fix: update keys path to use module path instead pf root * fix: add missing headers * fix: update keys directory inside provision-nomad * style; format hcl files * Update compute.tf * Update e2e/terraform/main.tf Co-authored-by: Tim Gross <tgross@hashicorp.com> * Update e2e/terraform/provision-infra/compute.tf Co-authored-by: Tim Gross <tgross@hashicorp.com> * fix: update more paths * fix: fmt hcl files * func: final paths revision for running e2e locally * fix: make path of certs relative to module for the bootstrap * func: final paths revision for running e2e locally * Update network.tf * fix: fix typo and add success message * fix: remove the test name from token to avoid long names and use name for vol to avoid colisions * func: unify the uploads folder * func: make the uploads file one per cluster * func: Add outputs with all data necessary to connect to the cluster * fix: make nomad token a sensitive output * Update bootstrap-nomad.sh --------- Co-authored-by: Tim Gross <tgross@hashicorp.com>	2025-01-13 15:59:40 +01:00
Juana De La Cuesta	2eb2b6c739	fix: update the dnsconfig script to handle multiple interfaces (#24800 )	2025-01-07 21:12:18 +01:00
Tim Gross	abeae5c47b	E2E: use a variable for region (#24693 ) In #24644 we set the region to "e2e" but forgot to setup the TLS certificate names appropriately. Swap the region out for a variable instead.	2024-12-17 10:28:22 -05:00
Tim Gross	75b0202f7f	api: don't copy previously parsed URL when setting new address (#24644 ) In #16872 we added support for unix domain sockets, but this required mutating the `Config` when parsing the address so as to remove the port number. In #23785 we fixed a bug where if the configuration was used across multiple clients that mutation would happen multiple times and the address would be incorrectly parsed. When making `alloc log`, `alloc fs`, or `alloc exec` calls where we have line-of-sight to the client, we attempt to make a HTTP API call directly to the client node. So we create a new API client from the same configuration and then set the address. But in this case we copy the private `url` field and that causes the URL parsing to be skipped for the new client. This results in the region always being set to the string literal `"global"` (because of mTLS handling code introduced all the way back in `4d3b75d867`), unless the user has set the region specifically. This fails with an error "no path to region" when the cluster isn't non-global and requests are sent to a non-leader. Arguably the "right" way of fixing this would be for `ClientConfig` not to change the API client's region to `"global"` in the first place, but as this is a public API and extremely longstanding behavior, it could potentially be a breaking change for some downstream consumers. Instead, we'll avoid copying the private `url` field so that the new address is re-parsed. Fixes: https://github.com/hashicorp/nomad/issues/24635 Fixes: https://github.com/hashicorp/nomad/issues/24609 Ref: https://github.com/hashicorp/nomad/pull/16872 Ref: https://github.com/hashicorp/nomad/pull/23785 Ref: `4d3b75d867`	2024-12-16 11:05:29 -05:00
Juana De La Cuesta	526c6375ad	Make paths in e2e/terraform/ directory relative to the module (#24664 ) * func: make paths relative * func: make paths relative to the module inside the e2e terraform folder * fix: add license files to gitignore * func: move /etc and update all paths * Uncomment forgotten code * fix: update the path to the tls certificates to be local to the instance	2024-12-13 17:33:59 +01:00
Juana De La Cuesta	a9a0f71213	Remove sockaddr and use native tools (#24665 ) * func: remove sockaddr and use native tools * Update setup.sh	2024-12-13 17:24:53 +01:00
Juana De La Cuesta	270b4f97a6	Update some details of the terraform readme file for e2e provisioning (#24451 ) * docs: update instructions to provision e2e cluster * Update e2e/terraform/README.md Co-authored-by: Piotr Kazmierczak <470696+pkazmierczak@users.noreply.github.com> * Update e2e/terraform/terraform.tfvars Co-authored-by: Piotr Kazmierczak <470696+pkazmierczak@users.noreply.github.com> * Update e2e/terraform/README.md Co-authored-by: Piotr Kazmierczak <470696+pkazmierczak@users.noreply.github.com> --------- Co-authored-by: Piotr Kazmierczak <470696+pkazmierczak@users.noreply.github.com>	2024-11-18 13:36:51 +01:00
Piotr Kazmierczak	a22e56390e	e2e: fix failing tests due to docker plugin settings (#24234 )	2024-10-17 11:12:59 +02:00
Piotr Kazmierczak	f9cbaaf6c7	docker: fix a bug where auth for private registries wasn't parsed correctly (#24215 ) In #23966 we introduced an official Docker client and did not notice that in contrast to our previous 3rd party client, the official SDK PullOptions object expects a base64 encoded JSON with username and password, instead of username/ password pair.	2024-10-16 22:04:54 +02:00
Tim Gross	d261d58ea2	build: update hc-install to current (#24199 ) Installing Vault and Consul from releases.hashicorp.com via `hc-install` has been failing intermittently. Update the `hc-install` binaries to be current and add one retry to downloads for our compat tests so that we can get builds more reliably green while the underlying issue is being debugged.	2024-10-15 10:07:58 -04:00
Daniel Bennett	10d3f1749b	e2e: test all cni config formats (#23650 )	2024-07-22 10:17:03 -05:00
Tim Gross	a29f9b6fc0	keyring: E2E testing for KMS/rotation (#23601 ) In #23580 we're implementing support for encrypting Nomad's key material with external KMS providers or Vault Transit. This changeset breaks out the E2E infrastructure and testing from that PR to keep the review manageable. Ref: https://hashicorp.atlassian.net/browse/NET-10334 Ref: https://github.com/hashicorp/nomad/issues/14852 Ref: https://github.com/hashicorp/nomad/pull/23580	2024-07-19 13:49:48 -04:00
Daniel Bennett	de10efa3fa	e2e: hc-install consul-cni (#23612 ) now that the version with tproxy CNI_ARGS is on releases.hashicorp.com	2024-07-17 14:26:40 -05:00
Daniel Bennett	afbd283c1b	e2e: skip missing windows ami if windows clients=0 (#23610 ) and tweak Makefile to generate a custom.tfvars instead of specifying vars separately via CLI. hoping this makes it a little more obvious if there is no consul/nomad license.	2024-07-17 12:45:41 -05:00
Martina Santangelo	bc81c85ec7	e2e: cni args tests (#23597 ) Co-authored-by: Daniel Bennett <dbennett@hashicorp.com>	2024-07-15 17:08:50 -04:00
Deniz Onur Duzgun	c82dd76a1b	security: update tls cipher suites (#23551 )	2024-07-11 14:01:45 -04:00
Seth Hoenig	2054e87158	e2e: add tests for exec2 task driver (#22406 ) * e2e: add tests for exec2 task driver * e2e: use envoy 1.29.4 because consul * e2e: add a bridge networking http test for exec driver * e2e: split up http test so curl always starts after the server	2024-05-31 09:22:39 -05:00
Seth Hoenig	9fb2b10ab6	e2e: no lnoger need consul terraform module (#22396 )	2024-05-28 08:04:03 -05:00
Tim Gross	91d422ec21	E2E: document how the AMIs are tagged and how those tags are used (#22237 ) The process by which we tag AMIs with the commit SHA of the Packer directory isn't documented in this repository, which makes it easy to accidentally build an AMI that will break nightly E2E.	2024-05-24 11:11:00 -05:00
Tim Gross	d40e23f939	E2E: clean up go mod cache after building `consul-cni` (#20378 ) In #20296 we added a Go tool chain to the AMI we use for E2E tests, so that we can build `consul-cni` for tproxy testing. This is intended to be temporary until `consul-k8s` 1.4.2 is officially released. But the Go cache from building `consul-k8s` uses up roughly 1.5GiB of space and the test machines have fairly small disks. This causes the Nomad clients to aggressively GC client allocations that stop, which breaks tests that run batch workloads and then read their logs.	2024-04-12 11:52:46 -04:00
Tim Gross	548adb0fd4	tproxy: E2E tests (#20296 ) Add the `consul-cni` plugin to the Linux AMI for E2E, and add a test case that covers the transparent proxy feature. Add test assertions to the Connect tests for upstream reachability Ref: https://github.com/hashicorp/nomad/pull/20175	2024-04-05 14:23:26 -04:00
Tim Gross	4ce728afbd	E2E: make `vault.create_from_role` unique per cluster (#20267 ) If a E2E cluster is destroyed after a different one has been created, the role and policy we create in Vault for the cluster will be deleted and Vault-related tests will fail. Note that before 1.9, we should figure out a way to give HCP Vault access to the JWKS endpoint and have a different set of policies, but we'll need to have a role-per-cluster in that case as well. Fixes: https://github.com/hashicorp/nomad-e2e/issues/138 (internal)	2024-04-03 08:45:01 -04:00
Tim Gross	cf25cf5cd5	E2E: use a self-hosted Consul for easier WI testing (#20256 ) Our `consulcompat` tests exercise both the Workload Identity and legacy Consul token workflow, but they are limited to running single node tests. The E2E cluster is network isolated, so using our HCP Consul cluster runs into a problem validating WI tokens because it can't reach the JWKS endpoint. In real production environments, you'd solve this with a CNAME pointing to a public IP pointing to a proxy with a real domain name. But that's logisitcally impractical for our ephemeral nightly cluster. Migrate the HCP Consul to a single-node Consul cluster on AWS EC2 alongside our Nomad cluster. Bootstrap TLS and ACLs in Terraform and ensure all nodes can reach each other. This will allow us to update our Consul tests so they can use Workload Identity, in a separate PR. Ref: #19698	2024-04-02 15:24:51 -04:00
Piotr Kazmierczak	8226a85263	e2e: remove deprecated template_file dependency for tf (#19313 ) This also allows running tf for our e2e suite locally on darwin.	2024-01-15 18:42:28 +01:00
Piotr Kazmierczak	858a805d7d	e2e: add a note about provisioning the infrastructure on macOS/Apple Silicon (#19727 )	2024-01-12 14:09:50 +01:00
Matt Robenolt	656bb5cafa	drivers/executor: set oom_score_adj for raw_exec (#19515 ) * drivers/executor: set oom_score_adj for raw_exec This might not be wholly true since I don't know all configurations of Nomad, but in our use cases, we run some of our tasks as `raw_exec` for reasons. We observed that our tasks were running with `oom_score_adj = -1000`, which prevents them from being OOM'd. This value is being inherited from the nomad agent parent process, as configured by systemd. Similar to #10698, we also were shocked to have this value inherited down to every child process and believe that we should also set this value to 0 explicitly. I have no idea if there are other paths that might leverage this or other ways that `raw_exec` can manifest, but this is how I was able to observe and fix in one of our configurations. We have been running in production our tasks wrapped in a script that does: `echo 0 > /proc/self/oom_score_adj` to avoid this issue. * drivers/executor: minor cleanup of setting oom adjustment * e2e: add test for raw_exec oom adjust score * e2e: set oom score adjust to -999 * cl: add cl --------- Co-authored-by: Seth Hoenig <shoenig@duck.com>	2024-01-02 13:35:09 -06:00
Daniel Bennett	c7d01705f5	e2e: push nomad token to servers (#19312 ) so humans with root shell access can use it to debug not ideal security, but this is a short-lived test cluster	2023-12-05 08:54:57 -06:00
Daniel Bennett	4ec9343447	e2e: use tf variable defaults (#19108 )	2023-11-16 14:50:11 -06:00
Seth Hoenig	f211a0ab7c	e2e: update terrform lock file for 1.6.3 (#19049 ) Using the latest version of terraform, the lock file is not the same as when it was generated. Seems like the http module is not needed? versioned? present? anymore.	2023-11-09 10:49:49 -06:00
Seth Hoenig	402540f7fb	e2e: bump packer build instances because faster (#19046 )	2023-11-09 09:33:30 -06:00
Seth Hoenig	a28e5b6965	e2e: refactor metrics test to use NSD and WI (#19022 ) * e2e: remove old metrics suite * e2e: install stress on e2e jammy image * e2e: overhaul metrics test to use nomad service discovery, workload identity * e2e: format metrics hcl files and copywrite * e2e: undo tf lock file * e2e: undo reg auth file perms * e2e: format cpustress.hcl	2023-11-09 08:21:16 -06:00
Seth Hoenig	63da22063b	e2e: update pledge driver to 0.3.0 (#19020 )	2023-11-08 06:58:59 -06:00
Seth Hoenig	a2f7ab2645	e2e disable windows (#19012 ) * e2e: disable windows client * e2e: disable windows artifact test	2023-11-07 09:34:18 -06:00
Daniel Bennett	a51d46c65c	e2e: packer windows from "ECS_Optimized" image (#18453 ) "Containers" AMIs evaporated at some point... https://aws.amazon.com/marketplace/pp/prodview-yfve3zjgfjtug > This version has been removed and is no longer > available to new customers.	2023-09-11 12:26:32 -05:00
hashicorp-copywrite[bot]	a9d61ea3fd	Update copyright file headers to BUSL-1.1	2023-08-10 17:27:29 -05:00
Seth Hoenig	8d28946993	e2e podman private registry (#17642 ) * e2e: add tests for using private registry with podman driver This PR adds e2e tests that stands up a private docker registry and has a podman tasks run a container from an image in that private registry. Tests - user:password set in task config - auth_soft_fail works for public images when auth is set in driver - credentials helper is set in driver auth config - config auth.json file is set in driver auth config * packer: use nomad-driver-podman v0.5.0 * e2e: eliminate unnecessary chmod Co-authored-by: Daniel Bennett <dbennett@hashicorp.com> * cr: no need to install nomad twice * cl: no need to install docker twice --------- Co-authored-by: Daniel Bennett <dbennett@hashicorp.com>	2023-07-19 15:59:36 -05:00
Seth Hoenig	159bf51120	e2e: add some e2e tests for pledge task driver (#17909 ) * e2e: setup nomad for pledge driver * e2e: add some e2e tests for pledge task driver	2023-07-12 11:56:08 -05:00
Daniel Bennett	6bd509869b	e2e: use DNS instead of HTTP to get my_public_ipv4 (#17759 )	2023-06-28 13:11:57 -05:00
Daniel Bennett	748aea1c61	e2e: fix windows client docker (#17572 ) the windows docker install script stopped working. after trying various things to fix the script, I opted instead for a base image that comes with docker already installed. error output during build was: Installing Docker. WARNING: Cannot find path 'C:\Users\Administrator\AppData\Local\Temp\DockerMsftProvider\DockerDefault_DockerSearchIndex.json' because it does not exist. WARNING: Cannot bind argument to parameter 'downloadURL' because it is an empty string. WARNING: The property 'AbsoluteUri' cannot be found on this object. Verify that the property exists. WARNING: The property 'RequestMessage' cannot be found on this object. Verify that the property exists. Failed to install Docker. Install-Package : No match was found for the specified search criteria and package name 'docker'.	2023-06-20 10:17:16 -05:00
Seth Hoenig	6975409386	e2e: cleanup podman installation in jammy image (#17558 ) * e2e: cleanup podman installation in jammy image The original steps were copied over from the bionic image and does a lot of hoop jumping we do not need anymore. For the moment just hard-code installing the v0.4.2 version of the driver, but I may follow up and modify hc-install to support installing @latest like go itself. * use releases for hc-install	2023-06-15 18:17:31 -05:00
Seth Hoenig	6b2834559f	e2e: purge bionic packer image scripts (#17559 ) Bionic is dead, long live the Jammy!	2023-06-15 15:15:01 -05:00
Shawn	9898e85d09	fix: typo (#16873 )	2023-04-12 16:18:13 -04:00
hashicorp-copywrite[bot]	f005448366	[COMPLIANCE] Add Copyright and License Headers	2023-04-10 15:36:59 +00:00
Tim Gross	6cb69e5609	E2E: test enforcement of ACL system (#16796 ) This changeset provides a matrix test of ACL enforcement across several dimensions: * anonymous vs bogus vs valid tokens * permitted vs not permitted by policy * request sent to server vs sent to client (and forwarded)	2023-04-06 09:11:20 -04:00
Michael Schurter	282e3bcfcc	Enable ACLs on E2E test clients (#16530 ) * e2e: uniformly enable acls across all agents * docs: clarify that acls should be set everywhere	2023-03-16 14:22:41 -07:00
Seth Hoenig	40ab325594	e2e: setup nomad permissions correctly (client vs. server) (#16399 ) This PR configures - server nodes with a systemd unit running the agent as the nomad service user - client nodes with a root owned nomad data directory	2023-03-08 14:41:08 -06:00
Seth Hoenig	24af468b67	e2e: fix permissions on nomad data directory (#16376 ) This PR updates the provisioning step where we create /opt/nomad/data, such that it is with 0700 permissions in line with our security guidance.	2023-03-07 14:41:54 -06:00
Tim Gross	517ad9c5bf	E2E: add multi-home networking to test infrastructure (#16218 ) Add an Elastic Network Interface (ENI) to each Linux host, on a secondary subnet we have provisioned in each AZ. Revise security groups as follows: * Split out client security groups from servers so that we can't have clients accidentally accessing serf addresses or other unexpected cross-talk. * Add new security groups for the secondary subnet that only allows communication within the security group so we can exercise behaviors with multiple IPs. This changeset doesn't include any Nomad configuration changes needed to take advantage of the extra network interface. I'll include those with testing for PR #16217.	2023-02-20 10:08:28 +01:00
Seth Hoenig	6e4410a9b1	e2e: fix 1 of 4 client disconnect tests (#15357 ) This PR modifies the disconnect helper job to run as root, which is necesary for manipulating iptables as it does. Also re-organizes the final test logic to wait for client re-connect before looking for the replacement (3rd) allocation in case that client was needed to run the alloc (also giving the sheduler more time to do its thing). Skips the other 3 tests, which fail and I cannot yet figure out what is going on.	2022-11-22 08:51:53 -06:00

1 2 3 4 5

212 Commits