nomad

mirror of https://github.com/kemko/nomad.git synced 2026-01-01 16:05:42 +03:00

Author	SHA1	Message	Date
James Rasell	d3e077a78e	enos: Modify Windows TF variable to match new 2022 value. (#26067 )	2025-06-17 08:13:36 +01:00
Tim Gross	9ee2582379	upgrade test: remove change mode from Vault workload (#25861 ) During the upgrade test we can trigger a re-render of the Vault secret due to client restart before the allocrunner has marked the task as running, which triggers the change mode on the template and restarts the task. This results in a race where the alloc is still "pending" when we go to check it. We never change the value of this secret in upgrade testing, so paper over this race condition by setting a "noop" change mode.	2025-05-15 10:10:58 -04:00
Tim Gross	6c9f2fdd29	reduce upgrade testing flakes (#25839 ) This changeset includes several adjustments to the upgrade testing scripts to reduce flakes and make problems more understandable: * When a node is drained prior to the 3rd client upgrade, it's entirely possible the 3rd client to be upgraded is the drained node. This results in miscounting the expected number of allocations because many of them will be "complete" (service/batch) or "pending" (system). Leave the system jobs running during drains and only count the running allocations at that point as the expected set. Move the inline script that gets this count into a script file for legibility. * When the last initial workload is deployed, it's possible for it to be briefly still in "pending" when we move to the next step. Poll for a short window for the expected count of jobs. * Make sure that any scripts that are being run right after a server or client is coming back up can handle temporary unavailability gracefully. * Change the debugging output of several scripts to avoid having the debug output run into the error message (Ex. "some allocs are not running" looked like the first allocation running was the missing allocation). * Add some notes to the README about running locally with `-dev` builds and tagging a cluster with your own name. Ref: https://hashicorp.atlassian.net/browse/NMD-162	2025-05-13 08:40:22 -04:00
Juana De La Cuesta	695ba2c159	Fix the verify alloc script (#25837 ) * fix: use the raw option on jq to avoid trating the " like a char * Update verify_allocs.sh	2025-05-12 14:53:28 +02:00
Juana De La Cuesta	cb09696b1c	Nojira upgrade3 (#25817 ) * fix: typo * fix: correct the script for unbound var * fix: typo * fix: typo	2025-05-06 18:21:33 +02:00
Juana De La Cuesta	f68203549b	Fix the verify allocs, missing `echo` (#25816 ) * fix: typo * fix: correct the script for unbound var * fix: typo	2025-05-06 17:16:56 +02:00
Juana De La Cuesta	42d4067d55	Nojira upgrade3 (#25815 ) * fix: typo * fix: correct the script for unbound var	2025-05-06 16:57:44 +02:00
Juana De La Cuesta	da0ea9935d	fix: typo (#25814 )	2025-05-06 16:44:25 +02:00
Juana De La Cuesta	22921418b6	Check for allocs running before checking for IDs after a client upgrade (#25790 ) * fix: wait for all allocs to be running before checking for their IDs after client upgrade * style: linter fix * fix: filter running allocs per client ID when checking for allocs after upgrade	2025-05-06 16:22:45 +02:00
Juanadelacuesta	2f02c90391	func: expand on some logs to get more info in case of a failure	2025-04-15 14:37:57 -04:00
Tim Gross	e4d2fc93cd	upgrade testing: temporarily disable CSI workload (#25589 ) The CSI workload we're using for upgrade testing seems to be flaky to come up. The plugin jobs don't launch in a timely fashion despite several attempts. In order to not block running the rest of the upgrade testing, let's disable this workload temporarily. We'll fix this in NET-12430. Ref: https://hashicorp.atlassian.net/browse/NET-12430	2025-04-03 08:53:20 -04:00
Juanadelacuesta	332e859da0	Typo: Wrong function name	2025-03-26 10:06:40 +01:00
Juana De La Cuesta	2bd5dc5970	Merge pull request #25479 from hashicorp/NET-11546-enos-same-allocs Add a test for re attaching allocs after client restart	2025-03-24 16:03:57 +01:00
Juanadelacuesta	c3258ab0f6	fix: reuse client_id when checking for running allocs	2025-03-24 15:11:33 +01:00
Juanadelacuesta	ce261be358	style: linter fix	2025-03-21 15:26:25 +01:00
Juanadelacuesta	b1dbc14499	func: make the csi_plugin health timeout a little longer help the test run better locally	2025-03-21 15:16:00 +01:00
Juanadelacuesta	82fcc62c46	func: add verification for allocs correctly reattaching after client restarts	2025-03-21 15:14:00 +01:00
Juanadelacuesta	e0d3be81da	fix: declare license inputs as sensitive variables	2025-03-19 19:53:32 +01:00
Juanadelacuesta	cd1640e59a	style: linter fix	2025-03-17 16:19:29 +01:00
Juana De La Cuesta	9b9d16421e	Merge branch 'main' into NET-11546-enos-drain	2025-03-17 16:14:18 +01:00
Juana De La Cuesta	9d5359886e	Update drain.sh	2025-03-17 14:37:23 +01:00
Juana De La Cuesta	9574a0d319	Update enos/modules/drain_nodes/scripts/drain.sh Co-authored-by: Tim Gross <tgross@hashicorp.com>	2025-03-17 14:36:57 +01:00
Juanadelacuesta	134441b4a7	func: add .gitignore entry to avoid commiting the rendered vault job spec	2025-03-17 14:29:35 +01:00
Juanadelacuesta	0239e0e915	fix: add missing command to enable drain eligibility	2025-03-17 13:56:28 +01:00
Juanadelacuesta	cfd4ee1756	fix: add missing variables for drain module	2025-03-14 17:57:26 +01:00
Juanadelacuesta	fba2efa728	func: add a step to drain a node as part of the upgrade process	2025-03-14 17:43:36 +01:00
Juanadelacuesta	4b0903789e	func: add check script for vault workload	2025-03-14 17:03:35 +01:00
Juanadelacuesta	4c1ba45d48	func: add workload to test vault workload identity	2025-03-13 17:55:59 +01:00
Tim Gross	8cf34bde62	upgrade testing: allow configurable artifactory repo (#25350 ) Prerelease builds are in a different Artifactory repository than release builds. Make this a variable option so we can test prerelease builds in the nightly/weekly runs.	2025-03-13 10:32:02 -04:00
Juana De La Cuesta	ad7dc7a4eb	Merge pull request #25348 from hashicorp/NET-11546-enos-linux Add instructions to add new workloads to the tests.	2025-03-13 10:38:47 +01:00
Juanadelacuesta	ebeb3047c8	docs: add note about workloads life expectancy	2025-03-12 16:51:03 +01:00
Juana De La Cuesta	3de2a6b1d6	Update README.md	2025-03-11 17:51:56 +01:00
Juana De La Cuesta	b1ea04a4d1	Update README.md	2025-03-11 17:50:26 +01:00
Juana De La Cuesta	859f257d32	Update README.md	2025-03-11 17:48:45 +01:00
Juanadelacuesta	08f386e8e5	docs: Add section of readme to add workloads	2025-03-11 17:48:14 +01:00
Tim Gross	61bbff9c24	upgrade testing: Variables, Workload Identity, and Task API (#25229 ) Add an upgrade test workload for that continuously writes to a Nomad Variable. In order to run this workload, we'll need to deploy a Workload-Associated ACL policy. So this extends the `run_workloads` module to allow for a "pre script" to be run before a given job is deployed. We can use that as a model for other test workloads. Ref: https://hashicorp.atlassian.net/browse/NET-12217	2025-03-11 08:48:40 -04:00
Tim Gross	5cc1b4e606	upgrade tests: add transparent proxy workload (#25176 ) Add an upgrade test workload for Consul service mesh with transparent proxy. Note this breaks from the "countdash" demo. The dashboard application only can verify the backend is up by making a websocket connection, which we can't do as a health check, and the health check it exposes for that purpose only passes once the websocket connection has been made. So replace the dashboard with a minimal nginx reverse proxy to the count-api instead. Ref: https://hashicorp.atlassian.net/browse/NET-12217	2025-03-07 15:25:26 -05:00
Tim Gross	f528022e3a	upgrade testing: add missing dependency during client upgrades (#25306 ) The check to read back node metadata depends on a resource that waits for the Nomad API, but that resource doesn't wait for the metadata to be written in the first place (and the client subsequently upgraded). Add this dependency so that we're reading back the node metadata as the last step. Ref: https://github.com/hashicorp/nomad-e2e/actions/runs/13690355150/job/38282457406	2025-03-07 09:06:04 -05:00
Tim Gross	694b10d71c	upgrade testing: commit missing volume specification (#25305 ) In #25285 we converted the CSI workload for upgrade testing to use a self-hosted NFS. But the volume spec name got changed to `volume.hcl` in the process, which is in our `.gitignore` file for the repo. We missed this during testing because the file existed locally, but it fails in nightly runs. Ref: https://github.com/hashicorp/nomad/pull/25285 Ref: https://github.com/hashicorp/nomad-e2e/actions/runs/13703979647/job/38324786351	2025-03-06 14:36:34 -05:00
Tim Gross	916fe2c7fa	upgrade testing: rework CSI test to use self-contained workload (#25285 ) Getting the CSI test to work with AWS EFS or EBS has proven to be awkward because we're having to deal with external APIs with their own consistency guarantees, as well as challenges around teardown. Make the CSI test entirely self-contained by using a userland NFS server and the rocketduck CSI plugin. Ref: https://hashicorp.atlassian.net/browse/NET-12217 Ref: https://gitlab.com/rocketduck/csi-plugin-nfs	2025-03-05 11:48:19 -05:00
Tim Gross	7a051991bd	upgrade testing: temporarily disable CSI test (#25283 ) The CSI workload is failing and creating complications for teardown, so I'm reworking it. But this work is taking a while to finish, so while that's in progress let's disable the CSI workload so that we're running the upgrade tests all the way through to the end. I expect to be able to revert this in the next couple days.	2025-03-04 11:21:45 -05:00
Tim Gross	9cc0e2eae0	upgrade testing: make cluster name prefix a variable (#25281 ) During initial development of upgrade testing, we had a hard-coded prefix to distinguish between clusters created for this vs those created by GHA runners. Update the prefix to be a variable so that developers can add their own prefix during test workload development.	2025-03-04 11:11:02 -05:00
Juana De La Cuesta	2dadf9fe6c	Improve stability (#25244 ) * func: add dependencies to avoid race conditions and move the update to each client to the main upgrade scenario * Update enos/enos-scenario-upgrade.hcl Co-authored-by: Tim Gross <tgross@hashicorp.com> * Update enos/enos-scenario-upgrade.hcl Co-authored-by: Tim Gross <tgross@hashicorp.com> --------- Co-authored-by: Tim Gross <tgross@hashicorp.com>	2025-03-04 16:23:07 +01:00
Tim Gross	4a62d1b75c	upgrade tests: add CSI workload (#25223 ) Add an upgrade test workload for CSI with the AWS EFS plugin. In order to validate this workload, we'll need to deploy the plugin job and then register a volume with it. So this extends the `run_workloads` module to allow for "pre scripts" and "post scripts" to be run before and after a given job has been deployed. We can use that as a model for other test workloads. Ref: https://hashicorp.atlassian.net/browse/NET-12217	2025-02-27 15:16:04 -05:00
Tim Gross	6ae1444cf4	upgrade testing: debugging assistance (#25232 ) Enos buries the Terraform output from provisioning. Add a shell script to load the environment from provisioning for debugging Nomad during development of upgrade tests.	2025-02-27 08:35:45 -05:00
Juana De La Cuesta	461d4268e2	func: add python servers to raw exec workloads (#25230 )	2025-02-26 18:05:46 +01:00
Juana De La Cuesta	b13132043b	Add new workloads (#25106 ) * func: Add more workloads * Update jobs.sh * Update versions.sh * style: format * Update enos/modules/test_cluster_health/scripts/allocs.sh Co-authored-by: Tim Gross <tgross@hashicorp.com> * docs: improve outputs descriptions * func: change docker workloads to be redis boxes and add healthchecks * func: register the services on consul * style: format --------- Co-authored-by: Tim Gross <tgross@hashicorp.com>	2025-02-26 17:02:27 +01:00
Tim Gross	8c95f5f17e	upgrade testing: make sure we capture last error if not exiting (#25186 ) While testing #25172 I found a few spots where #25152 wasn't capturing the errors from transient failures correctly or exiting early instead of retrying. Ref: https://hashicorp.atlassian.net/browse/NET-11546	2025-02-24 09:37:17 -05:00
Juana De La Cuesta	0529c0247d	Only take one snapshot when upgrading servers (#25187 ) * func: add possibility of having different binaries for server and clients * style: rename binaries modules * func: remove the check for last configuration log, and only take one snapshot when upgrading the servers * Update enos/modules/upgrade_servers/main.tf Co-authored-by: Tim Gross <tgross@hashicorp.com> --------- Co-authored-by: Tim Gross <tgross@hashicorp.com>	2025-02-24 15:06:16 +01:00
Juana De La Cuesta	4a75d2de63	Adjust the servers to be always linux instances (#25172 ) * func: add possibility of having different binaries for server and clients * style: rename binaries modules * docs: update comments * fix: correct the token input variable for fetch binaries	2025-02-24 13:09:57 +01:00

1 2

61 Commits