Tear down the volume-consuming job between subtests, rather than after
all the tests are complete. For good measure, use a different ID for
the volume-consuming job as well.
* lint: require should not be aliased in core_sched_test
* lint: require should not be aliased in volumes_watcher_test
* testing: don't alias state package in core_sched_test
The client configuration options for drivers have been deprecated
since 0.9. We haven't torn them out completely but because they're
deprecated it's been hard to guarantee correct behavior. Remove the
documentation so that users aren't misled about their viability.
This is a test around upgrading from Nomad 0.8, which is long since
no longer supported. The test is slow, flaky, and imports consul/sdk.
Remove this test as it is no longer relevant.
These tend to fail on GHA, where I believe the client is not
starting up fast enough before making requests. So wait on
the client agent first.
```
=== RUN TestDebug_CapturedFiles
operator_debug_test.go:422: serverName: TestDebug_CapturedFiles.global, clientID, 1afb00e6-13f2-d8d6-d0f9-745a3fd6e8e4
operator_debug_test.go:492:
Error Trace: operator_debug_test.go:492
Error: Should be empty, but was No node(s) with prefix "1afb00e6-13f2-d8d6-d0f9-745a3fd6e8e4" found
Failed to retrieve clients, 0 nodes found in list: 1afb00e6-13f2-d8d6-d0f9-745a3fd6e8e4
Test: TestDebug_CapturedFiles
--- FAIL: TestDebug_CapturedFiles (0.08s)
```
* Wait longer for node to go down in disconnected clients test.
The existing helper only waits 10s, but there's a jitter on heartbeats
that we need to account for. Wait for 30s for node to go down to give
us plenty of room
* Port disconnected clients to stdlib-style test
In #12112 and #12113 we solved for the problem of races in releasing
volume claims, but there was a case that we missed. During a node
drain with a controller attach/detach, we can hit a race where we call
controller publish before the unpublish has completed. This is
discouraged in the spec but plugins are supposed to handle it
safely. But if the storage provider's API is slow enough and the
plugin doesn't handle the case safely, the volume can get "locked"
into a state where the provider's API won't detach it cleanly.
Check the claim before making any external controller publish RPC
calls so that Nomad is responsible for the canonical information about
whether a volume is currently claimed.
This has a couple side-effects that also had to get fixed here:
* Changing the order means that the volume will have a past claim
without a valid external node ID because it came from the client, and
this uncovered a separate bug where we didn't assert the external node
ID was valid before returning it. Fallthrough to getting the ID from
the plugins in the state store in this case. We avoided this
originally because of concerns around plugins getting lost during node
drain but now that we've fixed that we may want to revisit it in
future work.
* We should make sure we're handling `FailedPrecondition` cases from
the controller plugin the same way we handle other retryable cases.
* Several tests had to be updated because they were assuming we fail
in a particular order that we're no longer doing.
Resolves#12095 by WONTFIXing it.
This approach disables `writeToFile` as it allows arbitrary host
filesystem writes and is only a small quality of life improvement over
multiple `template` stanzas.
This approach has the significant downside of leaving people who have
altered their `template.function_denylist` *still vulnerable!* I added
an upgrade note, but we should have implemented the denylist as a
`map[string]bool` so that new funcs could be denied without overriding
custom configurations.
This PR also includes a bug fix that broke enabling all consul-template
funcs. We repeatedly failed to differentiate between a nil (unset)
denylist and an empty (allow all) one.
Concurrent E2E runs can collide when provisioning policies on HCP
Consul and HCP Vault. Namespace these by the test run name, as we do
for most everything else.
Our E2E "framework" has a bunch of features around test discovery and
standing up infra that were never completed or fully used, and we
ended up building out a large test suite that ignored all that in lieu
of Terraform-provided infrastructure for the last couple years.
This changeset is a proposal (and demonstration) for gradually
migrating our E2E tests off the framework code so that developers can
write fairly ordinary golang stdlib testing tests.
This test exercises the behavior of clients that become disconnected
and have their allocations replaced. Future test cases will exercise
the `max_client_disconnect` field on the job spec.
test_checks.sh was removed in 2019 and now just breaks if VERBOSE is
set when running tests via make targets
in GHA, use verbose mode to display what tests are running