Nomad 1.10.0 is removing the legacy Vault token based workflow
which means the legacy e2e compatibility tests will fail and not
work.
The Nomad e2e cluster was using the legacy Vault token based
workflow for initial cluster build. This change migrates to using
the workload identity flow which utilizes authentication methods,
roles, and policies.
The Nomad server network has been modified to allow traffic from
the HCP Vault HVN which is a private network peered into our AWS
account. This is required, so that Vault can pull JWKS
information from the Nomad API without going over the public
internet.
The cluster build will now also configure a Vault KV v2 mount at
a unique indentifier for the e2e cluster. This allows all Nomad
workloads and tests to use this if required.
The vaultsecrets suite has been updated to accommodate the new
changes and extended to test the default workload ID flow for
allocations which use Vault for secrets.
We've been getting a couple of errors from this test on nightly where the
template hasn't rendered by the time we expect it to. I've run some tests
locally and this may be a timing issue introduced by recent code changes to
templates.
Move the start of the timer to after we're guaranteed that we've got a secret
lease TTL started, to eliminate this as a source of flakiness. In my tests this
adds another ~5s to a test that already takes over a minute to run anyways.
fixes VaultSecrets test - it was failing due to a
regex mismatch (`^job` stopped matching when
copywrite headers got prepended to the jobspec).
but RegisterFromJobspec (which had the bug)
was only used in the one spot, so instead this
refactors the whole test to the v3 format
with testing.T and some additional fun stuff
that we can take advantage of with it.
some improvements:
* use a namespace
* use and extend existing test helpers
* add more test helpers
If any E2E test hangs, it'll eventually timeout and panic, causing the
all the remaining tests to fail. External commands should use a short
context whenever possible so we can fail the test quickly and move on
to the next test.
Increase the timeout for vaultsecrets. As the default interval is 0.1s, 10
retries mean it only retries for one second, a very short time for some waiting
scenarios in the test (e.g. starting allocs, etc).
The `$NOMAD_SECRETS_DIR` environment variable is rendered as `/secrets`, which
prior to the recent security patch would unintentionally escape the file
sandbox and get dropped in a directory named `/secrets` where the Nomad client
binary was running. The `VaultSecrets` test was accidentally relying on this
behavior and that causes the test to fail.