Migrate our E2E tests for Connect off the old framework in preparation for
writing E2E tests for transparent proxy and the updated workload identity
workflow. Mark the tests that cover the legacy Consul token submitted workflow.
Ref: https://github.com/hashicorp/nomad/pull/20175
If a E2E cluster is destroyed after a different one has been created, the role
and policy we create in Vault for the cluster will be deleted and Vault-related
tests will fail. Note that before 1.9, we should figure out a way to give HCP
Vault access to the JWKS endpoint and have a different set of policies, but
we'll need to have a role-per-cluster in that case as well.
Fixes: https://github.com/hashicorp/nomad-e2e/issues/138 (internal)
Our `consulcompat` tests exercise both the Workload Identity and legacy Consul
token workflow, but they are limited to running single node tests. The E2E
cluster is network isolated, so using our HCP Consul cluster runs into a
problem validating WI tokens because it can't reach the JWKS endpoint. In real
production environments, you'd solve this with a CNAME pointing to a public IP
pointing to a proxy with a real domain name. But that's logisitcally
impractical for our ephemeral nightly cluster.
Migrate the HCP Consul to a single-node Consul cluster on AWS EC2 alongside our
Nomad cluster. Bootstrap TLS and ACLs in Terraform and ensure all nodes can
reach each other. This will allow us to update our Consul tests so they can use
Workload Identity, in a separate PR.
Ref: #19698
We've been getting a couple of errors from this test on nightly where the
template hasn't rendered by the time we expect it to. I've run some tests
locally and this may be a timing issue introduced by recent code changes to
templates.
Move the start of the timer to after we're guaranteed that we've got a secret
lease TTL started, to eliminate this as a source of flakiness. In my tests this
adds another ~5s to a test that already takes over a minute to run anyways.
* e2e: move rawexec oversub tests into oversubscription e2e test suite
This PR moves two tests for raw_exec and memory oversubscription into
the oversubscription test suite, which has the necessary plumbing to
activate and restore the oversubscription configuration of the scheduler
during the test.
* cr: rename files for better readability
* drivers/raw_exec: enable configuring raw_exec task to have no memory limit
This PR makes it possible to configure a raw_exec task to not have an
upper memory limit, which is how the driver would behave pre-1.7.
This is done by setting memory_max = -1. The cluster (or node pool) must
have memory oversubscription enabled.
* cl: add cl
* drivers/executor: set oom_score_adj for raw_exec
This might not be wholly true since I don't know all configurations of
Nomad, but in our use cases, we run some of our tasks as `raw_exec` for
reasons.
We observed that our tasks were running with `oom_score_adj = -1000`,
which prevents them from being OOM'd. This value is being inherited from
the nomad agent parent process, as configured by systemd.
Similar to #10698, we also were shocked to have this value inherited
down to every child process and believe that we should also set this
value to 0 explicitly.
I have no idea if there are other paths that might leverage this or
other ways that `raw_exec` can manifest, but this is how I was able to
observe and fix in one of our configurations.
We have been running in production our tasks wrapped in a script that
does: `echo 0 > /proc/self/oom_score_adj` to avoid this issue.
* drivers/executor: minor cleanup of setting oom adjustment
* e2e: add test for raw_exec oom adjust score
* e2e: set oom score adjust to -999
* cl: add cl
---------
Co-authored-by: Seth Hoenig <shoenig@duck.com>
* e2e/connect: adds test for namespace policies
* consul: use token namespace when fetching policies
* changelog
* fixup! e2e/connect: adds test for namespace policies
The EBS snapshot operation can take a long time to complete. Recent runs have
shown we sometimes get up to the 10s timeout on the context we're giving the CLI
command. Extend this so that we're not getting spurious timeouts.
Fixes: https://github.com/hashicorp/nomad/issues/19118
* cleanup consul tokens by accessor id
rather than secret id, which has been failing for some time with:
> 404 (Cannot find token to delete)
* expect subset of consul namespaces
the consul test cluster may have namespaces from other unrelated tests
This will dump much of the interesting parts of cluster state, including
available nodes and their status, existing allocations and their status,
and existing evaluations and their status.
When porting the `ConsulTemplate` test, I made a last-minute refactor to the
assertions for waiting on files, and accidentally inverted the test assertion in
the process.
Also, when running `jobs3.Submit` you need to include the `Namespace` option so
that the cleanup function that gets return deletes the job from the correct
namespace. This was causing the namespace cleanup to fail because the job
deletion had failed.
and error more verbosely if it fails
also, add extra information to a failed evaluation
for more error visibility in other tests
---------
Co-authored-by: Juanadelacuesta <juanita.delacuestamorales@hashicorp.com>
When configuring Consul for multi-namespace support, the JWT auth method
needs to specify namespace rules. This attribute is set to `nil` in CE
but is used in Nomad ENT.
The `TestTemplateUpdateTriggers` is flaky because of what turned out to be
incompatibility between the Consul agent on the E2E cluster and the HCP Consul
server we were running but hadn't upgraded in a while. Upgrading the HCP Consul
server seems to have fixed the tests, but while I'm in here I've updated this
test suite:
* Port all the consul template test suite off of the old framework, and upgrade to
using e2e "v3" where feasible.
* Clean up some of the assertions in the update triggers test to make the
purpose of the test more clear.
* Remove unnecessary default fields from the job specs.
Closes: #19075
fixes VaultSecrets test - it was failing due to a
regex mismatch (`^job` stopped matching when
copywrite headers got prepended to the jobspec).
but RegisterFromJobspec (which had the bug)
was only used in the one spot, so instead this
refactors the whole test to the v3 format
with testing.T and some additional fun stuff
that we can take advantage of with it.
some improvements:
* use a namespace
* use and extend existing test helpers
* add more test helpers
This simplifies the default setup of Nomad workloads WI-based
authentication for Consul by using a single auth method with 2 binding rules.
Users can still specify separate auth methods for services and tasks.
The ACL role test asserts that the role has various permissions by listing jobs
in namespaces. It never creates jobs, because we can make all the assertions we
need by checking the error. But the test included an assertion that the
namespace was empty. Usually this will be the case, but if the previous test
case has not completed its GC (which is sync), then it's possible a stopped job
will be in the namespace. Because this assertion is irrelevant for this test,
remove it.
In #18664 we change how null values worked with dynamic node metadata so that
they were no longer returned if there wasn't also a static value for that
key. The test assertion in E2E was not updated to match the new behavior.
Fixes: #19112
In Nomad 1.5 we started masking the specific error returned from the
authentication method and returned the "permission denied" error instead. Update
the E2E test that covers token expiration to ensure we're asserting the correct
error here.
Fixes: https://github.com/hashicorp/nomad/issues/16803
The E2E test suite for rescheduling had a few bugs:
* Using the command line to stop a job with a failing deployment returns a non-zero exit
code, which would cause an otherwise passing test to fail.
* Two of the input jobs were actually invalid but were only correctly detected
as such because of #17342
This changeset also updates the whole test suite to move it off the v1
"framework". A few test assertions are also de-flaked.
Fixes: #19076
We want to run the Vault compatibility E2E test with Vault Enterprise binaries
and use Vault namespaces. Refactor the `vaultcompat` test so as to parameterize
most of the test setup logic with the namespace, and add the appropriate build
tag for the CE version of the test.
We want to run the Consul compatibility E2E test with Consul Enterprise binaries
and use Consul namespaces. Refactor the `consulcompat` test so as to
parameterize most of the test setup logic with the namespace, and add the
appropriate build tag for the CE version of the test.
Ref: https://github.com/hashicorp/nomad-enterprise/pull/1305
Just because an alloc is running does not mean nomad is ready to serve
task logs. In a test case where you immediatly read logs after starting
a task, it could be that nomad responds with "no logs found" when you
try to read logs, in which case you just need to wait longer. Do so in
the v3 TaskLogs helper function.