Commit Graph

429 Commits

Author SHA1 Message Date
Tim Gross
7c7569674c CSI: unique volume per allocation
Add a `PerAlloc` field to volume requests that directs the scheduler to test
feasibility for volumes with a source ID that includes the allocation index
suffix (ex. `[0]`), rather than the exact source ID.

Read the `PerAlloc` field when making the volume claim at the client to
determine if the allocation index suffix (ex. `[0]`) should be added to the
volume source ID.
2021-03-18 15:35:11 -04:00
Charlie Voiselle
d914990e5f Fixup uses of sanity (#10187)
* Fixup uses of `sanity`
* Remove unnecessary comments.

These checks are better explained by earlier comments about
the context of the test. Per @tgross, moved the tests together
to better reinforce the overall shared context.

* Update nomad/fsm_test.go
2021-03-16 18:05:08 -04:00
Tim Gross
03a1192c12 docs: swap master for main in Nomad repo 2021-03-08 14:26:31 -05:00
Mahmood Ali
c2ab63adf9 Merge pull request #9935 from hashicorp/e2e-segment-e2e-clusters
e2e: segment e2e clusters
2021-03-01 09:23:21 -05:00
Drew Bailey
6fc62aa235 Merge pull request #9955 from hashicorp/on-update-services
Service and Check on_update configuration option (readiness checks)
2021-02-24 10:11:05 -05:00
Seth Hoenig
2a35c35a6e dist: place systemd unit options correctly
This PR places StartLimitIntervalSec and StartLimitBurst in the
Unit section of systemd unit files, rather than the Service section.

https://www.freedesktop.org/software/systemd/man/systemd.unit.html

Fixes #10065
2021-02-22 19:23:00 -06:00
Drew Bailey
2f99d6495d E2e/fix periodic (#10047)
* fix periodic

* update periodic to not use template

nomad job inspect no longer returns an apiliststub so the required fields to query job summary are no longer there, parse cli output instead

* rm tmp makefile entry

* fix typo

* revert makefile change
2021-02-18 12:21:53 -05:00
James Rasell
7cb48abb5a e2e: account for race condition in periodic dispatch test. 2021-02-11 11:08:48 +01:00
Seth Hoenig
c7b5ae65fd Merge pull request #9990 from hashicorp/f-nsiso-task
drivers/exec+java: Add task configuration to restore previous PID/IPC isolation behavior
2021-02-09 13:29:14 -06:00
Seth Hoenig
af48777ddd consul/connect: enable custom sidecars to use expose checks
This PR enables jobs configured with a custom sidecar_task to make
use of the `service.expose` feature for creating checks on services
in the service mesh. Before we would check that sidecar_task had not
been set (indicating that something other than envoy may be in use,
which would not support envoy's expose feature). However Consul has
not added support for anything other than envoy and probably never
will, so having the restriction in place seems like an unnecessary
hindrance. If Consul ever does support something other than Envoy,
they will likely find a way to provide the expose feature anyway.

Fixes #9854
2021-02-09 10:49:37 -06:00
Seth Hoenig
836ee9e4a2 drivers/exec+java: Add task configuration to restore previous PID/IPC isolation behavior
This PR adds pid_mode and ipc_mode options to the exec and java task
driver config options. By default these will defer to the default_pid_mode
and default_ipc_mode agent plugin options created in #9969. Setting
these values to "host" mode disables isolation for the task. Doing so
is not recommended, but may be necessary to support legacy job configurations.

Closes #9970
2021-02-08 14:26:35 -06:00
Drew Bailey
24c0e3ccf5 address pr comments 2021-02-08 13:43:05 -05:00
Drew Bailey
7217bf8f06 on_update check_restart e2e 2021-02-08 10:49:25 -05:00
Drew Bailey
74e7bbb7d2 e2e test for on_update service checks
check_restart not compatible with on_update=ignore

reword caveat
2021-02-08 08:32:40 -05:00
Chris Baker
81fef152a0 e2e packer build: upgrade jdk to java 14 2021-02-02 17:33:48 +00:00
Mahmood Ali
d161c40f34 e2e: segment e2e clusters
Ensure that the e2e clusters are isolated and never attempt to autojoin
with another e2e cluster.

This ensures that each cluster servers have a unique `ConsulAutoJoin`,
to be used for discovery.
2021-02-01 08:04:21 -05:00
Chris Baker
7f06adf1af Merge tag 'v1.0.3' into post-release-1.0.3
Version 1.0.3
2021-01-29 19:30:08 +00:00
Chris Baker
bcb78f15bf lint some nomad HCL job specs 2021-01-28 12:03:19 +00:00
Chris Baker
c9905747e6 e2e: java driver isolation tests 2021-01-28 12:03:19 +00:00
Chris Baker
3eb9cdf740 additional e2e utils for multi-task allocs 2021-01-28 12:03:19 +00:00
Kris Hicks
87f80b1042 Add a little comment 2021-01-28 12:03:19 +00:00
Kris Hicks
c0f6df7cfd Add test for alloc exec 2021-01-28 12:03:19 +00:00
Kris Hicks
ea7bab0714 Add e2e test for raw exec 2021-01-28 12:03:19 +00:00
Kris Hicks
677353a205 Add PID namespacing and e2e test 2021-01-28 12:03:19 +00:00
Mahmood Ali
38a7e73c91 e2e: skip node drain deadline/force tests 2021-01-27 08:42:16 -05:00
Mahmood Ali
e5bdc5cf71 e2e: use f.NoError instead of requires 2021-01-27 08:36:23 -05:00
Mahmood Ali
1349e49bec e2e: Disable Connect tests
The connect tests are very disruptive: restart consul/nomad agents with new
tokens.  The test seems particularly flaky, failing 32 times out of 73 in my
sample.

The tests are particularly problematic because they are disruptive and affect
other tests. On failure, the nomad or consul agent on the client can get into a
wedged state, so health/deployment info in subsequent tests may be wrong. In
some cases, the node will be deemed as fail, and then the subsequent tests may
fail when the node is deemed lost and the test allocations get migrated unexpectedly.
2021-01-26 10:01:14 -05:00
Mahmood Ali
78ccc93c2b e2e: deflake nodedrain test
The nodedrain deadline test asserts that all allocations are migrated by the
deadline. However, when the deadline is short (e.g. 10s), the test may fail
because of scheduler/client-propagation delays.

In one failing test, it took ~15s from the RPC call to the moment to the moment
the scheduler issued migration update, and then 3 seconds for the alloc to be
stopped.

Here, I increase the timeouts to avoid such false positives.
2021-01-26 10:01:14 -05:00
Mahmood Ali
1290eb75f9 e2e: vault increase timeout
Increase the timeout for vaultsecrets.  As the default  interval is 0.1s, 10
retries mean it only retries for one second, a very short time for some waiting
scenarios in the test (e.g. starting allocs, etc).
2021-01-26 10:01:14 -05:00
Mahmood Ali
fe9929270c e2e: prefer testutil.WaitForResultRetries
Prefer testutil.WaitForResultRetries that emits more descriptive errors on
failures. `require.Evatually` fails with opaque "Condition never satisfied"
error message.
2021-01-26 10:01:14 -05:00
Mahmood Ali
b49df6e9ae e2e: special case "Unexpected EOF" errors
This is an attempt at deflaking the e2e exec tests, and a way to improve
messages.

e2e occasionally fail with "unexpected EOF" even though the exec output matches
expectations. I suspect there is a race in handling EOF in server/http handling.

Here, we special case this error and ensures we get all failures,
to help debug the case better.
2021-01-26 10:01:14 -05:00
Mahmood Ali
25f10e13e5 e2e: tweak failure messages
Tweak the error messages for the flakiest tests, so that on test failure, we get
more output
2021-01-26 09:16:48 -05:00
Mahmood Ali
f7acda4260 e2e: use testify requires instead of t.Fatal
testify requires offer better error message that is easier to notice when seeing
a wall of text in the builds.
2021-01-26 09:14:47 -05:00
Mahmood Ali
30573f048e e2e: deflake consul/CheckRestart test
Ensure we pass the alloc ID to status.  Otherwise, the test may fail if there is
another spurious allocation running from another test.
2021-01-26 09:12:20 -05:00
Mahmood Ali
fcb7e160da e2e: Fix build script and pass shellcheck 2021-01-26 09:11:37 -05:00
Mahmood Ali
2867e262f1 Merge pull request #9798 from hashicorp/e2e-terraform-tweaks-20200113
This PR makes two ergonomics changes, meant to get e2e builds more reproducible and ease changes.

### AMI Management

First, we pin the server AMIs to the commits associated with the build.  No more using the latest AMI a developer build in a test branch, or accidentally using a stale AMI because we forgot to build one!  Packer is to tag the AMI images with the commit sha used to generate the image, and then Terraform would look up only the AMIs associated with that sha. To minimize churn, we use the SHA associated with the latest Packer configurations, rather than SHA of all.

This has few benefits: reproducibility and avoiding accidental AMI changes and contamination of changes across branches. Also, the change is a stepping stone to an e2e pipeline that builds new AMIs automatically if Packer files changed.

The downside is that new AMIs will be generated even for irrelevant changes (e.g. spelling, commits), but I suspect that's OK. Also, an engineer will be forced to build the AMI whenever they change Packer files while iterating on e2e scripts; this hasn't been an issue for me yet, and I'll be open for iterating on that later if it proves to be an issue.

### Config Files and Packer

Second, this PR moves e2e config hcl management to Terraform instead of Packer. Currently, the config files live in `./terraform/config`, but they are baked into the servers by Packer and changes are ignored.  This current behavior surprised me, as I spent a bit of time debugging why my config changes weren't applied.  Having Terraform manage them would ease engineer's iteration.  Also, make Packer management more consistent (Packer only works `e2e/terraform/packer`), and easing the logic for AMI change detection.

The config directory is very small (100KB), and having it as an upload step adds negligible time to `terraform apply`.
2021-01-25 13:20:28 -05:00
Mahmood Ali
c45c8e8bb6 update readme about profiles and packer build 2021-01-25 11:40:26 -05:00
Seth Hoenig
ceae8ad1cf consul/connect: Add support for Connect terminating gateways
This PR implements Nomad built-in support for running Consul Connect
terminating gateways. Such a gateway can be used by services running
inside the service mesh to access "legacy" services running outside
the service mesh while still making use of Consul's service identity
based networking and ACL policies.

https://www.consul.io/docs/connect/gateways/terminating-gateway

These gateways are declared as part of a task group level service
definition within the connect stanza.

service {
  connect {
    gateway {
      proxy {
        // envoy proxy configuration
      }
      terminating {
        // terminating-gateway configuration entry
      }
    }
  }
}

Currently Envoy is the only supported gateway implementation in
Consul. The gateay task can be customized by configuring the
connect.sidecar_task block.

When the gateway.terminating field is set, Nomad will write/update
the Configuration Entry into Consul on job submission. Because CEs
are global in scope and there may be more than one Nomad cluster
communicating with Consul, there is an assumption that any terminating
gateway defined in Nomad for a particular service will be the same
among Nomad clusters.

Gateways require Consul 1.8.0+, checked by a node constraint.

Closes #9445
2021-01-25 10:36:04 -06:00
Tim Gross
d0da4544aa e2e: added tests for check restart behavior 2021-01-22 10:55:40 -05:00
Drew Bailey
3cb1132693 prevent double job status update (#9768)
* Prevent Job Statuses from being calculated twice

https://github.com/hashicorp/nomad/pull/8435 introduced atomic eval
insertion iwth job (de-)registration. This change removes a now obsolete
guard which checked if the index was equal to the job.CreateIndex, which
would empty the status. Now that the job regisration eval insetion is
atomic with the registration this check is no longer necessary to set
the job statuses correctly.

* test to ensure only single job event for job register

* periodic e2e

* separate job update summary step

* fix updatejobstability to use copy instead of modified reference of job

* update envoygatewaybindaddresses copy to prevent job diff on null vs empty

* set ConsulGatewayBindAddress to empty map instead of nil

fix nil assertions for empty map

rm unnecessary guard
2021-01-22 09:18:17 -05:00
Mahmood Ali
5ba061359a e2e: show command output on failure
When a command fails, it's nice to have the full output, as it contains
diagnostic information. The status code isn't sufficient for debugging.
2021-01-21 10:32:16 -05:00
Mahmood Ali
db63d31241 e2e: deflake TestVolumeMounts
After submitting an update, the test ought to wait until the new
allocations are placed. Previously, we'd use the original to-be-stopped
allocations and the test fails when attempting to exec.
2021-01-21 10:28:41 -05:00
Mahmood Ali
60bb50c432 e2e deflake namespaces: only check namespace jobs
Deflake namespace e2e test by only asserting on jobs related to the
namespace tests. During our e2e tests, some left over jobs (e.g.
prometheus) are left running while being shutdown and cause the test to
fail.
2021-01-21 10:26:24 -05:00
Mahmood Ali
f045ec2d0b e2e: deflake events
Handle streamCh channel being closed.
2021-01-21 10:25:42 -05:00
Seth Hoenig
7ff2f9c1bc consul/connect: Enable running multiple ingress gateways per Nomad agent
Connect ingress gateway services were being registered into Consul without
an explicit deterministic service ID. Consul would generate one automatically,
but then Nomad would have no way to register a second gateway on the same agent
as it would not supply 'proxy-id' during envoy bootstrap.

Set the ServiceID for gateways, and supply 'proxy-id' when doing envoy bootstrap.

Fixes #9834
2021-01-19 12:58:36 -06:00
Mahmood Ali
906cbdfda5 add helper for building ami 2021-01-15 10:49:13 -05:00
Mahmood Ali
9fdd9a5428 set sha 2021-01-15 10:49:13 -05:00
Mahmood Ali
21f77f576d change ami naming 2021-01-15 10:49:12 -05:00
Mahmood Ali
da74d8c549 move config files to terraform 2021-01-15 10:49:12 -05:00
Seth Hoenig
74c1828431 e2e: use jobspec2 Parse for parsing jobfile in e2e utils
We directly parse job files in e2eutil, but currently using jobspec
package. Instead, use the Parse method from the jobspec2 package so
we can parse job files with new features.
2021-01-13 14:00:40 -06:00