Commit Graph

21484 Commits

Author SHA1 Message Date
Huan Wang
df74bc0cbb update-gopsutil (#10790) 2021-06-21 10:19:39 -04:00
Seth Hoenig
d61bf59105 Merge pull request #10789 from hashicorp/b-cns-mups
consul/connect: Validate uniqueness of Connect upstreams within task group
2021-06-21 09:01:02 -05:00
Seth Hoenig
adcbcc129b consul/connect: Validate uniqueness of Connect upstreams within task group
This PR adds validation during job submission that Connect proxy upstreams
within a task group are using different listener addresses. Otherwise, a
duplicate envoy listener will be created and not be able to bind.

Closes #7833
2021-06-18 16:50:53 -05:00
Russell Rollins
c56251b9dd Adds error handling for client error in getRandomJobAlloc. (#10787) 2021-06-18 16:26:43 -04:00
Seth Hoenig
6dcada4346 Merge pull request #10784 from hashicorp/b-dlskf
e2e: fix a couple recent e2e bugs
2021-06-18 13:17:20 -05:00
Seth Hoenig
15d39f0dee e2e: use -detach mode when registering jobs with cli
This PR changes the e2e helper thingy to set -detach option
when registering a job with the CLI instead of the API. This is
necessary for jobs which never become healthy, as the deployment
never finishes for failing jobs and the command never returns,
causing the test to timeout after 10 minutes.
2021-06-18 12:18:40 -05:00
Seth Hoenig
57fdb81433 consul: set task name only for group service checks
This PR fixes a bug introduced in a refactoring

https://github.com/hashicorp/nomad/pull/10764/files#diff-56b3c82fcbc857f8fb93a903f1610f6e6859b3610a4eddf92bad9ea27fdc85ec

where task level service checks would inherent the task name
field, when they shouldn't.

Fixes #10781
2021-06-18 12:16:27 -05:00
Tim Gross
2520d83e85 tests: allocrunner CNI tests are Linux-only (#10783)
Running the `client/allocrunner` tests fail to compile on macOS because the
CNI test file depends on the CNI network configurator, which is in a
Linux-only file.
2021-06-18 11:34:31 -04:00
Tim Gross
77f6ecbbbf deps: bump go-getter to 1.5.4 (#10778) 2021-06-17 16:30:00 -04:00
Seth Hoenig
2d8fc6b344 Merge pull request #10776 from hashicorp/b-cns-sysjob-ups
consul/connect: in-place update service definition when connect upstreams are modified
2021-06-17 10:13:56 -05:00
Tim Gross
ad3070a1c2 docs: host_network does support Docker task port mapping (#10774) 2021-06-17 09:11:10 -04:00
Tim Gross
b0922e90a7 changelog entry for #10756 2021-06-16 22:02:10 -04:00
Seth Hoenig
7ba60b4e33 consul/connect: in-place update service definition when connect upstreams are modified
This PR fixes a bug where modifying the upstreams of a Connect sidecar proxy
would not result Consul applying the changes, unless an additional change to
the job would trigger a task replacement (thus replacing the service definition).

The fix is to check if upstreams have been modified between Nomad's view of the
sidecar service definition, and the service definition for the sidecar that is
actually registered in Consul.

Fixes #8754
2021-06-16 16:48:26 -05:00
Tim Gross
2a640f0b2d docker: generate /etc/hosts file for bridge network mode (#10766)
When `network.mode = "bridge"`, we create a pause container in Docker with no
networking so that we have a process to hold the network namespace we create
in Nomad. The default `/etc/hosts` file of that pause container is then used
for all the Docker tasks that share that network namespace. Some applications
rely on this file being populated.

This changeset generates a `/etc/hosts` file and bind-mounts it to the
container when Nomad owns the network, so that the container's hostname has an
IP in the file as expected. The hosts file will include the entries added by
the Docker driver's `extra_hosts` field.

In this changeset, only the Docker task driver will take advantage of this
option, as the `exec`/`java` drivers currently copy the host's `/etc/hosts`
file and this can't be changed without breaking backwards compatibility. But
the fields are available in the task driver protobuf for community task
drivers to use if they'd like.
2021-06-16 14:55:22 -04:00
dependabot[bot]
3b5bca63b6 build(deps): bump postcss from 7.0.35 to 7.0.36 in /website (#10772)
Bumps [postcss](https://github.com/postcss/postcss) from 7.0.35 to 7.0.36.
- [Release notes](https://github.com/postcss/postcss/releases)
- [Changelog](https://github.com/postcss/postcss/blob/main/CHANGELOG.md)
- [Commits](https://github.com/postcss/postcss/compare/7.0.35...7.0.36)

---
updated-dependencies:
- dependency-name: postcss
  dependency-type: indirect
...

Signed-off-by: dependabot[bot] <support@github.com>

Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2021-06-16 12:18:43 -04:00
dependabot[bot]
0983b073e3 build(deps): bump ws from 7.3.1 to 7.4.6 in /scripts/screenshots/src (#10671)
Bumps [ws](https://github.com/websockets/ws) from 7.3.1 to 7.4.6.
- [Release notes](https://github.com/websockets/ws/releases)
- [Commits](https://github.com/websockets/ws/compare/7.3.1...7.4.6)

Signed-off-by: dependabot[bot] <support@github.com>

Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2021-06-16 11:09:34 -04:00
Seth Hoenig
e16a35167b Merge pull request #10765 from hashicorp/b-java-fp-version
client/fingerprint/java: improve java version string regex matching
2021-06-15 17:14:13 -05:00
Seth Hoenig
674183c35d client/fingerprint/java: improve java version string regex matching
This PR improves the regular expression used for matching the java
version string, which varies a lot depending on the java vendor and
version.

These are the example strings we now test for:

java version "1.7.0_80"
openjdk version "11.0.1" 2018-10-16
openjdk version "11.0.1" 2018-10-16
java version "1.6.0_36"
openjdk version "1.8.0_192"
openjdk 11.0.11 2021-04-20 LTS

The last one is a new test added on behalf of #6081, which is
still broken on today's CentOS 7 default JDK package.

openjdk 11.0.11 2021-04-20 LTS
OpenJDK Runtime Environment 18.9 (build 11.0.11+9-LTS)
OpenJDK 64-Bit Server VM 18.9 (build 11.0.11+9-LTS, mixed mode, sharing)

==> Evaluation "21c6caf7" finished with status "complete" but failed to place all allocations:
    Task Group "example" (failed to place 1 allocation):
      * Constraint "${driver.java.version} >= 11.0.0": 1 nodes excluded by filter
    Evaluation "2b737d48" waiting for additional capacity to place remainder

Fixes #6081
2021-06-15 14:15:01 -05:00
Seth Hoenig
52bf197790 Merge pull request #10764 from hashicorp/b-passfail-lost
consul: make failures_before_critical and success_before_passing work with group services
2021-06-15 12:41:04 -05:00
Seth Hoenig
0ef0b2ef2b docs: add bugfix note to 1.0.8 2021-06-15 12:40:44 -05:00
Seth Hoenig
b4a631c1c5 consul: make failures_before_critical and success_before_passing work with group services
This PR fixes some job submission plumbing to make sure the Consul Check parameters
- failure_before_critical
- success_before_passing

work with group-level services. They already work with task-level services.
2021-06-15 11:20:40 -05:00
Seth Hoenig
ab9b589b33 Merge pull request #10762 from hashicorp/docs-update-cl-2
docs: update changelog
2021-06-15 09:25:51 -05:00
Seth Hoenig
d7530f04ae docs: update changelog 2021-06-15 09:17:06 -05:00
James Rasell
c3b15b8733 Merge pull request #10758 from hashicorp/b-fix-test-datarace-plugins
plugins: fix test data race.
2021-06-15 14:33:53 +02:00
James Rasell
ff4cd338d9 plugins: fix test data race. 2021-06-15 09:31:08 +02:00
Isabel Suchanek
ca010f9f87 cli: check deployment exists before monitoring (#10757)
System and batch jobs don't create deployments, which means nomad tries
to monitor a non-existent deployment when it runs a job and outputs an
error message. This adds a check to make sure a deployment exists before
monitoring. Also fixes some formatting.

Co-authored-by: Tim Gross <tgross@hashicorp.com>
2021-06-14 16:42:38 -07:00
Mahmood Ali
8052ae1d11 deployment watcher: Reuse allocsCh if allocIndex remains the same (#10756)
Fix deployment watchers to avoid creating unnecessary deployment watcher goroutines and blocking queries. `deploymentWatcher.getAllocsCh` creates a new goroutine that makes a blocking query to fetch updates of deployment allocs.

## Background

When operators submit a new or updated service job, Nomad create a new deployment by default. The deployment object controls how fast to place the allocations through [`max_parallel`](https://www.nomadproject.io/docs/job-specification/update#max_parallel) and health checks configurations.

The `scheduler` and `deploymentwatcher` package collaborate to achieve deployment logic: The scheduler only places the canaries and `max_parallel` allocations for a new deployment; the `deploymentwatcher` monitors for alloc progress and then enqueues a new evaluation whenever the scheduler should reprocess a job and places the next `max_parallel` round of allocations.

The `deploymentwatcher` package makes blocking queries against the state store, to fetch all deployments and the relevant allocs for each running deployments. If `deploymentwatcher` fails or is hindered from fetching the state, the deployments fail to make progress.

`Deploymentwatcher` logic only runs on the leader.

## Why unnecessary deployment watchers can halt cluster progress
Previously, `getAllocsCh` is called on every for loop iteration in `deploymentWatcher.watch()` function. However, the for-loop may iterate many times before the allocs get updated. In fact, whenever a new deployment is created/updated/deleted, *all* `deploymentWatcher`s get notified through `w.deploymentUpdateCh`. The `getAllocsCh` goroutines and blocking queries spike significantly and grow quadratically with respect to the number of running deployments. The growth leads to two adverse outcomes:

1. it spikes the CPU/Memory usage resulting potentially leading to OOM or very slow processing
2. it activates the [query rate limiter](abaa9c5c5b/nomad/deploymentwatcher/deployment_watcher.go (L896-L898)), so later the watcher fails to get updates and consequently fails to make progress towards placing new allocations for the deployment!

So the cluster fails to catch up and fails to make progress in almost all deployments. The cluster recovers after a leader transition: the deposed leader stops all watchers and free up goroutines and blocking queries; the new leader recreates the watchers without the quadratic growth and remaining under the rate limiter.  Well, until a spike of deployments are created triggering the condition again.

### Relevant Code References

Path for deployment monitoring:
* [`Watcher.watchDeployments`](abaa9c5c5b/nomad/deploymentwatcher/deployments_watcher.go (L164-L192)) loops waiting for deployment updates.
* On every deployment update, [`w.getDeploys`](abaa9c5c5b/nomad/deploymentwatcher/deployments_watcher.go (L194-L229)) returns all deployments in the system
* `watchDeployments` calls `w.add(d)` on every active deployment
* which in turns, [updates existing watcher if one is found](abaa9c5c5b/nomad/deploymentwatcher/deployments_watcher.go (L251-L255)).
* The deployment watcher [updates local local deployment field and trigger `deploymentUpdateCh` channel]( abaa9c5c5b/nomad/deploymentwatcher/deployment_watcher.go (L136-L147))
* The [deployment watcher `deploymentUpdateCh` selector is activated](abaa9c5c5b/nomad/deploymentwatcher/deployment_watcher.go (L455-L489)). Most of the time the selector clause is a no-op, because the flow was triggered due to another deployment update
* The `watch` for-loop iterates again and in the previous code we create yet another goroutine and blocking call that risks being rate limited.

Co-authored-by: Tim Gross <tgross@hashicorp.com>
2021-06-14 16:01:01 -04:00
Seth Hoenig
3a5cbc4713 Merge pull request #10754 from hashicorp/b-client-connect-constraint
consul/connect: remove unnecessary connect constraint on clients
2021-06-14 09:41:25 -05:00
James Rasell
7019bc2b9c Merge pull request #10752 from hashicorp/b-fix-test-datarace-volumewatcher
volumewatcher: fix test data race.
2021-06-14 16:30:34 +02:00
Tim Gross
2b63a093ac quotas: evaluate quota feasibility last in scheduler (#10753)
The `QuotaIterator` is used as the source of nodes passed into feasibility
checking for constraints. Every node that passes the quota check counts the
allocation resources agains the quota, and as a result we count nodes which
will be later filtered out by constraints. Therefore for jobs with
constraints, nodes that are feasibility checked but fail have been counted
against quotas. This failure mode is order dependent; if all the unfiltered
nodes happen to be quota checked first, everything works as expected.

This changeset moves the `QuotaIterator` to happen last among all feasibility
checkers (but before ranking). The `QuotaIterator` will never receive filtered
nodes so it will calculate quotas correctly.
2021-06-14 10:11:40 -04:00
Seth Hoenig
0d13ef0c75 consul/connect: remove unnecessary connect constraint on clients
PR https://github.com/hashicorp/nomad/pull/10702 added 2 new constraints
for connect jobs - one for Consul gRPC listener, and one for Connect being
enabled on Clients. Connect does not need to be enabled on clients, only
on Consul servers. Remove the extra constraint.

Discuss:
https://discuss.hashicorp.com/t/nomad-1-1-1-and-consul-connect-enabled-on-consul-clients/25295
2021-06-14 08:01:45 -05:00
James Rasell
b6505c2350 volumewatcher: fix test data race. 2021-06-14 12:11:35 +02:00
Brandon Romano
ff2e2c113b Merge pull request #10750 from hashicorp/br.quote-image
Fix headshot image 404
2021-06-11 15:38:09 -07:00
Brandon Romano
399dd84acc Fix headshot image 404 2021-06-11 15:31:05 -07:00
Luiz Aoqui
5cfc104a38 fix agent-info help message formatting (#10747) 2021-06-11 15:39:28 -04:00
James Rasell
88e456d962 Merge pull request #10745 from hashicorp/b-fix-test-datarace-deploymentwatcher
deploymentwatcher: fix test data race.
2021-06-11 17:23:03 +02:00
James Rasell
a7d055a584 Merge pull request #10744 from hashicorp/b-remove-duplicate-imports
chore: remove duplicate import statements
2021-06-11 16:42:34 +02:00
Mahmood Ali
c467de331e Merge pull request #10742 from hashicorp/deflake-tests-20210608
Deflaking Test 2021 June edition
2021-06-11 09:14:40 -04:00
James Rasell
9c926d7a64 Merge pull request #10739 from hashicorp/f-remove-unused-types-pkg
core: remove unused types pkg and PeriodicCallback type.
2021-06-11 13:27:22 +02:00
James Rasell
9a25e89b00 deploymentwatcher: fix test data race. 2021-06-11 11:55:21 +02:00
James Rasell
a4156c3e94 tests: remove duplicate import statements. 2021-06-11 09:39:22 +02:00
James Rasell
d1db141472 jobspec2: remove duplicate imports statements. 2021-06-11 09:38:47 +02:00
James Rasell
b3fe60a7f3 drivers: remove duplicate import statements. 2021-06-11 09:38:09 +02:00
James Rasell
55f0611aec e2e: remove duplicate import statements. 2021-06-11 09:37:23 +02:00
Mahmood Ali
7254fb34d1 deflake TestNomad_BootstrapExpect and other leader tests
The test fails reliably locally on my machine. The test uses non-dev mode
where Raft actions get committed to disk, causing operations to exceed
the 50ms tight Raft deadlines.

So, here we ensure that non-dev servers use default Raft config
files with longer timeouts.

Also, noticed that the test queries a server, that may a follower with a
stale state.

I've updated the test to ensure we query the leader for its state. The
Barrier call ensures that the leader is a "stable" leader with committed
entries. Protects against a window where a new leader reports the
previous term before it commits a raft log entry.
2021-06-10 22:04:10 -04:00
Mahmood Ali
14994d666d tests: deflake TestAgentProfile_RemoteClient
TestAgentProfile_RemoteClient test must wait for the client node to be
registered in raft state store, and not merely that the server has a
network connection from the client.

In https://app.circleci.com/pipelines/github/hashicorp/nomad/15539/workflows/8dcbc3f3-946b-4da0-b089-9093788bc0c9/jobs/147919, notice how `node registration complete` log line occured after the test already have failed.

This is another case of flakiness due to not waiting for client
registration.
2021-06-10 22:00:15 -04:00
Mahmood Ali
a1800abb7f tests: deflake TestMonitor_Monitor_RemoteServer and cross-region tests
Ensure that all servers are joined to each other before test proceed,
instead of just joining them to the first server and relying on
background serf propagation.

Relying on backgorund serf propagation is a cause of flakiness,
specially for tests with multiple regions. The server receiving the RPC
may not be aware of the region and fail to forward RPC accordingly.

For example, consider `TestMonitor_Monitor_RemoteServer` failure in https://app.circleci.com/pipelines/github/hashicorp/nomad/16402/workflows/7f327235-7d0c-40ba-9757-600522afca51/jobs/158045  you can observe:
* `nomad-117` is joined to `nomad-118` and `nomad-119`
* `nomad-119` is the foreign region
* `nomad-117` gains leadership in the default region, `nomad-118` is the non-leader
*  search logs for `nomad: adding server` and notice that `nomad-118`
   only added `nomad-118` and `nomad-118`, but not `nomad-119`!
* so the query to the non-leader in the test fails to be forwarded to
  the appopriate region.
2021-06-10 21:27:55 -04:00
Mahmood Ali
7badf0fda2 tests: deflake CSI forwarding tests
This updates `client.Ready()` so it returns once the client node got
registered at the servers. Previously, it returns when the
fingerprinters first batch completes, wtihout ensuring that the node is
stored in the Raft data. The tests may fail later when it with unknown
node errors later.

`client.Reedy()` seem to be only called in CSI and some client stats
now.

This class of bug, assuming client is registered without checking, is a
source of flakiness elsewhere. Other tests use other mechanisms for
checking node readiness, though not consistently.
2021-06-10 21:26:34 -04:00
Isabel Suchanek
398d81ccf1 Merge pull request #10740 from hashicorp/docs-deploy-monitor
docs: add deployment monitor to docs, changelog
2021-06-10 13:53:03 -07:00
Isabel Suchanek
34aecd0e82 docs: add deployment monitor to docs, changelog
Updates the deployment status and job run docs
2021-06-10 10:51:33 -07:00