Commit Graph

22499 Commits

Author SHA1 Message Date
Seth Hoenig
6c7a934033 build: use single-path GOPATH set in makefile
When GOBIN is not set, BIN must be set to the single-path workaround
value of GOPATH, because Circle.
2022-02-17 09:26:13 -06:00
Luiz Aoqui
36e31c51cb initial base work for implementing sorting and filter across API endpoints (#12076) 2022-02-16 14:34:36 -05:00
Seth Hoenig
3ebfd7b414 Merge pull request #12077 from hashicorp/b-makefile-use-gobin
build: respect GOBIN when using make targets
2022-02-16 13:25:03 -06:00
Seth Hoenig
06613d652b build: respect GOBIN when using make targets
This PR updates GNUMakefile to respect $GOBIN if it is set in the
environment or via an $GOENV file. Previously we hard-coded the output
to $GOPATH/bin, which is not necessarily the desired behavior.
2022-02-16 12:05:55 -06:00
Luiz Aoqui
fafb7cecce Add go-bexpr filters to evals and deployment list endpoints (#12034) 2022-02-16 11:40:30 -05:00
Tiernan
1fabefd27e interpolate network.dns block on client (#12021) 2022-02-16 08:39:44 -05:00
Tim Gross
b775a73ded CSI: make gRPC client creation more robust (#12057)
Nomad communicates with CSI plugin tasks via gRPC. The plugin
supervisor hook uses this to ping the plugin for health checks which
it emits as task events. After the first successful health check the
plugin supervisor registers the plugin in the client's dynamic plugin
registry, which in turn creates a CSI plugin manager instance that has
its own gRPC client for fingerprinting the plugin and sending mount
requests.

If the plugin manager instance fails to connect to the plugin on its
first attempt, it exits. The plugin supervisor hook is unaware that
connection failed so long as its own pings continue to work. A
transient failure during plugin startup may mislead the plugin
supervisor hook into thinking the plugin is up (so there's no need to
restart the allocation) but no fingerprinter is started.

* Refactors the gRPC client to connect on first use. This provides the
  plugin manager instance the ability to retry the gRPC client
  connection until success.
* Add a 30s timeout to the plugin supervisor so that we don't poll
  forever waiting for a plugin that will never come back up.

Minor improvements:
* The plugin supervisor hook creates a new gRPC client for every probe
  and then throws it away. Instead, reuse the client as we do for the
  plugin manager.
* The gRPC client constructor has a 1 second timeout. Clarify that this
  timeout applies to the connection and not the rest of the client
  lifetime.
2022-02-15 16:57:29 -05:00
Seth Hoenig
07f4227d76 Merge pull request #12054 from hashicorp/b-creation-indexes
api: return sorted results in certain list endpoints
2022-02-15 15:08:38 -06:00
Seth Hoenig
b432f377cf api: return sorted results in certain list endpoints
These API endpoints now return results in chronological order. They
can return results in reverse chronological order by setting the
query parameter ascending=true.

- Eval.List
- Deployment.List
2022-02-15 13:48:28 -06:00
Seth Hoenig
53577ea398 Merge pull request #11955 from hashicorp/f-update-gopsutil
Update gopsutil to 3.21.12
2022-02-15 08:31:57 -06:00
Seth Hoenig
5ac59de999 cl: shorten changelog entry 2022-02-15 08:31:25 -06:00
Tim Gross
7c02750308 changelog entry (#12072) 2022-02-15 09:00:30 -05:00
Seth Hoenig
a56b79589a Merge pull request #12066 from hashicorp/f-make-golint-faster
build: allow golangci-lint to use more than 1 core
2022-02-15 08:00:07 -06:00
Alex Holyoake
11dcb87512 config: merge ReservableCores in clientConfig (#12044) 2022-02-15 08:36:37 -05:00
Seth Hoenig
34de8b5676 Merge pull request #12069 from alrs/scheduler-test-err
scheduler: fix dropped test error
2022-02-15 07:29:50 -06:00
Lars Lehtonen
f8d472a18c scheduler: fix dropped test error 2022-02-14 22:11:45 -08:00
Seth Hoenig
9ec605ea83 build: allow golangci-lint to use more than 1 core
Since switching to `golangci-lint` we have set the `-j 1` flag, which
restricts the tool to using 1 CPU thread.

This PR removes the flag so `make check` takes less time on good
computers.
2022-02-14 16:56:58 -06:00
James Rasell
282eb10a40 Merge pull request #12052 from hashicorp/b-taskrunner-track-deregistered-call
client: track service deregister call so it's only called once.
2022-02-14 09:01:26 +01:00
Tim Gross
4afc67b700 csi: volume cli prefix matching should accept exact match (#12051)
The `volume detach`, `volume deregister`, and `volume status` commands
accept a prefix argument for the volume ID. Update the behavior on
exact matches so that if there is more than one volume that matches
the prefix, we should only return an error if one of the volume IDs is
not an exact match. Otherwise we won't be able to use these commands
at all on those volumes. This also makes the behavior of these commands
consistent with `job stop`.
2022-02-11 08:53:03 -05:00
Tim Gross
16baefcb45 csi: provide CSI_ENDPOINT env var to plugins (#12050)
The CSI specification says:
> The CO SHALL provide the listen-address for the Plugin by way of the
`CSI_ENDPOINT` environment variable.

Note that plugins without filesystem isolation won't have the plugin
dir bind-mounted to their alloc dir, but we can provide a path to the
socket anyways.

Refactor to use opts struct for plugin supervisor hook config.
The parameter list for configuring the plugin supervisor hook has
grown enough where is makes sense to use an options struct similiar to
many of the other task runner hooks (ex. template).
2022-02-11 08:46:21 -05:00
James Rasell
2606188664 Merge pull request #12053 from marcaurele/fix-typo
doc(typo): technical typo in advertised example
2022-02-11 14:27:12 +01:00
James Rasell
d1ffc23715 Merge pull request #12041 from hashicorp/b-gh-12040
changelog: add entry for #12040
2022-02-11 10:15:09 +01:00
James Rasell
72f411c986 client: track service deregister call so it's only called once.
In certain task lifecycles the taskrunner service deregister call
could be called three times for a task that is exiting. Whilst
each hook caller of deregister has its own purpose, we should try
and ensure it is only called once during the shutdown lifecycle of
a task.

This change therefore tracks when deregister has been called, so
that subsequent calls are noop. In the event the task is
restarting, the deregister value is reset to ensure proper
operation.
2022-02-11 09:29:38 +01:00
Derek Strickland
cefc58dd7b reconciler: refactor computeGroup (#12033)
The allocReconciler's computeGroup function contained a significant amount of inline logic that was difficult to understand the intent of. This commit extracts inline logic into the following intention revealing subroutines. It also includes updates to the function internals also aimed at improving maintainability and renames some existing functions for the same purpose. New or renamed functions include.

Renamed functions

- handleGroupCanaries -> cancelUnneededCanaries
- handleDelayedLost -> createLostLaterEvals
- handeDelayedReschedules -> createRescheduleLaterEvals

New functions

- filterAndStopAll
- initializeDeploymentState
- requiresCanaries
- computeCanaries
- computeUnderProvisionedBy
- computeReplacements
- computeDestructiveUpdates
- computeMigrations
- createDeployment
- isDeploymentComplete
2022-02-10 16:24:51 -05:00
Luiz Aoqui
6a3368a08f docs: add upgrade note and ACL requirements for the job submit endpoint (#12046) 2022-02-10 15:35:16 -05:00
Luiz Aoqui
6d7813d571 update download to Nomad v1.2.6 (#12042) 2022-02-10 15:33:28 -05:00
Luiz Aoqui
af33237371 Merge pull request #12045 from hashicorp/merge-release-1.2.6-branch
Merge release 1.2.6 branch
2022-02-10 15:12:40 -05:00
Luiz Aoqui
096934a5a5 prepare for next release 2022-02-10 14:56:11 -05:00
Luiz Aoqui
bc333c2560 Merge tag 'v1.2.6' into merge-release-1.2.6-branch
Version 1.2.6
2022-02-10 14:55:34 -05:00
Marc-Aurèle Brothier
0cc28e9578 small typo in advertised example 2022-02-10 13:53:05 +01:00
James Rasell
7f0435ae7e changelog: add entry for #12040 2022-02-10 08:36:32 +01:00
Nomad Release Bot
95514d5696 Release v1.2.6 2022-02-10 03:26:34 +00:00
Nomad Release bot
a6c6b475db Generate files for 1.2.6 release 2022-02-10 02:47:03 +00:00
Luiz Aoqui
a3319d7d76 docs: add 1.2.6 to changelog 2022-02-09 19:59:37 -05:00
Tim Gross
c49359ad58 scheduler: prevent panic in spread iterator during alloc stop
The spread iterator can panic when processing an evaluation, resulting
in an unrecoverable state in the cluster. Whenever a panicked server
restarts and quorum is restored, the next server to dequeue the
evaluation will panic.

To trigger this state:
* The job must have `max_parallel = 0` and a `canary >= 1`.
* The job must not have a `spread` block.
* The job must have a previous version.
* The previous version must have a `spread` block and at least one
  failed allocation.

In this scenario, the desired changes include `(place 1+) (stop
1+), (ignore n) (canary 1)`. Before the scheduler can place the canary
allocation, it tries to find out which allocations can be
stopped. This passes back through the stack so that we can determine
previous-node penalties, etc. We call `SetJob` on the stack with the
previous version of the job, which will include assessing the `spread`
block (even though the results are unused). The task group spread info
state from that pass through the spread iterator is not reset when we
call `SetJob` again. When the new job version iterates over the
`groupPropertySets`, it will get an empty `spreadAttributeMap`,
resulting in an unexpected nil pointer dereference.

This changeset resets the spread iterator internal state when setting
the job, logging with a bypass around the bug in case we hit similar
cases, and a test that panics the scheduler without the patch.
2022-02-09 19:53:06 -05:00
Luiz Aoqui
1aa3b56108 api: prevent excessice CPU load on job parse
Add new namespace ACL requirement for the /v1/jobs/parse endpoint and
return early if HCLv2 parsing fails.

The endpoint now requires the new `parse-job` ACL capability or
`submit-job`.
2022-02-09 19:51:47 -05:00
Seth Hoenig
b3c0e6a7a5 client: check escaping of alloc dir using symlinks
This PR adds symlink resolution when doing validation of paths
to ensure they do not escape client allocation directories.
2022-02-09 19:50:13 -05:00
Seth Hoenig
6445da9baf client: fix race condition in use of go-getter
go-getter creates a circular dependency between a Client and Getter,
which means each is inherently thread-unsafe if you try to re-use
on or the other.

This PR fixes Nomad to no longer make use of the default Getter objects
provided by the go-getter package. Nomad must create a new Client object
on every artifact download, as the Client object controls the Src and Dst
among other things. When Caling Client.Get, the Getter modifies its own
Client reference, creating the circular reference and race condition.

We can still achieve most of the desired connection caching behavior by
re-using a shared HTTP client with transport pooling enabled.
2022-02-09 19:48:28 -05:00
Charlie Voiselle
1e29872d15 Add changelog 2022-02-09 19:31:42 -05:00
Tim Gross
05b99001ca CSI: use job status not alloc status for plugin updates from summary (#12027)
When an allocation is updated, the job summary for the associated job
is also updated. CSI uses the job summary to set the expected count
for controller and node plugins. We incorrectly used the allocation's
server status instead of the job status when deciding whether to
update or remove the job from the plugins. This caused a node drain or
other terminal state for an allocation to clear the expected count for
the entire plugin.

Use the job status to guide whether to update or remove the expected
count.

The existing CSI tests for the state store incorrectly modeled the
updates we received from servers vs those we received from clients,
leading to test assertions that passed when they should not.

Rework the tests to clarify each step in the lifecycle and rename CSI state
store functions for clarity
2022-02-09 11:51:49 -05:00
Tim Gross
b3212a5b21 docs and changelog for nomad config validate (#12031) 2022-02-09 10:20:45 -05:00
Kevin Schoonover
6633f8d908 fingerprint: remove metadata from digitalocean (#12032) 2022-02-09 07:31:45 -05:00
Thomas Lefebvre
41f84c657a Add config command and config validate subcommand to nomad CLI (#9198) 2022-02-08 16:52:35 -05:00
Tim Gross
79e8d394b4 fingerprint: digitalocean fingerprint test requires metadata header (#12028) 2022-02-08 16:35:13 -05:00
Seth Hoenig
0ae882a3da Merge pull request #12026 from hashicorp/f-update-aws
env: update aws cpu configs
2022-02-08 13:56:50 -06:00
Seth Hoenig
652de761bf env: update aws cpu configs
By running the tools/ec2info tool
2022-02-08 12:44:00 -06:00
Tim Gross
b0b7a49439 scheduler: seed random shuffle nodes with eval ID (#12008)
Processing an evaluation is nearly a pure function over the state
snapshot, but we randomly shuffle the nodes. This means that
developers can't take a given state snapshot and pass an evaluation
through it and be guaranteed the same plan results.

But the evaluation ID is already random, so if we use this as the seed
for shuffling the nodes we can greatly reduce the sources of
non-determinism. Unfortunately golang map iteration uses a global
source of randomness and not a goroutine-local one, but arguably
if the scheduler behavior is impacted by this, that's a bug in the
iteration.
2022-02-08 12:16:33 -05:00
Seth Hoenig
e2b69dcb62 Merge pull request #12024 from hashicorp/docs-update-cl
changelog: update changelog for DO
2022-02-08 10:29:09 -06:00
Seth Hoenig
da42b2845d cl: fix DO name
Co-authored-by: Luiz Aoqui <luiz@hashicorp.com>
2022-02-08 10:28:57 -06:00
Seth Hoenig
06b73afdfa changelog: update changelog for DO 2022-02-08 08:43:49 -06:00