Commit Graph

22492 Commits

Author SHA1 Message Date
Seth Hoenig
07f4227d76 Merge pull request #12054 from hashicorp/b-creation-indexes
api: return sorted results in certain list endpoints
2022-02-15 15:08:38 -06:00
Seth Hoenig
b432f377cf api: return sorted results in certain list endpoints
These API endpoints now return results in chronological order. They
can return results in reverse chronological order by setting the
query parameter ascending=true.

- Eval.List
- Deployment.List
2022-02-15 13:48:28 -06:00
Seth Hoenig
53577ea398 Merge pull request #11955 from hashicorp/f-update-gopsutil
Update gopsutil to 3.21.12
2022-02-15 08:31:57 -06:00
Seth Hoenig
5ac59de999 cl: shorten changelog entry 2022-02-15 08:31:25 -06:00
Tim Gross
7c02750308 changelog entry (#12072) 2022-02-15 09:00:30 -05:00
Seth Hoenig
a56b79589a Merge pull request #12066 from hashicorp/f-make-golint-faster
build: allow golangci-lint to use more than 1 core
2022-02-15 08:00:07 -06:00
Alex Holyoake
11dcb87512 config: merge ReservableCores in clientConfig (#12044) 2022-02-15 08:36:37 -05:00
Seth Hoenig
34de8b5676 Merge pull request #12069 from alrs/scheduler-test-err
scheduler: fix dropped test error
2022-02-15 07:29:50 -06:00
Lars Lehtonen
f8d472a18c scheduler: fix dropped test error 2022-02-14 22:11:45 -08:00
Seth Hoenig
9ec605ea83 build: allow golangci-lint to use more than 1 core
Since switching to `golangci-lint` we have set the `-j 1` flag, which
restricts the tool to using 1 CPU thread.

This PR removes the flag so `make check` takes less time on good
computers.
2022-02-14 16:56:58 -06:00
James Rasell
282eb10a40 Merge pull request #12052 from hashicorp/b-taskrunner-track-deregistered-call
client: track service deregister call so it's only called once.
2022-02-14 09:01:26 +01:00
Tim Gross
4afc67b700 csi: volume cli prefix matching should accept exact match (#12051)
The `volume detach`, `volume deregister`, and `volume status` commands
accept a prefix argument for the volume ID. Update the behavior on
exact matches so that if there is more than one volume that matches
the prefix, we should only return an error if one of the volume IDs is
not an exact match. Otherwise we won't be able to use these commands
at all on those volumes. This also makes the behavior of these commands
consistent with `job stop`.
2022-02-11 08:53:03 -05:00
Tim Gross
16baefcb45 csi: provide CSI_ENDPOINT env var to plugins (#12050)
The CSI specification says:
> The CO SHALL provide the listen-address for the Plugin by way of the
`CSI_ENDPOINT` environment variable.

Note that plugins without filesystem isolation won't have the plugin
dir bind-mounted to their alloc dir, but we can provide a path to the
socket anyways.

Refactor to use opts struct for plugin supervisor hook config.
The parameter list for configuring the plugin supervisor hook has
grown enough where is makes sense to use an options struct similiar to
many of the other task runner hooks (ex. template).
2022-02-11 08:46:21 -05:00
James Rasell
2606188664 Merge pull request #12053 from marcaurele/fix-typo
doc(typo): technical typo in advertised example
2022-02-11 14:27:12 +01:00
James Rasell
d1ffc23715 Merge pull request #12041 from hashicorp/b-gh-12040
changelog: add entry for #12040
2022-02-11 10:15:09 +01:00
James Rasell
72f411c986 client: track service deregister call so it's only called once.
In certain task lifecycles the taskrunner service deregister call
could be called three times for a task that is exiting. Whilst
each hook caller of deregister has its own purpose, we should try
and ensure it is only called once during the shutdown lifecycle of
a task.

This change therefore tracks when deregister has been called, so
that subsequent calls are noop. In the event the task is
restarting, the deregister value is reset to ensure proper
operation.
2022-02-11 09:29:38 +01:00
Derek Strickland
cefc58dd7b reconciler: refactor computeGroup (#12033)
The allocReconciler's computeGroup function contained a significant amount of inline logic that was difficult to understand the intent of. This commit extracts inline logic into the following intention revealing subroutines. It also includes updates to the function internals also aimed at improving maintainability and renames some existing functions for the same purpose. New or renamed functions include.

Renamed functions

- handleGroupCanaries -> cancelUnneededCanaries
- handleDelayedLost -> createLostLaterEvals
- handeDelayedReschedules -> createRescheduleLaterEvals

New functions

- filterAndStopAll
- initializeDeploymentState
- requiresCanaries
- computeCanaries
- computeUnderProvisionedBy
- computeReplacements
- computeDestructiveUpdates
- computeMigrations
- createDeployment
- isDeploymentComplete
2022-02-10 16:24:51 -05:00
Luiz Aoqui
6a3368a08f docs: add upgrade note and ACL requirements for the job submit endpoint (#12046) 2022-02-10 15:35:16 -05:00
Luiz Aoqui
6d7813d571 update download to Nomad v1.2.6 (#12042) 2022-02-10 15:33:28 -05:00
Luiz Aoqui
af33237371 Merge pull request #12045 from hashicorp/merge-release-1.2.6-branch
Merge release 1.2.6 branch
2022-02-10 15:12:40 -05:00
Luiz Aoqui
096934a5a5 prepare for next release 2022-02-10 14:56:11 -05:00
Luiz Aoqui
bc333c2560 Merge tag 'v1.2.6' into merge-release-1.2.6-branch
Version 1.2.6
2022-02-10 14:55:34 -05:00
Marc-Aurèle Brothier
0cc28e9578 small typo in advertised example 2022-02-10 13:53:05 +01:00
James Rasell
7f0435ae7e changelog: add entry for #12040 2022-02-10 08:36:32 +01:00
Nomad Release Bot
95514d5696 Release v1.2.6 2022-02-10 03:26:34 +00:00
Nomad Release bot
a6c6b475db Generate files for 1.2.6 release 2022-02-10 02:47:03 +00:00
Luiz Aoqui
a3319d7d76 docs: add 1.2.6 to changelog 2022-02-09 19:59:37 -05:00
Tim Gross
c49359ad58 scheduler: prevent panic in spread iterator during alloc stop
The spread iterator can panic when processing an evaluation, resulting
in an unrecoverable state in the cluster. Whenever a panicked server
restarts and quorum is restored, the next server to dequeue the
evaluation will panic.

To trigger this state:
* The job must have `max_parallel = 0` and a `canary >= 1`.
* The job must not have a `spread` block.
* The job must have a previous version.
* The previous version must have a `spread` block and at least one
  failed allocation.

In this scenario, the desired changes include `(place 1+) (stop
1+), (ignore n) (canary 1)`. Before the scheduler can place the canary
allocation, it tries to find out which allocations can be
stopped. This passes back through the stack so that we can determine
previous-node penalties, etc. We call `SetJob` on the stack with the
previous version of the job, which will include assessing the `spread`
block (even though the results are unused). The task group spread info
state from that pass through the spread iterator is not reset when we
call `SetJob` again. When the new job version iterates over the
`groupPropertySets`, it will get an empty `spreadAttributeMap`,
resulting in an unexpected nil pointer dereference.

This changeset resets the spread iterator internal state when setting
the job, logging with a bypass around the bug in case we hit similar
cases, and a test that panics the scheduler without the patch.
2022-02-09 19:53:06 -05:00
Luiz Aoqui
1aa3b56108 api: prevent excessice CPU load on job parse
Add new namespace ACL requirement for the /v1/jobs/parse endpoint and
return early if HCLv2 parsing fails.

The endpoint now requires the new `parse-job` ACL capability or
`submit-job`.
2022-02-09 19:51:47 -05:00
Seth Hoenig
b3c0e6a7a5 client: check escaping of alloc dir using symlinks
This PR adds symlink resolution when doing validation of paths
to ensure they do not escape client allocation directories.
2022-02-09 19:50:13 -05:00
Seth Hoenig
6445da9baf client: fix race condition in use of go-getter
go-getter creates a circular dependency between a Client and Getter,
which means each is inherently thread-unsafe if you try to re-use
on or the other.

This PR fixes Nomad to no longer make use of the default Getter objects
provided by the go-getter package. Nomad must create a new Client object
on every artifact download, as the Client object controls the Src and Dst
among other things. When Caling Client.Get, the Getter modifies its own
Client reference, creating the circular reference and race condition.

We can still achieve most of the desired connection caching behavior by
re-using a shared HTTP client with transport pooling enabled.
2022-02-09 19:48:28 -05:00
Charlie Voiselle
1e29872d15 Add changelog 2022-02-09 19:31:42 -05:00
Tim Gross
05b99001ca CSI: use job status not alloc status for plugin updates from summary (#12027)
When an allocation is updated, the job summary for the associated job
is also updated. CSI uses the job summary to set the expected count
for controller and node plugins. We incorrectly used the allocation's
server status instead of the job status when deciding whether to
update or remove the job from the plugins. This caused a node drain or
other terminal state for an allocation to clear the expected count for
the entire plugin.

Use the job status to guide whether to update or remove the expected
count.

The existing CSI tests for the state store incorrectly modeled the
updates we received from servers vs those we received from clients,
leading to test assertions that passed when they should not.

Rework the tests to clarify each step in the lifecycle and rename CSI state
store functions for clarity
2022-02-09 11:51:49 -05:00
Tim Gross
b3212a5b21 docs and changelog for nomad config validate (#12031) 2022-02-09 10:20:45 -05:00
Kevin Schoonover
6633f8d908 fingerprint: remove metadata from digitalocean (#12032) 2022-02-09 07:31:45 -05:00
Thomas Lefebvre
41f84c657a Add config command and config validate subcommand to nomad CLI (#9198) 2022-02-08 16:52:35 -05:00
Tim Gross
79e8d394b4 fingerprint: digitalocean fingerprint test requires metadata header (#12028) 2022-02-08 16:35:13 -05:00
Seth Hoenig
0ae882a3da Merge pull request #12026 from hashicorp/f-update-aws
env: update aws cpu configs
2022-02-08 13:56:50 -06:00
Seth Hoenig
652de761bf env: update aws cpu configs
By running the tools/ec2info tool
2022-02-08 12:44:00 -06:00
Tim Gross
b0b7a49439 scheduler: seed random shuffle nodes with eval ID (#12008)
Processing an evaluation is nearly a pure function over the state
snapshot, but we randomly shuffle the nodes. This means that
developers can't take a given state snapshot and pass an evaluation
through it and be guaranteed the same plan results.

But the evaluation ID is already random, so if we use this as the seed
for shuffling the nodes we can greatly reduce the sources of
non-determinism. Unfortunately golang map iteration uses a global
source of randomness and not a goroutine-local one, but arguably
if the scheduler behavior is impacted by this, that's a bug in the
iteration.
2022-02-08 12:16:33 -05:00
Seth Hoenig
e2b69dcb62 Merge pull request #12024 from hashicorp/docs-update-cl
changelog: update changelog for DO
2022-02-08 10:29:09 -06:00
Seth Hoenig
da42b2845d cl: fix DO name
Co-authored-by: Luiz Aoqui <luiz@hashicorp.com>
2022-02-08 10:28:57 -06:00
Seth Hoenig
06b73afdfa changelog: update changelog for DO 2022-02-08 08:43:49 -06:00
Seth Hoenig
fa0d8901c4 Merge pull request #12015 from kevinschoonover/main
client/fingerprint: add digitalocean fingerprinter
2022-02-08 08:41:03 -06:00
Dylan Staley
21f7d0113c Merge pull request #11936 from hashicorp/ds.ie11-warning
website: display warning in IE 11
2022-02-07 13:59:41 -08:00
Kevin Schoonover
5cea36639d address comments
Co-authored-by: Seth Hoenig <seth.a.hoenig@gmail.com>
2022-02-07 09:03:48 -08:00
Tim Gross
f811169267 scheduler: recover from panic (#12009)
If processing a specific evaluation causes the scheduler (and
therefore the entire server) to panic, that evaluation will never
get a chance to be nack'd and cleared from the state store. It will
get dequeued by another scheduler, causing that server to panic, and
so forth until all servers are in a panic loop. This prevents the
operator from intervening to remove the evaluation or update the
state.

Recover the goroutine from the top-level `Process` methods for each
scheduler so that this condition can be detected without panicking the
server process. This will lead to a loop of recovering the scheduler
goroutine until the eval can be removed or nack'd, but that's much
better than taking a downtime.
2022-02-07 11:47:53 -05:00
Kevin Schoonover
7b6f9540db small fixes 2022-02-05 22:23:43 -08:00
Kevin Schoonover
4d4c839796 add digitalocean fingerprinter 2022-02-05 22:17:36 -08:00
Derek Strickland
0263650f27 reconciler: improve variable names and extract methods from inline logic (#12010)
* reconciler: improved variable names and extract methods from inline logic

Co-authored-by: Tim Gross <tgross@hashicorp.com>
2022-02-05 04:54:19 -05:00