Commit Graph

23015 Commits

Author SHA1 Message Date
James Rasell
bab219a8ba agent: fix panic when logging about protocol version config use. (#12962)
The log line comes before the agent logger has been setup,
therefore we need to use the UI logging to avoid panic.
2022-05-13 09:28:43 +02:00
Michael Schurter
e611b099d3 docs: link s/port-plan-failure to more helpful doc (#12968)
The shortlink /s/port-plan-failure is logged when a plan for a node is
rejected to help users debug and mitigate repeated `plan for node
rejected` failures.

The current link to #9506 is... less than useful. It is not clear to
users what steps they should take to either fix their cluster or
contribute to the issue.

While .../monitoring-nomad#progess isn't as comprehensive as it could
be, it's a much more gentle introduction to the class of bug than the
original issue.
2022-05-12 13:59:17 -07:00
Tim Gross
1231d8140b docs: note that already-dispatched jobs cannot be updated (#12973) 2022-05-12 16:18:42 -04:00
Phil Renaud
11472408e1 Visual diff tests seed-stabilized by default (#12965)
* Seed-stabilization by default

* Hide right-column of topology viz route

* Remove seedless run from thee test:* suite

* Related evals paths render too late

* Vis:Hidden another topo viz unstable item
2022-05-12 16:09:19 -04:00
Tim Gross
f0031cf163 docs: remove beta tag for CSI from sidebar (#12970) 2022-05-12 14:12:40 -04:00
Eng Zer Jun
fca4ee8e05 test: use T.TempDir to create temporary test directory (#12853)
* test: use `T.TempDir` to create temporary test directory

This commit replaces `ioutil.TempDir` with `t.TempDir` in tests. The
directory created by `t.TempDir` is automatically removed when the test
and all its subtests complete.

Prior to this commit, temporary directory created using `ioutil.TempDir`
needs to be removed manually by calling `os.RemoveAll`, which is omitted
in some tests. The error handling boilerplate e.g.
	defer func() {
		if err := os.RemoveAll(dir); err != nil {
			t.Fatal(err)
		}
	}
is also tedious, but `t.TempDir` handles this for us nicely.

Reference: https://pkg.go.dev/testing#T.TempDir
Signed-off-by: Eng Zer Jun <engzerjun@gmail.com>

* test: fix TestLogmon_Start_restart on Windows

Signed-off-by: Eng Zer Jun <engzerjun@gmail.com>

* test: fix failing TestConsul_Integration

t.TempDir fails to perform the cleanup properly because the folder is
still in use

testing.go:967: TempDir RemoveAll cleanup: unlinkat /tmp/TestConsul_Integration2837567823/002/191a6f1a-5371-cf7c-da38-220fe85d10e5/web/secrets: device or resource busy

Signed-off-by: Eng Zer Jun <engzerjun@gmail.com>
2022-05-12 11:42:40 -04:00
Michael Schurter
9347613d9a docs: add sysbatch to scheduling internals (#12954) 2022-05-11 17:06:17 -07:00
Luiz Aoqui
3bb3b1b161 prepare for next release (#12956) 2022-05-11 17:42:53 -04:00
Seth Hoenig
894c2e61dd build: use new version of hc-install (#12937)
https://github.com/shoenig/hc-install/pull/2

Uses new version of hc-install which supports the new
json content type reported by api.releases.hashicorp.com
2022-05-10 15:28:29 -04:00
Georges-Etienne Legendre
992c2f6c62 Fix Exec not working with reverse proxy X-Nomad-Token (#12925)
* Capture token secret on fetch

* Fix tests

* Fix lint errors
2022-05-10 13:42:12 -04:00
modrake
b5665129cd Merge pull request #12913 from hashicorp/mdrake/svc-acct-codeowner
add service acct to codeowners for backport merging
2022-05-06 10:44:31 -07:00
Morgan Drake
a0ecdac67a add service acct to codeowners for backport merging 2022-05-06 10:06:20 -07:00
Chetan Sarva
76e6b5d27e docs: add version note to nomad services template (#12910) 2022-05-06 17:39:27 +02:00
Phil Renaud
592222bbca Changelog for visual diff tests (#12909) 2022-05-06 11:29:10 -04:00
Luiz Aoqui
4df648593f ci: update backport assitant workflow (#12899)
Remove the step to automatically backport `backport/website` PRs to the
latest release. This will be done manually by adding the proper tags.

Also use squash backports to match the pattern we use for `main`.
2022-05-06 10:15:59 -04:00
James Rasell
3956854cc4 fsm: add service registration snapshot persistence. (#12896) 2022-05-06 15:53:27 +02:00
Luiz Aoqui
d7d578b3f4 ci: revert file changes and add some checks (#12873)
During the release there are several files that need to be modified:

  - .release/ci.hcl: the notification channel needs to be updated to a
    channel with greater team visibility during the release.
  - version/version.go: the Version and VersionPrerelease variables
    need to be set so they match the release version.

After the release these files need to be reverted.

For GA releases the following additional changes also need to happen:

  - version/version.go: the Version variable needs to be bumped to the
    next version number.
  - GNUMakefile: the LAST_RELEASE variable needs to be set to the
    version that was just released.

Since the release process will commit file changes to the branch being
used for the release, it should _never_ run on main, so the first step
is now to protect against that.

It also adds a validation to make the user input version is correct.

After looking at the different release options and steps I noticed that
automatic CHANGELOG generation is actually the exception, so it would be
better to have the default to be false.
2022-05-05 18:07:51 -04:00
Phil Renaud
f1fdca55f4 Chronological most-recent evals by default (#12847)
* Chronological most-recent evals by default

* Adding reverse: true to the list of expected queryparams in test

* changelog
2022-05-05 16:11:27 -04:00
Phil Renaud
f34938d9f0 Percy snapshot tests (#12872)
* Sample percy test added

* Node engine up to 14.x for UI prep

* Force ui test rerun

* Updated config.yml

* Node v upgraded to 14 for docker image

* Expect length in test

* Running ember tests under percy exec

* Percy exec format

* Percy cli added

* Noop to rerun tests with updated percy_token

* Evals full list and details open snapshots

* Pretty legit use of assert so disable the warning

* Jobs list tests

* Snapshots for top-level clients, servers, ACL, topology, and storage lists

* Expect caveat for Topology test

* Stabilizing tests with faker seeded to 1

* Seed-stabilizing any tests with percySnapshots

* Faker import

* Drop unused param

* Assets and test audit using an older node version

* New strategy: avoid seeding, just use percyCSS to hide certain things
2022-05-05 16:05:13 -04:00
Seth Hoenig
7c91ac0712 Merge pull request #12875 from hashicorp/b-cgroupsv2-task-restarts
cgroups: make sure cgroup still exists after task restart
2022-05-05 10:54:29 -05:00
Tim Gross
29c014fbb8 docs: add missing set_contains_any constraint docs (#12886)
This constraint and affinity was added in 0.9.x but was only
documented for affinities. Close that documentation gap.
2022-05-05 11:11:05 -04:00
Bryce Kalow
9412a8409b website: remove source code and related CI jobs (#12596)
* remove website source code and related circle jobs

* remove data files

* updates platform-cli

* update local instructions

* updates package-lock
2022-05-05 09:53:22 -05:00
Seth Hoenig
37ffd2ffa2 cgroups: make sure cgroup still exists after task restart
This PR modifies raw_exec and exec to ensure the cgroup for a task
they are driving still exists during a task restart. These drivers
have the same bug but with different root cause.

For raw_exec, we were removing the cgroup in 2 places - the cpuset
manager, and in the unix containment implementation (the thing that
uses freezer cgroup to clean house). During a task restart, the
containment would remove the cgroup, and when the task runner hooks
went to start again would block on waiting for the cgroup to exist,
which will never happen, because it gets created by the cpuset manager
which only runs as an alloc pre-start hook. The fix here is to simply
not delete the cgroup in the containment implementation; killing the
PIDs is enough. The removal happens in the cpuset manager later anyway.

For exec, it's the same idea, except DestroyTask is called on task
failure, which in turn calls into libcontainer, which in turn deletes
the cgroup. In this case we do not have control over the deletion of
the cgroup, so instead we hack the cgroup back into life after the
call to DestroyTask.

All of this only applies to cgroups v2.
2022-05-05 09:51:03 -05:00
James Rasell
0310a963b1 core: add namespace to plan for node rejected log line. (#12868) 2022-05-05 10:56:40 +02:00
James Rasell
52faa167dd release: fix hcl linting error within CI file. (#12867) 2022-05-04 10:48:42 +02:00
Michele Degges
ed0d375ef4 Add config key to the promote-staging event (#12857) 2022-05-03 20:33:14 -07:00
Michele Degges
d551cda6f5 Add config key to the promote-staging event 2022-05-03 08:51:19 -07:00
Tim Gross
9d5c7b5d94 CSI: node drain should end once only plugins remain (#12846)
In #12324 we made it so that plugins wait until the node drain is
complete, as we do for system jobs. But we neglected to mark the node
drain as complete once only plugins (or system jobs) remaining, which
means that the node drain is left in a draining state until the
`deadline` time expires. This was incorrectly documented as expected
behavior in #12324.
2022-05-03 10:20:22 -04:00
Alex Carpenter
e0ca2f4fd4 [WIP] feat: homepage and use case pages redesign (#11873)
* feat: connect homepage and use case pages

* fix: internalLink usage

* fix: query name

* chore: add homepage patterns

* chore: remove offerings

* chore: add intro features

* chore: bump subnav

* chore: updating patterns

* chore: add use case to the subnav

* chore: cleanup unused import

* chore: remove subnav border
2022-05-03 09:06:00 -04:00
Luiz Aoqui
d3f26a5536 Update CHANGELOG for 1.3.0-rc.1 (#12849) 2022-05-02 16:52:00 -04:00
Seth Hoenig
4d404b3958 Merge pull request #12740 from hashicorp/cleanup-makefile-help
build: add missing help descriptions to makefile
2022-05-02 10:33:22 -05:00
Seth Hoenig
30ec18da28 Merge pull request #12840 from hashicorp/docs-nvidia-updates
docs: update nvidia driver documentation
2022-05-02 10:07:02 -05:00
Luiz Aoqui
c333eb6071 ui: fix an error when navigating to a task group (#12832)
Clicking in a task group row in the job details page would throw the
error:

Uncaught Error: You didn't provide enough string/numeric parameters to satisfy all of the dynamic segments for route jobs.job.task-group. Missing params: name
    createParamHandlerInfo http://localhost:4646/ui/assets/vendor-194b1e0d68d11ef7a4bf334eb30ba74d.js:4814
    applyToHandlers http://localhost:4646/ui/assets/vendor-194b1e0d68d11ef7a4bf334eb30ba74d.js:4804
    applyToState http://localhost:4646/ui/assets/vendor-194b1e0d68d11ef7a4bf334eb30ba74d.js:4801
    getTransitionByIntent http://localhost:4646/ui/assets/vendor-194b1e0d68d11ef7a4bf334eb30ba74d.js:4843
    transitionByIntent http://localhost:4646/ui/assets/vendor-194b1e0d68d11ef7a4bf334eb30ba74d.js:4836
    refresh http://localhost:4646/ui/assets/vendor-194b1e0d68d11ef7a4bf334eb30ba74d.js:4885
    refresh http://localhost:4646/ui/assets/vendor-194b1e0d68d11ef7a4bf334eb30ba74d.js:2254
    queryParamsDidChange http://localhost:4646/ui/assets/vendor-194b1e0d68d11ef7a4bf334eb30ba74d.js:2326
    k http://localhost:4646/ui/assets/vendor-194b1e0d68d11ef7a4bf334eb30ba74d.js:2423
    triggerEvent http://localhost:4646/ui/assets/vendor-194b1e0d68d11ef7a4bf334eb30ba74d.js:2349
    fireQueryParamDidChange http://localhost:4646/ui/assets/vendor-194b1e0d68d11ef7a4bf334eb30ba74d.js:4863
    getTransitionByIntent http://localhost:4646/ui/assets/vendor-194b1e0d68d11ef7a4bf334eb30ba74d.js:4848
    transitionByIntent http://localhost:4646/ui/assets/vendor-194b1e0d68d11ef7a4bf334eb30ba74d.js:4836
    doTransition http://localhost:4646/ui/assets/vendor-194b1e0d68d11ef7a4bf334eb30ba74d.js:4853
    transitionTo http://localhost:4646/ui/assets/vendor-194b1e0d68d11ef7a4bf334eb30ba74d.js:4882
    _doTransition http://localhost:4646/ui/assets/vendor-194b1e0d68d11ef7a4bf334eb30ba74d.js:2392
    transitionTo http://localhost:4646/ui/assets/vendor-194b1e0d68d11ef7a4bf334eb30ba74d.js:2177
    gotoTaskGroup http://localhost:4646/ui/assets/nomad-ui-4a2c1941e03e60e1feef715f23cf268c.js:623
...

This was caused because the attribute being passed to the transitionTo
function was not the task group name, but the whole model.
2022-05-02 11:01:19 -04:00
Seth Hoenig
d352ab25c4 docs: update nvidia driver documentation
notably:
- name of the compiled binary is 'nomad-device-nvidia', not 'nvidia-gpu'
- link to Nvidia docs for installing the container runtime toolkit
- list docker v19.03 as minimum version, to track with nvidia's new container runtime toolkit
2022-05-02 09:11:05 -05:00
Matus Goljer
89a794c905 nomad can also install autocomplete for fish shell (#12834) 2022-05-02 09:26:55 -04:00
Luiz Aoqui
dfda28daab ci: remove unused CircleCI Makefile (#12828)
This Makefile was used to generate the full config.yml from smaller
sub-files, but this is not done anymore.
2022-04-29 15:25:23 -04:00
Tim Gross
342a4ee735 docs: clarify capacity_min/max for volumes (#12825)
The capacity fields for `create volume` set bounds on the resulting
size of the volume, but the ultimate size of the volume will be
determined by the storage provider (between the min and max). Clarify
this in the documentation and provide a suggestion for how to set a
exact size.
2022-04-29 13:38:30 -04:00
Derek Strickland
2118226ca6 docs: Add known limitations callouts to Max Client Disconnect section (#12801)
* docs: Add known limitations callouts to Max Client Disconnect section
2022-04-28 16:17:14 -04:00
Phil Renaud
3c4d09cd61 Moves the evaluations table toolbar outside of the table-container (#12799) 2022-04-28 16:08:46 -04:00
Luiz Aoqui
2ffa710859 ci: update the hashicorp/actions-generate-metadata action version (#12813) 2022-04-28 15:24:55 -04:00
Jai
c180c8d463 fix broken link to task-group in Recent Allocation table in jobs.job.index (#12765)
* chore:  run prettier on hbs files

* ui:  ensure to pass a real job object to task-group link

* chore:  add changelog entry

* chore: prettify template

* ui:  template helper for formatting jobId in LinkTo component

* ui:  handle async relationship

* ui:  pass in job id to model arg instead of job model

* update test for serialized namespace

* ui:  defend against null  in tests

* ui:  prettified template added whitespace

* ui:  rollback ember-data to 3.24 because watcher return undefined on abort

* ui: use format-job-helper instead of job model via alloc

* ui: fix whitespace in template caused by prettier using template helper

* ui: update test for new namespace

* ui: revert prettier change

Co-authored-by: Luiz Aoqui <luiz@hashicorp.com>
2022-04-28 14:02:15 -04:00
Dave May
522b630825 debug: add version constraint to avoid pprof panic (#12807) 2022-04-28 13:18:55 -04:00
Luiz Aoqui
9dccbb1cb0 ci: fix build workflow trigger on push (#12806) 2022-04-28 11:15:54 -04:00
Luiz Aoqui
d63158786f ci: setup release process with CRT (#12781) 2022-04-27 20:14:23 -04:00
Derek Strickland
de59d73009 e2e: Wait for deployment to finish before disconnect (#12795)
* Wait for deployment to finish
* Don't reschedule disconnect or restart-node jobs
2022-04-27 12:27:03 -04:00
Phil Renaud
bae5bc16b0 [ui, mirage] Evaluation mocks (#12471)
* Linear and Branching mock evaluations

* De-comment

* test-trigger

* Making evaluation trees dynamic

* Reinstated job relationship on eval mock

* Dasherize job prefix back to normal

* Handle bug where UUIDKey is not present on job

* Appending node to eval

* Job ID as a passed property

* Remove unused import

* Branching evals set up as generatable
2022-04-27 12:11:24 -04:00
Tim Gross
3671ea6a8f remove pre-0.9 driver code and related E2E test (#12791)
This test exercises upgrades between 0.8 and Nomad versions greater
than 0.9. We have not supported 0.8.x in a very long time and in any
case the test has been marked to skip because the downloader doesn't
work.
2022-04-27 09:53:37 -04:00
Michael Schurter
e7924e35cb client: fix waiting on preempted alloc (#12779)
Fixes #10200

**The bug**

A user reported receiving the following error when an alloc was placed
that needed to preempt existing allocs:

```
[ERROR] client.alloc_watcher: error querying previous alloc:
alloc_id=28... previous_alloc=8e... error="rpc error: alloc lookup
failed: index error: UUID must be 36 characters"
```

The previous alloc (8e) was already complete on the client. This is
possible if an alloc stops *after* the scheduling decision was made to
preempt it, but *before* the node running both allocations was able to
pull and start the preemptor. While that is hopefully a narrow window of
time, you can expect it to occur in high throughput batch scheduling
heavy systems.

However the RPC error made no sense! `previous_alloc` in the logs was a
valid 36 character UUID!

**The fix**

The fix is:

```
-		prevAllocID:  c.Alloc.PreviousAllocation,
+		prevAllocID:  watchedAllocID,
```

The alloc watcher new func used for preemption improperly referenced
Alloc.PreviousAllocation instead of the passed in watchedAllocID. When
multiple allocs are preempted, a watcher is created for each with
watchedAllocID set properly by the caller. In this case
Alloc.PreviousAllocation="" -- which is where the `UUID must be 36 characters`
error was coming from! Sadly we were properly referencing
watchedAllocID in the log, so it made the error make no sense!

**The repro**

I was able to reproduce this with a dev agent with [preemption enabled](https://gist.github.com/schmichael/53f79cbd898afdfab76865ad8c7fc6a0#file-preempt-hcl)
and [lowered limits](https://gist.github.com/schmichael/53f79cbd898afdfab76865ad8c7fc6a0#file-limits-hcl)
for ease of repro.

First I started a [low priority count 3 job](https://gist.github.com/schmichael/53f79cbd898afdfab76865ad8c7fc6a0#file-preempt-lo-nomad),
then a [high priority job](https://gist.github.com/schmichael/53f79cbd898afdfab76865ad8c7fc6a0#file-preempt-hi-nomad)
that evicts 2 low priority jobs. Everything worked as expected.

However if I force it to use the [remotePrevAlloc implementation](https://github.com/hashicorp/nomad/blob/v1.3.0-beta.1/client/allocwatcher/alloc_watcher.go#L147),
it reproduces the bug because the watcher references PreviousAllocation
instead of watchedAllocID.
2022-04-26 13:14:43 -07:00
Tim Gross
059c89dff0 E2E: move volume mounts test to use golang's stdlib test runner (#12788)
Part of ongoing work to remove the old E2E framework code.
2022-04-26 14:28:20 -04:00
Tim Gross
26b0e04717 E2E: remove old CLI for driving provisioning (#12787)
We moved off the old provisioning process for nightly E2E to one driven
entirely by Terraform quite a while back now. We're in the slow
process of removing the framework code for this test-by-test, but this
chunk of code no longer has any callers.
2022-04-26 13:43:25 -04:00