Commit Graph

21882 Commits

Author SHA1 Message Date
Michael Schurter
13cc8b3c4a Merge pull request #11331 from shishir-a412ed/init
Add support for --init to docker driver.
2021-10-20 10:49:51 -07:00
Michael Schurter
fceb6cea2f Merge pull request #11347 from shishir-a412ed/cleanup
Code cleanup: Remove extra if clause.
2021-10-20 09:37:10 -07:00
Mahmood Ali
6d35e2fb58 Fix preemption panic (#11346)
Fix a bug where the scheduler may panic when preemption is enabled. The conditions are a bit complicated:
A job with higher priority that schedule multiple allocations that preempt other multiple allocations on the same node, due to port/network/device assignments.

The cause of the bug is incidental mutation of internal cached data. `RankedNode` computes and cache proposed allocations  in https://github.com/hashicorp/nomad/blob/v1.1.6/scheduler/rank.go#L42-L53 . But scheduler then mutates the list to remove pre-emptable allocs in https://github.com/hashicorp/nomad/blob/v1.1.6/scheduler/rank.go#L293-L294, and  `RemoveAllocs` mutates and sets the tail of cached slice with `nil`s triggering a nil-pointer derefencing case.

I fixed the issue by avoiding the mutation in `RemoveAllocs` - the micro-optimization there doesn't seem necessary.

Fixes https://github.com/hashicorp/nomad/issues/11342
2021-10-19 20:22:03 -04:00
Shishir Mahajan
e14e3555c5 Code cleanup: Remove extra if clause.
Signed-off-by: Shishir Mahajan <smahajan@roblox.com>
2021-10-19 16:52:11 -07:00
Michael Schurter
94d06a8dcb docs: add #11331 to changelog 2021-10-19 16:30:06 -07:00
Brandon Romano
1dce6ecabf Merge pull request #11341 from hashicorp/nq.update-alert-banner-hcg2021-live
website: Update alert banner for HashiConf
2021-10-19 07:01:04 -07:00
Noel Quiles
b5eccd50a4 Update alert banner for HashiConf
Final cleanup/closer exp date
2021-10-18 11:52:29 -04:00
Shishir Mahajan
479442e682 Add support for --init to docker driver.
Signed-off-by: Shishir Mahajan <smahajan@roblox.com>
2021-10-15 12:53:25 -07:00
Mahmood Ali
c46c530a58 ease building Linux binaries on macOS (#11329)
Meant for development purposes only, so one can compile binary on a
macos host then start a Docker container or scp the binary to a linux
host easily.

The resulting binary is statically linked and has very subtle
differences. e.g. static binaries use go native network stack that
honor /etc/hosts and /etc/resolve differently from the glibc
implementation. In development environment, I don't expect these to
materially change our experience.
2021-10-15 11:12:59 -04:00
Florian Apolloner
cc8d9443d2 Follow up fixes for #11237 (#11260) 2021-10-14 17:23:38 -04:00
Luiz Aoqui
600bf12b75 Merge missing commits from 1.2.0-beta1 release branch (#11319) 2021-10-14 16:10:05 -04:00
Luiz Aoqui
d17b6a2c2b Merge release branch (#11317) 2021-10-14 13:06:04 -04:00
Luiz Aoqui
f5d560d360 fix nomad job allocs command name (#11314) 2021-10-14 12:44:59 -04:00
Luiz Aoqui
681eeca515 docs: update Nvidia device plugin as external (#11313) 2021-10-14 12:22:31 -04:00
Dave May
bf94aad36f Remove vendor folder during make clean (#11315)
* Remove vendor folder during make clean
* Add vendor warning to make dev build command
2021-10-14 11:32:19 -04:00
Luiz Aoqui
9d2be2aee6 changlog: add entry for #10796 (#11312) 2021-10-14 09:01:43 -04:00
James Rasell
fa5addc4a1 Merge pull request #11280 from benbuzbee/log-err
Log error if there are no event handlers registered
2021-10-14 14:49:22 +02:00
Mahmood Ali
feb450a393 executor: set CpuWeight in cgroup-v2 (#11287)
Cgroup-v2 uses `cpu.weight` property instead of cpu shares:
https://www.kernel.org/doc/html/latest/admin-guide/cgroup-v2.html#cpu-interface-files
. And it uses a different range (i.e. `[1, 10000]`) from cpu.shares
(i.e. `[2, 262144]`) to make things more interesting.

Luckily, the libcontainer provides a helper function to perform the
conversion
[`ConvertCPUSharesToCgroupV2Value`](https://pkg.go.dev/github.com/opencontainers/runc@v1.0.2/libcontainer/cgroups#ConvertCPUSharesToCgroupV2Value).

I have confirmed that docker/libcontainer performs the conversion as
well in
https://github.com/opencontainers/runc/blob/v1.0.2/libcontainer/specconv/spec_linux.go#L536-L541
, and that CpuShares is ignored by libcontainer in
https://github.com/opencontainers/runc/blob/v1.0.2/libcontainer/cgroups/fs2/cpu.go#L24-L29
.
2021-10-14 08:46:07 -04:00
Luiz Aoqui
c0a1d3adb9 changelog: add entries for #9160 and #11078 (#11290) 2021-10-14 08:43:36 -04:00
Charlie Voiselle
8ba714e211 Return SchedulerConfig instead of SchedulerConfigResponse struct (#10799) 2021-10-13 21:23:13 -04:00
Michael Schurter
6a0dede9b6 Merge pull request #11167 from a-zagaevskiy/master
Support configurable dynamic port range
2021-10-13 16:47:38 -07:00
Michael Schurter
fc89835daf client: improve errors & tests for dynamic ports 2021-10-13 16:25:25 -07:00
Dave May
1d30caafad cli: rename paths in debug bundle for clarity (#11307)
* Rename folders to reflect purpose
* Improve captured files test coverage
* Rename CSI plugins output file
* Add changelog entry
* fix test and make changelog message more explicit

Co-authored-by: Luiz Aoqui <luiz@hashicorp.com>
2021-10-13 18:00:55 -04:00
Mahmood Ali
ff1b2f7623 tests: ensure that tests restore env-var values (#11309)
Fix a test corruption issue, where a test accidentally unsets
the `NOMAD_LICENSE` environment variable, that's relied on by some
tests.

As a habit, tests should always restore the environment variable value
on test completion. Golang 1.17 introduced
[`t.Setenv`](https://pkg.go.dev/testing#T.Setenv) to address this issue.
However, as 1.0.x and 1.1.x branches target golang 1.15 and 1.16, I
opted to use a helper function to ease backports.
2021-10-13 17:26:56 -04:00
Dave May
6852f21ddd cli: Improved autocomplete support for job dispatch and operator debug (#11270)
* Add autocomplete to nomad job dispatch
* Add autocomplete to nomad operator debug
* Update incorrect comment
* Update test to verify autocomplete
* Add changelog
* Apply lint suggestions
* Create dynamic slices instead of specific length
* Align style across predictors
2021-10-12 20:01:54 -04:00
Jorge Marey
833247600b Add os-nova nomad autoscaler repo link (#11277) 2021-10-12 17:04:58 -04:00
Dave May
1bd132f09d debug: Improve namespace and region support (#11269)
* Include region and namespace in CLI output
* Add region and prefix matching for server members
* Add namespace and region API outputs to cluster metadata folder
* Add region awareness to WaitForClient helper function
* Add helper functions for SliceStringHasPrefix and StringHasPrefixInSlice
* Refactor test client agent generation
* Add tests for region
* Add changelog
2021-10-12 16:58:41 -04:00
Florian Apolloner
75cd30c548 Fixed plan diffing to handle non-unique service names. (#10965) 2021-10-12 16:42:39 -04:00
Luiz Aoqui
d4c3989e2a Update job details box (#11288) 2021-10-12 16:36:10 -04:00
Dave May
f545ac1bc4 cli: Add nomad job allocs command (#11242) 2021-10-12 16:30:36 -04:00
Luiz Aoqui
713094ffb7 wrap log messages with hclog (#11291) 2021-10-12 14:38:44 -04:00
Ben Buzbee
337c5d765b Log error if there are no event handlers registered
We see this error all the time
```
no handler registered for event
event.Message=, event.Annotations=, event.Timestamp=0001-01-01T00:00:00Z, event.TaskName=, event.AllocID=, event.TaskID=,
```

So we're handling an even with all default fields. I noted that this can
happen if only err is set as in

```
func (d *driverPluginClient) handleTaskEvents(reqCtx context.Context, ch chan *TaskEvent, stream proto.Driver_TaskEventsClient) {
	defer close(ch)
	for {
		ev, err := stream.Recv()
		if err != nil {
			if err != io.EOF {
				ch <- &TaskEvent{
					Err: grpcutils.HandleReqCtxGrpcErr(err, reqCtx, d.doneCtx),
				}
			}
```

In this case Err fails to be serialized by the logger, see this test

```

	ev := &drivers.TaskEvent{
		Err: fmt.Errorf("errz"),
	}
	i.logger.Warn("ben test", "event", ev)
	i.logger.Warn("ben test2", "event err str", ev.Err.Error())
	i.logger.Warn("ben test3", "event err", ev.Err)
	ev.Err = nil
	i.logger.Warn("ben test4", "nil error", ev.Err)

2021-10-06T22:37:56.736Z INFO nomad.stdout {"@level":"warn","@message":"ben test","@module":"client.driver_mgr","@timestamp":"2021-10-06T22:37:56.643900Z","driver":"mock_driver","event":{"TaskID":"","TaskName":"","AllocID":"","Timestamp":"0001-01-01T00:00:00Z","Message":"","Annotations":null,"Err":{}}}
2021-10-06T22:37:56.736Z INFO nomad.stdout {"@level":"warn","@message":"ben test2","@module":"client.driver_mgr","@timestamp":"2021-10-06T22:37:56.644226Z","driver":"mock_driver","event err str":"errz"}
2021-10-06T22:37:56.736Z INFO nomad.stdout {"@level":"warn","@message":"ben test3","@module":"client.driver_mgr","@timestamp":"2021-10-06T22:37:56.644240Z","driver":"mock_driver","event err":"errz"}
2021-10-06T22:37:56.736Z INFO nomad.stdout {"@level":"warn","@message":"ben test4","@module":"client.driver_mgr","@timestamp":"2021-10-06T22:37:56.644252Z","driver":"mock_driver","nil error":null}
```

Note in the first example err is set to an empty object and the error is
lost.

What we want is the last two examples which call out the err field
explicitly so we can see what it is in this case
2021-10-11 19:44:52 +00:00
Bryce Kalow
721f388f43 website: upgrade deps to fix search styles (#11294) 2021-10-11 11:33:59 -05:00
Aleksandr Zagaevskiy
0620bb04a5 fixup! Support configurable dynamic port range 2021-10-11 14:13:59 +03:00
James Rasell
8378d00d66 Merge pull request #11283 from hashicorp/f-update-hclog-dep
deps: update hashicorp/go-hclog to v1.0.0
2021-10-11 08:39:41 +02:00
Jai
0564f9fa68 System Batch UI, Client Status Bar Chart and Client Tab page view (#11078) 2021-10-07 17:11:38 -04:00
Michael Lange
c50b75178f Merge pull request #11279 from hashicorp/f-ui/storybook-upgrade
UI: Storybook upgrade
2021-10-07 09:17:27 -07:00
James Rasell
dd07f07ec8 changelog: add entry for #11283 2021-10-07 08:16:05 +01:00
James Rasell
594ba94878 deps: update hashicorp/go-hclog to v1.0.0 2021-10-07 07:48:41 +01:00
Matt Mukerjee
0881b94201 Add FailoverHeartbeatTTL to config (#11127)
FailoverHeartbeatTTL is the amount of time to wait after a server leader failure
before considering reallocating client tasks. This TTL should be fairly long as
the new server leader needs to rebuild the entire heartbeat map for the
cluster. In deployments with a small number of machines, the default TTL (5m)
may be unnecessary long. Let's allow operators to configure this value in their
config files.
2021-10-06 18:48:12 -04:00
Michael Lange
b9937dfc38 Migrate: New hierarchical separator 2021-10-06 14:05:32 -07:00
Michael Lange
90eabb6955 Migrate decorator to new file layout 2021-10-06 14:05:32 -07:00
Michael Lange
51d2873c3d Override the app rootURL for storybook
Hopefully this work gets merged into ember-cli-storybook. For the time
being, we get a fork instead.
2021-10-06 14:05:32 -07:00
Michael Lange
95d4af91f2 Storybook for ember workaround 2021-10-06 14:05:32 -07:00
Michael Lange
5385021b2c Upgrade Storybook configuration for v6 2021-10-06 14:05:32 -07:00
Amit Shuster
215bf04bc6 Lightrun Integration - External task driver (#11203) 2021-10-06 15:34:34 -04:00
Shantanu Gadgil
20b44d77bd auth_soft_fail needed for public images when agent is configured with auth (#11190) 2021-10-06 15:30:23 -04:00
Leela Venkaiah G
3eb852fcfe [demo] Kadalu CSI support for Nomad (#11207) 2021-10-06 15:29:15 -04:00
Michael Lange
f9bc9ca6f7 Upgrade storybook from 5 to 6 2021-10-06 11:06:57 -07:00
Mahmood Ali
bc2a51d43a executor: suppress spurious log messages (#11273)
Suppress stats streaming error log messages when task finishes.
Streaming errors are expected when a task finishes and they aren't
actionable to users.

Also, note that the task runner Stats hook retries collecting stats
after a delay. If the connection terminates prematurely, it will be
retried, and closing the stats stream is not very disruptive.

Ideally, executor terminates cleanly when task exits, but that's a more
substantial change that may require changing the executor/drivers interface.

Fixes #10814
2021-10-06 12:42:35 -04:00