nomad

mirror of https://github.com/kemko/nomad.git synced 2026-01-01 16:05:42 +03:00

Author	SHA1	Message	Date
Tim Gross	dc58f247ed	docs: clarify reschedule, migrate, and replacement terminology (#24929 ) Our vocabulary around scheduler behaviors outside of the `reschedule` and `migrate` blocks leaves room for confusion around whether the reschedule tracker should be propagated between allocations. There are effectively five different behaviors we need to cover: * restart: when the tasks of an allocation fail and we try to restart the tasks in place. * reschedule: when the `restart` block runs out of attempts (or the allocation fails before tasks even start), and we need to move the allocation to another node to try again. * migrate: when the user has asked to drain a node and we need to move the allocations. These are not failures, so we don't want to propagate the reschedule tracker. * replacement: when a node is lost, we don't count that against the `reschedule` tracker for the allocations on the node (it's not the allocation's "fault", after all). We don't want to run the `migrate` machinery here here either, as we can't contact the down node. To the scheduler, this is effectively the same as if we bumped the `group.count` * replacement for `disconnect.replace = true`: this is a replacement, but the replacement is intended to be temporary, so we propagate the reschedule tracker. Add a section to the `reschedule`, `migrate`, and `disconnect` blocks explaining when each item applies. Update the use of the word "reschedule" in several places where "replacement" is correct, and vice-versa. Fixes: https://github.com/hashicorp/nomad/issues/24918 Co-authored-by: Aimee Ukasick <aimee.ukasick@hashicorp.com>	2025-02-18 09:31:03 -05:00
Seth Hoenig	51215bf102	deps: update to go-set/v3 and refactor to use custom iterators (#23971 ) * deps: update to go-set/v3 * deps: use custom set iterators for looping	2024-09-16 13:40:10 -05:00
Soren L. Hansen	96acddbc13	Avoid NPE in nomad/command/job_restart.go (#20049 ) stopAlloc() checks if an allocation represents a system job like this: ``` if alloc.Job.Type == api.JobTypeSystem { ... } ``` This caused the cli to crash: ``` ==> 2024-02-29T08:45:53+01:00: Restarting 2 allocations 2024-02-29T08:45:54+01:00: Rescheduling allocation "6a9da11a" for group "redacted-group" panic: runtime error: invalid memory address or nil pointer dereference [signal SIGSEGV: segmentation violation code=0x2 addr=0x20 pc=0x10686affc] goroutine 36 [running]: github.com/hashicorp/nomad/command.(JobRestartCommand).stopAlloc(0x14000b11040, {0x14000996dc0?, 0x0?}) github.com/hashicorp/nomad/command/job_restart.go:968 +0x25c github.com/hashicorp/nomad/command.(JobRestartCommand).handleAlloc(0x14000b11040, {0x14000996dc0?, 0x0?}) github.com/hashicorp/nomad/command/job_restart.go:868 +0x34 github.com/hashicorp/nomad/command.(JobRestartCommand).Run.(JobRestartCommand).Run.func1.func2() github.com/hashicorp/nomad/command/job_restart.go:392 +0x28 github.com/hashicorp/go-multierror.(Group).Go.func1() github.com/hashicorp/go-multierror@v1.1.1/group.go:23 +0x60 created by github.com/hashicorp/go-multierror.(*Group).Go in goroutine 1 github.com/hashicorp/go-multierror@v1.1.1/group.go:20 +0x84 ``` Attaching a debugger revealed that `alloc.Job` was set, but `alloc.Job.Type` was nil. After guarding the `.Type` check with a `alloc.Job.Type != nil`, it still crashed. This time, `alloc.Job` was nil. I was scrambling to get the job running again, so I didn't have the opportunity to find out why those values were nil, but this change ensures the CLI does not crash in these situations. Fixes #20048	2024-03-01 08:07:28 -06:00
Luiz Aoqui	d29ac461a7	cli: non-service jobs on `job restart -reschedule` (#19147 ) The `-reschedule` flag stops allocations and assumes the Nomad scheduler will create new allocations to replace them. But this is only true for service and batch jobs. Restarting non-service jobs with the `-reschedule` flag causes the command to loop forever waiting for the allocations to be replaced, which never happens. Allocations for system jobs may be replaced by triggering an evaluation after each stop to cause the reconciler to run again. Sysbatch jobs should not be allowed to be rescheduled as they are never replaced by the scheduler.	2023-11-29 13:01:19 -05:00
Luiz Aoqui	bdac8d9583	cli: prevent panic on CTRL+C during a question (#19154 ) Fix a panic when a question receives an interrupt signal before the signal handler is initialized.	2023-11-23 14:51:56 -05:00
Luiz Aoqui	d2849b8a76	cli: skip allocs with replacements on job restart (#19155 ) The `nomad job restart` command should skip allocations that already have replacements. Restarting an allocation with a replacement is a no-op because the allocation status is terminal and the command's replacement monitor returns immediatelly. But by not skipping them, the effective batch size is computed incorrectly.	2023-11-23 14:51:10 -05:00
Seth Hoenig	e3c8700ded	deps: upgrade to go-set/v2 (#18638 ) No functional changes, just cleaning up deprecated usages that are removed in v2 and replace one call of .Slice with .ForEach to avoid making the intermediate copy.	2023-10-05 11:56:17 -05:00
Seth Hoenig	d9341f0664	update go1.21 (#18184 ) * build: update to go1.21 * go: eliminate helpers in favor of min/max * build: run go mod tidy * build: swap depguard for semgrep * command: fixup broken tls error check on go1.21	2023-08-14 08:43:27 -05:00
hashicorp-copywrite[bot]	a9d61ea3fd	Update copyright file headers to BUSL-1.1	2023-08-10 17:27:29 -05:00
Luiz Aoqui	389dff42a1	cli: fix panic on job restart (#17346 ) When monitoring the replacement allocation, if the `Allocations().Info()` request fails, the `alloc` variable is `nil`, so it should not be read.	2023-05-30 11:08:49 -04:00
hashicorp-copywrite[bot]	f005448366	[COMPLIANCE] Add Copyright and License Headers	2023-04-10 15:36:59 +00:00
Luiz Aoqui	fffdbdff06	cli: job restart command (#16278 ) Implement the new `nomad job restart` command that allows operators to restart allocations tasks or reschedule then entire allocation. Restarts can be batched to target multiple allocations in parallel. Between each batch the command can stop and hold for a predefined time or until the user confirms that the process should proceed. This implements the "Stateless Restarts" alternative from the original RFC (https://gist.github.com/schmichael/e0b8b2ec1eb146301175fd87ddd46180). The original concept is still worth implementing, as it allows this functionality to be exposed over an API that can be consumed by the Nomad UI and other clients. But the implementation turned out to be more complex than we initially expected so we thought it would be better to release a stateless CLI-based implementation first to gather feedback and validate the restart behaviour. Co-authored-by: Shishir Mahajan <smahajan@roblox.com>	2023-03-23 18:28:26 -04:00

12 Commits