12 Commits

Author SHA1 Message Date
Tim Gross
dc58f247ed docs: clarify reschedule, migrate, and replacement terminology (#24929)
Our vocabulary around scheduler behaviors outside of the `reschedule` and
`migrate` blocks leaves room for confusion around whether the reschedule tracker
should be propagated between allocations. There are effectively five different
behaviors we need to cover:

* restart: when the tasks of an allocation fail and we try to restart the tasks
  in place.

* reschedule: when the `restart` block runs out of attempts (or the allocation
  fails before tasks even start), and we need to move
  the allocation to another node to try again.

* migrate: when the user has asked to drain a node and we need to move the
  allocations. These are not failures, so we don't want to propagate the
  reschedule tracker.

* replacement: when a node is lost, we don't count that against the `reschedule`
  tracker for the allocations on the node (it's not the allocation's "fault",
  after all). We don't want to run the `migrate` machinery here here either, as we
  can't contact the down node. To the scheduler, this is effectively the same as
  if we bumped the `group.count`

* replacement for `disconnect.replace = true`: this is a replacement, but the
  replacement is intended to be temporary, so we propagate the reschedule tracker.

Add a section to the `reschedule`, `migrate`, and `disconnect` blocks explaining
when each item applies. Update the use of the word "reschedule" in several
places where "replacement" is correct, and vice-versa.

Fixes: https://github.com/hashicorp/nomad/issues/24918
Co-authored-by: Aimee Ukasick <aimee.ukasick@hashicorp.com>
2025-02-18 09:31:03 -05:00
Seth Hoenig
51215bf102 deps: update to go-set/v3 and refactor to use custom iterators (#23971)
* deps: update to go-set/v3

* deps: use custom set iterators for looping
2024-09-16 13:40:10 -05:00
Soren L. Hansen
96acddbc13 Avoid NPE in nomad/command/job_restart.go (#20049)
stopAlloc() checks if an allocation represents a system job like this:
```
  if *alloc.Job.Type == api.JobTypeSystem {
    ...
  }
```

This caused the cli to crash:
```
==> 2024-02-29T08:45:53+01:00: Restarting 2 allocations
    2024-02-29T08:45:54+01:00: Rescheduling allocation "6a9da11a" for group "redacted-group"
panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x2 addr=0x20 pc=0x10686affc]

goroutine 36 [running]:
github.com/hashicorp/nomad/command.(*JobRestartCommand).stopAlloc(0x14000b11040, {0x14000996dc0?, 0x0?})
        github.com/hashicorp/nomad/command/job_restart.go:968 +0x25c
github.com/hashicorp/nomad/command.(*JobRestartCommand).handleAlloc(0x14000b11040, {0x14000996dc0?, 0x0?})
        github.com/hashicorp/nomad/command/job_restart.go:868 +0x34
github.com/hashicorp/nomad/command.(*JobRestartCommand).Run.(*JobRestartCommand).Run.func1.func2()
        github.com/hashicorp/nomad/command/job_restart.go:392 +0x28
github.com/hashicorp/go-multierror.(*Group).Go.func1()
        github.com/hashicorp/go-multierror@v1.1.1/group.go:23 +0x60
created by github.com/hashicorp/go-multierror.(*Group).Go in goroutine 1
        github.com/hashicorp/go-multierror@v1.1.1/group.go:20 +0x84
```

Attaching a debugger revealed that `alloc.Job` was set, but
`alloc.Job.Type` was nil. After guarding the `.Type` check with a
`alloc.Job.Type != nil`, it still crashed. This time, `alloc.Job` was
nil.

I was scrambling to get the job running again, so I didn't have the
opportunity to find out why those values were nil, but this change
ensures the CLI does not crash in these situations.

Fixes #20048
2024-03-01 08:07:28 -06:00
Luiz Aoqui
d29ac461a7 cli: non-service jobs on job restart -reschedule (#19147)
The `-reschedule` flag stops allocations and assumes the Nomad scheduler
will create new allocations to replace them. But this is only true for
service and batch jobs.

Restarting non-service jobs with the `-reschedule` flag causes the
command to loop forever waiting for the allocations to be replaced,
which never happens.

Allocations for system jobs may be replaced by triggering an evaluation
after each stop to cause the reconciler to run again.

Sysbatch jobs should not be allowed to be rescheduled as they are never
replaced by the scheduler.
2023-11-29 13:01:19 -05:00
Luiz Aoqui
bdac8d9583 cli: prevent panic on CTRL+C during a question (#19154)
Fix a panic when a question receives an interrupt signal before the
signal handler is initialized.
2023-11-23 14:51:56 -05:00
Luiz Aoqui
d2849b8a76 cli: skip allocs with replacements on job restart (#19155)
The `nomad job restart` command should skip allocations that already
have replacements. Restarting an allocation with a replacement is a
no-op because the allocation status is terminal and the command's
replacement monitor returns immediatelly.

But by not skipping them, the effective batch size is computed
incorrectly.
2023-11-23 14:51:10 -05:00
Seth Hoenig
e3c8700ded deps: upgrade to go-set/v2 (#18638)
No functional changes, just cleaning up deprecated usages that are
removed in v2 and replace one call of .Slice with .ForEach to avoid
making the intermediate copy.
2023-10-05 11:56:17 -05:00
Seth Hoenig
d9341f0664 update go1.21 (#18184)
* build: update to go1.21

* go: eliminate helpers in favor of min/max

* build: run go mod tidy

* build: swap depguard for semgrep

* command: fixup broken tls error check on go1.21
2023-08-14 08:43:27 -05:00
hashicorp-copywrite[bot]
a9d61ea3fd Update copyright file headers to BUSL-1.1 2023-08-10 17:27:29 -05:00
Luiz Aoqui
389dff42a1 cli: fix panic on job restart (#17346)
When monitoring the replacement allocation, if the
`Allocations().Info()` request fails, the `alloc` variable is `nil`, so
it should not be read.
2023-05-30 11:08:49 -04:00
hashicorp-copywrite[bot]
f005448366 [COMPLIANCE] Add Copyright and License Headers 2023-04-10 15:36:59 +00:00
Luiz Aoqui
fffdbdff06 cli: job restart command (#16278)
Implement the new `nomad job restart` command that allows operators to
restart allocations tasks or reschedule then entire allocation.

Restarts can be batched to target multiple allocations in parallel.
Between each batch the command can stop and hold for a predefined time
or until the user confirms that the process should proceed.

This implements the "Stateless Restarts" alternative from the original
RFC
(https://gist.github.com/schmichael/e0b8b2ec1eb146301175fd87ddd46180).
The original concept is still worth implementing, as it allows this
functionality to be exposed over an API that can be consumed by the
Nomad UI and other clients. But the implementation turned out to be more
complex than we initially expected so we thought it would be better to
release a stateless CLI-based implementation first to gather feedback
and validate the restart behaviour.

Co-authored-by: Shishir Mahajan <smahajan@roblox.com>
2023-03-23 18:28:26 -04:00