Commit Graph

46 Commits

Author SHA1 Message Date
Mahmood Ali
a80ccb82cb Only ignore rescheduled allocations if they got stopped 2020-09-14 21:11:52 -04:00
Mahmood Ali
5720266c91 Respect alloc job version for lost/failed allocs
This change fixes a bug where lost/failed allocations are replaced by
allocations with the latest versions, even if the version hasn't been
promoted yet.

Now, when generating a plan for lost/failed allocations, the scheduler
first checks if the current deployment is in Canary stage, and if so, it
ensures that any lost/failed allocations is replaced one with the latest
promoted version instead.
2020-08-19 09:52:48 -04:00
Lang Martin
9ccec0afbb scheduler/reconcile: set FollowupEvalID on lost stop_after_client_disconnect (#8105) (#8138)
* scheduler/reconcile: set FollowupEvalID on lost stop_after_client_disconnect

* scheduler/reconcile: thread follupEvalIDs through to results.stop

* scheduler/reconcile: comment typo

* nomad/_test: correct arguments for plan.AppendStoppedAlloc

* scheduler/reconcile: avoid nil, cleanup handleDelayed(Lost|Reschedules)
2020-06-09 17:13:53 -04:00
Lang Martin
cd6d34425f server: stop after client disconnect (#7939)
* jobspec, api: add stop_after_client_disconnect

* nomad/state/state_store: error message typo

* structs: alloc methods to support stop_after_client_disconnect

1. a global AllocStates to track status changes with timestamps. We
   need this to track the time at which the alloc became lost
   originally.

2. ShouldClientStop() and WaitClientStop() to actually do the math

* scheduler/reconcile_util: delayByStopAfterClientDisconnect

* scheduler/reconcile: use delayByStopAfterClientDisconnect

* scheduler/util: updateNonTerminalAllocsToLost comments

This was setup to only update allocs to lost if the DesiredStatus had
already been set by the scheduler. It seems like the intention was to
update the status from any non-terminal state, and not all lost allocs
have been marked stop or evict by now

* scheduler/testing: AssertEvalStatus just use require

* scheduler/generic_sched: don't create a blocked eval if delayed

* scheduler/generic_sched_test: several scheduling cases
2020-05-13 16:39:04 -04:00
Mahmood Ali
c62c246ad9 Stop allocs to be rescheduled
Currently, when an alloc fails and is rescheduled, the alloc desired
state remains as "run" and the nomad client may not free the resources.

Here, we ensure that an alloc is marked as stopped when it's
rescheduled.

Notice the Desired Status and Description before and after this change:

Before:
```
mars-2:nomad notnoop$ nomad alloc status 02aba49e
ID                   = 02aba49e
Eval ID              = bb9ed1d2
Name                 = example-reschedule.nodes[0]
Node ID              = 5853d547
Node Name            = mars-2.local
Job ID               = example-reschedule
Job Version          = 0
Client Status        = failed
Client Description   = Failed tasks
Desired Status       = run
Desired Description  = <none>
Created              = 10s ago
Modified             = 5s ago
Replacement Alloc ID = d6bf872b

Task "payload" is "dead"
Task Resources
CPU        Memory          Disk     Addresses
0/100 MHz  24 MiB/300 MiB  300 MiB

Task Events:
Started At     = 2019-06-06T21:12:45Z
Finished At    = 2019-06-06T21:12:50Z
Total Restarts = 0
Last Restart   = N/A

Recent Events:
Time                       Type            Description
2019-06-06T17:12:50-04:00  Not Restarting  Policy allows no restarts
2019-06-06T17:12:50-04:00  Terminated      Exit Code: 1
2019-06-06T17:12:45-04:00  Started         Task started by client
2019-06-06T17:12:45-04:00  Task Setup      Building Task Directory
2019-06-06T17:12:45-04:00  Received        Task received by client

```

After:

```
ID                   = 5001ccd1
Eval ID              = 53507a02
Name                 = example-reschedule.nodes[0]
Node ID              = a3b04364
Node Name            = mars-2.local
Job ID               = example-reschedule
Job Version          = 0
Client Status        = failed
Client Description   = Failed tasks
Desired Status       = stop
Desired Description  = alloc was rescheduled because it failed
Created              = 13s ago
Modified             = 3s ago
Replacement Alloc ID = 7ba7ac20

Task "payload" is "dead"
Task Resources
CPU         Memory          Disk     Addresses
21/100 MHz  24 MiB/300 MiB  300 MiB

Task Events:
Started At     = 2019-06-06T21:22:50Z
Finished At    = 2019-06-06T21:22:55Z
Total Restarts = 0
Last Restart   = N/A

Recent Events:
Time                       Type            Description
2019-06-06T17:22:55-04:00  Not Restarting  Policy allows no restarts
2019-06-06T17:22:55-04:00  Terminated      Exit Code: 1
2019-06-06T17:22:50-04:00  Started         Task started by client
2019-06-06T17:22:50-04:00  Task Setup      Building Task Directory
2019-06-06T17:22:50-04:00  Received        Task received by client
```
2019-06-06 17:27:12 -04:00
Preetha Appan
242cc191a1 Work in progress - force rescheduling of failed allocs 2018-05-08 17:26:57 -05:00
Alex Dadgar
ca588f9ce0 clarify comment 2018-05-07 14:55:01 -05:00
Alex Dadgar
0e1fb91189 Reschedule when we have canaries properly 2018-05-07 14:55:01 -05:00
Alex Dadgar
ff7b1bebcc Allow canary count greater than desired 2018-05-07 14:50:01 -05:00
Preetha Appan
32557a1a99 Only use DesiredTransition.Reschedule in reconciler when its an active deployment 2018-05-07 14:50:01 -05:00
Alex Dadgar
b1df4611fe Only reschedule allowed deployment allocs 2018-05-07 14:50:01 -05:00
Preetha Appan
7ebf6032dc remove unnecessary check and other fixes from code review 2018-04-04 07:35:20 -05:00
Preetha Appan
aa4a0cff50 Fixes edge cases around timing and task finish time being set more than once 2018-04-03 16:34:59 -05:00
Preetha Appan
2ba965fa7a rename skip->ignore and improve comment formatting 2018-03-29 15:11:10 -05:00
Preetha Appan
fc50ab930f Refactored for readability, pair programmed with @dadgar 2018-03-29 13:28:37 -05:00
Preetha Appan
fefbdd3178 Filter out allocs with DesiredState = stop, and unit tests 2018-03-29 09:28:52 -05:00
Preetha Appan
f401044600 Fix edge case in reconciler where service jobs with ClientstatusComplete were not replaced 2018-03-23 18:41:00 -05:00
Michael Schurter
a466f97cba scheduler: migrate non-terminal migrating allocs
filterByTainted node should always migrate non-terminal migrating allocs
2018-03-21 16:49:48 -07:00
Michael Schurter
832b1d5694 switch to new raft DesiredTransition message 2018-03-21 16:49:48 -07:00
Alex Dadgar
48d637dad1 RPC, FSM, State Store for marking DesiredTransistion
fix build tag
2018-03-21 16:49:48 -07:00
Preetha Appan
abeab12b9e Get reschedule policy from the alloc directly 2018-03-14 16:10:32 -05:00
Preetha Appan
e80d0d8156 Cleaner handling of batched evals 2018-03-14 16:10:32 -05:00
Preetha Appan
3eebacb53e Remove unnecessary check against 5 second window for determining immediate scheduling eligibility 2018-03-14 16:10:32 -05:00
Preetha Appan
9628454d7a Scheduler and Reconciler changes to support delayed rescheduling 2018-03-14 16:10:32 -05:00
Josh Soref
0cc21f8c57 spelling: reschedulable 2018-03-11 18:48:12 +00:00
Preetha Appan
24c04d67d5 Fixes bug in reconciler where previously rescheduled allocs are rescheduled again. Simplified logic and added test case to catch this. 2018-02-20 12:07:56 -06:00
Preetha Appan
a49ad471f9 Address more code review feedback 2018-01-31 09:56:53 -06:00
Preetha Appan
c5f81b426f Make sure that reschedule trackers are not added for node drain replacements 2018-01-31 09:56:53 -06:00
Preetha Appan
d96873c827 Reconile with changes to structs for reschedule tracking 2018-01-31 09:56:53 -06:00
Preetha Appan
cc54e11802 Fix some comments and lint warnings, remove unused method 2018-01-31 09:56:53 -06:00
Preetha Appan
5ecb7895bb Reschedule previous allocs and track their reschedule attempts 2018-01-31 09:56:53 -06:00
Preetha Appan
ef1a2e94f7 Fix some typos 2017-12-14 13:29:27 -06:00
Alex Dadgar
a9e3a41407 Enable more linters 2017-09-26 15:26:33 -07:00
Alex Dadgar
aabf2c0334 fixes 2017-08-15 12:27:05 -07:00
Alex Dadgar
7e6b14cf5d Fix panic occuring from improper bitmap size
This PR fixes an allignment calculation when determining the bitmap
size.

Fixes https://github.com/hashicorp/nomad/issues/3008
2017-08-12 15:37:02 -07:00
Luke Farnell
7a56971508 fixed all spelling mistakes for goreport 2017-08-07 17:13:05 -04:00
Alex Dadgar
d457735b2f Treat destructive updates atomically 2017-07-16 10:35:38 -07:00
Alex Dadgar
989aa56304 Remove canary 2017-07-07 12:10:04 -07:00
Alex Dadgar
71c7c45cf6 Change canary handling 2017-07-07 12:10:04 -07:00
Alex Dadgar
e5b1e3171c Remove promoted bit from allocation 2017-07-07 12:10:04 -07:00
Alex Dadgar
af7f93b56b Fix canary handling 2017-07-07 12:03:11 -07:00
Alex Dadgar
369a04b135 Deployment tests 2017-07-07 12:03:11 -07:00
Alex Dadgar
f32a9a5539 Non-Canary/Deployment Tests 2017-07-07 12:03:11 -07:00
Alex Dadgar
85e0d6fccd assign names 2017-07-07 12:03:11 -07:00
Alex Dadgar
1dabd206bb handle batch filtering 2017-07-07 12:03:11 -07:00
Alex Dadgar
4bbf24a875 Split reconcile file 2017-07-07 12:03:11 -07:00