Files
nomad/scheduler
Tim Gross c205688857 scheduler: fix state corruption from rescheduler tracker updates (#25698)
In #12319 we fixed a bug where updates to the reschedule tracker would be
dropped if the follow-up allocation failed to be placed by the scheduler in the
later evaluation. We did this by mutating the previous allocation's reschedule
tracker. But we did this without copying the previous allocation first and then
making sure the updated copy was in the plan. This is unfortunately unsafe and
corrupts the state store on the server where the scheduler ran; it may cause a
race condition in RPC handlers and it causes the server to be out of sync with
the other servers. This was discovered while trying to make all our tests
race-free, but likely impacts production users.

Copy the previous allocation before updating the reschedule tracker, and swap
out the updated allocation in the plan. This also requires that we include the
reschedule tracker in the "normalized" (stripped-down) allocations we send to
the leader as part of a plan.

Ref: https://github.com/hashicorp/nomad/pull/12319
Fixes: https://hashicorp.atlassian.net/browse/NET-12357
2025-04-18 08:42:54 -04:00
..