nomad

mirror of https://github.com/kemko/nomad.git synced 2026-01-01 16:05:42 +03:00

Author	SHA1	Message	Date
hashicorp-copywrite[bot]	2d35e32ec9	Update copyright file headers to BUSL-1.1	2023-08-10 17:27:15 -05:00
Tim Gross	116f24d768	client: de-duplicate alloc updates and gate during restore (#17074 ) When client nodes are restarted, all allocations that have been scheduled on the node have their modify index updated, including terminal allocations. There are several contributing factors: * The `allocSync` method that updates the servers isn't gated on first contact with the servers. This means that if a server updates the desired state while the client is down, the `allocSync` races with the `Node.ClientGetAlloc` RPC. This will typically result in the client updating the server with "running" and then immediately thereafter "complete". * The `allocSync` method unconditionally sends the `Node.UpdateAlloc` RPC even if it's possible to assert that the server has definitely seen the client state. The allocrunner may queue-up updates even if we gate sending them. So then we end up with a race between the allocrunner updating its internal state to overwrite the previous update and `allocSync` sending the bogus or duplicate update. This changeset adds tracking of server-acknowledged state to the allocrunner. This state gets checked in the `allocSync` before adding the update to the batch, and updated when `Node.UpdateAlloc` returns successfully. To implement this we need to be able to equality-check the updates against the last acknowledged state. We also need to add the last acknowledged state to the client state DB, otherwise we'd drop unacknowledged updates across restarts. The client restart test has been expanded to cover a variety of allocation states, including allocs stopped before shutdown, allocs stopped by the server while the client is down, and allocs that have been completely GC'd on the server while the client is down. I've also bench tested scenarios where the task workload is killed while the client is down, resulting in a failed restore. Fixes #16381	2023-05-11 09:05:24 -04:00
hashicorp-copywrite[bot]	f005448366	[COMPLIANCE] Add Copyright and License Headers	2023-04-10 15:36:59 +00:00
Mahmood Ali	c70f2a1269	Revert "client: defensive against getting stale alloc updates"	2020-06-19 15:39:44 -04:00
Mahmood Ali	2e1978eb1f	client: defensive against getting stale alloc updates When fetching node alloc assignments, be defensive against a stale read before killing local nodes allocs. The bug is when both client and servers are restarting and the client requests the node allocation for the node, it might get stale data as server hasn't finished applying all the restored raft transaction to store. Consequently, client would kill and destroy the alloc locally, just to fetch it again moments later when server store is up to date. The bug can be reproduced quite reliably with single node setup (configured with persistence). I suspect it's too edge-casey to occur in production cluster with multiple servers, but we may need to examine leader failover scenarios more closely. In this commit, we only remove and destroy allocs if the removal index is more recent than the alloc index. This seems like a cheap resiliency fix we already use for detecting alloc updates. A more proper fix would be to ensure that a nomad server only serves RPC calls when state store is fully restored or up to date in leadership transition cases.	2019-06-29 04:17:35 -05:00
Mahmood Ali	9dcebcd8a3	client: avoid registering node twice right away I noticed that `watchNodeUpdates()` almost immediately after `registerAndHeartbeat()` calls `retryRegisterNode()`, well after 5 seconds. This call is unnecessary and made debugging a bit harder. So here, we ensure that we only re-register node for new node events, not for initial registration.	2019-04-19 09:12:50 -04:00
Michael Schurter	53a4b3fe99	example redis job "runs" on arv2! see below Tons left to do and lots of churn: 1. No state saving 2. No shutdown or gc 3. Removed AR factory for now 4. Made all "Config" structs local to the package they configure 5. Added allocID to GC to avoid a lookup Really hating how many things use *structs.Allocation. It's not bad without state saving, but if AllocRunner starts updating its copy things get racy fast.	2018-10-16 16:53:29 -07:00
Alex Dadgar	a62e412b88	Refactor - wip	2018-06-12 10:23:45 -07:00
Alex Dadgar	3f1ccf7278	Respond to comments	2017-05-09 10:50:24 -07:00
Alex Dadgar	e22393aeb8	Restore state + upgrade path	2017-05-02 18:21:49 -07:00
Alex Dadgar	9def7e1a14	Don't deepcopy job when retrieving copy of Alloc This PR removes deepcopying of the job attached to the allocation in the alloc runner. This operation is called very often so removing reflect from the code path and the potentially large number of mallocs need to create a job reduced memory and cpu pressure.	2017-05-01 14:50:34 -07:00
Michael Schurter	73e99211f6	Fix string formatting	2016-12-01 11:22:51 -08:00
Michael Schurter	6f2b09f676	Add sanity check to SaveState Also just reuse the task states snapshot taken by `Alloc()` instead of doing a redundant copy.	2016-09-02 16:07:06 -07:00
Cameron Davison	aad6203b6d	write state to temp file and then rename	2016-06-27 12:29:33 -05:00
Sean Chittenden	7db2eb03c4	Use consul/lib's RandomStagger Removes four redundant copies of the method in the process.	2016-06-10 15:48:36 -04:00
Alex Dadgar	410ae593e7	Fix double pull with introduction of AllocModifyIndex	2016-02-01 15:43:59 -08:00
Ryan Uber	e4c29dc579	client: alloc dirs tolerate missing directories	2015-09-11 20:32:55 -07:00
Armon Dadgar	8a02dbc481	Use a single implementation of GenerateUUID	2015-09-07 15:23:03 -07:00
Armon Dadgar	17e8860a03	client: adding state save helpers	2015-08-29 18:03:00 -07:00
Armon Dadgar	5ac8546c99	client: working with alloc diffs	2015-08-23 14:54:52 -07:00
Armon Dadgar	398d1b723a	client: alloc diffing	2015-08-23 14:47:51 -07:00
Armon Dadgar	5e8d4ef647	client: register on start	2015-08-20 17:49:04 -07:00
Armon Dadgar	feabeb8167	client: skeleton package	2015-08-20 16:07:26 -07:00

23 Commits