nomad

mirror of https://github.com/kemko/nomad.git synced 2026-01-06 02:15:43 +03:00

Author	SHA1	Message	Date
Drew Bailey	264932dae4	Return FailedTGAlloc metric instead of no node err If an existing system allocation is running and the node its running on is marked as ineligible, subsequent plan/applys return an RPC error instead of a more helpful plan result. This change logs the error, and appends a failedTGAlloc for the placement.	2020-01-22 10:07:15 -05:00
Drew Bailey	81a24098f0	Update Evicted allocations to lost when lost If an alloc is being preempted and marked as evict, but the underlying node is lost before the migration takes place, the allocation currently stays as desired evict, status running forever, or until the node comes back online. This commit updates updateNonTerminalAllocsToLost to check for a destired status of Evict as well as Stop when updating allocations on tainted nodes. switch to table test for lost node cases	2020-01-07 13:34:18 -05:00
Preetha Appan	be897cadc3	More error->debug for logging in the bin packing iterator	2019-12-12 15:50:16 -06:00
Preetha Appan	ed1f30e799	Use debug logging for scheduler internals We currently log an error if preemption is unable to find a suitable set of allocations to preempt. This commit changes that to debug level since not finding preemptable allocations is not an error condition.	2019-12-12 12:05:29 -06:00
Michael Schurter	b2dd21a19e	Merge pull request #6792 from hashicorp/b-propose-panic scheduler: fix panic when preempting and evicting allocs	2019-12-03 10:40:19 -08:00
Tim Gross	3716a67b30	scheduler: fix job update placement on prev node penalized (#6781 ) Fixes #5856 When the scheduler looks for a placement for an allocation that's replacing another allocation, it's supposed to penalize the previous node if the allocation had been rescheduled or failed. But we're currently always penalizing the node, which leads to unnecessary migrations on job update. This commit leaves in place the existing behavior where if the previous alloc was itself rescheduled, its previous nodes are also penalized. This is conservative but the right behavior especially on larger clusters where a group of hosts might be having correlated trouble (like an AZ failure). Co-Authored-By: Michael Schurter <mschurter@hashicorp.com>	2019-12-03 06:14:49 -08:00
Michael Schurter	f12bfdb193	scheduler: update tests with modern error helper	2019-12-02 20:25:52 -08:00
Michael Schurter	6112ad9f92	scheduler: fix panic when preempting and evicting Fixes #6787 In ProposedAllocs the proposed alloc slice was being copied while its contents were not. Since RemoveAllocs nils elements of the proposed alloc slice and is called twice, it could panic on the second call when erroneously accessing a nil'd alloc. The fix is to not copy the proposed alloc slice and pass the slice returned by the 1st RemoveAllocs call to the 2nd call, thus maintaining the trimmed length.	2019-12-02 20:22:22 -08:00
Michael Schurter	62751321bf	Merge pull request #6699 from hashicorp/f-semver-constraints Add new "semver" constraint	2019-11-19 12:18:43 -08:00
Drew Bailey	89964c989a	Removes checking constraints for inplace update	2019-11-19 13:34:41 -05:00
Michael Schurter	75d6d4ec5e	core: add semver constraint The existing version constraint uses logic optimized for package managers, not schedulers, when checking prereleases: - 1.3.0-beta1 will not satisfy ">= 0.6.1" - 1.7.0-rc1 will not satisfy ">= 1.6.0-beta1" This is due to package managers wishing to favor final releases over prereleases. In a scheduler versions more often represent the earliest release all required features/APIs are available in a system. Whether the constraint or the version being evaluated are prereleases has no impact on ordering. This commit adds a new constraint - `semver` - which will use Semver v2.0 ordering when evaluating constraints. Given the above examples: - 1.3.0-beta1 satisfies ">= 0.6.1" using `semver` - 1.7.0-rc1 satisfies ">= 1.6.0-beta1" using `semver` Since existing jobspecs may rely on the old behavior, a new constraint was added and the implicit Consul Connect and Vault constraints were updated to use it.	2019-11-19 08:40:19 -08:00
Drew Bailey	c87c6415eb	DOCS: Spread stanza does not exist on task Fixes documentation inaccuracy for spread stanza placement. Spreads can only exist on the top level job struct or within a group. comment about nil assumption	2019-11-19 08:26:36 -05:00
Drew Bailey	1607a203a8	Check for changes to affinity and constraints Adds checks for affinity and constraint changes when determining if we should update inplace. refactor to check all levels at once check for spread changes when checking inplace update	2019-11-19 08:26:34 -05:00
Chris Baker	bb964defa4	changed all tests to require from t.Fatalf	2019-11-07 22:39:47 +00:00
Chris Baker	bcd1243471	the scheduler checks whether task changes require a restart, this needed to be updated to consider devices	2019-11-07 17:51:15 +00:00
Michael Schurter	08a17854ce	core: fix panic when AllocatedResources is nil Fix for #6540	2019-10-28 14:38:21 -07:00
Danielle Lancashire	ab5ba7aa9b	config: Hoist volume.config.source into volume Currently, using a Volume in a job uses the following configuration: ``` volume "alias-name" { type = "volume-type" read_only = true config { source = "host_volume_name" } } ``` This commit migrates to the following: ``` volume "alias-name" { type = "volume-type" source = "host_volume_name" read_only = true } ``` The original design was based due to being uncertain about the future of storage plugins, and to allow maxium flexibility. However, this causes a few issues, namely: - We frequently need to parse this configuration during submission, scheduling, and mounting - It complicates the configuration from and end users perspective - It complicates the ability to do validation As we understand the problem space of CSI a little more, it has become clear that we won't need the `source` to be in config, as it will be used in the majority of cases: - Host Volumes: Always need a source - Preallocated CSI Volumes: Always needs a source from a volume or claim name - Dynamic Persistent CSI Volumes: Always needs a source to attach the volumes to for managing upgrades and to avoid dangling. - Dynamic Ephemeral CSI Volumes: Less thought out, but `source` will probably point to the plugin name, and a `config` block will allow you to pass meta to the plugin. Or will point to a pre-configured ephemeral config. *If implemented The new design simplifies this by merging the source into the volume stanza to solve the above issues with usability, performance, and error handling.	2019-09-13 04:37:59 +02:00
Preetha Appan	654c72a7b4	update comment	2019-09-05 18:43:30 -05:00
Preetha Appan	87e998d043	Fix inplace updates bug with group level networks During inplace updates, we should be using network information from the previous allocation being updated.	2019-09-05 18:37:24 -05:00
Jasmine Dahilig	c346a47b5b	add default update stanza and max_parallel=0 disables deployments (#6191 )	2019-09-02 10:30:09 -07:00
Mahmood Ali	8a0647c9cf	schedulers: check all drivers on node When checking driver feasability for an alloc with multiple drivers, we must check that all drivers are detected and healthy. Nomad 0.9 and 0.8 have a bug where we may check a single driver only, but which driver is dependent on map traversal order, which is unspecified in golang spec.	2019-08-29 09:03:31 -04:00
Mahmood Ali	542d17e745	scheduler: tests for multiple drivers in TG	2019-08-29 09:03:31 -04:00
Danielle Lancashire	41292055de	scheduler: Implicit constraint on readonly hostvol When a Client declares a volume is ReadOnly, we should only schedule it for requests for ReadOnly volumes. This change means that if a host exposes a readonly volume, we then validate that the group level requests for the volume are all read only for that host.	2019-08-21 20:57:05 +02:00
Danielle Lancashire	af5d42c058	structs: Unify Volume and VolumeRequest	2019-08-12 15:39:08 +02:00
Danielle	0f5cf5fa91	Update scheduler/feasible.go Co-Authored-By: Mahmood Ali <mahmood@hashicorp.com>	2019-08-12 15:39:08 +02:00
Danielle Lancashire	709abbc675	scheduler: Add a feasability checker for Host Vols	2019-08-12 15:39:08 +02:00
Preetha Appan	5a1dd79179	Code review feedback	2019-07-31 01:04:08 -04:00
Preetha Appan	b561816343	Scheduler changes to support network at task group level Also includes unit tests for binpacker and preemption. The tests verify that network resources specified at the task group level are properly accounted for	2019-07-31 01:04:08 -04:00
Nick Ethier	4cb99a1112	scheduler: fix disk constraints	2019-07-31 01:04:08 -04:00
Nick Ethier	e910fdbb32	fix failing tests	2019-07-31 01:04:07 -04:00
Nick Ethier	e15005bdcb	networking: Add new bridge networking mode implementation	2019-07-31 01:04:06 -04:00
Nick Ethier	c742f8b580	ar: cleanup lint errors	2019-07-31 01:03:18 -04:00
Nick Ethier	e20fa7ccc1	Add network lifecycle management Adds a new Prerun and Postrun hooks to manage set up of network namespaces on linux. Work still needs to be done to make the code platform agnostic and support Docker style network initalization.	2019-07-31 01:03:17 -04:00
Lang Martin	2d8bfb8d11	system_sched submits failed evals as blocked	2019-07-18 10:32:12 -04:00
Preetha Appan	bead05f05f	Fix more tests	2019-06-26 16:30:53 -05:00
Preetha Appan	913427428a	Remove compat code associated with many previous versions of nomad This removes compat code for namespaces (0.7), Drain(0.8) and other older features from releases older than Nomad 0.7	2019-06-25 19:05:25 -05:00
Mahmood Ali	2899991ccd	Merge pull request #5790 from hashicorp/b-reschedule-desired-state Mark rescheduled allocs as stopped.	2019-06-13 17:28:59 -04:00
Mahmood Ali	25b44b18db	Test behavior no reschedule for service/batch jobs	2019-06-13 16:41:19 -04:00
Mahmood Ali	34a66835db	Don't stop rescheduleLater allocations When an alloc is due to be rescheduleLater, it goes through the reconciler twice: once to be ignored with a follow up evals, and once again when processing the follow up eval where they appear as rescheduleNow. Here, we ignore them in the first run and mark them as stopped in second iteration; rather than stop them twice.	2019-06-13 09:44:41 -04:00
Mahmood Ali	d342a24ba0	Only preempt for network when there is a network When examining preemption for networks, only consider allocs that have networks. Fixes https://github.com/hashicorp/nomad/issues/5793	2019-06-07 18:55:55 -04:00
Mahmood Ali	2808674fac	test: add tests for network devices and preemption	2019-06-07 18:55:02 -04:00
Mahmood Ali	c62c246ad9	Stop allocs to be rescheduled Currently, when an alloc fails and is rescheduled, the alloc desired state remains as "run" and the nomad client may not free the resources. Here, we ensure that an alloc is marked as stopped when it's rescheduled. Notice the Desired Status and Description before and after this change: Before: ``` mars-2:nomad notnoop$ nomad alloc status 02aba49e ID = 02aba49e Eval ID = bb9ed1d2 Name = example-reschedule.nodes[0] Node ID = 5853d547 Node Name = mars-2.local Job ID = example-reschedule Job Version = 0 Client Status = failed Client Description = Failed tasks Desired Status = run Desired Description = <none> Created = 10s ago Modified = 5s ago Replacement Alloc ID = d6bf872b Task "payload" is "dead" Task Resources CPU Memory Disk Addresses 0/100 MHz 24 MiB/300 MiB 300 MiB Task Events: Started At = 2019-06-06T21:12:45Z Finished At = 2019-06-06T21:12:50Z Total Restarts = 0 Last Restart = N/A Recent Events: Time Type Description 2019-06-06T17:12:50-04:00 Not Restarting Policy allows no restarts 2019-06-06T17:12:50-04:00 Terminated Exit Code: 1 2019-06-06T17:12:45-04:00 Started Task started by client 2019-06-06T17:12:45-04:00 Task Setup Building Task Directory 2019-06-06T17:12:45-04:00 Received Task received by client ``` After: ``` ID = 5001ccd1 Eval ID = 53507a02 Name = example-reschedule.nodes[0] Node ID = a3b04364 Node Name = mars-2.local Job ID = example-reschedule Job Version = 0 Client Status = failed Client Description = Failed tasks Desired Status = stop Desired Description = alloc was rescheduled because it failed Created = 13s ago Modified = 3s ago Replacement Alloc ID = 7ba7ac20 Task "payload" is "dead" Task Resources CPU Memory Disk Addresses 21/100 MHz 24 MiB/300 MiB 300 MiB Task Events: Started At = 2019-06-06T21:22:50Z Finished At = 2019-06-06T21:22:55Z Total Restarts = 0 Last Restart = N/A Recent Events: Time Type Description 2019-06-06T17:22:55-04:00 Not Restarting Policy allows no restarts 2019-06-06T17:22:55-04:00 Terminated Exit Code: 1 2019-06-06T17:22:50-04:00 Started Task started by client 2019-06-06T17:22:50-04:00 Task Setup Building Task Directory 2019-06-06T17:22:50-04:00 Received Task received by client ```	2019-06-06 17:27:12 -04:00
Mahmood Ali	5574b2f3d0	tests: Migrated allocs aren't lost Fix `TestServiceSched_NodeDown` for checking that the migrated allocs are actually marked to be stopped. The boolean logic in test made it skip actually checking client status as long as desired status was stop. Here, we mark some jobs for migration while leaving others as running, and we check that lost flag is only set for non-migrated allocs.	2019-06-06 16:05:07 -04:00
Lang Martin	21dccdf8dd	describe a pending deployment with auto_promote accurately	2019-05-22 12:32:08 -04:00
Lang Martin	2165f8be94	sched reconcile copy AutoPromote to DeploymentState	2019-05-22 12:32:08 -04:00
Preetha Appan	566dd71486	Fix comment and assert score in test case	2019-05-15 12:35:57 -05:00
Nick Ethier	5709bf7b54	fix missing brace	2019-05-15 13:02:04 -04:00
Nick Ethier	ea843a507a	scheduler: add check to prohibit returning inf during spread boost calculation	2019-05-15 13:00:24 -04:00
Lang Martin	b1228536e8	system_sched & test cleanup comments	2019-05-01 12:25:26 -04:00
Lang Martin	d1308420b2	system_sched_test extend the test to check ineligible nodes	2019-05-01 12:25:26 -04:00

1 2 3 4 5 ...

632 Commits