nomad

mirror of https://github.com/kemko/nomad.git synced 2026-01-06 02:15:43 +03:00

Author	SHA1	Message	Date
James Rasell	847c2cc528	client: accommodate Consul 1.14.0 gRPC and agent self changes. (#15309 ) * client: accommodate Consul 1.14.0 gRPC and agent self changes. Consul 1.14.0 changed the way in which gRPC listeners are configured, particularly when using TLS. Prior to the change, a single listener was responsible for handling plain-text and encrypted gRPC requests. In 1.14.0 and beyond, separate listeners will be used for each, defaulting to 8502 and 8503 for plain-text and TLS respectively. The change means that Nomad’s Consul Connect integration would not work when integrated with Consul clusters using TLS and running 1.14.0 or greater. The Nomad Consul fingerprinter identifies the gRPC port Consul has exposed using the "DebugConfig.GRPCPort" value from Consul’s “/v1/agent/self” endpoint. In Consul 1.14.0 and greater, this only represents the plain-text gRPC port which is likely to be disbaled in clusters running TLS. In order to fix this issue, Nomad now takes into account the Consul version and configured scheme to optionally use “DebugConfig.GRPCTLSPort” value from Consul’s agent self return. The “consul_grcp_socket” allocrunner hook has also been updated so that the fingerprinted gRPC port attribute is passed in. This provides a better fallback method, when the operator does not configure the “consul.grpc_address” option. * docs: modify Consul Connect entries to detail 1.14.0 changes. * changelog: add entry for #15309 * fixup: tidy tests and clean version match from review feedback. * fixup: use strings tolower func.	2022-11-21 09:19:09 -06:00
Tim Gross	21c2d1593a	remove deprecated `AllocUpdateRequestType` raft entry (#15285 ) After Deployments were added in Nomad 0.6.0, the `AllocUpdateRequestType` raft log entry was no longer in use. Mark this as deprecated, remove the associated dead code, and remove references to the metrics it emits from the docs. We'll leave the entry itself just in case we encounter old raft logs that we need to be able to safely load.	2022-11-17 12:08:04 -05:00
Tim Gross	65b3d01aab	eval delete: move batching of deletes into RPC handler and state (#15117 ) During unusual outage recovery scenarios on large clusters, a backlog of millions of evaluations can appear. In these cases, the `eval delete` command can put excessive load on the cluster by listing large sets of evals to extract the IDs and then sending larges batches of IDs. Although the command's batch size was carefully tuned, we still need to be JSON deserialize, re-serialize to MessagePack, send the log entries through raft, and get the FSM applied. To improve performance of this recovery case, move the batching process into the RPC handler and the state store. The design here is a little weird, so let's look a the failed options first: * A naive solution here would be to just send the filter as the raft request and let the FSM apply delete the whole set in a single operation. Benchmarking with 1M evals on a 3 node cluster demonstrated this can block the FSM apply for several minutes, which puts the cluster at risk if there's a leadership failover (the barrier write can't be made while this apply is in-flight). * A less naive but still bad solution would be to have the RPC handler filter and paginate, and then hand a list of IDs to the existing raft log entry. Benchmarks showed this blocked the FSM apply for 20-30s at a time and took roughly an hour to complete. Instead, we're filtering and paginating in the RPC handler to find a page token, and then passing both the filter and page token in the raft log. The FSM apply recreates the paginator using the filter and page token to get roughly the same page of evaluations, which it then deletes. The pagination process is fairly cheap (only abut 5% of the total FSM apply time), so counter-intuitively this rework ends up being much faster. A benchmark of 1M evaluations showed this blocked the FSM apply for 20-30ms at a time (typical for normal operations) and completes in less than 4 minutes. Note that, as with the existing design, this delete is not consistent: a new evaluation inserted "behind" the cursor of the pagination will fail to be deleted.	2022-11-14 14:08:13 -05:00
Douglas Jose	1217a96edf	Fix wrong reference to `vault` (#15228 )	2022-11-14 10:49:09 +01:00
Kyle Root	263ed6f9c6	Fix broken URL to nvidia device plugin (#15234 )	2022-11-14 10:37:06 +01:00
Tim Gross	11a5f79084	exec: allow running commands from host volume (#14851 ) The exec driver and other drivers derived from the shared executor check the path of the command before handing off to libcontainer to ensure that the command doesn't escape the sandbox. But we don't check any host volume mounts, which should be safe to use as a source for executables if we're letting the user mount them to the container in the first place. Check the mount config to verify the executable lives in the mount's host path, but then return an absolute path within the mount's task path so that we can hand that off to libcontainer to run. Includes a good bit of refactoring here because the anchoring of the final task path has different code paths for inside the task dir vs inside a mount. But I've fleshed out the test coverage of this a good bit to ensure we haven't created any regressions in the process.	2022-11-11 09:51:15 -05:00
Seth Hoenig	106dce9c9f	docs: clarify how to access task meta values in templates (#15212 ) This PR updates template and meta docs pages to give examples of accessing meta values in templates. To do so one must use the environment variable form of the meta key name, which isn't obvious and wasn't yet documented.	2022-11-10 16:11:53 -06:00
twunderlich-grapl	1b5eedc07a	Fix s3 example URLs in the artifacts docs (#15123 ) * Fix s3 URLs so that they work Unfortunately, s3 urls prefixed with https:// do NOT work with the underlying go-getter library. As such, this fixes the examples so that they are working examples that won't cause problems for people reading the docs. See discussion in https://github.com/hashicorp/nomad/issues/1113 circa 2016. * Use s3:// protocol schema for artifact examples Per the discussion in https://github.com/hashicorp/nomad/pull/15123, we're going to use the explicit s3 protocol in the examples since that is the likeliest to work in all scenarios	2022-11-07 14:14:57 -05:00
Charlie Voiselle	52a254ba22	template: error on missing key (#15141 ) * Support error_on_missing_value for templates * Update docs for template stanza	2022-11-04 13:23:01 -04:00
Tim Gross	6b2da83f6a	keyring: safely handle missing keys and restore GC (#15092 ) When replication of a single key fails, the replication loop breaks early and therefore keys that fall later in the sorting order will never get replicated. This is particularly a problem for clusters impacted by the bug that caused #14981 and that were later upgraded; the keys that were never replicated can now never be replicated, and so we need to handle them safely. Included in the replication fix: * Refactor the replication loop so that each key replicated in a function call that returns an error, to make the workflow more clear and reduce nesting. Log the error and continue. * Improve stability of keyring replication tests. We no longer block leadership on initializing the keyring, so there's a race condition in the keyring tests where we can test for the existence of the root key before the keyring has been initialize. Change this to an "eventually" test. But these fixes aren't enough to fix #14981 because they'll end up seeing an error once a second complaining about the missing key, so we also need to fix keyring GC so the keys can be removed from the state store. Now we'll store the key ID used to sign a workload identity in the Allocation, and we'll index the Allocation table on that so we can track whether any live Allocation was signed with a particular key ID.	2022-11-01 15:00:50 -04:00
Tim Gross	b363c56c96	docs: improved documentation on hardening and required capabilities (#15036 ) The existing docs on required capabilities are a little sparse and have been the subject of a lots of questions. Expand on this information and provide a pointer to the ongoing design discussion around rootless Nomad.	2022-10-26 09:46:13 -04:00
Tim Gross	b583f7822a	keyring: remove root key GC (#15034 )	2022-10-25 17:06:18 -04:00
James Rasell	eaea9164a5	acl: correctly resolve ACL roles within client cache. (#14922 ) The client ACL cache was not accounting for tokens which included ACL role links. This change modifies the behaviour to resolve role links to policies. It will also now store ACL roles within the cache for quick lookup. The cache TTL is configurable in the same manner as policies or tokens. Another small fix is included that takes into account the ACL token expiry time. This was not included, which meant tokens with expiry could be used past the expiry time, until they were GC'd.	2022-10-20 09:37:32 +02:00
Zach Shilton	c81fe3cf40	website: fix broken links (#14946 ) * fix: nomad license put link * fix: redirected URL * fix: avoid auto-formatting changes	2022-10-19 14:07:48 -04:00
Anthony	6dcf008fbb	Updated datacenter block description (#14953 ) * Updated datacenter block description * Replacing accidentally removed title * docs: add closing period Co-authored-by: Seth Hoenig <shoenig@duck.com>	2022-10-19 08:44:52 -05:00
Bryce Kalow	f49b3a95dd	website: fixes redirected links (#14918 )	2022-10-18 10:31:52 -05:00
Kevin Wang	57dc7c2ab1	fix: website broken links (#14904 ) * fix: website broken links * fix up keyring-rotate link Co-authored-by: Tim Gross <tgross@hashicorp.com>	2022-10-17 11:32:10 -04:00
Seth Hoenig	9e7e5e081e	services: remove assertion on 'task' field being set (#14864 ) This PR removes the assertion around when the 'task' field of a check may be set. Starting in Nomad 1.4 we automatically set the task field on all checks in support of the NSD checks feature. This is causing validation problems elsewhere, e.g. when a group service using the Consul provider sets 'task' it will fail validation that worked previously. The assertion of leaving 'task' unset was only about making sure job submitters weren't expecting some behavior, but in practice is causing bugs now that we need the task field for more than it was originally added for. We can simply update the docs, noting when the task field set by job submitters actually has value.	2022-10-10 13:02:33 -05:00
Damian Czaja	e4efedbbe4	cli: add `nomad fmt` (#14779 )	2022-10-06 17:00:29 -04:00
Giovani Avelar	2b9158b73e	Allow specification of a custom job name/prefix for parameterized jobs (#14631 )	2022-10-06 16:21:40 -04:00
Michael Schurter	0779a5bc10	docs: clarify nomad vars vs vault (#14831 ) * docs: clarify nomad vars vs vault I think we should make the difference in root key management between Nomad and Vault clear in the concept docs. I didn't see anywhere else in the docs we compared it. I also s/secrets/variables everywhere except the first sentence since the feature is intended to be more generic than secrets. Right now it's more of a compliment to Consul's kv than Vault due to root key handling and featureset. * Update website/content/docs/concepts/variables.mdx Co-authored-by: Tim Gross <tgross@hashicorp.com>	2022-10-06 13:17:26 -07:00
Tim Gross	f70fcf659e	docs: 1.4.0 upgrade warning for keyring initialization (#14825 )	2022-10-06 11:32:35 -04:00
Elijah Voigt	5fdcbf085f	Docs(job-specification/periodic): Add enabled toggle (#14767 ) This is probably undocumented for a reason, but the `enabled` toggle in the `periodic` stanza is very useful so I figured I try adding it to the docs. The feature has been secretly avaliable since #9142 and was called out in that PR as being a dubious addition, only added to avoid regressions. The use case for disabling a periodic job in this way is to prevent it from running without modifying the schedule. Ideally Nomad would make it more clear that this was the case, and allow you to force a run of the job, but even with those rough edges I think users would benefit from knowing about this toggle.	2022-10-03 15:08:24 -04:00
Tim Gross	98deb8d8a0	internals documentation with diagrams (#14750 ) This changeset adds new architecture internals documents to the contributing guide. These are intentionally here and not on the public-facing website because the material is not required for operators and includes a lot of diagrams that we can cheaply maintain with mermaid syntax but would involve art assets to have up on the main site that would become quickly out of date as code changes happen and be extremely expensive to maintain. However, these should be suitable to use as points of conversation with expert end users. Included: * A description of Evaluation triggers and expected counts, with examples. * A description of Evaluation states and implicit states. This is taken from an internal document in our team wiki. * A description of how writing the State Store works. This is taken from a diagram I put together a few months ago for internal education purposes. * A description of Evaluation lifecycle, from registration to running Allocations. This is mostly lifted from @lgfa29's amazing mega-diagram, but broken into digestible chunks and without multi-region deployments, which I'd like to cover in a future doc. Also includes adding Deployments to our public-facing glossary. Co-authored-by: Luiz Aoqui <luiz@hashicorp.com> Co-authored-by: Michael Schurter <mschurter@hashicorp.com> Co-authored-by: Seth Hoenig <shoenig@duck.com>	2022-10-03 14:06:41 -04:00
Tim Gross	fb1f5ea2d9	Revert removing deprecated client options docs (#14753 ) This reverts PR #12416 and commit `6668ce022a`. While the driver options are well and truly deprecated, this documentation also covers features like `fingerprint.denylist` that are not available any other way. Let's revert this until #12420 is ready.	2022-09-30 08:38:03 -04:00
Derek Strickland	58e76c64d5	Merge pull request #14664 from hashicorp/docs-multiregion-dispatch multiregion: Added a section for multiregion parameterized job dispatch	2022-09-28 15:40:11 -04:00
Derek Strickland	3c63967107	link from dispatch command	2022-09-28 08:30:22 -04:00
Derek Strickland	2c1df34fee	Apply suggestions from code review	2022-09-28 08:18:56 -04:00
Derek Strickland	998f662ecd	Update website/content/docs/job-specification/multiregion.mdx Co-authored-by: Luiz Aoqui <luiz@hashicorp.com>	2022-09-28 07:20:11 -04:00
Derek Strickland	6ac87c396f	Update website/content/docs/job-specification/multiregion.mdx Co-authored-by: Luiz Aoqui <luiz@hashicorp.com>	2022-09-28 07:19:54 -04:00
Seth Hoenig	1e5f6188fb	core: numeric operands comparisons in constraints (#14722 ) * cleanup: fixup linter warnings in schedular/feasible.go * core: numeric operands comparisons in constraints This PR changes constraint comparisons to be numeric rather than lexical if both operands are integers or floats. Inspiration #4856 Closes #4729 Closes #14719 * fix: always parse as int64	2022-09-27 11:07:07 -05:00
Michael Schurter	a6dc5ea585	docs: write a lot of words about heartbeats (#14679 ) * docs: write a lot of words about heartbeats Alternative to #14670 * Apply suggestions from code review Co-authored-by: Tim Gross <tgross@hashicorp.com> * use descriptive title for link * rework example of high failover ttl Co-authored-by: Tim Gross <tgross@hashicorp.com>	2022-09-26 14:43:34 -07:00
Michael Schurter	2e059c624f	fingerprint: add node attr for reserverable cores (#14694 ) * fingerprint: add node attr for reserverable cores Add an attribute for the number of reservable CPU cores as they may differ from the existing `cpu.numcores` due to client configuration or OS support. Hopefully clarifies some confusion in #14676 * add changelog * num_reservable_cores -> reservablecores	2022-09-26 13:03:03 -07:00
Michael Schurter	d677b48625	fingerprint: lengthen Vault check after seen (#14693 ) Extension of #14673 Once Vault is initially fingerprinted, extend the period since changes should be infrequent and the fingerprint is relatively expensive since it is contacting a central Vault server. Also move the period timer reset after the fingerprint. This is similar to #9435 where the idea is to ensure the retry period starts after the operation is attempted. 15s will be the minimum time between fingerprints now instead of the maximum time between fingerprints. In the case of Vault fingerprinting, the original behavior might cause the following: 1. Timer is reset to 15s 2. Fingerprint takes 16s 3. Timer has already elapsed so we immediately Fingerprint again Even if fingerprinting Vault only takes a few seconds, that may very well be due to excessive load and backing off our fingerprints is desirable. The new bevahior ensures we always wait at least 15s between fingerprint attempts and should allow some natural jittering based on server load and network latency.	2022-09-26 12:14:19 -07:00
Tim Gross	e7e9713d2e	variables: document restrictions on path and size (#14687 )	2022-09-26 11:40:53 -04:00
Tim Gross	786dc5ff94	fingerprint: don't clear Consul/Vault attributes on failure (#14673 ) Clients periodically fingerprint Vault and Consul to ensure the server has updated attributes in the client's fingerprint. If the client can't reach Vault/Consul, the fingerprinter clears the attributes and requires a node update. Although this seems like correct behavior so that we can detect intentional removal of Vault/Consul access, it has two serious failure modes: (1) If a local Consul agent is restarted to pick up configuration changes and the client happens to fingerprint at that moment, the client will update its fingerprint and result in evaluations for all its jobs and all the system jobs in the cluster. (2) If a client loses Vault connectivity, the same thing happens. But the consequences are much worse in the Vault case because Vault is not run as a local agent, so Vault connectivity failures are highly correlated across the entire cluster. A 15 second Vault outage will cause a new `node-update` evalution for every system job on the cluster times the number of nodes, plus one `node-update` evaluation for every non-system job on each node. On large clusters of 1000s of nodes, we've seen this create a large backlog of evaluations. This changeset updates the fingerprinting behavior to keep the last fingerprint if Consul or Vault queries fail. This prevents a storm of evaluations at the cost of requiring a client restart if Consul or Vault is intentionally removed from the client.	2022-09-23 14:45:12 -04:00
Derek Strickland	a001abdcdb	Update multiregion.mdx	2022-09-22 14:56:21 -04:00
Derek Strickland	76909e8b0f	multiregion: Added a section for multiregion parameterized job dispatch	2022-09-22 14:50:15 -04:00
Tim Gross	d1e90a17d6	cli: remove deprecated `eval status -json` list behavior (#14651 ) In Nomad 1.2.6 we shipped `eval list`, which accepts a `-json` flag, and deprecated the usage of `eval status` without an evaluation ID with an upgrade note that it would be removed in Nomad 1.4.0. This changeset completes that work.	2022-09-22 10:56:32 -04:00
Bryce Kalow	67d39725b1	website: content updates for developer (#14473 ) Co-authored-by: Geoffrey Grosenbach <26+topfunky@users.noreply.github.com> Co-authored-by: Anthony <russo555@gmail.com> Co-authored-by: Ashlee Boyer <ashlee.boyer@hashicorp.com> Co-authored-by: Ashlee M Boyer <43934258+ashleemboyer@users.noreply.github.com> Co-authored-by: HashiBot <62622282+hashibot-web@users.noreply.github.com> Co-authored-by: Kevin Wang <kwangsan@gmail.com>	2022-09-16 10:38:39 -05:00
Mahmood Ali	757c3c94f2	scheduler: stopped-yet-running allocs are still running (#10446 ) * scheduler: stopped-yet-running allocs are still running * scheduler: test new stopped-but-running logic * test: assert nonoverlapping alloc behavior Also add a simpler Wait test helper to improve line numbers and save few lines of code. * docs: tried my best to describe #10446 it's not concise... feedback welcome * scheduler: fix test that allowed overlapping allocs * devices: only free devices when ClientStatus is terminal * test: output nicer failure message if err==nil Co-authored-by: Mahmood Ali <mahmood@hashicorp.com> Co-authored-by: Michael Schurter <mschurter@hashicorp.com>	2022-09-13 12:52:47 -07:00
Tim Gross	06686c84ed	docs: tweak some copy in the concept docs (#14566 )	2022-09-13 13:21:09 -04:00
Seth Hoenig	9cc39de738	Merge pull request #14559 from hashicorp/docs-nsd-check-watcher docs: add documentation for nomad service check restarts	2022-09-13 10:52:01 -05:00
Ashlee M Boyer	faa5b4cf65	docs: Fixing heading order, adding text for links in /docs/ecosystem (#14549 ) * Fixing heading order, adding text for links * Apply suggestions from code review Co-authored-by: Tim Gross <tgross@hashicorp.com> * Applying more suggestions from code review Co-authored-by: Tim Gross <tgross@hashicorp.com>	2022-09-13 10:59:02 -04:00
Seth Hoenig	37906ca213	docs: update docs for NSD check restart	2022-09-13 09:59:02 -05:00
Tim Gross	93a147e482	docs: include path in ACL requirements for variables (#14561 ) Also add links to the ACL policy reference and variables concepts docs near the top of the page.	2022-09-13 10:21:29 -04:00
Charlie Voiselle	df04cd15d6	Variables CLI documentation (#14249 )	2022-09-12 16:44:31 -04:00
Tim Gross	6574717c55	docs: update `template` for Nomad Variables (#14527 )	2022-09-12 16:36:18 -04:00
Tim Gross	0c82b1dec9	remove root keyring install API (#14514 ) * keyring rotate API should require put/post method * remove keyring install API	2022-09-09 08:50:35 -04:00
James Rasell	11496d1816	hcl2: add strlen function and update docs. (#14463 )	2022-09-06 18:42:40 +02:00

1 2 3 4 5 ...

559 Commits