nomad

mirror of https://github.com/kemko/nomad.git synced 2026-01-05 01:45:44 +03:00

Author	SHA1	Message	Date
Luiz Aoqui	329807bd7f	docs: add cpu-allocated and memory-allocated (#15299 ) Document the Autoscaler Nomad APM paramemeters `cpu-allocated` and `memory-allocated` that were implemented in https://github.com/hashicorp/nomad-autoscaler/pull/324 and https://github.com/hashicorp/nomad-autoscaler/pull/334	2022-11-18 10:55:17 -05:00
Tim Gross	21c2d1593a	remove deprecated `AllocUpdateRequestType` raft entry (#15285 ) After Deployments were added in Nomad 0.6.0, the `AllocUpdateRequestType` raft log entry was no longer in use. Mark this as deprecated, remove the associated dead code, and remove references to the metrics it emits from the docs. We'll leave the entry itself just in case we encounter old raft logs that we need to be able to safely load.	2022-11-17 12:08:04 -05:00
Ayrat Badykov	322c6b3dce	fix create snapshot request docs (#15242 )	2022-11-17 08:43:40 +01:00
Nikita Beletskii	b55ab6318e	Fix variable create API example in docs (#15248 )	2022-11-15 16:04:11 +01:00
Tim Gross	65b3d01aab	eval delete: move batching of deletes into RPC handler and state (#15117 ) During unusual outage recovery scenarios on large clusters, a backlog of millions of evaluations can appear. In these cases, the `eval delete` command can put excessive load on the cluster by listing large sets of evals to extract the IDs and then sending larges batches of IDs. Although the command's batch size was carefully tuned, we still need to be JSON deserialize, re-serialize to MessagePack, send the log entries through raft, and get the FSM applied. To improve performance of this recovery case, move the batching process into the RPC handler and the state store. The design here is a little weird, so let's look a the failed options first: * A naive solution here would be to just send the filter as the raft request and let the FSM apply delete the whole set in a single operation. Benchmarking with 1M evals on a 3 node cluster demonstrated this can block the FSM apply for several minutes, which puts the cluster at risk if there's a leadership failover (the barrier write can't be made while this apply is in-flight). * A less naive but still bad solution would be to have the RPC handler filter and paginate, and then hand a list of IDs to the existing raft log entry. Benchmarks showed this blocked the FSM apply for 20-30s at a time and took roughly an hour to complete. Instead, we're filtering and paginating in the RPC handler to find a page token, and then passing both the filter and page token in the raft log. The FSM apply recreates the paginator using the filter and page token to get roughly the same page of evaluations, which it then deletes. The pagination process is fairly cheap (only abut 5% of the total FSM apply time), so counter-intuitively this rework ends up being much faster. A benchmark of 1M evaluations showed this blocked the FSM apply for 20-30ms at a time (typical for normal operations) and completes in less than 4 minutes. Note that, as with the existing design, this delete is not consistent: a new evaluation inserted "behind" the cursor of the pagination will fail to be deleted.	2022-11-14 14:08:13 -05:00
Douglas Jose	1217a96edf	Fix wrong reference to `vault` (#15228 )	2022-11-14 10:49:09 +01:00
Kyle Root	263ed6f9c6	Fix broken URL to nvidia device plugin (#15234 )	2022-11-14 10:37:06 +01:00
Tim Gross	11a5f79084	exec: allow running commands from host volume (#14851 ) The exec driver and other drivers derived from the shared executor check the path of the command before handing off to libcontainer to ensure that the command doesn't escape the sandbox. But we don't check any host volume mounts, which should be safe to use as a source for executables if we're letting the user mount them to the container in the first place. Check the mount config to verify the executable lives in the mount's host path, but then return an absolute path within the mount's task path so that we can hand that off to libcontainer to run. Includes a good bit of refactoring here because the anchoring of the final task path has different code paths for inside the task dir vs inside a mount. But I've fleshed out the test coverage of this a good bit to ensure we haven't created any regressions in the process.	2022-11-11 09:51:15 -05:00
Seth Hoenig	106dce9c9f	docs: clarify how to access task meta values in templates (#15212 ) This PR updates template and meta docs pages to give examples of accessing meta values in templates. To do so one must use the environment variable form of the meta key name, which isn't obvious and wasn't yet documented.	2022-11-10 16:11:53 -06:00
twunderlich-grapl	1b5eedc07a	Fix s3 example URLs in the artifacts docs (#15123 ) * Fix s3 URLs so that they work Unfortunately, s3 urls prefixed with https:// do NOT work with the underlying go-getter library. As such, this fixes the examples so that they are working examples that won't cause problems for people reading the docs. See discussion in https://github.com/hashicorp/nomad/issues/1113 circa 2016. * Use s3:// protocol schema for artifact examples Per the discussion in https://github.com/hashicorp/nomad/pull/15123, we're going to use the explicit s3 protocol in the examples since that is the likeliest to work in all scenarios	2022-11-07 14:14:57 -05:00
Tim Gross	ce0e0768ff	API for `Eval.Count` (#15147 ) Add a new `Eval.Count` RPC and associated HTTP API endpoints. This API is designed to support interactive use in the `nomad eval delete` command to get a count of evals expected to be deleted before doing so. The state store operations to do this sort of thing are somewhat expensive, but it's cheaper than serializing a big list of evals to JSON. Note that although it seems like this could be done as an extra parameter and response field on `Eval.List`, having it as its own endpoint avoids having to change the response body shape and lets us avoid handling the legacy filter params supported by `Eval.List`.	2022-11-07 08:53:19 -05:00
Charlie Voiselle	52a254ba22	template: error on missing key (#15141 ) * Support error_on_missing_value for templates * Update docs for template stanza	2022-11-04 13:23:01 -04:00
Phil Renaud	85f472189a	Accidentally trailed off on a docs paragraph (#15118 )	2022-11-02 23:33:41 -04:00
Phil Renaud	1a29e72f7f	[ui] Adds meta to job list stub and displays a pack logo on the jobs index (#14833 ) * Adds meta to job list stub and displays a pack logo on the jobs index * Changelog * Modifying struct for optional meta param * Explicitly ask for meta anytime I look up a job from index or job page * Test case for the endpoint * adding meta field to API struct and ommitting from response if empty * passthru method added to api/jobs.list * Meta param listed in docs for jobs list * Update api/jobs.go Co-authored-by: Tim Gross <tgross@hashicorp.com> Co-authored-by: Tim Gross <tgross@hashicorp.com>	2022-11-02 16:58:24 -04:00
Tim Gross	6b2da83f6a	keyring: safely handle missing keys and restore GC (#15092 ) When replication of a single key fails, the replication loop breaks early and therefore keys that fall later in the sorting order will never get replicated. This is particularly a problem for clusters impacted by the bug that caused #14981 and that were later upgraded; the keys that were never replicated can now never be replicated, and so we need to handle them safely. Included in the replication fix: * Refactor the replication loop so that each key replicated in a function call that returns an error, to make the workflow more clear and reduce nesting. Log the error and continue. * Improve stability of keyring replication tests. We no longer block leadership on initializing the keyring, so there's a race condition in the keyring tests where we can test for the existence of the root key before the keyring has been initialize. Change this to an "eventually" test. But these fixes aren't enough to fix #14981 because they'll end up seeing an error once a second complaining about the missing key, so we also need to fix keyring GC so the keys can be removed from the state store. Now we'll store the key ID used to sign a workload identity in the Allocation, and we'll index the Allocation table on that so we can track whether any live Allocation was signed with a particular key ID.	2022-11-01 15:00:50 -04:00
Tim Gross	b363c56c96	docs: improved documentation on hardening and required capabilities (#15036 ) The existing docs on required capabilities are a little sparse and have been the subject of a lots of questions. Expand on this information and provide a pointer to the ongoing design discussion around rootless Nomad.	2022-10-26 09:46:13 -04:00
Tim Gross	b583f7822a	keyring: remove root key GC (#15034 )	2022-10-25 17:06:18 -04:00
Zach Shilton	563e5e3d57	docs: add details to redirects file (#15020 )	2022-10-24 13:16:07 -04:00
Luiz Aoqui	f2318ed2ec	docs: use of `node_class` when autoscaling (#14950 ) Document how the value of `node_class` is used during cluster scaling. https://github.com/hashicorp/nomad-autoscaler/issues/255	2022-10-21 10:35:45 -04:00
James Rasell	1c9b4e398d	acl: add ACL roles to event stream topic and resolve policies. (#14923 ) This changes adds ACL role creation and deletion to the event stream. It is exposed as a single topic with two types; the filter is primarily the role ID but also includes the role name. While conducting this work it was also discovered that the events stream has its own ACL resolution logic. This did not account for ACL tokens which included role links, or tokens with expiry times. ACL role links are now resolved to their policies and tokens are checked for expiry correctly.	2022-10-20 09:43:35 +02:00
James Rasell	eaea9164a5	acl: correctly resolve ACL roles within client cache. (#14922 ) The client ACL cache was not accounting for tokens which included ACL role links. This change modifies the behaviour to resolve role links to policies. It will also now store ACL roles within the cache for quick lookup. The cache TTL is configurable in the same manner as policies or tokens. Another small fix is included that takes into account the ACL token expiry time. This was not included, which meant tokens with expiry could be used past the expiry time, until they were GC'd.	2022-10-20 09:37:32 +02:00
Luiz Aoqui	56816f2f93	docs: expand Autoscaling documentation (#14937 ) Rename `Internals` section to `Concepts` to match core docs structure and expand on how policies are evaluated. Also include missing documentation for check grouping and fix examples to use the new feature.	2022-10-19 17:57:08 -04:00
Luiz Aoqui	3fd800c600	docs: add autoscaling debug (#14941 )	2022-10-19 14:17:41 -04:00
Luiz Aoqui	38606a6a5b	docs: move autoscaling `source` agent config (#14947 ) Move the Autoscaler agent configuration `source` to the `policy` page since they are very closely related. Also update all headers in this section so they follow the proper `h1 > h2 > h3 > ...` hierarchy.	2022-10-19 14:17:09 -04:00
Luiz Aoqui	876ea90075	docs: explain autoscaler target-value strategy (#14951 ) Provide more technical details about how the `target-value` strategy calculates new scaling actions.	2022-10-19 14:16:17 -04:00
Zach Shilton	c81fe3cf40	website: fix broken links (#14946 ) * fix: nomad license put link * fix: redirected URL * fix: avoid auto-formatting changes	2022-10-19 14:07:48 -04:00
Anthony	6dcf008fbb	Updated datacenter block description (#14953 ) * Updated datacenter block description * Replacing accidentally removed title * docs: add closing period Co-authored-by: Seth Hoenig <shoenig@duck.com>	2022-10-19 08:44:52 -05:00
HashiBot	bf279ac019	chore: Update Digital Team Files (#14945 ) * Update generated scripts (website-start.sh) * Update generated scripts (should-build.sh) * Update generated scripts (website-build.sh) * Update generated website Makefile	2022-10-18 17:43:31 -04:00
HashiBot	c9bd653815	chore: Update Digital Team Files (#14940 ) * Update generated scripts (should-build.sh) * Update generated scripts (website-build.sh) * Update generated scripts (website-start.sh) * Update generated website Makefile	2022-10-18 12:36:24 -04:00
Zach Shilton	cc2b449911	website: redirects to empty array (#14921 )	2022-10-18 11:57:36 -04:00
Bryce Kalow	f49b3a95dd	website: fixes redirected links (#14918 )	2022-10-18 10:31:52 -05:00
Kevin Wang	57dc7c2ab1	fix: website broken links (#14904 ) * fix: website broken links * fix up keyring-rotate link Co-authored-by: Tim Gross <tgross@hashicorp.com>	2022-10-17 11:32:10 -04:00
Seth Hoenig	9e7e5e081e	services: remove assertion on 'task' field being set (#14864 ) This PR removes the assertion around when the 'task' field of a check may be set. Starting in Nomad 1.4 we automatically set the task field on all checks in support of the NSD checks feature. This is causing validation problems elsewhere, e.g. when a group service using the Consul provider sets 'task' it will fail validation that worked previously. The assertion of leaving 'task' unset was only about making sure job submitters weren't expecting some behavior, but in practice is causing bugs now that we need the task field for more than it was originally added for. We can simply update the docs, noting when the task field set by job submitters actually has value.	2022-10-10 13:02:33 -05:00
Damian Czaja	e4efedbbe4	cli: add `nomad fmt` (#14779 )	2022-10-06 17:00:29 -04:00
Giovani Avelar	2b9158b73e	Allow specification of a custom job name/prefix for parameterized jobs (#14631 )	2022-10-06 16:21:40 -04:00
Michael Schurter	0779a5bc10	docs: clarify nomad vars vs vault (#14831 ) * docs: clarify nomad vars vs vault I think we should make the difference in root key management between Nomad and Vault clear in the concept docs. I didn't see anywhere else in the docs we compared it. I also s/secrets/variables everywhere except the first sentence since the feature is intended to be more generic than secrets. Right now it's more of a compliment to Consul's kv than Vault due to root key handling and featureset. * Update website/content/docs/concepts/variables.mdx Co-authored-by: Tim Gross <tgross@hashicorp.com>	2022-10-06 13:17:26 -07:00
HashiBot	873d4f33c8	website: upgrade next version (#14830 ) Co-authored-by: Bryce Kalow <bkalow@hashicorp.com>	2022-10-06 13:48:11 -05:00
Tim Gross	f70fcf659e	docs: 1.4.0 upgrade warning for keyring initialization (#14825 )	2022-10-06 11:32:35 -04:00
Elijah Voigt	5fdcbf085f	Docs(job-specification/periodic): Add enabled toggle (#14767 ) This is probably undocumented for a reason, but the `enabled` toggle in the `periodic` stanza is very useful so I figured I try adding it to the docs. The feature has been secretly avaliable since #9142 and was called out in that PR as being a dubious addition, only added to avoid regressions. The use case for disabling a periodic job in this way is to prevent it from running without modifying the schedule. Ideally Nomad would make it more clear that this was the case, and allow you to force a run of the job, but even with those rough edges I think users would benefit from knowing about this toggle.	2022-10-03 15:08:24 -04:00
Tim Gross	98deb8d8a0	internals documentation with diagrams (#14750 ) This changeset adds new architecture internals documents to the contributing guide. These are intentionally here and not on the public-facing website because the material is not required for operators and includes a lot of diagrams that we can cheaply maintain with mermaid syntax but would involve art assets to have up on the main site that would become quickly out of date as code changes happen and be extremely expensive to maintain. However, these should be suitable to use as points of conversation with expert end users. Included: * A description of Evaluation triggers and expected counts, with examples. * A description of Evaluation states and implicit states. This is taken from an internal document in our team wiki. * A description of how writing the State Store works. This is taken from a diagram I put together a few months ago for internal education purposes. * A description of Evaluation lifecycle, from registration to running Allocations. This is mostly lifted from @lgfa29's amazing mega-diagram, but broken into digestible chunks and without multi-region deployments, which I'd like to cover in a future doc. Also includes adding Deployments to our public-facing glossary. Co-authored-by: Luiz Aoqui <luiz@hashicorp.com> Co-authored-by: Michael Schurter <mschurter@hashicorp.com> Co-authored-by: Seth Hoenig <shoenig@duck.com>	2022-10-03 14:06:41 -04:00
dependabot[bot]	6aee370969	build(deps-dev): bump @hashicorp/platform-cli in /website (#14541 ) Bumps [@hashicorp/platform-cli](https://github.com/hashicorp/web-platform-packages/tree/HEAD/packages/cli) from 2.1.0 to 2.3.0. - [Release notes](https://github.com/hashicorp/web-platform-packages/releases) - [Changelog](https://github.com/hashicorp/web-platform-packages/blob/main/packages/cli/CHANGELOG.md) - [Commits](https://github.com/hashicorp/web-platform-packages/commits/@hashicorp/platform-cli@2.3.0/packages/cli) --- updated-dependencies: - dependency-name: "@hashicorp/platform-cli" dependency-type: direct:development update-type: version-update:semver-minor ... Signed-off-by: dependabot[bot] <support@github.com> Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>	2022-09-30 14:59:55 -04:00
Tim Gross	fb1f5ea2d9	Revert removing deprecated client options docs (#14753 ) This reverts PR #12416 and commit `6668ce022a`. While the driver options are well and truly deprecated, this documentation also covers features like `fingerprint.denylist` that are not available any other way. Let's revert this until #12420 is ready.	2022-09-30 08:38:03 -04:00
Derek Strickland	58e76c64d5	Merge pull request #14664 from hashicorp/docs-multiregion-dispatch multiregion: Added a section for multiregion parameterized job dispatch	2022-09-28 15:40:11 -04:00
Derek Strickland	3c63967107	link from dispatch command	2022-09-28 08:30:22 -04:00
Derek Strickland	2c1df34fee	Apply suggestions from code review	2022-09-28 08:18:56 -04:00
Derek Strickland	998f662ecd	Update website/content/docs/job-specification/multiregion.mdx Co-authored-by: Luiz Aoqui <luiz@hashicorp.com>	2022-09-28 07:20:11 -04:00
Derek Strickland	6ac87c396f	Update website/content/docs/job-specification/multiregion.mdx Co-authored-by: Luiz Aoqui <luiz@hashicorp.com>	2022-09-28 07:19:54 -04:00
Seth Hoenig	1e5f6188fb	core: numeric operands comparisons in constraints (#14722 ) * cleanup: fixup linter warnings in schedular/feasible.go * core: numeric operands comparisons in constraints This PR changes constraint comparisons to be numeric rather than lexical if both operands are integers or floats. Inspiration #4856 Closes #4729 Closes #14719 * fix: always parse as int64	2022-09-27 11:07:07 -05:00
Michael Schurter	a6dc5ea585	docs: write a lot of words about heartbeats (#14679 ) * docs: write a lot of words about heartbeats Alternative to #14670 * Apply suggestions from code review Co-authored-by: Tim Gross <tgross@hashicorp.com> * use descriptive title for link * rework example of high failover ttl Co-authored-by: Tim Gross <tgross@hashicorp.com>	2022-09-26 14:43:34 -07:00
Michael Schurter	2e059c624f	fingerprint: add node attr for reserverable cores (#14694 ) * fingerprint: add node attr for reserverable cores Add an attribute for the number of reservable CPU cores as they may differ from the existing `cpu.numcores` due to client configuration or OS support. Hopefully clarifies some confusion in #14676 * add changelog * num_reservable_cores -> reservablecores	2022-09-26 13:03:03 -07:00

1 2 3 4 5 ...

4268 Commits