nomad

mirror of https://github.com/kemko/nomad.git synced 2026-01-07 02:45:42 +03:00

Author	SHA1	Message	Date
Tim Gross	6b2da83f6a	keyring: safely handle missing keys and restore GC (#15092 ) When replication of a single key fails, the replication loop breaks early and therefore keys that fall later in the sorting order will never get replicated. This is particularly a problem for clusters impacted by the bug that caused #14981 and that were later upgraded; the keys that were never replicated can now never be replicated, and so we need to handle them safely. Included in the replication fix: * Refactor the replication loop so that each key replicated in a function call that returns an error, to make the workflow more clear and reduce nesting. Log the error and continue. * Improve stability of keyring replication tests. We no longer block leadership on initializing the keyring, so there's a race condition in the keyring tests where we can test for the existence of the root key before the keyring has been initialize. Change this to an "eventually" test. But these fixes aren't enough to fix #14981 because they'll end up seeing an error once a second complaining about the missing key, so we also need to fix keyring GC so the keys can be removed from the state store. Now we'll store the key ID used to sign a workload identity in the Allocation, and we'll index the Allocation table on that so we can track whether any live Allocation was signed with a particular key ID.	2022-11-01 15:00:50 -04:00
Tim Gross	b363c56c96	docs: improved documentation on hardening and required capabilities (#15036 ) The existing docs on required capabilities are a little sparse and have been the subject of a lots of questions. Expand on this information and provide a pointer to the ongoing design discussion around rootless Nomad.	2022-10-26 09:46:13 -04:00
Tim Gross	b583f7822a	keyring: remove root key GC (#15034 )	2022-10-25 17:06:18 -04:00
James Rasell	eaea9164a5	acl: correctly resolve ACL roles within client cache. (#14922 ) The client ACL cache was not accounting for tokens which included ACL role links. This change modifies the behaviour to resolve role links to policies. It will also now store ACL roles within the cache for quick lookup. The cache TTL is configurable in the same manner as policies or tokens. Another small fix is included that takes into account the ACL token expiry time. This was not included, which meant tokens with expiry could be used past the expiry time, until they were GC'd.	2022-10-20 09:37:32 +02:00
Zach Shilton	c81fe3cf40	website: fix broken links (#14946 ) * fix: nomad license put link * fix: redirected URL * fix: avoid auto-formatting changes	2022-10-19 14:07:48 -04:00
Anthony	6dcf008fbb	Updated datacenter block description (#14953 ) * Updated datacenter block description * Replacing accidentally removed title * docs: add closing period Co-authored-by: Seth Hoenig <shoenig@duck.com>	2022-10-19 08:44:52 -05:00
Bryce Kalow	f49b3a95dd	website: fixes redirected links (#14918 )	2022-10-18 10:31:52 -05:00
Kevin Wang	57dc7c2ab1	fix: website broken links (#14904 ) * fix: website broken links * fix up keyring-rotate link Co-authored-by: Tim Gross <tgross@hashicorp.com>	2022-10-17 11:32:10 -04:00
Seth Hoenig	9e7e5e081e	services: remove assertion on 'task' field being set (#14864 ) This PR removes the assertion around when the 'task' field of a check may be set. Starting in Nomad 1.4 we automatically set the task field on all checks in support of the NSD checks feature. This is causing validation problems elsewhere, e.g. when a group service using the Consul provider sets 'task' it will fail validation that worked previously. The assertion of leaving 'task' unset was only about making sure job submitters weren't expecting some behavior, but in practice is causing bugs now that we need the task field for more than it was originally added for. We can simply update the docs, noting when the task field set by job submitters actually has value.	2022-10-10 13:02:33 -05:00
Damian Czaja	e4efedbbe4	cli: add `nomad fmt` (#14779 )	2022-10-06 17:00:29 -04:00
Giovani Avelar	2b9158b73e	Allow specification of a custom job name/prefix for parameterized jobs (#14631 )	2022-10-06 16:21:40 -04:00
Michael Schurter	0779a5bc10	docs: clarify nomad vars vs vault (#14831 ) * docs: clarify nomad vars vs vault I think we should make the difference in root key management between Nomad and Vault clear in the concept docs. I didn't see anywhere else in the docs we compared it. I also s/secrets/variables everywhere except the first sentence since the feature is intended to be more generic than secrets. Right now it's more of a compliment to Consul's kv than Vault due to root key handling and featureset. * Update website/content/docs/concepts/variables.mdx Co-authored-by: Tim Gross <tgross@hashicorp.com>	2022-10-06 13:17:26 -07:00
Tim Gross	f70fcf659e	docs: 1.4.0 upgrade warning for keyring initialization (#14825 )	2022-10-06 11:32:35 -04:00
Elijah Voigt	5fdcbf085f	Docs(job-specification/periodic): Add enabled toggle (#14767 ) This is probably undocumented for a reason, but the `enabled` toggle in the `periodic` stanza is very useful so I figured I try adding it to the docs. The feature has been secretly avaliable since #9142 and was called out in that PR as being a dubious addition, only added to avoid regressions. The use case for disabling a periodic job in this way is to prevent it from running without modifying the schedule. Ideally Nomad would make it more clear that this was the case, and allow you to force a run of the job, but even with those rough edges I think users would benefit from knowing about this toggle.	2022-10-03 15:08:24 -04:00
Tim Gross	98deb8d8a0	internals documentation with diagrams (#14750 ) This changeset adds new architecture internals documents to the contributing guide. These are intentionally here and not on the public-facing website because the material is not required for operators and includes a lot of diagrams that we can cheaply maintain with mermaid syntax but would involve art assets to have up on the main site that would become quickly out of date as code changes happen and be extremely expensive to maintain. However, these should be suitable to use as points of conversation with expert end users. Included: * A description of Evaluation triggers and expected counts, with examples. * A description of Evaluation states and implicit states. This is taken from an internal document in our team wiki. * A description of how writing the State Store works. This is taken from a diagram I put together a few months ago for internal education purposes. * A description of Evaluation lifecycle, from registration to running Allocations. This is mostly lifted from @lgfa29's amazing mega-diagram, but broken into digestible chunks and without multi-region deployments, which I'd like to cover in a future doc. Also includes adding Deployments to our public-facing glossary. Co-authored-by: Luiz Aoqui <luiz@hashicorp.com> Co-authored-by: Michael Schurter <mschurter@hashicorp.com> Co-authored-by: Seth Hoenig <shoenig@duck.com>	2022-10-03 14:06:41 -04:00
Tim Gross	fb1f5ea2d9	Revert removing deprecated client options docs (#14753 ) This reverts PR #12416 and commit `6668ce022a`. While the driver options are well and truly deprecated, this documentation also covers features like `fingerprint.denylist` that are not available any other way. Let's revert this until #12420 is ready.	2022-09-30 08:38:03 -04:00
Derek Strickland	58e76c64d5	Merge pull request #14664 from hashicorp/docs-multiregion-dispatch multiregion: Added a section for multiregion parameterized job dispatch	2022-09-28 15:40:11 -04:00
Derek Strickland	3c63967107	link from dispatch command	2022-09-28 08:30:22 -04:00
Derek Strickland	2c1df34fee	Apply suggestions from code review	2022-09-28 08:18:56 -04:00
Derek Strickland	998f662ecd	Update website/content/docs/job-specification/multiregion.mdx Co-authored-by: Luiz Aoqui <luiz@hashicorp.com>	2022-09-28 07:20:11 -04:00
Derek Strickland	6ac87c396f	Update website/content/docs/job-specification/multiregion.mdx Co-authored-by: Luiz Aoqui <luiz@hashicorp.com>	2022-09-28 07:19:54 -04:00
Seth Hoenig	1e5f6188fb	core: numeric operands comparisons in constraints (#14722 ) * cleanup: fixup linter warnings in schedular/feasible.go * core: numeric operands comparisons in constraints This PR changes constraint comparisons to be numeric rather than lexical if both operands are integers or floats. Inspiration #4856 Closes #4729 Closes #14719 * fix: always parse as int64	2022-09-27 11:07:07 -05:00
Michael Schurter	a6dc5ea585	docs: write a lot of words about heartbeats (#14679 ) * docs: write a lot of words about heartbeats Alternative to #14670 * Apply suggestions from code review Co-authored-by: Tim Gross <tgross@hashicorp.com> * use descriptive title for link * rework example of high failover ttl Co-authored-by: Tim Gross <tgross@hashicorp.com>	2022-09-26 14:43:34 -07:00
Michael Schurter	2e059c624f	fingerprint: add node attr for reserverable cores (#14694 ) * fingerprint: add node attr for reserverable cores Add an attribute for the number of reservable CPU cores as they may differ from the existing `cpu.numcores` due to client configuration or OS support. Hopefully clarifies some confusion in #14676 * add changelog * num_reservable_cores -> reservablecores	2022-09-26 13:03:03 -07:00
Michael Schurter	d677b48625	fingerprint: lengthen Vault check after seen (#14693 ) Extension of #14673 Once Vault is initially fingerprinted, extend the period since changes should be infrequent and the fingerprint is relatively expensive since it is contacting a central Vault server. Also move the period timer reset after the fingerprint. This is similar to #9435 where the idea is to ensure the retry period starts after the operation is attempted. 15s will be the minimum time between fingerprints now instead of the maximum time between fingerprints. In the case of Vault fingerprinting, the original behavior might cause the following: 1. Timer is reset to 15s 2. Fingerprint takes 16s 3. Timer has already elapsed so we immediately Fingerprint again Even if fingerprinting Vault only takes a few seconds, that may very well be due to excessive load and backing off our fingerprints is desirable. The new bevahior ensures we always wait at least 15s between fingerprint attempts and should allow some natural jittering based on server load and network latency.	2022-09-26 12:14:19 -07:00
Tim Gross	e7e9713d2e	variables: document restrictions on path and size (#14687 )	2022-09-26 11:40:53 -04:00
Tim Gross	786dc5ff94	fingerprint: don't clear Consul/Vault attributes on failure (#14673 ) Clients periodically fingerprint Vault and Consul to ensure the server has updated attributes in the client's fingerprint. If the client can't reach Vault/Consul, the fingerprinter clears the attributes and requires a node update. Although this seems like correct behavior so that we can detect intentional removal of Vault/Consul access, it has two serious failure modes: (1) If a local Consul agent is restarted to pick up configuration changes and the client happens to fingerprint at that moment, the client will update its fingerprint and result in evaluations for all its jobs and all the system jobs in the cluster. (2) If a client loses Vault connectivity, the same thing happens. But the consequences are much worse in the Vault case because Vault is not run as a local agent, so Vault connectivity failures are highly correlated across the entire cluster. A 15 second Vault outage will cause a new `node-update` evalution for every system job on the cluster times the number of nodes, plus one `node-update` evaluation for every non-system job on each node. On large clusters of 1000s of nodes, we've seen this create a large backlog of evaluations. This changeset updates the fingerprinting behavior to keep the last fingerprint if Consul or Vault queries fail. This prevents a storm of evaluations at the cost of requiring a client restart if Consul or Vault is intentionally removed from the client.	2022-09-23 14:45:12 -04:00
Derek Strickland	a001abdcdb	Update multiregion.mdx	2022-09-22 14:56:21 -04:00
Derek Strickland	76909e8b0f	multiregion: Added a section for multiregion parameterized job dispatch	2022-09-22 14:50:15 -04:00
Tim Gross	d1e90a17d6	cli: remove deprecated `eval status -json` list behavior (#14651 ) In Nomad 1.2.6 we shipped `eval list`, which accepts a `-json` flag, and deprecated the usage of `eval status` without an evaluation ID with an upgrade note that it would be removed in Nomad 1.4.0. This changeset completes that work.	2022-09-22 10:56:32 -04:00
Bryce Kalow	67d39725b1	website: content updates for developer (#14473 ) Co-authored-by: Geoffrey Grosenbach <26+topfunky@users.noreply.github.com> Co-authored-by: Anthony <russo555@gmail.com> Co-authored-by: Ashlee Boyer <ashlee.boyer@hashicorp.com> Co-authored-by: Ashlee M Boyer <43934258+ashleemboyer@users.noreply.github.com> Co-authored-by: HashiBot <62622282+hashibot-web@users.noreply.github.com> Co-authored-by: Kevin Wang <kwangsan@gmail.com>	2022-09-16 10:38:39 -05:00
Mahmood Ali	757c3c94f2	scheduler: stopped-yet-running allocs are still running (#10446 ) * scheduler: stopped-yet-running allocs are still running * scheduler: test new stopped-but-running logic * test: assert nonoverlapping alloc behavior Also add a simpler Wait test helper to improve line numbers and save few lines of code. * docs: tried my best to describe #10446 it's not concise... feedback welcome * scheduler: fix test that allowed overlapping allocs * devices: only free devices when ClientStatus is terminal * test: output nicer failure message if err==nil Co-authored-by: Mahmood Ali <mahmood@hashicorp.com> Co-authored-by: Michael Schurter <mschurter@hashicorp.com>	2022-09-13 12:52:47 -07:00
Tim Gross	06686c84ed	docs: tweak some copy in the concept docs (#14566 )	2022-09-13 13:21:09 -04:00
Seth Hoenig	9cc39de738	Merge pull request #14559 from hashicorp/docs-nsd-check-watcher docs: add documentation for nomad service check restarts	2022-09-13 10:52:01 -05:00
Ashlee M Boyer	faa5b4cf65	docs: Fixing heading order, adding text for links in /docs/ecosystem (#14549 ) * Fixing heading order, adding text for links * Apply suggestions from code review Co-authored-by: Tim Gross <tgross@hashicorp.com> * Applying more suggestions from code review Co-authored-by: Tim Gross <tgross@hashicorp.com>	2022-09-13 10:59:02 -04:00
Seth Hoenig	37906ca213	docs: update docs for NSD check restart	2022-09-13 09:59:02 -05:00
Tim Gross	93a147e482	docs: include path in ACL requirements for variables (#14561 ) Also add links to the ACL policy reference and variables concepts docs near the top of the page.	2022-09-13 10:21:29 -04:00
Charlie Voiselle	df04cd15d6	Variables CLI documentation (#14249 )	2022-09-12 16:44:31 -04:00
Tim Gross	6574717c55	docs: update `template` for Nomad Variables (#14527 )	2022-09-12 16:36:18 -04:00
Tim Gross	0c82b1dec9	remove root keyring install API (#14514 ) * keyring rotate API should require put/post method * remove keyring install API	2022-09-09 08:50:35 -04:00
James Rasell	11496d1816	hcl2: add strlen function and update docs. (#14463 )	2022-09-06 18:42:40 +02:00
Luiz Aoqui	7d88937751	connect: interpolate task env in config values (#14445 ) When configuring Consul Service Mesh, it's sometimes necessary to provide dynamic value that are only known to Nomad at runtime. By interpolating configuration values (in addition to configuration keys), user are able to pass these dynamic values to Consul from their Nomad jobs.	2022-09-02 15:00:28 -04:00
Luiz Aoqui	4f9beb0b39	docs: add warning about changing region config (#14443 )	2022-09-01 16:47:06 -04:00
Luiz Aoqui	99ebd0ab26	cli: set -hcl2-strict to false if -hcl1 is defined (#14426 ) These options are mutually exclusive but, since `-hcl2-strict` defaults to `true` users had to explicitily set it to `false` when using `-hcl1`. Also return `255` when job plan fails validation as this is the expected code in this situation.	2022-09-01 10:42:08 -04:00
Tim Gross	3a4345ef32	docs: clarify CSI plugin compatibility (#14434 ) Nomad is generally compliant with the CSI specification for Container Orchestrators (CO), except for unimplemented features. However, some storage vendors have built CSI plugins that are not compliant with the specification or which expect that they're only deployed on Kubernetes. Nomad cannot vouch for the compatibility of any particular plugin, so clarify this in the docs. Co-authored-by: Derek Strickland <1111455+DerekStrickland@users.noreply.github.com>	2022-09-01 10:06:44 -04:00
Brett Larson	baa3b05d00	Update ephemeral_disk.mdx (#14356 ) It is really unclear on how to use this feature. it took me a while to find this, so I thought I would purpose how to use this.	2022-08-31 20:17:41 -04:00
James Rasell	20f71cf3e4	docs: add documentation for ACL token expiration and ACL roles. (#14332 ) The ACL command docs are now found within a sub-dir like the operator command docs. Updates to the ACL token commands to accommodate token expiry have also been added. The ACL API docs are now found within a sub-dir like the operator API docs. The ACL docs now include the ACL roles endpoint as well as updated ACL token endpoints for token expiration. The configuration section is also updated to accommodate the new ACL and server parameters for the new ACL features.	2022-08-31 16:13:47 +02:00
Tim Gross	b7fea76f7f	keyring: wrap root key in key encryption key (#14388 ) Update the on-disk format for the root key so that it's wrapped with a unique per-key/per-server key encryption key. This is a bit of security theatre for the current implementation, but it uses `go-kms-wrapping` as the interface for wrapping the key. This provides a shim for future support of external KMS such as cloud provider APIs or Vault transit encryption. * Removes the JSON serialization extension we had on the `RootKey` struct; this struct is now only used for key replication and not for disk serialization, so we don't need this helper. * Creates a helper for generating cryptographically random slices of bytes that properly accounts for short reads from the source. * No observable functional changes outside of the on-disk format, so there are no test updates.	2022-08-30 10:59:25 -04:00
Tim Gross	8dfa6b06e7	docs: fixing a few more places we missed "secure" during rename (#14395 )	2022-08-30 10:08:50 -04:00
quoing	92680a83f9	docs: template change script example correction (#14368 ) "path" parameter doesn't work, should be command	2022-08-30 12:09:55 +02:00

1 2 3 4 5 ...

550 Commits