nomad

mirror of https://github.com/kemko/nomad.git synced 2026-01-04 17:35:43 +03:00

Author	SHA1	Message	Date
James Rasell	b44cef0e66	docs: make upgrade version detail clearer. (#18608 ) Co-authored-by: Luiz Aoqui <luiz@hashicorp.com>	2023-09-29 08:31:14 +01:00
Luiz Aoqui	54c45ed106	acl: fix parsing of policies with blocks w/o label An ACL policy with a block without label generates unexpected results. For example, a policy such as this: ``` namespace { policy = "read" } ``` Is applied to a namespace called `policy` instead of the documented behaviour of applying it to the `default` namespace. This happens because of the way HCL1 decodes blocks. Since it doesn't know if a block is expected to have a label it applies the `key` tag to the content of the block and, in the example above, the first key is `policy`, so it sets that as the `namespace` block label. Since this happens internally in the HCL decoder it's not possible to detect the problem externally. Fixing the problem inside the decoder is challenging because the JSON and HCL parsers generate different ASTs that makes impossible to differentiate between a JSON tree from an invalid HCL tree within the decoder. The fix in this commit consists of manually parsing the policy after decoding to clear labels that were not set in the file. This allows the validation rules to consistently catch and return any errors, no matter if the policy is an invalid HCL or JSON.	2023-07-19 10:38:08 -04:00
Michael Schurter	5169950562	docs: v1.6.0 requires ipc_lock cap for mlock (#17881 ) Fixes #17780	2023-07-10 11:53:07 -07:00
Bruce Lok	8953e78dc4	fix typo peers.json (#17538 )	2023-06-19 07:56:51 +01:00
Tim Gross	bd59893956	build: remove 386 builds for Nomad 1.6.0 (#17239 ) The 32-bit Intel builds (aka "386") are not tested and likely have bugs involving platform-sized integers when operated at any non-trivial scale. Remove these builds from the upcoming Nomad 1.6.0 and provide recommendations in the upgrade notes for those users who might have hobbyist boards running 32-bit ARM (this will primarily be the RaspberryPi Zero or older spins of the RaspPi). DO NOT BACKPORT TO 1.5.x OR EARLIER!	2023-05-22 13:27:17 -04:00
Lance Haig	7e93f150b5	cli: tls certs not created with correct SANs (#16959 ) The `nomad tls cert` command did not create certificates with the correct SANs for them to work with non default domain and region names. This changset updates the code to support non default domains and regions in the certificates.	2023-05-22 09:31:56 -04:00
Tim Gross	6155ba3bcf	docs: add note to upgrade guide about yanked version (#17115 ) Nomad 1.5.4 shipped with a logmon bug that we rolled out a fix for in Nomad 1.5.5. Unfortunately we can't yank the release but we should leave a note in the upgrade guide telling users to avoid it.	2023-05-08 13:28:45 -04:00
Tim Gross	3ee02ebc97	post release 1.5.5 (#17098 ) * changelog entries for 1.5.5 and missing merge of changelog for 1.5.4, 1.4.9, and 1.3.14 * note on deprecation of `logs.enabled` field	2023-05-05 11:46:08 -04:00
Tim Gross	c3002db815	client: allow `drain_on_shutdown` configuration (#16827 ) Adds a new configuration to clients to optionally allow them to drain their workloads on shutdown. The client sends the `Node.UpdateDrain` RPC targeting itself and then monitors the drain state as seen by the server until the drain is complete or the deadline expires. If it loses connection with the server, it will monitor local client status instead to ensure allocations are stopped before exiting.	2023-04-14 15:35:32 -04:00
Daniel Bennett	e6da5c70dc	Update enterprise licensing documentation (#16615 ) updated various docs for new expiration behavior and new command `nomad license inspect` to validate pre-upgrade	2023-03-30 16:40:19 -05:00
Luiz Aoqui	f2bfbfaf03	acl: update job eval requirement to `submit-job` (#16463 ) The job evaluate endpoint creates a new evaluation for the job which is a write operation. This change modifies the necessary capability from `read-job` to `submit-job` to better reflect this.	2023-03-13 17:13:54 -04:00
Seth Hoenig	95359b8c4c	client: disable running artifact downloader as nobody (#16375 ) * client: disable running artifact downloader as nobody This PR reverts a change from Nomad 1.5 where artifact downloads were executed as the nobody user on Linux systems. This was done as an attempt to improve the security model of artifact downloading where third party tools such as git or mercurial would be run as the root user with all the security implications thereof. However, doing so conflicts with Nomad's own advice for securing the Client data directory - which when setup with the recommended directory permissions structure prevents artifact downloads from working as intended. Artifact downloads are at least still now executed as a child process of the Nomad agent, and on modern Linux systems make use of the kernel Landlock feature for limiting filesystem access of the child process. * docs: update upgrade guide for 1.5.1 sandboxing * docs: add cl * docs: add title to upgrade guide fix	2023-03-08 15:58:43 -06:00
James Rasell	b677ec7e99	docs: add 1.5.0, 1.4.5, and 1.3.10 pause regression upgrade note. (#16358 )	2023-03-07 18:29:03 +01:00
Tim Gross	8373434b69	docs: clarify upgrade note on 1.4.0 panics (#16171 ) The panic bug for upgrades with older servers that shipped in 1.4.0 was fixed in 1.4.1, which makes the versions described in the warning in the upgrade guide misleading. Clarify the upgrade guide.	2023-02-14 11:26:33 -05:00
Seth Hoenig	511d0c1e70	artifact: protect against unbounded artifact decompression (1.5.0) (#16151 ) * artifact: protect against unbounded artifact decompression Starting with 1.5.0, set defaut values for artifact decompression limits. artifact.decompression_size_limit (default "100GB") - the maximum amount of data that will be decompressed before triggering an error and cancelling the operation artifact.decompression_file_count_limit (default 4096) - the maximum number of files that will be decompressed before triggering an error and cancelling the operation. * artifact: assert limits cannot be nil in validation	2023-02-14 09:28:39 -06:00
Tim Gross	88cd93bd94	docs: fix links in 1.5.0 upgrade guide (#16106 )	2023-02-09 09:39:49 -05:00
Tim Gross	6145cdcd11	cli: remove deprecated `keyring` and `keygen` commands (#16068 ) These command were marked as deprecated in 1.4.0 with intent to remove in 1.5.0. Remove them and clean up the docs.	2023-02-07 09:49:52 -05:00
jmwilkinson	46f3977db2	Allow wildcard datacenters to be specified in job file (#11170 ) Also allows for default value of `datacenters = ["*"]`	2023-02-02 09:57:45 -05:00
Luiz Aoqui	e23e366e8a	docs: add upgrade notice for batch GC changes (#15985 )	2023-01-31 15:04:12 -05:00
Piotr Kazmierczak	949a6f60c7	renamed stanza to block for consistency with other projects (#15941 )	2023-01-30 15:48:43 +01:00
Ashlee M Boyer	3444ece549	docs: Migrate link formats (#15779 ) * Adding check-legacy-links-format workflow * Adding test-link-rewrites workflow * chore: updates link checker workflow hash * Migrating links to new format Co-authored-by: Kendall Strautman <kendallstrautman@gmail.com>	2023-01-25 09:31:14 -08:00
Tim Gross	9bdb6a5b7d	Rename `nomad.broker.total_blocked` metric (#15835 ) This changeset fixes a long-standing point of confusion in metrics emitted by the eval broker. The eval broker has a queue of "blocked" evals that are waiting for an in-flight ("unacked") eval of the same job to be completed. But this "blocked" state is not the same as the `blocked` status that we write to raft and expose in the Nomad API to end users. There's a second metric `nomad.blocked_eval.total_blocked` that refers to evaluations in that state. This has caused ongoing confusion in major customer incidents and even in our own documentation! (Fixed in this PR.) There's little functional change in this PR aside from the name of the metric emitted, but there's a bit refactoring to clean up the names in `eval_broker.go` so that there aren't name collisions and multiple names for the same state. Changes included are: * Everything that was previously called "pending" referred to entities that were associated witht he "ready" metric. These are all now called "ready" to match the metric. * Everything named "blocked" in `eval_broker.go` is now named "pending", except for a couple of comments that actually refer to blocked RPCs. * Added a note to the upgrade guide docs for 1.5.0. * Fixed the scheduling performance metrics docs because the description for `nomad.broker.total_blocked` was actually the description for `nomad.blocked_eval.total_blocked`.	2023-01-20 14:23:56 -05:00
Anthony Davis	abe088954e	Fix rejoin_after_leave behavior (#15552 )	2023-01-11 16:39:24 -05:00
Seth Hoenig	493389e861	artifact: enable inheriting environment variables from client (#15514 ) * artifact: enable inheriting environment variables from client This PR adds client configuration for specifying environment variables that should be inherited by the artifact sandbox process from the Nomad Client agent. Most users should not need to set these values but the configuration is provided to ensure backwards compatability. Configuration of go-getter should ideally be done through the artifact block in a jobspec task. e.g. ```hcl client { artifact { set_environment_variables = "TMPDIR,GIT_SSH_OPTS" } } ``` Closes #15498 * website: update set_environment_variables text to mention PATH	2022-12-09 15:46:07 -06:00
Seth Hoenig	990537e8ba	artifact: add client toggle to disable filesystem isolation (#15503 ) This PR adds the client config option for turning off filesystem isolation, applicable on Linux systems where filesystem isolation is possible and enabled by default. ```hcl client{ artifact { disable_filesystem_isolation = <bool:false> } } ``` Closes #15496	2022-12-08 12:29:23 -06:00
Seth Hoenig	cfc67c3422	client: sandbox go-getter subprocess with landlock (#15328 ) * client: sandbox go-getter subprocess with landlock This PR re-implements the getter package for artifact downloads as a subprocess. Key changes include On all platforms, run getter as a child process of the Nomad agent. On Linux platforms running as root, run the child process as the nobody user. On supporting Linux kernels, uses landlock for filesystem isolation (via go-landlock). On all platforms, restrict environment variables of the child process to a static set. notably TMP/TEMP now points within the allocation's task directory kernel.landlock attribute is fingerprinted (version number or unavailable) These changes make Nomad client more resilient against a faulty go-getter implementation that may panic, and more secure against bad actors attempting to use artifact downloads as a privilege escalation vector. Adds new e2e/artifact suite for ensuring artifact downloading works. TODO: Windows git test (need to modify the image, etc... followup PR) * landlock: fixup items from cr * cr: fixup tests and go.mod file	2022-12-07 16:02:25 -06:00
Zach Shilton	c81fe3cf40	website: fix broken links (#14946 ) * fix: nomad license put link * fix: redirected URL * fix: avoid auto-formatting changes	2022-10-19 14:07:48 -04:00
Bryce Kalow	f49b3a95dd	website: fixes redirected links (#14918 )	2022-10-18 10:31:52 -05:00
Tim Gross	f70fcf659e	docs: 1.4.0 upgrade warning for keyring initialization (#14825 )	2022-10-06 11:32:35 -04:00
Seth Hoenig	1e5f6188fb	core: numeric operands comparisons in constraints (#14722 ) * cleanup: fixup linter warnings in schedular/feasible.go * core: numeric operands comparisons in constraints This PR changes constraint comparisons to be numeric rather than lexical if both operands are integers or floats. Inspiration #4856 Closes #4729 Closes #14719 * fix: always parse as int64	2022-09-27 11:07:07 -05:00
Michael Schurter	d677b48625	fingerprint: lengthen Vault check after seen (#14693 ) Extension of #14673 Once Vault is initially fingerprinted, extend the period since changes should be infrequent and the fingerprint is relatively expensive since it is contacting a central Vault server. Also move the period timer reset after the fingerprint. This is similar to #9435 where the idea is to ensure the retry period starts after the operation is attempted. 15s will be the minimum time between fingerprints now instead of the maximum time between fingerprints. In the case of Vault fingerprinting, the original behavior might cause the following: 1. Timer is reset to 15s 2. Fingerprint takes 16s 3. Timer has already elapsed so we immediately Fingerprint again Even if fingerprinting Vault only takes a few seconds, that may very well be due to excessive load and backing off our fingerprints is desirable. The new bevahior ensures we always wait at least 15s between fingerprint attempts and should allow some natural jittering based on server load and network latency.	2022-09-26 12:14:19 -07:00
Tim Gross	786dc5ff94	fingerprint: don't clear Consul/Vault attributes on failure (#14673 ) Clients periodically fingerprint Vault and Consul to ensure the server has updated attributes in the client's fingerprint. If the client can't reach Vault/Consul, the fingerprinter clears the attributes and requires a node update. Although this seems like correct behavior so that we can detect intentional removal of Vault/Consul access, it has two serious failure modes: (1) If a local Consul agent is restarted to pick up configuration changes and the client happens to fingerprint at that moment, the client will update its fingerprint and result in evaluations for all its jobs and all the system jobs in the cluster. (2) If a client loses Vault connectivity, the same thing happens. But the consequences are much worse in the Vault case because Vault is not run as a local agent, so Vault connectivity failures are highly correlated across the entire cluster. A 15 second Vault outage will cause a new `node-update` evalution for every system job on the cluster times the number of nodes, plus one `node-update` evaluation for every non-system job on each node. On large clusters of 1000s of nodes, we've seen this create a large backlog of evaluations. This changeset updates the fingerprinting behavior to keep the last fingerprint if Consul or Vault queries fail. This prevents a storm of evaluations at the cost of requiring a client restart if Consul or Vault is intentionally removed from the client.	2022-09-23 14:45:12 -04:00
Tim Gross	d1e90a17d6	cli: remove deprecated `eval status -json` list behavior (#14651 ) In Nomad 1.2.6 we shipped `eval list`, which accepts a `-json` flag, and deprecated the usage of `eval status` without an evaluation ID with an upgrade note that it would be removed in Nomad 1.4.0. This changeset completes that work.	2022-09-22 10:56:32 -04:00
Mahmood Ali	757c3c94f2	scheduler: stopped-yet-running allocs are still running (#10446 ) * scheduler: stopped-yet-running allocs are still running * scheduler: test new stopped-but-running logic * test: assert nonoverlapping alloc behavior Also add a simpler Wait test helper to improve line numbers and save few lines of code. * docs: tried my best to describe #10446 it's not concise... feedback welcome * scheduler: fix test that allowed overlapping allocs * devices: only free devices when ClientStatus is terminal * test: output nicer failure message if err==nil Co-authored-by: Mahmood Ali <mahmood@hashicorp.com> Co-authored-by: Michael Schurter <mschurter@hashicorp.com>	2022-09-13 12:52:47 -07:00
Piotr Kazmierczak	2d4acce3da	docs: Update upgrade guide to reflect enterprise changes introduced in nomad-enterprise (#14212 ) This PR documents a change made in the enterprise version of nomad that addresses the following issue: When a user tries to filter audit logs, they do so with a stanza that looks like the following: audit { enabled = true filter "remove deletes" { type = "HTTPEvent" endpoints = ["*"] stages = ["OperationComplete"] operations = ["DELETE"] } } When specifying both an "endpoint" and a "stage", the events with both matching a "endpoint" AND a matching "stage" will be filtered. When specifying both an "endpoint" and an "operation" the events with both matching a "endpoint" AND a matching "operation" will be filtered. When specifying both a "stage" and an "operation" the events with a matching a "stage" OR a matching "operation" will be filtered. The "OR" logic with stages and operations is unexpected and doesn't allow customers to get specific on which events they want to filter. For instance the following use-case is impossible to achieve: "I want to filter out all OperationReceived events that have the DELETE verb".	2022-08-24 16:31:49 +02:00
Luiz Aoqui	934bafb922	template: use pointer values for gid and uid (#14203 ) When a Nomad agent starts and loads jobs that already existed in the cluster, the default template uid and gid was being set to 0, since this is the zero value for int. This caused these jobs to fail in environments where it was not possible to use 0, such as in Windows clients. In order to differentiate between an explicit 0 and a template where these properties were not set we need to use a pointer.	2022-08-22 16:25:49 -04:00
Seth Hoenig	f8c4ad8cde	docs: remove consul 1.12.0 warning	2022-07-12 09:53:17 -05:00
Tim Gross	f295396ef8	docs: rename Internals to Concepts (#13696 )	2022-07-11 16:55:33 -04:00
Seth Hoenig	64f35f9cf3	docs: move upgrade docs for max_client_timeout Co-authored-by: Luiz Aoqui <luiz@hashicorp.com>	2022-07-07 16:46:26 -05:00
Seth Hoenig	cbcceb0625	docs: upgrade guide for client max_kill_timeout	2022-07-07 15:27:40 -05:00
Seth Hoenig	f1cafd0789	core: remove support for raft protocol version 2 This PR checks server config for raft_protocol, which must now be set to 3 or unset (0). When unset, version 3 is used as the default.	2022-06-23 14:37:50 +00:00
Arthur Leclerc	7518f42d1c	docs: Fix typo (#13389 )	2022-06-16 13:24:18 -04:00
Michael Schurter	3968509886	artifact: fix numerous go-getter security issues Fix numerous go-getter security issues: - Add timeouts to http, git, and hg operations to prevent DoS - Add size limit to http to prevent resource exhaustion - Disable following symlinks in both artifacts and `job run` - Stop performing initial HEAD request to avoid file corruption on retries and DoS opportunities. Approach Since Nomad has no ability to differentiate a DoS-via-large-artifact vs a legitimate workload, all of the new limits are configurable at the client agent level. The max size of HTTP downloads is also exposed as a node attribute so that if some workloads have large artifacts they can specify a high limit in their jobspecs. In the future all of this plumbing could be extended to enable/disable specific getters or artifact downloading entirely on a per-node basis.	2022-05-24 16:29:39 -04:00
Luiz Aoqui	59ce4f8caf	docs: add Consul 1.12.0 upgrade notice	2022-05-16 18:44:26 -04:00
Tim Gross	3671ea6a8f	remove pre-0.9 driver code and related E2E test (#12791 ) This test exercises upgrades between 0.8 and Nomad versions greater than 0.9. We have not supported 0.8.x in a very long time and in any case the test has been marked to skip because the downloader doesn't work.	2022-04-27 09:53:37 -04:00
James Rasell	89b74632d4	docs: add upgrade note for Consul implicit constraint. (#12749 )	2022-04-22 15:53:27 +02:00
Seth Hoenig	b2a2f77d40	docs: update documentation with connect acls changes This PR updates the changelog, adds notes the 1.3 upgrade guide, and updates the connect integration docs with documentation about the new requirement on Consul ACL policies of Consul agent default anonymous ACL tokens.	2022-04-18 08:22:33 -05:00
Seth Hoenig	f2ea1fab5a	connect: prefix tag with nomad.; merge into envoy_stats_tags; update docs This PR expands on the work done in #12543 to - prefix the tag, so it is now "nomad.alloc_id" to be more consistent with Consul tags - merge into pre-existing envoy_stats_tags fields - update the upgrade guide docs - update changelog	2022-04-14 12:52:52 -05:00
Seth Hoenig	be80a63584	docs: fixup title formatting in upgrade guide	2022-04-08 11:50:54 -05:00
Luiz Aoqui	c3e36bb367	docs: fix upgrade specific broken link and conflict tag (#12521 )	2022-04-08 12:36:47 -04:00

1 2

92 Commits