nomad

mirror of https://github.com/kemko/nomad.git synced 2026-01-01 16:05:42 +03:00

Author	SHA1	Message	Date
Brendan MacDonell	26485c45a2	Add job_max_count option to keep Nomad server from running out of memory (#26858 ) If a Nomad job is started with a large number of instances (e.g. 4 billion), then the Nomad servers that attempt to schedule it will run out of memory and crash. While it's unlikely that anyone would intentionally schedule a job with 4 billion instances, we have occasionally run into issues with bugs in external automation. For example, an automated deployment system running on a test environment had an off-by-one error, and deployed a job with count = uint32(-1), causing the Nomad servers for that environment to run out of memory and crash. To prevent this, this PR introduces a job_max_count Nomad server configuration parameter. job_max_count limits the number of allocs that may be created from a job. The default value is 50000 - this is low enough that a job with the maximum possible number of allocs will not require much memory on the server, but is still much higher than the number of allocs in the largest Nomad job we have ever run.	2025-10-06 09:35:10 -04:00
James Rasell	7466dd71b2	server: Add new `server.client_introduction` config block. (#26315 ) The new configuration block exposes some key options which allow cluster administrators to control certain client introduction behaviours. This change introduces the new block and plumbing, so that it is exposed in the Nomad server for consumption via internal processes.	2025-07-22 08:50:19 +01:00
Tim Gross	3f59860254	host volumes: add configuration to GC on node GC (#25903 ) When a node is garbage collected, any dynamic host volumes on the node are orphaned in the state store. We generally don't want to automatically collect these volumes and risk data loss, and have provided a CLI flag to `-force` remove them in #25902. But for clusters running on ephemeral cloud instances (ex. AWS EC2 in an autoscaling group), deleting host volumes may add excessive friction. Add a configuration knob to the client configuration to remove host volumes from the state store on node GC. Ref: https://github.com/hashicorp/nomad/pull/25902 Ref: https://github.com/hashicorp/nomad/issues/25762 Ref: https://hashicorp.atlassian.net/browse/NMD-705	2025-05-27 10:22:08 -04:00
tehut	55523ecf8e	Add NodeMaxAllocations to client configuration (#25785 ) * Set MaxAllocations in client config Add NodeAllocationTracker struct to Node struct Evaluate MaxAllocations in AllocsFit function Set up cli config parsing Integrate maxAllocs into AllocatedResources view Co-authored-by: Tim Gross <tgross@hashicorp.com> --------- Co-authored-by: Tim Gross <tgross@hashicorp.com>	2025-05-22 12:49:27 -07:00
Daniel Bennett	15c01e5a49	ipv6: normalize addrs per RFC-5942 §4 (#25921 ) https://datatracker.ietf.org/doc/html/rfc5952#section-4 * copy NormalizeAddr func from vault * PRs hashicorp/vault#29228 & hashicorp/vault#29517 * normalize bind/advertise addrs * normalize consul/vault addrs	2025-05-22 14:21:30 -04:00
Tim Gross	8a5a057d88	offline license utilization reporting (#25844 ) Nomad Enterprise users operating in air-gapped or otherwise secured environments don't want to send license reporting metrics directly from their servers. Implement manual/offline reporting by periodically recording usage metrics snapshots in the state store, and providing an API and CLI by which cluster administrators can download the snapshot for review and out-of-band transmission to HashiCorp. This is the CE portion of the work required for implemention in the Enterprise product. Nomad CE does not perform utilization reporting. Ref: https://github.com/hashicorp/nomad-enterprise/pull/2673 Ref: https://hashicorp.atlassian.net/browse/NMD-68 Ref: https://go.hashi.co/rfc/nmd-210	2025-05-14 09:51:13 -04:00
James Rasell	0b265d2417	encrypter: Track initial tasks for is ready calculation. (#25803 ) The server startup could "hang" to the view of an operator if it had a key that could not be decrypted or replicated loaded from the FSM at startup. In order to prevent this happening, the server startup function will now use a timeout to wait for the encrypter to be ready. If the timeout is reached, the error is sent back to the caller which fails the CLI command. This bubbling of error message will also flush to logs which will provide addition operator feedback. The server only cares about keys loaded from the FSM snapshot and trailing logs before the encrypter should be classed as ready. So that the encrypter ready function does not get blocked by keys added outside of the initial Raft load, we take a snapshot of the decryption tasks as we enter the blocking call, and class these as our barrier.	2025-05-07 15:38:16 +01:00
Nikita Eliseev	76fb3eb9a1	rpc: added configuration for yamux session (#25466 ) Fixes: https://github.com/hashicorp/nomad/issues/25380	2025-04-02 10:58:23 -04:00
Michael Smithhisler	5c4d0e923d	consul: Remove legacy token based authentication workflow (#25217 )	2025-03-05 15:38:11 -05:00
James Rasell	7268053174	vault: Remove legacy token based authentication workflow. (#25155 ) The legacy workflow for Vault whereby servers were configured using a token to provide authentication to the Vault API has now been removed. This change also removes the workflow where servers were responsible for deriving Vault tokens for Nomad clients. The deprecated Vault config options used byi the Nomad agent have all been removed except for "token" which is still in use by the Vault Transit keyring implementation. Job specification authors can no longer use the "vault.policies" parameter and should instead use "vault.role" when not using the default workload identity. --------- Co-authored-by: Tim Gross <tgross@hashicorp.com> Co-authored-by: Aimee Ukasick <aimee.ukasick@hashicorp.com>	2025-02-28 07:40:02 +00:00
James Rasell	7d48aa2667	client: emit optional telemetry from prerun and prestart hooks. (#24556 ) The Nomad client can now optionally emit telemetry data from the prerun and prestart hooks. This allows operators to monitor and alert on failures and time taken to complete. The new datapoints are: - nomad.client.alloc_hook.prerun.success (counter) - nomad.client.alloc_hook.prerun.failed (counter) - nomad.client.alloc_hook.prerun.elapsed (sample) - nomad.client.task_hook.prestart.success (counter) - nomad.client.task_hook.prestart.failed (counter) - nomad.client.task_hook.prestart.elapsed (sample) The hook execution time is useful to Nomad engineering and will help optimize code where possible and understand job specification impacts on hook performance. Currently only the PreRun and PreStart hooks have telemetry enabled, so we limit the number of new metrics being produced.	2024-12-12 14:43:14 +00:00
Juliano Martinez	4a74fda8ce	Allow client template config block to be parsed when using json config (#24007 ) - Adds tests - Adds sample test data for parsing hcl and json - Adds changelog	2024-10-01 15:44:36 -04:00
Daniel Bennett	2f5cf8efae	networking: option to enable ipv6 on bridge network (#23882 ) by setting bridge_network_subnet_ipv6 in client config Co-authored-by: Martina Santangelo <martina.santangelo@hashicorp.com>	2024-09-04 10:17:10 -05:00
Tim Gross	9d4686c0df	tls: remove deprecated `prefer_server_cipher_suites` field (#23712 ) The TLS configuration object includes a deprecated `prefer_server_cipher_suites` field. In version of Go prior to 1.17, this property controlled whether a TLS connection would use the cipher suites preferred by the server or by the client. This field is ignored as of 1.17 and, according to the `crypto/tls` docs: "Servers now select the best mutually supported cipher suite based on logic that takes into account inferred client hardware, server hardware, and security." This property has been long-deprecated and leaving it in place may lead to false assumptions about how cipher suites are negotiated in connection to a server. So we want to remove it in Nomad 1.9.0. Fixes: https://github.com/hashicorp/nomad-enterprise/issues/999 Ref: https://hashicorp.atlassian.net/browse/NET-10531	2024-08-01 08:52:05 -04:00
Tim Gross	c8be863bc8	reporting: allow export interval and address to be configurable (#23674 ) The go-census library supports configuration to send metrics to a local development version of the collector. Add "undocumented" configuration options to the `reporting` block allow developers to debug and verify we're sending the data we expect with real Nomad servers and not just unit tests. Ref: https://hashicorp.atlassian.net/browse/NET-10057 Ref: https://github.com/hashicorp/nomad-enterprise/pull/1708	2024-07-24 08:29:59 -04:00
Tim Gross	c970d22164	keyring: support external KMS for key encryption key (KEK) (#23580 ) In Nomad 1.4.0, we shipped support for encrypted Variables and signed Workload Identities, but the key material is protected only by a AEAD encrypting the KEK. Add support for Vault transit encryption and external KMS from major cloud providers. The servers call out to the external service to decrypt each key in the on-disk keystore. Ref: https://hashicorp.atlassian.net/browse/NET-10334 Fixes: https://github.com/hashicorp/nomad/issues/14852	2024-07-18 09:42:28 -04:00
James Rasell	facc3e8013	agent: allow configuration of in-memory telemetry sink. (#20166 ) This change adds configuration options for setting the in-memory telemetry sink collection and retention durations. This sink backs the metrics JSON API and previously had hard-coded default values. The new options are particularly useful when running development or debug environments, where metrics collection is desired at a fast and granular rate.	2024-03-25 15:00:18 +00:00
Seth Hoenig	05937ab75b	exec2: add client support for unveil filesystem isolation mode (#20115 ) * exec2: add client support for unveil filesystem isolation mode This PR adds support for a new filesystem isolation mode, "Unveil". The mode introduces a "alloc_mounts" directory where tasks have user-owned directory structure which are bind mounts into the real alloc directory structure. This enables a task driver to use landlock (and maybe the real unveil on openbsd one day) to isolate a task to the task owned directory structure, providing sandboxing. * actually create alloc-mounts-dir directory * fix doc strings about alloc mount dir paths	2024-03-13 08:24:17 -05:00
Piotr Kazmierczak	248b2ba5cd	WI: use single auth method for Consul by default (#19169 ) This simplifies the default setup of Nomad workloads WI-based authentication for Consul by using a single auth method with 2 binding rules. Users can still specify separate auth methods for services and tasks.	2023-11-28 12:22:27 +01:00
Luiz Aoqui	5ff6cce3ab	vault: update default JWT auth method path (#19188 ) Update default auth method path to be `jwt-nomad` to avoid potential conflicts when Vault's `jwt` default is already being used for something else.	2023-11-27 17:48:12 -05:00
Tim Gross	9d075c44b2	config: remove old Vault/Consul config blocks from parser (#18997 ) Remove the now-unused original configuration blocks for Consul and Vault from the agent configuration parsing. When the agent needs to refer to a Consul or Vault block it will always be for a specific cluster for the task/service (or the default cluster for the agent's own use). This is third of three changesets for this work. Fixes: https://github.com/hashicorp/nomad/issues/18947 Ref: https://github.com/hashicorp/nomad/pull/18991 Ref: https://github.com/hashicorp/nomad/pull/18994	2023-11-08 09:30:08 -05:00
Piotr Kazmierczak	7f62dec473	consul WI: rename default auth method for services (#18867 ) It should be called nomad-services instead of nomad-workloads.	2023-10-26 09:43:33 +02:00
Tim Gross	d0957eb109	Consul: agent config updates for WI (#18774 ) This changeset makes two changes: * Removes the `consul.use_identity` field from the agent configuration. This behavior is properly covered by the presence of `consul.service_identity` / `consul.task_identity` blocks. * Adds a `consul.task_auth_method` and `consul.service_auth_method` fields to the agent configuration. This allows the cluster administrator to choose specific Consul Auth Method names for their environment, with a reasonable default.	2023-10-17 14:42:14 -04:00
James Rasell	1ffdd576bb	agent: add config option to enable file and line log detail. (#18768 )	2023-10-16 15:59:16 +01:00
Luiz Aoqui	7267be719f	config: apply defaults to extra Consul and Vault (#18623 ) * config: apply defaults to extra Consul and Vault Apply the expected default values when loading additional Consul and Vault cluster configuration. Without these defaults some fields would be left empty. * config: retain pointer of multi Consul and Vault When calling `Copy()` the pointer reference from the `"default"` key of the `Consuls` and `Vaults` maps to the `Consul` and `Vault` field of `Config` was being lost. * test: ensure TestAgent has the right reference to the default Consul config	2023-09-29 17:15:20 -03:00
Luiz Aoqui	a4b29a29cb	vault: add `jwt_backend_path` agent config (#18606 ) Add agent configuration to allow cluster operators to define the path where the JWT auth method backend is mounted.	2023-09-28 18:02:30 -03:00
Tim Gross	5001bf4547	consul: use constant instead of "default" literal (#18611 ) Use the constant `structs.ConsulDefaultCluster` instead of the string literal "default", as we've done for Vault.	2023-09-28 16:50:21 -04:00
Luiz Aoqui	fed1992cea	vault: remove `use_identity` agent config (#18592 ) The initial intention behind the `vault.use_identity` configuration was to indicate to Nomad servers that they would need to sign a workload identities for allocs with a `vault` block. But in order to support identity renewal, #18262 and #18431 moved the token signing logic to the alloc runner since a new token needs to be signed prior to the TTL expiring. So #18343 implemented `use_identity` as a flag to indicate that the workload identity JWT flow should be used when deriving Vault tokens for tasks. But this configuration value is set on servers so it is not available to clients at the time of token derivation, making its meaning not clear: a job may end up using the identity-based flow even when `use_identity` is `false`. The only reliable signal available to clients at token derivation time is the presence of an `identity` block for Vault, and this is already configured with the `vault.default_identity` configuration block, making `vault.use_identity` redundant. This commit removes the `vault.use_identity` configuration and simplifies the logic on when an implicit Vault identity is injected into tasks.	2023-09-27 17:44:07 -03:00
Luiz Aoqui	868aba57bb	vault: update identity name to start with `vault_` (#18591 ) * vault: update identity name to start with `vault_` In the original proposal, workload identities used to derive Vault tokens were expected to be called just `vault`. But in order to support multiple Vault clusters it is necessary to associate identities with specific Vault cluster configuration. This commit implements a new proposal to have Vault identities named as `vault_<cluster>`.	2023-09-27 15:53:28 -03:00
Luiz Aoqui	19241964a4	config: fix some issues with workload identity and multi Consul and Vault (#18590 ) * config: fix multi consul and vault config parse Capture the loop variable when parsing multiple Consul and Vault configuration blocks so the duration parse function uses the correct field when it's called later on. * client: build Vault client with right config When setting up the multiple Vault clients, the code was always loading the default configuration, resulting in all clients to be configured the same way. * config: fix WorkloadIdentityConfig.Copy() method Ensure `WorkloadIdentityConfig.Copy()` does not return the original pointer for the `TTL` field.	2023-09-27 14:41:11 -03:00
Juana De La Cuesta	124272c050	server: Add reporting option to agent (#18572 ) * func: add reporting option to agent * func: add test for merge and fix comments * Update config_ce.go * Update config_ce.go * Update config_ce.go * fix: add reporting config to default configuration and update to use must over require * Update command/agent/config_parse.go Co-authored-by: Luiz Aoqui <luiz@hashicorp.com> * Update nomad/structs/config/reporting.go Co-authored-by: Luiz Aoqui <luiz@hashicorp.com> * Update nomad/structs/config/reporting.go Co-authored-by: Luiz Aoqui <luiz@hashicorp.com> * style: rename license and reporting config * fix: use default function instead of empty struct --------- Co-authored-by: Luiz Aoqui <luiz@hashicorp.com>	2023-09-27 00:11:32 +02:00
Tim Gross	d7bd47d60f	config: remove `consul.template_identity` in lieu of `task_identity` (#18540 ) The original thinking for Workload Identity integration with Consul and Vault was that we'd allow `template` blocks to specify their own identity. But because the login to Consul/Vault to get tokens happens at the task level, this would involve making the `template` block a new WID watcher on its own rather than using the Consul and Vault hooks we're building at the group/task level. So it doesn't make sense to have separate identities for individual `template` blocks rather than at the level of tasks. Update the agent configuration to rename the `template_identity` to the more accurate `task_identity`, which will be used for any non-service hooks (just `template` today). Update the implicit identities job mutation hook to create the identity we'll need as well.	2023-09-20 15:43:08 -04:00
Luiz Aoqui	3534307d0d	vault: add `use_identity` and `default_identity` agent configuration and implicit workload identity (#18343 )	2023-09-12 13:53:37 -03:00
Luiz Aoqui	82372fecb8	config: add TTL to agent identity config (#18457 ) Add support for identity token TTL in agent configuration fields such as Consul `service_identity` and `template_identity`. Co-authored-by: Michael Schurter <mschurter@hashicorp.com>	2023-09-12 11:13:09 -03:00
Luiz Aoqui	7466496608	config: fix identity config for Consul service (#18363 ) Rename the agent configuraion for workload identity to `WorkloadIdentityConfig` to make its use more explicit and remove the `ServiceName` field since it is never expected to be defined in a configuration file. Also update the job mutation to inject a service identity following these rules: 1. Don't inject identity if `consul.use_identity` is false. 2. Don't inject identity if `consul.service_identity` is not specified. 3. Don't inject identity if service provider is not `consul`. 4. Set name and service name if the service specifies an identity. 5. Inject `consul.service_identity` if service does not specify an identity.	2023-08-31 11:22:48 -03:00
Piotr Kazmierczak	b430d21a67	agent: add `consul.service_identity` and `consul.template_identity` blocks (#18279 ) This PR introduces updates to the agent config required for workload identity support.	2023-08-24 17:45:34 +02:00
Tim Gross	a8bad048b6	config: parsing support for multiple Consul clusters in agent config (#18255 ) Add the plumbing we need to accept multiple Consul clusters in Nomad agent configuration, to support upcoming Nomad Enterprise features. The `consul` blocks are differentiated by a new `name` field, and if the `name` is omitted it becomes the "default" Consul configuration. All blocks with the same name are merged together, as with the existing behavior. As with the `vault` block, we're still using HCL1 for parsing configuration and the `Decode` method doesn't parse multiple blocks differentiated only by a field name without a label. So we've had to add an extra parsing pass, similar to what we've done for HCL1 jobspecs. This also revealed a subtle bug in the `vault` block handling of extra keys when there are multiple `vault` blocks, which I've fixed here. For now, all existing consumers will use the "default" Consul configuration, so there's no user-facing behavior change in this changeset other than the contents of the agent self API. Ref: https://github.com/hashicorp/team-nomad/issues/404	2023-08-18 15:25:16 -04:00
Tim Gross	74b796e6d0	config: parsing support for multiple Vault clusters in agent config (#18224 ) Add the plumbing we need to accept multiple Vault clusters in Nomad agent configuration, to support upcoming Nomad Enterprise features. The `vault` blocks are differentiated by a new `name` field, and if the `name` is omitted it becomes the "default" Vault configuration. All blocks with the same name are merged together, as with the existing behavior. Unfortunately we're still using HCL1 for parsing configuration and the `Decode` method doesn't parse multiple blocks differentiated only by a field name without a label. So we've had to add an extra parsing pass, similar to what we've done for HCL1 jobspecs. For now, all existing consumers will use the "default" Vault configuration, so there's no user-facing behavior change in this changeset other than the contents of the agent self API. Ref: https://github.com/hashicorp/team-nomad/issues/404	2023-08-17 14:10:32 -04:00
hashicorp-copywrite[bot]	a9d61ea3fd	Update copyright file headers to BUSL-1.1	2023-08-10 17:27:29 -05:00
hashicorp-copywrite[bot]	f005448366	[COMPLIANCE] Add Copyright and License Headers	2023-04-10 15:36:59 +00:00
Alessio Perugini	365ccf4377	Allow configurable range of Job priorities (#16084 )	2023-02-17 09:23:13 -05:00
visweshs123	7d4ccf11bc	csi: add option to configure CSIVolumeClaimGCInterval (#16195 )	2023-02-16 10:41:15 -05:00
James Rasell	eaea9164a5	acl: correctly resolve ACL roles within client cache. (#14922 ) The client ACL cache was not accounting for tokens which included ACL role links. This change modifies the behaviour to resolve role links to policies. It will also now store ACL roles within the cache for quick lookup. The cache TTL is configurable in the same manner as policies or tokens. Another small fix is included that takes into account the ACL token expiry time. This was not included, which meant tokens with expiry could be used past the expiry time, until they were GC'd.	2022-10-20 09:37:32 +02:00
James Rasell	7b3bd1017d	Merge branch 'main' into f-gh-13120-sso-umbrella-merged-main	2022-08-25 12:14:29 +01:00
Piotr Kazmierczak	c4be2c6078	cleanup: replace TypeToPtr helper methods with pointer.Of (#14151 ) Bumping compile time requirement to go 1.18 allows us to simplify our pointer helper methods.	2022-08-17 18:26:34 +02:00
James Rasell	892ab8a07a	Merge branch 'main' into f-gh-13120-sso-umbrella	2022-08-02 08:30:03 +01:00
Luiz Aoqui	d456cc1e7f	Track plan rejection history and automatically mark clients as ineligible (#13421 ) Plan rejections occur when the scheduler work and the leader plan applier disagree on the feasibility of a plan. This may happen for valid reasons: since Nomad does parallel scheduling, it is expected that different workers will have a different state when computing placements. As the final plan reaches the leader plan applier, it may no longer be valid due to a concurrent scheduling taking up intended resources. In these situations the plan applier will notify the worker that the plan was rejected and that they should refresh their state before trying again. In some rare and unexpected circumstances it has been observed that workers will repeatedly submit the same plan, even if they are always rejected. While the root cause is still unknown this mitigation has been put in place. The plan applier will now track the history of plan rejections per client and include in the plan result a list of node IDs that should be set as ineligible if the number of rejections in a given time window crosses a certain threshold. The window size and threshold value can be adjusted in the server configuration. To avoid marking several nodes as ineligible at one, the operation is rate limited to 5 nodes every 30min, with an initial burst of 10 operations.	2022-07-12 18:40:20 -04:00
James Rasell	08845cef04	server: add ACL token expiration config parameters. (#13667 ) This commit adds configuration parameters to control ACL token expirations. This includes both limits on the min and max TTL expiration values, as well as a GC threshold for expired tokens.	2022-07-12 13:43:25 +02:00
James Rasell	d442e1b4c1	agent: test full object when performing test config parse. (#13668 )	2022-07-11 16:21:36 +02:00
Seth Hoenig	b242957990	ci: swap ci parallelization for unconstrained gomaxprocs	2022-03-15 12:58:52 -05:00

1 2 3 4

158 Commits