nomad

mirror of https://github.com/kemko/nomad.git synced 2026-01-06 10:25:42 +03:00

Author	SHA1	Message	Date
Michael Smithhisler	00ef9cacab	secrets: add common secrets plugins impl (#26335 ) Co-authored-by: Michael Schurter <mschurter@hashicorp.com>	2025-09-05 16:08:23 -04:00
James Rasell	ad508616dc	Merge branch 'main' into f-NMD-763-introduction	2025-08-05 08:56:51 +01:00
James Rasell	80a26306bf	intro: Add node introduction flow for Nomad client registration. (#26405 ) This change implements the client -> server workflow for Nomad node introduction. A Nomad node can optionally be started with an introduction token, which is a signed JWT containing claims for the node registration. The server handles this according to the enforcement configuration. The introduction token can be provided by env var, cli flag, or by placing it within a default filesystem location. The latter option does not override the CLI or env var. The region claims has been removed from the initial claims set of the intro identity. This boundary is guarded by mTLS and aligns with the node identity.	2025-08-05 08:23:44 +01:00
tehut	d709accaf5	Add nomad monitor export command (#26178 ) * Add MonitorExport command and handlers * Implement autocomplete * Require nomad in serviceName * Fix race in StreamReader.Read * Add and use framer.Flush() to coordinate function exit * Add LogFile to client/Server config and read NomadLogPath in rpcHandler instead of HTTPServer * Parameterize StreamFixed stream size	2025-08-01 10:26:59 -07:00
Tim Gross	3f59860254	host volumes: add configuration to GC on node GC (#25903 ) When a node is garbage collected, any dynamic host volumes on the node are orphaned in the state store. We generally don't want to automatically collect these volumes and risk data loss, and have provided a CLI flag to `-force` remove them in #25902. But for clusters running on ephemeral cloud instances (ex. AWS EC2 in an autoscaling group), deleting host volumes may add excessive friction. Add a configuration knob to the client configuration to remove host volumes from the state store on node GC. Ref: https://github.com/hashicorp/nomad/pull/25902 Ref: https://github.com/hashicorp/nomad/issues/25762 Ref: https://hashicorp.atlassian.net/browse/NMD-705	2025-05-27 10:22:08 -04:00
tehut	55523ecf8e	Add NodeMaxAllocations to client configuration (#25785 ) * Set MaxAllocations in client config Add NodeAllocationTracker struct to Node struct Evaluate MaxAllocations in AllocsFit function Set up cli config parsing Integrate maxAllocs into AllocatedResources view Co-authored-by: Tim Gross <tgross@hashicorp.com> --------- Co-authored-by: Tim Gross <tgross@hashicorp.com>	2025-05-22 12:49:27 -07:00
Nikita Eliseev	76fb3eb9a1	rpc: added configuration for yamux session (#25466 ) Fixes: https://github.com/hashicorp/nomad/issues/25380	2025-04-02 10:58:23 -04:00
grembo	b6d925987c	Allow disabling wait in client configuration (#25255 ) Before the fixes in #20165, the wait feature was disabled by default. After these changes, it's always enabled, which - at least on some platforms - leads to a significant increase in load (5-7x). This patch allows disabling the wait feature in the client stanza of the configuration file by setting min and max to 0: wait { min = "0" max = "0" } Per-template wait blocks in the task description still work like one would expect.	2025-03-03 16:38:46 -05:00
Tim Gross	7b89c0ee28	template: fix client's default retry configuration (#25113 ) In #20165 we fixed a bug where a partially configured `client.template` retry block would set any unset fields to nil instead of their default values. But this patch introduced a regression in the default values, so we were now defaulting to unlimited retries if the retry block was unset. Restore the correct behavior and add better test coverage at both the config parsing and template configuration code. Ref: https://github.com/hashicorp/nomad/pull/20165 Ref: https://github.com/hashicorp/nomad/issues/23305#issuecomment-2643731565	2025-02-14 09:25:41 -05:00
Jorge Marey	25426f0777	fingerprint: add config option to disable dmidecode (#25108 )	2025-02-13 11:20:48 -05:00
Daniel Bennett	49c147bcd7	dynamic host volumes: change env vars, fixup auto-delete (#24943 ) * plugin env: DHV_HOST_PATH->DHV_VOLUMES_DIR * client config: host_volumes_dir * plugin env: add namespace+nodepool * only auto-delete after error saving client state on initial create	2025-01-27 10:36:53 -06:00
Daniel Bennett	46a39560bb	dynamic host volumes: fingerprint client plugins (#24589 )	2024-12-19 09:25:54 -05:00
James Rasell	7d48aa2667	client: emit optional telemetry from prerun and prestart hooks. (#24556 ) The Nomad client can now optionally emit telemetry data from the prerun and prestart hooks. This allows operators to monitor and alert on failures and time taken to complete. The new datapoints are: - nomad.client.alloc_hook.prerun.success (counter) - nomad.client.alloc_hook.prerun.failed (counter) - nomad.client.alloc_hook.prerun.elapsed (sample) - nomad.client.task_hook.prestart.success (counter) - nomad.client.task_hook.prestart.failed (counter) - nomad.client.task_hook.prestart.elapsed (sample) The hook execution time is useful to Nomad engineering and will help optimize code where possible and understand job specification impacts on hook performance. Currently only the PreRun and PreStart hooks have telemetry enabled, so we limit the number of new metrics being produced.	2024-12-12 14:43:14 +00:00
Piotr Kazmierczak	f7a4ded2c0	security: add CT executeTemplate to default function_denylist (#24541 ) This PR adds Consul Template's executeTemplate function to the denylist by default, in order to prevent accidental or malicious infinitely recursive execution. --------- Co-authored-by: Tim Gross <tgross@hashicorp.com>	2024-11-22 19:33:56 +01:00
Martijn Vegter	3ecf0d21e2	metrics: introduce client config to include alloc metadata as part of the base labels (#23964 )	2024-10-02 10:55:44 -04:00
Juliano Martinez	4a74fda8ce	Allow client template config block to be parsed when using json config (#24007 ) - Adds tests - Adds sample test data for parsing hcl and json - Adds changelog	2024-10-01 15:44:36 -04:00
Daniel Bennett	2f5cf8efae	networking: option to enable ipv6 on bridge network (#23882 ) by setting bridge_network_subnet_ipv6 in client config Co-authored-by: Martina Santangelo <martina.santangelo@hashicorp.com>	2024-09-04 10:17:10 -05:00
guifran001	1c44521543	client: Add a preferred address family option for network-interface (#23389 ) to prefer ipv4 or ipv6 when deducing IP from network interface Co-authored-by: Daniel Bennett <dbennett@hashicorp.com>	2024-07-12 15:30:38 -05:00
Tim Gross	df67e74615	Consul: add preflight checks for Envoy bootstrap (#23381 ) Nomad creates Consul ACL tokens and service registrations to support Consul service mesh workloads, before bootstrapping the Envoy proxy. Nomad always talks to the local Consul agent and never directly to the Consul servers. But the local Consul agent talks to the Consul servers in stale consistency mode to reduce load on the servers. This can result in the Nomad client making the Envoy bootstrap request with a tokens or services that have not yet replicated to the follower that the local client is connected to. This request gets a 404 on the ACL token and that negative entry gets cached, preventing any retries from succeeding. To workaround this, we'll use a method described by our friends over on `consul-k8s` where after creating the objects in Consul we try to read them from the local agent in stale consistency mode (which prevents a failed read from being cached). This cannot completely eliminate this source of error because it's possible that Consul cluster replication is unhealthy at the time we need it, but this should make Envoy bootstrap significantly more robust. This changset adds preflight checks for the objects we create in Consul: * We add a preflight check for ACL tokens after we login via via Workload Identity and in the function we use to derive tokens in the legacy workflow. We do this check early because we also want to use this token for registering group services in the allocrunner hooks. * We add a preflight check for services right before we bootstrap Envoy in the taskrunner hook, so that we have time for our service client to batch updates to the local Consul agent in addition to the local agent sync. We've added the timeouts to be configurable via node metadata rather than the usual static configuration because for most cases, users should not need to touch or even know these values are configurable; the configuration is mostly available for testing. Fixes: https://github.com/hashicorp/nomad/issues/9307 Fixes: https://github.com/hashicorp/nomad/issues/10451 Fixes: https://github.com/hashicorp/nomad/issues/20516 Ref: https://github.com/hashicorp/consul-k8s/pull/887 Ref: https://hashicorp.atlassian.net/browse/NET-10051 Ref: https://hashicorp.atlassian.net/browse/NET-9273 Follow-up: https://hashicorp.atlassian.net/browse/NET-10138	2024-06-27 10:15:37 -04:00
Tim Gross	7b9bce2d08	config: fix `client.template` config merging with defaults (#20165 ) When loading the client configuration, the user-specified `client.template` block was not properly merged with the default values. As a result, if the user set any `client.template` field, all the other field defaulted to their zero values instead of the documented defaults. This changeset: * Adds the missing `Merge` method for the client template config and ensures it's called. * Makes a single source of truth for the default template configuration, instead of two different constructors. * Extends the tests to cover the merge of a partial block better. Fixes: https://github.com/hashicorp/nomad/issues/20164	2024-03-20 10:18:56 -04:00
Seth Hoenig	05937ab75b	exec2: add client support for unveil filesystem isolation mode (#20115 ) * exec2: add client support for unveil filesystem isolation mode This PR adds support for a new filesystem isolation mode, "Unveil". The mode introduces a "alloc_mounts" directory where tasks have user-owned directory structure which are bind mounts into the real alloc directory structure. This enables a task driver to use landlock (and maybe the real unveil on openbsd one day) to isolate a task to the task owned directory structure, providing sandboxing. * actually create alloc-mounts-dir directory * fix doc strings about alloc mount dir paths	2024-03-13 08:24:17 -05:00
Seth Hoenig	286dce7a2a	exec2: add a client.users configuration block (#20093 ) * exec: add a client.users configuration block For now just add min/max dynamic user values; soon we can also absorb the "user.denylist" and "user.checked_drivers" options from the deprecated client.options map. * give the no-op pool implementation a better name * use explicit error types to make referencing them cleaner in tests * use import alias to not shadow package name	2024-03-08 16:02:32 -06:00
Tim Gross	50f0ce5412	config: remove old Vault/Consul config blocks from client (#18994 ) Remove the now-unused original configuration blocks for Consul and Vault from the client. When the client needs to refer to a Consul or Vault block it will always be for a specific cluster for the task/service. Add a helper for accessing the default clusters (for the client's own use). This is two of three changesets for this work. The remainder will implement the same changes in the `command/agent` package. As part of this work I discovered and fixed two bugs: * The gRPC proxy socket that we create for Envoy is only ever created using the default Consul cluster's configuration. This will prevent Connect from being used with the non-default cluster. * The Consul configuration we use for templates always comes from the default Consul cluster's configuration, but will use the correct Consul token for the non-default cluster. This will prevent templates from being used with the non-default cluster. Ref: https://github.com/hashicorp/nomad/issues/18947 Ref: https://github.com/hashicorp/nomad/pull/18991 Fixes: https://github.com/hashicorp/nomad/issues/18984 Fixes: https://github.com/hashicorp/nomad/issues/18983	2023-11-07 09:15:37 -05:00
Luiz Aoqui	347389f9f9	vault: derive token using `create_from_role` (#18880 ) Fallback to the ACL role defined in the client's `create_from_role` configuration when using the JWT flow and the task does not specify a role to use.	2023-10-27 13:03:44 -04:00
Tim Gross	5001bf4547	consul: use constant instead of "default" literal (#18611 ) Use the constant `structs.ConsulDefaultCluster` instead of the string literal "default", as we've done for Vault.	2023-09-28 16:50:21 -04:00
Luiz Aoqui	868aba57bb	vault: update identity name to start with `vault_` (#18591 ) * vault: update identity name to start with `vault_` In the original proposal, workload identities used to derive Vault tokens were expected to be called just `vault`. But in order to support multiple Vault clusters it is necessary to associate identities with specific Vault cluster configuration. This commit implements a new proposal to have Vault identities named as `vault_<cluster>`.	2023-09-27 15:53:28 -03:00
Seth Hoenig	591394fb62	drivers: plumb hardware topology via grpc into drivers (#18504 ) * drivers: plumb hardware topology via grpc into drivers This PR swaps out the temporary use of detecting system hardware manually in each driver for using the Client's detected topology by plumbing the data over gRPC. This ensures that Client configuration is taken to account consistently in all references to system topology. * cr: use enum instead of bool for core grade * cr: fix test slit tables to be possible	2023-09-18 08:58:07 -05:00
Seth Hoenig	2e1974a574	client: refactor cpuset partitioning (#18371 ) * client: refactor cpuset partitioning This PR updates the way Nomad client manages the split between tasks that make use of resources.cpus vs. resources.cores. Previously, each task was explicitly assigned which CPU cores they were able to run on. Every time a task was started or destroyed, all other tasks' cpusets would need to be updated. This was inefficient and would crush the Linux kernel when a client would try to run ~400 or so tasks. Now, we make use of cgroup heirarchy and cpuset inheritence to efficiently manage cpusets. * cr: tweaks for feedback	2023-09-12 09:11:11 -05:00
Seth Hoenig	f5b0da1d55	all: swap exp packages for maps, slices (#18311 )	2023-08-23 15:42:13 -05:00
Tim Gross	a8bad048b6	config: parsing support for multiple Consul clusters in agent config (#18255 ) Add the plumbing we need to accept multiple Consul clusters in Nomad agent configuration, to support upcoming Nomad Enterprise features. The `consul` blocks are differentiated by a new `name` field, and if the `name` is omitted it becomes the "default" Consul configuration. All blocks with the same name are merged together, as with the existing behavior. As with the `vault` block, we're still using HCL1 for parsing configuration and the `Decode` method doesn't parse multiple blocks differentiated only by a field name without a label. So we've had to add an extra parsing pass, similar to what we've done for HCL1 jobspecs. This also revealed a subtle bug in the `vault` block handling of extra keys when there are multiple `vault` blocks, which I've fixed here. For now, all existing consumers will use the "default" Consul configuration, so there's no user-facing behavior change in this changeset other than the contents of the agent self API. Ref: https://github.com/hashicorp/team-nomad/issues/404	2023-08-18 15:25:16 -04:00
Tim Gross	74b796e6d0	config: parsing support for multiple Vault clusters in agent config (#18224 ) Add the plumbing we need to accept multiple Vault clusters in Nomad agent configuration, to support upcoming Nomad Enterprise features. The `vault` blocks are differentiated by a new `name` field, and if the `name` is omitted it becomes the "default" Vault configuration. All blocks with the same name are merged together, as with the existing behavior. Unfortunately we're still using HCL1 for parsing configuration and the `Decode` method doesn't parse multiple blocks differentiated only by a field name without a label. So we've had to add an extra parsing pass, similar to what we've done for HCL1 jobspecs. For now, all existing consumers will use the "default" Vault configuration, so there's no user-facing behavior change in this changeset other than the contents of the agent self API. Ref: https://github.com/hashicorp/team-nomad/issues/404	2023-08-17 14:10:32 -04:00
hashicorp-copywrite[bot]	2d35e32ec9	Update copyright file headers to BUSL-1.1	2023-08-10 17:27:15 -05:00
Seth Hoenig	a4cc76bd3e	numa: enable numa topology detection (#18146 ) * client: refactor cgroups management in client * client: fingerprint numa topology * client: plumb numa and cgroups changes to drivers * client: cleanup task resource accounting * client: numa client and config plumbing * lib: add a stack implementation * tools: remove ec2info tool * plugins: fixup testing for cgroups / numa changes * build: update makefile and package tests and cl	2023-08-10 17:05:30 -05:00
Tim Gross	88323bab4a	allocrunner: provide factory function so we can build mock ARs (#17161 ) Tools like `nomad-nodesim` are unable to implement a minimal implementation of an allocrunner so that we can test the client communication without having to lug around the entire allocrunner/taskrunner code base. The allocrunner was implemented with an interface specifically for this purpose, but there were circular imports that made it challenging to use in practice. Move the AllocRunner interface into an inner package and provide a factory function type. Provide a minimal test that exercises the new function so that consumers have some idea of what the minimum implementation required is.	2023-05-12 13:29:44 -04:00
Daniel Bennett	c2dc1c58dd	full task cleanup when alloc prerun hook fails (#17104 ) to avoid leaking task resources (e.g. containers, iptables) if allocRunner prerun fails during restore on client restart. now if prerun fails, TaskRunner.MarkFailedKill() will only emit an event, mark the task as failed, and cancel the tr's killCtx, so then ar.runTasks() -> tr.Run() can take care of the actual cleanup. removed from (formerly) tr.MarkFailedDead(), now handled by tr.Run(): * set task state as dead * save task runner local state * task stop hooks also done in tr.Run() now that it's not skipped: * handleKill() to kill tasks while respecting their shutdown delay, and retrying as needed * also includes task preKill hooks * clearDriverHandle() to destroy the task and associated resources * task exited hooks	2023-05-08 13:17:10 -05:00
Tim Gross	c3002db815	client: allow `drain_on_shutdown` configuration (#16827 ) Adds a new configuration to clients to optionally allow them to drain their workloads on shutdown. The client sends the `Node.UpdateDrain` RPC targeting itself and then monitors the drain state as seen by the server until the drain is complete or the deadline expires. If it loses connection with the server, it will monitor local client status instead to ensure allocations are stopped before exiting.	2023-04-14 15:35:32 -04:00
hashicorp-copywrite[bot]	f005448366	[COMPLIANCE] Add Copyright and License Headers	2023-04-10 15:36:59 +00:00
Michael Schurter	9bab96ebd3	Task API via Unix Domain Socket (#15864 ) This change introduces the Task API: a portable way for tasks to access Nomad's HTTP API. This particular implementation uses a Unix Domain Socket and, unlike the agent's HTTP API, always requires authentication even if ACLs are disabled. This PR contains the core feature and tests but followup work is required for the following TODO items: - Docs - might do in a followup since dynamic node metadata / task api / workload id all need to interlink - Unit tests for auth middleware - Caching for auth middleware - Rate limiting on negative lookups for auth middleware --------- Co-authored-by: Seth Hoenig <shoenig@duck.com>	2023-02-06 11:31:22 -08:00
Charlie Voiselle	55df5af4aa	client: Add option to enable hairpinMode on Nomad bridge (#15961 ) * Add `bridge_network_hairpin_mode` client config setting * Add node attribute: `nomad.bridge.hairpin_mode` * Changed format string to use `%q` to escape user provided data * Add test to validate template JSON for developer safety Co-authored-by: Daniel Bennett <dbennett@hashicorp.com>	2023-02-02 10:12:15 -05:00
Karl Johann Schubert	588392cabc	client: add disk_total_mb and disk_free_mb config options (#15852 )	2023-01-24 09:14:22 -05:00
James Rasell	eaea9164a5	acl: correctly resolve ACL roles within client cache. (#14922 ) The client ACL cache was not accounting for tokens which included ACL role links. This change modifies the behaviour to resolve role links to policies. It will also now store ACL roles within the cache for quick lookup. The cache TTL is configurable in the same manner as policies or tokens. Another small fix is included that takes into account the ACL token expiry time. This was not included, which meant tokens with expiry could be used past the expiry time, until they were GC'd.	2022-10-20 09:37:32 +02:00
Michael Schurter	f91100bda3	client: remove unused LogOutput and LogLevel (#14867 ) * client: remove unused LogOutput * client: remove unused config.LogLevel	2022-10-11 09:24:40 -07:00
Seth Hoenig	fed329883e	cleanup: rename Equals to Equal for consistency (#14759 )	2022-10-10 09:28:46 -05:00
Seth Hoenig	ff1a30fe8d	cleanup more helper updates (#14638 ) * cleanup: refactor MapStringStringSliceValueSet to be cleaner * cleanup: replace SliceStringToSet with actual set * cleanup: replace SliceStringSubset with real set * cleanup: replace SliceStringContains with slices.Contains * cleanup: remove unused function SliceStringHasPrefix * cleanup: fixup StringHasPrefixInSlice doc string * cleanup: refactor SliceSetDisjoint to use real set * cleanup: replace CompareSliceSetString with SliceSetEq * cleanup: replace CompareMapStringString with maps.Equal * cleanup: replace CopyMapStringString with CopyMap * cleanup: replace CopyMapStringInterface with CopyMap * cleanup: fixup more CopyMapStringString and CopyMapStringInt * cleanup: replace CopySliceString with slices.Clone * cleanup: remove unused CopySliceInt * cleanup: refactor CopyMapStringSliceString to be generic as CopyMapOfSlice * cleanup: replace CopyMap with maps.Clone * cleanup: run go mod tidy	2022-09-21 14:53:25 -05:00
Luiz Aoqui	43fe45d972	fix minor issues found durint ENT merge (#14250 )	2022-08-23 17:22:18 -04:00
Michael Schurter	01648e615a	client: fix data races in config handling (#14139 ) Before this change, Client had 2 copies of the config object: config and configCopy. There was no guidance around which to use where (other than configCopy's comment to pass it to alloc runners), both are shared among goroutines and mutated in data racy ways. At least at one point I think the idea was to have `config` be mutable and then grab a lock to overwrite `configCopy`'s pointer atomically. This would have allowed alloc runners to read their config copies in data race safe ways, but this isn't how the current implementation worked. This change takes the following approach to safely handling configs in the client: 1. `Client.config` is the only copy of the config and all access must go through the `Client.configLock` mutex 2. Since the mutex only protects the config pointer itself and not fields inside the Config struct: all config mutation must be done on a copy of the config, and then Client's config pointer is overwritten while the mutex is acquired. Alloc runners and other goroutines with the old config pointer will not see config updates. 3. Deep copying is implemented on the Config struct to satisfy the previous approach. The TLS Keyloader is an exception because it has its own internal locking to support mutating in place. An unfortunate complication but one I couldn't find a way to untangle in a timely fashion. 4. To facilitate deep copying I made an internally backward incompatible API change: our `helper/funcs` used to turn containers (slices and maps) with 0 elements into nils. This probably saves a few memory allocations but makes it very easy to cause panics. Since my new config handling approach uses more copying, it became very difficult to ensure all code that used containers on configs could handle nils properly. Since this code has caused panics in the past, I fixed it: nil containers are copied as nil, but 0-element containers properly return a new 0-element container. No more "downgrading to nil!"	2022-08-18 16:32:04 -07:00
Piotr Kazmierczak	c4be2c6078	cleanup: replace TypeToPtr helper methods with pointer.Of (#14151 ) Bumping compile time requirement to go 1.18 allows us to simplify our pointer helper methods.	2022-08-17 18:26:34 +02:00
Derek Strickland	696deb9600	Add Nomad RetryConfig to agent template config (#13907 ) * add Nomad RetryConfig to agent template config	2022-08-03 16:56:30 -04:00
Derek Strickland	7899fd3fac	consul-template: Add fault tolerant defaults (#13041 ) consul-template: Add fault tolerant defaults Co-authored-by: Tim Gross <tgross@hashicorp.com>	2022-06-08 14:08:25 -04:00
Michael Schurter	3968509886	artifact: fix numerous go-getter security issues Fix numerous go-getter security issues: - Add timeouts to http, git, and hg operations to prevent DoS - Add size limit to http to prevent resource exhaustion - Disable following symlinks in both artifacts and `job run` - Stop performing initial HEAD request to avoid file corruption on retries and DoS opportunities. Approach Since Nomad has no ability to differentiate a DoS-via-large-artifact vs a legitimate workload, all of the new limits are configurable at the client agent level. The max size of HTTP downloads is also exposed as a node attribute so that if some workloads have large artifacts they can specify a high limit in their jobspecs. In the future all of this plumbing could be extended to enable/disable specific getters or artifact downloading entirely on a per-node basis.	2022-05-24 16:29:39 -04:00

1 2 3 4

173 Commits