nomad

mirror of https://github.com/kemko/nomad.git synced 2026-01-01 16:05:42 +03:00

Author	SHA1	Message	Date
Brendan MacDonell	26485c45a2	Add job_max_count option to keep Nomad server from running out of memory (#26858 ) If a Nomad job is started with a large number of instances (e.g. 4 billion), then the Nomad servers that attempt to schedule it will run out of memory and crash. While it's unlikely that anyone would intentionally schedule a job with 4 billion instances, we have occasionally run into issues with bugs in external automation. For example, an automated deployment system running on a test environment had an off-by-one error, and deployed a job with count = uint32(-1), causing the Nomad servers for that environment to run out of memory and crash. To prevent this, this PR introduces a job_max_count Nomad server configuration parameter. job_max_count limits the number of allocs that may be created from a job. The default value is 50000 - this is low enough that a job with the maximum possible number of allocs will not require much memory on the server, but is still much higher than the number of allocs in the largest Nomad job we have ever run.	2025-10-06 09:35:10 -04:00
Allison Larson	e40164abce	Add preserve-resources flag (#26841 ) * Add preserve-resources flag when registering a job * Add preserve-resources flag to website docs * Add changelog * Update tests, docs * Preserve counts & resources in fsm * Update doc * Update preservation of resources/count to happen in StateStore	2025-10-02 13:56:59 -07:00
Piotr Kazmierczak	f42239bf6c	api: add DefaultUpdateStrategy to system jobs if missing (#26777 ) From 1.11, Nomad system jobs will feature deployments, and thus jobspecs missing an update block should be canonicalized to have one.	2025-09-18 15:21:23 +02:00
Tim Gross	4e75e99f1a	windows: use/accept platform-specific signal for stopping agent (#26780 ) On Windows, the `os.Process.Signal` method returns an error when sending `os.Interrupt` (SIGINT) because it isn't implemented. This causes test servers in the `testutil` packages to break on Windows. Use the platform specific syscalls to generate the SIGINT instead. The agent's signal handler also did not correctly handle the Ctrl-C because we were masking os.Interrupt instead of SIGINT. Fixes: https://github.com/hashicorp/nomad/issues/26775 Co-authored-by: Chris Roberts <croberts@hashicorp.com>	2025-09-17 11:32:20 -04:00
James Rasell	2abd72d433	http: Fix client identity renew call when node ID is in URI. (#26773 ) When calling the client identity renew API, it is possible the target node ID is provided by either the URI or within the request body. This change fixes a bug where all calls using a node_id query parameter would be reject as it failed to decode the empty request body. Co-authored-by: Tim Gross <tgross@hashicorp.com>	2025-09-16 15:15:39 +01:00
hc-github-team-nomad-core	f9bce13f8c	Generate files for 1.10.5 release	2025-09-11 10:20:15 +02:00
Michael Smithhisler	10ed46cbd4	secrets: pass key/value config data to plugins as env (#26455 ) Co-authored-by: Michael Schurter <mschurter@hashicorp.com> Co-authored-by: Tim Gross <tgross@hashicorp.com>	2025-09-05 16:08:24 -04:00
Michael Smithhisler	9950ef515c	secrets: validate name and update client config (#26447 )	2025-09-05 16:08:23 -04:00
Michael Smithhisler	00ef9cacab	secrets: add common secrets plugins impl (#26335 ) Co-authored-by: Michael Schurter <mschurter@hashicorp.com>	2025-09-05 16:08:23 -04:00
Michael Smithhisler	65c7f34f2d	secrets: Add secrets block to job spec (#26076 )	2025-09-04 15:58:03 -04:00
Daniel Bennett	9682aa2724	consul connect: allow "cni/*" network mode (#26449 ) don't require "bridge" network mode when using connect{} we document this as "at your own risk" because CNI configuration is so flexible that we can't guarantee a user's network will work, but Nomad's "bridge" CNI config may be used as a reference.	2025-09-04 12:29:50 -04:00
James Rasell	270ab1011e	lint: Enable and fix SA9004 constant type lint errors. (#26678 ) When creating constants with a custom type, each definition should include the type definition. If only the first constant defines this, it will have a different type to the other constants. This change fixes occurances of this and enables SA9004 within CI linting to catch future problems while the change is in review.	2025-09-03 07:45:29 +01:00
Chris Roberts	61c36bdef7	[winsvc] Add support for Windows Eventlog (#26441 ) Defines a `winsvc.Event` type which can be sent using the `winsvc.SendEvent` function. If nomad is running on Windows and can send to the Windows Eventlog the event will be sent. Initial event types are defined for starting, ready, stopped, and log message. The `winsvc.EventLogger` provides an `io.WriteCloser` that can be included in the logger's writers collection. It will extract the log level from log lines and write them appropriately to the eventlog. The eventlog only supports error, warning, and info levels so messages with other levels will be ignored. A new configuration block is included for enabling logging to the eventlog. Logging must be enabled with the `log_level` option and the `eventlog.level` value can then be of the same or higher severity.	2025-09-02 16:40:31 -07:00
James Rasell	cddc1b0127	config: Validate keyring config to catch invalid provider types. (#26673 )	2025-09-02 11:07:49 +01:00
Michael Smithhisler	485356c3d3	csi: fix volume registration error (#26642 )	2025-08-27 15:00:16 -04:00
James Rasell	e5eb125264	agent: Ensure node identity renew handler decodes the request body. (#26638 ) The HTTP request body contains the node ID where the request should be routed and without decoding this, we cannot route to anything other than local nodes.	2025-08-27 14:06:12 +01:00
Chris Roberts	4b9597a31d	[agent] Fix error checking within retry join (#26434 ) The `RetryJoin` function checks for an error and logs it before retrying. The error variables were shadowed which resulted in the errors never being logged. This predefines the variables to prevent them from being shadowed. The testlog package was also updated to support providing a custom writer which allows logging output to be easily caught and inspected.	2025-08-26 14:18:12 -07:00
Michael Smithhisler	da4cf07ff4	logs: skip logging SIGPIPE signal (#26582 )	2025-08-21 09:08:49 -04:00
James Rasell	3b0b7db1a1	client: Add client identity API, CLI, and RPC workflow. (#26543 ) The Nomad clients store their Nomad identity in memory and within their state store. While active, it is not possible to dump the state to view the stored identity token, so having a way to view the current claims while running aids debugging and operations. This change adds a client identity workflow, allowing operators to view the current claims of the nodes identity. It does not return any of the signing key material.	2025-08-19 08:25:51 +01:00
James Rasell	1ae83114c1	ci: Run hclogvet across all codebase and fix found issue. (#26545 )	2025-08-18 15:06:11 +01:00
Daniel Bennett	9f806e3063	Post 1.10.4 release main (#26521 ) * Generate files for 1.10.4 release * Prepare for next release * Merge release 1.10.4 files --------- Co-authored-by: hc-github-team-nomad-core <github-team-nomad-core@hashicorp.com>	2025-08-14 12:22:32 -04:00
Joey	c997afe0de	chore: Fix function name in comment (#26511 )	2025-08-13 15:06:50 +01:00
James Rasell	ad508616dc	Merge branch 'main' into f-NMD-763-introduction	2025-08-05 08:56:51 +01:00
James Rasell	350662c88e	Merge pull request #26291 from hashicorp/f-NMD-763-identity identity: The initial implementation code for node identity.	2025-08-05 09:52:28 +02:00
James Rasell	80a26306bf	intro: Add node introduction flow for Nomad client registration. (#26405 ) This change implements the client -> server workflow for Nomad node introduction. A Nomad node can optionally be started with an introduction token, which is a signed JWT containing claims for the node registration. The server handles this according to the enforcement configuration. The introduction token can be provided by env var, cli flag, or by placing it within a default filesystem location. The latter option does not override the CLI or env var. The region claims has been removed from the initial claims set of the intro identity. This boundary is guarded by mTLS and aligns with the node identity.	2025-08-05 08:23:44 +01:00
tehut	21841d3067	Add historical journald and log export flags to operator debug command (#26410 ) * Add -log-file-export and -log-lookback commands to add historical log to debug capture * use monitor.PrepFile() helper for other historical log tests	2025-08-04 13:55:25 -07:00
tehut	d709accaf5	Add nomad monitor export command (#26178 ) * Add MonitorExport command and handlers * Implement autocomplete * Require nomad in serviceName * Fix race in StreamReader.Read * Add and use framer.Flush() to coordinate function exit * Add LogFile to client/Server config and read NomadLogPath in rpcHandler instead of HTTPServer * Parameterize StreamFixed stream size	2025-08-01 10:26:59 -07:00
James Rasell	5989d5862a	ci: Update golangci-lint to v2 and fix highlighted issues. (#26334 )	2025-07-25 10:44:08 +01:00
James Rasell	62f1dbebfb	server: Add RPC and HTTP functionality for node intro token gen. (#26320 ) The node introduction workflow will utilise JWT's that can be used as authentication tokens on initial client registration. This change implements the basic builder for this JWT claim type and the RPC and HTTP handler functionality that will expose this to the operator.	2025-07-23 14:32:26 +01:00
James Rasell	7466dd71b2	server: Add new `server.client_introduction` config block. (#26315 ) The new configuration block exposes some key options which allow cluster administrators to control certain client introduction behaviours. This change introduces the new block and plumbing, so that it is exposed in the Nomad server for consumption via internal processes.	2025-07-22 08:50:19 +01:00
James Rasell	dce4284361	Merge branch 'main' into f-NMD-763-identity	2025-07-17 07:35:16 +01:00
James Rasell	953a149180	client: Allow operators to force a client to renew its identity. (#26277 ) The Nomad client will have its identity renewed according to the TTL which defaults to 24h. In certain situations such as root keyring rotation, operators may want to force clients to renew their identities before the TTL threshold is met. This change introduces a client HTTP and RPC endpoint which will instruct the node to request a new identity at its next heartbeat. This can be used via the API or a new command. While this is a manual intervention step on top of the any keyring rotation, it dramatically reduces the initial feature complexity as it provides an asynchronous and efficient method of renewal that utilises existing functionality.	2025-07-16 14:56:00 +01:00
hc-github-team-nomad-core	ccba3ae6a2	Generate files for 1.10.3 release	2025-07-08 16:47:39 -07:00
Chris Roberts	493e7b2faa	command: prevent server panic on graceful shutdown (#26171 ) When performing a graceful shutdown the client drain configuration is checked for a deadline which is appended to the timeout. When running as a server the client will not be set. Attempting to get the drain deadline will result in a panic. This checks for the client being available prior to fetching the deadline value.	2025-07-01 15:54:03 -07:00
James Rasell	d5b2d5078b	rpc: Generate node identities with node RPC handlers when needed. (#26165 ) When a Nomad client register or re-registers, the RPC handler will generate and return a node identity if required. When an identity is generated, the signing key ID will be stored within the node object, to ensure a root key is not deleted until it is not used. During normal client operation it will periodically heartbeat to the Nomad servers to indicate aliveness. The RPC handler that is used for this action has also been updated to conditionally perform identity generation. Performing it here means no extra RPC handlers are required and we inherit the jitter in identity generation from the heartbeat mechanism. The identity generation check methods are performed from the RPC request arguments, so they a scoped to the required behaviour and can handle the nuance of each RPC. Failure to generate an identity is considered terminal to the RPC call. The client will include behaviour to retry this error which is always caused by the encrypter not being ready unless the servers keyring has been corrupted.	2025-07-01 16:07:21 +01:00
Allison Larson	63f0788747	Expose Kind field for Consul Service Registrations (#26170 ) * consul: Add service kind to jobspec * consul: Add kind to service docs * Add changelog	2025-06-30 14:32:23 -07:00
James Rasell	7a5f5750b0	test: Wait for client when enabled in test agent if possible. (#26129 ) When a test starts an agent and the client is enabled, we can wait until this reaches the ready state within the set up method. This mimics what we already do with leadership and the root keyring and should reduce flakey tests where it assume the client is ready as soon as the set up function returns, which is not guaranteed. The change exposed a couple of TLS reload tests which were not using the test agent correctly. They were setting up a client even though it would never be able to join the cluster due to TLS configuration issues. These have been fixed.	2025-06-25 10:00:28 +01:00
Chris Roberts	4dbf645bf7	command: prevent panic on graceful shutdown (#26018 ) When performing a graceful shutdown a channel is used to wait for the agent to leave. The channel is closed when the agent leaves successfully, but it also is closed within a deferral. If the agent successfully leaves and closes the channel, a panic will occur when the channel is closed the second time within the deferral. To prevent this from occurring, the channel closing is wrapped within a `OnceFunc` so the channel is only closed once.	2025-06-12 09:35:57 -07:00
Chris Roberts	eeec603975	command: prevent early exit from graceful shutdown (#26023 ) While waiting for the agent to leave during a graceful shutdown the wait can be interrupted immediately if another signal is received. It is common that while waiting a `SIGPIPE` is received from journald causing the wait to end early. This results in the agent not finishing the leave process and reporting an error when the process has stopped. Instead of allowing any signal to interrupt the wait, the signal is checked for a `SIGPIPE` and if matched will continue waiting.	2025-06-12 08:56:55 -07:00
hc-github-team-nomad-core	1e49d9eb44	Generate files for 1.10.2 release	2025-06-10 14:35:25 -07:00
James Rasell	e95148c10d	consul: Fix data race within test by using mutex to read map. (#25977 )	2025-06-04 15:09:37 +01:00
Michael Smithhisler	4c8257d0c7	client: add once mode to template block (#25922 )	2025-05-28 11:45:11 -04:00
Tim Gross	3f59860254	host volumes: add configuration to GC on node GC (#25903 ) When a node is garbage collected, any dynamic host volumes on the node are orphaned in the state store. We generally don't want to automatically collect these volumes and risk data loss, and have provided a CLI flag to `-force` remove them in #25902. But for clusters running on ephemeral cloud instances (ex. AWS EC2 in an autoscaling group), deleting host volumes may add excessive friction. Add a configuration knob to the client configuration to remove host volumes from the state store on node GC. Ref: https://github.com/hashicorp/nomad/pull/25902 Ref: https://github.com/hashicorp/nomad/issues/25762 Ref: https://hashicorp.atlassian.net/browse/NMD-705	2025-05-27 10:22:08 -04:00
tehut	55523ecf8e	Add NodeMaxAllocations to client configuration (#25785 ) * Set MaxAllocations in client config Add NodeAllocationTracker struct to Node struct Evaluate MaxAllocations in AllocsFit function Set up cli config parsing Integrate maxAllocs into AllocatedResources view Co-authored-by: Tim Gross <tgross@hashicorp.com> --------- Co-authored-by: Tim Gross <tgross@hashicorp.com>	2025-05-22 12:49:27 -07:00
Daniel Bennett	15c01e5a49	ipv6: normalize addrs per RFC-5942 §4 (#25921 ) https://datatracker.ietf.org/doc/html/rfc5952#section-4 * copy NormalizeAddr func from vault * PRs hashicorp/vault#29228 & hashicorp/vault#29517 * normalize bind/advertise addrs * normalize consul/vault addrs	2025-05-22 14:21:30 -04:00
Piotr Kazmierczak	cdc308a0eb	wi: new endpoint for listing workload attached ACL policies (#25588 ) This introduces a new HTTP endpoint (and an associated CLI command) for querying ACL policies associated with a workload identity. It allows users that want to learn about the ACL capabilities from within WI-tasks to know what sort of policies are enabled. --------- Co-authored-by: Tim Gross <tgross@hashicorp.com> Co-authored-by: Aimee Ukasick <aimee.ukasick@hashicorp.com>	2025-05-19 19:54:12 +02:00
Tim Gross	8a5a057d88	offline license utilization reporting (#25844 ) Nomad Enterprise users operating in air-gapped or otherwise secured environments don't want to send license reporting metrics directly from their servers. Implement manual/offline reporting by periodically recording usage metrics snapshots in the state store, and providing an API and CLI by which cluster administrators can download the snapshot for review and out-of-band transmission to HashiCorp. This is the CE portion of the work required for implemention in the Enterprise product. Nomad CE does not perform utilization reporting. Ref: https://github.com/hashicorp/nomad-enterprise/pull/2673 Ref: https://hashicorp.atlassian.net/browse/NMD-68 Ref: https://go.hashi.co/rfc/nmd-210	2025-05-14 09:51:13 -04:00
hc-github-team-nomad-core	9ef42e9807	Generate files for 1.10.1 release	2025-05-13 14:26:48 +02:00
James Rasell	0b265d2417	encrypter: Track initial tasks for is ready calculation. (#25803 ) The server startup could "hang" to the view of an operator if it had a key that could not be decrypted or replicated loaded from the FSM at startup. In order to prevent this happening, the server startup function will now use a timeout to wait for the encrypter to be ready. If the timeout is reached, the error is sent back to the caller which fails the CLI command. This bubbling of error message will also flush to logs which will provide addition operator feedback. The server only cares about keys loaded from the FSM snapshot and trailing logs before the encrypter should be classed as ready. So that the encrypter ready function does not get blocked by keys added outside of the initial Raft load, we take a snapshot of the decryption tasks as we enter the blocking call, and class these as our barrier.	2025-05-07 15:38:16 +01:00
Juanadelacuesta	9288a3141a	func and docs: Use the config from the client and not from the agent that is already parsed. Add the breaking change to the release notes	2025-04-30 10:53:02 +02:00

1 2 3 4 5 ...

2455 Commits