If a Nomad job is started with a large number of instances (e.g. 4 billion),
then the Nomad servers that attempt to schedule it will run out of memory and
crash. While it's unlikely that anyone would intentionally schedule a job with 4
billion instances, we have occasionally run into issues with bugs in external
automation. For example, an automated deployment system running on a test
environment had an off-by-one error, and deployed a job with count = uint32(-1),
causing the Nomad servers for that environment to run out of memory and crash.
To prevent this, this PR introduces a job_max_count Nomad server configuration
parameter. job_max_count limits the number of allocs that may be created from a
job. The default value is 50000 - this is low enough that a job with the maximum
possible number of allocs will not require much memory on the server, but is
still much higher than the number of allocs in the largest Nomad job we have
ever run.
* Add preserve-resources flag when registering a job
* Add preserve-resources flag to website docs
* Add changelog
* Update tests, docs
* Preserve counts & resources in fsm
* Update doc
* Update preservation of resources/count to happen in StateStore
The transit keyring uses the go-kms-wrapper for parsing the vault config
and errors if tlsSkipVerify is an empty string.
---------
Co-authored-by: Tim Gross <tgross@hashicorp.com>
When we attempt to drop unneeded evals from the eval broker, if the eval has
been GC'd before the check is made, we hit a nil pointer. Check that the eval
actually exists before attempting to remove it from the broker.
Fixes: https://github.com/hashicorp/nomad/issues/26871
ACL policies are parsed when creating, updating, or compiling the
resulting ACL object when used. This parsing was silently ignoring
duplicate singleton keys, or invalid keys which does not grant any
additional access, but is a poor UX and can be unexpected.
This change parses all new policy writes and updates, so that
duplicate or invalid keys return an error to the caller. This is
called strict parsing. In order to correctly handle upgrades of
clusters which have existing policies that would fall foul of the
change, a lenient parsing mode is also available. This allows
the policy to continue to be parsed and compiled after an upgrade
without the need for an operator to correct the policy document
prior to further use.
Co-authored-by: Tim Gross <tgross@hashicorp.com>
In #26831 we're preventing unexpected node RPCs by ensuring that the volume
watcher only unpublishes when allocations are client-terminal.
To mitigate any remaining similar issues, add serialization of node plugin RPCs,
as we did for controller plugin RPCs in #17996 and as recommended ("SHOULD") by
the CSI specification. Here we can do per-volume serialization rather than
per-plugin serialization.
Reorder the methods of the `volumeManager` in the client so that each interface
method and its directly-associated helper methods read from top-to-bottom,
instead of a mix of directions.
Ref: https://github.com/hashicorp/nomad/pull/17996
Ref: https://github.com/hashicorp/nomad/pull/26831
The volume watcher checks whether any allocations that have claims are terminal
so that it knows if it's safe to unpublish the volume. This check was
considering a claim as unpublishable if the allocation was terminal on either
the server or client, rather than the client alone. In many circumstances this
is safe.
But if an allocation takes a while to stop (ex. it has a `shutdown_delay`), it's
possible for garbage collection to run in the window between when the alloc is
marked server-terminal and when the task is actually stopped. The server
unpublishes the volume which sends a node plugin RPC. The plugin unmounts the
volume while it's in use, and then unmounts it again when the allocation stops
and the CSI postrun hook runs. If the task writes to the volume during the
unmounting process, some providers end up in a broken state and the volume is
not usable unless it's detached and reattached.
Fix this by considering a claim a "past claim" only when the allocation is
client terminal. This way if garbage collection runs while we're waiting for
allocation shutdown, the alloc will only be server-terminal and we won't send
the extra node RPCs.
Fixes: https://github.com/hashicorp/nomad/issues/24130
Fixes: https://github.com/hashicorp/nomad/issues/25819
Ref: https://hashicorp.atlassian.net/browse/NMD-1001
Nomad's periodic block includes a "time_zone" parameter which lets
operators set the time zone at which the next launch interval is
checked against. For this to work, Nomad needs to use the
"time.LoadLocation" which in-turn can use multiple TZ data sources.
When using the Docker image to trigger Nomad job registrations, it
currently does not have access to any TZ data, meaning it is only
aware of UTC. Adding the tzdata package contents to the release
image provides the required data for this to work.
It would have also been possible to set the "-tags" build tag when
releasing Nomad which would embed a copy of the timezone database
in the code. We decided against using the build tag approach as it
is a subtle way that we could introduce bugs that are very
difficult to track down and we prefer the commit approach.
In most RPC endpoints we use the resolved ACL object to determine whether a
given auth token or identity has access to the object of interest to the
RPC. In #15870 we adjusted this across most of the RPCs to handle workload identity.
But in the ACL endpoints that read policies, we can't use the resolved ACL
object and have to go back to the original token and lookup the policies it has
access to. So we need to resolve any workload-associated policies during that
lookup as well.
Fixes: https://github.com/hashicorp/nomad/issues/26764
Ref: https://hashicorp.atlassian.net/browse/NMD-990
Ref: https://github.com/hashicorp/nomad/pull/15870
In Nomad Enterprise we can fingerprint multiple Consul datacenters. If neither
is `"default"` then we end up with warning logs about adding a "link".
The `Link` field on the `Node` struct is a map of attributes that only
contributes to the node's computed hash. The `"consul"` key's value is derived
from the `unique.consul.name` attribute, which only exists if there's a default
Consul cluster.
Update the fingerprint to skip setting the link field if there's no
`unique.consul.name`, and lower the warning log for malformed fields to debug;
this is a minor scheduling optimization largely captured by existing Consul
fields in the node computed class. The only reason not to remove it entirely is
to avoid changing computed classes on existing large clusters.
Fixes: https://github.com/hashicorp/nomad/issues/26781
Ref: https://hashicorp.atlassian.net/browse/NMD-998
On Windows, the `os.Process.Signal` method returns an error when sending
`os.Interrupt` (SIGINT) because it isn't implemented. This causes test servers
in the `testutil` packages to break on Windows. Use the platform specific
syscalls to generate the SIGINT instead.
The agent's signal handler also did not correctly handle the Ctrl-C because we
were masking os.Interrupt instead of SIGINT.
Fixes: https://github.com/hashicorp/nomad/issues/26775
Co-authored-by: Chris Roberts <croberts@hashicorp.com>
The metrics on the eval broker include labels for the job ID, but under a high
volume of dispatch workloads, this results in excessive heap usage on the
leader. Dispatch workloads should use their parent ID rather than their child ID
for any metrics we collect.
Also, eliminate an extra copy of the labels. And remove the extremely high
cardinality `"eval_id"` label from the `nomad.broker.eval_waiting` metric.
Fixes: https://github.com/hashicorp/nomad/issues/26657
The allocation network hook was not properly restoring network status from state when the network had previously been setup. This led to missing environment variables, misconfigured hosts file, and resolv.conf when a task was restarted after the nomad agent has restarted.
---------
Co-authored-by: Daniel Bennett <dbennett@hashicorp.com>
A small optimization in the scheduler required users to specify specific
models of devices if the required count was higher than the individual
model/vendor on the node. This change removes that optimization to allow
for more intuitive device scheduling when different vendor/model device
types exist on a node.
The `go-getter` update in https://github.com/hashicorp/nomad/pull/26713 is not passing tests upstream (apparently https://github.com/hashicorp/go-getter/pull/548 is the origin of the problem but that PR did not ever run tests). The issue being fixed isn't a critical vulnerability, so in the interest of preparing us for the next release, revert the `go-getter` change but keep the Go toolchain update.
We'll skip go-getter 1.8.0 and pick up the next patch version once its issues are fixed.
Reverts commit 8a96929870.
During a large volume dispatch load test, I discovered that a lot of the total
scheduling time is being spent calling `structs.ParsePortRanges` repeatedly, in
order to parse the reserved ports configuration of the node (ex. converting
`"80,8000-8001"` to `[]int{80, 8000, 8001}`). A close examination of the
profiles shows that the bulk of the time is being spent hashing the keys for the
map of ports we use for de-duplication, and then sorting the resulting slice.
The `(*NetworkIndex) SetNode` method that calls the offending `ParsePortRanges`
merges all the ports into the `UsedPorts` map of bitmaps at scheduling
time. Which means the consumer of the slice is already de-duplicating and
doesn't care about the order. The only other caller of `ParsePortRanges` is when
we validate the configuration file, and that throws away the slice entirely.
By skipping de-duplication and not sorting, we can cut down the runtime of this
function by 30x and memory usage by 3x.
Ref: https://github.com/hashicorp/nomad/blob/v1.10.4/nomad/structs/network.go#L201
Fixes: https://github.com/hashicorp/nomad/issues/26654
In #8435 (shipped in 0.12.1), we updated the `Job.Register` RPC to atomically
write the eval along with the job. But this didn't get copied to
`Job.Dispatch`. Under excessive load testing we demonstrated this can result in
dispatched jobs without corresponding evals.
Update the dispatch RPC to write the eval in the same Raft log as the job
registration. Note that we don't need to version-check this change for upgrades,
because the register and dispatch RPCs share the same `JobRegisterRequestType`
Raft message, and therefore all supported server versions already look for the
eval in the FSM. If an updated leader includes the eval, older followers will
write the eval. If a non-updated leader writes the eval in a separate Raft
entry, updated followers will write those evals normally.
Fixes: https://github.com/hashicorp/nomad/issues/26655
Ref: https://hashicorp.atlassian.net/browse/NMD-947
Ref: https://github.com/hashicorp/nomad/pull/8435
Typically the `LOGNAME` environment variable should be set according
to the values within `/etc/passwd` and represents the name of the
logged in user. This should be set, where possible, alongside the
USER and HOME variables for all drivers that use the shared
executor and do not use a sub-shell.
don't require "bridge" network mode when using connect{}
we document this as "at your own risk" because CNI configuration
is so flexible that we can't guarantee a user's network will work,
but Nomad's "bridge" CNI config may be used as a reference.
Currently every time a client starts, it creates a new consul token per service or task,. This PR changes the behaviour , it persists consul ACL token to the client state and it starts by looking up a token before creating a new one.
Fixes: #20184Fixes: #20185
Adds a new `windows` command which is available when running on
a Windows hosts. The command includes two new subcommands:
* `service install`
* `service uninstall`
The `service install` command will install the called binary into
the Windows program files directory, create a new Windows service,
setup configuration and data directories, and register the service
with the Window eventlog. If the service and/or binary already
exist, the service will be stopped, service and eventlog updated
if needed, binary replaced, and the service started again.
The `service uninstall` command will stop the service, remove the
Windows service, and deregister the service with the eventlog. It
will not remove the configuration/data directory nor will it remove
the installed binary.
This adds artifact inspection after download to detect any issues
with the content fetched. Currently this means checking for any
symlinks within the artifact that resolve outside the task or
allocation directories. On platforms where lockdown is available
(some Linux) this inspection is not performed.
The inspection can be disabled with the DisableArtifactInspection
option. A dedicated option for disabling this behavior allows
the DisableFilesystemIsolation option to be enabled but still
have artifacts inspected after download.
When a node misses a heartbeat and is marked down, Nomad deletes service
registration instances for that node. But if the node then successfully
heartbeats before its allocations are marked lost, the services are never
restored. The node is unaware that it has missed a heartbeat and there's no
anti-entropy on the node in any case.
We already delete services when the plan applier marks allocations as stopped,
so deleting the services when the node goes down is only an optimization to more
quickly divert service traffic. But because the state after a plan apply is the
"canonical" view of allocation health, this breaks correctness.
Remove the code path that deletes services from nodes when nodes go down. Retain
the state store code that deletes services when allocs are marked terminal by
the plan applier. Also add a path in the state store to delete services when
allocs are marked terminal by the client. This gets back some of the
optimization but avoids the correctness bug because marking the allocation
client-terminal is a one way operation.
Fixes: https://github.com/hashicorp/nomad/issues/16983
* Add -log-file-export and -log-lookback commands to add historical log to
debug capture
* use monitor.PrepFile() helper for other historical log tests
the executor dies, leaving an orphaned process still running.
the panic fix:
* don't `panic()`
* and return an empty, but non-nil, func on cgroup error
feature fix:
* allow non-root agent to proceed with exec when cgroups are off
* Add MonitorExport command and handlers
* Implement autocomplete
* Require nomad in serviceName
* Fix race in StreamReader.Read
* Add and use framer.Flush() to coordinate function exit
* Add LogFile to client/Server config and read NomadLogPath in rpcHandler instead of HTTPServer
* Parameterize StreamFixed stream size
Improved the acl policy self CLI command to handle both management and client tokens.
Management tokens now display a clear message indicating global access with no individual policies.
Fixes: https://github.com/hashicorp/nomad/issues/26389
When a task group is removed from a jobspec, the reconciler stops all
allocations and immediately returns from `computeGroup`. We can do the same for
when the group has been scaled-to-zero, but doing so runs into an inconsistency
in the way that server-terminal allocations are handled.
Prior to this change server-terminal allocations fall through `computeGroup`
without being marked as `ignore`, unless they are terminal canaries, in which
case they are marked `stop` (but this is a no-op). This inconsistency causes a
_tiny_ amount of extra `Plan.Submit`/Raft traffic, but more importantly makes it
more difficult to make test assertions for `stop` vs `ignore` vs
fallthrough. Remove this inconsistency by filtering out server-terminal
allocations early in `computeGroup`.
This brings the cluster reconciler's behavior closer to the node reconciler's
behavior, except that the node reconciler discards _all_ terminal allocations
because it doesn't support rescheduling.
This changeset required adjustments to two tests, but the tests themselves were
a bit of a mess:
* In https://github.com/hashicorp/nomad/pull/25726 we added a test of how
canaries were treated when on draining nodes. But the test didn't correctly
configure the job with an update block, leading to misleading test
behavior. Fix the test to exercise the intended behavior and refactor for
clarity.
* While working on reconciler behaviors around stopped allocations, I found it
extremely hard to follow the intent of the disconnected client tests because
many of the fields in the table-driven test are switches for more complex
behavior or just tersely named. Attempt to make this a little more legible by
moving some branches directly into fields, renaming some fields, and
flattening out some branching.
Ref: https://hashicorp.atlassian.net/browse/NMD-819
System and sysbatch jobs don't support the reschedule block, because we'd always
replace allocations back onto the same node. The job validation for system jobs
asserts that the user hasn't set a `reschedule` block so that users aren't
submitting jobs expecting it to be supported. But this validation was missing
for sysbatch jobs.
Validate that sysbatch jobs don't have a reschedule block.
The RPC handler for deleting dynamic host volumes has a check that any
allocations associated with a volume are client-terminal before deleting the
volume. But the state store delete that happens after we send client RPCs to the
plugin checks that the allocs are non-terminal on both server and client.
This can improperly allow deleting a volume from a client but then not being
able to delete it from the state store because of a time-of-check / time-of-use
bug. If the allocation fails/completes on the client before the server marks its
desired status as terminal, or if the allocation is marked server-terminal
during the client RPC, we can get a volume that passes the first check but not
the second check that happens in the state store and cannot be deleted.
Update the state store delete method to require that any allocation for a volume
is client terminal in order to delete the volume, not just server terminal.
Fixes: https://github.com/hashicorp/nomad/issues/26140
Ref: https://hashicorp.atlassian.net/browse/NMD-883
The output of the reconciler stage of scheduling is only visible via debug-level
logs, typically accessible only to the cluster admin. We can give job authors
better ability to understand what's happening to their jobs if we expose this
information to them in the `eval status` command.
Add the reconciler's desired updates to the evaluation struct so it can be
exposed in the API. This increases the size of evals by roughly 15% in the state
store, or a bit more when there are preemptions (but we expect this will be a
small minority of evals).
Ref: https://hashicorp.atlassian.net/browse/NMD-818
Fixes: https://github.com/hashicorp/nomad/issues/15564