A small optimization in the scheduler required users to specify specific
models of devices if the required count was higher than the individual
model/vendor on the node. This change removes that optimization to allow
for more intuitive device scheduling when different vendor/model device
types exist on a node.
The `go-getter` update in https://github.com/hashicorp/nomad/pull/26713 is not passing tests upstream (apparently https://github.com/hashicorp/go-getter/pull/548 is the origin of the problem but that PR did not ever run tests). The issue being fixed isn't a critical vulnerability, so in the interest of preparing us for the next release, revert the `go-getter` change but keep the Go toolchain update.
We'll skip go-getter 1.8.0 and pick up the next patch version once its issues are fixed.
Reverts commit 8a96929870.
tests that use this local docker registry (docker and podman tests)
occasionally flake, I think due to the timeout being reached,
despite passing after a restart.
> jobs3.go:658: tg 'create-files' task 'create-auth-file' event: Task received by client
> jobs3.go:658: tg 'create-files' task 'create-auth-file' event: Building Task Directory
> jobs3.go:658: tg 'create-files' task 'create-auth-file' event: Task started by client
> jobs3.go:658: tg 'create-files' task 'create-auth-file' event: Exit Code: 1
> jobs3.go:658: tg 'create-files' task 'create-auth-file' event: Task restarting in 16.212149445s
> jobs3.go:658: tg 'create-files' task 'create-auth-file' event: Task started by client
> jobs3.go:658: tg 'create-files' task 'create-auth-file' event: Exit Code: 0
setting the delay lower will (hopefully) keep within the job timeout.
I'm not sure why the `pledge` task apparently flakes like this;
I could find no useful info in the logs.
When configuring Nomad Enterprise with Consul Enterprise and multiple
namespaces, you need to include the `consul_namespace` mapping in the auth
method configuration. Otherwise you'll see an error like "unknown variable
accessed: value.consul_namespace". There's no example of the updated auth method
configuration you need, which makes this detail unclear when we're showing the
claim being used in the following `consul acl auth-method create` command.
During a large volume dispatch load test, I discovered that a lot of the total
scheduling time is being spent calling `structs.ParsePortRanges` repeatedly, in
order to parse the reserved ports configuration of the node (ex. converting
`"80,8000-8001"` to `[]int{80, 8000, 8001}`). A close examination of the
profiles shows that the bulk of the time is being spent hashing the keys for the
map of ports we use for de-duplication, and then sorting the resulting slice.
The `(*NetworkIndex) SetNode` method that calls the offending `ParsePortRanges`
merges all the ports into the `UsedPorts` map of bitmaps at scheduling
time. Which means the consumer of the slice is already de-duplicating and
doesn't care about the order. The only other caller of `ParsePortRanges` is when
we validate the configuration file, and that throws away the slice entirely.
By skipping de-duplication and not sorting, we can cut down the runtime of this
function by 30x and memory usage by 3x.
Ref: https://github.com/hashicorp/nomad/blob/v1.10.4/nomad/structs/network.go#L201
Fixes: https://github.com/hashicorp/nomad/issues/26654
In #8435 (shipped in 0.12.1), we updated the `Job.Register` RPC to atomically
write the eval along with the job. But this didn't get copied to
`Job.Dispatch`. Under excessive load testing we demonstrated this can result in
dispatched jobs without corresponding evals.
Update the dispatch RPC to write the eval in the same Raft log as the job
registration. Note that we don't need to version-check this change for upgrades,
because the register and dispatch RPCs share the same `JobRegisterRequestType`
Raft message, and therefore all supported server versions already look for the
eval in the FSM. If an updated leader includes the eval, older followers will
write the eval. If a non-updated leader writes the eval in a separate Raft
entry, updated followers will write those evals normally.
Fixes: https://github.com/hashicorp/nomad/issues/26655
Ref: https://hashicorp.atlassian.net/browse/NMD-947
Ref: https://github.com/hashicorp/nomad/pull/8435
This changeset adds system scheduler tests of various permutations of the `update`
block. It also fixes a number of bugs discovered in the process.
* Don't create deployment for in-flight rollout. If a system job is in the
middle of a rollout prior to upgrading to a version of Nomad with system
deployments, we'll end up creating a system deployment which might never
complete because previously placed allocs will not be tracked. Check to see if
we have existing allocs that should belong to the new deployment and prevent a
deployment from being created in that case.
* Ensure we call `Copy` on `Deployment` to avoid state store corruption.
* Don't limit canary counts by `max_parallel`.
* Never create deployments for `sysbatch` jobs.
Ref: https://hashicorp.atlassian.net/browse/NMD-761
In the system scheduler, we need to keep track which nodes were previously used
as "canary nodes" and not pick them at random, in case of previously failed
canaries or changes to the amount of canaries in the jobspec.
---------
Co-authored-by: Tim Gross <tgross@hashicorp.com>
Typically the `LOGNAME` environment variable should be set according
to the values within `/etc/passwd` and represents the name of the
logged in user. This should be set, where possible, alongside the
USER and HOME variables for all drivers that use the shared
executor and do not use a sub-shell.
don't require "bridge" network mode when using connect{}
we document this as "at your own risk" because CNI configuration
is so flexible that we can't guarantee a user's network will work,
but Nomad's "bridge" CNI config may be used as a reference.
Currently every time a client starts, it creates a new consul token per service or task,. This PR changes the behaviour , it persists consul ACL token to the client state and it starts by looking up a token before creating a new one.
Fixes: #20184Fixes: #20185
look, I know I misspelled "locater" in the code comment, but it's easier to acknowledge that here in this commit message than it is to push a new commit with all the test/approval machinery in github.
This changeset adjusts the handling of allocations placement when we're
promoting a deployment, and it corrects the behavior of isDeploymentComplete,
which previously would never mark promoted deployment as complete.
The TestVolumeWatch_LeadershipTransition test was a little racy
and the fix required adding an eventually wrapper to the end of
the test. While doing this work, it seemed fit to move the package
to the must library also.
When creating constants with a custom type, each definition should
include the type definition. If only the first constant defines
this, it will have a different type to the other constants.
This change fixes occurances of this and enables SA9004 within CI
linting to catch future problems while the change is in review.
Adds a new `windows` command which is available when running on
a Windows hosts. The command includes two new subcommands:
* `service install`
* `service uninstall`
The `service install` command will install the called binary into
the Windows program files directory, create a new Windows service,
setup configuration and data directories, and register the service
with the Window eventlog. If the service and/or binary already
exist, the service will be stopped, service and eventlog updated
if needed, binary replaced, and the service started again.
The `service uninstall` command will stop the service, remove the
Windows service, and deregister the service with the eventlog. It
will not remove the configuration/data directory nor will it remove
the installed binary.