The Nomad client can now optionally emit telemetry data from the
prerun and prestart hooks. This allows operators to monitor and
alert on failures and time taken to complete.
The new datapoints are:
- nomad.client.alloc_hook.prerun.success (counter)
- nomad.client.alloc_hook.prerun.failed (counter)
- nomad.client.alloc_hook.prerun.elapsed (sample)
- nomad.client.task_hook.prestart.success (counter)
- nomad.client.task_hook.prestart.failed (counter)
- nomad.client.task_hook.prestart.elapsed (sample)
The hook execution time is useful to Nomad engineering and will
help optimize code where possible and understand job specification
impacts on hook performance.
Currently only the PreRun and PreStart hooks have telemetry
enabled, so we limit the number of new metrics being produced.
The retry_join logic was not allowing for retries to happen and
was exiting after the first failed discovery attempt. This change
fixes that behaviour and adds a test to ensure no further
regressions.
This PR adds Consul Template's executeTemplate function to the denylist by
default, in order to prevent accidental or malicious infinitely recursive
execution.
---------
Co-authored-by: Tim Gross <tgross@hashicorp.com>
A more comprehensive env.denylist that now includes more token, token file and
license variables.
---------
Co-authored-by: Daniel Bennett <dbennett@hashicorp.com>
* func: User url rules to scape non alphanumeric values in hcl variables
* docs: add changelog
* func: unscape flags before returning
* use JSON.stringify instead of bespoke value quoting to handle in-value-multi-line cases
---------
Co-authored-by: Phil Renaud <phil@riotindustries.com>
When the service client syncs to Consul, we accumulate service sync errors in a
multierror before reading all the local checks. If the API call to the local
checks fails, we either return that error or append it to the multierror and
return the set of errors. But `multierror.Error.Len()` doesn't nil-check, so we
need to do this ourselves.
I've also made a quick pass through the rest of the code base looking for
multierror `Len` method calls to see if we have this pattern elsewhere.
Fixes: https://github.com/hashicorp/nomad/issues/24512
this opens up dispatching parameterized jobs by systems
that do not allow modifying what http request body they send
e.g. these two things are equal:
POST '{"Payload": "'"$(base64 <<< "hello")"'"}' /v1/job/my-job/dispatch
POST 'hello' /v1/job/my-job/dispatch/payload
* cli: trim job init example jobspec
* cli: trim job init -connect example jobspec
---------
Co-authored-by: Aimee Ukasick <aimee.ukasick@hashicorp.com>
* detect ipv6 on "bridge" network and set
service.connect.sidecar_proxy.config.bind_address
for envoy to "::" instead of "0.0.0.0"
* allow users to set bind_address in jobspec
e.g. "" would defer to consul proxy-defaults
* caveat: tproxy still does not work, because
the CNI plugin does not configure ip6tables
When the local Consul agent receives a deregister request, it performs a
pre-flight check using the locally cached ACL token. The agent then sends the
request upstream to the Consul servers as part of anti-entropy, using its own
token. This requires that the token we use for deregistration is valid even
though that's not the token used to write to the Consul server.
There are several cases where the service identity token might no longer exist
at the time of deregistration:
* A race condition between the sync and destroying the allocation.
* Misconfiguration of the Consul auth method with a TTL.
* Out-of-band destruction of the token.
Additionally, Nomad's sync with Consul returns early if there are any errors,
which means that a single broken token can prevent any other service on the
Nomad agent from being registered or deregistered.
Update Nomad's sync with Consul to use the Nomad agent's own Consul token for
deregistration, regardless of which token the service was registered
with. Accumulate errors from the sync so that they no longer block
deregistration of other services.
Fixes: https://github.com/hashicorp/nomad/issues/20159
* jobspec: add a chown option to artifact block
This PR adds a boolean 'chown' field to the artifact block.
It indicates whether the Nomad client should chown the downloaded files
and directories to be owned by the task.user. This is useful for drivers
like raw_exec and exec2 which are subject to the host filesystem user
permissions structure. Before, these drivers might not be able to use or
manage the downloaded artifacts since they would be owned by the root
user on a typical Nomad client configuration.
* api: no need for pointer of chown field
We initialize this slice with a zeroed array and then append to it, which means
we then have to clean out the empty strings later. Initialize to the correct
capacity up front so there are no empty values.
Ref: https://github.com/hashicorp/nomad/pull/24104
In #24007 we merged new HCL files but they were missing copywrite headers
because the scan didn't run on this PR for some reason. I've already backported
this to the Enterprise branches.
When jobs are submitted with a scaling policy, the scaling policy's target only
includes the job's namespace if the `namespace` field is set in the jobspec and
not from the request. Normally jobs are canonicalized in the RPC handler before
being written to Raft. But the scaling policy targets are instead written during
the conversion from `api.Job` to `structs.Job`. We populate the `structs.Job`
namespace from the request here as well, but only after the conversion has
occurred. Swap the order of these operations so that the conversion is always
happening with a correct namespace.
Long-term we should not be making mutations during conversion either. But we
can't remove it immediately because API requests may come from any agent across
upgrades. Move the scaling target creation into the `Canonicalize` method and
mark it for future removal in the API conversion code path.
Fixes: https://github.com/hashicorp/nomad/issues/24039
* TaggedVersion information in structs, rather than job_endpoint (#23841)
* TaggedVersion information in structs, rather than job_endpoint
* Test for taggedVersion description length
* Some API plumbing
* Tag and Untag job versions (#23863)
* Tag and Untag at API level on down, but am I unblocking the wrong thing?
* Code and comment cleanup
* Unset methods generally now I stare long into the namespace abyss
* Namespace passes through with QueryOptions removed from a write requesting struct
* Comment and PR review cleanup
* Version back to VersionStr
* Generally consolidate unset logic into apply for version tagging
* Addressed some PR comments
* Auth check and RPC forwarding
* uint64 instead of pointer for job version after api layer and renamed copy
* job tag command split into apply and unset
* latest-version convenience handling moved to CLI command level
* CLI tests for tagging/untagging
* UI parts removed
* Add to job table when unsetting job tag on latest version
* Vestigial no more
* Compare versions by name and version number with the nomad history command (#23889)
* First pass at passing a tagname and/or diff version to plan/versions requests
* versions API now takes compare_to flags
* Job history command output can have tag names and descriptions
* compare_to to diff-tag and diff-version, plus adding flags to history command
* 0th version now shows a diff if a specific diff target is requested
* Addressing some PR comments
* Simplify the diff-appending part of jobVersions and hide None-type diffs from CLI
* Remove the diff-tag and diff-version parts of nomad job plan, with an eye toward making them a new top-level CLI command soon
* Version diff tests
* re-implement JobVersionByTagName
* Test mods and simplification
* Documentation for nomad job history additions
* Prevent pruning and reaping of TaggedVersion jobs (#23983)
tagged versions should not count against JobTrackedVersions
i.e. new job versions being inserted should not evict tagged versions
and GC should not delete a job if any of its versions are tagged
Co-authored-by: Daniel Bennett <dbennett@hashicorp.com>
---------
Co-authored-by: Daniel Bennett <dbennett@hashicorp.com>
* [ui] Version Tags on the job versions page (#24013)
* Timeline styles and their buttons modernized, and tags added
* styled but not yet functional version blocks
* Rough pass at edit/unedit UX
* Styles consolidated
* better UX around version tag crud, plus adapter and serializers
* Mirage and acceptance tests
* Modify percy to not show time-based things
---------
Co-authored-by: Daniel Bennett <dbennett@hashicorp.com>
* Job revert command and API endpoint can take a string version tag name (#24059)
* Job revert command and API endpoint can take a string version tag name
* RevertOpts as a signature-modified alternative to Revert()
* job revert CLI test
* Version pointers in endpoint tests
* Dont copy over the tag when a job is reverted to a version with a tag
* Convert tag name to version number at CLI level
* Client method for version lookup by tag
* No longer double-declaring client
* [ui] Add tag filter to the job versions page (#24064)
* Rough pass at the UI for version diff dropdown
* Cleanup and diff fetching via adapter method
* TaggedVersion now VersionTag (#24066)
---------
Co-authored-by: Daniel Bennett <dbennett@hashicorp.com>
In #18925 we added a `-json` flag to the `job status` command, but the argument
handling had a bug where it would always set the `-json` flag if either the `-t`
or `-json` flags were set, resulting in a misleading error. Instead, pass the
`-json` flag value into the formatter.
Fixes: https://github.com/hashicorp/nomad/issues/24050
In #23977 we moved the keyring into Raft, which can expose key material in Raft
snapshots when using the less-secure AEAD keyring instead of KMS. This changeset
adds tools for redacting this material from snapshots:
* The `operator snapshot state` command gains the ability to display key
metadata (only), which respects the `-filter` option.
* The `operator snapshot save` command gains a `-redact` option that removes key
material from the snapshot after it's downloaded.
* A new `operator snapshot redact` command allows removing key material from an
existing snapshot.
so more than one copy of a program can run
at a time on the same port with SO_REUSEPORT.
requires host network mode.
some task drivers (like docker) may also need
config {
network_mode = "host"
}
but this is not validated prior to placement.
* build: update golangci-lint to 1.60.1
* ci: update golangci-lint to v1.60.1
Helps with go1.23 compatability. Introduces some breaking changes / newly
enforced linter patterns so those are fixed as well.