Adds a new configuration to clients to optionally allow them to drain their
workloads on shutdown. The client sends the `Node.UpdateDrain` RPC targeting
itself and then monitors the drain state as seen by the server until the drain
is complete or the deadline expires. If it loses connection with the server, it
will monitor local client status instead to ensure allocations are stopped
before exiting.
* api: enable support for setting original source alongside job
This PR adds support for setting job source material along with
the registration of a job.
This includes a new HTTP endpoint and a new RPC endpoint for
making queries for the original source of a job. The
HTTP endpoint is /v1/job/<id>/submission?version=<version> and
the RPC method is Job.GetJobSubmission.
The job source (if submitted, and doing so is always optional), is
stored in the job_submission memdb table, separately from the
actual job. This way we do not incur overhead of reading the large
string field throughout normal job operations.
The server config now includes job_max_source_size for configuring
the maximum size the job source may be, before the server simply
drops the source material. This should help prevent Bad Things from
happening when huge jobs are submitted. If the value is set to 0,
all job source material will be dropped.
* api: avoid writing var content to disk for parsing
* api: move submission validation into RPC layer
* api: return an error if updating a job submission without namespace or job id
* api: be exact about the job index we associate a submission with (modify)
* api: reword api docs scheduling
* api: prune all but the last 6 job submissions
* api: protect against nil job submission in job validation
* api: set max job source size in test server
* api: fixups from pr
The configuration docs for `client.template.vault_retry`, `consul_retry`, and
`nomad_retry` incorrectly document the default number of attempts to be
unlimited (0). When we added these config blocks, we defaulted the fields to
`nil` for backwards compatibility, which causes them to fall back to the default
consul-template configuration values.
* artifact: protect against unbounded artifact decompression
Starting with 1.5.0, set defaut values for artifact decompression limits.
artifact.decompression_size_limit (default "100GB") - the maximum amount of
data that will be decompressed before triggering an error and cancelling
the operation
artifact.decompression_file_count_limit (default 4096) - the maximum number
of files that will be decompressed before triggering an error and
cancelling the operation.
* artifact: assert limits cannot be nil in validation
In #13374 we updated the commented-out `license_path` in the packaged example
configuration file to match the existing documentation. Although this config
value was commented-out, it was reported that changing the value was
confusing. Update the commented-out line to the previous value and update the
documented examples to match that. This matches most of the examples for
Consul/Vault licensing as well. I've double-checked the tutorials and it looks
like it'd been left on the previous value there, so no additional work to be
done.
* Demoable state
* Demo mirage color
* Label as a block with foreground and background colours
* Test mock updates
* Go test updated
* Documentation update for label support
* Add `bridge_network_hairpin_mode` client config setting
* Add node attribute: `nomad.bridge.hairpin_mode`
* Changed format string to use `%q` to escape user provided data
* Add test to validate template JSON for developer safety
Co-authored-by: Daniel Bennett <dbennett@hashicorp.com>
Prior to 2409f72 the code compared the modification index of a job to itself. Afterwards, the code compared the creation index of the job to itself. In either case there should never be a case of re-parenting of allocs causing the evaluation to trivially always result in false, which leads to unreclaimable memory.
Prior to this change allocations and evaluations for batch jobs were never garbage collected until the batch job was explicitly stopped. The new `batch_eval_gc_threshold` server configuration controls how often they are collected. The default threshold is `24h`.
* Add config elements
* Wire in snapshot configuration to raft
* Add hot reload of raft config
* Add documentation for new raft settings
* Add changelog
* [no ci] first pass at plumbing grpc_ca_file
* consul: add support for grpc_ca_file for tls grpc connections in consul 1.14+
This PR adds client config to Nomad for specifying consul.grpc_ca_file
These changes combined with https://github.com/hashicorp/consul/pull/15913 should
finally enable Nomad users to upgrade to Consul 1.14+ and use tls grpc connections.
* consul: add cl entgry for grpc_ca_file
* docs: mention grpc_tls changes due to Consul 1.14
* artifact: enable inheriting environment variables from client
This PR adds client configuration for specifying environment variables that
should be inherited by the artifact sandbox process from the Nomad Client agent.
Most users should not need to set these values but the configuration is provided
to ensure backwards compatability. Configuration of go-getter should ideally be
done through the artifact block in a jobspec task.
e.g.
```hcl
client {
artifact {
set_environment_variables = "TMPDIR,GIT_SSH_OPTS"
}
}
```
Closes#15498
* website: update set_environment_variables text to mention PATH
This PR adds the client config option for turning off filesystem isolation,
applicable on Linux systems where filesystem isolation is possible and
enabled by default.
```hcl
client{
artifact {
disable_filesystem_isolation = <bool:false>
}
}
```
Closes#15496
* client: accommodate Consul 1.14.0 gRPC and agent self changes.
Consul 1.14.0 changed the way in which gRPC listeners are
configured, particularly when using TLS. Prior to the change, a
single listener was responsible for handling plain-text and
encrypted gRPC requests. In 1.14.0 and beyond, separate listeners
will be used for each, defaulting to 8502 and 8503 for plain-text
and TLS respectively.
The change means that Nomad’s Consul Connect integration would not
work when integrated with Consul clusters using TLS and running
1.14.0 or greater.
The Nomad Consul fingerprinter identifies the gRPC port Consul has
exposed using the "DebugConfig.GRPCPort" value from Consul’s
“/v1/agent/self” endpoint. In Consul 1.14.0 and greater, this only
represents the plain-text gRPC port which is likely to be disbaled
in clusters running TLS. In order to fix this issue, Nomad now
takes into account the Consul version and configured scheme to
optionally use “DebugConfig.GRPCTLSPort” value from Consul’s agent
self return.
The “consul_grcp_socket” allocrunner hook has also been updated so
that the fingerprinted gRPC port attribute is passed in. This
provides a better fallback method, when the operator does not
configure the “consul.grpc_address” option.
* docs: modify Consul Connect entries to detail 1.14.0 changes.
* changelog: add entry for #15309
* fixup: tidy tests and clean version match from review feedback.
* fixup: use strings tolower func.
When replication of a single key fails, the replication loop breaks early and
therefore keys that fall later in the sorting order will never get
replicated. This is particularly a problem for clusters impacted by the bug that
caused #14981 and that were later upgraded; the keys that were never replicated
can now never be replicated, and so we need to handle them safely.
Included in the replication fix:
* Refactor the replication loop so that each key replicated in a function call
that returns an error, to make the workflow more clear and reduce nesting. Log
the error and continue.
* Improve stability of keyring replication tests. We no longer block leadership
on initializing the keyring, so there's a race condition in the keyring tests
where we can test for the existence of the root key before the keyring has
been initialize. Change this to an "eventually" test.
But these fixes aren't enough to fix#14981 because they'll end up seeing an
error once a second complaining about the missing key, so we also need to fix
keyring GC so the keys can be removed from the state store. Now we'll store the
key ID used to sign a workload identity in the Allocation, and we'll index the
Allocation table on that so we can track whether any live Allocation was signed
with a particular key ID.
The client ACL cache was not accounting for tokens which included
ACL role links. This change modifies the behaviour to resolve role
links to policies. It will also now store ACL roles within the
cache for quick lookup. The cache TTL is configurable in the same
manner as policies or tokens.
Another small fix is included that takes into account the ACL
token expiry time. This was not included, which meant tokens with
expiry could be used past the expiry time, until they were GC'd.
This reverts PR #12416 and commit 6668ce022a.
While the driver options are well and truly deprecated, this documentation also
covers features like `fingerprint.denylist` that are not available any other
way. Let's revert this until #12420 is ready.
* docs: write a lot of words about heartbeats
Alternative to #14670
* Apply suggestions from code review
Co-authored-by: Tim Gross <tgross@hashicorp.com>
* use descriptive title for link
* rework example of high failover ttl
Co-authored-by: Tim Gross <tgross@hashicorp.com>
The ACL command docs are now found within a sub-dir like the
operator command docs. Updates to the ACL token commands to
accommodate token expiry have also been added.
The ACL API docs are now found within a sub-dir like the operator
API docs. The ACL docs now include the ACL roles endpoint as well
as updated ACL token endpoints for token expiration.
The configuration section is also updated to accommodate the new
ACL and server parameters for the new ACL features.
Plan rejections occur when the scheduler work and the leader plan
applier disagree on the feasibility of a plan. This may happen for valid
reasons: since Nomad does parallel scheduling, it is expected that
different workers will have a different state when computing placements.
As the final plan reaches the leader plan applier, it may no longer be
valid due to a concurrent scheduling taking up intended resources. In
these situations the plan applier will notify the worker that the plan
was rejected and that they should refresh their state before trying
again.
In some rare and unexpected circumstances it has been observed that
workers will repeatedly submit the same plan, even if they are always
rejected.
While the root cause is still unknown this mitigation has been put in
place. The plan applier will now track the history of plan rejections
per client and include in the plan result a list of node IDs that should
be set as ineligible if the number of rejections in a given time window
crosses a certain threshold. The window size and threshold value can be
adjusted in the server configuration.
To avoid marking several nodes as ineligible at one, the operation is rate
limited to 5 nodes every 30min, with an initial burst of 10 operations.
Fixes#13505
This fixes#13505 by treating reserved_ports like we treat a lot of jobspec settings: merging settings from more global stanzas (client.reserved.reserved_ports) "down" into more specific stanzas (client.host_networks[].reserved_ports).
As discussed in #13505 there are other options, and since it's totally broken right now we have some flexibility:
Treat overlapping reserved_ports on addresses as invalid and refuse to start agents. However, I'm not sure there's a cohesive model we want to publish right now since so much 0.9-0.12 compat code still exists! We would have to explain to folks that if their -network-interface and host_network addresses overlapped, they could only specify reserved_ports in one place or the other?! It gets ugly.
Use the global client.reserved.reserved_ports value as the default and treat host_network[].reserverd_ports as overrides. My first suggestion in the issue, but @groggemans made me realize the addresses on the agent's interface (as configured by -network-interface) may overlap with host_networks, so you'd need to remove the global reserved_ports from addresses shared with a shared network?! This seemed really confusing and subtle for users to me.
So I think "merging down" creates the most expressive yet understandable approach. I've played around with it a bit, and it doesn't seem too surprising. The only frustrating part is how difficult it is to observe the available addresses and ports on a node! However that's a job for another PR.
* core: allow pause/un-pause of eval broker on region leader.
* agent: add ability to pause eval broker via scheduler config.
* cli: add operator scheduler commands to interact with config.
* api: add ability to pause eval broker via scheduler config
* e2e: add operator scheduler test for eval broker pause.
* docs: include new opertor scheduler CLI and pause eval API info.
Fix numerous go-getter security issues:
- Add timeouts to http, git, and hg operations to prevent DoS
- Add size limit to http to prevent resource exhaustion
- Disable following symlinks in both artifacts and `job run`
- Stop performing initial HEAD request to avoid file corruption on
retries and DoS opportunities.
**Approach**
Since Nomad has no ability to differentiate a DoS-via-large-artifact vs
a legitimate workload, all of the new limits are configurable at the
client agent level.
The max size of HTTP downloads is also exposed as a node attribute so
that if some workloads have large artifacts they can specify a high
limit in their jobspecs.
In the future all of this plumbing could be extended to enable/disable
specific getters or artifact downloading entirely on a per-node basis.
After a more detailed analysis of this feature, the approach taken in
PR #12449 was found to be not ideal due to poor UX (users are
responsible for setting the entity alias they would like to use) and
issues around jobs potentially masquerading itself as another Vault
entity.
Move some common Vault API data struct decoding out of the Vault client
so it can be reused in other situations.
Make Vault job validation its own function so it's easier to expand it.
Rename the `Job.VaultPolicies` method to just `Job.Vault` since it
returns the full Vault block, not just their policies.
Set `ChangeMode` on `Vault.Canonicalize`.
Add some missing tests.
Allows specifying an entity alias that will be used by Nomad when
deriving the task Vault token.
An entity alias assigns an indentity to a token, allowing better control
and management of Vault clients since all tokens with the same indentity
alias will now be considered the same client. This helps track Nomad
activity in Vault's audit logs and better control over Vault billing.
Add support for a new Nomad server configuration to define a default
entity alias to be used when deriving Vault tokens. This default value
will be used if the task doesn't have an entity alias defined.
The client configuration options for drivers have been deprecated
since 0.9. We haven't torn them out completely but because they're
deprecated it's been hard to guarantee correct behavior. Remove the
documentation so that users aren't misled about their viability.
Resolves#12095 by WONTFIXing it.
This approach disables `writeToFile` as it allows arbitrary host
filesystem writes and is only a small quality of life improvement over
multiple `template` stanzas.
This approach has the significant downside of leaving people who have
altered their `template.function_denylist` *still vulnerable!* I added
an upgrade note, but we should have implemented the denylist as a
`map[string]bool` so that new funcs could be denied without overriding
custom configurations.
This PR also includes a bug fix that broke enabling all consul-template
funcs. We repeatedly failed to differentiate between a nil (unset)
denylist and an empty (allow all) one.